This is Part 3 of our 3 -part series on APrime's LLM Kit. Read Part 1 , Read Part 2 .
This is Part 3 of our three-part series on hosting your own LLM in AWS. See Part 1 for an introduction and Part 2 for the quickstart guide.
Overview
This comprehensive guide walks through deploying a self-hosted AI model in AWS using a free quickstart script and Terraform modules. For most users, the quickstart guide will be sufficient; this post is Part 3 of our three-part series exploring private AI model hosting and drills into the implementation details.
Prerequisites
This walkthrough assumes familiarity with the demo repository, the quickstart.sh script, and the open-source Terraform module.
Deploying the Text Generation Inference (TGI) Service
Resource Requirements
For optimal performance, allocate a single ECS task per EC2 instance, ensuring exclusive GPU access. CPU and memory should be set slightly below available capacity to leave buffer room for system processes.
Model and Quantization Settings
Quantization reduces model weight precision, accelerating inference while lowering memory requirements.
We recommend selecting quantization settings that balance performance and accuracy for your specific use case. The bitsandbytes strategy works well on T4 NVIDIA GPUs and maintains compatibility across model formats, though newer GPUs may support superior methods. Always examine whether your chosen model was designed with specific quantization requirements.
Learn more about quantization.
Container Environment Variables
Specify environment variables in the TGI container definition. The NUMBA_CACHE_DIR setting resolves a Docker startup issue where Numba (a JIT compiler improving Python execution speed) requires cache directory configuration:
environment = [
{
name = "NUMBA_CACHE_DIR"
value = "/tmp/numba_cache"
},
{
name = "MODEL_ID"
value = "teknium/OpenHermes-2.5-Mistral-7B"
},
{
name = "QUANTIZE"
value = "bitsandbytes"
}
]
EFS Persistence
Use Amazon Elastic File System for persistent /data directory storage, accelerating future deployments. EFS requires a custom image with non-root user configuration to prevent privileged container execution. A pre-built image is available at open-webui, though building your own is recommended for production use.
Putting It Together Into the Service Definition
module "ecs_service" {
source = "terraform-aws-modules/ecs/aws//modules/service"
version = "5.11.0"
create = true
capacity_provider_strategy = {
"my-cluster-asg" = {
capacity_provider = module.ecs_cluster.autoscaling_capacity_providers[ "my-cluster-asg"].name
weight = 100
base = 1
}
}
cluster_arn = module.ecs_cluster.arn
desired_count = 1
name = "text-generation-inference"
placement_constraints = [{
type = "distinctInstance"
}]
create_task_definition = true
container_definitions = {
text_generation_inference = {
name = "text_generation_inference"
image = "https://ghcr.io/huggingface/text-generation-inference:2.0.4"
environment = [
{
name = "NUMBA_CACHE_DIR"
value = "/tmp/numba_cache"
},
{
name = "MODEL_ID"
value = "teknium/OpenHermes-2.5-Mistral-7B"
},
{
name = "QUANTIZE"
value = "bitsandbytes"
}
]
port_mappings = [{
name = "http"
containerPort = 11434
hostPort = 11434
protocol = "tcp"
}]
resource_requirements = [
{
type = "GPU"
value = 1
}
],
}
}
}
Deploying Open WebUI
Open WebUI provides a ChatGPT-like frontend compatible with any OpenAI-compatible API, functioning as a drop-in replacement by configuring the API URL. While TGI provides compatible endpoints, integration requires additional configuration.
Integration with TGI
An NGINX sidecar provides an OpenAI-compatible /v1/models endpoint for seamless UI interaction. Two new container definitions are added: one populating NGINX configuration on disk, another running NGINX.
NGINX Configuration
This configuration returns a static response from /v1/models, enabling Open WebUI to discover the running model:
resource "time_static" "activation_date" {}
locals {
nginx_config = <<EOF
server {
listen 80;
location /v1/models {
default_type application/json;
return 200 '{ "object": "list", "data": [ { "id": "teknium/OpenHermes-2.5-Mistral-7B", "object": "model", "created": ${time_static.activation_date.unix}, "owned_by": "system" } ]}';
}
location / {
proxy_pass http://localhost:11434/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
EOF
}
Empty Service Volume
To populate NGINX configuration in the init container, create an empty volume mounted to each container:
volume = {
nginx_config = {}
}
Add Container Definitions
Initialize NGINX with this container definition:
container_definitions = {
init_nginx = {
entrypoint = [
"bash",
"-c",
"set -ueo pipefail; mkdir -p /etc/nginx/conf.d/; echo ${base64encode(local.nginx_config)} | base64 -d > /etc/nginx/conf.d/default.conf; cat /etc/nginx/conf.d/default.conf",
]
image = "public.ecr.aws/docker/library/bash:5"
name = "init_nginx"
mount_points = [{
containerPath = "/etc/nginx/conf.d"
sourceVolume = "nginx_config"
readOnly = false
}]
},
...
Continue with the main NGINX container, depending on successful init completion:
nginx = {
dependencies = [
{
containerName = "init_nginx"
condition = "SUCCESS"
}
]
image = "nginx:stable-alpine"
name = "nginx"
port_mappings = [{
name = "http-proxy"
containerPort = 80
hostPort = 80
protocol = "tcp"
}]
mount_points = [{
containerPath = "/etc/nginx/conf.d"
sourceVolume = "nginx_config"
readOnly = false
}]
},
...
ECS Service Connect
Configure ECS Service Connect to maintain TGI endpoint privacy and security, restricting access to authorized users and services.
Create a service discovery namespace resource:
resource "aws_service_discovery_http_namespace" "this" {
name = "mynamespace"
}
Add this input to the TGI service module:
service_connect_configuration = {
enabled = true
namespace = aws_service_discovery_http_namespace.this.arn
service = {
client_alias = {
port = 80
}
port_name = "http-proxy"
discovery_name = "text-generation-inference"
}
}
Open WebUI Service
Create an ECS service for Open WebUI configured for TGI communication over Service Connect:
module "ecs_service" {
source = "terraform-aws-modules/ecs/aws//modules/service"
version = "5.11.0"
create = true
cluster_arn = module.ecs_cluster.arn
desired_count = 1
name = "open-webui"
create_task_definition = true
container_definitions = {
open_webui = {
name = "open_webui"
image = "ghcr.io/open-webui/open-webui:main"
environment = [
{
name = "WEBUI_URL"
value = "https://open-webui.mydomain.com"
},
{
name = "PORT"
value = 8080
},
{
name = "OPENAI_API_KEY"
value = "fake"
},
{
name = "OPENAI_API_BASE_URL"
value = "text-generation-inference.${aws_service_discovery_http_namespace.this.name}"
},
],
port_mappings = [{
name = "http"
containerPort = 8080
hostPort = 8080
protocol = "tcp"
}]
load_balancer = {
service = {
target_group_arn = module.alb.target_groups["open_webui"].arn
container_name = "open_webui"
container_port = 8080
}
}
}
}
}
At this point, Terraform should be ready to apply. Upon successful completion, proceed to the next section.
Application Load Balancer (ALB) for the Service
Certificate
Domain lookup and SSL certificate creation use the Route 53 zone for mydomain.com and the official AWS ACM module:
data "aws_route53_zone" "this" {
name = "mydomain.com"
private_zone = true
}
module "acm" {
source = "terraform-aws-modules/acm/aws"
version = "5.0.0"
create_certificate = true
domain_name = "mydomain.com"
validate_certificate = true
validation_method = "DNS"
zone_id = data.aws_route53_zone.this.zone_id
}
ALB
An Application Load Balancer supporting SSL on port 443 is configured using the official ALB module with the previously created certificate:
module "alb" {
source = "terraform-aws-modules/alb/aws"
version = "9.1.0"
create = true
load_balancer_type = "application"
name = "my-cluster-tgi-alb"
internal = false
listeners = {
http-https-redirect = {
port = 80
protocol = "HTTP"
redirect = {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
https = {
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-Res-2021-06"
certificate_arn = module.acm.acm_certificate_arn
}
}
target_groups = {
text_generation_inference = {
name = "my-cluster-tgi"
protocol = "HTTP"
port = 11434
create_attachment = false
target_type = "ip"
deregistration_delay = 10
load_balancing_cross_zone_enabled = true
health_check = {
enabled = true
healthy_threshold = 5
interval = 30
matcher = "200"
path = "/health"
port = "traffic-port"
protocol = "HTTP"
timeout = 5
unhealthy_threshold = 2
}
}
}
create_security_group = true
vpc_id = var.vpc_id
security_group_ingress_rules = {
http = {
from_port = 80
to_port = 80
ip_protocol = "tcp"
cidr_ipv4 = "0.0.0.0/0"
}
https = {
from_port = 443
to_port = 443
ip_protocol = "tcp"
cidr_ipv4 = "0.0.0.0/0"
}
}
security_group_egress_rules = {
all = {
ip_protocol = "-1"
cidr_ipv4 = "0.0.0.0/0"
}
}
route53_records = {
A = {
name = "open-webui"
type = "A"
zone_id = var.route53_zone_id
}
AAAA = {
name = "open-webui"
type = "AAAA"
zone_id = var.route53_zone_id
}
}
}
Create an ECS Cluster with GPU Capacity
To simplify common cluster setup, the official AWS Terraform modules provide an autoscaling module which we’ll leverage in this section.
Autoscaling Group Setup
Creating an autoscaling group ensures the appropriate number of EC2 instances run to meet workload demands. Configure instances to join the ECS cluster with necessary IAM roles and security groups enabling communication between instances and the ECS service.
ECS Optimized GPU AMI
Retrieve the recommended ECS optimized GPU AMI ID for launching EC2 instances:
data "aws_ssm_parameter" "ecs_optimized_ami" {
name = "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended"
}
Security Group
Using the ALB ID, create a security group allowing ALB communication with autoscaling group instances:
module "autoscaling_sg" {
source = "terraform-aws-modules/security-group/aws"
version = "~> 5.0"
create = true
name = "my-cluster-asg-sg"
vpc_id = var.vpc_id
computed_ingress_with_source_security_group_id = [
{
rule = "http-80-tcp"
source_security_group_id = var.alb_security_group_id
}
]
number_of_computed_ingress_with_source_security_group_id = 1
egress_rules = ["all-all"]
}
The Autoscaling Group
Using the security group and AMI ID with the autoscaling module, attach the AmazonEC2ContainerServiceforEC2Role to allow EC2 instances joining the ECS cluster:
module "autoscaling" {
source = "terraform-aws-modules/autoscaling/aws"
version = "~> 6.5"
create = true
name = "my-cluster-asg"
image_id = jsondecode(data.aws_ssm_parameter.ecs_optimized_ami.value)["image_id"]
instance_type = "g4dn.xlarge"
security_groups = [module.autoscaling_sg.security_group_id]
user_data = base64encode(local.user_data)
create_iam_instance_profile = true
iam_role_policies = {
AmazonEC2ContainerServiceforEC2Role = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
vpc_zone_identifier = var.vpc_private_subnets
health_check_type = "EC2"
min_size = 1
max_size = 1
desired_capacity = 1
# https://github.com/hashicorp/terraform-provider-aws/issues/12582
autoscaling_group_tags = {
AmazonECSManaged = true
}
}
User Data
The user_data field populates files at EC2 startup. This leverages the ECS configuration file for the ECS-optimized AMI at startup:
locals {
# https://github.com/aws/amazon-ecs-agent/blob/master/README.md#environment-variables
user_data = <<-EOT
#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=my-cluster
ECS_LOGLEVEL=info
ECS_ENABLE_TASK_IAM_ROLE=true
EOF
EOT
}
ECS Cluster
Capacity Provider
Create a capacity provider representing the EC2 autoscaling group to maintain 100% utilization:
locals {
default_autoscaling_capacity_providers = {
"my-cluster-asg" = {
auto_scaling_group_arn = module.autoscaling.autoscaling_group_arn
managed_termination_protection = "ENABLED"
managed_scaling = {
maximum_scaling_step_size = 2
minimum_scaling_step_size = 1
status = "ENABLED"
target_capacity = 100
}
default_capacity_provider_strategy = {
weight = 0
}
}
}
}
Cluster
Define the cluster using the capacity provider:
module "ecs_cluster" {
source = "terraform-aws-modules/ecs/aws//modules/cluster"
version = "5.11.0"
create = true
# Cluster
cluster_name = "my-cluster"
# Capacity providers
default_capacity_provider_use_fargate = true
autoscaling_capacity_providers = local.default_autoscaling_capacity_providers
}
We are here to help!
If these modules prove useful, star the repo and follow APrime on GitHub for future updates. For help implementing AI models or incorporating LLMs into your product, reach out or schedule a call — APrime looks forward to exploring how to help with your work.