APrime's LLM Kit, Part 3: In-Depth Walkthrough of Self-Hosted AI Setup

This is Part 3 of our 3 -part series on APrime's LLM Kit. Read Part 1 , Read Part 2 .

This is Part 3 of our three-part series on hosting your own LLM in AWS. See Part 1 for an introduction and Part 2 for the quickstart guide.

Overview

This comprehensive guide walks through deploying a self-hosted AI model in AWS using a free quickstart script and Terraform modules. For most users, the quickstart guide will be sufficient; this post is Part 3 of our three-part series exploring private AI model hosting and drills into the implementation details.

Prerequisites

This walkthrough assumes familiarity with the demo repository, the quickstart.sh script, and the open-source Terraform module.

Deploying the Text Generation Inference (TGI) Service

Resource Requirements

For optimal performance, allocate a single ECS task per EC2 instance, ensuring exclusive GPU access. CPU and memory should be set slightly below available capacity to leave buffer room for system processes.

Model and Quantization Settings

Quantization reduces model weight precision, accelerating inference while lowering memory requirements.

We recommend selecting quantization settings that balance performance and accuracy for your specific use case. The bitsandbytes strategy works well on T4 NVIDIA GPUs and maintains compatibility across model formats, though newer GPUs may support superior methods. Always examine whether your chosen model was designed with specific quantization requirements.

Learn more about quantization.

Container Environment Variables

Specify environment variables in the TGI container definition. The NUMBA_CACHE_DIR setting resolves a Docker startup issue where Numba (a JIT compiler improving Python execution speed) requires cache directory configuration:

environment = [
    {
      name  = "NUMBA_CACHE_DIR"
      value = "/tmp/numba_cache"
    },
    {
      name  = "MODEL_ID"
      value = "teknium/OpenHermes-2.5-Mistral-7B"
    },
    {
      name  = "QUANTIZE"
      value = "bitsandbytes"
    }
 ]

EFS Persistence

Use Amazon Elastic File System for persistent /data directory storage, accelerating future deployments. EFS requires a custom image with non-root user configuration to prevent privileged container execution. A pre-built image is available at open-webui, though building your own is recommended for production use.

Putting It Together Into the Service Definition

module "ecs_service" {
  source  = "terraform-aws-modules/ecs/aws//modules/service"
  version = "5.11.0"

  create = true

  capacity_provider_strategy = {
    "my-cluster-asg" = {
      capacity_provider = module.ecs_cluster.autoscaling_capacity_providers[ "my-cluster-asg"].name
      weight            = 100
      base              = 1
    }
  }

  cluster_arn   = module.ecs_cluster.arn
  desired_count = 1
  name          = "text-generation-inference"

  placement_constraints = [{
    type = "distinctInstance"
  }]

  create_task_definition = true
  
  container_definitions = {
    text_generation_inference = {
      name = "text_generation_inference"
      image = "https://ghcr.io/huggingface/text-generation-inference:2.0.4"
      environment = [
   {
     name  = "NUMBA_CACHE_DIR"
     value = "/tmp/numba_cache"
   },
   {
     name  = "MODEL_ID"
     value = "teknium/OpenHermes-2.5-Mistral-7B"
   },
   {
     name  = "QUANTIZE"
     value = "bitsandbytes"
   }
 ]
      port_mappings = [{
        name          = "http"
        containerPort =  11434
        hostPort      =  11434
        protocol      = "tcp"
      }]
      resource_requirements = [
        {
          type  = "GPU"
          value = 1
        }
      ],
    }
  }
}

Deploying Open WebUI

Open WebUI provides a ChatGPT-like frontend compatible with any OpenAI-compatible API, functioning as a drop-in replacement by configuring the API URL. While TGI provides compatible endpoints, integration requires additional configuration.

Integration with TGI

An NGINX sidecar provides an OpenAI-compatible /v1/models endpoint for seamless UI interaction. Two new container definitions are added: one populating NGINX configuration on disk, another running NGINX.

NGINX Configuration

This configuration returns a static response from /v1/models, enabling Open WebUI to discover the running model:

resource "time_static" "activation_date" {}

locals {
  nginx_config = <<EOF
server {
    listen 80;

    location /v1/models {
        default_type application/json;
        return 200 '{ "object": "list", "data": [ { "id": "teknium/OpenHermes-2.5-Mistral-7B", "object": "model", "created": ${time_static.activation_date.unix}, "owned_by": "system" } ]}';
    }

    location / {
        proxy_pass http://localhost:11434/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}
EOF
}

Empty Service Volume

To populate NGINX configuration in the init container, create an empty volume mounted to each container:

volume = {
  nginx_config = {}
}

Add Container Definitions

Initialize NGINX with this container definition:

container_definitions = {
  init_nginx = {
        entrypoint = [
          "bash",
          "-c",
          "set -ueo pipefail; mkdir -p /etc/nginx/conf.d/; echo ${base64encode(local.nginx_config)} | base64 -d > /etc/nginx/conf.d/default.conf; cat /etc/nginx/conf.d/default.conf",
        ]
        image                    = "public.ecr.aws/docker/library/bash:5"
        name                     = "init_nginx"
        mount_points             = [{
          containerPath = "/etc/nginx/conf.d"
          sourceVolume  = "nginx_config"
          readOnly      = false
        }]
      },
    ...

Continue with the main NGINX container, depending on successful init completion:

nginx = {
        dependencies = [
          {
            containerName = "init_nginx"
            condition     = "SUCCESS"
          }
        ]
        image                  = "nginx:stable-alpine"
        name                   = "nginx"
        port_mappings = [{
          name          = "http-proxy"
          containerPort = 80
          hostPort      = 80
          protocol      = "tcp"
        }]
        mount_points             = [{
          containerPath = "/etc/nginx/conf.d"
          sourceVolume  = "nginx_config"
          readOnly      = false
        }]
      },
    ...

ECS Service Connect

Configure ECS Service Connect to maintain TGI endpoint privacy and security, restricting access to authorized users and services.

Create a service discovery namespace resource:

resource "aws_service_discovery_http_namespace" "this" {
  name        = "mynamespace"
}

Add this input to the TGI service module:

service_connect_configuration = {
  enabled   = true
  namespace = aws_service_discovery_http_namespace.this.arn
  service = {
    client_alias = {
      port = 80
    }
    port_name      = "http-proxy"
    discovery_name = "text-generation-inference"
  }
}

Open WebUI Service

Create an ECS service for Open WebUI configured for TGI communication over Service Connect:

module "ecs_service" {
  source  = "terraform-aws-modules/ecs/aws//modules/service"
  version = "5.11.0"

  create = true

  cluster_arn   = module.ecs_cluster.arn
  desired_count = 1
  name          = "open-webui"

  create_task_definition = true
  
  container_definitions = {
    open_webui = {
        name = "open_webui"
        image = "ghcr.io/open-webui/open-webui:main"
        environment = [
            {
              name  = "WEBUI_URL"
              value = "https://open-webui.mydomain.com"
            },
            {
              name  = "PORT"
              value = 8080
            },
            {
              name  = "OPENAI_API_KEY"
              value = "fake"
            },
            {
              name  = "OPENAI_API_BASE_URL"
              value = "text-generation-inference.${aws_service_discovery_http_namespace.this.name}"
            },
        ],
        port_mappings = [{
              name          = "http"
              containerPort =  8080
              hostPort      =  8080
              protocol      = "tcp"
        }]
        load_balancer = {
              service = {
                target_group_arn = module.alb.target_groups["open_webui"].arn
                container_name   = "open_webui"
                container_port   = 8080
              }
        }
    }
  }
}

At this point, Terraform should be ready to apply. Upon successful completion, proceed to the next section.

Application Load Balancer (ALB) for the Service

Certificate

Domain lookup and SSL certificate creation use the Route 53 zone for mydomain.com and the official AWS ACM module:

data "aws_route53_zone" "this" {
  name         = "mydomain.com"
  private_zone = true
}

module "acm" {
  source  = "terraform-aws-modules/acm/aws"
  version = "5.0.0"

  create_certificate = true

  domain_name          = "mydomain.com"
  validate_certificate = true
  validation_method    = "DNS"
  zone_id              = data.aws_route53_zone.this.zone_id
}

ALB

An Application Load Balancer supporting SSL on port 443 is configured using the official ALB module with the previously created certificate:

module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "9.1.0"

  create = true

  load_balancer_type = "application"
  name               = "my-cluster-tgi-alb"
  internal           = false

  listeners = {
    http-https-redirect = {
      port     = 80
      protocol = "HTTP"

      redirect = {
        port        = "443"
        protocol    = "HTTPS"
        status_code = "HTTP_301"
      }
    }

    https = {
      port            = 443
      protocol        = "HTTPS"
      ssl_policy      = "ELBSecurityPolicy-TLS13-1-2-Res-2021-06"
      certificate_arn = module.acm.acm_certificate_arn
    }
  }

  target_groups = {
    text_generation_inference = {
      name                              = "my-cluster-tgi"
      protocol                          = "HTTP"
      port                              = 11434
      create_attachment                 = false
      target_type                       = "ip"
      deregistration_delay              = 10
      load_balancing_cross_zone_enabled = true

      health_check = {
        enabled             = true
        healthy_threshold   = 5
        interval            = 30
        matcher             = "200"
        path                = "/health"
        port                = "traffic-port"
        protocol            = "HTTP"
        timeout             = 5
        unhealthy_threshold = 2
      }
    }
  }

  create_security_group = true
  vpc_id                = var.vpc_id

  security_group_ingress_rules = {
    http = {
      from_port   = 80
      to_port     = 80
      ip_protocol = "tcp"
      cidr_ipv4   = "0.0.0.0/0"
    }
    https = {
      from_port   = 443
      to_port     = 443
      ip_protocol = "tcp"
      cidr_ipv4   = "0.0.0.0/0"
    }
  }

  security_group_egress_rules =   {
    all = {
      ip_protocol = "-1"
      cidr_ipv4   = "0.0.0.0/0"
    }
  }

  route53_records = {
    A = {
      name    = "open-webui"
      type    = "A"
      zone_id = var.route53_zone_id
    }
    AAAA = {
      name    = "open-webui"
      type    = "AAAA"
      zone_id = var.route53_zone_id
    }
  }
}

Create an ECS Cluster with GPU Capacity

To simplify common cluster setup, the official AWS Terraform modules provide an autoscaling module which we’ll leverage in this section.

Autoscaling Group Setup

Creating an autoscaling group ensures the appropriate number of EC2 instances run to meet workload demands. Configure instances to join the ECS cluster with necessary IAM roles and security groups enabling communication between instances and the ECS service.

ECS Optimized GPU AMI

Retrieve the recommended ECS optimized GPU AMI ID for launching EC2 instances:

data "aws_ssm_parameter" "ecs_optimized_ami" {
  name = "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended"
}

Security Group

Using the ALB ID, create a security group allowing ALB communication with autoscaling group instances:

module "autoscaling_sg" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "~> 5.0"

  create = true

  name        = "my-cluster-asg-sg"
  vpc_id      = var.vpc_id

  computed_ingress_with_source_security_group_id = [
    {
      rule                     = "http-80-tcp"
      source_security_group_id = var.alb_security_group_id
    }
  ]
  number_of_computed_ingress_with_source_security_group_id = 1

  egress_rules = ["all-all"]
}

The Autoscaling Group

Using the security group and AMI ID with the autoscaling module, attach the AmazonEC2ContainerServiceforEC2Role to allow EC2 instances joining the ECS cluster:

module "autoscaling" {
  source  = "terraform-aws-modules/autoscaling/aws"
  version = "~> 6.5"

  create = true

  name = "my-cluster-asg"

  image_id      = jsondecode(data.aws_ssm_parameter.ecs_optimized_ami.value)["image_id"]
  instance_type = "g4dn.xlarge"

  security_groups                 = [module.autoscaling_sg.security_group_id]
  user_data                       = base64encode(local.user_data)

  create_iam_instance_profile = true
  iam_role_policies           = {
    AmazonEC2ContainerServiceforEC2Role = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
    AmazonSSMManagedInstanceCore        = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }

  vpc_zone_identifier = var.vpc_private_subnets
  health_check_type   = "EC2"
  min_size            = 1
  max_size            = 1
  desired_capacity    = 1

  # https://github.com/hashicorp/terraform-provider-aws/issues/12582
  autoscaling_group_tags = {
    AmazonECSManaged = true
  }
}

User Data

The user_data field populates files at EC2 startup. This leverages the ECS configuration file for the ECS-optimized AMI at startup:

locals {
  # https://github.com/aws/amazon-ecs-agent/blob/master/README.md#environment-variables
  user_data = <<-EOT
    #!/bin/bash

    cat <<'EOF' >> /etc/ecs/ecs.config
    ECS_CLUSTER=my-cluster
    ECS_LOGLEVEL=info
    ECS_ENABLE_TASK_IAM_ROLE=true
    EOF
  EOT
}

ECS Cluster

Capacity Provider

Create a capacity provider representing the EC2 autoscaling group to maintain 100% utilization:

locals {
  default_autoscaling_capacity_providers = {
    "my-cluster-asg" = {
      auto_scaling_group_arn         = module.autoscaling.autoscaling_group_arn
      managed_termination_protection = "ENABLED"

      managed_scaling = {
        maximum_scaling_step_size = 2
        minimum_scaling_step_size = 1
        status                    = "ENABLED"
        target_capacity           = 100
      }

      default_capacity_provider_strategy = {
        weight = 0
      }
    }
  }
}

Cluster

Define the cluster using the capacity provider:

module "ecs_cluster" {
  source  = "terraform-aws-modules/ecs/aws//modules/cluster"
  version = "5.11.0"

  create = true

  # Cluster
  cluster_name          = "my-cluster"

  # Capacity providers
  default_capacity_provider_use_fargate = true
  autoscaling_capacity_providers        = local.default_autoscaling_capacity_providers
}

We are here to help!

If these modules prove useful, star the repo and follow APrime on GitHub for future updates. For help implementing AI models or incorporating LLMs into your product, reach out or schedule a call — APrime looks forward to exploring how to help with your work.