Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.

In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.

Solution overview

The steps to implement the solution are as follows:

Create the EKS cluster.
Set up the Inferentia 2 node group.
Install the Neuron device plugin and scheduling extension.
Prepare the Docker image.
Deploy the Meta Llama 3.18B model.

We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.

Prerequisites

Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.

The AWS Command Line Interface (AWS CLI) installed
eksctl
kubectl
docker

In this post, the examples use an inf2.48xlarge instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.

Create the EKS cluster

If you don’t have an existing EKS cluster, you can create one using eksctl. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:

export AWS_REGION=us-east-1
export CLUSTER_NAME=my-cluster
export EKS_VERSION=1.30
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Then complete the following steps:

Create a new file named eks_cluster.yaml with the following command:

cat > eks_cluster.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $AWS_REGION
  version: "$EKS_VERSION"

addons:
- name: vpc-cni
  version: latest

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
    
iam:
  withOIDC: true
EOF

This configuration file contains the following parameters:

metadata.name – Specifies the name of your EKS cluster, which is set to my-cluster in this example. You can change it to a name of your choice.
metadata.region – Specifies the Region where you want to create the cluster. In this example, it’s set to us-east-2. Change this to your desired Region. Because we’re using Inf2 instances, you should choose a Region where those instances are presented.
metadata.version – Specifies the Kubernetes version to use for the cluster. In this example, it’s set to 1.30. You can change this to a different version if needed, but make sure to use a version that is supported by Amazon EKS. For a list of supported versions, see Review release notes for Kubernetes versions on standard support.
addons.vpc-cni – Specifies the version of the Amazon VPC CNI (Container Network Interface) add-on to use. Setting it to latest will install the latest available version.
cloudWatch.clusterLogging – Enables cluster logging, which sends logs from the control plane to Amazon CloudWatch Logs.
iam.withOIDC – Enables the OpenID Connect (OIDC) provider for the cluster, which is required for certain AWS services to interact with the cluster.

After you create the eks_cluster.yaml file, you can create the EKS cluster by running the following command:

eksctl create cluster --config-file eks_cluster.yaml

This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml file. The process will take approximately 15–20 minutes to complete.

During the cluster creation process, eksctl will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.

To complete the setup of kubectl, run the following code:

aws eks update-kubeconfig —region $AWS_REGION —name $CLUSTER_NAME

Set up the Inferentia 2 node group

To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:

First, retrieve the latest Amazon EKS optimized accelerated AMI ID:

export ACCELERATED_AMI=$(aws ssm get-parameter 
--name /aws/service/eks/optimized-ami/$EKS_VERSION/amazon-linux-2-gpu/recommended/image_id 
--region $AWS_REGION 
--query "Parameter.Value" 
--output text)

Create the Inferentia 2 node group using eksctl:

cat > eks_nodegroup.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $AWS_REGION
  version: "$EKS_VERSION"
    
managedNodeGroups:
  - name: neuron-group
    instanceType: inf2.48xlarge
    desiredCapacity: 1
    volumeSize: 512
    ami: "$ACCELERATED_AMI"
    amiFamily: AmazonLinux2
    iam:
      attachPolicyARNs:
      - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
      - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

    overrideBootstrapCommand: |
      #!/bin/bash

      /etc/eks/bootstrap.sh $CLUSTER_NAME
EOF

Run eksctl create nodegroup --config-file eks_nodegroup.yaml to create the node group.

This will take approximately 5 minutes.

Install the Neuron device plugin and scheduling extension

To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.

The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.

For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.

Prepare the Docker image

To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).

Set up environment variables:

export ECR_REPO_NAME=vllm-neuron

Create an ECR repository:

aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION

Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.

Create the Dockerfile:

cat > Dockerfile <<EOF
FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04
# Clone the vllm repository
RUN git clone https://github.com/vllm-project/vllm.git
# Set the working directory
WORKDIR /vllm
RUN git checkout v0.6.0
# Set the environment variable
ENV VLLM_TARGET_DEVICE=neuron
# Install the dependencies
RUN python3 -m pip install -U -r requirements-neuron.txt
RUN python3 -m pip install .
# Modify the arg_utils.py file to support larger block_size option
RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/[8, 16, 32]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py
# Install ray
RUN python3 -m pip install ray
RUN pip install -U  triton>=3.0.0
# Set the entry point
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
EOF

Use the following commands to create an ECR repository, build your Docker image, and push it to the newly created repository. The account ID and Region are dynamically set using AWS CLI commands, making the process more flexible and avoiding hard-coded values.

# Authenticate Docker to your ECR registry
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Build the Docker image
docker build -t ${ECR_REPO_NAME}:latest .

# Tag the image
docker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
# Push the image to ECR
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest

Deploy the Meta Llama 3.1-8B model

With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:

cat > neuronx-vllm-deployment.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: neuronx-vllm-deployment
  labels:
    app: neuronx-vllm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: neuronx-vllm
  template:
    metadata:
      labels:
        app: neuronx-vllm
    spec:
      schedulerName: my-scheduler
      containers:
      - name: neuronx-vllm
        image: <replace with the url to the docker image you pushed to the ECR>
        resources:
          limits:
            cpu: 32
            memory: "64G"
            aws.amazon.com/neuroncore: "8"
          requests:
            cpu: 32
            memory: "64G"
            aws.amazon.com/neuroncore: "8"
        ports:
        - containerPort: 8000
        env:
        - name: HF_TOKEN
          value: <your huggingface token>
        - name: FI_EFA_FORK_SAFE
          value: "1"
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3.1-8B"
        - "--tensor-parallel-size"
        - "8"
        - "--max-num-seqs"
        - "64"
        - "--max-model-len"
        - "8192"
        - "--block-size"
        - "8192"
EOF

Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml.

This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.

The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.

To modify the setup, update the neuronx-vllm-deployment.yaml file, adjusting the replicas field in the deployment specification and the NUM_NEURON_CORES environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.

The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.

Test the deployment

After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:

Check the deployment status:

kubectl get deployments

This will show you the desired, current, and up-to-date number of replicas.

Monitor the pods:

kubectl get pods -l app=neuronx-vllm -w

The -w flag will watch for changes. You’ll see the pods transitioning from "Pending" to "ContainerCreating" to "Running".

Check the logs of a specific pod:

kubectl logs <pod-name>

The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.

To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.

Add the following probe configurations to your container spec in the deployment YAML:

spec:
  containers:
  - name: neuronx-vllm
    # ... other container configurations ...
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      periodSeconds: 15
    startupProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      failureThreshold: 30
      periodSeconds: 10

The configuration is comprised of three probes:

Readiness probe – Checks if the pod is ready to serve traffic. It starts checking after 60 seconds and repeats every 10 seconds.
Liveness probe – Verifies if the pod is still running correctly. It begins after 120 seconds and checks every 15 seconds.
Startup probe – Gives the application time to start up. It allows up to 25 minutes for the application to start before considering it failed.

These probes assume that your vLLM application exposes a /health endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.

With these probes in place, Kubernetes will do the following:

Only send traffic to pods that are ready
Restart pods that are no longer alive
Allow sufficient time for initial startup and compilation

This configuration helps facilitate high availability and proper functioning of your vLLM deployment.

Now you’re ready to access the pods.

Identify the pod that is running your inference server. You can use the following command to list the pods with the neuronx-vllm label:

kubectl get pods -l app=neuronx-vllm

This command will output a list of pods, and you’ll need the name of the pod you want to forward.

Use kubectl port-forward to forward the port from the Kubernetes pod to your local machine. Use the name of your pod from the previous step:

kubectl port-forward <pod-name> 8000:8000

This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000.

Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.

In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.

You can test the inference server by making a request from your local machine. The following code is an example of how to make an inference call using curl:

curl -X POST http://localhost:8000/v1/completions  
-H "Content-Type: application/json"  
-d '{ 
  "model": " meta-llama/Meta-Llama-3.1-8B", 
  "prompt": "Explain the theory of relativity.", 
  "max_tokens": 100 
}'

This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.

For more information about routing, see Route application and HTTP traffic with Application Load Balancers.

Monitor performance

AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.

These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.

Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.

To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.

Scaling and multi-tenancy

As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:

Increasing the number of nodes in your EKS node group to provide additional Neuron cores
Increasing the number of replicas in your deployment to utilize these new resources

You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:

kubectl scale deployment neuronx-vllm-deployment --replicas=<new-number>

Alternatively, you can set up auto scaling:

Configure auto scaling for your EKS node group to automatically add nodes based on resource demands
Use Horizontal Pod Autoscaling (HPA) to automatically adjust the number of replicas in your deployment

You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.

Example scaling solutions include:

Cluster Autoscaler with Karpenter – Though not currently installed in this setup, Karpenter offers more flexible and efficient auto scaling for future consideration. It can dynamically provision the right number of nodes needed for your Neuron workloads based on pending pods and custom scheduling constraints. For more details, see Scale cluster compute with Karpenter and Cluster Autoscaler.
Multi-cluster federation – For even larger scale, you could set up multiple EKS clusters, each with its own Neuron-equipped nodes, and use a multi-cluster federation tool to distribute traffic among them.

You should consider the following when scaling:

Alignment of resources – Make sure that your scaling strategy for both nodes and pods aligns with the Neuron core requirements (multiples of 8 for optimal performance). This is model dependent and unique for the Meta Llama 3.1 model.
Compilation time – Remember the 15-minute compilation time for new pods when planning your scaling strategy. Consider pre-warming pods during off-peak hours.
Cost management – Monitor costs closely as you scale, because Neuron-equipped instances can be expensive.
Performance testing – Conduct thorough performance testing as you scale to verify that increased capacity translates to improved throughput and reduced latency.

By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.

Clean up

Use the following code to delete the cluster created in this solution:

eksctl delete cluster --name $CLUSTER_NAME --region $AWS_REGION

Conclusion

Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.

This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.

The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.

For the complete code and detailed implementation steps, visit the GitHub repository.

About the Authors

Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.

Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.

Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.

Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.

Vedere AI