The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.
In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.
Deploy vLLM on AWS Trainium and Inferentia EC2 instances
In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta’s newest Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model and how to run online and offline inference on the model. We will also talk about performance tuning the inference graph.
Prerequisite: Hugging Face account and model access
To use the meta-llama/Llama-3.2-1B
model, you’ll need a Hugging Face account and access to the model. Please go to the model card, sign up, and agree to the model license. You will then need a Hugging Face token, which you can get by following these steps. When you get to the Save your Access Token screen, as shown in the following figure, make sure you copy the token because it will not be shown again.
Create an EC2 instance
You can create an EC2 Instance by following the guide. A few things to note:
- If this is your first time using inf/trn instances, you will need to request a quota increase.
- You will use
inf2.xlarge
as your instance type.inf2.xlarge
instances are only available in these AWS Regions. - Increase the gp3 volume to 100 G.
- You will use
Deep Learning AMI Neuron (Ubuntu 22.04)
as your AMI, as shown in the following figure.
After the instance is launched, you can connect to it to access the command line. In the next step, you’ll use Docker (preinstalled on this AMI) to run a vLLM container image for neuron.
Start vLLM server
You will use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:
Then run:
Building the image will take about 10 minutes. After it’s done, use the new Docker image (replace YOUR_TOKEN_HERE
with the token from Hugging Face):
You can now start the vLLM server with the following command:
This command runs vLLM with the following parameters:
serve meta-llama/Llama-3.2-1B
: The Hugging FacemodelID
of the model that is being deployed for inference.--device neuron
: Configures vLLM to run on the neuron device.--tensor-parallel-size 2
: Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neuron device and each neuron device has 2 neuron cores.--max-model-len 4096
: This is set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.--block-size 8
: For neuron devices, this is internally set to the max-model-len.--max-num-seqs 32
: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.
The first time you load a model, if there isn’t a previously compiled model, it will need to be compiled. This compiled model can optionally be saved so the compilation step is not necessary if the container is recreated. After everything is done and the model server is running, you should see the following logs:
This means that the model server is running, but it isn’t yet processing requests because none have been received. You can now detach from the container by pressing ctrl + p
and ctrl + q
.
Inference
When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1B
is running.
This should return something like:
Now, send it a prompt:
You should get back a response similar to the following from vLLM:
Offline inference with vLLM
Another way to use vLLM on Inferentia is by sending a few requests all at the same time in a script. This is useful for automation or when you have a batch of prompts that you want to send all at the same time.
You can reattach to your Docker container and stop the online inference server with the following:
At this point, you should see a blank cursor, press ctrl + c
to stop the server and you should be back at the bash prompt in the container. Create a file for using the offline inference engine:
Now, run the script python offline_inference.py
and you should get back responses for the four prompts. This may take a minute as the model needs to be started again.
You can now type exit
and press return and then press ctrl + c
to shut down the Docker container and go back to your inf2 instance.
Clean up
Now that you’re done testing the Llama 3.2 1B LLM, you should terminate your EC2 instance to avoid additional charges.
Performance tuning for variable sequence lengths
You will probably have to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a computation graph that works with the shape and size of the buckets. To fine tune the performance based on the length of input and output tokens in the inference requests, you can set two kinds of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of integers:
NEURON_CONTEXT_LENGTH_BUCKETS
corresponds to the context encoding phase. Set this to the estimated length of prompts during inference.NEURON_TOKEN_GEN_BUCKETS
corresponds to the token generation phase. Set this to a range of powers of two within your generation length.
You can use Docker run command to set the environment variables while starting the vLLM server (remember to replace YOUR_TOKEN_HERE
with your Hugging Face token):
You can then start the server using the same command:
vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32
As the model graph has changed, the model will need to be recompiled. If the container was terminated, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p
and ctrl + q
and using the same command:
curl localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
For more information about how to configure the buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS
corresponds to context_length_estimate
in the documentation and NEURON_TOKEN_GEN_BUCKETS
corresponds to n_positions
in the documentation.
Conclusion
You’ve just seen how to deploy meta-llama/Llama-3.2-1B
using vLLM on an Amazon EC2 Inf2 instance. If you’re interested in deploying other popular LLMs from Hugging Face, you can replace the modelID
in the vLLM serve
command. More details on the integration between the Neuron SDK and vLLM can be found in the Neuron user guide for continuous batching and the vLLM guide for Neuron.
After you’ve identified a model that you want to use in production, you will want to deploy it with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post of this series, we’ll go into using Amazon EKS with Ray Serve to deploy vLLM into production with autoscaling and observability.
About the authors
Omri Shiv is an Open Source Machine Learning Engineer focusing on helping customers through their AI/ML journey. In his free time, he likes cooking, tinkering with open source and open hardware, and listening to and playing music.
Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.