Amazon AWS – Page 43

Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

November 26, 2024

by Maurits de Groot Amazon AWS

With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.

In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.

Solution overview

The steps to implement the solution are as follows:

Create the EKS cluster.
Set up the Inferentia 2 node group.
Install the Neuron device plugin and scheduling extension.
Prepare the Docker image.
Deploy the Meta Llama 3.18B model.

We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.

Prerequisites

Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.

The AWS Command Line Interface (AWS CLI) installed
eksctl
kubectl
docker

In this post, the examples use an inf2.48xlarge instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.

Create the EKS cluster

If you don’t have an existing EKS cluster, you can create one using eksctl. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:

export AWS_REGION=us-east-1
export CLUSTER_NAME=my-cluster
export EKS_VERSION=1.30
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Then complete the following steps:

Create a new file named eks_cluster.yaml with the following command:

cat > eks_cluster.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $AWS_REGION
  version: "$EKS_VERSION"

addons:
- name: vpc-cni
  version: latest

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
    
iam:
  withOIDC: true
EOF

This configuration file contains the following parameters:

metadata.name – Specifies the name of your EKS cluster, which is set to my-cluster in this example. You can change it to a name of your choice.
metadata.region – Specifies the Region where you want to create the cluster. In this example, it’s set to us-east-2. Change this to your desired Region. Because we’re using Inf2 instances, you should choose a Region where those instances are presented.
metadata.version – Specifies the Kubernetes version to use for the cluster. In this example, it’s set to 1.30. You can change this to a different version if needed, but make sure to use a version that is supported by Amazon EKS. For a list of supported versions, see Review release notes for Kubernetes versions on standard support.
addons.vpc-cni – Specifies the version of the Amazon VPC CNI (Container Network Interface) add-on to use. Setting it to latest will install the latest available version.
cloudWatch.clusterLogging – Enables cluster logging, which sends logs from the control plane to Amazon CloudWatch Logs.
iam.withOIDC – Enables the OpenID Connect (OIDC) provider for the cluster, which is required for certain AWS services to interact with the cluster.

After you create the eks_cluster.yaml file, you can create the EKS cluster by running the following command:

eksctl create cluster --config-file eks_cluster.yaml

This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml file. The process will take approximately 15–20 minutes to complete.

During the cluster creation process, eksctl will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.

To complete the setup of kubectl, run the following code:

aws eks update-kubeconfig —region $AWS_REGION —name $CLUSTER_NAME

Set up the Inferentia 2 node group

To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:

First, retrieve the latest Amazon EKS optimized accelerated AMI ID:

export ACCELERATED_AMI=$(aws ssm get-parameter 
--name /aws/service/eks/optimized-ami/$EKS_VERSION/amazon-linux-2-gpu/recommended/image_id 
--region $AWS_REGION 
--query "Parameter.Value" 
--output text)

Create the Inferentia 2 node group using eksctl:

cat > eks_nodegroup.yaml <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: $CLUSTER_NAME
  region: $AWS_REGION
  version: "$EKS_VERSION"
    
managedNodeGroups:
  - name: neuron-group
    instanceType: inf2.48xlarge
    desiredCapacity: 1
    volumeSize: 512
    ami: "$ACCELERATED_AMI"
    amiFamily: AmazonLinux2
    iam:
      attachPolicyARNs:
      - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
      - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

    overrideBootstrapCommand: |
      #!/bin/bash

      /etc/eks/bootstrap.sh $CLUSTER_NAME
EOF

Run eksctl create nodegroup --config-file eks_nodegroup.yaml to create the node group.

This will take approximately 5 minutes.

Install the Neuron device plugin and scheduling extension

To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.

The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.

For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.

Prepare the Docker image

To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).

Set up environment variables:

export ECR_REPO_NAME=vllm-neuron

Create an ECR repository:

aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION

Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.

Create the Dockerfile:

cat > Dockerfile <<EOF
FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04
# Clone the vllm repository
RUN git clone https://github.com/vllm-project/vllm.git
# Set the working directory
WORKDIR /vllm
RUN git checkout v0.6.0
# Set the environment variable
ENV VLLM_TARGET_DEVICE=neuron
# Install the dependencies
RUN python3 -m pip install -U -r requirements-neuron.txt
RUN python3 -m pip install .
# Modify the arg_utils.py file to support larger block_size option
RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/[8, 16, 32]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py
# Install ray
RUN python3 -m pip install ray
RUN pip install -U  triton>=3.0.0
# Set the entry point
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
EOF

Use the following commands to create an ECR repository, build your Docker image, and push it to the newly created repository. The account ID and Region are dynamically set using AWS CLI commands, making the process more flexible and avoiding hard-coded values.

# Authenticate Docker to your ECR registry
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Build the Docker image
docker build -t ${ECR_REPO_NAME}:latest .

# Tag the image
docker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
# Push the image to ECR
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest

Deploy the Meta Llama 3.1-8B model

With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:

cat > neuronx-vllm-deployment.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: neuronx-vllm-deployment
  labels:
    app: neuronx-vllm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: neuronx-vllm
  template:
    metadata:
      labels:
        app: neuronx-vllm
    spec:
      schedulerName: my-scheduler
      containers:
      - name: neuronx-vllm
        image: <replace with the url to the docker image you pushed to the ECR>
        resources:
          limits:
            cpu: 32
            memory: "64G"
            aws.amazon.com/neuroncore: "8"
          requests:
            cpu: 32
            memory: "64G"
            aws.amazon.com/neuroncore: "8"
        ports:
        - containerPort: 8000
        env:
        - name: HF_TOKEN
          value: <your huggingface token>
        - name: FI_EFA_FORK_SAFE
          value: "1"
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3.1-8B"
        - "--tensor-parallel-size"
        - "8"
        - "--max-num-seqs"
        - "64"
        - "--max-model-len"
        - "8192"
        - "--block-size"
        - "8192"
EOF

Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml.

This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.

The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.

To modify the setup, update the neuronx-vllm-deployment.yaml file, adjusting the replicas field in the deployment specification and the NUM_NEURON_CORES environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.

The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.

Test the deployment

After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:

Check the deployment status:

kubectl get deployments

This will show you the desired, current, and up-to-date number of replicas.

Monitor the pods:

kubectl get pods -l app=neuronx-vllm -w

The -w flag will watch for changes. You’ll see the pods transitioning from "Pending" to "ContainerCreating" to "Running".

Check the logs of a specific pod:

kubectl logs <pod-name>

The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.

To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.

Add the following probe configurations to your container spec in the deployment YAML:

spec:
  containers:
  - name: neuronx-vllm
    # ... other container configurations ...
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      periodSeconds: 15
    startupProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 1800
      failureThreshold: 30
      periodSeconds: 10

The configuration is comprised of three probes:

Readiness probe – Checks if the pod is ready to serve traffic. It starts checking after 60 seconds and repeats every 10 seconds.
Liveness probe – Verifies if the pod is still running correctly. It begins after 120 seconds and checks every 15 seconds.
Startup probe – Gives the application time to start up. It allows up to 25 minutes for the application to start before considering it failed.

These probes assume that your vLLM application exposes a /health endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.

With these probes in place, Kubernetes will do the following:

Only send traffic to pods that are ready
Restart pods that are no longer alive
Allow sufficient time for initial startup and compilation

This configuration helps facilitate high availability and proper functioning of your vLLM deployment.

Now you’re ready to access the pods.

Identify the pod that is running your inference server. You can use the following command to list the pods with the neuronx-vllm label:

kubectl get pods -l app=neuronx-vllm

This command will output a list of pods, and you’ll need the name of the pod you want to forward.

Use kubectl port-forward to forward the port from the Kubernetes pod to your local machine. Use the name of your pod from the previous step:

kubectl port-forward <pod-name> 8000:8000

This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000.

Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.

In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.

You can test the inference server by making a request from your local machine. The following code is an example of how to make an inference call using curl:

curl -X POST http://localhost:8000/v1/completions  
-H "Content-Type: application/json"  
-d '{ 
  "model": " meta-llama/Meta-Llama-3.1-8B", 
  "prompt": "Explain the theory of relativity.", 
  "max_tokens": 100 
}'

This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.

For more information about routing, see Route application and HTTP traffic with Application Load Balancers.

Monitor performance

AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.

These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.

Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.

To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.

Scaling and multi-tenancy

As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:

Increasing the number of nodes in your EKS node group to provide additional Neuron cores
Increasing the number of replicas in your deployment to utilize these new resources

You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:

kubectl scale deployment neuronx-vllm-deployment --replicas=<new-number>

Alternatively, you can set up auto scaling:

Configure auto scaling for your EKS node group to automatically add nodes based on resource demands
Use Horizontal Pod Autoscaling (HPA) to automatically adjust the number of replicas in your deployment

You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.

Example scaling solutions include:

Cluster Autoscaler with Karpenter – Though not currently installed in this setup, Karpenter offers more flexible and efficient auto scaling for future consideration. It can dynamically provision the right number of nodes needed for your Neuron workloads based on pending pods and custom scheduling constraints. For more details, see Scale cluster compute with Karpenter and Cluster Autoscaler.
Multi-cluster federation – For even larger scale, you could set up multiple EKS clusters, each with its own Neuron-equipped nodes, and use a multi-cluster federation tool to distribute traffic among them.

You should consider the following when scaling:

Alignment of resources – Make sure that your scaling strategy for both nodes and pods aligns with the Neuron core requirements (multiples of 8 for optimal performance). This is model dependent and unique for the Meta Llama 3.1 model.
Compilation time – Remember the 15-minute compilation time for new pods when planning your scaling strategy. Consider pre-warming pods during off-peak hours.
Cost management – Monitor costs closely as you scale, because Neuron-equipped instances can be expensive.
Performance testing – Conduct thorough performance testing as you scale to verify that increased capacity translates to improved throughput and reduced latency.

By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.

Clean up

Use the following code to delete the cluster created in this solution:

eksctl delete cluster --name $CLUSTER_NAME --region $AWS_REGION

Conclusion

Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.

This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.

The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.

For the complete code and detailed implementation steps, visit the GitHub repository.

About the Authors

Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.

Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.

Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.

Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.

Serving LLMs using vLLM and Amazon EC2 instances with AWS AI chips

November 26, 2024

by Omri Shiv Amazon AWS

The use of large language models (LLMs) and generative AI has exploded over the last year. With the release of powerful publicly available foundation models, tools for training, fine tuning and hosting your own LLM have also become democratized. Using vLLM on AWS Trainium and Inferentia makes it possible to host LLMs for high performance inference and scalability.

In this post, we will walk you through how you can quickly deploy Meta’s latest Llama models, using vLLM on an Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instance. For this example, we will use the 1B version, but other sizes can be deployed using these steps, along with other popular LLMs.

Deploy vLLM on AWS Trainium and Inferentia EC2 instances

In these sections, you will be guided through using vLLM on an AWS Inferentia EC2 instance to deploy Meta’s newest Llama 3.2 model. You will learn how to request access to the model, create a Docker container to use vLLM to deploy the model and how to run online and offline inference on the model. We will also talk about performance tuning the inference graph.

Prerequisite: Hugging Face account and model access

To use the meta-llama/Llama-3.2-1B model, you’ll need a Hugging Face account and access to the model. Please go to the model card, sign up, and agree to the model license. You will then need a Hugging Face token, which you can get by following these steps. When you get to the Save your Access Token screen, as shown in the following figure, make sure you copy the token because it will not be shown again.

Create an EC2 instance

You can create an EC2 Instance by following the guide. A few things to note:

If this is your first time using inf/trn instances, you will need to request a quota increase.
You will use inf2.xlarge as your instance type. inf2.xlarge instances are only available in these AWS Regions.
Increase the gp3 volume to 100 G.
You will use Deep Learning AMI Neuron (Ubuntu 22.04) as your AMI, as shown in the following figure.

After the instance is launched, you can connect to it to access the command line. In the next step, you’ll use Docker (preinstalled on this AMI) to run a vLLM container image for neuron.

Start vLLM server

You will use Docker to create a container with all the tools needed to run vLLM. Create a Dockerfile using the following command:

cat > Dockerfile <<EOF
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && 
    apt-get install -y 
        git 
        python3 
        python3-pip 
        ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
ENV VLLM_TARGET_DEVICE neuron
RUN git clone https://github.com/vllm-project/vllm.git && 
    cd vllm && 
    git checkout v0.6.2 && 
    python3 -m pip install -U 
        cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 
        -r requirements-neuron.txt && 
    pip install --no-build-isolation -v -e . && 
    pip install --upgrade triton==3.0.0
CMD ["/bin/bash"]
EOF

Then run:

docker build . -t vllm-neuron

Building the image will take about 10 minutes. After it’s done, use the new Docker image (replace YOUR_TOKEN_HERE with the token from Hugging Face):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run 
        -it 
        -p 8000:8000 
        --device /dev/neuron0 
        -e HF_TOKEN=$HF_TOKEN 
        -e NEURON_CC_FLAGS=-O1 
        vllm-neuron

You can now start the vLLM server with the following command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

This command runs vLLM with the following parameters:

serve meta-llama/Llama-3.2-1B: The Hugging Face modelID of the model that is being deployed for inference.
--device neuron: Configures vLLM to run on the neuron device.
--tensor-parallel-size 2: Sets the number of partitions for tensor parallelism. inf2.xlarge has 1 neuron device and each neuron device has 2 neuron cores.
--max-model-len 4096: This is set to the maximum sequence length (input tokens plus output tokens) for which to compile the model.
--block-size 8: For neuron devices, this is internally set to the max-model-len.
--max-num-seqs 32: This is set to the hardware batch size or a desired level of concurrency that the model server needs to handle.

The first time you load a model, if there isn’t a previously compiled model, it will need to be compiled. This compiled model can optionally be saved so the compilation step is not necessary if the container is recreated. After everything is done and the model server is running, you should see the following logs:

Avg prompt throughput: 0.0 tokens/s ...

This means that the model server is running, but it isn’t yet processing requests because none have been received. You can now detach from the container by pressing ctrl + p and ctrl + q.

Inference

When you started the Docker container, you ran it with the command -p 8000:8000. This told Docker to forward port 8000 from the container to port 8000 on your local machine. When you run the following command, you should see that the model server with meta-llama/Llama-3.2-1B is running.

curl localhost:8000/v1/models

This should return something like:

{"object":"list","data":[{"id":"meta-llama/Llama-3.2-1B","object":"model","created":1732552038,"owned_by":"vllm","root":"meta-llama/Llama-3.2-1B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-6d44a6f6e52447eb9074b13ae1e9e285","object":"model_permission","created":1732552038,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}ubuntu@ip-172-31-12-216:~$

Now, send it a prompt:

curl localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

You should get back a response similar to the following from vLLM:

ubuntu@ip-172-31-13-178:~$ curl localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'
  % Total    % Received % Xferd  Average Speed   Time    Time    Time  Current
                                 Dload  Upload   Total   Spent  Left  Speed
100  1067  100   966  100   101    108     11  0:00:09  0:00:08 0:00:01   258
" How does it work?nGen AI is a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system 
that can learn and adapt to new situations and environments. Gen AI is designed to be able to learn and adapt to new situations and environments in a way that is similar to how the human brain does.nGen AI is 
a new type of artificial intelligence that is designed to learn and adapt to new situations and environments. It is based on the idea that the human brain is a complex system that can learn and adapt to new 
situations and environments."

Offline inference with vLLM

Another way to use vLLM on Inferentia is by sending a few requests all at the same time in a script. This is useful for automation or when you have a batch of prompts that you want to send all at the same time.

You can reattach to your Docker container and stop the online inference server with the following:

docker attach $(docker ps --format "{{.ID}}")

At this point, you should see a blank cursor, press ctrl + c to stop the server and you should be back at the bash prompt in the container. Create a file for using the offline inference engine:

cat > offline_inference.py <<EOF
from vllm.entrypoints.llm import LLM
from vllm.sampling_params import SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="meta-llama/Llama-3.2-1B",
        max_num_seqs=32,
        max_model_len=4096,
        block_size=8,
        device="neuron",
        tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

EOF

Now, run the script python offline_inference.py and you should get back responses for the four prompts. This may take a minute as the model needs to be started again.

Processed prompts: 100%|
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.53it/s, est. speed input: 16.46 toks/s, output: 40.51 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Anna and I am the 4th year student of the Bachelor of Engineering at'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States of America. A'
Prompt: 'The capital of France is', Generated text: ' also the most expensive city to live in. The average cost of living in Paris'
Prompt: 'The future of AI is', Generated text: ' nownThe 10 most influential AI professionals to watch in 2019n'

You can now type exit and press return and then press ctrl + c to shut down the Docker container and go back to your inf2 instance.

Clean up

Now that you’re done testing the Llama 3.2 1B LLM, you should terminate your EC2 instance to avoid additional charges.

Performance tuning for variable sequence lengths

You will probably have to process variable length sequences during LLM inference. The Neuron SDK generates buckets and a computation graph that works with the shape and size of the buckets. To fine tune the performance based on the length of input and output tokens in the inference requests, you can set two kinds of buckets corresponding to the two phases of LLM inference through the following environment variables as a list of integers:

NEURON_CONTEXT_LENGTH_BUCKETS corresponds to the context encoding phase. Set this to the estimated length of prompts during inference.
NEURON_TOKEN_GEN_BUCKETS corresponds to the token generation phase. Set this to a range of powers of two within your generation length.

You can use Docker run command to set the environment variables while starting the vLLM server (remember to replace YOUR_TOKEN_HERE with your Hugging Face token):

export HF_TOKEN="YOUR_TOKEN_HERE"
docker run 
        -it 
        -p 8000:8000 
        --device /dev/neuron0 
        -e HF_TOKEN=$HF_TOKEN 
        -e NEURON_CC_FLAGS=-O1 
        -e NEURON_CONTEXT_LENGTH_BUCKETS="1024,1280,1536,1792,2048" 
        -e NEURON_TOKEN_GEN_BUCKETS="256,512,1024" 
        vllm-neuron

You can then start the server using the same command:

vllm serve meta-llama/Llama-3.2-1B --device neuron --tensor-parallel-size 2 --block-size 8 --max-model-len 4096 --max-num-seqs 32

As the model graph has changed, the model will need to be recompiled. If the container was terminated, the model will be downloaded again. You can then send a request by detaching from the container by pressing ctrl + p and ctrl + q and using the same command:

curl localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "What is Gen AI?", "temperature":0, "max_tokens": 128}' | jq '.choices[0].text'

For more information about how to configure the buckets, see the developer guide on bucketing. Note, NEURON_CONTEXT_LENGTH_BUCKETS corresponds to context_length_estimate in the documentation and NEURON_TOKEN_GEN_BUCKETS corresponds to n_positions in the documentation.

Conclusion

You’ve just seen how to deploy meta-llama/Llama-3.2-1B using vLLM on an Amazon EC2 Inf2 instance. If you’re interested in deploying other popular LLMs from Hugging Face, you can replace the modelID in the vLLM serve command. More details on the integration between the Neuron SDK and vLLM can be found in the Neuron user guide for continuous batching and the vLLM guide for Neuron.

After you’ve identified a model that you want to use in production, you will want to deploy it with autoscaling, observability, and fault tolerance. You can also refer to this blog post to understand how to deploy vLLM on Inferentia through Amazon Elastic Kubernetes Service (Amazon EKS). In the next post of this series, we’ll go into using Amazon EKS with Ray Serve to deploy vLLM into production with autoscaling and observability.

About the authors

Omri Shiv is an Open Source Machine Learning Engineer focusing on helping customers through their AI/ML journey. In his free time, he likes cooking, tinkering with open source and open hardware, and listening to and playing music.

Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

Using LLMs to fortify cyber defenses: Sophos’s insight on strategies for using LLMs with Amazon Bedrock and Amazon SageMaker

November 26, 2024

by Benoît de Patoul Amazon AWS

This post is co-written with Adarsh Kyadige and Salma Taoufiq from Sophos.

As a leader in cutting-edge cybersecurity, Sophos is dedicated to safeguarding over 500,000 organizations and millions of customers across more than 150 countries. By harnessing the power of threat intelligence, machine learning (ML), and artificial intelligence (AI), Sophos delivers a comprehensive range of advanced products and services. These solutions are designed to protect and defend users, networks, and endpoints against a wide array of cyber threats including phishing, ransomware, and malware. The Sophos Artificial Intelligence (AI) group (SophosAI) oversees the development and maintenance of Sophos’s major ML security technology.

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation across diverse domains as showcased in numerous leaderboards (e.g., HELM, Hugging Face Open LLM leaderboard) that evaluate them on a myriad of generic tasks. However, their effectiveness in specialized fields like cybersecurity relies heavily on domain-specific knowledge. In this context, fine-tuning emerges as a crucial technique to adapt these general-purpose models to the intricacies of cybersecurity. For example, we could use Instruction fine-tuning to increase the model performance on an incident classification or summarization. However, before fine-tuning, it’s important to determine an out-of-the-box model’s potential by testing its abilities on a set of tasks based on the domain. We have defined three specialized tasks that are covered later in the blog. These same tasks can also be used to measure the gains in performance obtained through fine-tuning, Retrieval-Augmented Generation (RAG), or knowledge distillation.

In this post, SophosAI shares insights in using and evaluating an out-of-the-box LLM for the enhancement of a security operations center’s (SOC) productivity using Amazon Bedrock and Amazon SageMaker. We use Anthropic’s Claude 3 Sonnet on Amazon Bedrock to illustrate the use cases.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Tasks

We will showcase three example tasks to delve into using LLMs in the context of an SOC. An SOC is an organizational unit responsible for monitoring, detecting, analyzing, and responding to cybersecurity threats and incidents. It employs a combination of technology, processes, and skilled personnel to maintain the confidentiality, integrity, and availability of information systems and data. SOC analysts continuously monitor security events, investigate potential threats, and take appropriate action to mitigate risks. Known challenges faced by SOCs are the high volume of alerts generated by detection tools and the subsequent alert fatigue among analysts. These challenges are often coupled with staffing shortages. To address these challenges and enhance operational efficiency and scalability, many SOCs are increasingly turning to automation technologies to streamline repetitive tasks, prioritize alerts, and accelerate incident response. Considering the nature of tasks analysts need to perform, LLMs are good tools to enhance the level of automation in SOCs and empower security teams.

For this work, we focus on three essential SOC use cases where LLMs have the potential of greatly assisting analysts, namely:

SQL Query generation from natural language to simplify data extraction
Incident severity prediction to prioritize which incidents analysts should focus on
Incident summarization based on its constituent alert data to increase analyst productivity

Based on the token consumption of these tasks, particularly the summarization component, we need a model with a context window of at least 4000 tokens. While the tasks have been tested in English, Anthropic’s Claude 3 Sonnet model can perform in other languages. However, we recommend evaluating the performance in your specific language of interest.

Let’s dive into the details of each task.

Task 1: Query generation from natural language

This task’s objective is to assess a model’s capacity to translate natural language questions into SQL queries, using contextual knowledge of the underlying data schema. This skill simplifies the data extraction process, allowing security analysts to conduct investigations more efficiently without requiring deep technical knowledge. We used prompt engineering guidelines to tailor our prompts to generate better responses from the LLM.

A three-shot prompting strategy is used for this task. Given a database schema, the model is provided with three examples pairing a natural-language question with its corresponding SQL query. Following these examples, the model is then prompted to generate the SQL query for a question of interest.

The prompt below is a three-shot prompt example for query generation from natural language. Empirically, we have obtained better results with few-shot prompting as opposed to one-shot (where the model is provided with only one example question and corresponding query before the actual question of interest) or zero-shot (where the model is directly prompted to generate a desired query without any examples).

Translate the following request into SQL
Schema for alert_table table
   <Table schema>
Schema for process_table table
   <Table schema>
Schema for network_table table
   <Table schema>

Here are some examples
<examples>
Request:tell me a list of processes that were executed between 2021/10/19 and 2021/11/30
   SQL:select * from process_table where timestamp between '2021-10-19' and '2021-11-30';

Request:show me any low severity security alerts for the 23 days ago
   SQL:select * from alert_table where severity='low' and timestamp>=DATEADD('day', -23, CURRENT_TIMESTAMP());

Request:show me the count of msword.exe processes that ran between Dec/01 and Dec/11
   SQL:select count(*) from process_table where process='msword.exe' and timestamp>='2022-12-01' and timestamp<='2022-12-11';
</examples>

Request:"Any Ubuntu processes that was run by the user ""admin"" from host ""db-server"""
SQL:

To evaluate a model’s performance on this task, we rely on a proprietary data set of about 100 target queries based on a test database schema. To determine the accuracy of the queries generated by the model, a multi-step evaluation is followed. First, we verify whether the model’s output is an exact match to the expected SQL statement. Exact matches are then recorded as successful outcomes. If there is a mismatch, we then run both the model’s query and the expected query against our mock database to compare their results. However, this method can be prone to false positives and false negatives. To mitigate this, we further perform a query equivalence assessment using a different stronger LLM on this task. This method is known as LLM-as-a-judge.

Anthropic’s Claude 3 Sonnet model achieved a good accuracy rate of 88 percent on the chosen dataset, suggesting that this natural-language-to-SQL task is quite simple for LLMs. With basic few-shot prompting, an LLM can therefore be used out-of-the-box without fine-tuning by security analysts to assist them in retrieving key information while investigating threats. The above model performance is based on our dataset and our experiment. This means that you can perform your own test using the strategy explained above.

Task 2: Incident severity prediction

For the second task, we assess a model’s ability to recognize the severity of observed events as indicators of an incident. Specifically, we try to determine whether an LLM can review a security incident and accurately gauge its importance. Armed with such a capability, a model can assist analysts in determining which incidents are most pressing, so they can work more efficiently by organizing their work queue based on severity levels, cut through the noise, and save time and energy.

The input data in this use case is semi-structured alert data, typical of what is produced by various detection systems during an incident. We clearly define severity categories—critical, high, medium, low, and informational—across which the model is to classify the severity of the incident. This is therefore a classification problem that tests an LLM’s intrinsic cybersecurity knowledge.

Each security incident within the Sophos Managed Detection and Response (MDR) platform is made up of multiple detections that highlight suspicious activities occurring in a user’s environment. A detection might involve identifying potentially harmful patterns, such as unusual command executions, abnormal file access, anomalous network traffic, or suspicious script use. We have attached below an example input data.

The “detection” section provides detailed information about each specific suspicious activity that was identified. It includes the type of security incident, such as “Execution,” along with a description that explains the nature of the threat, like the use of suspicious PowerShell commands. The detection is tied to a unique identifier for tracking and reference purposes. Additionally, it contains details from the MITRE ATT&CK framework which categorizes the tactics and techniques involved in the threat. This section might also reference related Sigma rules, which are community-driven signatures for detecting threats across different systems. By including these elements, the detection section serves as a comprehensive outline of the potential threat, helping analysts understand not just what was detected but also why it matters.

The “machine_data” section holds crucial information about the machine on which the detection occurred. It can provide further metadata on the machine, helping to pinpoint where exactly in the environment the suspicious activity was observed.

{
    ...
  "detection": {
    "attack": "Execution",
    "description": "Identifies the use of suspicious PowerShell IEX patterns. IEX is the shortened version of the Invoke-Expression PowerShell cmdlet. The cmdlet runs the specified string as a command.",
    "id": <Detection ID>,
    "mitre_attack": [
      {
        "tactic": {
          "id": "TA0002",
          "name": "Execution",
          "techniques": [
            {
              "id": "T1059.001",
              "name": "PowerShell"
            }
          ]
        }
      },
      {
        "tactic": {
          "id": "TA0005",
          "name": "Defense Evasion",
          "techniques": [
            {
              "id": "T1027",
              "name": "Obfuscated Files or Information"
            }
          ]
        }
      }
    ],
    "sigma": {
      "id": <Detection ID>,
      "references": [
        "https://github.com/SigmaHQ/sigma/blob/master/rules/windows/process_creation/proc_creation_win_susp_powershell_download_iex.yml",
        "https://github.com/VirtualAlllocEx/Payload-Download-Cradles/blob/main/Download-Cradles.cmd"
      ]
    },
    "type": "process",
  },
  "machine_data": {
    ...
    "username": <Username>
    },
    "customer_id": <Customer ID>,
    "decorations": {
        <Customer data>
    },
    "original_file_name": "powershell.exe",
    "os_platform": "windows",
    "parent_process_name": "cmd.exe",
    "parent_process_path": "C:\Windows\System32\cmd.exe",
    "powershell_code": "iex ([system.text.encoding]::ASCII.GetString([Convert]::FromBase64String('aWYoR2V0LUNvbW1hbmQgR2V0LVdpbmRvd3NGZWF0dXJlIC1lYSBTaWxlbnRseUNvbnRpbnVlKQp7CihHZXQtV2luZG93c0ZlYXR1cmUgfCBXaGVyZS1PYmplY3QgeyRfLm5hbWUgLWVxICdSRFMtUkQtU2VydmVyJ30gfCBTZWxlY3QgSW5zdGFsbFN0YXRlKS5JbnN0YWxsU3RhdGUKfQo=')))",
    "process_name": "powershell.exe",
    "process_path": "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe",
  },
  ...
}

To facilitate evaluation, the prompt used for this task requires that the model communicates its severity assessments in a uniform way, providing the response in a standardized format, for example, as a dictionary with severity_pred as the key and their chosen severity level as the value. The prompt below is an example for incident severity classification. Model performance is then evaluated against a test set of over 3,800 security incidents with target severity levels.

You are a helpful cybersecurity incident investigation expert that classifies incidents according to their severity level given a set of detections per incident.
Respond strictly with this JSON format: {"severity_pred": "xxx"} where xxx should only be either:
    - Critical,
    <Criteria for a critical incident>
    - High,
    <Criteria for a high severity incident>
    - Medium,
    <Criteria for a medium severity incident>
    - Low,
    <Criteria for a low severity incident>
    - Informational
    <Criteria for an informational incident>
    No other value is allowed.

Detections:

Various experimental setups are used for this task, including zero-shot prompting, three-shot prompting using random or nearest-neighbor incidents examples, and simple classifiers.

This task turned out to be quite challenging, because of the noise in the target labels and the inherent difficulty of assessing the criticality of an incident without further investigation by models that weren’t trained specifically for this use case.

Even under various setups, such as few-shot prompting with nearest neighbor incidents, the model’s performance couldn’t reliably outperform random chance. For reference, the baseline accuracy on the test set is approximately 71 percent and the baseline balanced accuracy is 20 percent.

Figure 1 presents the confusion matrix of the model’s responses. The confusion matrix allows to see in one graph the performance of the model’s classification. We can see that only 12% (0.12) of the Actual critical incidents have been correctly predicted/classified. Then 50% of the Critical incidents have been predicted as High incidents, 25% as Medium incidents and 12% as Informational incidents. We can similarly see low accuracy on the rest of the labels and the lowest being bee the Low incidents label with only 2% of the incidents correctly predicted. There is also a notable tendency to overpredict High and Medium categories across the board.

Figure 1: Confusion matrix for the five-severity-level classification using Anthropic Claude 3 Sonnet

The performance observed in this benchmark task indicates this is a particularly hard problem for an unmodified, all-purpose LLM, and the problem requires a more specialized model, specifically trained or fine-tuned on cybersecurity data.

Task 3: Incident summarization

The third task is concerned with the summarization of incoming incidents. It evaluates the potential of a model to assist threat analysts in the triage and investigation of security incidents as they come in by providing a succinct and concise summary of the activity that triggered the incident.

Security incidents typically consist of a series of events occurring on a user endpoint or network, associated with detected suspicious activity. The analysts investigating the incident are presented with a series of events that occurred on the endpoint at the time the suspicious activity was detected. However, analyzing this event sequence can be challenging and time-consuming, resulting in difficulty in identifying noteworthy events. This is where LLMs can be beneficial by helping organize and categorize event data following a specific template, thereby aiding comprehension, and helping analysts quickly determine the appropriate next actions.

We use real incident data from Sophos’s MDR for incident summarization. The input for this task encompasses a set of JSON events, each having distinct schemas and attributes based on the capturing sensor. Along with instructions and a predefined template, this data is provided to the model to generate a summary. The prompt below is an example template prompt for generating incident summaries from SOC data.

As a cybersecurity assistant, your task is to:
    1. Analyze the provided cybersecurity detections data.
    2. Create a report of the events using the information from the '### Detections' section, which may include security artifacts such as command lines and file paths.
    3. [Any other additional general requirements for formatting, etc.]
The report outline should look like this:
Summary:
    <Few sentence description of the activity. [Any additional requirements for the summary: what to  include, etc.]>
Observed MITRE Techniques:
    <List only the registered MITRE Technique or Tactic ID and name pairs if available. The ID should start with 'T'.>
Impacted Hosts:
    <List of all hostname observed in the detections, provide corresponding IPs if available>
Active Users:
    <List of all usernames observed in the detections. There could be multiple, list all of them>
Events:
    <One sentence description for top three detection events. Start the list with n1. >
IPs/URLs:
    <List available IPs and URLs.>
    <Enumerate only up to ten artifacts under each report category, and summarize any remaining events beyond that.>
Files: 
    <List the files found in the incident as follows:>
    <TEMPLATE FOR FILES WITH DETAILS>
Command Lines: 
    <List the command lines found in the detections as follows:>
    <TEMPLATE FOR COMMAND LINES WITH DETAILS>

### Detections:

Evaluating these generated incident summaries is tricky because several factors must be considered. For example, it’s crucial that the extracted information is not only correct, but also relevant. To gain a general understanding of the quality of a model’s incident summarization, we use a set of five distinct metrics and rely on a dataset comprising of N incidents. We compare the generated descriptions with corresponding gold-standard descriptions crafted based on Sophos analysts’ feedback.

We compute two classes of metrics. The first class of metrics assesses factual accuracy; they are used to evaluate how many artifacts such as command lines, file paths, usernames, and so on were correctly identified and summarized by the model. The computation here is straightforward; we compute the average distance across extracted artifacts between the generated description and the target. We use two distance metrics, Levenshtein distance and longest common subsequence (LCS).

The second class of metrics is used to provide a more semantic evaluation of the generated description, using three different metrics:

BERTScore metric: This metric is used to evaluate the generated summaries using a pre-trained BERT model’s contextual embeddings. It determines the similarity between the generated summary and the reference summary using cosine similarity.
ADA2 embeddings cosine similarity: This metric assesses the cosine similarity of ADA2 embeddings of tokens in the generated summary with those of the reference summary.
METEOR score: METEOR is an evaluation metric based on the harmonic mean of unigram precision and recall.

More advanced evaluation methods can be used such as training a reward model on human preferences and using it as an evaluator, but for the sake of simplicity and cost-effectiveness, we limited the scope to these metrics.

Below is a summary of our results on this task:

Model	Levenshtein-based factual accuracy	LCS-based factual accuracy	BERTScore	Cosine similarity of ADA2 embeddings	METEOR score
Anthropic’s Claude 3 Sonnet	0.810	0.721	0.886	0.951	0.4165

Based on these findings, we gain a broad understanding of the performance of the model when it comes to generating incident summaries, focusing especially on factual accuracy and retrieval rate. Anthropic’s Claude 3 Sonnet model can capture the activity that’s occurring in the incident and summarize it well. However, it ignores certain instructions such as defanging all IPs and URLs. The returned reports are also not fully aligned with the target responses on a token level as signaled by the METEOR score. Anthropic’s Claude 3 Sonnet model skims over some details and explanations in the reports.

Experimental setup using Amazon Bedrock and Amazon SageMaker

This section outlines the experimental setup for evaluating various large language models (LLMs) using Amazon Bedrock and Amazon SageMaker. These services allowed us to efficiently interact with and deploy multiple LLMs for quick and cost-effective experimentation.

Amazon Bedrock

Amazon Bedrock is a managed service that allows experimenting with various LLMs quickly in an on-demand manner. This brings the advantage of being able to interact and experiment with LLMs without having to self-host them and only pay by tokens consumed. We used the InvokeModel API to interact with the model with minimal latency. We wrote the following function that let us call different models by passing the necessary inference parameters to the API. For more details on what the inference parameters are per provider, we recommend you read the Inference request parameters and response fields for foundation models section in the Amazon Bedrock documentation. The example below uses the function based on Anthropic’s Claude 3 Sonnet model. Notice that we gave the model a role via the system prompt and that we prefilled its response.

system_prompt = “You are a helpful cybersecurity incident investigation expert that classifies incidents according to their severity level given a set of detections per incident”
messages = [
             {"role": "user", 
             "content": 
" Respond strictly with this JSON format:{"severity_pred": "xxx"} where xxx should only be either:
- Critical,
<Criteria for a critical incident>
- High,
<Criteria for a high severity incident>
- Medium,
<Criteria for a medium severity incident>
- Low,
<Criteria for a low severity incident>
- Informational
<Criteria for an informational incident>
No other value is allowed."},
              {"role": "assistant", "content": " Detections:"}]

def generate_message(bedrock_runtime, model_id, system_prompt, messages, max_tokens):
    body=json.dumps(
        {
            "anthropic_version": " bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages
        }  
    )   
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    return response_body

The above example is based on our use case. The model_id parameter specifies the identifier of the specific model you wish to invoke using the Bedrock runtime. We used the model id anthropic.claude-3-sonnet-20240229-v1:0. For other model ids, please refer to the bedrock documentation. For further details about this API, we recommend you read the API documentation. We advise you to adapt it to your use case based on your requirements.

Our analysis in this blog post has focused on Anthropic’s Claude 3 Sonnet model and three specific use cases. These insights can be adapted to other SOCs’ specific requirements and desired models. For example, it’s possible to access other models such as Meta’s Llama models, Mistral models, Amazon Titan models and others. For additional models, we used Amazon SageMaker Jumpstart.

Amazon SageMaker

Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. Amazon SageMaker JumpStart is a robust feature within the SageMaker machine learning (ML) environment, offering practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). It offers a wide range of publicly available and proprietary LLMs that you can, in a low-code manner, quickly tune and deploy. To quickly deploy and experiment with the out of the box models in SageMaker in a cost-effective manner, we deployed the LLMs from SageMaker JumpStart using asynchronous inference endpoints.

Inference endpoints were an effortless way for us to directly download these models from the respective Hugging Face repositories and deploy them using a few lines of code and pre-made Text Generation Inference (TGI) containers (see the example notebook on GitHub). In addition, we used asynchronous inference endpoints with autoscaling, which helped us to manage costs by automatically scaling the inference endpoints down to zero when they weren’t being used. Considering the number of endpoints we were creating, asynchronous inference made it simple for us to manage endpoints by having the endpoint ready to use whenever they were needed and scaling them down when they weren’t being used, without additional management on our end after the scaling policy was defined.

Next steps

In this blog post we applied the tasks on a single model to show case it as an example; in reality, you would select a couple of LLMs that you would put through the experiments in this post based on your requirements. From there, if the out-of-the-box models aren’t sufficient for the task, you would select the best suited LLM and then fine-tune it on the specific task.

For example, based on the outcomes of our three experimental tasks, we found that the results of the incident information summarization task didn’t meet our expectations. Therefore, we will fine-tune the out-of-the-box model that best suits our needs. This fine-tuning process can be accomplished using Amazon Bedrock Custom Models or SageMaker fine tuning, and the fine-tuned model could then be deployed using the customized model by importing it into Amazon Bedrock or by deploying the model to a SageMaker endpoint.

In this blog we covered the experimentation phase. Once you identify an LLM that meets your performance requirements, it’s important to start considering how to productionize it. When productionizing an LLM, it is important to consider things like guardrails and scalability of the LLM. Implementing guardrails helps you to minimize the risk of the model being misused or security breaches. Amazon Bedrock Guardrails enables you to implement safeguards for your generative AI applications based on your use cases and responsible AI policies. This blog covers how to build guardrails in your generative AI applications. When moving an LLM into ] production, you also want to validate the scalability of the LLM based on request traffic. In Amazon Bedrock, consider increasing the quotas of your model, batch inference, queuing the requests, or even distributing the requests between different Regions that have the same model. Select the technique that suits you based on your use case and traffic.

Conclusion

In this post, SophosAI shared insights on how to use and evaluate out-of-the-box LLMs following a set of specialized tasks for the enhancement of a security operations center’s (SOC) productivity by using Amazon Bedrock and Amazon SageMaker. We used Anthropic’s Claude 3 Sonnet model on Amazon Bedrock to illustrate three use cases.

Amazon Bedrock and SageMaker have been key to enabling us to run these experiments. With the convenient access to high-performing foundation models (FMs) from leading AI companies provided by Amazon Bedrock through a single API call, we were able to test various LLMs without needing to deploy them ourselves. Additionally, the on-demand pricing model allowed us to only pay for what we used based on token consumption.

To access additional models with flexible control, SageMaker is a great alternative that offers a wide range of LLMs ready for deployment. While you would deploy these models yourself, you can still achieve great cost optimization by using asynchronous endpoints with a scaling policy that scales the instance down to zero when not in use.

General takeaways as to the applicability of an LLM such as Anthropic’s Claude 3 Sonnet model in cybersecurity can be summarized as follows:

An out-of-the-box LLM can be an effective assistant in threat hunting and incident investigation. However, it still requires some guardrails and guidance. We believe that this potential application can be implemented using an existing powerful model, such as Anthropic’s Claude 3 Sonnet model, with careful prompt engineering.
When it comes to summarizing incident information from raw data, Anthropic’s Claude 3 Sonnet model performs adequately, but there’s room for improvement through fine-tuning.
Evaluating individual artifacts or groups of artifacts remains a challenging task for a pre-trained LLM. To tackle this problem, a specialized LLM trained specifically on cybersecurity data might be required.

It is also worth noticing that while we used the InvokeModel API from Amazon Bedrock, another simpler way to access Amazon Bedrock models is by using the Converse API. The Converse API provides consistent API calls that work with Amazon Bedrock models that support messages. This means you can write code once and use it with different models. Should a model have unique inference parameters, the Converse API also allows you to pass those unique parameters in a model specific structure.

About the Authors

Benoit de Patoul is a GenAI/AI/ML Specialist Solutions Architect at AWS. He helps customers by providing guidance and technical assistance to build solutions related to GenAI/AI/ML using Amazon Web Services. In his free time, he likes to play piano and spend time with friends.

Naresh Nagpal is a Solutions Architect at AWS with extensive experience in application development, integration, and technology architecture. At AWS, he works with ISV customers in the UK to help them build and modernize their SaaS applications on AWS. He is also helping customers to integrate GenAI capabilities in their SaaS applications.

Adarsh Kyadige oversees the Research wing of the Sophos AI team, where he has been working since 2018 at the intersection of Machine Learning and Security. He earned a Masters degree in Computer Science, with a specialization in Artificial Intelligence and Machine Learning, from UC San Diego. His interests and responsibilities involve applying Deep Learning to Cybersecurity, as well as orchestrating pipelines for large scale data processing. In his leisure time, Adarsh can be found at the archery range, tennis courts, or in nature. His latest research can be found on Google Scholar.

Salma Taoufiq was a Senior Data Scientist at Sophos focusing at the intersection of machine learning and cybersecurity. With an undergraduate background in computer science, she graduated from the Central European University with a MSc. in Mathematics and Its Applications. When not developing a malware detector, Salma is an avid hiker, traveler, and consumer of thrillers.

Enhanced observability for AWS Trainium and AWS Inferentia with Datadog

November 26, 2024

by Curtis Maher Amazon AWS

This post is co-written with Curtis Maher and Anjali Thatte from Datadog.

This post walks you through Datadog’s new integration with AWS Neuron, which helps you monitor your AWS Trainium and AWS Inferentia instances by providing deep observability into resource utilization, model execution performance, latency, and real-time infrastructure health, enabling you to optimize machine learning (ML) workloads and achieve high-performance at scale.

Neuron is the SDK used to run deep learning workloads on Trainium and Inferentia based instances. AWS AI chips, Trainium and Inferentia, enable you to build and deploy generative AI models at higher performance and lower cost. With the increasing use of large models, requiring a large number of accelerated compute instances, observability plays a critical role in ML operations, empowering you to improve performance, diagnose and fix failures, and optimize resource utilization.

Datadog, an observability and security platform, provides real-time monitoring for cloud infrastructure and ML operations. Datadog is excited to launch its Neuron integration, which pulls metrics collected by the Neuron SDK’s Neuron Monitor tool into Datadog, enabling you to track the performance of your Trainium and Inferentia based instances. By providing real-time visibility into model performance and hardware usage, Datadog helps you achieve efficient training and inference, optimized resource utilization, and the prevention of service slowdowns.

Comprehensive monitoring for Trainium and Inferentia

Datadog’s integration with the Neuron SDK automatically collects metrics and logs from Trainium and Inferentia instances and sends them to the Datadog platform. Upon enabling the integration, users will find an out-of-the-box dashboard in Datadog, making it straightforward to start monitoring quickly. You can also modify preexisting dashboards and monitors, and add news ones tailored to your specific monitoring requirements.

The Datadog dashboard offers a detailed view of your AWS AI chip (Trainium or Inferentia) performance, such as the number of instances, availability, and AWS Region. Real-time metrics give an immediate snapshot of infrastructure health, with preconfigured monitors alerting teams to critical issues like latency, resource utilization, and execution errors. The following screenshot shows an example dashboard.

For instance, when latency spikes on a specific instance, a monitor in the monitor summary section of the dashboard will turn red and trigger alerts through Datadog or other paging mechanisms (like Slack or email). High latency may indicate high user demand or inefficient data pipelines, which can slow down response times. By identifying these signals early, teams can quickly respond in real time to maintain high-quality user experiences.

Datadog’s Neuron integration enables tracking of key performance aspects, providing crucial insights for troubleshooting and optimization:

NeuronCore counters – Monitoring NeuronCore utilization helps make sure that cores are being used efficiently, helping you identify if you need to make adjustments to balance workloads or optimize performance.
Execution status – You can monitor the progress of training jobs, including completed tasks and failed runs. This data makes sure models are being trained smoothly and reliably. If failures increase, it may signal issues with data quality, model configurations, or resource limitations that need to be addressed.
Memory used – You can gain a granular view of memory usage across both the host and Neuron device, including memory allocated for tensors and model execution. This helps you understand how effectively resources are being used, and when it might be time to rebalance workloads or scale resources to prevent bottlenecks from causing disruptions during training.
Neuron runtime vCPU usage – You can keep an eye on vCPU utilization to make sure your models aren’t overburdening the infrastructure. When vCPU usage crosses a certain threshold, you will be alerted to decide whether to redistribute workloads or upgrade instance types to avoid training slowdowns.

By consolidating these metrics into one view, Datadog provides a powerful tool for maintaining efficient, high-performance Neuron workloads, helping teams identify issues in real time and optimize infrastructure as needed. Using the Neuron integration combined with Datadog’s LLM Observability capabilities, you can gain comprehensive visibility into your large language model (LLM) applications.

Get started with Datadog and Inferentia and Trainium

Datadog’s integration with Neuron provides real-time visibility into Trainium and Inferentia, helping you optimize resource utilization, troubleshoot issues, and achieve seamless performance at scale. To get started, see AWS Inferentia and AWS Trainium Monitoring.

To learn more about how Datadog integrates with Amazon ML services and Datadog LLM Observability, see Monitor Amazon Bedrock with Datadog and Monitoring Amazon SageMaker with Datadog.

If you don’t already have a Datadog account, you can sign up for a free 14-day trial today.

About the Authors

Curtis Maher is a Product Marketing Manager at Datadog, focused on the platform’s cloud and AI/ML integrations. Curtis works closely with Datadog’s product, marketing, and sales teams to coordinate product launches and help customers observe and secure their cloud infrastructure.

Anjali Thatte is a Product Manager at Datadog. She currently focuses on building technology to monitor AI infrastructure and ML tooling and helping customers gain visibility across their AI application tech stacks.

Jason Mimick is a Senior Partner Solutions Architect at AWS working closely with product, engineering, marketing, and sales teams daily.

Anuj Sharma is a Principal Solution Architect at Amazon Web Services. He specializes in application modernization with hands-on technologies such as serverless, containers, generative AI, and observability. With over 18 years of experience in application development, he currently leads co-building with containers and observability focused AWS Software Partners.

Create a virtual stock technical analyst using Amazon Bedrock Agents

November 26, 2024

by Bharath Sridharan Amazon AWS

Stock technical analysis questions can be as unique as the individual stock analyst themselves. Queries often have multiple technical indicators like Simple Moving Average (SMA), Exponential Moving Average (EMA), Relative Strength Index (RSI), and others. Answering these varied questions would mean writing complex business logic to unpack the query into parts and fetching the necessary data. With the number of indicators available, the possibility of having one or many of them in any combination, and each of those indicators over different time periods, it can get quite complex to build such a business logic into code.

As AI technology continues to evolve, the capabilities of generative AI agents continue to expand, offering even more opportunities for you to gain a competitive edge. At the forefront of this evolution sits Amazon Bedrock, a fully managed service that makes high-performing foundation models (FMs) from Amazon and other leading AI companies available through a single API. With Amazon Bedrock, you can build and scale generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents plans and runs multistep tasks using company systems and data sources—from answering customer questions about your product availability to taking their orders. With Amazon Bedrock, you can create an agent in just a few quick steps by first selecting an FM and providing it access to your enterprise systems, knowledge bases, and actions to securely execute your APIs. These actions can be implemented in the cloud using AWS Lambda, or you can use local business logic with return of control. An agent analyzes the user request and automatically calls the necessary APIs and data sources to fulfill the request. Amazon Bedrock Agents offers enhanced security and privacy—no need for you to engineer prompts, manage session context, or manually orchestrate tasks.

In this post, we create a virtual analyst that can answer natural language queries of stocks matching certain technical indicator criteria using Amazon Bedrock Agents. As part of the agent, we configure action groups consisting of Lambda functions, which gives the agent the ability to perform various actions. The Amazon Bedrock agent will transform the user natural language query into relevant Lambda calls, passing the technical indicators and their needed duration. Lambda will access the open source stock data pre-fetched into an Amazon Simple Storage Service (Amazon S3) bucket, calculate the technical indicator in real time, and pass it back to the agent. The agent will take further actions like other Lambda calls or filtering and ordering based on the task.

Solution overview

The technical analysis assistant solution will use Amazon Bedrock Agents to answer natural language technical analysis queries, from simple ones like “Can you give me a list of stocks in the NASDAQ 100 index” to complex ones like “Which stocks in the NASDAQ 100 index has both grown over 10% in last 6 months and also closed above their 20-day SMA?” Agents orchestrate and analyze the task and break it down into the correct logical sequence using the FM’s reasoning abilities. Agents automatically call the necessary Lambda functions to fetch the relevant stock technical analysis data, determining along the way if they can proceed or if they need to gather more information.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The solution spins up a Python-based Lambda function that fetches the daily stock data for the last one year using the yfinance package. The Lambda function is triggered to run every day using an Amazon EventBridge The functions puts the last 1-year stock data into an S3 bucket.
A user asks a natural language query, like “Can you give me a list of stocks in NASDAQ 100 index” or “Which stocks have closed over both 20 SMA and 50 EMA in the FTSE 100 index?”
This query is passed to the Amazon Bedrock agent powered by Anthropic’s Claude 3 Sonnet. The agent deconstructs the user query, creates an action plan, and executes it step-by-step to fetch various data needed to answer the question. To fetch the needed data, the agent has three action groups, each powered by a Lambda function that can use the raw data stored in the S3 bucket to calculate technical indicators and other stock-related information. Based on the response from the action group and the agent’s plan of action, the agent will continue to make calls or take other actions like filtering or summarizing until it arrives at the answer to the question. The action groups are as follows:
1. get-index – Get the stock symbol of constituents for a given index. The example currently has constituents configured for Nasdaq 100, FTSE 100, and Nifty 50 indexes.
2. get-stock-change – For a given stock or list of stocks, calculate the change percentage over a given period based on the pre-fetched raw data in Amazon S3. This solution currently is configured to have data of the past 1 year.
3. get-technical-analysis – For a given stock or list of stocks, calculate the given technical indicator for a given time period. It also fetches the last closing price of the stock based on the pre-fetched raw data in Amazon S3. This solution currently is configured to handle SMA, EMA, and RSI technical indicators for up to 1 year.

Prerequisites

To set up this solution, you need a basic knowledge of AWS and the relevant AWS services. Additionally, request model access on Amazon Bedrock for Anthropic’s Claude 3 Sonnet.

Deploy the solution

Complete the following steps to deploy the solution using AWS CloudFormation:

Launch the CloudFormation stack in the us-east-1 AWS Region:

For Stack name, enter a stack name of your choice.
Leave the rest as defaults.
Choose Next.
Choose Next
Select the acknowledgement check box and choose Submit.

Wait for the stack creation to complete.
Verify all resources are created on the stack details page.

The CloudFormation stack creates the solution described in the solution overview and the following key resources:

StockDataS3Bucket – The S3 bucket to store the 1-year stock data.
YfinDailyLambda – A Python Lambda function to fetch the last 1-year stocks data in the Nasdaq 100, FTSE 100 and Nifty 50 indexes from Yahoo Finance using the yfinance package:

# Example call to Yahoo finance to get stock history
import yfinance as yf
stock_ticker = yf.Ticker(<Stock Symbol>)
stock_history = stock_ticker.history(period= <Time duration to fetch history e.g. 1y>))

YfinDailyLambdaScheduleRule – An EventBridge rule to trigger the YfinDailyLambda function daily to get the latest stock data.
InvokeYfinDailyLambda and InvokeLambdaFunction – A custom CloudFormation resource and its Lambda function to invoke the YfinDailyLambda function as part of the stack creation to fetch the initial data.
GetIndexLambda – This function takes in an index name as input and returns the list of stocks in the given index.
GetStockChangeLambda – This function takes a list of stocks and number of days as input, fetches the stock data from the S3 bucket, calculates the percentage change over the period for the stocks, and returns the data.
GetStockTechAnalysisLambda – This function takes a list of stocks, number of days, and a technical indicator as input and returns the last close and the technical indicator over the number of days for the given list of stocks. For example:

# Sample code to calculate Simple Moving Average,  a technical indicator
from ta.trend import SMAIndicator
indicator_ta = SMAIndicator(<PAndas series with Close price>, window=<SMA window number of days>)
stock_SMA = indicator_ta.sma_indicator()

StockBotAgent – The Amazon Bedrock agent created with Anthropic’s Claude 3 Sonnet model with three action groups, each mapped to a Lambda function. We give the agent instructions in natural language. In our solution, part of our instruction is that “You can fetch the list of stocks in a given index” so the agent knows it can fetch stocks in an index. We configure the action groups as part of the agent and use the OpenAPI 3 schema standard to describe the Lambda functionality, so the agent understands when and how to invoke the Lambda functions. The following is a snippet of the get-index action group OpenAPI schema where we describe its functionality, its input and output parameters, and format:

{
"openapi": "3.0.0",
"info": {
"title": "Get Index list of stocks api",
"version": "1.0.0",
"description": "API to fetch the list of stocks in a given index"
},

"paths": {
"/get-index": {
"get": {
"summary": "Get list of stock symbols in index",
"description": "Based on provided index, return list of stock symbols in the index",
"operationId": "getIndex",
"parameters": [
{
"name": "indexName",
"in": "path",
"description": "Index Name",
"required": true,
"schema": {
"type": "string"
}
}
],

"responses": {
"200": {
"description": "Get Index stock list",
"content": {
"application/json": {
"schema": {
"type": "array",

Test the solution

To test the solution, complete the following steps:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Choose the agent created by the CloudFormation stack.

Under Test, choose the alias with the name Version 1 and expand the Test

Now you can enter questions to interact with the agent.

Let’s start with the query “Can you give me a list of stocks in Nasdaq.”

You can see the answer in the following screenshot. Expand the Trace Step section in the right pane to see the agent’s rationale and the call to the Lambda function.

Now let’s ask a question that is likely to use all three action groups and their Lambda functions: “Can you give list of stocks that has both grown over 10% in last 6 months and also closed above their 20-day SMA. Use stocks from the Nasdaq index.”

You will get the response shown in the following screenshot, and in the trace steps, you will see the various Lambda functions being invoked at different steps as the agent reasons through the steps to get to the answer to the question.

You can further test with additional prompts, such as:

Can you give the top three gainers in terms of percentage in the last 6 months in the Nifty 50 index?
Which stocks have closed over both 20 SMA and 50 EMA in the FTSE 100 index?
Can you give list of stocks that has grown over 10% in last 6 months and closed above 20-day SMA and 50-day EMA. Use stocks from the FTSE 100 index. Follow up question: Of these stocks, are there any that have grown over 25% in the last months? If so, can you give me the stocks and their growth percent over 6 months?

Programmatically invoke the agent

When you’re satisfied with the performance of your agent, you can build an application to programmatically invoke the agent using the InvokeAgent API. The agent ID and the agent alias ID needed for invoking the agent alias programmatically can be found on the Outputs tab of the CloudFormation stack, titled AgentId and AgentAliasId, respectively. To learn more about the programmatic invocation, refer the following Python example and JavaScript example.

Clean up

To avoid charges in your AWS account, clean up the solution’s provisioned resources:

On the Amazon S3 console, empty the S3 bucket created as part of the CloudFormation stack. The bucket name should start with your stack name that you entered while creating the CloudFormation stack. You can also check the name of the bucket on the CloudFormation stack’s Resources
On the AWS CloudFormation console, select the stack you created for this solution and choose Delete.

Conclusion

In this post, we showed how you can use Amazon Bedrock Agents to carry out complex tasks that need multiple step orchestration through just natural language instructions. With agents, you can automate tasks for your customers and answer questions for them. We encourage you to explore the Amazon Bedrock Agents User Guide to understand its capabilities further and use it for your use cases.

About the authors

Bharath Sridharan is a Senior Technical Account Manager at AWS and works with strategic customers of AWS in pro-actively optimizing their workloads on AWS. Bharath additionally specializes on Machine learning services of AWS with a focus on Generative AI.

Apply Amazon SageMaker Studio lifecycle configurations using AWS CDK

November 26, 2024

by Gabriel Rodriguez Garcia Amazon AWS

This post serves as a step-by-step guide on how to set up lifecycle configurations for your Amazon SageMaker Studio domains. With lifecycle configurations, system administrators can apply automated controls to their SageMaker Studio domains and their users. We cover core concepts of SageMaker Studio and provide code examples of how to apply lifecycle configuration to your SageMaker Studio domain to automate behaviors such as preinstallation of libraries and automated shutdown of idle kernels.

Amazon SageMaker Studio is the first integrated development environment (IDE) purposefully designed to accelerate end-to-end machine learning (ML) development. Amazon SageMaker Studio provides a single web-based visual interface where data scientists create dedicated workspaces to perform all ML development steps required to prepare data and build, train, and deploy models. You can create multiple Amazon SageMaker domains, which define environments with dedicated data storage, security policies, and networking configurations. With your domains in place, you can then create domain user profiles, which serve as an access point for data scientists to enter the workspace with user-defined least-privilege permissions. Data scientists use their domain user profiles to launch private or shared Amazon SageMaker Studio spaces to manage the storage and resource needs of the IDEs they use to tackle different ML projects.

To effectively manage and govern both user profiles and domains with SageMaker Studio, you can use Amazon SageMaker Studio lifecycle configurations. This feature allows you for instance to install custom packages, configure notebook extensions, preload datasets, set up code repositories, or shut down idle notebook kernels automatically. Amazon SageMaker Studio now also supports configuration of idle kernel shutdown directly on the user interface for JupyterLab and Code Editor applications that use Amazon SageMaker Distribution image version 2.0 or newer.

These automations can greatly decrease overhead related to ML project setup, facilitate technical consistency, and save costs related to running idle instances. SageMaker Studio lifecycle configurations can be deployed on two different levels: on the domain level (all users in a domain are affected) or on the user level (only specific users are affected).

In this post, we demonstrate how you can configure custom lifecycle configurations for SageMaker Studio to manage your own ML environments efficiently at scale.

Solution overview

The solution constitutes a best-practice Amazon SageMaker domain setup with a configurable list of domain user profiles and a shared SageMaker Studio space using the AWS Cloud Development Kit (AWS CDK). The AWS CDK is a framework for defining cloud infrastructure as code.

In addition, we demonstrate how to implement two different use cases of SageMaker Studio lifecycle configurations: 1) automatic installation of python packages and 2) automatic shutdown of idle kernels. Both are deployed and managed with AWS CDK custom resources. These are powerful, low-abstracted and highly customizable AWS CDK constructs that can be used to manage the behavior of resources at creation, update, and deletion events. We use Python as the main language for our AWS CDK application, but the code can be easily translated to other AWS CDK supported languages. For more information, refer to Work with the AWS CDK library.

The following architecture diagram captures the main infrastructure that is deployed by the AWS CDK, typically carried out by a DevOps engineer. The domain administrator defines the configuration of the Studio domain environment, which also includes the selection of Studio lifecycle configurations to include in the deployment of the infrastructure. After the infrastructure is provisioned, data scientists can access the SageMaker Studio IDE through their domain user profiles in the SageMaker console.

After the data scientists access the IDE, they can select from a variety of available applications, including JupyterLab and Code Editor, and run the provisioned space. In this solution, a JupyterLab space has been included in the infrastructure stack. Upon opening JupyterLab, data scientists can immediately start tackling their development work, which includes retrieving or dumping data on Amazon Simple Storage Service (Amazon S3), developing ML models, and pushing changes to their code repository. If multiple data scientists are working on the same project, they can access the shared Studio spaces using their domain user profiles to foster collaboration. The main Python libraries are already installed by the Studio lifecycle configuration, saving time for value-generating tasks. As the data scientists complete their daily work, their spaces will be automatically shut down by the Studio lifecycle configuration.

High-level architecture diagram of this solution, which includes a SageMaker Studio domain, user profiles, and two Studio lifecycle configurations.

Prerequisites

To get started, make sure you fulfill the following prerequisites:

The AWS Command Line Interface (AWS CLI) installed.
The AWS CDK installed. For more information, refer to Getting started with the AWS CDK and Working with the AWS CDK in Python.
Python 3.9 or higher installed.

Clone the GitHub repository

First, clone this GitHub repository.

Upon cloning the repository, you can observe a classic AWS CDK project setup, where app.py is the CDK entry point script that deploys two stacks in sequence. The first stack (NetworkingStack) deploys the networking infrastructure and the second stack (SageMakerStudioStack) deploys the domain, user profiles and spaces. The application logic is covered by AWS Lambda functions which are found under the source directory.

The AWS CDK stacks

In the following subsections we elaborate on the provisioned resources for each of the two CDK stacks.

Virtual private cloud (VPC) setup for the NetworkingStack

The NetworkingStack deploys all the necessary networking resources and builds the foundation of the application. This includes a VPC with a public and a private subnet and a NAT gateway to enable connection from instances in the private subnet to AWS services (for example, to Amazon SageMaker). The SageMaker Studio domain is deployed into the VPC’s private subnet, shielded from direct internet access for enhanced security. The stack also includes security groups to control traffic within the VPC and a custom resource to delete security groups when destroying the infrastructure via CDK. We elaborate on custom resources in the subsection CustomResource class.

SageMaker Studio domain and user profiles for SageMakerStudioStack

The SageMakerStudioStack is deployed on top of the NetworkingStack and captures project-specific resources. This includes the domain user profiles and the name of the workspace. By default, it creates one workspace called “project1” with three users called “user1”, “user2”, and “user3”. The SageMakerStudioStack is instantiated, as shown in the following code example.

SagemakerStudioStack(
  env=env,
  scope=app,
  construct_id="SageMakerStudioStack",
  domain_name="sagemaker-domain",
  vpc_id=networking_stack.vpc_id,
  subnet_ids=networking_stack.subnet_ids,
  security_group_id=networking_stack.security_group_id,
  workspace_id="project1",
  user_ids=[
  "user1",
  "user2",
  "user3",
   ],
)

You can adjust the names according to your own requirements and even deploy multiple SageMaker Studio domains by instantiating multiple objects of the SageMakerStudioStack class in your CDK app.py script.

Apply the lifecycle configurations

The application of lifecycle configurations in this solution relies on CDK custom resources which are powerful constructs that allow you to deploy and manage highly bespoke infrastructure components to fit your specific needs. To facilitate the usage of these components, this solution comprises a general CustomResource class that is inherited by the following five CustomResource subclasses:

InstallPackagesCustomresource subclass: Installs the required packages automatically when launching JupyterLab within SageMaker Studio using lifecycle configurations.
ShutdownIdleKernelsCustomResource subclass: Shuts down idle kernels after the user-specified time window (default 1 hour) using lifecycle configurations.
EFSCustomResource subclass: Deletes the Elastic File System (EFS) of SageMaker Studio when destroying the infrastructure.
StudioAppCustomResource subclass: Deletes the JupyterLab application for each user profile when destroying the infrastructure.
VPCCustomResource subclass: Deletes the security groups when destroying the infrastructure.

Note that only the first two subclasses in the list are used for lifecycle configurations, the other subclasses are not related to lifecycle configurations but serve other purposes. This code structure allows you to easily define your own custom resources following the same pattern. In the following subsections we dive deeper on how custom resources work and elaborate on a specific example.

The CustomResource class

The CDK CustomResource class is composed of three key elements including a Lambda function that contains the logic for Create, Update, and Delete cycles, a Provider that manages the creation of the Lambda function as well as its IAM role, and the custom resource itself which references the Provider and entails some properties that are passed to the Lambda function. The class definition is illustrated below and can be found in the repository under stacks/sagemaker/constructs/custom_resources/CustomResource.py.

from aws_cdk import (
  aws_iam as iam,
  aws_lambda as lambda_,
  aws_logs as logs,
)
import aws_cdk as cdk
from aws_cdk.custom_resources import Provider
from constructs import Construct
import os
from typing import Dict

class CustomResource(Construct):
    def __init__(
        self,
        scope: Construct,
        construct_id: str,
        properties: Dict,
        lambda_file_name: str,
        iam_policy: iam.PolicyStatement,
        **kwargs,
    ) -> None:
        super().__init__(scope, construct_id, **kwargs)

        on_event_lambda_fn = lambda_.Function(
            self,
            "EventLambda",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="index.on_event_handler",
            code=lambda_.Code.from_asset(
                os.path.join(os.getcwd(), "src", "lambda", lambda_file_name)
            ),
            initial_policy=[iam_policy],
            timeout=cdk.Duration.minutes(3),
        )
        is_complete_lambda_fn = lambda_.Function(
            self,
            "CompleteLambda",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="index.is_complete_handler",
            code=lambda_.Code.from_asset(
                os.path.join(os.getcwd(), "src", "lambda", lambda_file_name)
            ),
            initial_policy=[iam_policy],
            timeout=cdk.Duration.minutes(3),
        )

        provider = Provider(
            self,
            "Provider",
            on_event_handler=on_event_lambda_fn,
            is_complete_handler=is_complete_lambda_fn,
            total_timeout=cdk.Duration.minutes(10),
            log_retention=logs.RetentionDays.ONE_DAY,
        )

        cdk.CustomResource(
            self,
            "CustomResource",
            service_token=provider.service_token,
            properties={
                **properties,
                "on_event_lambda_version": on_event_lambda_fn.current_version.version,
                "is_complete_lambda_version": is_complete_lambda_fn.current_version.version,
            },
        )

The InstallPackagesCustomResource subclass

This subclass inherits from the CustomResource to deploy the lifecycle configurations for SageMaker Studio to automatically install Python packages within JupyterLab environments. The lifecycle configuration is defined on the domain level to cover all users at once. The subclass definition is illustrated below and can be found in the repository under stacks/sagemaker/constructs/custom_resources/InstallPackagesCustomResource.py.

from aws_cdk import (
    aws_iam as iam,
)
from constructs import Construct
from stacks.sagemaker.constructs.custom_resources import CustomResource


class InstallPackagesCustomResource(CustomResource):
    def __init__(
        self,
        scope: Construct,
        construct_id: str,
        domain_id: str,
    ) -> None:
        super().__init__(
            scope,
            construct_id,
            properties={
                "domain_id": domain_id,
                "package_lifecycle_config": f"{domain_id}-package-lifecycle-config",
            },
            lambda_file_name="lcc_install_packages_lambda",
            iam_policy=iam.PolicyStatement(
                effect=iam.Effect.ALLOW,
                actions=[
                    "sagemaker:CreateStudioLifecycleConfig",
                    "sagemaker:DeleteStudioLifecycleConfig",
                    "sagemaker:Describe*",
                    "sagemaker:List*",
                    "sagemaker:UpdateDomain",
                ],
                resources=["*"],
            ),
        )

The code for the AWS Lambda function used for the custom resources is stored in the repository under src/lambda/lcc_install_packages_lambda/index.py. During the Create event, the Lambda function uses the Boto3 client method create_studio_lifecycle_config to create the lifecycle configuration. In a consecutive step, it uses the update_domain method to update the configuration of the domain to attach the created lifecycle configuration. During the Update event, the lifecycle configuration is deleted and recreated as they can’t be modified in-place after they’re provisioned. During the Delete event, the delete_studio_lifecycle_config method is called to remove the lifecycle configuration. The lifecycle configuration itself is a shell script that is executed once deployed into the domain. As an example, the content of the install packages script is displayed below.

#!/bin/bash
set -eux

# Packages to install
pip install --upgrade darts pip-install-test

In this example, two packages are automatically installed for every new kernel instance provisioned by a Studio user: darts and pip-install-test. You can modify and extend this list of packages to fit your own requirements.

The source code for the idle kernel shutdown lifecycle configuration follows the same design principle and is stored in the repository under src/lambda/lcc_shutdown_idle_kernels_lambda/index.py. The main difference between the two Studio lifecycle configurations is the content of the bash scripts, which in this case was referenced from sagemaker-studio-lifecycle-config-examples.

Deploy the AWS CDK stacks

To deploy the AWS CDK stacks, run the following commands in the location where you cloned the repository. Depending on your path configurations, the command may be python instead of python3.

Create a virtual environment:
1. For macOS/Linux, use python3 -m venv .cdk-venv
2. For Windows, use python3 -m venv .cdk-venv
Activate the virtual environment:
1. For macOS/Linux, use source .cdk-venvbinactivate
2. For Windows, use .cdk-venv/Scripts/activate.bat
3. For PowerShell, use .cdk-venv/Scripts/activate.ps1
Install the required dependencies:
1. pip install -r requirements.txt
2. pip install -r requirements-dev.txt
(Optional) Synthesize the AWS CloudFormation template for this application: cdk synth
Deploy the solution with the following commands:
1. aws configure
2. cdk bootstrap
3. cdk deploy --all alternatively, you can deploy the two stacks individually using cdk deploy <StackName>

When the stacks are successfully deployed, you’ll be able to view the deployed stacks in the AWS CloudFormation console, as shown below.

You’ll also be able to view the Studio domain and the Studio lifecycle configurations on the SageMaker console, as shown in the following screenshots.

Choose one of the lifecycle configurations to view the shell code and its configuration details, as follows.

To make sure your lifecycle configuration is included in your space, launch SageMaker Studio from your user profile, navigate to JupyterLab, and choose the provisioned space. You can then select a lifecycle configuration that is associated with your domain or user profile and activate it, as shown below.

After you run the space and open JupyterLab, you can validate the functionality. In the example shown in the following screenshot, the preinstalled package can be imported directly.

Optional: How to attach Studio lifecycle configurations manually

If you want to manually attach a lifecycle configuration to an already existing domain, perform the following steps:

On the SageMaker console, choose Domains in the navigation pane.
Choose the domain name you’re using and the current user profile, then choose Edit.
Select the lifecycle configuration you want to use and choose Attach, as shown in the following screenshot.

From here, you can also set it as default.

Clean up

Complete the steps in this section to remove all your provisioned resources from your environment.

Delete the AWS CDK stacks

When you’re done with the resources you created, you can destroy your AWS CDK stack by running the following command in the location where you cloned the repository:

cdk destroy --all

When asked to confirm the deletion of the stack, enter yes.

You can also delete the stack on the AWS CloudFormation console with the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose the stack that you want to delete.
In the stack details pane, choose Delete.
Choose Delete stack when prompted.

User profile applications can sometimes take several minutes to delete, which can interfere with the deletion of the stack. If you run into any errors during stack deletion, you may have to manually delete the user profile apps and retry.

Conclusion

In this post, we described how customers can deploy a SageMaker Studio domain with automated lifecycle configurations to control their SageMaker resources. Lifecycle configurations are based on custom shell scripts to perform automated tasks and can be deployed with AWS CDK Custom Resources.

Whether you are already using SageMaker domains in your organization or starting out on your SageMaker adoption, effectively managing lifecycles on your SageMaker Studio domains will greatly improve the productivity of your data science team and alleviate administrative work. By implementing the described steps, you can streamline your workflows, reduce operational overhead, and empower your team to focus on driving insights and innovation.

About the Authors

Gabriel Rodriguez Garcia is a Machine Learning Engineer at AWS Professional Services in Zurich. In his current role, he has helped customers achieve their business goals on a variety of ML use cases, ranging from setting up MLOps inference pipelines to developing generative AI applications.

Gabriel Zylka is a Machine Learning Engineer within AWS Professional Services. He works closely with customers to accelerate their cloud adoption journey. Specializing in the MLOps domain, he focuses on productionizing ML workloads by automating end-to-end ML lifecycles and helping to achieve desired business outcomes.

Krithi Balasubramaniyan is a Principal Consultant at AWS. He enables global enterprise customers in their digital transformation journeys and helps architect cloud native solutions.

Cory Hairston is a Software Engineer with AWS Bedrock. He currently works on providing reusable software solutions.

Gouri Pandeshwar is an Engineering Manager with AWS Bedrock. He and his team of engineers are working to build reusable solutions and frameworks that help accelerate adoption of AWS AI/ML services for customers’ business use cases.

Build a read-through semantic cache with Amazon OpenSearch Serverless and Amazon Bedrock

November 26, 2024

by Kamran Razi Amazon AWS

In the field of generative AI, latency and cost pose significant challenges. The commonly used large language models (LLMs) often process text sequentially, predicting one token at a time in an autoregressive manner. This approach can introduce delays, resulting in less-than-ideal user experiences. Additionally, the growing demand for AI-powered applications has led to a high volume of calls to these LLMs, potentially exceeding budget constraints and creating financial pressures for organizations.

This post presents a strategy for optimizing LLM-based applications. Given the increasing need for efficient and cost-effective AI solutions, we present a serverless read-through caching blueprint that uses repeated data patterns. With this cache, developers can effectively save and access similar prompts, thereby enhancing their systems’ efficiency and response times. The proposed cache solution uses Amazon OpenSearch Serverless and Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Solution overview

The cache in this solution acts as a buffer, intercepting prompts—requests to the LLM expressed in natural language—before they reach the main model. The semantic cache functions as a memory bank storing previously encountered similar prompts. It’s designed for efficiency and swiftly matching a user’s prompt with its closest semantic counterparts. However, in a practical cache system, it’s crucial to refine the definition of similarity. This refinement is necessary to strike a balance between two key factors: increasing cache hits and reducing cache collisions. A cache hit occurs when a requested prompt is found in the cache, meaning the system doesn’t need to send it to the LLM for a new generation. Conversely, a cache collision happens when multiple prompts are mapped to the same cache location due to similarities in their semantic features. To better understand these concepts, let’s examine a couple of examples.

Imagine a concierge AI assistant powered by an LLM, specifically designed for a travel company. It excels at providing personalized responses drawn from a pool of past interactions, making sure that each reply is relevant and tailored to travelers’ needs. Here, we might prioritize high recall, meaning we’d rather have more cached responses even if it occasionally leads to overlapping prompts.

Now, consider a different scenario: an AI assistant, designed to assist back desk agents at this travel company, uses an LLM to translate natural language queries into SQL commands. This enables the agents to generate reports from invoices and other financial data, applying filters such as dates and total amounts to streamline report creation. Precision is key here. We need every user request mapped accurately to its corresponding SQL command, leaving no room for error. In this case, we’d opt for a tighter similarity threshold, making sure that cache collisions are kept to an absolute minimum.

In essence, the read-through semantic cache isn’t just a go-between; it’s a strategic tool for optimizing system performance based on the specific demands of different applications. Whether it’s prioritizing recall for a chatbot or precision for a query parser, the adjustable similarity feature makes sure that the cache operates at peak efficiency, enhancing the overall user experience.

A semantic cache system operates at its core as a database storing numerical vector embeddings of text queries. Before being stored, each natural language query is transformed into a corresponding embedding vector. With Amazon Bedrock, you have the flexibility to select from various managed embedding models, including Amazon’s proprietary Amazon Titan embedding model or third-party alternatives like Cohere. These embedding models are specifically designed to map similar natural language queries to vector embeddings with comparable Euclidean distances, providing semantic similarity. With OpenSearch Serverless, you can establish a vector database suitable for setting up a robust cache system.

By harnessing these technologies, developers can build a semantic cache that efficiently stores and retrieves semantically related queries, improving the performance and responsiveness of their systems. In this post, we demonstrate how to use various AWS technologies to establish a serverless semantic cache system. This setup allows for quick lookups to retrieve available responses, bypassing the time-consuming LLM calls. The result is not only faster response times, but also a notable reduction in price.

The solution presented in this post can be deployed through an AWS CloudFormation template. It uses the following AWS services:

An Amazon Bedrock managed text generation model, for example Anthropic’s Claude
An Amazon Bedrock managed text embedding model, for example Amazon Titan Text Emebeddings Model V2
An OpenSearch Serverless vector search collection
AWS Lambda as the cache handler

The following architecture shows a serverless read-through semantic cache pattern you can use to integrate into an LLM-based solution.

In this architecture, examples of cache miss and hit are shown in red and green, respectively. In this particular scenario, the client sends a query, which is then semantically compared to previously seen queries. The Lambda function, acting as the cache manager, prompts an LLM for a new generation due to a lack of cache hits given the similarity threshold. The new generation is then sent to the client and used to update the vector database. In the case of a cache hit (green path), the previously generated semantically similar query is sent to the client immediately.

For this short query, the following table summarizes the response latency using the test results of “Who was the first US president” queries, tested on Anthropic Claude V2.

Query Under Test	Without Cache Hit	With Cache Hit
Who was the first US president?	2 seconds	Under 0.5 seconds

Prerequisites

Amazon Bedrock users need to request access to FMs before they are available for use. This is a one-time action and takes less than a minute. For this solution, you’ll need one of the embedding models such as Cohere Embed-English on Amazon Bedrock or Amazon Titan Text Embedding. For text generation, you can choose from Anthropic’s Claude models. For a complete list of text generation models, refer to Amazon Bedrock.

Deploy the solution

This solution entails setting up a Lambda layer that includes dependencies to interact with services like OpenSearch Serverless and Amazon Bedrock. A pre-built layer is compiled and added to a public Amazon Simple Storage Service (Amazon S3) prefix, available in the provided CloudFormation template. You have the option to build your own layer with other libraries; for more information, refer to the following GitHub repo.

You can deploy this solution with the required roles by using the provided template:

This solution uses the following input parameters:

Embedding model
LLM
Similarity threshold

After a successful deployment (which takes about 2 minutes), you can get your Lambda function name and start experimenting. You can find the Lambda function name on the Outputs tab of your CloudFormation stack, as shown in the following screenshot.

You can invoke the Lambda function from the Lambda console or through the AWS Command Line Interface (AWS CLI):

aws lambda invoke 
--function-name YOUR_LAMBDA_FUNCTION_NAME 
--invocation-type RequestResponse 
--cli-binary-format raw-in-base64-out
--payload '{"prompt": "your question here"}' 
output.txt

Your payload can have other options to control cache and generation as follows:

{
"max_tokens_to_sample": int, // Optional. Controls the length of LLM generation in case of a cache miss. Default is 1000.
"reset": bool // Optional. To reset cache
}

Test the solution

To test your cache using the Lambda console, open the Functions page. Navigate to the function you retrieved from the output of your stack. Set up a test event as illustrated in the following screenshot.

Upon a successful run, the result appears similar to the following screenshot.

Let’s try another experiment to take a closer look into the functioning of our cache. For this experiment, we use the Cohere embedding model on Amazon Bedrock and a similarity threshold of 0.75. We asked the following question, and it took about 15 seconds for the model to generate the complete response with Anthropic’s Claude V2:

Prompt: What are the potential long-term impacts of artificial intelligence and blockchain technology on the global financial industry?

Generation: Here are some potential long-term impacts of AI and blockchain on the global financial industry:

Automation of routine tasks – AI and machine learning can automate many routine financial tasks like loan processing, fraud detection, client onboarding etc. This could lead to lower operational costs and improved efficiency for financial firms.

Better data analytics and insights – AI can analyze large volumes of financial data more efficiently than humans. This can lead to better insights, personalized recommendations for clients and improved risk management by firms.

Improved trading and investment platforms – AI algorithms can potentially execute trades faster and more optimally than human traders. They can also enable robo-advisors that provide automated investment advice and portfolio management.

Enhanced security and fraud detection – Machine learning models can detect patterns in data to identify potential fraud, money laundering or cybersecurity threats. Blockchain provides a decentralized, encrypted record of transactions that is harder to tamper with.

When we tried these prompt variations, the cache system successfully returned the previously generated response (under 0.5 seconds):

Similar Prompt 1: What are the potential future implications of AI and blockchain on the worldwide financial sector?

Similar Prompt 2: Discuss the possible effects of artificial intelligence and blockchain advancements on the global finance landscape over the coming decades.

Let’s replace the term “financial sector” with “healthcare sector”; we expect no cache collision and new generation:

Prompt: Could you outline the future implications of AI and blockchain on the worldwide healthcare sector?

Generation: Here are some potential long-term impacts AI could have on the healthcare sector:

Personalized medicine – AI can analyze huge datasets about genetics, lifestyle, medical history etc. to provide personalized treatment plans tailored to each individual patient’s needs and risks. This could improve outcomes and reduce waste in the system.

Early disease detection – Algorithms are getting very good at analyzing medical images and data to identify potential diseases and conditions much earlier than human doctors. This could lead to earlier treatment and prevention of serious illness.

Reduced costs – AI has the potential to automate and streamline many processes in healthcare leading to greater efficiency and lower costs. For example, AI chatbots and virtual assistants could provide some basic services at a fraction of the cost of human staff.

The following table summarizes the query latency test results without and with cache hit tested on Anthropic’s Claude V2.

Query Under Test	Without Cache Hit	With Cache Hit
Could you outline the future implications of AI and blockchain on the worldwide healthcare sector?	15 seconds	Under 0.5 seconds

In addition to latency, you can also save costs for your LLM system. Typically, embedding models are more cost-efficient than generation models. For example, Amazon Titan Text Embedding V2 costs $0.00002 per 1,000 input tokens, whereas Anthropic’s Claude V2 costs $0.008 per 1,000 input tokens and $0.024 for 1,000 output tokens. Even considering an additional cost from OpenSearch Service, depending on the scale of cache data, the cache system can be cost-efficient for many use cases.

Clean up

After you are done experimenting with the Lambda function, you can quickly delete all the resources you used to build this semantic cache, including your OpenSearch Serverless collection and Lambda function. To do so, locate your CloudFormation stack on the AWS CloudFormation console and delete it.

Make sure that the status of your stack changes from Delete in progress to Deleted.

Conclusion

In this post, we walked you through the process of setting up a serverless read-through semantic cache. By implementing the pattern outlined here, you can elevate the latency of your LLM-based applications while simultaneously optimizing costs and enriching user experience. Our solution allows for experimentation with embedding models of varying sizes, conveniently hosted on Amazon Bedrock. Moreover, it enables fine-tuning of similarity thresholds to strike the perfect balance between cache hit and cache collision rates. Embrace this approach to unlock enhanced efficiency and effectiveness within your projects.

For more information, refer to the Amazon Bedrock User Guide and Amazon OpenSearch Serverless Developer Guide.

About the Authors

Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. Leveraging over a decade of experience in software development, he specializes in building AI-driven solutions, including chatbots, document processing, and retrieval-augmented generation (RAG) pipelines. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Yash Shah is a Science Manager in the AWS Generative AI Innovation Center. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Anila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Rad AI reduces real-time inference latency by 50% using Amazon SageMaker

November 26, 2024

by Ken Kao Amazon AWS

This post is co-written with Ken Kao and Hasan Ali Demirci from Rad AI.

Rad AI has reshaped radiology reporting, developing solutions that streamline the most tedious and repetitive tasks, and saving radiologists’ time. Since 2018, using state-of-the-art proprietary and open source large language models (LLMs), our flagship product—Rad AI Impressions— has significantly reduced the time radiologists spend dictating reports, by generating Impression sections.

The Impression section serves as the conclusion of a radiology report, including summarization, follow-up recommendations, and highlights of significant findings. It stands as the primary result for the clinician who ordered the study, influencing the subsequent course of the patient’s treatment. Given its pivotal role, accuracy and clarity in this section are paramount. Traditionally, radiologists dictated every word of the impressions section, creating it from scratch for each report. This time-consuming process led to fatigue and burnout, and involved redundant manual dictation in many studies.

The automation provided by Rad AI Impressions not only reduces burnout, but also safeguards against errors arising from manual repetition. It increases the capacity to generate reports, reducing health system turnaround times and making high-quality care available to more patients. Impressions are meticulously customized to each radiologist’s preferred language and style. Radiologists review and revise the output as they see fit, maintaining exact control over the final report, and Rad AI also helps radiologists catch and fix a wide variety of errors in their reports. This improves the overall quality of patient care.

Today, by executing abstractive summarization tasks at scale, Rad AI’s language models generate impressions for millions of radiology studies every month, assisting thousands of radiologists at more than 40% of all US health systems and 9 of the 10 largest US radiology practices. Based on years of working with customers, we estimate that our solutions save 1 hour for every 9-hour radiology shift.

Operating within the real-time radiology workflow, our product functions online around the clock, adhering to strict latency requirements. For years, Rad AI has been a reliable partner to radiology practices and health systems, consistently delivering high availability and generating complete results seamlessly in 0.5–3 seconds, with minimal latency. This efficiency empowers radiologists to achieve optimal results in their studies.

In this post, we share how Rad AI reduced real-time inference latency by 50% using Amazon SageMaker.

Challenges in deploying advanced ML models in healthcare

Rad AI, being an AI-first company, integrates machine learning (ML) models across various functions—from product development to customer success, from novel research to internal applications. AI models are ubiquitous within Rad AI, enhancing multiple facets of the organization. It might seem straightforward to integrate ML models into healthcare workflows, but the challenges are many and interconnected.

Healthcare applications make some of the usual AI complexities more challenging. Although any AI solution has to balance speed against accuracy, radiologists rely on the timeliness of our impressions to care for patients, and expect our clinical accuracy to always improve. This constant innovation requires new kinds of models and demands continually improving specialized software and hardware. As inference logic becomes more complex, composing results from multiple models (each seeing regular releases), and a streamlined and reproducible process for orchestration and management is of paramount importance. Even diagnosing basic issues, at this level of complexity, requires a deliberate and methodical approach.

Rad AI’s ML organization tackles this challenge on two fronts. First, it enhances researcher productivity by providing the necessary processes and automation, positioning them to deliver high-quality models with regularity. Second, it navigates operational requirements by making strategic infrastructure choices and partnering with vendors that offer both computational resources and managed services. By enhancing both researcher productivity and operational efficiency, Rad AI creates an environment that fosters ML innovation.

To succeed in this environment, Rad AI takes advantage of the availability and consistency offered by SageMaker real-time endpoints, a fully managed AI inference service that allows seamless deployment and scaling of models independently from the applications that use them. By integrating Amazon Elastic Container Service (Amazon ECS) and SageMaker, Rad AI’s ML system forms a complex server-side architecture with numerous online components. This infrastructure enables Rad AI to navigate the complexities of real-time model deployment, so radiologists receive timely and accurate impressions.

With focused effort and strategic planning, Rad AI continues to enhance its systems and processes, ultimately improving outcomes for patients and clinicians alike.

Let’s transition to exploring solutions and architectural strategies.

Approaches to researcher productivity

To translate our strategic planning into action, we developed approaches focused on refining our processes and system architectures. By improving our deployment pipelines and enhancing collaboration between researchers and MLOps engineers, we streamlined the integration of models into our healthcare workflows. In this section, we discuss the practices that have enabled us to optimize our operations and advance our ML capabilities.

To enable researchers to work at full capacity while minimizing synchronization with MLOps engineers, we recognized the need for normalization in our deployment processes. The pipeline begins when researchers manage tags and metadata on the corresponding model artifact. This approach abstracts away the complexity beneath the surface and eliminates the usual ceremony involved in deploying models. By centralizing model registration and aligning practices across team members, we clamp the entry point for model deployment. This allows us to build additional tooling as we identify bottlenecks or areas for improvement.

Instead of frequent synchronization between MLOps and research teams, we observe practices and identify needs as they arise. Under the hood, we employ an in-house tool combined with modular, reusable infrastructure as code to automate the creation of pull requests. No one writes any code manually. The protocol between researchers and engineers is reduced to pull request reviews, eliminating the need for circulating documents or holding alignment meetings. The declarative nature of the infrastructure code, coupled with intuitive design, answers most questions that MLOps engineers would typically ask researchers—all within the file added to the repository and pull requested.

These approaches, combined with the power and streamlining offered by SageMaker, have reduced the model deployment process to a matter of minutes after a model artifact is ready. Deploying a new model to a target environment now requires minimal effort. Only when dealing with peculiar characteristics of an architecture or specific configurations—such as adjustments for tensor parallelism—do additional considerations arise. By minimizing the complexity and time involved in deployment, we enable researchers to concentrate on innovation rather than operational hurdles.

Architectural strategies

In our architectural strategies, we aimed to achieve high performance and scalability while effectively deploying ML models. The need for low latency in inference tasks—especially critical in healthcare settings where delays can impact patient care—required architectures capable of efficiently handling both GPU-bound and CPU-bound workloads. Additionally, straightforward configuration options that allow us to quickly generate benchmarks became essential. This capability enables us to swiftly evaluate different backend engines, a necessity in latency-bound environments.

In addition to process improvements, we implemented architectural strategies to address the technical aspects. As previously mentioned, real-world inference systems often combine GPU-bound and CPU-bound inference tasks, along with the need to compose results from multiple services. This complexity is typically required for an ML organization to provide product-side functionality. We use AWS Fargate to run CPU inferences and other supporting components, usually alongside a comprehensive frontend API. This setup implements a classic architecture consisting of a frontend API and backend application services. GPU inferences are served through SageMaker real-time inference endpoints. An illustration of this architecture is provided in the following diagram.

We standardized on using SageMaker Large Model Inference (LMI) containers, maintained and offered from public Amazon repositories. These containers support several optimization frameworks and provide simple configuration delivery options. This setup is straightforward for researchers to interpret and spares them the unnecessary hassle of dealing with dependencies and compatibility issues among various ML libraries and managing the underlying container layers.

Diving deeper into our architecture, we consider one of the deployment strategies used in our online inference system. On a single instance, we employ a server that schedules inference tasks with DJL Serving as the model server. This approach allows us to select from and experiment with multiple backend engines, including popular frameworks such as TensorRT-LLM and vLLM. The abstractions and built-in integration with SageMaker real-time endpoints, along with support for multi-GPU inference and tensor parallelism, enable us to quickly evaluate different backends for a given task.

As Rad AI has matured, our architectural solutions have evolved. Initially, we relied on custom components, managing our own container images and running NVIDIA Triton Server directly on instances provided by Amazon ECS. However, by migrating to SageMaker managed hosting and using instance types ranging from 1–8 GPUs of various kinds, we implemented the architectural strategies discussed earlier. Removing the undifferentiated heavy lifting involved in building and optimizing model hosting infrastructure reduced our total cost of ownership by 50%. Optimizing the instance types and container parameters decreased latency by the same margin.

When deploying models with SageMaker Inference, consider the following key best practices:

It’s important to build a robust model deployment pipeline that automates the process of registering, testing, and promoting models to production. This can involve integrating SageMaker with continuous integration and delivery (CI/CD) tools to streamline the model release process.
In terms of infrastructure choices, it’s important to right-size your SageMaker endpoints to match the expected traffic and model complexity, using features like auto scaling to dynamically adjust capacity.
Performance optimization techniques like model optimization and inference container parameter tuning can help improve latency and reduce costs.
Comprehensive monitoring and logging of model performance in production is critical to quickly identify and address any issues that arise.

Conclusion

One of the enduring challenges in healthcare is enhancing patient care on a global scale. Rad AI is committed to meeting this challenge by transforming the field of radiology. By refining our processes and implementing strategic architectural solutions, we have enhanced both researcher productivity and operational efficiency.

Our deliberate approach to model deployment and infrastructure management has streamlined workflows and significantly reduced costs and latency. Every additional second saved not only increases bandwidth and reduces fatigue for the radiologists we serve, but also improves patient outcomes and benefits healthcare organizations in a variety of ways. Our inference systems are instrumental in realizing these objectives, using SageMaker’s scalability and flexibility to integrate ML models seamlessly into healthcare settings. As we continue to evolve, our commitment to innovation and excellence positions Rad AI at the forefront of AI-driven healthcare solutions.

Share your thoughts and questions in the comments.

References

Rad AI Impressions
Deep Java Library: Large Model Inference
Weights & Biases. (2023, May 2). Continuous Deployment with Weights & Biases Automations YouTube video

About the authors

Ken Kao is an executive leader with 12+ years leading engineering and product across early, mid-stage startups and public companies. He is currently the VP of Engineering at Rad AI pushing the frontier of applying Gen AI to healthcare to help make physicians more efficient and improve patient outcome. Prior to that, he was at Meta driving VR Device performance, emulation, and development tooling & Infrastructure. He has also previously held engineering leadership roles at Airbnb, Flatiron Health, and Palantir. Ken holds M.S. and B.S degrees in Electrical Engineering from Stanford University.

Hasan Ali Demirci is a Staff Engineer at Rad AI, specializing in software and infrastructure for machine learning. Since joining as an early engineer hire in 2019, he has steadily worked on the design and architecture of Rad AI’s online inference systems. He is certified as an AWS Certified Solutions Architect and holds a bachelor’s degree in mechanical engineering from Boğaziçi University in Istanbul and a graduate degree in finance from the University of California, Santa Cruz.

Karan Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in financial services industry.

Read graphs, diagrams, tables, and scanned pages using multimodal prompts in Amazon Bedrock

November 26, 2024

by Mithil Shah Amazon AWS

Large language models (LLMs) have come a long way from being able to read only text to now being able to read and understand graphs, diagrams, tables, and images. In this post, we discuss how to use LLMs from Amazon Bedrock to not only extract text, but also understand information available in images.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API. It also provides a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Solution overview

In this post, we demonstrate how to use models on Amazon Bedrock to retrieve information from images, tables, and scanned documents. We provide the following examples:

Performing object classification and object detection tasks
Reading and querying graphs
Reading flowcharts and architecture diagrams (such as an AWS architecture diagram) and converting it to text

We also explore specific use cases where Amazon Bedrock can help you generate information from an image:

Identify the action in the image
Name a product and extract metadata to generate a tagline and description
Generate a real estate listing for a property
Generate a recipe from an image of a dish

Perform object classification and object detection tasks using LLM

At the time of writing, the best algorithms for object classification and detection are based on convolution neural networks. However, LLMs such as Anthropic’s Claude 3 Sonnet on Amazon Bedrock can also perform these tasks using zero-shot prompting, which refers to a prompting technique to give a task to the model without providing specific examples or training for that specific task. In the following example, we show how to classify an image based on the classes that we passed with the prompt. The following image is a sample Australian Medicare card.

We use the following prompt with Anthropic’s Claude 3 Sonnet to classify the image.:

This is one of the documents that a customer seeking a loan has uploaded. The image can belong to one of the following four classes : ID, bank statement, application form or utility bill. Identify what the class of this document is and give me just the class name. I don’t need any other information for now

We get the following response from the LLM:

Based on the image provided, the class of this document appears to be an ID card or identification document. The image shows what seems to be a Medicare card or similar government-issued ID card displaying names and an identification number.

It has correctly identified the class; however, a response in JSON would make it more straightforward to process that response. We add the following to the end of the prompt:

provide the response in json format with the key as “class” and the value as the class of the document

We get the following response:

{
"class": "ID"
}

You can now read the JSON response using a library of your choice, such as the Python JSON library.

In this next example, we use the Meta’s Llama 3.2 90B Vision model to identify an object in an image. The following image is of a gearbox.

We use the following prompt to identify the gearbox from the image:

identify the entire machine part in this image. Give just the name of the machine part in a JSON format with the key as ”name“ and the value as the name of the machine part

The following screenshot shows the prompt and the answer the model gives.

Read and query graphs

In this next example, we show how to read a graph and then interpolate and extrapolate values in the graph. The following graph shows the population growth for a few countries (source).

We pass this image to Anthropic’s Claude 3 Sonnet along with the following prompt:

From this graph create a table for the population of India, with the first column as the year and the second column as the population in millions. Create a row for every 5 years starting from 1950 to 2025.

We ask it to create a row for every 5 years, so the model must interpolate values. We also ask it to extend the table until 2025, and because the data is only until 2021, the model will have to extrapolate the values. We get the following response:

Here is a table for the population of India, with the first column as the year and the second column as the population in millions, for every 5 years starting from 1950 to 2025:

| Year | Population (in millions) |

|------|--------------------------|

| 1950 | 357.6 |

| 1955 | 390.5 |

| 1960 | 427.8 |

| 1965 | 468.0 |

| 1970 | 510.9 |

| 1975 | 556.9 |

| 1980 | 605.7 |

| 1985 | 658.0 |

| 1990 | 713.8 |

| 1995 | 773.0 |

| 2000 | 835.6 |

| 2005 | 901.7 |

| 2010 | 971.5 |

| 2015 | 1045.0 |

| 2020 | 1122.0 |

| 2025 | 1202.4 |

The extrapolation may not be accurate, and this is where a more traditional machine learning (ML) approach would be helpful.

You can also ask the model to combine its knowledge with the knowledge from the graph. The following prompt asks for the population for countries in this graph for 2021, along with the country’s capital:

Can you create a table for 2021 with the first column as the country name, the second column as population in millions and third column as the capital of that country?

The following screenshot shows the response.

The LLM has filled in the table based on the graph and its own knowledge about the capital of each country.

You can try out something harder as well. The following image is a weather map from the

You can ask the LLM a specific question, such as the following:

This is a precipitation map for Australia. Is it raining in Sydney right now?

You don’t have to tell the LLM where Sydney is or that the image is for rainfall.

We get the following response:

Based on the precipitation map shown, there does not appear to be any significant rainfall occurring over the Sydney area, which is located on the eastern coast of the Australian continent. The map uses contour lines and colors to represent precipitation levels, but the region around Sydney has no shaded areas or contour lines indicating rainfall at this particular timeframe depicted by the map. Therefore, the available data suggests that it is likely not raining in Sydney at the time represented by this precipitation map.

There are a couple of interesting observations from this use case:

Anthropic’s Claude 3 Sonnet can read maps
The model is able to read overlays on a map
Phrases such as “region around Sydney” shows that the model doesn’t need to work with exact information but can use an approximation, just as humans do

Read flowcharts and architecture diagrams

You can read AWS architecture diagrams using the Meta Llama 3.2 90B Vision model. The following is an example architecture diagram for modernizing applications with microservices using Amazon Elastic Kubernetes Service (Amazon EKS).

We use the following prompt to read this diagram:

The steps in this diagram are explained using numbers 1 to 11. The numbers are shown in blue squares. Can you explain the diagram using the numbers 1 to 11 and an explanation of what happens at each of those steps?

The following screenshot shows the response that we get from the LLM (truncated for brevity).

Furthermore, you can use this diagram to ask follow-up questions:

Why do we need a network load balancer in this architecture

The following screenshot shows the response from the model.

As you can see, the LLM acts as your advisor now for questions related to this architecture.

However, we’re not limited to using generative AI for only software engineering. You can also read diagrams and images from engineering, architecture, and healthcare.

For this example, we use a process diagram taken from Wikipedia.

To find out what this process diagram is for and to describe the process, you can use the following prompt:

Can you name the process shown in the example. Also describe the process using numbered steps and go from left to right.

The following screenshot shows the response.

The LLM has done a good job figuring out that the diagram is for the Haber process to produce ammonia. It also describes the steps of the process.

Identify actions in an image

You can identify and classify the actions taking place in the image. The model’s ability to accurately identify actions is further enhanced by its capacity to analyze contextual information, such as the surrounding objects, environments, and the positions of individuals or entities within the image. By combining these visual cues and contextual elements, Anthropic’s Claude 3 Sonnet can make informed decisions about the nature of the actions being performed, providing a comprehensive understanding of the scene depicted in the image.

The following is an example where we can not only classify the action of the player but also provide feedback to the player comparing the action to a professional player.

We provide the model the following image of a tennis player. The image was generated using the Stability AI (SDXL 1.0) model on Amazon Bedrock.

The following screenshot shows the prompt and the model’s response.

Name a product and extract metadata to generate a tagline and description

In the field of marketing and product development, coming up with a perfect product name and creative promotional content can be challenging. With the image-to-text capabilities of Anthropic’s Claude 3 Sonnet, you can upload the image of the product and the model can generate a unique product name and craft taglines to suit the target audience.

For this example, we provide the following image of a sneaker to the model (the image was generated using the Stability AI (SDXL 1.0) model on Amazon Bedrock).

The following screenshot shows the prompt.

The following screenshot shows the model’s response.

In the retail and ecommerce domain, you can also use Anthropic’s Claude 3 Sonnet to extract detailed product information from the images for inventory management.

For example, we use the prompt shown in the following screenshot.

The following screenshot shows the model’s response.

Create a real estate listing for a property

You can upload images of a property floor plan and pictures of interior and exterior of the house and then get a description to use in a real estate listing. This is useful to increase the creativity and productivity of real estate agents while advertising properties. Architects could also use this mechanism to explain the floor plan to customers.

We provide the following example floor plan to the model.

The following screenshot shows the prompt.

The following screenshot shows the response.

Generate a recipe from the image of a dish

You can also use Anthropic’s Claude 3 Sonnet to create a recipe based on a picture of a dish. However, out of the box, the model can identify only the dishes that are included in the dataset used for the model training. Factors such as ingredient substitutions, cooking techniques, and cultural variations in cuisine can pose significant challenges.

For example, we provide the following image of a cake to the model to extract the recipe. The image was generated using the Stability AI model (SDXL 1.0) on Amazon Bedrock.

The following screenshot shows the prompt.

The model successfully identifies the dish as Black Forest cake and creates a detailed recipe. The recipe may not create the exact cake shown in the figure, but it does get close to a Black Forest Cake.

Conclusion

FMs such as Anthropic’s Claude 3 Sonnet and Meta Llama 3.2 90B Vision model, available on Amazon Bedrock, have demonstrated impressive capabilities in image processing. These FMs unlock a range of powerful features, including image classification, optical character recognition (OCR), and the ability to interpret complex visuals such as graphs and architectural blueprints. Such innovations offer novel solutions to challenging problems, from searching through scanned document archives to generating image-inspired text content and converting visual information into structured data.

To start using these capabilities for your specific needs, we recommend exploring the chat playground feature on Amazon Bedrock, which allows you to interact with and extract information from images.

About the Authors

Mithil Shah is a Principal AI/ML Solution Architect at Amazon Web Services. He helps commercial and public sector customers use AI/ML to achieve their business outcome. He is currently helping customers build chat bots and search functionality using LLM agents and RAG.

Santosh Kulkarni is an Senior Solutions Architect at Amazon Web Services specializing in AI/ML. He is passionate about generative AI and is helping customers unlock business potential and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

How Crexi achieved ML models deployment on AWS at scale and boosted efficiency

November 26, 2024

by Isaac Smothers Amazon AWS

This post is co-written with Isaac Smothers and James Healy-Mirkovich from Crexi.

With the current demand for AI and machine learning (AI/ML) solutions, the processes to train and deploy models and scale inference are crucial to business success. Even though AI/ML and especially generative AI progress is rapid, machine learning operations (MLOps) tooling is continuously evolving to keep pace. Customers are looking for success stories about how best to adopt the culture and new operational solutions to support their data scientists. Solutions should be flexible to adopt, allow seamless integration with other systems, and provide a path to automate MLOps using AWS services and third-party tools, as we’ll explore in this post with Pulumi and Datadog. This framework helps to achieve operational excellence not only in the DevOps space but allows stakeholders to optimize tools such as infrastructure as code (IaC) automation and DevOps research and assessment (DORA) observability of pipelines for MLOps.

Commercial Real Estate Exchange, Inc. (Crexi), is a digital marketplace and platform designed to streamline commercial real estate transactions. It allows brokers to manage the entire process from listing to closing on one platform, including digital letters of intent, best and final offer negotiations, and transaction management tools. Its data and research features allow investors and other commercial real estate stakeholders to conduct due diligence and proactively connect with other professionals ahead of the transaction process.

In this post, we will review how Crexi achieved its business needs and developed a versatile and powerful framework for AI/ML pipeline creation and deployment. This customizable and scalable solution allows its ML models to be efficiently deployed and managed to meet diverse project requirements.

Datadog is a monitoring service for cloud-scale applications, bringing together data from servers, databases, tools and services to present a unified view of your entire stack. Datadog is a SaaS-based data analytics platform that enables Dev and Ops teams to work collaboratively to avoid downtime, resolve performance problems, and helps track that development and deployment cycles finish on time.

Pulumi’s modern infrastructure as code (IaC) platform empowers teams to manage cloud resources using their favorite languages including Python, JavaScript, TypeScript, Go, and C#. Pulumi’s open source SDK integrates with its free and commercial software as a service (SaaS) to simplify infrastructure provisioning, delivery, architecture, policy, and testing on a cloud.

Solution overview

Central to Crexi’s infrastructure are boilerplate AWS Lambda triggers that call Amazon SageMaker endpoints, executing any given model’s inference logic asynchronously. This modular approach supports complex pipeline pathways, with final results directed to Amazon Simple Storage Service (Amazon S3) and Amazon Data Firehose for seamless integration into other systems. One of the SageMaker endpoints also uses Amazon Textract, but any model can be used.

ML pipeline engineering requirements

The engineering requirements for the ML pipeline goal to build a robust infrastructure for model deployments are:

Rapid deployment of ML models: Model pipeline deployments should be managed through a continuous integration and continuous deployment (CI/CD) infrastructure, facilitating model pipeline rollbacks, regression testing, and click deploys. This automated CI/CD deployment process is used to automatically test and deploy pipeline changes, minimizing the risk of errors and downtime.
Distinct separation of concerns for production and development ML pipelines: This requirement prevents ongoing model experiments in the development environment from affecting the production environment, thereby maintaining the stability and reliability of the production models.
Model pipeline health monitoring: Health monitoring allows for proactive identification and resolution of potential issues in model pipelines before they impact downstream engineering teams and users.
Readily accessible models: Model pipelines should be accessible across engineering teams and straightforward to integrate into new and existing products.

The goal is to build reliable, efficient ML pipelines that can be used by other engineering teams with confidence.

Technical overview

The ML pipeline infrastructure is an amalgamation of various AWS products, designed to seamlessly invoke and retrieve output from ML models. This infrastructure is deployed using Pulumi, a modern IaC tool that allows Crexi to handle the orchestration of AWS products in a streamlined and efficient manner.

The AWS products managed by Pulumi in the infrastructure include:

Amazon Identity and Access Management (IAM) for secure access management
Amazon S3 for storing model tar.gz files and model prediction outputs, and Amazon SageMaker for model inference
AWS Lambda to send outputs from SageMaker models to one another
Amazon Simple Notification Service (Amazon SNS) is used to notify downstream teams when ML models produce predictions, helping to ensure timely communication and collaboration
Data Firehose to ship model predictions as needed, further enhancing the flexibility of the pipeline

To protect the robustness and reliability of the infrastructure, Crexi uses Datadog for pipeline log monitoring, which allows the team to keep a close eye on the pipeline’s performance and quickly identify and address issues that might arise.

Lastly, Crexi uses GitHub actions to run Pulumi scripts in a CI/CD fashion for ML pipeline deploys, updates, and destroys. These GitHub actions keep the infrastructure reproducible and sufficiently hardened against code regression.

Pipeline as code

Pulumi-managed ML pipelines are coded as YAML files that data scientists can quickly create and deploy. Deploying IaC using YAML files that data scientists can write has three key advantages:

Increased efficiency and speed: A streamlined deployment process allows data scientists to write and deploy their own models. Enabling data scientists in this way reduces delivery time by not requiring additional data engineering or ops personnel (that is, it reduces cross-functional dependencies) for deployments.
Flexibility and customization: YAML files allow data scientists to specify the necessary configurations such as instance types, model images, and additional permissions. This level of customization helps the team to optimize the deployed models for specific use cases.
Simplicity and readability: YAML files are human-readable, facilitating the evaluation, review, and auditing of infrastructure and deployment configurations.

Implementation

Now, let’s look at the implementation details of the ML pipeline.

The pipeline contains three Sage Maker endpoints named model-a, model-b, and model-c. Each endpoint is asynchronous and has a specified number of running instances. They each have a specified docker image to run the model hosted on the endpoint, a specified location of the model.tar.gz file that the endpoint will host, and a specified type of machine instance to run the endpoint on. The model-b and model-c endpoints depend on the output from model-a.

The model-a endpoint has access to input Amazon S3 objects in the Crexi AWS account and depends on the crexi-model-input-dev bucket for input. Lastly, the model-c endpoint also has access to input S3 objects in the Crexi AWS account in addition to Amazon Textract.

After a new version of an input is uploaded to the crexi-model-input-dev S3 bucket, a Lambda function passes it to the model-a SageMaker endpoint. After results are ready and delivered to the model-a-model-output bucket, the relevant Lambda functions execute model-b and model-c SageMaker endpoints accordingly.

The visualization that follows depicts the pipeline flow.

To automate changes in the resources and new models, the Crexi team manages infrastructure using Pulumi and defines resources using YAML. SageMakerPipelineExample.yaml creates a stack of AWS resources that deploy service models to production. The AWS stack contains the necessary Lambda functions, S3 buckets, SageMaker endpoints, IAM permissions, and so on. As an example, the following is part of the YAML files that define the SageMaker endpoints.

team: Mlops

identifier: SagemakerPipelineExample

data_dev: 
  buckets:
    - name: "crexi-model-storage-dev" 
      additionalWriters:
        - "arn:aws:iam::<aws_account_id>:role/DataDevelopers"
    - name: "crexi-model-input-dev"

sagemakerPipelines:
  - name: "Infrared"
    models:
      - name: model-a 
        async: true 
        count: 4
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference-with-t
        s3Path: "crexi-model-storage-dev/model-a.tar.gz"
        access:
          filesCrexiAccess: true 
          instanceType: ml.c5.4xlarge 
          dependsOn: 
            s3Buckets:
              - bucketName: "crexi-model-input-dev"
                prefix: "manifests/"
                suffix: ".json"
      - name: model-b 
        async: true 
        count: 1
        instanceType: ml.m5.xlarge
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.1.0
        s3Path: "crexi-model-storage-dev/model-b.tar.gz"
        dependsOn: 
          models:
            - "model-a"
      - name: model-c 
        async: true 
        count: 1
        instanceType: ml.m5.large
        image: "<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:2.1.0
        s3Path: "crexi-model-storage-dev/model-c.tar.gz" 
        access:
          filesCrexiAccess: true
        textract: true 
        dependsOn:
          models:
            - "model-a"

Pipeline deployment

ML pipelines can be quickly deployed, modified, and destroyed using a continuous delivery GitHub workflow named Deploy self-service infrastructure that has been set up in a Crexi repository. After new models are tested and everything is ready in Crexi’s repository, GitHub workflow triggers deployment using Pulumi and a YAML file with resources defined in the previous section of this post.

The Deploy self-service infrastructure workflow takes four arguments:

branch
- Description: GitHub branch to source the pipeline YAML file from
- Input (options)
  - GitHub branch (for example, main)
action
- Description: Specifies the type of Pulumi action to run
- Input (options):
  - up: Create or update resources
  - destroy: Tear down resources
  - preview: Preview changes without applying them
environment
- Description: Defines the environment against which the action will be executed
- Input (options):
  - data_dev: Development environment
  - data_prod: Production environment
YAML
- Description: Path to the infrastructure YAML file that defines the resources to be managed
- Input (string)
  - Filename of SageMaker model pipeline YAML file to deploy, modify, or destroy

The following screenshot shows GitHub workflow parameters and history.

Pipeline Monitoring

Pipeline monitoring for Pulumi-deployed ML pipelines uses a comprehensive Datadog dashboard (shown in the following figure) that offers extensive logging capabilities and visualizations. Key metrics and logs are collected and visualized to facilitate real-time monitoring and historical analysis. Pipeline monitoring has dramatically simplified the assessment of a given pipeline’s health status, allowing for the rapid detection of potential bottlenecks and bugs, thereby improving operation of the ML pipelines.

The dashboard offers several core features:

Error tracking: The dashboard tracks 4xx and 5xx errors in aggregate, correlating errors to specific logged events within the model pipelines, which aids in quick and effective diagnosis by providing insights into the frequency and distribution of these errors.
Invocation metrics for SageMaker models: The dashboard aggregates data on instance resource utilization, invocation latency, invocation failures, and endpoint backlog for the SageMaker models deployed through Pulumi, giving a detailed view of performance bottlenecks and latencies.
Lambda function monitoring: The dashboard monitors the success and failure rates of invocations for triggerable Lambda functions, thus delivering a holistic view of the system’s performance.

Conclusion

The ML pipeline deployment framework explored here offers a robust, scalable, and highly customizable solution for AI/ML needs and addresses Crexi’s requirements. With the power to rapidly build and deploy pipelines, experiments and new ML techniques can be tested at scale with minimal effort. It separates development workflow of models and production deployments, and allows to proactively monitor for different issues. Additionally, routing model outputs to S3 supports seamless integration with Snowflake, facilitating storage and accessibility of data. This interconnected ecosystem does more than just improve current operations; it lays the groundwork for continuous innovation. The data housed in Snowflake serves as a rich resource for training new models that can be deployed quickly with new ML pipelines, enabling a cycle of improvement and experimentation that propels Crexi’s projects forward.

If you have any thoughts or questions, leave them in the comments section.

Isaac Smothers is a Senior DevOps Engineer at Crexi. Isaac focuses on automating the creation and maintenance of robust, secure cloud infrastructure with built-in observability. Based in San Luis Obispo, he is passionate about providing self-service solutions that enable developers to build, configure, and manage their services independently, without requiring cloud or DevOps expertise. In his free time, he enjoys hiking, video editing, and gaming.

James Healy-Mirkovich is a principal data scientist at Crexi in Los Angeles. Passionate about making data actionable and impactful, he develops and deploys customer-facing AI/ML solutions and collaborates with product teams to explore the possibilities of AI/ML. Outside work, he unwinds by playing guitar, traveling, and enjoying music and movies.

Marina Novikova is a Senior Partner Solution Architect at AWS. Marina works on the technical co-enablement of AWS ISV Partners in the DevOps and Data and Analytics segments to enrich partner solutions and solve complex challenges for AWS customers. Outside of work, Marina spends time climbing high peaks around the world.