Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

  • Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
  • Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
  • Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
  • Enterprise-grade features with built-in capabilities including:
    • Advanced parallelism techniques
    • Memory optimization strategies
    • Distributed checkpointing
    • Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

  • Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
  • Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
  • Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

  • Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
  • Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
  • Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
  • Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
  • Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

  1. Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
  2. Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
  3. Template parameters:
    • Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
    • All other parameters can be left as default or changed as needed for your use case.
  4. Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

  1. Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:
$ aws --version
  1. Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars
  1. Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.
$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src
  1. Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
    1. Cluster name
    2. It defines three instance groups
      1. Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
      2. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
      3. Worker-group: The group of nodes that executes the actual model training workload
    3. VPC configuration
$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json 
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json 
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json
  1. Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.
$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{      
		"instance_group_name": "worker-group-1",      
		"partition_name": ${instance_type}
	}  
],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  1. Create the SageMaker HyperPod cluster
$ aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json --region $AWS_REGION
  1. Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding
$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

  1. Install the AWS SSM Session Manager Plugin.
  2. Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.
$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

  1. View the existing partition and nodes per partition
$ sinfo
  1. List the jobs that are in the queue or running.
$ squeue
  1. SSH to the compute nodes.
# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster

#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)

#Exit to the head node
$ exit

#Exit again to cancel the srun job above
$ exit
  1. Clone the code sample GitHub repository onto the cluster controller node (head node).
$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

  1. NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
  2. You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.
$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh
  1. To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:
$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:
   return run.Config(
       llm.Llama3Config8B,       
	   rotary_base=500_000,       
	   seq_length=1024,       
	   num_layers=12,       
	   hidden_size=768,       
	   ffn_hidden_size=2688,       
	   num_attention_heads=16,       
	   init_method_std=0.023,
   )

The following function defines the Slurm executor.

def slurm_executor(
   account: str,   
   partition: str,   
   nodes: int,   
   user: str = "local",   
   host: str = "local",   
   remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",   
   time: str = "01:00:00",   
   custom_mounts: Optional[list[str]] = None,   
	custom_env_vars: Optional[dict[str, str]] = None,   
	container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",   
	retries: int = 0,) -> run.SlurmExecutor: 

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:
       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
       # Run the experiment
       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

  1. After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed
$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

  1. After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.
$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

  • If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

  1. Delete the SageMaker HyperPod cluster.
    $ aws sagemaker delete-cluster --cluster-name ml-cluster

  2. Delete the CloudFormation stack created in the prerequisites.
    $ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References


About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Read More

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the NeMo Retriever Llama3.2 Text Embedding and Reranking NVIDIA NIM microservices are available in Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s optimized reranking and embedding models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on SageMaker JumpStart.

About NVIDIA NIM on AWS

NVIDIA NIM microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise available in AWS Marketplace, NIM is a set of user-friendly microservices designed to streamline and accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models, from open source community models to NVIDIA AI foundation models (FMs) and custom models. NIM microservices provide straightforward integration into generative AI applications using industry-standard APIs and can be deployed with just a few lines of code, or with a few clicks on the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM helps you deploy your generative AI applications.

Overview of NVIDIA NeMo Retriever NIM microservices

In this section, we provide an overview of the NVIDIA NeMo Retriever NIM microservices discussed in this post.

NeMo Retriever text embedding NIM

The NVIDIA NeMo Retriever Llama3.2 embedding NIM is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35-fold through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

NeMo Retriever text reranking NIM

The NVIDIA NeMo Retriever Llama3.2 reranking NIM is optimized for providing a logit score that represents how relevant a document is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens). This model was evaluated on the same 26 languages mentioned earlier.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art FMs for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

Solution overview

You can now discover and deploy the NeMo Retriever text embedding and reranking NIM microservices in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your virtual private cloud (VPC), helping to support data security for enterprise security needs.

In the following sections, we demonstrate how to deploy these microservices and run real-time and batch inference.

Make sure your SageMaker AWS Identity and Access Management (IAM) service role has the AmazonSageMakerFullAccess permission policy attached.

To deploy NeMo Retriever Llama3.2 embedding and reranking microservices successfully, confirm one of the following:

  • Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
    • aws-marketplace:ViewSubscriptions
    • aws-marketplace:Unsubscribe
    • aws-marketplace:Subscribe
  • Alternatively, confirm your AWS account has a subscription to the model. If so, you can skip the following deployment instructions and start at the Subscribe to the model package section.

Deploy NeMo Retriever microservices on SageMaker JumpStart

For those new to SageMaker JumpStart, we demonstrate using SageMaker Studio to access models on SageMaker JumpStart. The following screenshot shows the NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

Deployment starts when you choose the Deploy option. You might be prompted to subscribe to this model through AWS Marketplace. If you are already subscribed, then you can move forward with choosing the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Deploy the NeMo Retriever microservice.

Subscribe to the model package

To subscribe to the model package, complete the following steps

  1. Depending on the model you want to deploy, open the model package listing page for Llama-3.2-NV-EmbedQA-1B-v2 or Llama-3.2-NV-RerankQA-1B-v2.
  2. On the AWS Marketplace listing, choose Continue to subscribe.
  3. On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
  4. Choose Continue to configuration and then choose an AWS Region.

A product Amazon Resource Name (ARN) will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Deploy NeMo Retriever microservices using the SageMaker SDK

In this section, we walk through deploying the NeMo Retriever text embedding NIM through the SageMaker SDK. A similar process can be followed for deploying the NeMo Retriever text reranking NIM as well.

Define the SageMaker model using the model package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

# Define the model details
model_package_arn = "Specify the model package ARN here"
sm_model_name = "nim-llama-3-2-nv-embedqa-1b-v2"

# Create the SageMaker model
create_model_response = sm.create_model(
ModelName=sm_model_name,
PrimaryContainer={
'ModelPackageName': model_package_arn
},
ExecutionRoleArn=role,
EnableNetworkIsolation=True
)
print("Model Arn: " + create_model_response["ModelArn"])

Create the endpoint configuration

Next, we create an endpoint configuration specifying instance type; in this case, we are using an ml.g5.2xlarge instance type accelerated by NVIDIA A10G GPUs. Make sure you have the account-level service limit for using ml.g5.2xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. For further performance improvements, you can use NVIDIA Hopper GPUs (P5 instances) on SageMaker.

# Create the endpoint configuration
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': sm_model_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.g5.xlarge', 
'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
}
]
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService after the deployment is successful.

# Create the endpoint
endpoint_name = endpoint_config_name
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Deploy the NIM microservice

Deploy the NIM microservice with the following code:

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
time.sleep(60)
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

We get the following output:

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:611951037680:endpoint/nim-llama-3-2-nv-embedqa-1b-v2
Status: InService

After you deploy the model, your endpoint is ready for inference. In the following section, we use a sample text to do an inference request. For inference request format, NIM on SageMaker supports the OpenAI API inference protocol (at the time of writing). For an explanation of supported parameters, see Create an embedding vector from the input text.

Inference example with NeMo Retriever text embedding NIM microservice

The NVIDIA NeMo Retriever Llama3.2 embedding model is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). In this section, we provide examples of running real-time inference and batch inference.

Real-time inference example

The following code example illustrates how to perform real-time inference using the NeMo Retriever Llama3.2 embedding model:

import pprint
pp1 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=3)

input_embedding = '''{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}'''

print("Example input data for embedding model endpoint:")
print(input_embedding)

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input_embedding
)

print("nEmbedding endpoint response:")
response = json.load(response["Body"])
pp1.pprint(response)

We get the following output:

Example input data for embedding model endpoint:
{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", 
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}

Embedding endpoint response:
{ 'data': [ {'embedding': [...], 'index': 0, 'object': 'embedding'},
            {'embedding': [...], 'index': 1, 'object': 'embedding'}],
  'model': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
  'object': 'list',
  'usage': {'prompt_tokens': 14, 'total_tokens': 14}}

Batch inference example

When you have many documents, you can vectorize each of them with a for loop. This will often result in many requests. Alternatively, you can send requests consisting of batches of documents to reduce the number of requests to the API endpoint. We use the following example with a dataset of 10 documents. Let’s test the model with a number of documents in different languages:

documents = [
"El futuro de la computación cuántica en aplicaciones criptográficas.",
"L’application des réseaux neuronaux dans les systèmes de véhicules autonomes.",
"Analyse der Rolle von Big Data in personalisierten Gesundheitslösungen.",
"L’evoluzione del cloud computing nello sviluppo di software aziendale.",
"Avaliando o impacto da IoT na infraestrutura de cidades inteligentes.",
"Потенциал граничных вычислений для обработки данных в реальном времени.",
"评估人工智能在欺诈检测系统中的有效性。",
"倫理的なAIアルゴリズムの開発における課題と機会。",
"دمج تقنية الجيل الخامس (5G) في تعزيز الاتصال بالإنترنت للأشياء (IoT).",
"सुरक्षित लेनदेन के लिए बायोमेट्रिक प्रमाणीकरण विधियों में प्रगति।"
]

The following code demonstrates how to group the documents into batches and invoke the endpoint repeatedly to vectorize the whole dataset. Specifically, the example code loops over the 10 documents in batches of size 5 (batch_size=5).

pp2 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=2)

encoded_data = []
batch_size = 5

# Loop over the documents in increments of the batch size
for i in range(0, len(documents), batch_size):
input = json.dumps({
"input": documents[i:i+batch_size],
"input_type": "passage",
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
})

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input,
)

response = json.load(response["Body"])

# Concatenating vectors into a single list; preserve original index
encoded_data.extend({"embedding": data[1]["embedding"], "index": data[0] } for
data in zip(range(i,i+batch_size), response["data"]))

# Print the response data
pp2.pprint(encoded_data)

We get the following output:

[ {'embedding': [...], 'index': 0}, {'embedding': [...], 'index': 1},
  {'embedding': [...], 'index': 2}, {'embedding': [...], 'index': 3},
  {'embedding': [...], 'index': 4}, {'embedding': [...], 'index': 5},
  {'embedding': [...], 'index': 6}, {'embedding': [...], 'index': 7},
  {'embedding': [...], 'index': 8}, {'embedding': [...], 'index': 9}]

Inference example with NeMo Retriever text reranking NIM microservice

The NVIDIA NeMo Retriever Llama3.2 reranking NIM microservice is optimized for providing a logit score that represents how relevant a documents is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens).

In the following example, we create an input payload for a list of emails in multiple languages:

payload_model = "nvidia/llama-3.2-nv-rerankqa-1b-v2"
query = {"text": "What emails have been about returning items?"}
documents = [
    {"text":"Contraseña incorrecta. Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"text":"Confirmation Email Missed. Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"text":"أسئلة حول سياسة الإرجاع. مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"text":"Customer Support is Busy. Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"text":"Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"text":"Customer Service is Unavailable. Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"text":"Return Policy for Defective Product. Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"text":"收到错误物品. 早上好,关于我最近的订单,我有一个问题。我收到了错误的商品,需要退货。"},
    {"text":"Return Defective Product. Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

payload = {
  "model": payload_model,
  "query": query,
  "passages": documents,
  "truncate": "END"
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(f'Documents: {response}')
print(json.dumps(output, indent=2))

In this example, the relevance (logit) scores are normalized to be in the range [0, 1]. Scores close to 1 indicate a high relevance to the query, and scores closer to 0 indicate low relevance.

Documents: {'ResponseMetadata': {'RequestId': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 04 Mar 2025 21:46:39 GMT', 'content-type': 'application/json', 'content-length': '349', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fbb00ff94b0>}
{
  "rankings": [
    {
      "index": 4,
      "logit": 0.0791015625
    },
    {
      "index": 8,
      "logit": -0.1904296875
    },
    {
      "index": 7,
      "logit": -2.583984375
    },
    {
      "index": 2,
      "logit": -4.71484375
    },
    {
      "index": 6,
      "logit": -5.34375
    },
    {
      "index": 1,
      "logit": -5.64453125
    },
    {
      "index": 5,
      "logit": -11.96875
    },
    {
      "index": 3,
      "logit": -12.2265625
    },
    {
      "index": 0,
      "logit": -16.421875
    }
  ],
  "usage": {
    "prompt_tokens": 513,
    "total_tokens": 513
  }
}

Let’s see the top-ranked document for our query:

# 1. Extract the array of rankings
rankings = output["rankings"]  # or output.get("rankings", [])

# 2. Get the top-ranked entry (highest logit)
top_ranked_entry = rankings[0]
top_index = top_ranked_entry["index"]  # e.g. 4 in your example

# 3. Retrieve the corresponding document
top_document = documents[top_index]

print("Top-ranked document:")
print(top_document)

The following is the top-ranked document based on the provided relevance scores:

Top-ranked document:
{'text': 'Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.'}

This translates to the following:

"Wrong item received. Hello, I have a question about my last order. I received the wrong item and need to return it."

Based on the preceding results from the model, we see that a higher logit indicates stronger alignment with the query, whereas lower (or more negative) values indicate lower relevance. In this case, the document discussing receiving the wrong item (in German) was ranked first with the highest logit, confirming that the model quickly and effectively identified it as the most relevant passage regarding product returns.

Clean up

To clean up your resources, use the following commands:

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

The NVIDIA NeMo Retriever Llama 3.2 NIM microservices bring powerful multilingual capabilities to enterprise search and retrieval systems. These models excel in diverse use cases, including cross-lingual search applications, enterprise knowledge bases, customer support systems, and content recommendation engines. The text embedding NIM’s dynamic embedding size (Matryoshka Embeddings) reduces storage footprint by 35-fold while supporting 26 languages and documents up to 8,192 tokens. The reranking NIM accurately scores document relevance across languages, enabling precise information retrieval even for multilingual content. For organizations managing global knowledge bases or customer-facing search experiences, these NVIDIA-optimized microservices provide a significant advantage in latency, accuracy, and efficiency, allowing developers to quickly deploy sophisticated search capabilities without compromising on performance or linguistic diversity.

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language FMs for text embedding and reranking. Through the UI or just a few lines of code, you can deploy a highly accurate text embedding model to generate dense vector representations that capture semantic meaning and a reranking model to find semantic matches and retrieve the most relevant information from various data stores at scale and cost-efficiently.


About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Greeshma NallapareddyGreeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi OlaoyeAbdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Banu NagasundaramBanu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by Amazon SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Chase PinkertonChase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Eliuth Triana IsazaEliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Read More

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

As generative AI adoption accelerates across enterprises, maintaining safe, responsible, and compliant AI interactions has never been more critical. Amazon Bedrock Guardrails provides configurable safeguards that help organizations build generative AI applications with industry-leading safety protections. With Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple foundation models (FMs), improving user experiences and standardizing safety controls across generative AI applications. Beyond Amazon Bedrock models, the service offers the flexible ApplyGuardrails API that enables you to assess text using your pre-configured guardrails without invoking FMs, allowing you to implement safety controls across generative AI applications—whether running on Amazon Bedrock or on other systems—at both input and output levels.

Today, we’re announcing a significant enhancement to Amazon Bedrock Guardrails: AWS Identity and Access Management (IAM) policy-based enforcement. This powerful capability enables security and compliance teams to establish mandatory guardrails for every model inference call, making sure organizational safety policies are consistently enforced across AI interactions. This feature enhances AI governance by enabling centralized control over guardrail implementation.

Challenges with building generative AI applications

Organizations deploying generative AI face critical governance challenges: content appropriateness, where models might produce undesirable responses to problematic prompts; safety concerns, with potential generation of harmful content even from innocent prompts; privacy protection requirements for handling sensitive information; and consistent policy enforcement across AI deployments.

Perhaps most challenging is making sure that appropriate safeguards are applied consistently across AI interactions within an organization, regardless of which team or individual is developing or deploying applications.

Amazon Bedrock Guardrails capabilities

Amazon Bedrock Guardrails enables you to implement safeguards in generative AI applications customized to your specific use cases and responsible AI policies. Guardrails currently supports six types of policies:

  • Content filters – Configurable thresholds across six harmful categories: hate, insults, sexual, violence, misconduct, and prompt injections
  • Denied topics – Definition of specific topics to be avoided in the context of an application
  • Sensitive information filters – Detection and removal of personally identifiable information (PII) and custom regex entities to protect user privacy
  • Word filters – Blocking of specific words in generative AI applications, such as harmful words, profanity, or competitor names and products
  • Contextual grounding checks – Detection and filtering of hallucinations in model responses by verifying if the response is properly grounded in the provided reference source and relevant to the user query
  • Automated reasoning – Prevention of factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with known facts and aren’t based on fabricated or inconsistent data

Policy-based enforcement of guardrails

Security teams often have organizational requirements to enforce the use of Amazon Bedrock Guardrails for every inference call to Amazon Bedrock. To support this requirement, Amazon Bedrock Guardrails provides the new IAM condition key bedrock:GuardrailIdentifier, which can be used in IAM policies to enforce the use of a specific guardrail for model inference. The condition key in the IAM policy can be applied to the following APIs:

The following diagram illustrates the policy-based enforcement workflow.

If the guardrail configured in your IAM policy doesn’t match the guardrail specified in the request, the request will be rejected with an access denied exception, enforcing compliance with organizational policies.

Policy examples

In this section, we present several policy examples demonstrating how to enforce guardrails for model inference.

Example 1: Enforce the use of a specific guardrail and its numeric version

The following example illustrates the enforcement of exampleguardrail and its numeric version 1 during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

The added explicit deny denies the user request for calling the listed actions with other GuardrailIdentifier and GuardrailVersion values irrespective of other permissions the user might have.

Example 2: Enforce the use of a specific guardrail and its draft version

The following example illustrates the enforcement of exampleguardrail and its draft version during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 3: Enforce the use of a specific guardrail and its numeric versions

The following example illustrates the enforcement of exampleguardrail and its numeric versions during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 4: Enforce the use of a specific guardrail and its versions, including the draft

The following example illustrates the enforcement of exampleguardrail and its versions, including the draft, during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 5: Enforce the use of a specific guardrail and version pair from a list of guardrail and version pairs

The following example illustrates the enforcement of exampleguardrail1 and its version 1, or exampleguardrail2 and its version 2, or exampleguardrail3 and its version 3 and its draft during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
            ]
        }
    ]
}

Known limitations

When implementing policy-based guardrail enforcement, be aware of these limitations:

  • At the time of this writing, Amazon Bedrock Guardrails doesn’t support resource-based policies for cross-account access.
  • If a user assumes a role that has a specific guardrail configured using the bedrock:GuardrailIdentifier condition key, the user can strategically use input tags to help avoid having guardrail checks applied to certain parts of their prompt. Input tags allow users to mark specific sections of text that should be processed by guardrails, leaving other sections unprocessed. For example, a user could intentionally leave sensitive or potentially harmful content outside of the tagged sections, preventing those portions from being evaluated against the guardrail policies. However, regardless of how the prompt is structured or tagged, the guardrail will still be fully applied to the model’s response.
  • If a user has a role configured with a specific guardrail requirement (using the bedrock:GuardrailIdentifier condition), they shouldn’t use that same role to access services like Amazon Bedrock Knowledge Bases RetrieveAndGenerate or Amazon Bedrock Agents InvokeAgent. These higher-level services work by making multiple InvokeModel calls behind the scenes on the user’s behalf. Although some of these calls might include the required guardrail, others don’t. When the system attempts to make these guardrail-free calls using a role that requires guardrails, it results in AccessDenied errors, breaking the functionality of these services. To help avoid this issue, organizations should separate permissions—using different roles for direct model access with guardrails versus access to these composite Amazon Bedrock services.

Conclusion

The new IAM policy-based guardrail enforcement in Amazon Bedrock represents a crucial advancement in AI governance as generative AI becomes integrated into business operations. By enabling centralized policy enforcement, security teams can maintain consistent safety controls across AI applications regardless of who develops or deploys them, effectively mitigating risks related to harmful content, privacy violations, and bias. This approach offers significant advantages: it scales efficiently as organizations expand their AI initiatives without creating administrative bottlenecks, helps prevent technical debt by standardizing safety implementations, and enhances the developer experience by allowing teams to focus on innovation rather than compliance mechanics.

This capability demonstrates organizational commitment to responsible AI practices through comprehensive monitoring and audit mechanisms. Organizations can use model invocation logging in Amazon Bedrock to capture complete request and response data in Amazon CloudWatch Logs or Amazon Simple Storage Service (Amazon S3) buckets, including specific guardrail trace documentation showing when and how content was filtered. Combined with AWS CloudTrail integration that records guardrail configurations and policy enforcement actions, businesses can confidently scale their generative AI initiatives with appropriate safety mechanisms protecting their brand, customers, and data—striking the essential balance between innovation and ethical responsibility needed to build trust in AI systems.

Get started today with Amazon Bedrock Guardrails and implement configurable safeguards that balance innovation with responsible AI governance across your organization.


About the Authors

Shyam Srinivasan is on the Amazon Bedrock Guardrails product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at AWS. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.

Read More

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

SQL is one of the key languages widely used across businesses, and it requires an understanding of databases and table metadata. This can be overwhelming for nontechnical users who lack proficiency in SQL. Today, generative AI can help bridge this knowledge gap for nontechnical users to generate SQL queries by using a text-to-SQL application. This application allows users to ask questions in natural language and then generates a SQL query for the user’s request.

Large language models (LLMs) are trained to generate accurate SQL queries for natural language instructions. However, off-the-shelf LLMs can’t be used without some modification. Firstly, LLMs don’t have access to enterprise databases, and the models need to be customized to understand the specific database of an enterprise. Additionally, the complexity increases due to the presence of synonyms for columns and internal metrics available.

The limitation of LLMs in understanding enterprise datasets and human context can be addressed using Retrieval Augmented Generation (RAG). In this post, we explore using Amazon Bedrock to create a text-to-SQL application using RAG. We use Anthropic’s Claude 3.5 Sonnet model to generate SQL queries, Amazon Titan in Amazon Bedrock for text embedding and Amazon Bedrock to access these models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

This solution is primarily based on the following services:

  1. Foundational model – We use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as our LLM to generate SQL queries for user inputs.
  2. Vector embeddings – We use Amazon Titan Text Embeddings v2 on Amazon Bedrock for embeddings. Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning (ML) model. The following diagram provides more details about embeddings.vector embeddings
  3. RAG – We use RAG for providing more context about table schema, column synonyms, and sample queries to the FM. RAG is a framework for building generative AI applications that can make use of enterprise data sources and vector databases to overcome knowledge limitations. RAG works by using a retriever module to find relevant information from an external data store in response to a user’s prompt. This retrieved data is used as context, combined with the original prompt, to create an expanded prompt that is passed to the LLM. The language model then generates a SQL query that incorporates the enterprise knowledge. The following diagram illustrates the RAG framework.RAG Framework
  4. Streamlit This open source Python library makes it straightforward to create and share beautiful, custom web apps for ML and data science. In just a few minutes you can build powerful data apps using only Python.

The following diagram shows the solution architecture.

solution architecture

We need to update the LLMs with an enterprise-specific database. This make sure that the model can correctly understand the database and generate a response tailored to enterprise-based data schema and tables. There are multiple file formats available for storing this information, such as JSON, PDF, TXT, and YAML. In our case, we created JSON files to store table schema, table descriptions, columns with synonyms, and sample queries. JSON’s inherently structured format allows for clear and organized representation of complex data such as table schemas, column definitions, synonyms, and sample queries. This structure facilitates quick parsing and manipulation of data in most programming languages, reducing the need for custom parsing logic.

There can be multiple tables with similar information, which can lower the model’s accuracy. To increase the accuracy, we categorized the tables in four different types based on the schema and created four JSON files to store different tables. We’ve added one dropdown menu with four choices. Each choice represents one of these four categories and is lined to individual JSON files. After the user selects the value from the dropdown menu, the relevant JSON file is passed to Amazon Titan Text Embeddings v2, which can convert text into embeddings. These embeddings are stored in a vector database for faster retrieval.

We added the prompt template to the FM to define the roles and responsibilities of the model. You can add additional information such as which SQL engine should be used to generate the SQL queries.

When the user provides the input through the chat prompt, we use similarity search to find the relevant table metadata from the vector database for the user’s query. The user input is combined with relevant table metadata and the prompt template, which is passed to the FM as a single input all together. The FM generates the SQL query based on the final input.

To evaluate the model’s accuracy and track the mechanism, we store every user input and output in Amazon Simple Storage Service (Amazon S3).

Prerequisites

To create this solution, complete the following prerequisites:

  1. Sign up for an AWS account if you don’t already have one.
  2. Enable model access for Amazon Titan Text Embeddings v2 and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
  3. Create an S3 bucket as ‘simplesql-logs-****‘, replace ‘****’ with your unique identifier. Bucket names are unique globally across the entire Amazon S3 service.
  4. Choose your testing environment. We recommend that you test in Amazon SageMaker Studio, although you can use other available local environments.
  5. Install the following libraries to execute the code:
    pip install streamlit
    pip install jq
    pip install openpyxl
    pip install "faiss-cpu"
    pip install langchain

Procedure

There are three main components in this solution:

  1. JSON files store the table schema and configure the LLM
  2. Vector indexing using Amazon Bedrock
  3. Streamlit for the front-end UI

You can download all three components and code snippets provided in the following section.

Generate the table schema

We use the JSON format to store the table schema. To provide more inputs to the model, we added a table name and its description, columns and their synonyms, and sample queries in our JSON files. Create a JSON file as Table_Schema_A.json by copying the following code into it:

{
  "tables": [
    {
      "separator": "table_1",
      "name": "schema_a.orders",
      "schema": "CREATE TABLE schema_a.orders (order_id character varying(200), order_date timestamp without time zone, customer_id numeric(38,0), order_status character varying(200), item_id character varying(200) );",
      "description": "This table stores information about orders placed by customers.",
      "columns": [
        {
          "name": "order_id",
          "description": "unique identifier for orders.",
          "synonyms": ["order id"]
        },
        {
          "name": "order_date",
          "description": "timestamp when the order was placed",
          "synonyms": ["order time", "order day"]
        },
        {
          "name": "customer_id",
          "description": "Id of the customer associated with the order",
          "synonyms": ["customer id", "userid"]
        },
        {
          "name": "order_status",
          "description": "current status of the order, sample values are: shipped, delivered, cancelled",
          "synonyms": ["order status"]
        },
        {
          "name": "item_id",
          "description": "item associated with the order",
          "synonyms": ["item id"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(order_id) as total_orders from schema_a.orders where customer_id = '9782226' and order_status = 'cancelled'",
          "user_input": "Count of orders cancelled by customer id: 978226"
        }
      ]
    },
    {
      "separator": "table_2",
      "name": "schema_a.customers",
      "schema": "CREATE TABLE schema_a.customers (customer_id numeric(38,0), customer_name character varying(200), registration_date timestamp without time zone, country character varying(200) );",
      "description": "This table stores the details of customers.",
      "columns": [
        {
          "name": "customer_id",
          "description": "Id of the customer, unique identifier for customers",
          "synonyms": ["customer id"]
        },
        {
          "name": "customer_name",
          "description": "name of the customer",
          "synonyms": ["name"]
        },
        {
          "name": "registration_date",
          "description": "registration timestamp when customer registered",
          "synonyms": ["sign up time", "registration time"]
        },
        {
          "name": "country",
          "description": "customer's original country",
          "synonyms": ["location", "customer's region"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(customer_id) as total_customers from schema_a.customers where country = 'India' and to_char(registration_date, 'YYYY') = '2024'",
          "user_input": "The number of customers registered from India in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id where c.customer_name = 'john' and to_char(o.order_date, 'YYYY-MM') = '2024-01'",
          "user_input": "Total orders placed in January 2024 by customer name john"
        }
      ]
    },
    {
      "separator": "table_3",
      "name": "schema_a.items",
      "schema": "CREATE TABLE schema_a.items (item_id character varying(200), item_name character varying(200), listing_date timestamp without time zone );",
      "description": "This table stores the complete details of items listed in the catalog.",
      "columns": [
        {
          "name": "item_id",
          "description": "Id of the item, unique identifier for items",
          "synonyms": ["item id"]
        },
        {
          "name": "item_name",
          "description": "name of the item",
          "synonyms": ["name"]
        },
        {
          "name": "listing_date",
          "description": "listing timestamp when the item was registered",
          "synonyms": ["listing time", "registration time"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(item_id) as total_items from schema_a.items where to_char(listing_date, 'YYYY') = '2024'",
          "user_input": "how many items are listed in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id join schema_a.items i on o.item_id = i.item_id where c.customer_name = 'john' and i.item_name = 'iphone'",
          "user_input": "how many orders are placed for item 'iphone' by customer name john"
        }
      ]
    }
  ]
}

Configure the LLM and initialize vector indexing using Amazon Bedrock

Create a Python file as library.py by following these steps:

  1. Add the following import statements to add the necessary libraries:
    import boto3  # AWS SDK for Python
    from langchain_community.document_loaders import JSONLoader  # Utility to load JSON files
    from langchain.llms import Bedrock  # Large Language Model (LLM) from Anthropic
    from langchain_community.chat_models import BedrockChat  # Chat interface for Bedrock LLM
    from langchain.embeddings import BedrockEmbeddings  # Embeddings for Titan model
    from langchain.memory import ConversationBufferWindowMemory  # Memory to store chat conversations
    from langchain.indexes import VectorstoreIndexCreator  # Create vector indexes
    from langchain.vectorstores import FAISS  # Vector store using FAISS library
    from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks
    from langchain.chains import ConversationalRetrievalChain  # Conversational retrieval chain
    from langchain.callbacks.manager import CallbackManager

  2. Initialize the Amazon Bedrock client and configure Anthropic’s Claude 3.5 You can limit the number of output tokens to optimize the cost:
    # Create a Boto3 client for Bedrock Runtime
    bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"
    )
    
    # Function to get the LLM (Large Language Model)
    def get_llm():
        model_kwargs = {  # Configuration for Anthropic model
            "max_tokens": 512,  # Maximum number of tokens to generate
            "temperature": 0.2,  # Sampling temperature for controlling randomness
            "top_k": 250,  # Consider the top k tokens for sampling
            "top_p": 1,  # Consider the top p probability tokens for sampling
            "stop_sequences": ["nnHuman:"]  # Stop sequence for generation
        }
        # Create a callback manager with a default callback handler
        callback_manager = CallbackManager([])
        
        llm = BedrockChat(
            model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",  # Set the foundation model
            model_kwargs=model_kwargs,  # Pass the configuration to the model
            callback_manager=callback_manager
            
        )
    
        return llm

  3. Create and return an index for the given schema type. This approach is an efficient way to filter tables and provide relevant input to the model:
    # Function to load the schema file based on the schema type
    def load_schema_file(schema_type):
        if schema_type == 'Schema_Type_A':
            schema_file = "Table_Schema_A.json"  # Path to Schema Type A
        elif schema_type == 'Schema_Type_B':
            schema_file = "Table_Schema_B.json"  # Path to Schema Type B
        elif schema_type == 'Schema_Type_C':
            schema_file = "Table_Schema_C.json"  # Path to Schema Type C
        return schema_file
    
    # Function to get the vector index for the given schema type
    def get_index(schema_type):
        embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                       client=bedrock_runtime)  # Initialize embeddings
    
        db_schema_loader = JSONLoader(
            file_path=load_schema_file(schema_type),  # Load the schema file
            # file_path="Table_Schema_RP.json",  # Uncomment to use a different file
            jq_schema='.',  # Select the entire JSON content
            text_content=False)  # Treat the content as text
    
        db_schema_text_splitter = RecursiveCharacterTextSplitter(  # Create a text splitter
            separators=["separator"],  # Split chunks at the "separator" string
            chunk_size=10000,  # Divide into 10,000-character chunks
            chunk_overlap=100  # Allow 100 characters to overlap with previous chunk
        )
    
        db_schema_index_creator = VectorstoreIndexCreator(
            vectorstore_cls=FAISS,  # Use FAISS vector store
            embedding=embeddings,  # Use the initialized embeddings
            text_splitter=db_schema_text_splitter  # Use the text splitter
        )
    
        db_index_from_loader = db_schema_index_creator.from_loaders([db_schema_loader])  # Create index from loader
    
        return db_index_from_loader

  4. Use the following function to create and return memory for the chat session:
    # Function to get the memory for storing chat conversations
    def get_memory():
        memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True)  # Create memory
    
        return memory

  5. Use the following prompt template to generate SQL queries based on user input:
    # Template for the question prompt
    template = """ Read table information from the context. Each table contains the following information:
    - Name: The name of the table
    - Description: A brief description of the table
    - Columns: The columns of the table, listed under the 'columns' key. Each column contains:
      - Name: The name of the column
      - Description: A brief description of the column
      - Type: The data type of the column
      - Synonyms: Optional synonyms for the column name
    - Sample Queries: Optional sample queries for the table, listed under the 'sample_data' key
    
    Given this structure, Your task is to provide the SQL query using Amazon Redshift syntax that would retrieve the data for following question. The produced query should be functional, efficient, and adhere to best practices in SQL query optimization.
    
    Question: {}
    """

  6. Use the following function to get a response from the RAG chat model:
    # Function to get the response from the conversational retrieval chain
    def get_rag_chat_response(input_text, memory, index):
        llm = get_llm()  # Get the LLM
    
        conversation_with_retrieval = ConversationalRetrievalChain.from_llm(
            llm, index.vectorstore.as_retriever(), memory=memory, verbose=True)  # Create conversational retrieval chain
    
        chat_response = conversation_with_retrieval.invoke({"question": template.format(input_text)})  # Invoke the chain
    
        return chat_response['answer']  # Return the answer

Configure Streamlit for the front-end UI

Create the file app.py by following these steps:

  1. Import the necessary libraries:
    import streamlit as st
    import library as lib
    from io import StringIO
    import boto3
    from datetime import datetime
    import csv
    import pandas as pd
    from io import BytesIO

  2. Initialize the S3 client:
    s3_client = boto3.client('s3')
    bucket_name = 'simplesql-logs-****'
    #replace the 'simplesql-logs-****’ with your S3 bucket name
    log_file_key = 'logs.xlsx'

  3. Configure Streamlit for UI:
    st.set_page_config(page_title="Your App Name")
    st.title("Your App Name")
    
    # Define the available menu items for the sidebar
    menu_items = ["Home", "How To", "Generate SQL Query"]
    
    # Create a sidebar menu using radio buttons
    selected_menu_item = st.sidebar.radio("Menu", menu_items)
    
    # Home page content
    if selected_menu_item == "Home":
        # Display introductory information about the application
        st.write("This application allows you to generate SQL queries from natural language input.")
        st.write("")
        st.write("**Get Started** by selecting the button Generate SQL Query !")
        st.write("")
        st.write("")
        st.write("**Disclaimer :**")
        st.write("- Model's response depends on user's input (prompt). Please visit How-to section for writing efficient prompts.")
               
    # How-to page content
    elif selected_menu_item == "How To":
        # Provide guidance on how to use the application effectively
        st.write("The model's output completely depends on the natural language input. Below are some examples which you can keep in mind while asking the questions.")
        st.write("")
        st.write("")
        st.write("")
        st.write("")
        st.write("**Case 1 :**")
        st.write("- **Bad Input :** Cancelled orders")
        st.write("- **Good Input :** Write a query to extract the cancelled order count for the items which were listed this year")
        st.write("- It is always recommended to add required attributes, filters in your prompt.")
        st.write("**Case 2 :**")
        st.write("- **Bad Input :** I am working on XYZ project. I am creating a new metric and need the sales data. Can you provide me the sales at country level for 2023 ?")
        st.write("- **Good Input :** Write an query to extract sales at country level for orders placed in 2023 ")
        st.write("- Every input is processed as tokens. Do not provide un-necessary details as there is a cost associated with every token processed. Provide inputs only relevant to your query requirement.") 

  4. Generate the query:
    # SQL-AI page content
    elif selected_menu_item == "Generate SQL Query":
        # Define the available schema types for selection
        schema_types = ["Schema_Type_A", "Schema_Type_B", "Schema_Type_C"]
        schema_type = st.sidebar.selectbox("Select Schema Type", schema_types)

  5. Use the following for SQL generation:
    if schema_type:
            # Initialize or retrieve conversation memory from session state
            if 'memory' not in st.session_state:
                st.session_state.memory = lib.get_memory()
    
            # Initialize or retrieve chat history from session state
            if 'chat_history' not in st.session_state:
                st.session_state.chat_history = []
    
            # Initialize or update vector index based on selected schema type
            if 'vector_index' not in st.session_state or 'current_schema' not in st.session_state or st.session_state.current_schema != schema_type:
                with st.spinner("Indexing document..."):
                    # Create a new index for the selected schema type
                    st.session_state.vector_index = lib.get_index(schema_type)
                    # Update the current schema in session state
                    st.session_state.current_schema = schema_type
    
            # Display the chat history
            for message in st.session_state.chat_history:
                with st.chat_message(message["role"]):
                    st.markdown(message["text"])
    
            # Get user input through the chat interface, set the max limit to control the input tokens.
            input_text = st.chat_input("Chat with your bot here", max_chars=100)
            
            if input_text:
                # Display user input in the chat interface
                with st.chat_message("user"):
                    st.markdown(input_text)
    
                # Add user input to the chat history
                st.session_state.chat_history.append({"role": "user", "text": input_text})
    
                # Generate chatbot response using the RAG model
                chat_response = lib.get_rag_chat_response(
                    input_text=input_text, 
                    memory=st.session_state.memory,
                    index=st.session_state.vector_index
                )
                
                # Display chatbot response in the chat interface
                with st.chat_message("assistant"):
                    st.markdown(chat_response)
    
                # Add chatbot response to the chat history
                st.session_state.chat_history.append({"role": "assistant", "text": chat_response})

  6. Log the conversations to the S3 bucket:
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
                try:
                    # Attempt to download the existing log file from S3
                    log_file_obj = s3_client.get_object(Bucket=bucket_name, Key=log_file_key)
                    log_file_content = log_file_obj['Body'].read()
                    df = pd.read_excel(BytesIO(log_file_content))
    
                except s3_client.exceptions.NoSuchKey:
                    # If the log file doesn't exist, create a new DataFrame
                    df = pd.DataFrame(columns=["User Input", "Model Output", "Timestamp", "Schema Type"])
    
                # Create a new row with the current conversation data
                new_row = pd.DataFrame({
                    "User Input": [input_text], 
                    "Model Output": [chat_response], 
                    "Timestamp": [timestamp],
                    "Schema Type": [schema_type]
                })
                # Append the new row to the existing DataFrame
                df = pd.concat([df, new_row], ignore_index=True)
                
                # Prepare the updated DataFrame for S3 upload
                output = BytesIO()
                df.to_excel(output, index=False)
                output.seek(0)
                
                # Upload the updated log file to S3
                s3_client.put_object(Body=output.getvalue(), Bucket=bucket_name, Key=log_file_key)
    

Test the solution

Open your terminal and invoke the following command to run the Streamlit application.

streamlit run app.py

To visit the application using your browser, navigate to the localhost.

To visit the application using SageMaker, copy your notebook URL and replace ‘default/lab’ in the URL with ‘default/proxy/8501/ ‘ . It should look something like the following:

https://your_sagemaker_lab_url.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/

Choose Generate SQL query to open the chat window. Test your application by asking questions in natural language. We tested the application with the following questions and it generated accurate SQL queries.

Count of orders placed from India last month?
Write a query to extract the canceled order count for the items that were listed this year.
Write a query to extract the top 10 item names having highest order for each country.

Troubleshooting tips

Use the following solutions to address errors:

Error – An error raised by inference endpoint means that an error occurred (AccessDeniedException) when calling the InvokeModel operation. You don’t have access to the model with the specified model ID.
Solution – Make sure you have access to the FMs in Amazon Bedrock, Amazon Titan Text Embeddings v2, and Anthropic’s Claude 3.5 Sonnet.

Error – app.py does not exist
Solution – Make sure your JSON file and Python files are in the same folder and you’re invoking the command in the same folder.

Error – No module named streamlit
Solution – Open the terminal and install the streamlit module by running the command pip install streamlit

Error – An error occurred (NoSuchBucket) when calling the GetObject operation. The specified bucket doesn’t exist.
Solution – Verify your bucket name in the app.py file and update the name based on your S3 bucket name.

Clean up

Clean up the resources you created to avoid incurring charges. To clean up your S3 bucket, refer to Emptying a bucket.

Conclusion

In this post, we showed how Amazon Bedrock can be used to create a text-to-SQL application based on enterprise-specific datasets. We used Amazon S3 to store the outputs generated by the model for corresponding inputs. These logs can be used to test the accuracy and enhance the context by providing more details in the knowledge base. With the aid of a tool like this, you can create automated solutions that are accessible to nontechnical users, empowering them to interact with data more efficiently.

Ready to get started with Amazon Bedrock? Start learning with these interactive workshops.

For more information on SQL generation, refer to these posts:

We recently launched a managed NL2SQL module to retrieve structured data in Amazon Bedrock Knowledge  . To learn more, visit Amazon Bedrock Knowledge Bases now supports structured data retrieval.


About the Author

rajendra choudharyRajendra Choudhary is a Sr. Business Analyst at Amazon. With 7 years of experience in developing data solutions, he possesses profound expertise in data visualization, data modeling, and data engineering. He is passionate about supporting customers by leveraging generative AI–based solutions. Outside of work, Rajendra is an avid foodie and music enthusiast, and he enjoys swimming and hiking.

Read More

Unleash AI innovation with Amazon SageMaker HyperPod

Unleash AI innovation with Amazon SageMaker HyperPod

The rise of generative AI has significantly increased the complexity of building, training, and deploying machine learning (ML) models. It now demands deep expertise, access to vast datasets, and the management of extensive compute clusters. Customers also face the challenges of writing specialized code for distributed training, continuously optimizing models, addressing hardware issues, and keeping projects on track and within budget. To simplify this process, AWS introduced Amazon SageMaker HyperPod during AWS re:Invent 2023, and it has emerged as a pioneering solution, revolutionizing how companies approach AI development and deployment.

As Amazon CEO Andy Jassy recently shared, “One of the most exciting innovations we’ve introduced is SageMaker HyperPod. HyperPod accelerates the training of machine learning models by distributing and parallelizing workloads across numerous powerful processors like AWS’s Trainium chips or GPUs. HyperPod also constantly monitor your infrastructure for problems, automatically repairing them when detected. During repair, your work is automatically saved, ensuring seamless resumption. This innovation is widely adopted, with most SageMaker AI customers relying on HyperPod for their demanding training needs.”

In this post, we show how SageMaker HyperPod, and its new features introduced at AWS re:Invent 2024, is designed to meet the demands of modern AI workloads, offering a persistent and optimized cluster tailored for distributed training and accelerated inference at cloud scale and attractive price-performance.

Customers using SageMaker HyperPod

Leading startups like Writer, Luma AI, and Perplexity, as well as major enterprises such as Thomson Reuters and Salesforce, are accelerating model development with SageMaker HyperPod. Amazon itself used SageMaker HyperPod to train its new Amazon Nova models, significantly reducing training costs, enhancing infrastructure performance, and saving months of manual effort that would have otherwise been spent on cluster setup and end-to-end process management.

Today, more organizations are eager to fine-tune popular publicly available models or train their own specialized models to revolutionize their businesses and applications with generative AI. To support this demand, SageMaker HyperPod continues to evolve, introducing new innovations that make it straightforward, faster, and more cost-effective for customers to build, train, and deploy these models at scale.

Deep infrastructure control

SageMaker HyperPod offers persistent clusters with deep infrastructure control, enabling builders to securely connect using SSH to Amazon Elastic Compute Cloud (Amazon EC2) instances for advanced model training, infrastructure management, and debugging. To maximize availability, HyperPod maintains a pool of dedicated and spare instances (at no additional cost), minimizing downtime during critical node replacements.

You can use familiar orchestration tools such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS), along with the libraries built on these tools, to enable flexible job scheduling and compute sharing. Integrating SageMaker HyperPod clusters with Slurm also allows the use of NVIDIA’s Enroot and Pyxis for efficient container scheduling in performant, unprivileged sandboxes. The underlying operating system and software stack are based on the Deep Learning AMI, preconfigured with NVIDIA CUDA, NVIDIA cuDNN, and the latest versions of PyTorch and TensorFlow. SageMaker HyperPod also is integrated with Amazon SageMaker AI distributed training libraries, optimized for AWS infrastructure, enabling automatic workload distribution across thousands of accelerators for efficient parallel training.

Builders can use built-in ML tools within SageMaker HyperPod to enhance model performance. For example, Amazon SageMaker with TensorBoard helps visualize model architecture and address convergence issues, as shown in the following screenshot. Integration with observability tools like Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana offers deeper insights into cluster performance, health, and utilization, streamlining development time.

SageMaker HyperPod allows you to implement custom libraries and frameworks, enabling the service to be tailored to specific AI project needs. This level of personalization is essential in the rapidly evolving AI landscape, where innovation often requires experimenting with cutting-edge techniques and technologies. The adaptability of SageMaker HyperPod means that businesses are not constrained by infrastructure limitations, fostering creativity and technological advancement.

Intelligent resource management

As organizations increasingly provision large amounts of accelerated compute capacity for model training, they face challenges in effectively governing resource usage. These compute resources are both expensive and finite, making it crucial to prioritize critical model development tasks and avoid waste or under utilization. Without proper controls over task prioritization and resource allocation, some projects stall due to insufficient resources, while others leave resources underused. This creates a significant burden for administrators, who must constantly reallocate resources, and for data scientists, who struggle to maintain progress. These inefficiencies delay AI innovation and drive up costs.

SageMaker HyperPod addresses these challenges with its task governance capabilities, enabling you to maximize accelerator utilization for model training, fine-tuning, and inference. With just a few clicks, you can define task priorities and set limits on compute resource usage for teams. Once configured, SageMaker HyperPod automatically manages the task queue, making sure the most critical work receives the necessary resources. This reduction in operational overhead allows organizations to reallocate valuable human resources toward more innovative and strategic initiatives. This reduces model development costs by up to 40%.

For instance, if an inference task powering a customer-facing service requires urgent compute capacity but all resources are currently in use, SageMaker HyperPod reallocates underutilized or non-urgent resources to prioritize the critical task. Non-urgent tasks are automatically paused, checkpoints are saved to preserve progress, and these tasks resume seamlessly when resources become available. This makes sure you maximize your compute investments without compromising ongoing work.

As a fast-growing generative AI startup, Articul8 AI constantly optimizes their compute environment to allocate accelerated compute resources as efficiently as possible. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. The ability to automatically shift resources to high-priority tasks has increased their team’s productivity, allowing them to bring new generative AI innovations to market faster than ever before.

At its core, SageMaker HyperPod represents a paradigm shift in AI infrastructure, moving beyond the traditional emphasis on raw computational power to focus on intelligent and adaptive resource management. By prioritizing optimized resource allocation, SageMaker HyperPod minimizes waste, maximizes efficiency, and accelerates innovation—all while reducing costs. This makes AI development more accessible and scalable for organizations of all sizes.

Get started faster with SageMaker HyperPod recipes

Many customers want to customize popular publicly available models, like Meta’s Llama and Mistral, for their specific use cases using their organization’s data. However, optimizing training performance often requires weeks of iterative testing—experimenting with algorithms, fine-tuning parameters, monitoring training impact, debugging issues, and benchmarking performance.

To simplify this process, SageMaker HyperPod now offers over 30 curated model training recipes for some of today’s most popular models, including DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, and Mixtral. These recipes enable you to get started in minutes by automating key steps like loading training datasets, applying distributed training techniques, and configuring systems for checkpointing and recovery from infrastructure failures. This empowers users of all skill levels to achieve better price-performance for model training on AWS infrastructure from the outset, eliminating weeks of manual evaluation and testing.

You can browse the GitHub repo to explore available training recipes, customize parameters to fit your needs, and deploy in minutes. With a simple one-line change, you can seamlessly switch between GPU or AWS Trainium based instances to further optimize price-performance.

Researchers at Salesforce were looking for ways to quickly get started with foundation model (FM) training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Integrating Kubernetes with SageMaker Hyperpod

Though the standalone capabilities of SageMaker HyperPod are impressive, its integration with Amazon EKS takes AI workloads to new levels of power and flexibility. Amazon EKS simplifies the deployment, scaling, and management of containerized applications, making it an ideal solution for orchestrating complex AI/ML infrastructure.

By running SageMaker HyperPod on Amazon EKS, organizations can use Kubernetes’s advanced scheduling and orchestration features to dynamically provision and manage compute resources for AI/ML workloads, providing optimal resource utilization and scalability.

“We were able to meet our large language model training requirements using Amazon SageMaker HyperPod,” says John Duprey, Distinguished Engineer, Thomson Reuters Labs. “Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarization and classification.”

This integration also enhances fault tolerance and high availability. With self-healing capabilities, HyperPod automatically replaces failed nodes, maintaining workload continuity. Automated GPU health monitoring and seamless node replacement provide reliable execution of AI/ML workloads with minimal downtime, even during hardware failures.

Additionally, running SageMaker HyperPod on Amazon EKS enables efficient resource isolation and sharing using Kubernetes namespaces and resource quotas. Organizations can isolate different AI/ML workloads or teams while maximizing resource utilization across the cluster.

Flexible training plans help meet timelines and budgets

Although infrastructure innovations help reduce costs and improve training efficiency, customers still face challenges in planning and managing the compute capacity needed to complete training tasks on time and within budget. To address this, AWS is introducing flexible training plans for SageMaker HyperPod.

With just a few clicks, you can specify your desired completion date and the maximum amount of compute resources needed. SageMaker HyperPod then helps acquire capacity and sets up clusters, saving teams weeks of preparation time. This eliminates much of the uncertainty customers encounter when acquiring large compute clusters for model development tasks.


SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), and US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlarge, ml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in the US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker pricing page.

Hippocratic AI is an AI company that develops the first safety-focused large language model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. SageMaker HyperPod flexible training plans made it straightforward for them to gain access to EC2 P5 instances.

Developers and data scientists at OpenBabylon, an AI company that customizes LLMs for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large-scale experiments. Using the multi-node SageMaker HyperPod distributed training capabilities, they conducted 100 large-scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating the ability of SageMaker HyperPod to successfully deliver complex projects on time and at budget.

Integrating training and inference infrastructures

A key focus area is integrating next-generation AI accelerators like the anticipated AWS Trainium2 release. These advanced accelerators promise unparalleled computational performance, offering 30–40% better price-performance than the current generation of GPU-based EC2 instances, significantly boosting AI model training and deployment efficiency and speed. This will be crucial for real-time applications and processing vast datasets simultaneously. The seamless accelerator integration with SageMaker HyperPod enables businesses to harness cutting-edge hardware advancements, driving AI initiatives forward.

Another pivotal aspect is that SageMaker HyperPod, through its integration with Amazon EKS, enables scalable inference solutions. As real-time data processing and decision-making demands grow, the SageMaker HyperPod architecture efficiently handles these requirements. This capability is essential across sectors like healthcare, finance, and autonomous systems, where timely, accurate AI inferences are critical. Offering scalable inference enables deploying high-performance AI models under varying workloads, enhancing operational effectiveness.

Moreover, integrating training and inference infrastructures represents a significant advancement, streamlining the AI lifecycle from development to deployment and providing optimal resource utilization throughout. Bridging this gap facilitates a cohesive, efficient workflow, reducing transition complexities from development to real-world applications. This holistic integration supports continuous learning and adaptation, which is key for next-generation, self-evolving AI models (continuously learning models, which possess the ability to adapt and refine themselves in real time based on their interactions with the environment).

SageMaker HyperPod uses established open source technologies, including MLflow integration through SageMaker, container orchestration through Amazon EKS, and Slurm workload management, providing users with familiar and proven tools for their ML workflows. By engaging the global AI community and encouraging knowledge sharing, SageMaker HyperPod continuously evolves, incorporating the latest research advancements. This collaborative approach helps SageMaker HyperPod remain at the forefront of AI technology, providing the tools to drive transformative change.

Conclusion

SageMaker HyperPod represents a fundamental change in AI infrastructure, offering a future-fit solution that empowers organizations to unlock the full potential of AI technologies. With its intelligent resource management, versatility, scalability, and forward-thinking design, SageMaker HyperPod enables businesses to accelerate innovation, reduce operational costs, and stay ahead of the curve in the rapidly evolving AI landscape.

Whether it’s optimizing the training of LLMs, processing complex datasets for medical imaging inference, or exploring novel AI architectures, SageMaker HyperPod provides a robust and flexible foundation for organizations to push the boundaries of what is possible in AI.

As AI continues to reshape industries and redefine what is possible, SageMaker HyperPod stands at the forefront, enabling organizations to navigate the complexities of AI workloads with unparalleled agility, efficiency, and innovation. With its commitment to continuous improvement, strategic partnerships, and alignment with emerging technologies, SageMaker HyperPod is poised to play a pivotal role in shaping the future of AI, empowering organizations to unlock new realms of possibility and drive transformative change.

Take the first step towards revolutionizing your AI initiatives by scheduling a consultation with our experts. Let us guide you through the process of harnessing the power of SageMaker HyperPod and unlock a world of possibilities for your business.


About the authors

Ilan Gleiser is a Principal GenAI Specialist at AWS WWSO Frameworks team focusing on developing scalable Artificial General Intelligence architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 20 blogs and delivered 100+ prototypes globally over the last 5 years. Ilan holds a Master’s degree in mathematical economics.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Shubha Kumbadakone is a Sr. Mgr on the AWS WWSO Frameworks team focusing on Foundation Model Builders and self-managed machine learning with a focus on open-source software and tools. She has more than 19 years of experience in cloud infrastructure and machine learning and is helping customers build their distributed training and inference at scale for their ML models on AWS. She also holds a patent on a caching algorithm for rapid resume from hibernation for mobile systems.

Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.

Read More

Revolutionizing clinical trials with the power of voice and AI

Revolutionizing clinical trials with the power of voice and AI

In the rapidly evolving healthcare landscape, patients often find themselves navigating a maze of complex medical information, seeking answers to their questions and concerns. However, accessing accurate and comprehensible information can be a daunting task, leading to confusion and frustration. This is where the integration of cutting-edge technologies, such as audio-to-text translation and large language models (LLMs), holds the potential to revolutionize the way patients receive, process, and act on vital medical information.

As the healthcare industry continues to embrace digital transformation, solutions that combine advanced technologies like audio-to-text translation and LLMs will become increasingly valuable in addressing key challenges, such as patient education, engagement, and empowerment. By taking advantage of these innovative technologies, healthcare providers can deliver more personalized, efficient, and effective care, ultimately improving patient outcomes and driving progress in the life sciences domain.

For instance, envision a voice-enabled virtual assistant that not only understands your spoken queries, but also transcribes them into text with remarkable accuracy. This transcription then serves as the input for a powerful LLM, which draws upon its vast knowledge base to provide personalized, context-aware responses tailored to your specific situation. This solution can transform the patient education experience, empowering individuals to make informed decisions about their healthcare journey.

In this post, we discuss possible use cases for combining speech recognition technology with LLMs, and how the solution can revolutionize clinical trials.

By combining speech recognition technology with LLMs, the solution can accurately transcribe a patient’s spoken queries into text, enabling the LLM to understand and analyze the context of the question. The LLM can then use its extensive knowledge base, which can be regularly updated with the latest medical research and clinical trial data, to provide relevant and trustworthy responses tailored to the patient’s specific situation.

Some of the potential benefits of this integrated approach are that patients can receive instant access to reliable information, empowering them to make more informed decisions about their healthcare. Additionally, the solution can help alleviate the burden on healthcare professionals by providing patients with a convenient and accessible source of information, freeing up valuable time for more critical tasks. Furthermore, the voice-enabled interface can enhance accessibility for patients with disabilities or those who prefer verbal communication, making sure that no one is left behind in the pursuit of better health outcomes.

Use cases overview

In this section, we discuss several possible use cases for this solution.

Use case 1: Audio-to-text translation and LLM integration for clinical trial patient interactions

In the domain of clinical trials, effective communication between patients and physicians is crucial for gathering accurate data, enforcing patient adherence, and maintaining study integrity. This use case demonstrates how audio-to-text translation combined with LLM capabilities can streamline and enhance the process of capturing and analyzing patient-physician interactions during clinical trial visits and telemedicine sessions.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio capture – During patient visits or telemedicine sessions, the audio of the patient-physician interaction is recorded securely, with appropriate consent and privacy measures in place.
  2. Audio-to-text translation – The recorded audio is processed through an advanced speech recognition (ASR) system, which converts the audio into text transcripts. This step provides an accurate and efficient conversion of spoken words into a format suitable for further analysis.
  3. Text preprocessing – The transcribed text undergoes preprocessing steps, such as removing identifying information, formatting the data, and enforcing compliance with relevant data privacy regulations.
  4. LLM integration – The preprocessed text is fed into a powerful LLM tailored for the healthcare and life sciences (HCLS) domain. The LLM analyzes the text, identifying key information relevant to the clinical trial, such as patient symptoms, adverse events, medication adherence, and treatment responses.
  5. Intelligent insights and recommendations – Using its large knowledge base and advanced natural language processing (NLP) capabilities, the LLM provides intelligent insights and recommendations based on the analyzed patient-physician interaction. These insights can include:
    1. Potential adverse event detection and reporting.
    2. Identification of protocol deviations or non-compliance.
    3. Recommendations for personalized patient care or adjustments to treatment regimens.
    4. Extraction of relevant data points for electronic health records (EHRs) and clinical trial databases.
  6. Data integration and reporting – The extracted insights and recommendations are integrated into the relevant clinical trial management systems, EHRs, and reporting mechanisms. This streamlines the process of data collection, analysis, and decision-making for clinical trial stakeholders, including investigators, sponsors, and regulatory authorities.

The solution offers the following potential benefits:

  • Improved data accuracy – By accurately capturing and analyzing patient-physician interactions, this approach minimizes the risks of manual transcription errors and provides high-quality data for clinical trial analysis and decision-making.
  • Enhanced patient safety – The LLM’s ability to detect potential adverse events and protocol deviations can help identify and mitigate risks, improving patient safety and study integrity.
  • Personalized patient care – Using the LLM’s insights, physicians can provide personalized care recommendations, tailored treatment plans, and better manage patient adherence, leading to improved patient outcomes.
  • Streamlined data collection and analysis – Automating the process of extracting relevant data points from patient-physician interactions can significantly reduce the time and effort required for manual data entry and analysis, enabling more efficient clinical trial management.
  • Regulatory compliance – By integrating the extracted insights and recommendations into clinical trial management systems and EHRs, this approach facilitates compliance with regulatory requirements for data capture, adverse event reporting, and trial monitoring.

This use case demonstrates the potential of combining audio-to-text translation and LLM capabilities to enhance patient-physician communication, improve data quality, and support informed decision-making in the context of clinical trials. By using advanced technologies, this integrated approach can contribute to more efficient, effective, and patient-centric clinical research processes.

Use case 2: Intelligent site monitoring with audio-to-text translation and LLM capabilities

In the HCLS domain, site monitoring plays a crucial role in maintaining the integrity and compliance of clinical trials. Site monitors conduct on-site visits, interview personnel, and verify documentation to assess adherence to protocols and regulatory requirements. However, this process can be time-consuming and prone to errors, particularly when dealing with extensive audio recordings and voluminous documentation.

By integrating audio-to-text translation and LLM capabilities, we can streamline and enhance the site monitoring process, leading to improved efficiency, accuracy, and decision-making support.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio capture and transcription – During site visits, monitors record interviews with site personnel, capturing valuable insights and observations. These audio recordings are then converted into text using ASR and audio-to-text translation technologies.
  2. Document ingestion – Relevant site documents, such as patient records, consent forms, and protocol manuals, are digitized and ingested into the system.
  3. LLM-powered data analysis – The transcribed interviews and ingested documents are fed into a powerful LLM, which can understand and correlate the information from multiple sources. The LLM can identify key insights, potential issues, and areas of non-compliance by analyzing the content and context of the data.
  4. Case report form generation – Based on the LLM’s analysis, a comprehensive case report form (CRF) is generated, summarizing the site visit findings, identifying potential risks or deviations, and providing recommendations for corrective actions or improvements.
  5. Decision support and site selection – The CRFs and associated data can be further analyzed by the LLM to identify patterns, trends, and potential risks across multiple sites. This information can be used to support decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

The solution offers the following potential benefits:

  • Improved efficiency – By automating the transcription and data analysis processes, site monitors can save significant time and effort, allowing them to focus on more critical tasks and cover more sites within the same time frame.
  • Enhanced accuracy – LLMs can identify and correlate subtle patterns and nuances within the data, reducing the risk of overlooking critical information or making erroneous assumptions.
  • Comprehensive documentation – The generated CRFs provide a standardized and detailed record of site visits, facilitating better communication and collaboration among stakeholders.
  • Regulatory compliance – The LLM-powered analysis can help identify potential areas of non-compliance, enabling proactive measures to address issues and mitigate risks.
  • Informed decision-making – The insights derived from the LLM’s analysis can support data-driven decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

By combining audio-to-text translation and LLM capabilities, this integrated approach offers a powerful solution for intelligent site monitoring in the HCLS domain, supporting improved efficiency, accuracy, and decision-making while providing regulatory compliance and quality assurance.

Use case 3: Enhancing adverse event reporting in clinical trials with audio-to-text and LLMs

Clinical trials are crucial for evaluating the safety and efficacy of investigational drugs and therapies. Accurate and comprehensive adverse event reporting is essential for identifying potential risks and making informed decisions. By combining audio-to-text translation with LLM capabilities, we can streamline and augment the adverse event reporting process, leading to improved patient safety and more efficient clinical research.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio data collection – During clinical trial visits or follow-ups, audio recordings of patient-doctor interactions are captured, capturing detailed descriptions of adverse events or symptoms experienced by the participants. These audio recordings can be obtained through various channels, such as in-person visits, telemedicine consultations, or dedicated voice reporting systems.
  2. Audio-to-text transcription – The audio recordings are processed through an audio-to-text translation system, converting the spoken words into written text format. ASR and NLP techniques provide accurate transcription, accounting for factors like accents, background noise, and medical terminology.
  3. Text data integration – The transcribed text data is integrated with other sources of adverse event reporting, such as electronic case report forms (eCRFs), patient diaries, and medication logs. This comprehensive dataset provides a holistic view of the adverse events reported across multiple data sources.
  4. LLM analysis – The integrated dataset is fed into an LLM specifically trained on medical and clinical trial data. The LLM analyzes the textual data, identifying patterns, extracting relevant information, and generating insights related to adverse event occurrences, severity, and potential causal relationships.
  5. Intelligent reporting and decision support – The LLM generates detailed adverse event reports, highlighting key findings, trends, and potential safety signals. These reports can be presented to clinical trial teams, regulatory bodies, and safety monitoring committees, supporting informed decision-making processes. The LLM can also provide recommendations for further investigation, protocol modifications, or risk mitigation strategies based on the identified adverse event patterns.

The solution offers the following potential benefits:

  • Improved data capture – By using audio-to-text translation, valuable information from patient-doctor interactions can be captured and included in adverse event reporting, reducing the risk of missed or incomplete data.
  • Enhanced accuracy and completeness – The integration of multiple data sources, combined with the LLM’s analysis capabilities, provides a comprehensive and accurate understanding of adverse events, reducing the potential for errors or omissions.
  • Efficient data analysis – The LLM can rapidly process large volumes of textual data, identifying patterns and insights that might be difficult or time-consuming for human analysts to detect manually.
  • Timely decision support – Real-time adverse event reporting and analysis enable clinical trial teams to promptly identify and address potential safety concerns, mitigating risks and providing participant well-being.
  • Regulatory compliance – Comprehensive adverse event reporting and detailed documentation facilitate compliance with regulatory requirements and support transparent communication with regulatory agencies.

By integrating audio-to-text translation with LLM capabilities, this approach addresses the critical need for accurate and timely adverse event reporting in clinical trials, ultimately enhancing patient safety, improving research efficiency, and supporting informed decision-making in the HCLS domain.

Use case 4: Audio-to-text and LLM integration for enhanced patient care

In the healthcare domain, effective communication and accurate data capture are crucial for providing personalized and high-quality care. By integrating audio-to-text translation capabilities with LLM technology, we can streamline processes and unlock valuable insights, ultimately improving patient outcomes.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input collection – Caregivers or healthcare professionals can record audio updates on a patient’s condition, mood, or relevant observations using a secure and user-friendly interface. This could be done through mobile devices, dedicated recording stations, or during virtual consultations.
  2. Audio-to-text transcription – The recorded audio files are securely transmitted to a speech-to-text engine, which converts the spoken words into text format. Advanced NLP techniques provide accurate transcription, handling accents, medical terminology, and background noise.
  3. Text processing and contextualization – The transcribed text is then fed into an LLM trained on various healthcare datasets, including medical literature, clinical guidelines, and deidentified patient records. The LLM processes the text, identifies key information, and extracts relevant context and insights.
  4. LLM-powered analysis and recommendations – Using its sizeable knowledge base and natural language understanding capabilities, the LLM can perform various tasks, such as:
    1. Identifying potential health concerns or risks based on the reported symptoms and observations.
    2. Suggesting personalized care plans or treatment options aligned with evidence-based practices.
    3. Providing recommendations for follow-up assessments, diagnostic tests, or specialist consultations.
    4. Flagging potential drug interactions or contraindications based on the patient’s medical history.
    5. Generating summaries or reports in a structured format for efficient documentation and communication.
  5. Integration with EHRs – The analyzed data and recommendations from the LLM can be seamlessly integrated into the patient’s EHR, providing a comprehensive and up-to-date medical profile. This enables healthcare professionals to access relevant information promptly and make informed decisions during consultations or treatment planning.

The solution offers the following potential benefits:

  • Improved efficiency – By automating the transcription and analysis process, healthcare professionals can save time and focus on providing personalized care, rather than spending extensive hours on documentation and data entry.
  • Enhanced accuracy – ASR and NLP techniques provide accurate transcription, reducing errors and improving data quality.
  • Comprehensive patient insights – The LLM’s ability to process and contextualize unstructured audio data provides a more holistic understanding of the patient’s condition, enabling better-informed decision-making.
  • Personalized care plans – By using the LLM’s knowledge base and analytical capabilities, healthcare professionals can develop tailored care plans aligned with the patient’s specific needs and medical history.
  • Streamlined communication – Structured reports and summaries generated by the LLM facilitate efficient communication among healthcare teams, making sure everyone has access to the latest patient information.
  • Continuous learning and improvement – As more data is processed, the LLM can continuously learn and refine its recommendations, improving its performance over time.

By integrating audio-to-text translation and LLM capabilities, healthcare organizations can unlock new efficiencies, enhance patient-provider communication, and ultimately deliver superior care while staying at the forefront of technological advancements in the industry.

Use case 5: Audio-to-text translation and LLM integration for clinical trial protocol design

Efficient and accurate protocol design is crucial for successful study execution and regulatory compliance. By combining audio-to-text translation capabilities with the power of LLMs, we can streamline the protocol design process, using diverse data sources and AI-driven insights to create high-quality protocols in a timely manner.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input collection – Clinical researchers, subject matter experts, and stakeholders provide audio inputs, such as recorded meetings, discussions, or interviews, related to the proposed clinical trial. These audio files can capture valuable insights, requirements, and domain-specific knowledge.
  2. Audio-to-text transcription – Using ASR technology, the audio inputs are converted into text transcripts with high accuracy. This step makes sure that valuable information is captured and transformed into a format suitable for further processing by LLMs.
  3. Data integration – Relevant data sources, such as previous clinical trial protocols, regulatory guidelines, scientific literature, and medical databases, are integrated into the workflow. These data sources provide contextual information and serve as a knowledge base for the LLM.
  4. LLM processing – The transcribed text, along with the integrated data sources, is fed into a powerful LLM. The LLM uses its knowledge base and NLP capabilities to analyze the inputs, identify key elements, and generate a draft clinical trial protocol.
  5. Protocol refinement and review – The draft protocol generated by the LLM is reviewed by clinical researchers, medical experts, and regulatory professionals. They provide feedback, make necessary modifications, and enforce compliance with relevant guidelines and best practices.
  6. Iterative improvement – As the AI system receives feedback and correlated outcomes from completed clinical trials, it continuously learns and refines its protocol design capabilities. This iterative process enables the LLM to become more accurate and efficient over time, leading to higher-quality protocol designs.

The solution offers the following potential benefits:

  • Efficiency – By automating the initial protocol design process, researchers can save valuable time and resources, allowing them to focus on more critical aspects of clinical trial execution.
  • Accuracy and consistency – LLMs can use vast amounts of data and domain-specific knowledge, reducing the risk of errors and providing consistency across protocols.
  • Knowledge integration – The ability to seamlessly integrate diverse data sources, including audio recordings, scientific literature, and regulatory guidelines, enhances the quality and comprehensiveness of the protocol design.
  • Continuous improvement – The iterative learning process allows the AI system to adapt and improve its protocol design capabilities based on real-world outcomes, leading to increasingly accurate and effective protocols over time.
  • Decision-making support – By providing well-structured and comprehensive protocols, the AI-driven approach enables better-informed decision-making for clinical researchers, sponsors, and regulatory bodies.

This integrated approach using audio-to-text translation and LLM capabilities has the potential to revolutionize the clinical trial protocol design process, ultimately contributing to more efficient and successful clinical trials, accelerating the development of life-saving treatments, and improving patient outcomes.

Use case 6: Voice-enabled clinical trial and disease information assistant

In the HCLS domain, effective communication and access to accurate information are crucial for patients, caregivers, and healthcare professionals. This use case demonstrates how audio-to-text translation combined with LLM capabilities can address these needs by providing an intelligent, voice-enabled assistant for clinical trial and disease information.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input – The user, whether a patient, caregiver, or healthcare professional, can initiate the process by providing a voice query related to a specific disease or clinical trial. This could include questions about the disease itself, treatment options, ongoing trials, eligibility criteria, or other relevant information.
  2. Audio-to-text translation – The audio input is converted into text using state-of-the-art speech recognition technology. This step makes sure that the user’s query is accurately transcribed and ready for further processing by the LLM.
  3. Data integration – The system integrates various data sources, including clinical trial data, disease-specific information from reputable sources (such as PubMed or WebMD), and other relevant third-party resources. This comprehensive data integration makes sure that the LLM has access to a large knowledge base for generating accurate and comprehensive responses.
  4. LLM processing – The transcribed query is fed into the LLM, which uses its natural language understanding capabilities to comprehend the user’s intent and extract relevant information from the integrated data sources. The LLM can provide intelligent responses, insights, and recommendations based on the query and the available data.
  5. Response generation – The LLM generates a detailed, context-aware response addressing the user’s query. This response can be presented in various formats, such as text, audio (using text-to-speech technology), or a combination of both, depending on the user’s preferences and accessibility needs.
  6. Feedback and continuous improvement – The system can incorporate user feedback mechanisms to improve its performance over time. This feedback can be used to refine the LLM’s understanding, enhance the data integration process, and make sure that the system remains up to date with the latest clinical trial and disease information.

The solution offers the following potential benefits:

  • Improved access to information – By using voice input and NLP capabilities, the system empowers patients, caregivers, and healthcare professionals to access accurate and comprehensive information about diseases and clinical trials, regardless of their technical expertise or literacy levels.
  • Enhanced communication – The voice-enabled interface facilitates seamless communication between users and the system, enabling them to ask questions and receive responses in a conversational manner, mimicking human-to-human interaction.
  • Personalized insights – The LLM can provide personalized insights and recommendations based on the user’s specific query and context, enabling more informed decision-making and tailored support for individuals.
  • Time and efficiency gains – By automating the process of information retrieval and providing intelligent responses, the system can significantly reduce the time and effort required for healthcare professionals to manually search and synthesize information from multiple sources.
  • Improved patient engagement – By offering accessible and user-friendly access to disease and clinical trial information, the system can empower patients and caregivers to actively participate in their healthcare journey, fostering better engagement and understanding.

This use case highlights the potential of integrating audio-to-text translation with LLM capabilities to address real-world challenges in the HCLS domain. By using cutting-edge technologies, this solution can improve information accessibility, enhance communication, and support more informed decision-making for all stakeholders involved in clinical trials and disease management.

For the demonstration purpose we will focus on following use case:

Use case overview: Patient reporting and analysis in clinical trials

In clinical trials, it’s crucial to gather accurate and comprehensive patient data to assess the safety and efficacy of investigational drugs or therapies. Traditional methods of collecting patient reports can be time-consuming, prone to errors, and might result in incomplete or inconsistent data. By combining audio-to-text translation with LLM capabilities, we can streamline the patient reporting process and unlock valuable insights to support decision-making.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

  1. Audio input – Patients participating in clinical trials can provide their updates, symptoms, and feedback through voice recordings using a mobile application or a dedicated recording device.
  2. Audio-to-text transcription – The recorded audio files are securely transmitted to a cloud-based infrastructure, where they undergo automated transcription using ASR technology. The audio is converted into text, providing accurate and verbatim transcripts.
  3. Data consolidation – The transcribed patient reports are consolidated into a structured database, enabling efficient storage, retrieval, and analysis.
  4. LLM processing – The consolidated textual data is then processed by an LLM trained on biomedical and clinical trial data. The LLM can perform various tasks, including:
    1. Natural language processing – Extracting relevant information and identifying key symptoms, adverse events, or treatment responses from the patient reports.
    2. Sentiment analysis – Analyzing the emotional and psychological state of patients based on their language and tone, which can provide valuable insights into their overall well-being and treatment experience.
    3. Pattern recognition – Identifying recurring themes, trends, or anomalies across multiple patient reports, enabling early detection of potential safety concerns or efficacy signals.
    4. Knowledge extraction – Using the LLM’s understanding of biomedical concepts and clinical trial protocols to derive meaningful insights and recommendations from the patient data.
  5. Insights and reporting – The processed data and insights derived from the LLM are presented through interactive dashboards, visualizations, and reports. These outputs can be tailored to different stakeholders, such as clinical researchers, medical professionals, and regulatory authorities.

The solution offers the following potential benefits:

  • Improved data quality – By using audio-to-text transcription, the risk of errors and inconsistencies associated with manual data entry is minimized, providing high-quality patient data.
  • Time and cost-efficiency – Automated transcription and LLM-powered analysis can significantly reduce the time and resources required for data collection, processing, and analysis, leading to faster decision-making and cost savings.
  • Enhanced patient experience – Patients can provide their updates conveniently through voice recordings, reducing the burden of manual data entry and enabling more natural communication.
  • Comprehensive analysis – The combination of NLP, sentiment analysis, and pattern recognition capabilities offered by LLMs allows for a holistic understanding of patient experiences, treatment responses, and potential safety signals.
  • Regulatory compliance – Accurate and comprehensive patient data, coupled with robust analysis, can support compliance with regulatory requirements for clinical trial reporting and data documentation.

By integrating audio-to-text translation and LLM capabilities, clinical trial sponsors and research organizations can benefit from streamlined patient reporting, enhanced data quality, and powerful insights to support informed decision-making throughout the clinical development process.

Solution overview

The following diagram illustrates the solution architecture.

Solution overview: patient reporting and analysis in clinical trials

Solution overview: patient reporting and analysis in clinical trials

Key AWS services used in this solution include Amazon Simple Storage Service (Amazon S3), AWS HealthScribe, Amazon Transcribe, and Amazon Bedrock.

Prerequisites

This solution requires the following prerequisites:

Data samples

To illustrate the concept and provide a practical understanding, we have curated a collection of audio samples. These samples serve as representative examples, simulating site interviews conducted by researchers at clinical trial sites with patient participants.

The audio recordings offer a glimpse into the type of data typically encountered during such interviews. We encourage you to listen to these samples to gain a better appreciation of the data and its context.

These samples are for demonstration purposes only and don’t contain any real patient information or sensitive data. They are intended solely to provide a sample structure and format for the audio recordings used in this particular use case.

Sample Data Audio File
Site interview 1
Site Interview 2
Site Interview 3
Site Interview 4
Site Interview 5

Prompt templates

Prior to deploying and executing this solution, it’s essential to comprehend the input prompts and the anticipated output from the LLM. Although this is merely a sample, the potential outcomes and possibilities can be vastly expanded by crafting creative prompts.

We use the following input prompt template:

You are an expert medical research analyst for clinical trials of medicines.

You will be provided with a dictionary containing text transcriptions of clinical trial interviews conducted between patients and interviewers.

The dictionary keys represent the interview_id, and the values contain the interview transcripts.

<interview_transcripts>add_interview_transcripts</interview_transcripts>

Your task is to analyze all the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

The response Amazon Bedrock will be as below:

Based on the interview transcripts provided, here is a comprehensive report summarizing the key findings and conclusions from the clinical trial:

Introduction:

This report analyzes transcripts from interviews conducted with patients participating in a clinical trial for a new investigational drug. The interviews cover various aspects of the trial, including the informed consent process, randomization procedures, dosing schedules, follow-up visits, and patient experiences with potential side effects.

Key Findings:

1. Informed Consent Process:

– The informed consent process was thorough, with detailed explanations provided to patients about the trial’s procedures, potential risks, and benefits (Transcript 5).

– Patients were given ample time to review the consent documents, discuss them with family members, and have their questions addressed satisfactorily by the study team (Transcript 5).

– Overall, patients felt they fully understood the commitments and requirements of participating in the trial (Transcript 5).

2. Randomization and Blinding:

– Patients were randomized to either receive the investigational drug or a placebo, as part of a placebo-controlled study design (Transcript 2).

– The randomization process was adequately explained to patients, and they understood the rationale behind blinding, which is to prevent bias in the results (Transcript 2).

– Patients expressed acceptance of the possibility of receiving a placebo, recognizing its importance for the research (Transcript 2).

3. Dosing Schedule and Adherence:

– The dosing schedule involved taking the medication twice daily, in the morning and evening (Transcript 4).

– Some patients reported occasional difficulties in remembering the evening dose but implemented strategies like setting reminders on their phones to improve adherence (Transcript 4).

4. Follow-up Visits and Assessments:

– Follow-up visits were scheduled at specific intervals, such as 30 days, 3 months, and 6 months after the last dose (Transcripts 1 and 3).

– During these visits, various assessments were conducted, including blood tests, physical exams, ECGs, and evaluation of patient-reported outcomes like pain levels (Transcripts 1 and 3).

– Patients were informed that they would receive clinically significant findings from these assessments (Transcript 3).

5. Patient-Reported Side Effects:

– Some patients reported experiencing mild side effects, such as headaches, nausea, and joint pain improvement (Transcripts 3 and 4).

– The study team diligently documented and monitored these side effects, noting them in case report forms for further evaluation (Transcript 4).

6. Study Conduct and Communication:

– The study team provided 24/7 contact information, allowing patients to reach out with concerns between scheduled visits (Transcript 1).

– Patients were informed that they would receive information about the overall study results once available (Transcript 1).

– Patients were made aware of their ability to withdraw from the study at any time if they became uncomfortable (Transcript 2).

Conclusions:

Based on the interview transcripts, the clinical trial appears to have been conducted in a thorough and ethical manner, adhering to principles of informed consent, randomization, and blinding. Patients were adequately informed about the trial procedures, potential risks, and their rights as participants. The study team diligently monitored patient safety, documented adverse events, and maintained open communication channels. Overall, the transcripts suggest a well-managed clinical trial with a focus on patient safety, data integrity, and adherence to research protocols.

Deploy resources with AWS CloudFormation

To deploy the solution, use AWS CloudFormation template

Test the application

To test the application, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Locate your bucket starting with blog-hcls-assets-*.
  3. Navigate to the S3 prefix hcls-framework/samples-input-audio/. You will see sample audio files, which we reviewed earlier in this post.
  4. Select these files, and on the Actions menu, choose Copy.Select these files, and on the Actions menu, choose Copy.
  5. For Destination, choose Browse S3 and navigate to the S3 path for hcls-framework/input-audio/.For Destination, choose Browse S3 and navigate to the S3 path

Copying these sample files will trigger an S3 event invoking the AWS Lambda function audio-to-text. To review the invocations of the Lambda function on the AWS Lambda console, navigate to the audio-to-text function and then the Monitor tab, which contains detailed logs.

Review AWS Lambda execution logs

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

At this step, the interview transcripts are ready. They should be available in Amazon S3 under the prefix hcls-framework/input-text/.

At this step, the interview transcripts are ready. They should be available in Amazon S3.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

Now let’s run Anthropic’s Claude 3 Sonnet using the Lambda function hcls_clinical_trial_analysis to analyze the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

  1. On the Lambda console, navigate to the function named hcls_clinical_trial_analysis.
  2. Choose Test.
  3. If the console prompts you to create a new test event, do so with default or no input to the test event.

If the console prompts you to create a new test event, do so with default or no input to the test event.

  1. Run the test event.

To review the output, open the Lambda console and navigate to the function named hcls_clinical_trial_analysis, and then on the Monitor tab, for detailed logs, choose View CloudWatch Logs. In the logs, you will see your comprehensive report on the clinical trial.

In the logs, you will see your comprehensive report on the clinical trial.

So far, we have completed a process involving:

  • Collecting audio interviews from clinical trials
  • Transcribing the audio to text
  • Compiling transcripts into a dictionary
  • Using Amazon Bedrock (Anthropic’s Claude 3 Sonnet) to generate a comprehensive summary

Although we focused on summarization, this approach can be extended to other applications such as sentiment analysis, extracting key learnings, identifying common complaints, and more.

Summary

Healthcare patients often find themselves in need of reliable information about their conditions, clinical trials, or treatment options. However, accessing accurate and up-to-date medical knowledge can be a daunting task. Our innovative solution integrates cutting-edge audio-to-text translation and LLM capabilities to revolutionize how patients receive vital healthcare information. By using speech recognition technology, we can accurately transcribe patients’ spoken queries, allowing our LLM to comprehend the context and provide personalized, evidence-based responses tailored to their specific needs. This empowers patients to make informed decisions, enhances accessibility for those with disabilities or preferences for verbal communication, and alleviates the workload on healthcare professionals, ultimately improving patient outcomes and driving progress in the HCLS domain.

Take charge of your healthcare journey with our innovative voice-enabled virtual assistant. Empower yourself with accurate and personalized information by simply asking your questions aloud. Our cutting-edge solution integrates speech recognition and advanced language models to provide reliable, context-aware responses tailored to your specific needs. Embrace the future of healthcare today and experience the convenience of instantaneous access to vital medical information.


About the Authors

Vrinda Dabke leads AWS Professional Services North America Delivery. Prior to joining AWS, Vrinda held a variety of leadership roles in Fortune 100 companies like UnitedHealth Group, The Hartford, Aetna, and Pfizer. Her work has been focused on in the areas of business intelligence, analytics, and AI/ML. She is a motivational people leader with experience in leading and managing high-performing global teams in complex matrix organizations.

Kannan Raman leads the North America Delivery for AWS Professional Services Healthcare and Life Sciences practice at AWS. He has over 24 years of healthcare and life sciences experience and provides thought leadership in digital transformation. He works with C level customer executives to help them with their digital transformation agenda.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Read More

Intelligent healthcare assistants: Empowering stakeholders with personalized support and data-driven insights

Intelligent healthcare assistants: Empowering stakeholders with personalized support and data-driven insights

Large language models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like text with remarkable accuracy. However, despite their impressive language capabilities, LLMs are inherently limited by the data they were trained on. Their knowledge is static and confined to the information they were trained on, which becomes problematic when dealing with dynamic and constantly evolving domains like healthcare.

The healthcare industry is a complex, ever-changing landscape with a vast and rapidly growing body of knowledge. Medical research, clinical practices, and treatment guidelines are constantly being updated, rendering even the most advanced LLMs quickly outdated. Additionally, patient data, including electronic health records (EHRs), diagnostic reports, and medical histories, are highly personalized and unique to each individual. Relying solely on an LLM’s pre-trained knowledge is insufficient for providing accurate and personalized healthcare recommendations.

Furthermore, healthcare decisions often require integrating information from multiple sources, such as medical literature, clinical databases, and patient records. LLMs lack the ability to seamlessly access and synthesize data from these diverse and distributed sources. This limits their potential to provide comprehensive and well-informed insights for healthcare applications.

Overcoming these challenges is crucial for using the full potential of LLMs in the healthcare domain. Patients, healthcare providers, and researchers require intelligent agents that can provide up-to-date, personalized, and context-aware support, drawing from the latest medical knowledge and individual patient data.

Enter LLM function calling, a powerful capability that addresses these challenges by allowing LLMs to interact with external functions or APIs, enabling them to access and use additional data sources or computational capabilities beyond their pre-trained knowledge. By combining the language understanding and generation abilities of LLMs with external data sources and services, LLM function calling opens up a world of possibilities for intelligent healthcare agents.

In this blog post, we will explore how Mistral LLM on Amazon Bedrock can address these challenges and enable the development of intelligent healthcare agents with LLM function calling capabilities, while maintaining robust data security and privacy through Amazon Bedrock Guardrails.

Healthcare agents equipped with LLM function calling can serve as intelligent assistants for various stakeholders, including patients, healthcare providers, and researchers. They can assist patients by answering medical questions, interpreting test results, and providing personalized health advice based on their medical history and current conditions. For healthcare providers, these agents can help with tasks such as summarizing patient records, suggesting potential diagnoses or treatment plans, and staying up to date with the latest medical research. Additionally, researchers can use LLM function calling to analyze vast amounts of scientific literature, identify patterns and insights, and accelerate discoveries in areas such as drug development or disease prevention.

Benefits of LLM function calling

LLM function calling offers several advantages for enterprise applications, including enhanced decision-making, improved efficiency, personalized experiences, and scalability. By combining the language understanding capabilities of LLMs with external data sources and computational resources, enterprises can make more informed and data-driven decisions, automate and streamline various tasks, provide tailored recommendations and experiences for individual users or customers, and handle large volumes of data and process multiple requests concurrently.

Potential use cases for LLM function calling in the healthcare domain include patient triage, medical question answering, and personalized treatment recommendations. LLM-powered agents can assist in triaging patients by analyzing their symptoms, medical history, and risk factors, and providing initial assessments or recommendations for seeking appropriate care. Patients and healthcare providers can receive accurate and up-to-date answers to medical questions by using LLMs’ ability to understand natural language queries and access relevant medical knowledge from various data sources. Additionally, by integrating with electronic health records (EHRs) and clinical decision support systems, LLM function calling can provide personalized treatment recommendations tailored to individual patients’ medical histories, conditions, and preferences.

Amazon Bedrock supports a variety of foundation models. In this post, we will be exploring how to perform function calling using Mistral from Amazon Bedrock. Mistral supports function calling, which allows agents to invoke external functions or APIs from within a conversation flow. This capability enables agents to retrieve data, perform calculations, or use external services to enhance their conversational abilities. Function calling in Mistral is achieved through the use of specific function call blocks that define the external function to be invoked and handle the response or output.

Solution overview

LLM function calling typically involves integrating an LLM model with an external API or function that provides access to additional data sources or computational capabilities. The LLM model acts as an interface, processing natural language inputs and generating responses based on its pre-trained knowledge and the information obtained from the external functions or APIs. The architecture typically consists of the LLM model, a function or API integration layer, and external data sources and services.

Healthcare agents can integrate LLM models and call external functions or APIs through a series of steps: natural language input processing, self-correction, chain of thought, function or API calling through an integration layer, data integration and processing, and persona adoption. The agent receives natural language input, processes it through the LLM model, calls relevant external functions or APIs if additional data or computations are required, combines the LLM model’s output with the external data or results, and provides a comprehensive response to the user.

High Level Architecture

High Level Architecture- Healthcare assistant

The architecture for the Healthcare Agent is shown in the preceding figure and is as follows:

  1. Consumers interact with the system through Amazon API Gateway.
  2. AWS Lambda orchestrator, along with tool configuration and prompts, handles orchestration and invokes the Mistral model on Amazon Bedrock.
  3. Agent function calling allows agents to invoke Lambda functions to retrieve data, perform computations, or use external services.
  4. Functions such as insurance, claims, and pre-filled Lambda functions handle specific tasks.
  5. Data is stored in a conversation history, and a member database (MemberDB) is used to store member information and the knowledge base has static documents used by the agent.
  6. AWS CloudTrail, AWS Identity and Access Management (IAM), and Amazon CloudWatch handle data security.
  7. AWS Glue, Amazon SageMaker, and Amazon Simple Storage Service (Amazon S3) facilitate data processing.

A sample code using function calling through the Mistral LLM can be found at mistral-on-aws.

Security and privacy considerations

Data privacy and security are of utmost importance in the healthcare sector because of the sensitive nature of personal health information (PHI) and the potential consequences of data breaches or unauthorized access. Compliance with regulations such as HIPAA and GDPR is crucial for healthcare organizations handling patient data. To maintain robust data protection and regulatory compliance, healthcare organizations can use Amazon Bedrock Guardrails, a comprehensive set of security and privacy controls provided by Amazon Web Services (AWS).

Amazon Bedrock Guardrails offers a multi-layered approach to data security, including encryption at rest and in transit, access controls, audit logging, ground truth validation and incident response mechanisms. It also provides advanced security features such as data residency controls, which allow organizations to specify the geographic regions where their data can be stored and processed, maintaining compliance with local data privacy laws.

When using LLM function calling in the healthcare domain, it’s essential to implement robust security measures and follow best practices for handling sensitive patient information. Amazon Bedrock Guardrails can play a crucial role in this regard by helping to provide a secure foundation for deploying and operating healthcare applications and services that use LLM capabilities.

Some key security measures enabled by Amazon Bedrock Guardrails are:

  • Data encryption: Patient data processed by LLM functions can be encrypted at rest and in transit, making sure that sensitive information remains secure even in the event of unauthorized access or data breaches.
  • Access controls: Amazon Bedrock Guardrails enables granular access controls, allowing healthcare organizations to define and enforce strict permissions for who can access, modify, or process patient data through LLM functions.
  • Secure data storage: Patient data can be stored in secure, encrypted storage services such as Amazon S3 or Amazon Elastic File System (Amazon EFS), making sure that sensitive information remains protected even when at rest.
  • Anonymization and pseudonymization: Healthcare organizations can use Amazon Bedrock Guardrails to implement data anonymization and pseudonymization techniques, making sure that patient data used for training or testing LLM models doesn’t contain personally identifiable information (PII).
  • Audit logging and monitoring: Comprehensive audit logging and monitoring capabilities provided by Amazon Bedrock Guardrails enable healthcare organizations to track and monitor all access and usage of patient data by LLM functions, enabling timely detection and response to potential security incidents.
  • Regular security audits and assessments: Amazon Bedrock Guardrails facilitates regular security audits and assessments, making sure that the healthcare organization’s data protection measures remain up-to-date and effective in the face of evolving security threats and regulatory requirements.

By using Amazon Bedrock Guardrails, healthcare organizations can confidently deploy LLM function calling in their applications and services, maintaining robust data security, privacy protection, and regulatory compliance while enabling the transformative benefits of AI-powered healthcare assistants.

Case studies and real-world examples

3M Health Information Systems is collaborating with AWS to accelerate AI innovation in clinical documentation by using AWS machine learning (ML) services, compute power, and LLM capabilities. This collaboration aims to enhance 3M’s natural language processing (NLP) and ambient clinical voice technologies, enabling intelligent healthcare agents to capture and document patient encounters more efficiently and accurately. These agents, powered by LLMs, can understand and process natural language inputs from healthcare providers, such as spoken notes or queries, and use LLM function calling to access and integrate relevant medical data from EHRs, knowledge bases, and other data sources. By combining 3M’s domain expertise with AWS ML and LLM capabilities, the companies can improve clinical documentation workflows, reduce administrative burdens for healthcare providers, and ultimately enhance patient care through more accurate and comprehensive documentation.

GE Healthcare developed Edison, a secure intelligence solution running on AWS, to ingest and analyze data from medical devices and hospital information systems. This solution uses AWS analytics, ML, and Internet of Things (IoT) services to generate insights and analytics that can be delivered through intelligent healthcare agents powered by LLMs. These agents, equipped with LLM function calling capabilities, can seamlessly access and integrate the insights and analytics generated by Edison, enabling them to assist healthcare providers in improving operational efficiency, enhancing patient outcomes, and supporting the development of new smart medical devices. By using LLM function calling to retrieve and process relevant data from Edison, the agents can provide healthcare providers with data-driven recommendations and personalized support, ultimately enabling better patient care and more effective healthcare delivery.

Future trends and developments

Future advancements in LLM function calling for healthcare might include more advanced natural language processing capabilities, such as improved context understanding, multi-turn conversational abilities, and better handling of ambiguity and nuances in medical language. Additionally, the integration of LLM models with other AI technologies, such as computer vision and speech recognition, could enable multimodal interactions and analysis of various medical data formats.

Emerging technologies such as multimodal models, which can process and generate text, images, and other data formats simultaneously, could enhance LLM function calling in healthcare by enabling more comprehensive analysis and visualization of medical data. Personalized language models, trained on individual patient data, could provide even more tailored and accurate responses. Federated learning techniques, which allow model training on decentralized data while preserving privacy, could address data-sharing challenges in healthcare.

These advancements and emerging technologies could shape the future of healthcare agents by making them more intelligent, adaptive, and personalized. Agents could seamlessly integrate multimodal data, such as medical images and lab reports, into their analysis and recommendations. They could also continuously learn and adapt to individual patients’ preferences and health conditions, providing truly personalized care. Additionally, federated learning could enable collaborative model development while maintaining data privacy, fostering innovation and knowledge sharing across healthcare organizations.

Conclusion

LLM function calling has the potential to revolutionize the healthcare industry by enabling intelligent agents that can understand natural language, access and integrate various data sources, and provide personalized recommendations and insights. By combining the language understanding capabilities of LLMs with external data sources and computational resources, healthcare organizations can enhance decision-making, improve operational efficiency, and deliver superior patient experiences. However, addressing data privacy and security concerns is crucial for the successful adoption of this technology in the healthcare domain.

As the healthcare industry continues to embrace digital transformation, we encourage readers to explore and experiment with LLM function calling in their respective domains. By using this technology, healthcare organizations can unlock new possibilities for improving patient care, advancing medical research, and streamlining operations. With a focus on innovation, collaboration, and responsible implementation, the healthcare industry can harness the power of LLM function calling to create a more efficient, personalized, and data-driven future. AWS can help organizations use LLM function calling and build intelligent healthcare assistants through its AI/ML services, including Amazon Bedrock, Amazon Lex, and Lambda, while maintaining robust security and compliance using Amazon Bedrock Guardrails. To learn more, see AWS for Healthcare & Life Sciences.


About the Authors

Laks Sundararajan is a seasoned Enterprise Architect helping companies reset, transform and modernize their IT, digital, cloud, data and insight strategies. A proven leader with significant expertise around Generative AI, Digital, Cloud and Data/Analytics Transformation, Laks is a Sr. Solutions Architect with Healthcare and Life Sciences (HCLS).

Subha Venugopal is a Senior Solutions Architect at AWS with over 15 years of experience in the technology and healthcare sectors. Specializing in digital transformation, platform modernization, and AI/ML, she leads AWS Healthcare and Life Sciences initiatives. Subha is dedicated to enabling equitable healthcare access and is passionate about mentoring the next generation of professionals.

Read More

Getting started with computer use in Amazon Bedrock Agents

Getting started with computer use in Amazon Bedrock Agents

Computer use is a breakthrough capability from Anthropic that allows foundation models (FMs) to visually perceive and interpret digital interfaces. This capability enables Anthropic’s Claude models to identify what’s on a screen, understand the context of UI elements, and recognize actions that should be performed such as clicking buttons, typing text, scrolling, and navigating between applications. However, the model itself doesn’t execute these actions—it requires an orchestration layer to safely implement the supported actions.

Today, we’re announcing computer use support within Amazon Bedrock Agents using Anthropic’s Claude 3.5 Sonnet V2 and Anthropic’s Claude Sonnet 3.7 models on Amazon Bedrock. This integration brings Anthropic’s visual perception capabilities as a managed tool within Amazon Bedrock Agents, providing you with a secure, traceable, and managed way to implement computer use automation in your workflows.

Organizations across industries struggle with automating repetitive tasks that span multiple applications and systems of record. Whether processing invoices, updating customer records, or managing human resource (HR) documents, these workflows often require employees to manually transfer information between different systems – a process that’s time-consuming, error-prone, and difficult to scale.

Traditional automation approaches require custom API integrations for each application, creating significant development overhead. Computer use capabilities change this paradigm by allowing machines to perceive existing interfaces just as humans.

In this post, we create a computer use agent demo that provides the critical orchestration layer that transforms computer use from a perception capability into actionable automation. Without this orchestration layer, computer use would only identify potential actions without executing them. The computer use agent demo powered by Amazon Bedrock Agents provides the following benefits:

  • Secure execution environment – Execution of computer use tools in a sandbox environment with limited access to the AWS ecosystem and the web. It is crucial to note that currently Amazon Bedrock Agent does not provide a sandbox environment
  • Comprehensive logging – Ability to track each action and interaction for auditing and debugging
  • Detailed tracing capabilities – Visibility into each step of the automated workflow
  • Simplified testing and experimentation – Reduced risk when working with this experimental capability through managed controls
  • Seamless orchestration – Coordination of complex workflows across multiple systems without custom code

This integration combines Anthropic’s perceptual understanding of digital interfaces with the orchestration capabilities of Amazon Bedrock Agents, creating a powerful agent for automating complex workflows across applications. Rather than build custom integrations for each system, developers can now create agents that perceive and interact with existing interfaces in a managed, secure way.

With computer use, Amazon Bedrock Agents can automate tasks through basic GUI actions and built-in Linux commands. For example, your agent could take screenshots, create and edit text files, and run built-in Linux commands. Using Amazon Bedrock Agents and compatible Anthropic’s Claude models, you can use the following action groups:

  • Computer tool – Enables interactions with user interfaces (clicking, typing, scrolling)
  • Text editor tool – Provides capabilities to edit and manipulate files
  • Bash – Allows execution of built-in Linux commands

Solution overview

An example computer use workflow consists of the following steps:

  1. Create an Amazon Bedrock agent and use natural language to describe what the agent should do and how it should interact with users, for example: “You are computer use agent capable of using Firefox web browser for web search.”
  2. Add the Amazon Bedrock Agents supported computer use action groups to your agent using CreateAgentActionGroup API.
  3. Invoke the agent with a user query that requires computer use tools, for example, “What is Amazon Bedrock, can you search the web?”
  4. The Amazon Bedrock agent uses the tool definitions at its disposal and decides to use the computer action group to click a screenshot of the environment. Using the return control capability of Amazon Bedrock Agents, the agent the responds with the tool or tools that it wants to execute. The return control capability is required for using computer use with Amazon Bedrock Agents.
  5. The workflow parses the agent response and executes the tool returned in a sandbox environment. The output is given back to the Amazon Bedrock agent for further processing.
  6. The Amazon Bedrock agent continues to respond with tools at its disposal until the task is complete.

You can recreate this example in the us-west-2 AWS Region with the AWS Cloud Development Kit (AWS CDK) by following the instructions in the GitHub repository. This demo deploys a containerized application using AWS Fargate across two Availability Zones in the us-west-2 Region. The infrastructure operates within a virtual private cloud (VPC) containing public subnets in each Availability Zone, with an internet gateway providing external connectivity. The architecture is complemented by essential supporting services, including AWS Key Management Service (AWS KMS) for security and Amazon CloudWatch for monitoring, creating a resilient, serverless container environment that alleviates the need to manage underlying infrastructure while maintaining robust security and high availability.

The following diagram illustrates the solution architecture.

At the core of our solution are two Fargate containers managed through Amazon Elastic Container Service (Amazon ECS), each protected by its own security group. The first is our orchestration container, which not only handles the communication between Amazon Bedrock Agents and end users, but also orchestrates the workflow that enables tool execution. The second is our environment container, which serves as a secure sandbox where the Amazon Bedrock agent can safely run its computer use tools. The environment container has limited access to the rest of the ecosystem and the internet. We utilize service discovery to connect Amazon ECS services with DNS names.

The orchestration container includes the following components:

  • Streamlit UI – The Streamlit UI that facilitates interaction between the end user and computer use agent
  • Return control loop – The workflow responsible for parsing the tools that the agent wants to execute and returning the output of these tools

The environment container includes the following components:

  • UI and pre-installed applications – A lightweight UI and pre-installed Linux applications like Firefox that can be used to complete the user’s tasks
  • Tool implementation – Code that can execute computer use tool in the environment like “screenshot” or “double-click”
  • Quart (RESTful) JSON API – An orchestration container that uses Quart to execute tools in a sandbox environment

The following diagram illustrates these components.

Prerequisites

  1. AWS Command Line Interface (CLI), follow instructions here. Make sure to setup credentials, follow instructions here.
  2. Require Python 3.11 or later.
  3. Require Node.js 14.15.0 or later.
  4. AWS CDK CLI, follow instructions here.
  5. Enable model access for Anthropic’s Claude Sonnet 3.5 V2 and for Anthropic’s Claude Sonnet 3.7.
  6. Boto3 version >= 1.37.10.

Create an Amazon Bedrock agent with computer use

You can use the following code sample to create a simple Amazon Bedrock agent with computer, bash, and text editor action groups. It is crucial to provide a compatible action group signature when using Anthropic’s Claude 3.5 Sonnet V2 and Anthropic’s Claude 3.7 Sonnet as highlighted here.

Model Action Group Signature
Anthropic’s Claude 3.5 Sonnet V2 computer_20241022
text_editor_20241022
bash_20241022
Anthropic’s Claude 3.7 Sonnet computer_20250124
text_editor_20250124
bash_20250124
import boto3
import time

# Step 1: Create the bedrock agent client

bedrock_agent = boto3.client("bedrock-agent", region_name="us-west-2")

# Step 2: Create an agent

create_agent_response = create_agent_response = bedrock_agent.create_agent(
        agentResourceRoleArn=agent_role_arn, # Amazon Bedrock Agent execution role
        agentName="computeruse",
        description="""Example agent for computer use. 
				This agent should only operate on 
				Sandbox environments with limited privileges.""",
        foundationModel="us.anthropic.claude-3-7-sonnet-20250219-v1:0",      
		instruction="""You are computer use agent capable of using Firefox 
                 web browser for web search.""",
)

time.sleep(30) # wait for agent to be created

# Step 3.1: Create and attach computer action group

bedrock_agent.create_agent_action_group(
    actionGroupName="ComputerActionGroup",
    actionGroupState="ENABLED",
    agentId=create_agent_response["agent"]["agentId"],
    agentVersion="DRAFT",
    parentActionGroupSignature="ANTHROPIC.Computer",
    parentActionGroupSignatureParams={
        "type": "computer_20250124",
        "display_height_px": "768",
        "display_width_px": "1024",
        "display_number": "1",
    },
)

# Step 3.2: Create and attach bash action group

bedrock_agent.create_agent_action_group(
    actionGroupName="BashActionGroup",
    actionGroupState="ENABLED",
    agentId=create_agent_response["agent"]["agentId"],
    agentVersion="DRAFT",
    parentActionGroupSignature="ANTHROPIC.Bash",
    parentActionGroupSignatureParams={
        "type": "bash_20250124",
    },
)

# Step 3.3: Create and attach text editor action group

bedrock_agent.create_agent_action_group(
    actionGroupName="TextEditorActionGroup",
    actionGroupState="ENABLED",
    agentId=create_agent_response["agent"]["agentId"],
    agentVersion="DRAFT",
    parentActionGroupSignature="ANTHROPIC.TextEditor",
    parentActionGroupSignatureParams={
        "type": "text_editor_20250124",
    },
)

# Step 3.4 Create Weather Action Group

bedrock_agent.create_agent_action_group(
        actionGroupName="WeatherActionGroup",
        agentId=create_agent_response["agent"]["agentId"],
        agentVersion="DRAFT",
        actionGroupExecutor = {
            'customControl': 'RETURN_CONTROL',
        },
        functionSchema = {
            'functions': [
                {
                    "name": "get_current_weather",
                    "description": "Get the current weather in a given location.",
                    "parameters": {
                        "location": {
                            "type": "string",
                            "description": "The city, e.g., San Francisco",
                            "required": True,
                        },
                        "unit": {
                            "type": "string",
                            "description": 'The unit to use, e.g., 
									fahrenheit or celsius. Defaults to "fahrenheit"',
                            "required": False,
                        },
                    },
                    "requireConfirmation": "DISABLED",
                }
            ]
        },
)
time.sleep(10)
# Step 4: Prepare agent

bedrock_agent.prepare_agent(agentId=create_agent_response["agent"]["agentId"])

Example use case

In this post, we demonstrate an example where we use Amazon Bedrock Agents with the computer use capability to complete a web form. In the example, the computer use agent can also switch Firefox tabs to interact with a customer relationship management (CRM) agent to get the required information to complete the form. Although this example uses a sample CRM application as the system of record, the same approach works with Salesforce, SAP, Workday, or other systems of record with the appropriate authentication frameworks in place.

In the demonstrated use case, you can observe how well the Amazon Bedrock agent performed with computer use tools. Our implementation completed the customer ID, customer name, and email by visually examining the excel data. However, for the overview, it decided to select the cell and copy the data, because the information wasn’t completely visible on the screen. Finally, the CRM agent was used to get additional information on the customer.

Best practices

The following are some ways you can improve the performance for your use case:

Considerations

The computer use feature is made available to you as a beta service as defined in the AWS Service Terms. It is subject to your agreement with AWS and the AWS Service Terms, and the applicable model EULA. Computer use poses unique risks that are distinct from standard API features or chat interfaces. These risks are heightened when using the computer use feature to interact with the internet. To minimize risks, consider taking precautions such as:

  • Operate computer use functionality in a dedicated virtual machine or container with minimal privileges to minimize direct system exploits or accidents
  • To help prevent information theft, avoid giving the computer use API access to sensitive accounts or data
  • Limit the computer use API’s internet access to required domains to reduce exposure to malicious content
  • To enforce proper oversight, keep a human in the loop for sensitive tasks (such as making decisions that could have meaningful real-world consequences) and for anything requiring affirmative consent (such as accepting cookies, executing financial transactions, or agreeing to terms of service)

Any content that you enable Anthropic’s Claude to see or access can potentially override instructions or cause the model to make mistakes or perform unintended actions. Taking proper precautions, such as isolating Anthropic’s Claude from sensitive surfaces, is essential – including to avoid risks related to prompt injection. Before enabling or requesting permissions necessary to enable computer use features in your own products, inform end users of any relevant risks, and obtain their consent as appropriate.

Clean up

When you are done using this solution, make sure to clean up all the resources. Follow the instructions in the provided GitHub repository.

Conclusion

Organizations across industries face significant challenges with cross-application workflows that traditionally require manual data entry or complex custom integrations. The integration of Anthropic’s computer use capability with Amazon Bedrock Agents represents a transformative approach to these challenges.

By using Amazon Bedrock Agents as the orchestration layer, organizations can alleviate the need for custom API development for each application, benefit from comprehensive logging and tracing capabilities essential for enterprise deployment, and implement automation solutions quickly.

As you begin exploring computer use with Amazon Bedrock Agents, consider workflows in your organization that could benefit from this approach. From invoice processing to customer onboarding, HR documentation to compliance reporting, the potential applications are vast and transformative.

We’re excited to see how you will use Amazon Bedrock Agents with the computer use capability to securely streamline operations and reimagine business processes through AI-driven automation.

Resources

To learn more, refer to the following resources:


About the Authors

Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.

Maira Ladeira Tanke is a Tech Lead for Agentic workloads in Amazon Bedrock at AWS, where she enables customers on their journey to develop autonomous AI systems. With over 10 years of experience in AI/ML. At AWS, Maira partners with enterprise customers to accelerate the adoption of agentic applications using Amazon Bedrock, helping organizations harness the power of foundation models to drive innovation and business transformation. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Adarsh Srikanth is a Software Development Engineer at Amazon Bedrock, where he develops AI agent services. He holds a master’s degree in computer science from USC and brings three years of industry experience to his role. He spends his free time exploring national parks, discovering new hiking trails, and playing various racquet sports.

Abishek Kumar is a Senior Software Engineer at Amazon, bringing over 6 years of valuable experience across both retail and AWS organizations. He has demonstrated expertise in developing generative AI and machine learning solutions, specifically contributing to key AWS services including SageMaker Autopilot, SageMaker Canvas, and AWS Bedrock Agents. Throughout his career, Abishek has shown passion for solving complex problems and architecting large-scale systems that serve millions of customers worldwide. When not immersed in technology, he enjoys exploring nature through hiking and traveling adventures with his wife.

Krishna Gourishetti is a Senior Software Engineer for the Bedrock Agents team in AWS. He is passionate about building scalable software solutions that solve customer problems. In his free time, Krishna loves to go on hikes.

Read More

Evaluating RAG applications with Amazon Bedrock knowledge base evaluation

Evaluating RAG applications with Amazon Bedrock knowledge base evaluation

Organizations building and deploying AI applications, particularly those using large language models (LLMs) with Retrieval Augmented Generation (RAG) systems, face a significant challenge: how to evaluate AI outputs effectively throughout the application lifecycle. As these AI technologies become more sophisticated and widely adopted, maintaining consistent quality and performance becomes increasingly complex.

Traditional AI evaluation approaches have significant limitations. Human evaluation, although thorough, is time-consuming and expensive at scale. Although automated metrics are fast and cost-effective, they can only evaluate the correctness of an AI response, without capturing other evaluation dimensions or providing explanations of why an answer is problematic. Furthermore, traditional automated evaluation metrics typically require ground truth data, which for many AI applications is difficult to obtain. Especially for those involving open-ended generation or retrieval augmented systems, defining a single “correct” answer is practically impossible. Finally, metrics such as ROUGE and F1 can be fooled by shallow linguistic similarities (word overlap) between the ground truth and the LLM response, even when the actual meaning is very different. These challenges make it difficult for organizations to maintain consistent quality standards across their AI applications, particularly for generative AI outputs.

Amazon Bedrock has recently launched two new capabilities to address these evaluation challenges: LLM-as-a-judge (LLMaaJ) under Amazon Bedrock Evaluations and a brand new RAG evaluation tool for Amazon Bedrock Knowledge Bases. Both features rely on the same LLM-as-a-judge technology under the hood, with slight differences depending on if a model or a RAG application built with Amazon Bedrock Knowledge Bases is being evaluated. These evaluation features combine the speed of automated methods with human-like nuanced understanding, enabling organizations to:

  • Assess AI model outputs across various tasks and contexts
  • Evaluate multiple evaluation dimensions of AI performance simultaneously
  • Systematically assess both retrieval and generation quality in RAG systems
  • Scale evaluations across thousands of responses while maintaining quality standards

These capabilities integrate seamlessly into the AI development lifecycle, empowering organizations to improve model and application quality, promote responsible AI practices, and make data-driven decisions about model selection and application deployment.

This post focuses on RAG evaluation with Amazon Bedrock Knowledge Bases, provides a guide to set up the feature, discusses nuances to consider as you evaluate your prompts and responses, and finally discusses best practices. By the end of this post, you will understand how the latest Amazon Bedrock evaluation features can streamline your approach to AI quality assurance, enabling more efficient and confident development of RAG applications.

Key features

Before diving into the implementation details, we examine the key features that make the capabilities of RAG evaluation on Amazon Bedrock Knowledge Bases particularly powerful. The key features are:

  1. Amazon Bedrock Evaluations
    • Evaluate Amazon Bedrock Knowledge Bases directly within the service
    • Systematically evaluate both retrieval and generation quality in RAG systems to change knowledge base build-time parameters or runtime parameters
  2. Comprehensive, understandable, and actionable evaluation metrics
    • Retrieval metrics: Assess context relevance and coverage using an LLM as a judge
    • Generation quality metrics: Measure correctness, faithfulness (to detect hallucinations), completeness, and more
    • Provide natural language explanations for each score in the output and on the console
    • Compare results across multiple evaluation jobs for both retrieval and generation
    • Metrics scores are normalized to 0 and 1 range
  3. Scalable and efficient assessment
    • Scale evaluation across thousands of responses
    • Reduce costs compared to manual evaluation while maintaining high quality standards
  4. Flexible evaluation framework
    • Support both ground truth and reference-free evaluations
    • Equip users to select from a variety of metrics for evaluation
    • Supports evaluating fine-tuned or distilled models on Amazon Bedrock
    • Provides a choice of evaluator models
  5. Model selection and comparison
    • Compare evaluation jobs across different generating models
    • Facilitate data-driven optimization of model performance
  6. Responsible AI integration
    • Incorporate built-in responsible AI metrics such as harmfulness, answer refusal, and stereotyping
    • Seamlessly integrate with Amazon Bedrock Guardrails

These features enable organizations to comprehensively assess AI performance, promote responsible AI development, and make informed decisions about model selection and optimization throughout the AI application lifecycle. Now that we’ve explained the key features, we examine how these capabilities come together in a practical implementation.

Feature overview

The Amazon Bedrock Knowledge Bases RAG evaluation feature provides a comprehensive, end-to-end solution for assessing and optimizing RAG applications. This automated process uses the power of LLMs to evaluate both retrieval and generation quality, offering insights that can significantly improve your AI applications.

The workflow is as follows, as shown moving from left to right in the following architecture diagram:

  1. Prompt dataset – Prepared set of prompts, optionally including ground truth responses
  2. JSONL file – Prompt dataset converted to JSONL format for the evaluation job
  3. Amazon Simple Storage Service (Amazon S3) bucket – Storage for the prepared JSONL file
  4. Amazon Bedrock Knowledge Bases RAG evaluation job – Core component that processes the data, integrating with Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases.
  5. Automated report generation – Produces a comprehensive report with detailed metrics and insights at individual prompt or conversation level
  6. Analyze the report to derive actionable insights for RAG system optimization

Designing holistic RAG evaluations: Balancing cost, quality, and speed

RAG system evaluation requires a balanced approach that considers three key aspects: cost, speed, and quality. Although Amazon Bedrock Evaluations primarily focuses on quality metrics, understanding all three components helps create a comprehensive evaluation strategy. The following diagram shows how these components interact and feed into a comprehensive evaluation strategy, and the next sections examine each component in detail.

Cost and speed considerations

The efficiency of RAG systems depends on model selection and usage patterns. Costs are primarily driven by data retrieval and token consumption during retrieval and generation, and speed depends on model size and complexity as well as prompt and context size. For applications requiring high performance content generation with lower latency and costs, model distillation can be an effective solution to use for creating a generator model, for example. As a result, you can create smaller, faster models that maintain quality of larger models for specific use cases.

Quality assessment framework

Amazon Bedrock knowledge base evaluation provides comprehensive insights through various quality dimensions:

  • Technical quality through metrics such as context relevance and faithfulness
  • Business alignment through correctness and completeness scores
  • User experience through helpfulness and logical coherence measurements
  • Incorporates built-in responsible AI metrics such as harmfulness, stereotyping, and answer refusal.

Establishing baseline understanding

Begin your evaluation process by choosing default configurations in your knowledge base (vector or graph database), such as default chunking strategies, embedding models, and prompt templates. These are just some of the possible options. This approach establishes a baseline performance, helping you understand your RAG system’s current effectiveness across available evaluation metrics before optimization. Next, create a diverse evaluation dataset. Make sure this dataset contains a diverse set of queries and knowledge sources that accurately reflect your use case. The diversity of this dataset will provide a comprehensive view of your RAG application performance in production.

Iterative improvement process

Understanding how different components affect these metrics enables informed decisions about:

  • Knowledge base configuration (chunking strategy or embedding size or model) and inference parameter refinement
  • Retrieval strategy modifications (semantic or hybrid search)
  • Prompt engineering refinements
  • Model selection and inference parameter configuration
  • Choice between different vector stores including graph databases

Continuous evaluation and improvement

Implement a systematic approach to ongoing evaluation:

  • Schedule regular offline evaluation cycles aligned with knowledge base updates
  • Track metric trends over time to identify areas for improvement
  • Use insights to guide knowledge base refinements and generator model customization and selection

Prerequisites

To use the knowledge base evaluation feature, make sure that you have satisfied the following requirements:

  • An active AWS account.
  • Selected evaluator and generator models enabled in Amazon Bedrock. You can confirm that the models are enabled for your account on the Model access page of the Amazon Bedrock console.
  • Confirm the AWS Regions where the model is available and quotas.
  • Complete the knowledge base evaluation prerequisites related to AWS Identity and Access Management (IAM) creation and add permissions for an S3 bucket to access and write output data.
  • Have an Amazon Bedrock knowledge base created and sync your data such that it’s ready to be used by a knowledge base evaluation job.
  • If yo’re using a custom model instead of an on-demand model for your generator model, make sure you have sufficient quota for running a Provisioned Throughput during inference. Go to the Service Quotas console and check the following quotas:
    • Model units no-commitment Provisioned Throughputs across custom models
    • Model units per provisioned model for [your custom model name]
    • Both fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Prepare input dataset

To prepare your dataset for a knowledge base evaluation job, you need to follow two important steps:

  1. Dataset requirements:
    1. Maximum 1,000 conversations per evaluation job (1 conversation is contained in the conversationTurns key in the dataset format)
    2. Maximum 5 turns (prompts) per conversation
    3. File must use JSONL format (.jsonl extension)
    4. Each line must be a valid JSON object and complete prompt
    5. Stored in an S3 bucket with CORS enabled
  2. Follow the following format:
    1. Retrieve only evaluation jobs.

Special note: On March 20, 2025, the referenceContexts key will change to referenceResponses. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages/chunks retrieved from the Knowledge Base.

{
    "conversationTurns": [{
        ## required for Context Coverage metric
        "referenceContexts": [{
            "content": [{
                "text": "This is reference retrieved context"
            }]
        }],
        ## your prompt to the model
        "prompt": {
            "content": [{
                "text": "This is a prompt"
            }]
        }
    }]
}
  1. Retrieve and generate evaluation jobs
{
    "conversationTurns": [{
        ##optional
        "referenceResponses": [{
            "content": [{
                "text": "This is a reference response used as groud truth"
            }]
        }],
        ## your prompt to the model
        "prompt": {
            "content": [{
                "text": "This is a prompt"
            }]
        }
    }]
}

Start a knowledge base RAG evaluation job using the console

Amazon Bedrock Evaluations provides you with an option to run an evaluation job through a guided user interface on the console. To start an evaluation job through the console, follow these steps:

  1. On the Amazon Bedrock console, under Inference and Assessment in the navigation pane, choose Evaluations and then choose Knowledge Bases.
  2. Choose Create, as shown in the following screenshot.
  3. Give an Evaluation name, a Description, and choose an Evaluator model, as shown in the following screenshot. This model will be used as a judge to evaluate the response of the RAG application.
  4. Choose the knowledge base and the evaluation type, as shown in the following screenshot. Choose Retrieval only if you want to evaluate only the retrieval component and Retrieval and response generation if you want to evaluate the end-to-end retrieval and response generation. Select a model, which will be used for generating responses in this evaluation job.
  5. (Optional) To change inference parameters, choose configurations. You can update or experiment with different values of temperature, top-P, update knowledge base prompt templates, associate guardrails, update search strategy, and configure numbers of chunks retrieved. The following screenshot shows the Configurations screen.
  6. Choose the Metrics you would like to use to evaluate the RAG application, as shown in the following screenshot.
  7. Provide the S3 URI, as shown in step 3 for evaluation data and for evaluation results. You can use the Browse S3
  8. Select a service (IAM) role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, the knowledge base in the job, and the models being used in the job. You can also create a new IAM role in the evaluation setup and the service will automatically give the role the proper permissions for the job.
  9. Choose Create.
  10. You will be able to check the evaluation job In Progress status on the Knowledge Base evaluations screen, as shown in in the following screenshot.
  11. Wait for the job to be complete. This could be 10–15 minutes for a small job or a few hours for a large job with hundreds of long prompts and all metrics selected. When the evaluation job has been completed, the status will show as Completed, as shown in the following screenshot.
  12. When it’s complete, select the job, and you’ll be able to observe the details of the job. The following screenshot is the Metric summary.
  13. You should also observe a directory with the evaluation job name in the Amazon S3 path. You can find the output S3 path from your job results page in the evaluation summary section.
  14. You can compare two evaluation jobs to gain insights about how different configurations or selections are performing. You can view a radar chart comparing performance metrics between two RAG evaluation jobs, making it simple to visualize relative strengths and weaknesses across different dimensions, as shown in the following screenshot.

On the Evaluation details tab, examine score distributions through histograms for each evaluation metric, showing average scores and percentage differences. Hover over the histogram bars to check the number of conversations in each score range, helping identify patterns in performance, as shown in the following screenshots.

Start a knowledge base evaluation job using Python SDK and APIs

To use the Python SDK for creating a knowledge base evaluation job, follow these steps. First, set up the required configurations:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f"kb-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure your knowledge base and model settings
knowledge_base_id = "<YOUR_KB_ID>"
evaluator_model = "mistral.mistral-large-2402-v1:0"
generator_model = "anthropic.claude-3-sonnet-20240229-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"

# Specify S3 locations for evaluation data and output
input_data = "s3://<YOUR_BUCKET>/evaluation_data/input.jsonl"
output_path = "s3://<YOUR_BUCKET>/evaluation_output/"

# Configure retrieval settings
num_results = 10
search_type = "HYBRID"

# Create Bedrock client
bedrock_client = boto3.client('bedrock')

For retrieval-only evaluation, create a job that focuses on assessing the quality of retrieved contexts:

retrieval_job = bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval performance",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveConfig": {
                    "knowledgeBaseId": knowledge_base_id,
                    "knowledgeBaseRetrievalConfiguration": {
                        "vectorSearchConfiguration": {
                            "numberOfResults": num_results,
                            "overrideSearchType": search_type
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.ContextRelevance",
                    "Builtin.ContextCoverage"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

For a complete evaluation of both retrieval and generation, use this configuration:

retrieve_generate_job=bedrock_client.create_evaluation_job(
    jobName=job_name,
    jobDescription="Evaluate retrieval and generation",
    roleArn=role_arn,
    applicationType="RagEvaluation",
    inferenceConfig={
        "ragConfigs": [{
            "knowledgeBaseConfig": {
                "retrieveAndGenerateConfig": {
                    "type": "KNOWLEDGE_BASE",
                    "knowledgeBaseConfiguration": {
                        "knowledgeBaseId": knowledge_base_id,
                        "modelArn": generator_model,
                        "retrievalConfiguration": {
                            "vectorSearchConfiguration": {
                                "numberOfResults": num_results,
                                "overrideSearchType": search_type
                            }
                        }
                    }
                }
            }
        }]
    },
    outputDataConfig={
        "s3Uri": output_path
    },
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Custom",
                "dataset": {
                    "name": "RagDataset",
                    "datasetLocation": {
                        "s3Uri": input_data
                    }
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Helpfulness",
                    "Builtin.LogicalCoherence",
                    "Builtin.Faithfulness"
                ]
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [{
                    "modelIdentifier": evaluator_model
                }]
            }
        }
    }
)

To monitor the progress of your evaluation job, use this configuration:

# depending on job type, we can retrieve the ARN of the job and monitor to to take any downstream actions.
evaluation_job_arn = retrieval_job['jobArn']
evaluation_job_arn = retrieve_generate_job['jobArn']

response = bedrock_client.get_evaluation_job(
    jobIdentifier=evaluation_job_arn 
)
print(f"Job Status: {response['status']}")

Interpreting results

After your evaluation jobs are completed, Amazon Bedrock RAG evaluation provides a detailed comparative dashboard across the evaluation dimensions.

The evaluation dashboard includes comprehensive metrics, but we focus on one example, the completeness histogram shown below. This visualization represents how well responses cover all aspects of the questions asked. In our example, we notice a strong right-skewed distribution with an average score of 0.921. The majority of responses (15) scored above 0.9, while a small number fell in the 0.5-0.8 range. This type of distribution helps quickly identify if your RAG system has consistent performance or if there are specific cases needing attention.

Selecting specific score ranges in the histogram reveals detailed conversation analyses. For each conversation, you can examine the input prompt, generated response, number of retrieved chunks, ground truth comparison, and most importantly, the detailed score explanation from the evaluator model.

Consider this example response that scored 0.75 for the question, “What are some risks associated with Amazon’s expansion?” Although the generated response provided a structured analysis of operational, competitive, and financial risks, the evaluator model identified missing elements around IP infringement and foreign exchange risks compared to the ground truth. This detailed explanation helps in understanding not just what’s missing, but why the response received its specific score.

This granular analysis is crucial for systematic improvement of your RAG pipeline. By understanding patterns in lower-performing responses and specific areas where context retrieval or generation needs improvement, you can make targeted optimizations to your system—whether that’s adjusting retrieval parameters, refining prompts, or modifying knowledge base configurations.

Best practices for implementation

These best practices help build a solid foundation for your RAG evaluation strategy:

  1. Design your evaluation strategy carefully, using representative test datasets that reflect your production scenarios and user patterns. If you have large workloads greater than 1,000 prompts per batch, optimize your workload by employing techniques such as stratified sampling to promote diversity and representativeness within your constraints such as time to completion and costs associated with evaluation.
  2. Schedule periodic batch evaluations aligned with your knowledge base updates and content refreshes because this feature supports batch analysis rather than real-time monitoring.
  3. Balance metrics with business objectives by selecting evaluation dimensions that directly impact your application’s success criteria.
  4. Use evaluation insights to systematically improve your knowledge base content and retrieval settings through iterative refinement.
  5. Maintain clear documentation of evaluation jobs, including the metrics selected and improvements implemented based on results. The job creation configuration settings in your results pages can help keep a historical record here.
  6. Optimize your evaluation batch size and frequency based on application needs and resource constraints to promote cost-effective quality assurance.
  7. Structure your evaluation framework to accommodate growing knowledge bases, incorporating both technical metrics and business KPIs in your assessment criteria.

To help you dive deeper into the scientific validation of these practices, we’ll be publishing a technical deep-dive post that explores detailed case studies using public datasets and internal AWS validation studies. This upcoming post will examine how our evaluation framework performs across different scenarios and demonstrate its correlation with human judgments across various evaluation dimensions. Stay tuned as we explore the research and validation that powers Amazon Bedrock Evaluations.

Conclusion

Amazon Bedrock knowledge base RAG evaluation enables organizations to confidently deploy and maintain high-quality RAG applications by providing comprehensive, automated assessment of both retrieval and generation components. By combining the benefits of managed evaluation with the nuanced understanding of human assessment, this feature allows organizations to scale their AI quality assurance efficiently while maintaining high standards. Organizations can make data-driven decisions about their RAG implementations, optimize their knowledge bases, and follow responsible AI practices through seamless integration with Amazon Bedrock Guardrails.

Whether you’re building customer service solutions, technical documentation systems, or enterprise knowledge base RAG, Amazon Bedrock Evaluations provides the tools needed to deliver reliable, accurate, and trustworthy AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets. You can find it on our GitHub repository.

We encourage you to explore these capabilities in the Amazon Bedrock console and discover how systematic evaluation can enhance your RAG applications.


About the Authors

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Ayan Ray is a Senior Generative AI Partner Solutions Architect at AWS, where he collaborates with ISV partners to develop integrated Generative AI solutions that combine AWS services with AWS partner products. With over a decade of experience in Artificial Intelligence and Machine Learning, Ayan has previously held technology leadership roles at AI startups before joining AWS. Based in the San Francisco Bay Area, he enjoys playing tennis and gardening in his free time.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Evangelia Spiliopoulou is an Applied Scientist in the AWS Bedrock Evaluation group, where the goal is to develop novel methodologies and tools to assist automatic evaluation of LLMs. Her overall work focuses on Natural Language Processing (NLP) research and developing NLP applications for AWS customers, including LLM Evaluations, RAG, and improving reasoning for LLMs. Prior to Amazon, Evangelia completed her Ph.D. at Language Technologies Institute, Carnegie Mellon University.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Read More

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy

Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.

With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.

This post provides an overview of a custom solution developed by the for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.

Solution overview

GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:

Input: Fruit by the Foot Starburst

Output: color -> multi-colored, material -> candy, category -> snacks, product_line -> Fruit by the Foot,…

GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.

This solution uses the following components to categorize products more accurately and efficiently:

The key steps are illustrated in the following figure:

  1. A JSONL file containing product data is uploaded to an S3 bucket, triggering the first Lambda function. Amazon Bedrock batch processes this single JSONL file, where each row contains input parameters and prompts. It generates an output JSONL file with a new model_output value appended to each row, corresponding to the input data.
  2. The Lambda function spins up an Amazon Bedrock batch processing endpoint and passes the S3 file location.
  3. The Amazon Bedrock endpoint performs the following tasks:
    1. It reads the product name data and generates a categorized output, including category, subcategory, season, price range, material, color, product line, gender, and year of first sale.
    2. It writes the output to another S3 location.
  4. The second Lambda function performs the following tasks:
    1. It monitors the batch processing job on Amazon Bedrock.
    2. It shuts down the endpoint when processing is complete.

The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.

We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.

The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.

The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.

In the following sections, we look at the key components of the solution in more detail.

Batch inference

We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob API to create a batch job with a unique job name. This API returns a response containing jobArn. Refer to the following code:

Request: POST /model-invocation-job HTTP/1.1

Content-type: application/json
{
  "clientRequestToken": "string",
  "inputDataConfig": {
    "s3InputDataConfig": {
      "s3Uri": "string",
      "s3InputFormat": "JSONL"
    }
   },
  "jobName": "string",
  "modelId": "string",
  "outputDataConfig": {
    "s3OutputDataConfig": {
      "s3Uri": "string"
    }
  },
  "roleArn": "string",
  "tags": [{
  "key": "string",
  "value": "string"
  }]
}

Response
HTTP/1.1 200 Content-type: application/json
{
  "jobArn": "string"
}

We can monitor the job status using GetModelInvocationJob with the jobArn returned on job creation. The following are valid statuses during the lifecycle of a job:

  • Submitted – The job is marked Submitted when the JSON file is ready to be processed by Amazon Bedrock for inference.
  • InProgress – The job is marked InProgress when Amazon Bedrock starts processing the JSON file.
  • Failed – The job is marked Failed if there was an error while processing. The error can be written into the JSON file as a part of modelOutput. If it was a 4xx error, it’s written in the metadata of the Job.
  • Completed – The job is marked Completed when the output JSON file is generated for the input JSON file and has been uploaded to the S3 output path submitted as a part of the CreateModelInvocationJob in outputDataConfig.
  • Stopped – The job is marked Stopped when a StopModelInvocationJob API is called on a job that is InProgress. A terminal state job (Succeeded or Failed) can’t be stopped using StopModelInvocationJob.

The following is example code for the GetModelInvocationJob API:

GET /model-invocation-job/jobIdentifier HTTP/1.1

Response:
{
  'ResponseMetadata': {
    'RequestId': '081afa52-189f-4e83-a3f9-aa0918d902f4',
    'HTTPStatusCode': 200,
    'HTTPHeaders': {
       'date': 'Tue, 09 Jan 2024 17:00:16 GMT',
       'content-type': 'application/json',
       'content-length': '690',
       'connection': 'keep-alive',
       'x-amzn-requestid': '081afa52-189f-4e83-a3f9-aa0918d902f4'
      },
     'RetryAttempts': 0
   },
  'jobArn': 'arn:aws:bedrock:<region>:<account-id>:model-invocation-job/<id>',
  'jobName': 'job47',
  'modelId': 'arn:aws:bedrock:<region>::foundation-model/anthropic.claude-instant-v1:2',
  'status': 'Submitted',
  'submitTime': datetime.datetime(2024, 1, 8, 21, 44, 38, 611000, tzinfo=tzlocal()),
  'lastModifiedTime': datetime.datetime(2024, 1, 8, 23, 5, 47, 169000, tzinfo=tzlocal()),
  'inputDataConfig': {'s3InputDataConfig': {'s3Uri': <path to input jsonl file>}},
  'outputDataConfig': {'s3OutputDataConfig': {'s3Uri': <path to output jsonl.out file>}}
}

When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:

  • json.out – The following code shows an example of the format:
{
   "processedRecordCount":<number>,
   "successRecordCount":<number>,
   "errorRecordCount":<number>,
   "inputTokenCount":<number>,
   "outputTokenCount":<number>
}
  • <file_name>.jsonl.out – The following screenshot shows an example of the code, containing the successfully processed records under The modelOutput contains a list of categories for a given product name in JSON format.

We then process the jsonl.out file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser to generate a .csv file. The PydanticOutputParser requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData.

class CCData(BaseModel):
   product_name: Optional[str] = Field(default=None, description="product name, which will be given as input")
   brand: Optional[str] = Field(default=None, description="Brand of the product inferred from the product name")
   color: Optional[str] = Field(default=None, description="Color of the product inferred from the product name")
   material: Optional[str] = Field(default=None, description="Material of the product inferred from the product name")
   price: Optional[str] = Field(default=None, description="Price of the product inferred from the product name")
   category: Optional[str] = Field(default=None, description="Category of the product inferred from the product name")
   sub_category: Optional[str] = Field(default=None, description="Sub-category of the product inferred from the product name")
   product_line: Optional[str] = Field(default=None, description="Product Line of the product inferred from the product name")
   gender: Optional[str] = Field(default=None, description="Gender of the product inferred from the product name")
   year_of_first_sale: Optional[str] = Field(default=None, description="Year of first sale of the product inferred from the product name")
   season: Optional[str] = Field(default=None, description="Season of the product inferred from the product name")

class List_of_CCData(BaseModel): 
   list_of_dict: List[CCData]

We also use OutputFixingParser to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.

Prompt engineering

Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.

Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.

In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.

Output generation

The following are best practices and considerations for output generation:

  • Provide simple, clear and complete instructions – This is the general guideline for prompt engineering work.
  • Use separator characters consistently – In this use case, we use the newline character n
  • Deal with default output values such as missing – For this use case, we don’t want special values such as N/A or missing, so we put multiple instructions in line, aiming to exclude the default or missing values.
  • Use few-shot prompting – Also termed in-context learning, few-shot prompting involves providing a handful of examples, which can be beneficial in helping LLMs understand the output requirements more effectively. In this use case, 0–10 in-context examples were tested for both Llama 2 and Anthropic’s Claude models.
  • Use packing techniques – We combined multiple SKU and product names into one LLM query, so that some prompt instructions can be shared across different SKUs for cost and latency optimization. In this use case, 1–10 packing numbers were tested for both Llama 2 and Anthropic’s Claude models.
  • Test for good generalization – You should keep a hold-out test set and correct responses to check if your prompt modifications generalize.
  • Use additional techniques for Anthropic’s Claude model families – We incorporated the following techniques:
    • Enclosing examples in XML tags:
<example>
H: <question> The list of product names is:
{few_shot_product_name} </question>
A: <response> The category information generated with absolutely no missing value, in JSON format is:
{few_shot_field} </response>
</example>
  • Using the Human and Assistant annotations:
nnHuman:
...
...
nnAssistant:
  • Guiding the assistant prompt:
nnAssistant: Here are the answer with NO missing, unknown, null, or N/A values (in JSON format):
  • Use additional techniques for Llama model families – For Llama 2 model families, you can enclose examples in [INST] tags:
[INST]
If the list of product names is:
{few_shot_product_name}
[/INST]

Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

{few_shot_field}

[INST]
If the list of product names is:
{product_name}
[/INST]

Then the answer with NO missing, unknown, null, or N/A values is (in JSON format):

Format parsing

The following are best practices and considerations for format parsing:

  • Refine the prompt with modifiers – Refinement of task instructions typically involves altering the instruction, task, or question part of the prompt. The effectiveness of these techniques varies based on the task and data. Some beneficial strategies in this use case include:
    • Role assumption – Ask the model to assume it’s playing a role. For example:

You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.

  • Prompt specificity: Being very specific and providing detailed instructions to the model can help generate better responses for the required task.

EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.

  • Output format description – We provided the JSON format instructions through a JSON string directly, as well as through the few-shot examples indirectly.
  • Pay attention to few-shot example formatting – The LLMs (Anthropic’s Claude and Llama) are sensitive to subtle formatting differences. Parsing time was significantly improved after several iterations on few-shot examples formatting. The final solution is as follows:
few_shot_field='{"list_of_dict"' +
':[' +
', n'.join([true_df.iloc[i].to_json() for i in range(num_few_shot)]) +
']}'
  • Use additional techniques for Anthropic’s Claude model families – For the Anthropic’s Claude model, we instructed it to format the output in JSON format:
{
    "list_of_dict": [{
        "some_category": "your_generated_answer",
        "another_category": "your_generated_answer",
    },
    {
        <category information for the 2st product name, in json format>
    },
    {
        <category information for the 3st product name, in json format>
    },
// ... {additional product information, in json format} ...
    }]
}
  • Use additional techniques for Llama 2 model families – For the Llama 2 model, we instructed it to format the output in JSON format as follows:

Format your output in the JSON format (ensure to escape special character):
The output should be formatted as a JSON instance that conforms to the JSON schema below.
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:

{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}

Models and parameters

We used the following prompting parameters:

  • Number of packings – 1, 5, 10
  • Number of in-context examples – 0, 2, 5, 10
  • Format instruction – JSON format pseudo example (shorter length), JSON format full example (longer length)

For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:

{
    "temperature": 0.1,
    "top_p": 0.9,
    "max_gen_len": 2048,
}

For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:

{
   "temperature": 0.1,
   "top_k": 250,
   "top_p": 1,
   "max_tokens_to_sample": 4096,
   "stop_sequences": ["nnHuman:"],
   "anthropic_version": "bedrock-2023-05-31"
}

The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.

Evaluations

The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:

  • Content coverage – Measures portions of missing values in the output generation step.
  • Parsing coverage – Measures portions of missing samples in the format parsing step:
    • Parsing recall on product name – An exact match serves as a lower bound for parsing completeness (parsing coverage is the upper bound for parsing completeness) because in some cases, two virtually identical product names need to be normalized and transformed to be an exact match (for example, “Nike Air Jordan” and “nike. air Jordon”).
    • Parsing precision on product name – For an exact match, we use a similar metric to parsing recall, but use precision instead of recall.
  • Final coverage – Measures portions of missing values in both output generation and format parsing steps.
  • Human evaluation – Focuses on holistic quality evaluation such as accuracy, relevance, and comprehensiveness (richness) of the text generation.

Results

The following are the approximate sample input and output lengths under some best performing settings:

  • Input length for Llama 2 model family – 2,068 tokens for 10-shot, 1,585 tokens for 5-shot, 1,319 tokens for 2-shot
  • Input length for Anthropic’s Claude model family – 1,314 tokens for 10-shot, 831 tokens for 5-shot, 566 tokens for 2-shot, 359 tokens for zero-shot
  • Output length with 5-packing – Approximately 500 tokens

Quantitative results

The following table summarizes our consolidated quantitative results.

  • To be concise, the table contains only some of our final recommendations for each model types.
  • The metrics used are latency and accuracy.
  • The best model and results are highlighted in green color and in bold font.
Config Latency Accuracy
Batch process service Model Prompt Batch process latency (5 packing) Near-real-time process latency (1 packing) Programmatic evaluation (coverage)
test set = 20 test set = 5k GoDaddy rqmt @ 5k Recall on parsing exact match Final content coverage
Amazon Bedrock batch inference Llama2-13b zero-shot n/a n/a 3600s n/a n/a n/a
5-shot (template12) 65.4s 1704s 3600s 72/20=3.6s 92.60% 53.90%
Llama2-70b zero-shot n/a n/a 3600s n/a n/a n/a
5-shot (template13) 139.6s 5299s 3600s 156/20=7.8s 98.30% 61.50%
Claude-v1 (instant) zero-shot (template6) 29s 723s 3600s 44.8/20=2.24s 98.50% 96.80%
5-shot (template12) 30.3s 644s 3600s 51/20=2.6s 99% 84.40%
Claude-v2 zero-shot (template6) 82.2s 1706s 3600s 104/20=5.2s 99% 84.40%
5-shot (template14) 49.1s 1323s 3600s 104/20=5.2s 99.40% 90.10%

The following tables summarize the scaling effect in batch inference.

  • When scaling from 5,000 to 100,000 samples, only eight times more computation time was needed.
  • Performing categorization with individual LLM calls for each product would have increased the inference time for 100,000 products by approximately 40 times compared to the batch processing method.
  • The accuracy in coverage remained stable, and cost scaled approximately linearly.
Batch process service Model Prompt Batch process latency (5 packing) Near-real-time process latency (1 packing)
test set = 20 test set = 5k GoDaddy rqmt @ 5k test set = 100k
Amazon Bedrock batch Claude-v1 (instant) zero-shot (template6) 29s 723s 3600s 5733s 44.8/20=2.24s
Amazon Bedrock batch Anthropic’s Claude-v2 zero-shot (template6) 82.2s 1706s 3600s 7689s 104/20=5.2s
Batch process service Near-real-time process latency (1 packing) Programmatic evaluation (coverage)
Parsing recall on product name (test set = 5k) Parsing recall on product name (test set = 100k) Final content coverage (test set = 5k) Final content coverage (test set = 100k)
Amazon Bedrock batch 44.8/20=2.24s 98.50% 98.40% 96.80% 96.50%
Amazon Bedrock batch 104/20=5.2s 99% 98.80% 84.40% 97%

The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.

We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.

Batch process service Model Prompt Latency (test set = 20) Accuracy (final coverage)
npack = 1 npack= 5 npack = 10 npack = 1 npack= 5 npack = 10
Amazon Bedrock batch inference Llama2-13b 5-shot (template12) 72s 65.4s 65s 95.90% 93.20% 88.90%
Llama2-70b 5-shot (template13) 156s 139.6s 150s 85% 97.70% 100%
Claude-v1 (instant) zero-shot (template6) 45s 29s 27s 99.50% 99.50% 99.30%
5-shot (template12) 51.3s 30.3s 27.4s 99.50% 99.50% 100%
Claude-v2 zero-shot (template6) 104s 82.2s 67s 85% 97.70% 94.50%
5-shot (template14) 104s 49.1s 43.5s 97.70% 100% 99.80%

Qualitative results

We noted the following qualitative results:

  • Human evaluation – The categories generated were evaluated qualitatively by GoDaddy SMEs. The categories were found to be of good quality.
  • Learnings – We used an LLM in two separate calls: output generation and format parsing. We observed the following:
    • For this use case, we saw Llama 2 didn’t perform well in format parsing but was relatively capable in output generation. To be consistent and make a fair comparison, we required the LLM used in both calls to be the same—the API calls in both steps should all be invoked to llama2-13b-chat-v1, or they should all be invoked to anthropic.claude-instant-v1. However, GoDaddy chose Llama 2 as the LLM for category generation. For this use case, we found that using Llama 2 in output generation only and using Anthropic’s Claude in format parsing was suitable due to Llama 2’s relative lower model capability.
    • Format parsing is improved through prompt engineering (JSON format instruction is critical) to reduce the latency. For example, with Anthropic’s Claude-Instant on a 20-test set and averaging multiple prompt templates, the latency can be reduced by approximately 77% (from 90 seconds to 20 seconds). This directly eliminates the necessity of using a JSON fine-tuned version of the LLM.
  • Llama2 – We observed the following:
    • Llama2-13b and Llama2-70b models both need the full instruction as format_instruction() in zero-shot prompts.
    • Llama2-13b seems to be worse in content coverage and formatting (for example, it can’t correctly escape char, \“), which can incur significant parsing time and cost and also degrade accuracy.
    • Llama 2 shows clear performance drops and instability when the packing number varies among 1, 5, and 10, indicating poorer generalizability compared to the Anthropic’s Claude model family.
  • Anthropic’s Claude – We observed the following:
    • Anthropic’s Claude-Instant and Claude-v2, regardless of using zero-shot or few-shot prompting, need only partial format instruction instead of the full instruction format_instruction(). It shortens the input length, and is therefore more cost-effective. It also shows Anthropic’s Claude’s better capability in following instructions.
    • Anthropic’s Claude generalizes well when varying packing numbers among 1, 5, and 10.

Business takeaways

We had the following key business takeaways:

  • Improved latency – Our solution inferences 5,000 products in 12 minutes, which is 80% faster than GoDaddy’s needs (5,000 products in 1 hour). Using batch inference in Amazon Bedrock demonstrates efficient batch processing capabilities and anticipates further scalability with AWS planning to deploy more cloud instances. The expansion will lead to increased time and cost savings.
  • More cost-effectiveness – The solution built by the Generative AI Innovation Center using Anthropic’s Claude-Instant is 8% more affordable than the existing proposal using Llama2-13b while also providing 79% more coverage.
  • Enhanced accuracy – The deliverable produces 97% category coverage on both the 5,000 and 100,000 hold-out test set, exceeding GoDaddy’s needs at 90%. The comprehensive framework is able to facilitate future iterative improvements over the current model parameters and prompt templates.
  • Qualitative assessment – The category generation is in satisfactory quality through human evaluation by GoDaddy SMEs.

Technical takeaways

We had the following key technical takeaways:

  • The solution features both batch inference and near real-time inference (2 seconds per product) capability and multiple backend LLM selections.
  • Anthropic’s Claude-Instant with zero-shot is the clear winner:
    • It was best in latency, cost, and accuracy on the 5,000 hold-out test set.
    • It showed better generalizability to higher packing numbers (number of SKUs in one query), with potentially more cost and latency improvement.
  • Iteration on prompt templates shows improvement on all these models, suggesting that good prompt engineering is a practical approach for the categorization generation task.
  • Input-wise, increasing to 10-shot may further improve performance, as observed in small-scale science experiments, but also increase the cost by around 30%. Therefore, we tested at most 5-shot in large-scale batch experiments.
  • Output-wise, increasing to 10-packing or even 20-packing (Anthropic’s Claude only; Llama 2 has 2,048 output length limit) might further improve latency and cost (because more SKUs can share the same input instructions).
  • For this use case, we saw Anthropic’s Claude model family having better accuracy and generalizability, for example:
    • Final category coverage performance was better with Anthropic’s Claude-Instant.
    • When increasing packing numbers from 1, 5, to 10, Anthropic’s Claude-Instant showed improvement in latency and stable accuracy in comparison to Llama 2.
    • To achieve the final categories for the use case, we noticed that Anthropic’s Claude required a shorter prompt input to follow the instruction and had a longer output length limit for a higher packing number.

Next steps for GoDaddy

The following are the recommendations that the GoDaddy team is considering as a part of future steps:

  • Dataset enhancement – Aggregate a larger set of ground truth examples and expand programmatic evaluation to better monitor and refine the model’s performance. On a related note, if the product names can be normalized by domain knowledge, the cleaner input is also helpful for better LLM responses. For example, the product name ”<product_name> Power t-shirt, ladyfit vest or hoodie” can prompt the LLM to respond for multiple SKUs, instead of one SKU (similarly, “<product_name> – $5 or $10 or $20 or $50 or $100”).
  • Human evaluation – Increase human evaluations to provide higher generation quality and alignment with desired outcomes.
  • Fine-tuning – Consider fine-tuning as a potential strategy for enhancing category generation when a more extensive training dataset becomes available.
  • Prompt engineering – Explore automatic prompt engineering techniques to enhance category generation, particularly when additional training data becomes available.
  • Few-shot learning – Investigate techniques such as dynamic few-shot selection and crafting in-context examples based on the model’s parameter knowledge to enhance the LLMs’ few-shot learning capabilities.
  • Knowledge integration – Improve the model’s output by connecting LLMs to a knowledge base (internal or external database) and enabling it to incorporate more relevant information. This can help to reduce LLM hallucinations and enhance relevance in responses.

Conclusion

In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.

If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.

Security Best Practices

References


About the Authors

Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.

Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More