Llama 4 family of models from Meta are now available in SageMaker JumpStart

Llama 4 family of models from Meta are now available in SageMaker JumpStart

Today, we’re excited to announce the availability of Llama 4 Scout and Maverick models in Amazon SageMaker JumpStart and coming soon in Amazon Bedrock. Llama 4 represents Meta’s most advanced multimodal models to date, featuring a mixture of experts (MoE) architecture and context window support up to 10 million tokens. With native multimodality and early fusion technology, Meta states that these new models demonstrate unprecedented performance across text and vision tasks while maintaining efficient compute requirements. With a dramatic increase on supported context length from 128K in Llama 3, Llama 4 is now suitable for multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over extensive codebases. You can now deploy the Llama-4-Scout-17B-16E-Instruct, Llama-4-Maverick-17B-128E-Instruct, and Llama-4-Maverick-17B-128E-Instruct-FP8 models using SageMaker JumpStart in the US East (N. Virginia) AWS Region.

In this blog post, we walk you through how to deploy and prompt a Llama-4-Scout-17B-16E-Instruct model using SageMaker JumpStart.

Llama 4 overview

Meta announced Llama 4 today, introducing three distinct model variants: Scout, which offers advanced multimodal capabilities and a 10M token context window; Maverick, a cost-effective solution with a 128K context window; and Behemoth, in preview. These models are optimized for multimodal reasoning, multilingual tasks, coding, tool-calling, and powering agentic systems.

Llama 4 Maverick is a powerful general-purpose model with 17 billion active parameters, 128 experts, and 400 billion total parameters, and optimized for high-quality general assistant and chat use cases. Additionally, Llama 4 Maverick is available with base and instruct models in both a quantized version (FP8) for efficient deployment on the Instruct model and a non-quantized (BF16) version for maximum accuracy.

Llama 4 Scout, the more compact and smaller model, has 17 billion active parameters, 16 experts, and 109 billion total parameters, and features an industry-leading 10M token context window. These models are designed for industry-leading performance in image and text understanding with support for 12 languages, enabling the creation of AI applications that bridge language barriers.

See Meta’s community license agreement for usage terms and more details.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances can be isolated within your virtual private cloud (VPC). After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.

Prerequisites

To try the Llama 4 models in SageMaker JumpStart, you need the following prerequisites:

Discover Llama 4 models in SageMaker JumpStart

SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the Amazon SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive integrated development environment (IDE) that offers a unified, web-based interface for performing all aspects of the AI development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process.

In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference. You can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page in SageMaker Studio, as shown in the following figure.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI and machine learning (AI/ML) workflows and pipelines.

By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.

Deploy Llama 4 models for inference through the SageMaker JumpStart UI

On the SageMaker JumpStart landing page, you can find all the public pre-trained models offered by SageMaker AI. You can then choose the Meta model provider tab to discover all the available Meta models.

If you’re using SageMaker Classic Studio and don’t see the Llama 4 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, see Shut down and Update Studio Classic Apps.

  1. Search for Meta to view the Meta model card. Each model card shows key information, including:
    • Model name
    • Provider name
    • Task category (for example, Text Generation)
  2. Select the model card to view the model details page.

The model details page includes the following information:

  • The model name and provider information
  • Deploy button to deploy the model
  • About and Notebooks tabs with detailed information

The About tab includes important details, such as:

  • Model description
  • License information
  • Technical specifications
  • Usage guidelines

Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.

  1. Choose Deploy to proceed with deployment.
  1. For Endpoint name, use the automatically generated name or enter a custom one.
  2. For Instance type, use the default: p5.48xlarge.
  3. For Initial instance count, enter the number of instances (default: 1).
    Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed.
  4. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.
  5. Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
  6. Choose Deploy. The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.

Deploy Llama 4 models for inference using the SageMaker Python SDK

When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker AI.

You can deploy the Llama 4 Scout model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-vlm-llama-4-scout-17b-16e-instruct")

predictor = model.deploy(accept_eula=False)

This deploys the model on SageMaker AI with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
"messages": [
 {"role": "system", "content": "You are a helpful assistant"},
 {"role": "user", "content": "How are you doing today"},
 {"role": "assistant", "content": "Good, what can i help you with today?"},
 {"role": "user", "content": "Give me 5 steps to become better at tennis?"}
],
 "temperature": 0.6,
 "top_p": 0.9,
 "max_tokens": 512,
 "logprobs": False
} 
response = predictor.predict(payload) 
response_message = response['choices'][0]['message']['content']

Recommended instances and benchmark

The following table lists all the Llama 4 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model name Model ID Default instance type Supported instance types
Llama-4-Scout-17B-16E-Instruct meta-vlm-llama-4-scout-17b-16e-instruct ml.p5.48xlarge ml.g6e.48xlarge, ml.p5.48xlarge, ml.p5en.48xlarge
Llama-4-Maverick-17B-128E-Instruct meta-vlm-llama-4-maverick-17b-128e-instruct ml.p5.48xlarge ml.p5.48xlarge, ml.p5en.48xlarge
Llama 4-Maverick-17B-128E-Instruct-FP8 meta-vlm-llama-4-maverick-17b-128-instruct-fp8 ml.p5.48xlarge ml.p5.48xlarge, ml.p5en.48xlarge

Inference and example prompts for Llama 4 Scout 17B 16 Experts model

You can use the Llama 4 Scout model for text and image or vision reasoning use cases. With that model, you can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more.

In the following sections we show example payloads, invocations, and responses for Llama 4 Scout that you can use against your Llama 4 model deployments using Sagemaker JumpStart.

Text-only input

Input:

payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant."
        },
        {
            "role": "user",
            "content": "What are three key benefits of large language models for businesses?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

Response:

Large language models (LLMs) offer numerous benefits to businesses, transforming how they operate, 
interact with customers, and generate insights. Here are three key benefits:

1. **Enhanced Customer Experience through Automation and Personalization**:
   - **Automated Customer Support**: LLMs can power chatbots and virtual assistants 
	that provide 24/7 customer support. These models can understand and respond to a wide range of customer queries, help with troubleshooting, and even process transactions, significantly reducing the workload on human customer support agents.
   - **Personalization**: By analyzing customer data, LLMs can help businesses personalize their 
	communications and offerings. For instance, they can generate personalized emails, product recommendations, and content, enhancing customer engagement and satisfaction.

2. **Efficiency and Productivity Gains**:
   - **Content Generation**: LLMs can automate the creation of various types of content, such as 
   blog posts, reports, product descriptions, and social media updates. This not only speeds up content production but also allows human writers to focus on more creative and strategic tasks.
   - **Data Analysis and Summarization**: These models can quickly analyze large volumes of data, extract relevant information, and summarize findings in a readable format. This capability can significantly reduce the time and effort required for market research, competitive analysis, and internal reporting.

3. **Improved Decision Making with Data-Driven Insights**:
   - **Market and Trend Analysis**: LLMs can process and analyze vast amounts of data from various sources, including news articles, social media, and market reports. This helps businesses stay informed about market trends, consumer sentiment, and competitor activity, enabling more informed strategic decisions.
   - **Risk Management and Compliance**: By analyzing regulatory documents and monitoring communications, LLMs can help businesses identify and mitigate compliance risks. They can also assist in the creation of compliance reports and documentation, reducing the risk of human error.

In summary, large language models offer businesses the potential to automate and enhance customer interactions, 
improve operational efficiency, and gain deeper insights from their data. These capabilities can lead to cost savings, 
increased revenue, and a competitive edge in the market.

Single-image input

In this section, let’s test Llama 4’s multimodal capabilities. By merging text and vision tokens into a unified processing backbone, Llama 4 can seamlessly understand and respond to queries about an image. The following is an example of how you can prompt Llama 4 to answer questions about an image such as the one in the example:

Image:

Input:

import requests
import base64
import boto3
import json

s3 = boto3.client("s3")

s3_bucket = f"jumpstart-cache-prod-{region}"
key_prefix = "inference-notebook-assets"

def download_from_s3(key_filenames):
    for key_filename in key_filenames:
        s3.download_file(s3_bucket, f"{key_prefix}/{key_filename}", key_filename)

# Define image names
heat_map = "heatmap_semantic_similarity_search.png"

# Download and display the heatmap image
download_from_s3(key_filenames=[heat_map])

def img_to_base64(image_path):
    with open(image_path, "rb") as f:
        img = f.read()
    enc_img = base64.b64encode(img).decode('utf-8')
    return enc_img

b64_img = img_to_base64(heat_map)

data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img
                    }
                },
            ],
        },
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

payload = json.dumps(data)

Response:

The image presents a heatmap illustrating the semantic textual similarity between 
sentences. The heatmap is divided into two axes, with the x-axis listing six sentences 
and the y-axis listing the same six sentences in a different order.nn**Sentences:**nn* 
Your dog is so cute.n* How cute your dog is!n* You have such a cute dog!n* New York City 
is the place where I work.n* I work in New York City.n* What color do you like the most?n* 
What is your favourite color?nn**Heatmap:**nnThe heatmap displays the semantic textual 
similarity between each pair of sentences, with darker shades of red indicating higher similarity 
and lighter shades of orange indicating lower similarity. The color bar on the right side of the 
heatmap provides a scale for interpreting the similarity values, ranging from -0.4 to 1.0.nn
**Observations:**nn* The three sentences related to dogs ("Your dog is so cute.", "How cute your dog is!", 
and "You have such a cute dog!") show high similarity with each other, as indicated by the dark red squares 
in the corresponding section of the heatmap.n* The two sentences related to New York City ("New York City is 
the place where I work." and "I work in New York City.") also show high similarity with each other.n* The two 
sentences related to favorite colors ("What color do you like the most?" and "What is your favorite color?") 
show high similarity with each other.n* The sentences from different categories (dogs, New York City, and 
favorite colors) show lower similarity with each other, as indicated by the lighter shades of orange in the heatmap.nn
Overall, the heatmap effectively visualizes the semantic textual similarity between the given sentences, allowing 
for easy identification of similar and dissimilar sentence pairs.

The Llama 4 model on JumpStart can take in the image provided via a URL, underlining its powerful potential for real-time multimodal applications.

Multi-image input

Building on its advanced multimodal functionality, Llama 4 can effortlessly process multiple images at the same time. In this demonstration, the model is prompted with two image URLs and tasked with describing each image and explaining their relationship, showcasing its capacity to synthesize information across several visual inputs. Let’s test this below by passing in the URLs of the following images in the payload.

Image 1:

Image 2:

Input: 

def url_to_base64(image_url):
    # Download the image
    response = requests.get(image_url)
    if response.status_code != 200:
        return None

    # Encode the image content to base64
    image_base64 = base64.b64encode(response.content).decode('utf-8')
    return image_base64

image_url_2 = "https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg"
b64_img_url_2 = url_to_base64(image_url_2)

data = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "There are two images. Describe each one and why they are related",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img,
                    }
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": b64_img_url_2,
                    }
                },
            ],
        },
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512
}

Response:

The first image is a heatmap illustrating the semantic textual similarity between 
sentences. The heatmap is titled "Semantic Textual Similarity Between Sentences" 
and features a color scale ranging from -0.4 to 1.0, with darker shades indicating 
higher similarity. The sentences being compared are:nn* Your dog is so cute.n* 
How cute your dog is!n* You have such a cute dog!n* New York City is the place 
where I work.n* I work in New York City.n* What color do you like the most?n* 
What is your favourite color?nnThe heatmap reveals that the sentences about dogs 
have high similarity scores, with values above 0.9, indicating that they convey 
similar meanings. Similarly, the sentences about work locations and favorite colors 
also show high similarity scores. However, there is a low similarity score between 
the sentences about dogs and those about work locations or favorite colors.nn
The second image presents five stick figures labeled A, B, C, D, and E, each representing
a person. The figures are drawn using simple lines and geometric shapes, such as circles
for heads, squares for bodies, and triangles for skirts. The main difference between the
figures lies in their arm and leg positions:nn* Figure A has its arms at its sides and 
straight legs.n* Figure B has its arms at its sides and straight legs, but with a slightly 
different posture.n* Figure C has its arms at its sides and straight legs, similar to Figure A.n*
Figure D has one arm raised and straight legs.n* Figure E has two arms raised and bent legs.nn
Overall, the two images are distinct in terms of their content and purpose. The heatmap provides a 
visual representation of semantic similarities between sentences, while the stick figures appear to 
be simple illustrations of people with varying poses.

As you can see, Llama 4 excels in handling multiple images simultaneously, providing detailed and contextually relevant insights that emphasize its robust multimodal processing abilities.

Codebase analysis with Llama 4

Using Llama 4 Scout’s industry-leading context window, this section showcases its ability to deeply analyze expansive codebases. The example extracts and contextualizes the buildspec-1-10-2.yml file from the AWS Deep Learning Containers GitHub repository, illustrating how the model synthesizes information across an entire repository. We used a tool to ingest the whole repository into plaintext that we provided to the model as context:

Input:

payload = {
    "messages": [
        {
            "role": "system",
            "content": """You are a helpful AI assistant that summarizes codebases to me to help me understand
            how to analyze code by synthesizing through the entire codebase before responding.
            
            Be thorough in your search as the file may be nested within a markdown code block or within a directory listing.
            """
        },
        {
            "role": "user",
            "content": f"Can you explain to me the buildspec-1-10-2.yml file and how it relates to the rest of the huggingface directory? Here is the AWS Deep Learning Container Repository for reference:nn{text}"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

Output: 

## Overview of the buildspec-1-10-2.yml File

The provided information appears to be a part of a large, merged representation of 
the Hugging Face repository, generated by Repomix. The `buildspec-1-10-2.yml` file 
is likely a build specification file used in a Continuous Integration/Continuous 
Deployment (CI/CD) pipeline.

## Purpose and Format

The file contains a packed representation of the entire repository's contents, designed
 to be easily consumable by AI systems for analysis, code review, or other automated processes. 
The content is organized into sections:

1.  **Summary**: Provides an overview of the file's purpose, format, and usage guidelines.
2.  **Repository Information**: Not shown in the provided excerpt, but likely contains metadata about the repository.
3.  **Directory Structure**: Lists the directory structure of the repository, including file paths.
4.  **File Entries**: Each file entry consists of a header with the file path and the full contents 
of the file in a code block.

## Relationship to the Hugging Face Directory

The `buildspec-1-10-2.yml` file is part of the Hugging Face repository, which appears to be a 
large project with multiple subdirectories and files. The directory structure section lists various 
subdirectories, including:

*   `.github/`: Contains GitHub-related configuration files, such as issue templates, workflows, 
and CODEOWNERS.
*   `.release_templates/`: Includes release image templates for various frameworks like MXNet, PyTorch, 
and TensorFlow.
*   `autogluon/`: A subdirectory with build artifacts, inference configurations, and other related files.

## buildspec-1-10-2.yml File Contents

Unfortunately, the provided excerpt does not contain the actual contents of the `buildspec-1-10-2.yml` file. 
However, based on the file name and its presence in a CI/CD context, it is likely that this file specifies 
the build process, dependencies, and environment configurations for the Hugging Face project.

## Usage Guidelines

When working with this file, keep in mind:

*   Treat this file as read-only and make changes to the original repository files instead.
*   Use the file path to distinguish between different files in the repository.
*   Handle this file with the same level of security as the original repository, as it may 
contain sensitive information.

## Additional Notes

*   Some files might be excluded based on `.gitignore` rules and Repomix's configuration.
*   Binary files are not included in this packed representation.
*   Files are sorted by Git change count, with files having more changes appearing at the bottom.

## Example Use Case

To analyze the `buildspec-1-10-2.yml` file, you would:

1.  Locate the file in the repository directory structure.
2.  Review the file contents to understand the build process, dependencies, and environment configurations.
3.  Use this information to optimize the build process, troubleshoot issues, or make changes to the repository.

Keep in mind that the actual contents of the `buildspec-1-10-2.yml` file are not provided, so this 
is a general explanation based on the file name and context. 

To get the actual content you would look for 

## File: buildspec-1-10-2.yml 

 block and read its content 

Let me know if I can assist with anything else.

Multi-document processing

Harnessing the same extensive token context window, Llama 4 Scout excels in multi-document processing. In this example, the model extracts key financial metrics from Amazon 10-K reports (2017-2024), demonstrating its capability to integrate and analyze data spanning multiple years—all without the need for additional processing tools.

Input: 

payload = {
    "messages": [
        {
            "role": "system",
            "content": "You are a financial analyst. Extract key financial data from this 10-K report."
        },
        {
            "role": "user",
            "content": f"""Extract the following information from this Amazon 10-K report:
            1. What year is this report for?
            2. Total revenue for the year
            3. Net income
            4. Key business segments and their performance
            5. Important trends mentioned
            
            Here's the 10-K text:
            {pdf_text[:25000]}
            
            Format your response as a JSON object with these fields: year, revenue, net_income, segments, trends.
            """
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": False
}

Output:

To analyze Amazon's growth from 2017 to 2024, we'll examine the revenue and net income 
data available from the 10-K reports.

### Revenue Growth

| Year | Revenue (in billion USD) |
| --- | --- |
| 2014 | 89 |
| 2015 | 107 |
| 2016 | 135.99 |
| 2017 | 177.87 |
| 2018 | 232.88 |
| 2019 | 280.52 |
| 2020 | 386.06 |
| 2021 | 478 |
| 2022 | 513 |
| 2023 | 513.98 |
| 2024 | 574.78 (projected) |

From 2017 to 2024, Amazon's revenue has grown from $177.87 billion to a projected $574.78 billion, 
representing a significant increase of approximately 223% over the 7-year period.

### Net Income Growth

| Year | Net Income (in billion USD) |
| --- | --- |
| 2014 | 0.236 |
| 2015 | 0.596 |
| 2016 | 2.37 |
| 2017 | 3.03 |
| 2018 | 10.07 |
| 2019 | 11.59 |
| 2020 | 18.68 |
| 2021 | 33.4 |
| 2022 | 18.7 (lower due to economic conditions) |
| 2023 | 33.36 |
| 2024 | 37.68 (projected) |

Amazon's net income has grown from $3.03 billion in 2017 to a projected $37.68 billion in 2024, 
representing a substantial increase of approximately 1143% over the 7-year period.

### Segment and Trend Analysis

Throughout the years, Amazon has maintained its three main business segments: North America, 
International, and Amazon Web Services (AWS). The company has continued to expand into new products, 
services, technologies, and geographic regions, driving growth and increasing competition.

The trends mentioned across the 10-K reports highlight the company's:

1. **Seasonality**: Higher sales volume during the fourth quarter.
2. **Intense competition**: Across geographies and industries, including e-commerce, cloud computing, 
and digital content.
3. **Expansion**: Into new products, services, technologies, and geographic regions, which subjects 
the company to additional business, legal, financial, and competitive risks.
4. **International operations**: Subject to risks such as local economic and political conditions, 
government regulation, and restrictions on sales or distribution.

### Conclusion

Amazon has demonstrated remarkable growth from 2017 to 2024, with revenue increasing by 
approximately 223% and net income growing by approximately 1143%. The company's continued 
expansion into new areas, its strong presence in cloud computing through AWS, and its ability 
to adapt to changing market conditions have contributed to its success. However, the company 
also faces intense competition, seasonality, and risks associated with international operations.

--------------------------------------------------
Ask a question about the Amazon 10-K reports across years.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, using the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 4 models today.

For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.


About the authors

Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. As a member of the Third-party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a global lead for the Meta–AWS Partnership and technical strategy. Based in Seattle, Washington, Marco enjoys writing, reading, exercising, and building applications in his free time.

Chakravarthy Nagarajan is a Principal Solutions Architect specializing in machine learning, big data, and high performance computing. In his current role, he helps customers solve real-world, complex business problems using machine learning and generative AI solutions.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Malav Shastri is a Software Development Engineer at AWS, where he works on the Amazon SageMaker JumpStart and Amazon Bedrock teams. His role focuses on enabling customers to take advantage of state-of-the-art open source and proprietary foundation models and traditional machine learning algorithms. Malav holds a Master’s degree in Computer Science.

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon Sagemaker and Amazon EC2. Based in San Francisco, Baladithya enjoys tinkering, developing applications, and his home lab in his free time.

John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 and Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.

Read More

Multi-tenancy in RAG applications in a single Amazon Bedrock knowledge base with metadata filtering

Multi-tenancy in RAG applications in a single Amazon Bedrock knowledge base with metadata filtering

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies and AWS. Amazon Bedrock Knowledge Bases offers fully managed, end-to-end Retrieval Augmented Generation (RAG) workflows to create highly accurate, low-latency, secure, and custom generative AI applications by incorporating contextual information from your company’s data sources.

Organizations need to control access to their data across different business units, including companies, departments, or even individuals, while maintaining scalability. When organizations try to separate data sources manually, they often create unnecessary complexity and hit service limitations. This post demonstrates how Amazon Bedrock Knowledge Bases can help you scale your data management effectively while maintaining proper access controls on different management levels.

One of these strategies is using Amazon Simple Storage Service (Amazon S3) folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segmentation within a single knowledge base. Additionally, we dive into integrating common vector database solutions available for Amazon Bedrock Knowledge Bases and how these integrations enable advanced metadata filtering and querying capabilities.

Organizing S3 folder structures for scalable knowledge bases

Organizations working with multiple customers need a secure and scalable way to keep each customer’s data separate while maintaining efficient access controls. Without proper data segregation, companies risk exposing sensitive information between customers or creating complex, hard-to-maintain systems. For this post, we focus on maintaining access controls across multiple business units within the same management level.

A key strategy involves using S3 folder structures and Amazon Bedrock Knowledge Bases metadata filtering to enable efficient data segregation within a single knowledge base. Instead of creating separate knowledge bases for each customer, you can use a consolidated knowledge base with a well-structured S3 folder hierarchy. For example, imagine a consulting firm that manages documentation for multiple healthcare providers—each customer’s sensitive patient records and operational documents must remain strictly separated. The Amazon S3 structure might look as follows:

s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/

    s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/policies/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerA/procedures/

    s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/policies/

        s3://amzn-s3-demo-my-knowledge-base-bucket/customer-data/customerB/procedures/

This structure makes sure that Customer A’s healthcare documentation remains completely separate from Customer B’s data. When combined with Amazon Bedrock Knowledge Bases metadata filtering, you can verify that users associated with Customer A can only access their organization’s documents, and Customer B’s users can only see their own data—maintaining strict data boundaries while using a single, efficient knowledge base infrastructure.

The Amazon Bedrock Knowledge Bases metadata filtering capability enhances this segregation by allowing you to tag documents with customer-specific identifiers and other relevant attributes. These metadata filters provide an additional layer of security and organization, making sure that queries only return results from the appropriate customer’s dataset.

Solution overview

The following diagram provides a high-level overview of AWS services and features through a sample use case. Although the example uses Customer A and Customer B for illustration, these can represent distinct business units (such as departments, companies, or teams) with different compliance requirements, rather than only individual customers.

The workflow consists of the following steps:

  1. Customer data is uploaded along with metadata indicating data ownership and other properties to specific folders in an S3 bucket.
  2. The S3 bucket, containing customer data and metadata, is configured as a knowledge base data source. Amazon Bedrock Knowledge Bases ingests the data, along with the metadata, from the source repository and a knowledge base sync is performed.
  3. A customer initiates a query using a frontend application with metadata filters against the Amazon Bedrock knowledge base. An access control metadata filter must be in place to make sure that the customer only accesses data they own; the customer can apply additional filters to further refine query results. This combined query and filter is passed to the RetrieveAndGenerate API.
  4. The RetrieveAndGenerate API handles the core RAG workflow. It consists of several sub-steps:
    • The user query is converted into a vector representation (embedding).
    • Using the query embedding and the metadata filter, relevant documents are retrieved from the knowledge base.
    • The original query is augmented with the retrieved documents, providing context for the large language model (LLM).
    • The LLM generates a response based on the augmented query and retrieved context.
  5. Finally, the generated response is sent back to the user.

When implementing Amazon Bedrock Knowledge Bases in scenarios involving sensitive information or requiring access controls, developers must implement proper metadata filtering in their application code. Failure to enforce appropriate metadata-based filtering could result in unauthorized access to sensitive documents within the knowledge base. Metadata filtering serves as a critical security boundary and should be consistently applied across all queries. For comprehensive guidance on implementing secure metadata filtering practices, refer to the Amazon Bedrock Knowledge Base Security documentation.

Implement metadata filtering

For this use case, two specific example customers, Customer A and Customer B, are aligned to different proprietary compliance documents. The number of customers and folders can scale to N depending on the size of the customer base. We will use the following public documents, which will reside in the respective customer’s S3 folder. Customer A requires the Architecting for HIPAA Security and Compliance on AWS document. Customer B requires access to the  Using AWS in the Context of NHS Cloud Security Guidance  document.

  1. Create a JSON file representing the corresponding metadata for both Customer A and Customer B:

The following is the JSON metadata for Customer A’s data:

{ "metadataAttributes": { "customer": "CustomerA", "documentType": "HIPAA Compliance Guide", "focus": "HIPAA Compliance", "publicationYear": 2022, "region": "North America" }}

The following is the JSON metadata for Customer B’s data:

{ "metadataAttributes": { "customer": "CustomerB", "documentType": "NHS Compliance Guidance", "focus": "UK Healthcare Compliance", "publicationYear": 2023, "region": "Europe" }}

  1. Save these files separately with the naming convention <filename>.pdf.metadata.JSON and store them in the same S3 folder or prefix that stores the source document. For Customer A, name the metadata file architecting-hipaa-compliance-on-aws.pdf.metadata.json and upload it to the folder corresponding to Customer A’s documents. Repeat these steps for Customer B.
  2. Create an Amazon Bedrock knowledge base. For instructions, see Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases
  3. After you create your knowledge base, you can sync the data source. For more details, see Sync your data with your Amazon Bedrock knowledge base.

Test metadata filtering

After you sync the data source, you can test the metadata filtering.

The following is an example for setting the customer = CustomerA metadata filter to show Customer A only has access to the HIPAA compliance document and not the NHS Compliance Guidance that relates to Customer B.

To use the metadata filtering options on the Amazon Bedrock console, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Choose the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the Configurations icon, then expand Filters.
  5. Enter a condition using the format: key = value (for this example, customer = CustomerA) and press Enter.
  6. When finished, enter your query in the message box, then choose Run.

We enter two queries, “summarize NHS Compliance Guidance” and “summarize HIPAA Compliance Guide.” The following figure shows the two queries: one attempting to query data related to NHS compliance guidance, which fails because it is outside of the Customer A segment, and another successfully querying data on HIPAA compliance, which has been tagged for Customer A.

Implement field-specific chunking

Amazon Bedrock Knowledge Bases supports several document types for Amazon S3 metadata filtering. The supported file formats include:

  • Plain text (.txt)
  • Markdown (.md)
  • HTML (.html)
  • Microsoft Word documents (.doc and.docx)
  • CSV files (.csv)
  • Microsoft Excel spreadsheets (.xls and .xlsx)

When working with CSV data, customers often want to chunk on a specific field in their CSV documents to gain granular control over data retrieval and enhance the efficiency and accuracy of queries. By creating logical divisions based on fields, users can quickly access relevant subsets of data without needing to process the entire dataset.

Additionally, field-specific chunking aids in organizing and maintaining large datasets, facilitating updating or modifying specific portions without affecting the whole. This granularity supports better version control and data lineage tracking, which are crucial for data integrity and compliance. Focusing on relevant chunks can improve the performance of LLMs, ultimately leading to more accurate insights and better decision-making processes within organizations. For more information, see Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.

To demonstrate field-specific chunking, we use two sample datasets with the following schemas:

  • Schema 1 – Customer A uses the following synthetic dataset for recording medical case reports (case_reports.csv)
CaseID DoctorID PatientID Diagnosis TreatmentPlan Content
C001 D001 P001 Hypertension Lifestyle changes, Medication (Lisinopril) “Patient diagnosed with hypertension, advised lifestyle changes, and started on Lisinopril.”
C002 D002 P002 Diabetes Type 2 Medication (Metformin), Diet adjustment “Diabetes Type 2 confirmed, prescribed Metformin, and discussed a low-carb diet plan.”
C003 D003 P003 Asthma Inhaler (Albuterol) “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management.”
C004 D004 P004 Coronary Artery Disease Medication (Atorvastatin), Surgery Consultation “Coronary artery disease diagnosed, started on Atorvastatin, surgery consultation recommended.”
  • Schema 2 – Customer B uses the following dataset for recording genetic testing results (genetic_testings.csv)
SampleID PatientID TestType Result
S001 P001 Genome Sequencing Positive
S002 P002 Exome Sequencing Negative
S003 P003 Targeted Gene Panel Positive
S004 P004 Whole Genome Sequencing Negative

Complete the following steps:

  1. Create a JSON file representing the corresponding metadata for both Customer A and Customer B:

The following is the JSON metadata for Customer A’s data (note that recordBasedStructureMetadata supports exactly one content field):

{
    "metadataAttributes": {
        "customer": "CustomerA"
    },
    "documentStructureConfiguration": {
        "type": "RECORD_BASED_STRUCTURE_METADATA",
        "recordBasedStructureMetadata": {
            "contentFields": [
                {
                    "fieldName": "Content"
                }
            ],
            "metadataFieldsSpecification": {
                "fieldsToInclude": [
                    {
                        "fieldName": "CaseID"
                    },
                    {
                        "fieldName": "DoctorID"
                    },
                    {
                        "fieldName": "PatientID"
                    },
                    {
                        "fieldName": "Diagnosis"
                    },
                    {
                        "fieldName": "TreatmentPlan"
                    }
                ]
            }
        }
    }
}

The following is the JSON metadata for Customer B’s data:

{
    "metadataAttributes": {
        "customer": "CustomerB"
    },
    "documentStructureConfiguration": {
        "type": "RECORD_BASED_STRUCTURE_METADATA",
        "recordBasedStructureMetadata": {
            "contentFields": [
                {
                    "fieldName": "TestType"
                }
            ],
            "metadataFieldsSpecification": {
                "fieldsToInclude": [
                    {
                        "fieldName": "SampleID"
                    },
                    {
                        "fieldName": "PatientID"
                    },
                    {
                        "fieldName": "Result"
                    }
                ]
            }
        }
    }
}
  1. Save your files with the naming convention <filename>.csv.metadata.json and store the new JSON file in the same S3 prefix of the bucket where you stored the dataset. For Customer A, name the metadata file case_reports.csv.metadata.JSON and upload the file to the same folder corresponding to Customer A’s datasets.

Repeat the process for Customer B. You have now created metadata from the source CSV itself, as well as an additional metadata field customer that doesn’t exist in the original dataset. The following image highlights the metadata.

  1. Create an Amazon Bedrock knowledge base.
  2. Sync your data with your Amazon Bedrock knowledge base.

Test field-specific chunking

The following is an example of setting the customer = CustomerA metadata filter demonstrating that Customer A only has access to the medical case reports dataset and not the genetic testing dataset that relates to Customer B. We enter a query requesting information about a patient with PatientID as P003.

To test, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Choose the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the Configurations icon, then expand Filters.
  5. Enter a condition using the format: key = value (for this example, customer = CustomerA) and press Enter.
  6. When finished, enter your query in the message box, then choose Run.

The knowledge base returns, “Patient reports difficulty breathing; prescribed Albuterol inhaler for asthma management,” which is the Result column entry from Customer A’s medical case reports dataset for that PatientID. Although there is a record with the same PatientID in Customer B’s genetic testing dataset, Customer A has access only to the medical case reports data due to the metadata filtering.

Apply metadata filtering for the Amazon Bedrock API

You can call the Amazon Bedrock API RetrieveAndGenerate to query a knowledge base and generate responses based on the retrieved results using the specified FM or inference profile. The response only cites sources that are relevant to the query.

The following Python Boto3 example API call applies the metadata filtering for retrieving Customer B data and generates responses based on the retrieved results using the specified FM (Anthropic’s Claude 3 Sonnet) in RetrieveAndGenerate:

response = bedrock_client.retrieve_and_generate(
    input={
    "text": "Summarize NHS compliance guidance."
},
    retrieveAndGenerateConfiguration={
    "type": "KNOWLEDGE_BASE",
    "knowledgeBaseConfiguration": {
            'knowledgeBaseId': 'example_knowledge_base_id’,
        "modelArn": "arn:aws:bedrock:{}::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0".format(region),
        "retrievalConfiguration": {
            "vectorSearchConfiguration": {
                "numberOfResults": 5,
                "filter": {
                    "equals": {
                        "key": "customer",
                        "value": ‘CustomerB’
                    }
                }
            }
        }
    }
})

The following GitHub repository provides a notebook that you can follow to deploy an Amazon Bedrock knowledge base with access control implemented using metadata filtering in your own AWS account.

Integrate existing vector databases with Amazon Bedrock Knowledge Bases and validate metadata

There are multiple ways to create vector databases from AWS services and partner offerings to build scalable solutions. If a vector database doesn’t exist, you can use Amazon Bedrock Knowledge Bases to create one using Amazon OpenSearch Serverless Service, Amazon Aurora PostgreSQL Serverless, or Amazon Neptune Analytics to store embeddings, or you can specify an existing vector database supported by Redis Enterprise Cloud, Amazon Aurora PostgreSQL with the pgvector extension, MongoDB Atlas, or Pinecone. After you create your knowledge base and either ingest or sync your data, the metadata attached to the data will be ingested and automatically populated to the vector database.

In this section, we review how to incorporate and validate metadata filtering with existing vector databases using OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone. To learn how to set up each individual vector databases, follow the instructions in Prerequisites for your own vector store for a knowledge base.

OpenSearch Serverless as a knowledge base vector store

With OpenSearch Serverless vector database capabilities, you can implement semantic search, RAG with LLMs, and recommendation engines. To address data segregation between business segments within each Amazon Bedrock knowledge base with an OpenSearch Serverless vector database, use metadata filtering. Metadata filtering allows you to segment data inside of an OpenSearch Serverless vector database. This can be useful when you want to add descriptive data to your documents for more control and granularity in searches.

Each OpenSearch Serverless dashboard has a URL that can be used to add documents and query your database; the structure of the URL is domain-endpoint/_dashboard.

After creating a vector database index, you can use metadata filtering to selectively retrieve items by using JSON query options in the request body. For example, to return records owned by Customer A, you can use the following request:

GET <index_name>/_search
{
  "query": {
    "match": {
      "customer": "CustomerA"
    }
  }
}

This query will return a JSON response containing the document index with the document labeled as belonging to Customer A.

Aurora PostgreSQL with the pgvector extension as a knowledge base vector store

Pgvector is an extension of PostgreSQL that allows you to extend your relational database into a high-dimensional vector database. It stores each document’s vector in a separate row of a database table. For details on creating an Aurora PostgreSQL table to be used as the vector store for a knowledge base, see Using Aurora PostgreSQL as a Knowledge Base for Amazon Bedrock.

When storing a vector index for your knowledge base in an Aurora database cluster, make sure that the table for your index contains a column for each metadata property in your metadata files before starting data ingestion.

Continuing with the Customer A example, the customer requires the Architecting for HIPAA Security and Compliance on AWS document.

The following is the JSON metadata for Customer A’s data:

{ "metadataAttributes": { "customer": "CustomerA", "documentType": "HIPAA Compliance Guide", "focus": "HIPAA Compliance", "publicationYear": 2022, "region": "North America" }}

The schema of the PostgreSQL table you create must contain four essential columns for ID, text content, vector values, and service managed metadata; it must also include additional metadata columns (customer, documentType, focus, publicationYear, region) for each metadata property in the corresponding metadata file. This allows pgvector to perform efficient vector searches and similarity comparisons by running queries directly on the database table. The following table summarizes the columns.

Column Name Data Type Description
id UUID primary key Contains unique identifiers for each record
chunks Text Contains the chunks of raw text from your data sources
embedding Vector Contains the vector embeddings of the data sources
metadata JSON Contains Amazon Bedrock managed metadata required to carry out source attribution and to enable data ingestion and querying.
customer Text Contains the customer ID
documentType Text Contains the type of document
focus Text Contains the document focus
publicationYear Int Contains the year document was published
region Text Contains the document’s related AWS Region

During Amazon Bedrock knowledge base data ingestion, these columns will be populated with the corresponding attribute values. Chunking can break down a single document into multiple separate records (each associated with a different ID).

This PostgreSQL table structure allows for efficient storage and retrieval of document vectors, using PostgreSQL’s robustness and pgvector’s specialized vector handling capabilities for applications like recommendation systems, search engines, or other systems requiring similarity searches in high-dimensional space.

Using this approach, you can implement access control at the table level by creating database tables for each segment. Additional metadata columns can also be included in the table for properties such as the specific document owner (user_id), tags, and so on to further enable and enforce fine-grained (row-level) access control and result filtering if you restrict each user to only query the rows that contain their user ID (document owner).

After creating a vector database table, you can use metadata filtering to selectively retrieve items by using a PostgreSQL query. For example, to return table records owned by Customer A, you can use the following query:

SELECT *
FROM bedrock_integration.bedrock_kb
WHERE customer = 'CustomerA';

This query will return a response containing the database records with the document labeled as belonging to Customer A.

Pinecone as a knowledge base vector store

Pinecone, a fully managed vector database, enables semantic search, high-performance search, and similarity matching. Pinecone databases can be integrated into your AWS environment in the form of Amazon Bedrock knowledge bases, but are first created through the Pinecone console. For detailed documentation about setting up a vector store in Pinecone, see Pinecone as a Knowledge Base for Amazon Bedrock. Then, you can integrate the databases using the Amazon Bedrock console. For more information about Pinecone integration with Amazon Bedrock, see Bring reliable GenAI applications to market with Amazon Bedrock and Pinecone.

You can segment a Pinecone database by adding descriptive metadata to each index and using that metadata to inform query results. Pinecone supports strings and lists of strings to filter vector searches on customer names, customer industry, and so on. Pinecone also supports numbers and booleans.

Use metadata query language to filter output ($eq, $ne, $in, $nin, $and, and $or). The following example shows a snippet of metadata and queries that will return that index. The example queries in Python demonstrate how you can retrieve a list of records associated with Customer A from the Pinecone database.

pc = Pinecone(api_key="xxxxxxxxxxx")

index = pc.Index(<index_name>)

index.query(
    namespace="",
    vector=[0.17,0.96, …, 0.44],
    filter={
        "customer": {"$eq": "CustomerA"}
    },
    top_k=10,
    include_metadata=True # Include metadata in the response.
)

This query will return a response containing the database records labeled as belonging to Customer A.

Enhanced scaling with multiple data sources

Amazon Bedrock Knowledge Bases now supports multiple data sources across AWS accounts. Amazon Bedrock Knowledge Bases can ingest data from up to five data sources, enhancing the comprehensiveness and relevancy of a knowledge base. This feature allows customers with complex IT systems to incorporate data into generative AI applications without restructuring or migrating data sources. It also provides flexibility for you to scale your Amazon Bedrock knowledge bases when data resides in different AWS accounts.

The features includes cross-account data access, enabling the configuration of S3 buckets as data sources across different accounts and efficient data management options for retaining or deleting data when a source is removed. These enhancements alleviate the need for creating multiple knowledge bases or redundant data copies.

Clean up

After completing the steps in this blog post, make sure to clean up your resources to avoid incurring unnecessary charges. Delete the Amazon Bedrock Knowledge Base by navigating to the Amazon Bedrock console, selecting your knowledge base, and choosing “Delete” from the “Actions” dropdown menu. If you created vector databases for testing, remember to delete OpenSearch Serverless collections, stop or delete Aurora PostgreSQL instances, and remove Pinecone index created. Additionally, consider deleting test documents uploaded to S3 buckets specifically for this blog example to avoid storage charges. Review and clean up any IAM roles or policies created for this demonstration if they’re no longer needed.

While Amazon Bedrock Knowledge Bases include charges for data indexing and queries, the underlying storage in S3 and vector databases will continue to incur charges until those resources are removed. For specific pricing details, refer to the Amazon Bedrock pricing page.

Conclusion

In this post, we covered several key strategies for building scalable, secure, and segmented Amazon Bedrock knowledge bases. These include using S3 folder structure, metadata to organize data sources, and data segmentation within a single knowledge base. Using metadata filtering to create custom queries that target specific data segments helps provide retrieval accuracy and maintain data privacy. We also explored integrating and validating metadata for vector databases including OpenSearch Serverless, Aurora PostgreSQL with the pgvector extension, and Pinecone.

By consolidating multiple business segments or customer data within a single Amazon Bedrock knowledge base, organizations can achieve cost optimization compared to creating and managing them separately. The improved data segmentation and access control measures help make sure each team or customer can only access the information relevant to their domain. The enhanced scalability helps meet the diverse needs of organizations, while maintaining the necessary data segregation and access control.

Try out metadata filtering with Amazon Bedrock Knowledge Bases, and share your thoughts and questions with the authors or in the comments.


About the Authors

Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing 1P and 3P model adoption. Breanne is also on the Women at Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign.

 Justin Lin is a Small & Medium Business Solutions Architect at Amazon Web Services. He studied computer science at UW Seattle. Dedicated to designing and developing innovative solutions that empower customers, Justin has been dedicating his time to experimenting with applications in generative AI, natural language processing, and forecasting.

Chloe Gorgen is an Enterprise Solutions Architect at Amazon Web Services, advising AWS customers in various topics including security, analytics, data management, and automation. Chloe is passionate about youth engagement in technology, and supports several AWS initiatives to foster youth interest in cloud-based technology. Chloe holds a Bachelor of Science in Statistics and Analytics from the University of North Carolina at Chapel Hill.

Read More

Effectively use prompt caching on Amazon Bedrock

Effectively use prompt caching on Amazon Bedrock

Prompt caching, now generally available on Amazon Bedrock with Anthropic’s Claude 3.5 Haiku and Claude 3.7 Sonnet, along with Nova Micro, Nova Lite, and Nova Pro models, lowers response latency by up to 85% and reduces costs up to 90% by caching frequently used prompts across multiple API calls.

With prompt caching, you can mark the specific contiguous portions of your prompts to be cached (known as a prompt prefix). When a request is made with the specified prompt prefix, the model processes the input and caches the internal state associated with the prefix. On subsequent requests with a matching prompt prefix, the model reads from the cache and skips the computation steps required to process the input tokens. This reduces the time to first token (TTFT) and makes more efficient use of hardware such that we can share the cost savings with you.

This post provides a detailed overview of the prompt caching feature on Amazon Bedrock and offers guidance on how to effectively use this feature to achieve improved latency and cost savings.

How prompt caching works

Large language model (LLM) processing is made up of two primary stages: input token processing and output token generation. The prompt caching feature on Amazon Bedrock optimizes the input token processing stage.

You can begin by marking the relevant portions of your prompt with cache checkpoints. The entire section of the prompt preceding the checkpoint then becomes the cached prompt prefix. As you send more requests with the same prompt prefix, marked by the cache checkpoint, the LLM will check if the prompt prefix is already stored in the cache. If a matching prefix is found, the LLM can read from the cache, allowing the input processing to resume from the last cached prefix. This saves the time and cost that would otherwise be spent recomputing the prompt prefix.

Be advised that the prompt caching feature is model-specific. You should review the supported models and details on the minimum number of tokens per cache checkpoint and maximum number of cache checkpoints per request.

Cache hits only occur when the exact prefix matches. To fully realize the benefits of prompt caching, it’s recommended to position static content such as instructions and examples at the beginning of the prompt. Dynamic content, including user-specific information, should be placed at the end of the prompt. This principle also extends to images and tools, which must remain identical across requests in order to enable caching.

The following diagram illustrates how cache hits work. A, B, C, D represent distinct portions of the prompt. A, B and C are marked as the prompt prefix. Cache hits occur when subsequent requests contain the same A, B, C prompt prefix.

When to use prompt caching

Prompt caching on Amazon Bedrock is recommended for workloads that involve long context prompts that are frequently reused across multiple API calls. This capability can significantly improve response latency by up to 85% and reduce inference costs by up to 90%, making it well-suited for applications that use repetitive, long input context. To determine if prompt caching is beneficial for your use case, you will need to estimate the number of tokens you plan to cache, the frequency of reuse, and the time between requests.

The following use cases are well-suited for prompt caching:

  • Chat with document – By caching the document as input context on the first request, each user query becomes more efficient, enabling simpler architectures that avoid heavier solutions like vector databases.
  • Coding assistants – Reusing long code files in prompts enables near real-time inline suggestions, eliminating much of the time spent reprocessing code files.
  • Agentic workflows – Longer system prompts can be used to refine agent behavior without degrading the end-user experience. By caching the system prompts and complex tool definitions, the time to process each step in the agentic flow can be reduced.
  • Few-shot learning – Including numerous high-quality examples and complex instructions, such as for customer service or technical troubleshooting, can benefit from prompt caching.

How to use prompt caching

When evaluating a use case to use prompt caching, it’s crucial to categorize the components of a given prompt into two distinct groups: the static and repetitive portion, and the dynamic portion. The prompt template should adhere to the structure illustrated in the following figure.

You can create multiple cache checkpoints within a request, subject to model-specific limitations. It should follow the same static portion, cache checkpoint, dynamic portion structure, as illustrated in the following figure.

Use case example

The “chat with document” use case, where the document is included in the prompt, is well-suited for prompt caching. In this example, the static portion of the prompt would comprise instructions on response formatting and the body of the document. The dynamic portion would be the user’s query, which changes with each request.

In this scenario, the static portions of the prompt should be marked as the prompt prefixes to enable prompt caching. The following code snippet demonstrates how to implement this approach using the Invoke Model API. Here we create two cache checkpoints in the request, one for the instructions and one for the document content, as illustrated in the following figure.

We use the following prompt:

def chat_with_document(document, user_query):
    instructions = (
    "I will provide you with a document, followed by a question about its content. "
    "Your task is to analyze the document, extract relevant information, and provide "
    "a comprehensive answer to the question. Please follow these detailed instructions:"

    "nn1. Identifying Relevant Quotes:"
    "n   - Carefully read through the entire document."
    "n   - Identify sections of the text that are directly relevant to answering the question."
    "n   - Select quotes that provide key information, context, or support for the answer."
    "n   - Quotes should be concise and to the point, typically no more than 2-3 sentences each."
    "n   - Choose a diverse range of quotes if multiple aspects of the question need to be addressed."
    "n   - Aim to select between 2 to 5 quotes, depending on the complexity of the question."

    "nn2. Presenting the Quotes:"
    "n   - List the selected quotes under the heading 'Relevant quotes:'"
    "n   - Number each quote sequentially, starting from [1]."
    "n   - Present each quote exactly as it appears in the original text, enclosed in quotation marks."
    "n   - If no relevant quotes can be found, write 'No relevant quotes' instead."
    "n   - Example format:"
    "n     Relevant quotes:"
    "n     [1] "This is the first relevant quote from the document.""
    "n     [2] "This is the second relevant quote from the document.""

    "nn3. Formulating the Answer:"
    "n   - Begin your answer with the heading 'Answer:' on a new line after the quotes."
    "n   - Provide a clear, concise, and accurate answer to the question based on the information in the document."
    "n   - Ensure your answer is comprehensive and addresses all aspects of the question."
    "n   - Use information from the quotes to support your answer, but do not repeat them verbatim."
    "n   - Maintain a logical flow and structure in your response."
    "n   - Use clear and simple language, avoiding jargon unless it's necessary and explained."

    "nn4. Referencing Quotes in the Answer:"
    "n   - Do not explicitly mention or introduce quotes in your answer (e.g., avoid phrases like 'According to quote [1]')."
    "n   - Instead, add the bracketed number of the relevant quote at the end of each sentence or point that uses information from that quote."
    "n   - If a sentence or point is supported by multiple quotes, include all relevant quote numbers."
    "n   - Example: 'The company's revenue grew by 15% last year. [1] This growth was primarily driven by increased sales in the Asian market. [2][3]'"

    "nn5. Handling Uncertainty or Lack of Information:"
    "n   - If the document does not contain enough information to fully answer the question, clearly state this in your answer."
    "n   - Provide any partial information that is available, and explain what additional information would be needed to give a complete answer."
    "n   - If there are multiple possible interpretations of the question or the document's content, explain this and provide answers for each interpretation if possible."

    "nn6. Maintaining Objectivity:"
    "n   - Stick to the facts presented in the document. Do not include personal opinions or external information not found in the text."
    "n   - If the document presents biased or controversial information, note this objectively in your answer without endorsing or refuting the claims."

    "nn7. Formatting and Style:"
    "n   - Use clear paragraph breaks to separate different points or aspects of your answer."
    "n   - Employ bullet points or numbered lists if it helps to organize information more clearly."
    "n   - Ensure proper grammar, punctuation, and spelling throughout your response."
    "n   - Maintain a professional and neutral tone throughout your answer."

    "nn8. Length and Depth:"
    "n   - Provide an answer that is sufficiently detailed to address the question comprehensively."
    "n   - However, avoid unnecessary verbosity. Aim for clarity and conciseness."
    "n   - The length of your answer should be proportional to the complexity of the question and the amount of relevant information in the document."

    "nn9. Dealing with Complex or Multi-part Questions:"
    "n   - For questions with multiple parts, address each part separately and clearly."
    "n   - Use subheadings or numbered points to break down your answer if necessary."
    "n   - Ensure that you've addressed all aspects of the question in your response."

    "nn10. Concluding the Answer:"
    "n    - If appropriate, provide a brief conclusion that summarizes the key points of your answer."
    "n    - If the question asks for recommendations or future implications, include these based strictly on the information provided in the document."

    "nnRemember, your goal is to provide a clear, accurate, and well-supported answer based solely on the content of the given document. "
    "Adhere to these instructions carefully to ensure a high-quality response that effectively addresses the user's query."
    )

    document_content =  f"Here is the document:  <document> {document} </document>"

    messages_API_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": instructions,
                    "cache_control": {
                        "type": "ephemeral"
                    }
                },
                {
                    "type": "text",
                    "text": document_content,
                    "cache_control": {
                        "type": "ephemeral"
                    }
                },
                {
                    "type": "text",
                    "text": user_query
                },
            ]
        }
    ]
    }

    response = bedrock_runtime.invoke_model(
        body=json.dumps(messages_API_body),
        modelId="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
        accept="application/json",
        contentType="application/json"
    )
    response_body = json.loads(response.get("body").read())
print(json.dumps(response_body, indent=2))    
 
 
response = requests.get("https://aws.amazon.com/blogs/aws/reduce-costs-and-latency-with-amazon-bedrock-intelligent-prompt-routing-and-prompt-caching-preview/")
blog = response.text
chat_with_document(blog, "What is the blog writing about?")

In the response to the preceding code snippet, there is a usage section that provides metrics on the cache reads and writes. The following is the example response from the first model invocation:

{
  "id": "msg_bdrk_01BwzJX6DBVVjUDeRqo3Z6GL",
  "type": "message",
  "role": "assistant",
  "model": "claude-3-7-sonnet-20250219”,
  "content": [
    {
      "type": "text",
      "text": "Relevant quotes:n[1] "Today, Amazon Bedrock has introduced in preview two capabilities that help reduce costs and latency for generative AI applications"nn[2] "Amazon Bedrock Intelligent Prompt Routing u2013 When invoking a model, you can now use a combination of foundation models (FMs) from the same model family to help optimize for quality and cost... Intelligent Prompt Routing can reduce costs by up to 30 percent without compromising on accuracy."nn[3] "Amazon Bedrock now supports prompt caching u2013 You can now cache frequently used context in prompts across multiple model invocations... Prompt caching in Amazon Bedrock can reduce costs by up to 90% and latency by up to 85% for supported models."nnAnswer:nThe article announces two new preview features for Amazon Bedrock that aim to improve cost efficiency and reduce latency in generative AI applications [1]:nn1. Intelligent Prompt Routing: This feature automatically routes requests between different models within the same model family based on the complexity of the prompt, choosing more cost-effective models for simpler queries while maintaining quality. This can reduce costs by up to 30% [2].nn2. Prompt Caching: This capability allows frequent reuse of cached context across multiple model invocations, which is particularly useful for applications that repeatedly use the same context (like document Q&A systems). This feature can reduce costs by up to 90% and improve latency by up to 85% [3].nnThese features are designed to help developers build more efficient and cost-effective generative AI applications while maintaining performance and quality standards."
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 9,
    "cache_creation_input_tokens": 37209,
    "cache_read_input_tokens": 0,
    "output_tokens": 357
  }
}

The cache checkpoint has been successfully created with 37,209 tokens cached, as indicated by the cache_creation_input_tokens value, as illustrated in the following figure.

For the subsequent request, we can ask a different question:

chat_with_document(blog, "what are the use cases?")

The dynamic portion of the prompt has been changed, but the static portion and prompt prefixes remain the same. We can expect cache hits from the subsequent invocations. See the following code:

{
  "id": "msg_bdrk_01HKoDMs4Bmm9mhzCdKoQ8bQ",
  "type": "message",
  "role": "assistant",
  "model": "claude-3-7-sonnet-20250219",
  "content": [
    {
      "type": "text",
      "text": "Relevant quotes:n[1] "This is particularly useful for applications such as customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models."nn[2] "This is especially valuable for applications that repeatedly use the same context, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that need to maintain context about code files."nn[3] "During the preview, you can use the default prompt routers for Anthropic's Claude and Meta Llama model families."nnAnswer:nThe document describes two main features with different use cases:nn1. Intelligent Prompt Routing:n- Customer service applications where query complexity variesn- Applications needing to balance between cost and performancen- Systems that can benefit from using different models from the same family (Claude or Llama) based on query complexity [1][3]nn2. Prompt Caching:n- Document Q&A systems where users ask multiple questions about the same documentn- Coding assistants that need to maintain context about code filesn- Applications that frequently reuse the same context in prompts [2]nnBoth features are designed to optimize costs and reduce latency while maintaining response quality. Prompt routing can reduce costs by up to 30% without compromising accuracy, while prompt caching can reduce costs by up to 90% and latency by up to 85% for supported models."
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 10,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 37209,
    "output_tokens": 324
  }
}

37,209 tokens are for the document and instructions read from the cache, and 10 input tokens are for the user query, as illustrated in the following figure.

Let’s change the document to a different blog post, but our instructions remain the same. We can expect cache hits for the instructions prompt prefix because it was positioned before the document body in our requests. See the following code:

response = requests.get(https://aws.amazon.com/blogs/machine-learning/enhance-conversational-ai-with-advanced-routing-techniques-with-amazon-bedrock/)
blog = response.text
chat_with_document(blog, "What is the blog writing about?")
{
  "id": "msg_bdrk_011S8zqMXzoGHABHnXX9qSjq",
  "type": "message",
  "role": "assistant",
  "model": "claude-3-7-sonnet-20250219",
  "content": [
    {
      "type": "text",
      "text": "Let me analyze this document and provide a comprehensive answer about its main topic and purpose.nnRelevant quotes:n[1] "When you're designing a security strategy for your organization, firewalls provide the first line of defense against threats. Amazon Web Services (AWS) offers AWS Network Firewall, a stateful, managed network firewall that includes intrusion detection and prevention (IDP) for your Amazon Virtual Private Cloud (VPC)."nn[2] "This blog post walks you through logging configuration best practices, discusses three common architectural patterns for Network Firewall logging, and provides guidelines for optimizing the cost of your logging solution."nn[3] "Determining the optimal logging approach for your organization should be approached on a case-by-case basis. It involves striking a balance between your security and compliance requirements and the costs associated with implementing solutions to meet those requirements."nnAnswer:nThis document is a technical blog post that focuses on cost considerations and logging options for AWS Network Firewall. The article aims to help organizations make informed decisions about implementing and managing their firewall logging solutions on AWS. Specifically, it:nn1. Explains different logging configuration practices for AWS Network Firewall [1]n2. Discusses three main architectural patterns for handling firewall logs:n   - Amazon S3-based solutionn   - Amazon CloudWatch-based solutionn   - Amazon Kinesis Data Firehose with OpenSearch solutionn3. Provides detailed cost analysis and comparisons of different logging approaches [3]n4. Offers guidance on balancing security requirements with cost considerationsnnThe primary purpose is to help AWS users understand and optimize their firewall logging strategies while managing associated costs effectively. The article serves as a practical guide for organizations looking to implement or improve their network security logging while maintaining cost efficiency [2]."
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 9,
    "cache_creation_input_tokens": 37888,
    "cache_read_input_tokens": 1038,
    "output_tokens": 385
  }
}

In the response, we can see 1,038 cache read tokens for the instructions and 37,888 cache write tokens for the new document content, as illustrated in the following figure.

Cost savings

When a cache hit happens, Amazon Bedrock passes along the compute savings to customers by giving a per-token discount on cached context. To calculate the potential cost savings, you should first understand your prompt caching usage pattern with cache write/read metrics in the Amazon Bedrock response. Then you can calculate your potential cost savings with price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read). For more price details, see Amazon Bedrock pricing.

Latency benchmark

Prompt caching is optimized to improve the TTFT performance on repetitive prompts. Prompt caching is well-suited for conversational applications that involve multi-turn interactions, similar to chat playground experiences. It can also benefit use cases that require repeatedly referencing a large document.

However, prompt caching might be less effective for workloads that involve a lengthy 2,000-token system prompt with a long set of dynamically changing text afterwards. In such cases, the benefits of prompt caching might be limited.

We have published a notebook on how to use prompt caching and how to benchmark it in our GitHub repo . The benchmark results depend on the use case: input token count, cached token count, or output token count.

Amazon Bedrock cross-Region inference

Prompt caching can be used in conjunction with cross-region inference (CRIS). Cross-region inference automatically selects the optimal AWS Region within your geography to serve your inference request, thereby maximizing available resources and model availability. At times of high demand, these optimizations may lead to increased cache writes.

Metrics and observability

Prompt caching observability is essential for optimizing cost savings and improving latency in applications using Amazon Bedrock. By monitoring key performance metrics, developers can achieve significant efficiency improvements—such as reducing TTFT by up to 85% and cutting costs by up to 90% for lengthy prompts. These metrics are pivotal because they enable developers to assess cache performance accurately and make strategic decisions regarding cache management.

Monitoring with Amazon Bedrock

Amazon Bedrock exposes cache performance data through the API response’s usage section, allowing developers to track essential metrics such as cache hit rates, token consumption (both read and write), and latency improvements. By using these insights, teams can effectively manage caching strategies to enhance application responsiveness and reduce operational costs.

Monitoring with Amazon CloudWatch

Amazon CloudWatch provides a robust platform for monitoring the health and performance of AWS services, including new automatic dashboards tailored specifically for Amazon Bedrock models. These dashboards offer quick access to key metrics and facilitate deeper insights into model performance.

To create custom observability dashboards, complete the following steps:

  1. On the CloudWatch console, create a new dashboard. For a full example, see Improve visibility into Amazon Bedrock usage and performance with Amazon CloudWatch.
  2. Choose CloudWatch as your data source and select Pie for the initial widget type (this can be adjusted later).
  3. Update the time range for metrics (such as 1 hour, 3 hours, or 1 day) to suit your monitoring needs.
  4. Select Bedrock under AWS namespaces.
  5. Enter “cache” in the search box to filter cache-related metrics.
  6. For the model, locate anthropic.claude-3-7-sonnet-20250219-v1:0, and select both CacheWriteInputTokenCount and CacheReadInputTokenCount.
  1. Choose Create widget and then Save to save your dashboard.

The following is a sample JSON configuration for creating this widget:

{
    "view": "pie",
    "metrics": [
        [ "AWS/Bedrock", "CacheReadInputTokenCount" ],
        [ ".", "CacheWriteInputTokenCount" ]
    ],
    "region": "us-west-2",
    "setPeriodToTimeRange": true
}

Understanding cache hit rates

Analyzing cache hit rates involves observing both CacheReadInputTokens and CacheWriteInputTokens. By summing these metrics over a defined period, developers can gain insights into the efficiency of the caching strategies. With the published pricing for the model-specific price per 1,000 input tokens (cache write) and price per 1,000 input tokens (cache read) on the Amazon Bedrock pricing page, you can estimate the potential cost savings for your specific use case.

Conclusion

This post explored the prompt caching feature in Amazon Bedrock, demonstrating how it works, when to use it, and how to use it effectively. It’s important to carefully evaluate whether your use case will benefit from this feature. It depends on thoughtful prompt structuring, understanding the distinction between static and dynamic content, and selecting appropriate caching strategies for your specific needs. By using CloudWatch metrics to monitor cache performance and following the implementation patterns outlined in this post, you can build more efficient and cost-effective AI applications while maintaining high performance.

For more information about working with prompt caching on Amazon Bedrock, see Prompt caching for faster model inference.


About the authors

Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.

Kosta Belz is a Senior Applied Scientist in the AWS Generative AI Innovation Center, where he helps customers design and build generative AI solutions to solve key business problems.

Sean Eichenberger is a Sr Product Manager at AWS.

Read More

Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow

Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow

Developing generative AI agents that can tackle real-world tasks is complex, and building production-grade agentic applications requires integrating agents with additional tools such as user interfaces, evaluation frameworks, and continuous improvement mechanisms. Developers often find themselves grappling with unpredictable behaviors, intricate workflows, and a web of complex interactions. The experimentation phase for agents is particularly challenging, often tedious and error prone. Without robust tracking mechanisms, developers face daunting tasks such as identifying bottlenecks, understanding agent reasoning, ensuring seamless coordination across multiple tools, and optimizing performance. These challenges make the process of creating effective and reliable AI agents a formidable undertaking, requiring innovative solutions to streamline development and enhance overall system reliability.

In this context, Amazon SageMaker AI with MLflow offers a powerful solution to streamline generative AI agent experimentation. For this post, I use LangChain’s popular open source LangGraph agent framework to build an agent and show how to enable detailed tracing and evaluation of LangGraph generative AI agents. This post explores how Amazon SageMaker AI with MLflow can help you as a developer and a machine learning (ML) practitioner efficiently experiment, evaluate generative AI agent performance, and optimize their applications for production readiness. I also show you how to introduce advanced evaluation metrics with Retrieval Augmented Generation Assessment (RAGAS) to illustrate MLflow customization to track custom and third-party metrics like with RAGAS.

The need for advanced tracing and evaluation in generative AI agent development

A crucial functionality for experimentation is the ability to observe, record, and analyze the internal execution path of an agent as it processes a request. This is essential for pinpointing errors, evaluating decision-making processes, and improving overall system reliability. Tracing workflows not only aids in debugging but also ensures that agents perform consistently across diverse scenarios.

Further complexity arises from the open-ended nature of tasks that generative AI agents perform, such as text generation, summarization, or question answering. Unlike traditional software testing, evaluating generative AI agents requires new metrics and methodologies that go beyond basic accuracy or latency measures. You must assess multiple dimensions—such as correctness, toxicity, relevance, coherence, tool call, and groundedness—while also tracing execution paths to identify errors or bottlenecks.

Why SageMaker AI with MLflow?

Amazon SageMaker AI, which provides a fully managed version of the popular open source MLflow, offers a robust platform for machine learning experimentation and generative AI management. This combination is particularly powerful for working with generative AI agents. SageMaker AI with MLflow builds on MLflow’s open source legacy as a tool widely adopted for managing machine learning workflows, including experiment tracking, model registry, deployment, and metrics comparison with visualization.

  • Scalability: SageMaker AI allows you to easily scale generative AI agentic experiments, running multiple iterations simultaneously.
  • Integrated tracking: MLflow integration enables efficient management of experiment tracking, versioning, and agentic workflow.
  • Visualization: Monitor and visualize the performance of each experiment run with built-in MLflow capabilities.
  • Continuity for ML Teams: Organizations already using MLflow for classic ML can adopt agents without overhauling their MLOps stack, reducing friction for generative AI adoption.
  • AWS ecosystem advantage: Beyond MLflow, SageMaker AI provides a comprehensive ecosystem for generative AI development, including access to foundation models, many managed services, simplified infrastructure, and integrated security.

This evolution positions SageMaker AI with MLflow as a unified platform for both traditional ML and cutting-edge generative AI agent development.

Key features of SageMaker AI with MLflow

The capabilities of SageMaker AI with MLflow directly address the core challenges of agentic experimentation—tracing agent behavior, evaluating agent performance, and unified governance.

  1. Experiment tracking: Compare different runs of the LangGraph agent and track changes in performance across iterations.
  2. Agent versioning: Keep track of different versions of the agent throughout its development lifecycle to iteratively refine and improve agents.
  3. Unified agent governance: Agents registered in SageMaker AI with MLflow automatically appear in the SageMaker AI with MLflow console, enabling a collaborative approach to management, evaluation, and governance across teams.
  4. Scalable infrastructure: Use the managed infrastructure of SageMaker AI to run large-scale experiments without worrying about resource management.

LangGraph generative AI agents

LangGraph offers a powerful and flexible approach to designing generative AI agents tailored to your company’s specific needs. LangGraph’s controllable agent framework is engineered for production use, providing low-level customization options to craft bespoke solutions.

In this post, I show you how to create a simple finance assistant agent equipped with a tool to retrieve financial data from a datastore, as depicted in the following diagram. This post’s sample agent, along with all necessary code, is available on the GitHub repository, ready for you to replicate and adapt it for your own applications.

Solution code

You can follow and execute the full example code from the aws-samples GitHub repository. I use snippets from the code in the repository to illustrate evaluation and tracking approaches in the reminder of this post.

Prerequisites

Trace generative AI agents with SageMaker AI with MLflow

MLflow’s tracing capabilities are essential for understanding the behavior of your LangGraph agent. The MLflow tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.

MLflow tracing is a feature that enhances observability in your generative AI agent by capturing detailed information about the execution of the agent services, nodes, and tools. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors.

The MLfow tracking UI displays the traces exported under the MLflow Traces tab for the selected MLflow experimentation, as shown in the following image.

Furthermore, you can see the detailed trace for an agent input or prompt invocation by selecting the Request ID. Choosing Request ID opens a collapsible view with results captured at each step of the invocation workflow from input to the final output, as shown in the following image.

SageMaker AI with MLflow traces all the nodes in the LangGraph agent and displays the trace in the MLflow UI with detailed inputs, outputs, usage tokens, and multi-sequence messages with origin type (human, tool, AI) for each node. The display also captures the execution time over the entire agentic workflow, providing a per-node breakdown of time. Overall, tracing is crucial for generative AI agents for the following reasons:

  • Performance monitoring: Tracing enables you to oversee the agent’s behavior and make sure that it operates effectively, helping identify malfunctions, inaccuracies, or biased outputs.
  • Timeout management: Tracing with timeouts helps prevent agents from getting stuck in long-running operations or infinite loops, helping to ensure better resource management and responsiveness.
  • Debugging and troubleshooting: For complex agents with multiple steps and varying sequences based on user input, tracing helps pinpoint where issues are introduced in the execution process.
  • Explainability: Tracing provides insights into the agent’s decision-making process, helping you to understand the reasoning behind its actions. For example, you can see what tools are called and the processing type—human, tool, or AI.
  • Optimization: Capturing and propagating an AI system’s execution trace enables end-to-end optimization of AI systems, including optimization of heterogeneous parameters such as prompts and metadata.
  • Compliance and security: Tracing helps in maintaining regulatory compliance and secure operations by providing audit logs and real-time monitoring capabilities.
  • Cost tracking: Tracing can help in analyzing resource usage (input tokens, output tokens) and associated extrapolate costs of running AI agents.
  • Adaptation and learning: Tracing allows for observing how agents interact with prompts and data, providing insights that can be used to improve and adapt the agent’s performance over time.

In the MLflow UI, you can choose the Task name to see details captured at any agent step as it services the input request prompt or invocation, as shown in the following image.

By implementing proper tracing, you can gain deeper insights into your generative AI agents’ behavior, optimize their performance, and make sure that they operate reliably and securely.

Configure tracing for agent

For fine-grained control and flexibility in tracking, you can use MLflow’s tracing decorator APIs. With these APIs, you can add tracing to specific agentic nodes, functions, or code blocks with minimal modifications.

@mlflow.trace(name="assistant", attributes={"workflow": "agent_assistant"}, span_type="graph.py")
def assistant(state: GraphState):
    ...

This configuration allows users to:

  • Pinpoint performance bottlenecks in the LangGraph agent
  • Track decision-making processes
  • Monitor error rates and types
  • Analyze patterns in agent behavior across different scenarios

This approach allows you to specify exactly what you want to track in your experiment. Additionally, MLflow offers out-of-the box tracing comparability with LangChain for basic tracing through MLflow’s autologging feature mlflow.langchain.autolog(). With SageMaker AI with MLflow, you can gain deep insights into the LangGraph agent’s performance and behavior, facilitating easier debugging, optimization, and monitoring, in both development and production environments.

Evaluate with MLflow

You can use MLflow’s evaluation capabilities to help assess the performance of the LangGraph large language model (LLM) agent and objectively measure its effectiveness in various scenarios. The important aspects of evaluation are:

  • Evaluation metrics: MLflow offers many default metrics such as LLM-as-a-Judge, accuracy, and latency metrics that you can specify for evaluation and have the flexibility to define custom LLM-specific metrics tailored to the agent. For instance, you can introduce custom metrics for Correct Financial Advice, Adherence to Regulatory Guidelines, and Usefulness of Tool Invocations.
  • Evaluation dataset: Prepare a dataset for evaluation that reflects real-world queries and scenarios. The dataset should include example questions, expected answers, and relevant context data.
  • Run evaluation using MLflow evaluate library: MLflow’s mlflow.evaluate() returns comprehensive evaluation results, which can be viewed directly in the code or through the SageMaker AI with MLflow UI for a more visual representation.

The following is a snippet for how mlflow.evaluate(), can be used to execute evaluation on agents. You can follow this example by running the code in the same aws-samples GitHub repository.

python
results = mlflow.evaluate(
            agent_responses,  # Agent-generated answers to test queries
            targets="ground_truth",    # Reference "correct" answers for comparison
            model_type="question-answering",  # Predefined metrics for QA tasks
            extra_metrics=metrics   # Evaluation Metrics to include
        )

This code snippet employs MLflow’s evaluate() function to rigorously assess the performance of a LangGraph LLM agent, comparing its responses to a predefined ground truth dataset that’s maintained in the golden_questions_answer.jsonl file in the aws-samples GitHub repository. By specifying “model_type”:”question-answering”, MLflow applies relevant evaluation metrics for question-answering tasks, such as accuracy and coherence. Additionally, the extra_metrics parameter allows you to incorporate custom, domain-specific metrics tailored to the agent’s specific application, enabling a comprehensive and nuanced evaluation beyond standard benchmarks. The results of this evaluation are then logged in MLflow (as shown in the following image), providing a centralized and traceable record of the agent’s performance, facilitating iterative improvement and informed deployment decisions. The MLflow evaluation is captured as part of the MLflow execution run.

You can open the SageMaker AI with MLflow tracking server and see the list of MLflow execution runs for the specified MLflow experiment, as shown in the following image.

The evaluation metrics are captured within the MLflow execution along with model metrics and the accompanying artifacts, as shown in the following image.

Furthermore, the evaluation metrics are also displayed under the Model metrics tab within a selected MLflow execution run, as shown in the following image.

Finally, as shown in the following image, you can compare different variations and versions of the agent during the development phase by selecting the compare checkbox option in the MLflow UI between selected MLflow execution experimentation runs. This can help compare and select the best functioning agent version for deployment or with other decision making processes for agent development.

Register the LangGraph agent

You can use SageMaker AI with MLflow artifacts to register the LangGraph agent along with any other item as required or that you’ve produced. All the artifacts are stored in the SageMaker AI with MLflow tracking server’s configured Amazon Simple Storage Service (Amazon S3) bucket. Registering the LangGraph agent is crucial for governance and lifecycle management. It provides a centralized repository for tracking, versioning, and deploying the agents. Think of it as a catalog of your validated AI assets.

As shown in the following figure, you can see the artifacts captured under the Artifact tab within the MLflow execution run.

MLflow automatically captures and logs agent-related information files such as the evaluation results and the consumed libraries in the requirements.txt file. Furthermore, a successfully logged LangGraph agent as a MLflow model can be loaded and used for inference using mlflow.langchain.load_model(model_uri). Registering the generative AI agent after rigorous evaluation helps ensure that you’re promoting a proven and validated agent to production. This practice helps prevent the deployment of poorly performing or unreliable agents, helping to safeguard the user experience and the integrity of your applications. Post-evaluation registration is critical to make sure that the experiment with the best result is the one that gets promoted to production.

Use MLflow to experiment and evaluate with external libraries (such as RAGAS)

MLflow’s flexibility allows for seamless integration with external libraries, enhancing your ability to experiment and evaluate LangChain LangGraph agents. You can extend SageMaker MLflow to include external evaluation libraries such as  RAGAS for comprehensive LangGraph agent assessment. This integration enables ML practitioners to use RAGAS’s specialized LLM evaluation metrics while benefiting from MLflow’s experiment tracking and visualization capabilities. By logging RAGAS metrics directly to SageMaker AI with MLflow, you can easily compare different versions of the LangGraph agent across multiple runs, gaining deeper insights into its performance.

RAGAS is an open source library that provide tools specifically for evaluation of LLM applications and generative AI agents. RAGAS includes a method ragas.evaluate(), to run evaluations for LLM agents with choice of LLM models (evaluators) for scoring the evaluation, and extensive list of default metrics. To incorporate RAGAS metrics into your MLflow experiments, you can use the following approach.

You can follow this example by running the notebook in the GitHub repository additional_evaluations_with_ragas.ipynb.

from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
evaluation_dataset = EvaluationDataset.from_list(ragas_dataset)
evaluator_llm = LangchainLLMWrapper(llm_for_evaluation)
result = evaluate(
    dataset=evaluation_dataset,
    metrics=metrics_final,
    llm=evaluator_llm,
    embeddings=bedrock_embeddings,
    )
result

The evaluation results using RAGAS metrics from the above code are shown in the following figure.

Subsequently, the computed RAGAS evaluations metrics can be exported and tracked in the SageMaker AI with MLflow tracking server as part of the MLflow experimentation run. See the following code snippet for illustration and the full code can be found in the notebook in the same aws-samples GitHub repository.

with mlflow.start_run(
    experiment_id=get_experiment_id(_MLFLOW_RAGAS_EXPERIMENT_NAME), 
    run_name=timestamp, 
    tags={
        "project": os.getenv('PROJECT'),
        "model": os.getenv('MODELID'),
        "version": os.getenv('VERSION')
    }
):
    # Log the dataset to MLflow
    mlflow.log_input(dataset, context="ragas_eval_results")

    for ragas_metric in [faithfulness, answer_relevancy, answer_correctness]:
        print(ragas_metric.name)
        mean = ragas_result_ds[ragas_metric.name].mean()
        p90 = ragas_result_ds[ragas_metric.name].quantile(0.9)
        variance = ragas_result_ds[ragas_metric.name].var()
        print(mean, p90, variance)
        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/mean", mean)
        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/p90", p90)
        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/variance", variance)
mlflow.end_run()

You can view the RAGAS metrics logged by MLflow in the SageMaker AI with MLflow UI on the Model metrics tab, as shown in the following image.

From experimentation to production: Collaborative approval with SageMaker with MLflow tracing and evaluation

In a real-world deployment scenario, MLflow’s tracing and evaluation capabilities with LangGraph agents can significantly streamline the process of moving from experimentation to production.

Imagine a large team of data scientists and ML engineers working on an agentic platform, as shown in the following image. With MLflow, they can create sophisticated agents that can handle complex queries, process returns, and provide product recommendations. During the experimentation phase, the team can use MLflow to log different versions of the agent, tracking performance and evaluation metrics such as response accuracy, latency, and other metrics. MLflow’s tracing feature allows them to analyze the agent’s decision-making process, identifying areas for improvement. The results across numerous experiments are automatically logged to SageMaker AI with MLflow. The team can use the MLflow UI to collaborate, compare, and select the best performing version of the agent and decide on a production-ready version, all informed by the diverse set data logged in SageMaker AI with MLflow.

With this data, the team can present a clear, data-driven case to stakeholders for promoting the agent to production. Managers and compliance officers can review the agent’s performance history, examine specific interaction traces, and verify that the agent meets all necessary criteria. After being approved, the SageMaker AI with MLflow registered agent facilitates a smooth transition to deployment, helping to ensure that the exact version of the agent that passed evaluation is the one that goes live. This collaborative, traceable approach not only accelerates the development cycle but also instills confidence in the reliability and effectiveness of the generative AI agent in production.

Clean up

To avoid incurring unnecessary charges, use the following steps to clean up the resources used in this post:

  1. Remove SageMaker AI with MLflow tracking server:
    • In SageMaker Studio, stop and delete any running MLflow tracking server instances
  2. Revoke Amazon Bedrock model access:
    • Go to the Amazon Bedrock console.
    • Navigate to Model access and remove access to any models you enabled for this project.
  3. Delete the SageMaker domain (If not needed):
    • Open the SageMaker console.
    • Navigate to the Domains section.
    • Select the domain you created for this project.
    • Choose Delete domain and confirm the action.
    • Also delete any associated S3 buckets and IAM roles.

Conclusion

In this post, I showed you how to combine LangChain’s LangGraph, Amazon SageMaker AI, and MLflow to demonstrate a powerful workflow for developing, evaluating, and deploying sophisticated generative AI agents. This integration provides the tools needed to gain deep insights into the generative AI agent’s performance, iterate quickly, and maintain version control throughout the development process.

As the field of AI continues to advance, tools like these will be essential for managing the increasing complexity of generative AI agents and ensuring their effectiveness with the following considerations,

  1. Traceability is paramount: Effective tracing of agent execution paths using SageMaker MLflow is crucial for debugging, optimization, and helping to ensure consistent performance in complex generative AI workflows. Pinpoint issues, understand decision-making, examine interaction traces, and improve overall system reliability through detailed, recorded analysis of agent processes.
  2. Evaluation drives improvement: Standardized and customized evaluation metrics, using MLflow’s evaluate() function and integrations with external libraries like RAGAS, provide quantifiable insights into agent performance, guiding iterative refinement and informed deployment decisions.
  3. Collaboration and governance are essential: Unified governance facilitated by SageMaker AI with MLflow enables seamless collaboration across teams, from data scientists to compliance officers, helping to ensure responsible and reliable deployment of generative AI agents in production environments.

By embracing these principles and using the tools outlined in this post, developers and ML practitioners can confidently navigate the complexities of generative AI agent development and deployment, building robust and reliable applications that deliver real business value. Now, it’s your turn to unlock the potential of advanced tracing, evaluation, and collaboration in your agentic workflows! Dive into the aws-samples GitHub repository and start using the power of LangChain’s LangGraph, Amazon SageMaker AI, and MLflow for your generative AI projects.


About the Author

Sandeep Raveesh is a Generative AI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, Retrieval Augmented Generation (RAG), generative AI agents, and scaling generative AI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can find Sandeep on LinkedIn.

Read More

Prompting for the best price-performance

Prompting for the best price-performance

In the drive to remain competitive, businesses today are turning to AI to help them minimize cost and maximize efficiency. It’s incumbent on them to find the most suitable AI model—the one that will help them achieve more while spending less. For many businesses, the migration from OpenAI’s model family to Amazon Nova represents not only a shift in model but a strategic move toward scalability, efficiency, and broader multimodal capabilities.

In this blog, we discuss how to optimize prompting in Amazon Nova for the best price-performance.

Why migrate from OpenAI to Amazon Nova?

OpenAI’s models remain powerful, but their operational costs can be prohibitive when scaled. Consider these figures from Artificial Analysis:

Model Input Token Cost (per Million Tokens) Output Token Cost (per Million Tokens) Context Window Output Speed (Tokens per Second) Latency (Seconds per first token)
GPT-4o ~$2.50 ~$10.00 Up to 128K tokens ~63 ~0.49
GPT-4o Mini ~$0.15 ~$0.60 Up to 128K tokens ~90 ~0.43
Nova Micro ~$0.035 ~$0.14 Up to 128K tokens ~195 ~0.29
Nova Lite ~$0.06 ~$0.24 Up to 300K tokens ~146 ~0.29
Nova Pro ~$0.80 ~$3.20 Up to 300K tokens ~90 ~0.34

For high-volume applications—like global customer support or large-scale document analysis—these cost differences are disruptive. Not only does Amazon Nova Pro offer over three times the cost-efficiency, its longer context window also enables it to handle more extensive and complex inputs.

Breaking down the Amazon Nova suite

Amazon Nova isn’t a single model—it’s a suite designed for various needs:

  • Amazon Nova Pro – A robust multimodal model that can process text, images, and video. It excels at tasks such as document analysis and deep data visualization. Benchmark comparisons show Amazon Nova Pro matching or even surpassing GPT-4o on complex reasoning tasks, according to section 2.1.1 of the Nova technical report and model card.
  • Amazon Nova Lite – Offers a balanced mix of multimodal processing and speed. Amazon Nova Lite is ideal for applications such as document summarization, translation, and even basic visual search, delivering quality outputs at lower latency and cost compared to GPT-4o Mini. You can find these benchmark results in section 2.1.2 of the Nova Technical Report and Model Card.
  • Amazon Nova Micro – A text-only model engineered for ultra-low latency. With output speed of up to 195 tokens per second, Amazon Nova Micro is perfect for real-time applications such as chat-based assistants and automated FAQs. Its token costs are dramatically lower than those of GPT-4o Mini—roughly 4.3 times cheaper on a per-token basis.

The lower per-token costs and higher output per second of Amazon Nova give you the flexibility to simplify prompts for real-time applications so you can balance quality, speed, and cost for your use case.

Understanding the foundations

To make the best decision about which model family fits your needs, it’s important to understand the differences in prompt engineering best practices in both OpenAI and Amazon Nova. Each model family has its own set of strengths, but there are some things that apply to both families. Across both model families, quality accuracy is achieved through clarity of instructions, structured prompts, and iterative refinement. Whether you’re using strong output directives or clearly defined use cases, the goal is to reduce ambiguity and improve response quality.

The OpenAI approach

OpenAI uses a layered messaging system for prompt engineering, where system, developer, and user prompts work in harmony to control tone, safety, and output format. Their approach emphasizes:

  • Hierarchical message roles – Setting the model’s role and behavior using system messages makes sure that the overarching safety and style guidelines (set in system prompts) are preserved
  • Instruction placement and delimiters – Directives are placed at the beginning, with clear separation between context, examples, and queries
  • Selective chain-of-thought – Detailed, step-by-step reasoning is used when it benefits complex tasks
  • Formatting and structure – Using strong directives such as DO, MUST, and DO NOT to provide consistent outputs (for example, in JSON)

The Amazon Nova approach

  • Define the prompt use case
    • Task – What exactly the model should do
    • Role – Which role the model should assume
    • Response style – The structure or tone of the output
    • Instructions – Guidelines the model must follow
  • Chain-of-thought
    • Explicit state response – Provide clear and strong instructions to limit the model’s response
    • Structural thinking – Thinking step-by-step encourages structural thinking
  • Formatting and structure
    • Use delimiters to sectioned your prompts, for example, ##Task##, ##Context##, or ##Example##
    • Specify the output format, for example, JSON, YAML, or Markdown
    • Use strong instructions and caps, such as DO, DO NOT, or MUST
    • Prefill the responses to guide the model, for example, start with “{“ or “json…”

Evolving your prompt strategy: Migrating from OpenAI to Amazon Nova

Transitioning to Amazon Nova isn’t merely a change in API endpoints—it requires retooling your prompt engineering to align with the strengths of Amazon Nova. You need to reframe your use case definition. Begin by breaking down your current GPT-4o or GPT-4o Mini prompt into its core elements of task, role, response style, and instructions and success criteria. Make sure to structure these elements clearly to provide a blueprint for the model.

To understand how to migrate an existing OpenAI prompt to work optimally for Amazon Nova Pro, consider the following example using the meeting notes summarizer. Here is the GPT-4o system prompt:

You will be provided with meeting notes, and your task is to summarize the meeting 
as follows: 
-Overall summary of discussion 
-Action items (what needs to be done and who is doing it) 
-If applicable, a list of topics that need to be discussed more fully in the next meeting.

The user prompt is the meeting notes that need to be summarized:

Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters...

GPT produces this helpful response:

Overall Summary of Discussion: The team discussed the recent mission to Planet Zog, 
the space pirate issue in Sector 7, the annual Intergalactic Bake-Off, planning for 
the upcoming charity fundraiser, and a proposed team-building retreat. The mission to 
Planet Zog was successful, but communication with the Zogians was challenging. The team 
also discussed the need for a better strategy to deal with space pirates. The team placed 
second in the Intergalactic Bake-Off and is planning for the upcoming charity fundraiser. 
A team-building retreat to the Moon Resort and Spa was proposed.

Action Items:
1. Dr. Quasar to start working on a Zogian-English dictionary.
2. Captain Stardust to speak with Admiral Starbeam about increasing patrols in Sector 7.
3. Dr. Quasar to finalize the prototype of the new cloaking technology.
4. All team members to start gathering supplies and preparing games for the Intergalactic 
Charity Bazaar.
5. Captain Stardust to check the budget for the proposed team-building retreat to the Moon 
Resort and Spa.

Topics for Next Meeting:
1. Update on the Zogian-English dictionary.
2. Progress report on the cloaking technology.
3. Results of increased patrols in Sector 7.
4. Final preparations for the Intergalactic Charity Bazaar.

To meet or exceed the quality of the response from GPT-4o, here is what an Amazon Nova Pro prompt might look like. The prompt uses the same best practices discussed in this post, starting with the system prompt. We used a temperature of .2 and a topP of .9 here:

You are an experienced executive assistant skilled in meeting note analysis and 
summarization. Your primary responsibilities include distilling complex discussions
into clear, actionable summaries.
Follow these instructions:

##INSTRUCTIONS##
1. Read understand the meeting notes found in ##NOTES##
2. Put all of your outputs in a section called ##OUTPUTS## in markdown formatting
3. Summarize the meeting notes in 5 sentences or less. Put this in a section called 
"Overall Summary".
4. Numerically list any action items for specific people and what needs to be completed. 
Put this list in a section called "Action Items".
5. If applicable, list the topics that need to be discussed more fully in the next meeting. 
Put this in a section called "Topics for Next Meeting".

Here’s the user prompt, using prefilled responses:

##NOTES##
Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters
Attendees:
- Captain Stardust
- Dr. Quasar
- Lady Nebula
- Sir Supernova
- Ms. Comet
Meeting called to order by Captain Stardust at 2:05 PM
1. Introductions and welcome to our newest team member, Ms. Comet
2. Discussion of our recent mission to Planet Zog
- Captain Stardust: "Overall, a success, but communication with the Zogians was difficult. 
We need to improve our language skills."
- Dr. Quasar: "Agreed. I'll start working on a Zogian-English dictionary right away."
- Lady Nebula: "The Zogian food was out of this world, literally! We should consider having 
a Zogian food night on the ship."
3. Addressing the space pirate issue in Sector 7
- Sir Supernova: "We need a better strategy for dealing with these pirates. They've already 
plundered three cargo ships this month."
- Captain Stardust: "I'll speak with Admiral Starbeam about increasing patrols in that area.
- Dr. Quasar: "I've been working on a new cloaking technology that could help our ships avoid 
detection by the pirates. I'll need a few more weeks to finalize the prototype."
4. Review of the annual Intergalactic Bake-Off
- Lady Nebula: "I'm happy to report that our team placed second in the competition! Our Martian Mud 
Pie was a big hit!"
- Ms. Comet: "Let's aim for first place next year. I have a secret recipe for Jupiter Jello that I 
think could be a winner."
5. Planning for the upcoming charity fundraiser
- Captain Stardust: "We need some creative ideas for our booth at the Intergalactic Charity Bazaar."
- Sir Supernova: "How about a 'Dunk the Alien' game? We can have people throw water balloons at a 
volunteer dressed as an alien."
- Dr. Quasar: "I can set up a 'Name That Star' trivia game with prizes for the winners."
- Lady Nebula: "Great ideas, everyone. Let's start gathering the supplies and preparing the games."
6. Upcoming team-building retreat
- Ms. Comet: "I would like to propose a team-building retreat to the Moon Resort and Spa. It's a great 
opportunity to bond and relax after our recent missions."
- Captain Stardust: "Sounds like a fantastic idea. I'll check the budget and see if we can make it happen."
7. Next meeting agenda items
- Update on the Zogian-English dictionary (Dr. Quasar)
- Progress report on the cloaking technology (Dr. Quasar)
- Results of increased patrols in Sector 7 (Captain Stardust)
- Final preparations for the Intergalactic Charity Bazaar (All)
Meeting adjourned at 3:15 PM. Next meeting scheduled for March 19th, 2050 at 2:00 PM in Conference Room 3B, 
Intergalactic Headquarters.

##OUTPUTS##

The following example shows that the Amazon Nova response meets and exceeds the accuracy of the OpenAI example, formats the output in Markdown, and has found clear owners for each action item:

### Overall Summary
The meeting covered the successful mission to Planet Zog, the ongoing issue with 
space pirates in Sector 7, the results of the annual Intergalactic Bake-Off, plans 
for the upcoming charity fundraiser, and a proposed team-building retreat. Action items 
were assigned, and the next meeting's agenda was set.

### Action Items
1. **Dr. Quasar**: Begin working on a Zogian-English dictionary.
2. **Captain Stardust**: Speak with Admiral Starbeam about increasing patrols in Sector 7.
3. **Dr. Quasar**: Finalize the prototype for the new cloaking technology within a few weeks.
4. **Team**: Start gathering supplies and preparing games for the Intergalactic Charity Bazaar.
5. **Captain Stardust**: Check the budget for the proposed team-building retreat to the Moon Resort and Spa.

### Topics for Next Meeting
1. Update on the Zogian-English dictionary (Dr. Quasar)
2. Progress report on the cloaking technology (Dr. Quasar)
3. Results of increased patrols in Sector 7 (Captain Stardust)
4. Final preparations for the Intergalactic Charity Bazaar (All)

A few updates to the prompt can achieve comparable or better results from Amazon Nova Pro while enjoying a much less expensive cost of inference.

Employ Amazon Nova extended context

Amazon Nova Lite and Amazon Nova Pro can support up to 300,000 input tokens, which means that you can include more context in your prompt if needed. Expand your background data and detailed instructions accordingly—if your original OpenAI prompt was optimized for 128,000 tokens, adjust it to use the Amazon Nova extended window.

Tailor output constraints

If your GPT prompt required strict formatting (for example, “Respond in JSON only”), make sure that your Amazon Nova prompt includes these directives. Additionally, if your task involves multimodal inputs, specify when to include images or video references.

Function calling

The rise of generative AI agents has made function calling, or tool calling, one of the most important abilities of a given large language model (LLM). A model’s ability to correctly pick the right tool for the job, in a low-latency manner, is often the difference between success and failure of an agentic system.

Both OpenAI and Amazon Nova models share similarities in function calling, in particular their support for structured API calls. Both model families support tool selection through defined tool schemas, which we discuss later in this post. They also both provide a mechanism to decide when to invoke these tools or not.

OpenAI’s function calling uses flexible JSON schemas to define and structure API interactions. The models support a wide range of schema configurations, which give developers the ability to quickly implement external function calls through straightforward JSON definitions tied to their API endpoints.

Here is an example of a function:

tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current temperature for a given location.",
            "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Montevideo, Uruguay"
                        }            
                },            
                "required": [                
                    "location"
        ],
        "additionalProperties": False
    },
    "strict": True
    }
}]

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the weather like in Punta del Este today?"}],
    tools=tools

Similar to OpenAI’s approach, Amazon Nova can call tools when passed a configuration schema as shown in the following code example. Amazon Nova has made heavy use of Greedy Decoding when calling tools, and it’s advised to set temperature, topP, and topK to 1. This makes sure that the model has the highest accuracy in tool selection. These Greedy Decoding parameters and other great examples of tool use are covered in great detail in Tool use (function calling) with Amazon Nova.

The following is an example of function calling without using additionalModelRequestFields:

tool_config = {
    "tools": [{
        "toolSpec": {
            "name": "get_recipe",
            "description": "Structured recipe generation system",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "recipe": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "ingredients": {
                                    "type": "array",
                                    "items": {
                                        "type": "object",
                                        "properties": {
                                            "item": {"type": "string"},
                                            "amount": {"type": "number"},
                                            "unit": {"type": "string"}
                                        }
                                    }
                                },
                                "instructions": {
                                    "type": "array",
                                    "items": {"type": "string"}
                                }
                            },
                            "required": ["name", "ingredients", "instructions"]
                        }
                    }
                }
            }
        }
    }]
}

# Base configuration without topK=1
input_text = "I need a recipe for chocolate lava cake"
messages = [{
    "role": "user",
    "content": [{"text": input_text}]
}]

# Inference parameters
inf_params = {"topP": 1, "temperature": 1}

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params
)
# Typically produces less structured or incomplete output
 

The following example shows how function calling accuracy can be improved by using

additionalModelRequestFields:

# Enhanced configuration with topK=1
response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields={"inferenceConfig": {"topK": 1}}
)
# Produces more accurate and structured function call

To maximize Amazon Nova function calling potential and improve accuracy, always use additionalModelRequestFields with topk=1. This forces the model to select the single most probable token and prevents random token selection. This increases deterministic output generation and improves function call precision by about 30–40%.

The following code examples further explain how to conduct tool calling successfully. The first scenario shows recipe generation without an explicit tool. The example doesn’t use topK, which typically results in responses that are less structured:

input_text = """
I'm looking for a decadent chocolate dessert that's quick to prepare. 
Something that looks fancy but isn't complicated to make.
"""

messages = [{
    "role": "user",
    "content": [{"text": input_text}]
}]

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    inferenceConfig={"topP": 1, "temperature": 1}
)
# Generates a conversational recipe description
# Less structured, more narrative-driven response

In this example, the scenario shows recipe generation with a structured tool. We add topK set to 1, which produces a more structured output:

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig={"topP": 1, "temperature": 1},
    additionalModelRequestFields={"inferenceConfig": {"topK": 1}}
)
# Generates a highly structured, JSON-compliant recipe
# Includes precise ingredient measurements
# Provides step-by-step instructions

Overall, OpenAI offers more flexible, broader schema support. Amazon Nova provides more precise, controlled output generation and is the best choice when working with high-stakes, structured data scenarios, as demonstrated in Amazon Nova’s performance on the IFEval benchmark discussed in section 2.1.1 of the technical report and model card. We recommend using Amazon Nova for applications requiring predictable, structured responses because its function calling methodology provides superior control and accuracy.

Conclusion

The evolution from OpenAI’s models to Amazon Nova represents a significant shift in using AI. It shows a transition toward models that deliver similar or superior performance at a fraction of the cost, with expanded capabilities in multimodal processing and extended context handling.

Whether you’re using the robust, enterprise-ready Amazon Nova Pro, the agile and economical Amazon Nova Lite, or the versatile Amazon Nova Micro, the benefits are clear:

  • Cost savings – With token costs up to four times lower, businesses can scale applications more economically
  • Enhanced response performance – Faster response times (up to 190 tokens per second) make real-time applications more viable
  • Expanded capabilities – A larger context window and multimodal support unlock new applications, from detailed document analysis to integrated visual content

By evolving your prompt strategy—redefining use cases, exploiting the extended context, and iteratively refining instructions—you can smoothly migrate your existing workflows from OpenAI’s o4 and o4-mini models to the innovative world of Amazon Nova.


About the Authors

Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.

Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Read More

Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Organizations deploying generative AI applications need robust ways to evaluate their performance and reliability. When we launched LLM-as-a-judge (LLMaJ) and Retrieval Augmented Generation (RAG) evaluation capabilities in public preview at AWS re:Invent 2024, customers used them to assess their foundation models (FMs) and generative AI applications, but asked for more flexibility beyond Amazon Bedrock models and knowledge bases. Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic.

The general availability release introduces “bring your own inference responses” capabilities for both RAG evaluation and model evaluation. This means you can now evaluate a RAG system or model—whether running on Amazon Bedrock, other cloud providers, or on premises—as long as you provide evaluation data in the required format. You can even evaluate the final responses of a fully built application. We’ve also added new citation metrics for the already-powerful RAG evaluation suite, including citation precision and citation coverage, to help you better assess how accurately your RAG system uses retrieved information. These metrics provide deeper insights into whether responses are properly grounded in the source documents and if they’re using the most relevant retrieved passages.

In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments. Whether you’re using Amazon Bedrock or other environments for your generative AI applications, these evaluation tools can help you optimize performance and promote quality across your entire generative AI portfolio.

What’s new: RAG evaluation

In this section, we discuss what’s new with RAG evaluation, including the introduction of Bring Your Own Inference responses for evaluating external RAG systems and new citation metrics that measure precision and coverage of RAG citations.

Bring Your Own Inference responses

The RAG evaluation capabilities have been significantly expanded to allow evaluation of outputs from RAG systems hosted anywhere through the evaluation environment in Amazon Bedrock. With Bring Your Own Inference (BYOI) responses, you can now evaluate retrieval and generation results from a variety of sources, including other FM providers, custom-build RAG systems, or deployed open-weights solutions, by providing the outputs in the required format. Additionally, the introduction of new citation metrics with our previously released quality and responsible AI metrics also provides deeper insights into how well RAG systems use their knowledge bases and source documents.

When evaluating retrieval results from a knowledge base, we allow one knowledge base to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn currently has a prompt and referenceResponses. When using BYOI, you must provide your retrieval results in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). You must provide a knowledgeBaseIdentifier for every output. Additionally, for every retrieval result you bring, you can provide a name and additional metadata in the form of key-value pairs. The following is the input JSONL format for RAG evaluation jobs (Retrieve). Fields marked with ? are optional.

In the public preview period that ended March 20, 2025, the following dataset format contained a key called referenceContexts. This field has now been changed to referenceResponses to align with the Retrieve and Generate evaluation. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages or chunks retrieved from the knowledge base.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
       "output"?: {
        "knowledgeBaseIdentifier": string
        "retrievedResults": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        }
      }
    }
  ]
}

When evaluating retrieval and generation results from a knowledge base or RAG system, we allow one knowledge base (RAG system) to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn has a prompt and referenceResponses. When using BYOI, you must provide the generated text, retrieved passages (for example, retrieval results), and citations for the generated text in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). The citations field will be used by the citation metrics. If you don’t have citations, you can enter dummy data in those fields and make sure to not select the citation precision or citation coverage metrics for the evaluation job. You must also provide a knowledgeBaseIdentifier for every output you bring. This knowledgeBaseIdentifier should be the same for all lines in the dataset and also at the job level. The following is the input JSONL format for RAG evaluation jobs (Retrieve and Generate). Fields marked with ? are optional.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
      "output"?: {
        "text": string
        "modelIdentifier"?: string
        "knowledgeBaseIdentifier": string
        "retrievedPassages": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        },
        "citations": [
          {
            "generatedResponsePart": {
              "textResponsePart": {
                "span": {
                  "start": number
                  "end": number
                  
                },
                "text": string
              }
            },
            "retrievedReferences": [
              {
                "name"?: string
                "content": {
                  "text": string
                },
                "metadata"?: {
                  [key: string]: string
                }
              }
            ]
          }
        ]
      }
    }
  ]
}

Citation metrics

Citation precision measures how accurately a RAG system cites its sources by evaluating whether the cited passages actually contain information used in the response. This metric helps identify when responses include unnecessary or irrelevant citations—a common issue where models might cite passages that weren’t actually used to generate the response content. The metric produces a score between 0–1, where 1 indicates perfect citation precision (all cited passages were relevant and used in the response), and 0 indicates complete imprecision (none of the cited passages should have been used). This helps teams identify and fix cases where their RAG systems are citing passages indiscriminately rather than selectively citing only the passages that contributed to the response.

Citation coverage evaluates how well a response’s content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited. This metric works by comparing the faithfulness of the response to both the cited passages and the full set of retrieved passages, then calculating their ratio. In cases where the response is completely unfaithful to the retrieved passages (for example, hallucination), the metric indicates that citation coverage assessment isn’t applicable. The metric provides a score from 0–1, where scores closer to 0 indicate that the model failed to cite relevant supporting passages, and scores closer to 1 suggest proper citation of information from the retrieved passages. Importantly, a perfect score of 1 doesn’t necessarily mean all information in the response is cited—rather, it indicates that all information that could have been cited (based on the retrieved passages) was properly cited. This helps teams identify when their RAG systems are missing opportunities to cite relevant source material.

What’s new: Model evaluation

In this section, we discuss what’s new with model evaluation, including expanded Bring Your Own Inference capabilities that allow you to evaluate model responses from external providers and deployed solutions through both LLMaaJ and human evaluation workflows.

Bring Your Own Inference responses

The model evaluation capabilities have been significantly expanded to allow evaluation of outputs from the models available through the Amazon Bedrock evaluation environment. With BYOI responses, you can now evaluate model responses from other FM providers or deployed solutions by providing the outputs in the required format. BYOI is available for both LLMaJ and human evaluation workflows.

You don’t have to limit these evaluations to foundation models. Because BYOI evaluation takes in the prompt and the output, it can evaluate the final response of a full application if you choose to bring that into your dataset.

When using LLMaaJ, only one model can be evaluated per evaluation job. Consequently, you must provide just a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLMaaJ with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

When using human evaluation, up to two models can be evaluated per evaluation job. Consequently, you can provide up to two entries in the modelResponses list, with each response requiring a unique identifier. The modelResponses field is required for BYOI jobs but not needed for non-BYOI jobs. The following is the input JSONL format for human evaluation with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
    "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        },
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

Feature overview

The LLMaaJ evaluation workflow with BYOI enables you to systematically assess model outputs from your preferred source using Amazon Bedrock model evaluation capabilities. The process follows a straightforward flow, starting with selecting an evaluator model and configuring BYOI, then choosing appropriate metrics, uploading your evaluation datasets, running the assessment, and finally analyzing the detailed results. This comprehensive evaluation pipeline (as illustrated in the following diagram) helps provide consistent quality across your AI implementations, regardless of where they are deployed.

For RAG system evaluation, the workflow incorporates additional components and metrics specific to context retrieval analysis. The process begins with choosing an evaluator model, then selecting between retrieval-only or retrieve-and-generate evaluation modes. With BYOI support, you can evaluate RAG outputs from any source while using powerful LLMaaJ metrics to assess the quality of your retrievals or full end-to-end retrieve-and-generate flow of your RAG system. This end-to-end evaluation framework (see the following diagram) provides deep insights into how effectively your RAG system uses its retrieved context and generates accurate, well-supported responses.

Prerequisites

To use the LLMaaJ model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

Dataset description and preparation

For demonstrating an LLMaaJ evaluation job with BYOI, we created a shopping math problems dataset with a third-party model. The dataset contains 30 discount calculation problems, each with a prompt, reference answer, and model response following the required BYOI format. Each record uses the "third-party-model" identifier in the modelResponses array, enabling the LLMaaJ evaluator to assess response quality against your chosen metrics.

The dataset we used for a RAG evaluation job with BYOI was created using Amazon’s 10-K SEC filing information. Each record in the dataset contains questions about Amazon’s corporate information (such as SEC file numbers, operating segments, and fiscal year reporting), reference responses, and third-party model-generated answers with their corresponding retrieved passages and citations. The dataset follows the required BYOI format, with each record using "third-party-RAG" as the knowledge base identifier. This structure enables evaluation of both retrieve and generate quality metrics and citation accuracy, demonstrating how organizations can assess RAG systems regardless of where they’re deployed.

Start an LLMaaJ evaluation job with BYOI using the Amazon Bedrock console

In this first example, we use the prompt and inference responses dataset published in our amazon-bedrock-samples repository. You can use LLMaaJ with BYOI responses on Amazon Bedrock model evaluation to assess model performance through a user-friendly AWS Management Console interface. Follow these steps to start an evaluation job:

  1. On the Amazon Bedrock console, choose Inference and Assessment in the navigation pane and then choose Evaluations.
  2. On the Evaluations page, choose the Models
  3. In the Model evaluations section, choose Create and choose Automatic: Model as a judge.
  4. On the Create automatic evaluation page, enter a name and description and under Evaluation model, select a model. This model will be used as a judge to evaluate your inference responses.
  5. Optionally, choose Tags, and create your own tags.
  6. Under Inference source, choose Bring your own inference responses as the inference source.
  7. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset.

    {
        "prompt": "If books cost \$10.21 each and you buy 15 of them with a 20% discount, how much will you pay in total?",
        "referenceResponse": "The total price will be \$122.52. Original price: \$153.15 minus 20% discount (\$30.630000000000003)",
        "category": "Shopping Math",
        "modelResponses": [{
            "response": "To determine the total cost of buying 15 books with a 20% discount, we can follow these steps:nn1. Calculate the total cost of the books before the discount.n2. Determine the discount amount.n3. Subtract the discount amount from the total cost before the discount.nn**Step 1: Calculate the total cost of the books before the discount.**nnThe cost of one book is \$10.21. Therefore, the cost of 15 books is:n\[ 15 \times 10.21 = 153.15 \]nn**Step 2: Determine the discount amount.**nnThe discount is 20%, so we need to find 20% of \$153.15. To do this, we multiply \$153.15 by 0.20:n\[ 153.15 \times 0.20 = 30.63 \]nn**Step 3: Subtract the discount amount from the total cost before the discount.**nnThe total cost after the discount is:n\[ 153.15 - 30.63 = 122.52 \]nnTherefore, the total amount you will pay is:n\[ \boxed{122.52} \]",
            "modelIdentifier": "third-party-model"
        }]
    }

  1. Under Metrics, select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).
  2. Under Datasets, for Choose a prompt dataset and Evaluation results, enter or choose the Browse S3 option to enter the S3 URI.
  3. Under Amazon Bedrock IAM role – Permissions, select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job.
  4. Specify the folder in the S3 bucket where the results of the model evaluation will be stored and choose Create.

There’s an input dataset validator that validates that the format of the input prompt dataset is correct when you choose Create. It will help you correct any formatting errors.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete. This may take minutes or hours depending on how long your prompts and responses are and if you used only a few prompts or hundreds of prompts.

  1. When the job is complete, choose the job to see its details and metrics summary.
  2. To view generation metrics details, scroll down in the model evaluation report and choose one of the individual metrics (like helpfulness or correctness) to see its detailed breakdown.
  3. To see each record’s prompt input, generation output (from bring your own inference responses), ground truth (which is optional), and individual scores, choose a metric and choose Prompt details.
  4. Hover over an individual score to view its detailed explanation.

Start an LLMaaJ evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an LLMaaJ model evaluation job with your own inference responses, use the following steps.

  1. Set up the required configurations:
    import boto3
    import json
    import random
    from datetime import datetime
    import botocore
    
    # AWS Configuration
    REGION = "us-east-1"
    ROLE_ARN = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
    PREFIX = "<YOUR_BUCKET_PREFIX>"
    dataset_custom_name = "<YOUR_BYOI_DATASET_NAME>" # without the ".jsonl extension
    
    # Initialize AWS clients
    bedrock_client = boto3.client('bedrock', region_name=REGION)
    s3_client = boto3.client('s3', region_name=REGION)

  2. Use the following function to create an LLMaaJ model evaluation job with your own inference responses. The precomputedInferenceSource parameter is used when you’re bringing your own pre-generated inference responses. The inferenceSourceIdentifier value must match the model identifier you used in your evaluation dataset. This identifier represents the model that generated the responses you want to evaluate.
    def create_llm_judge_evaluation(
        client,
        job_name: str,
        role_arn: str,
        input_s3_uri: str,
        output_s3_uri: str,
        evaluator_model_id: str,
        dataset_name: str = None,
        task_type: str = "General" # must be General for LLMaaJ
    ):    
        # All available LLM-as-judge metrics
        llm_judge_metrics = [
            "Builtin.Correctness",
            "Builtin.Completeness", 
            "Builtin.Faithfulness",
            "Builtin.Helpfulness",
            "Builtin.Coherence",
            "Builtin.Relevance",
            "Builtin.FollowingInstructions",
            "Builtin.ProfessionalStyleAndTone",
            "Builtin.Harmfulness",
            "Builtin.Stereotyping",
            "Builtin.Refusal"
        ]
    
        # Configure dataset
        dataset_config = {
            "name": dataset_name or "CustomDataset",
            "datasetLocation": {
                "s3Uri": input_s3_uri
            }
        }
    
        try:
            response = client.create_evaluation_job(
                jobName=job_name,
                roleArn=role_arn,
                applicationType="ModelEvaluation",
                evaluationConfig={
                    "automated": {
                        "datasetMetricConfigs": [
                            {
                                "taskType": task_type,
                                "dataset": dataset_config,
                                "metricNames": llm_judge_metrics
                            }
                        ],
                        "evaluatorModelConfig": {
                            "bedrockEvaluatorModels": [
                                {
                                    "modelIdentifier": evaluator_model_id
                                }
                            ]
                        }
                    }
                },
                inferenceConfig={
                    "models": [
                        {
                            "precomputedInferenceSource": {
                                "inferenceSourceIdentifier": "third-party-model"
                            }
                        }
                    ]
                },
                outputDataConfig={
                    "s3Uri": output_s3_uri
                }
            )
            return response
            
        except Exception as e:
            print(f"Error creating evaluation job: {str(e)}")
            raise

  3. Use the following code to create an LLMaaJ evaluation job, specifying the evaluator model and configuring all necessary parameters such as input dataset location, output path, and task type. The job configuration creates an evaluation task that will assess your model inference responses, with results being stored in the designated S3 bucket for later analysis.
    # Job Configuration
    evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
    job_name = f"llmaaj-third-party-model-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    # S3 Paths
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}"
    
    # Create evaluation job
    try:
        llm_as_judge_response = create_llm_judge_evaluation(
            client=bedrock_client,
            job_name=job_name,
            role_arn=ROLE_ARN,
            input_s3_uri=input_data,
            output_s3_uri=output_path,
            evaluator_model_id=evaluator_model,
            task_type="General"
        )
        print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
    except Exception as e:
        print(f"✗ Failed to create evaluation job: {str(e)}")
        raise

  4. Monitor the progress of your evaluation job:
    # Get job ARN based on job type
    evaluation_job_arn = llm_as_judge_response['jobArn']
    
    # Check job status
    check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
    print(f"Job Status: {check_status['status']}")

Start a RAG evaluation job with BYOI using the Amazon Bedrock console

When dealing with RAG systems, it’s important to evaluate the quality of retrieval and retrieval and generation together. We already provide several important metrics for those workflows, and now you can also evaluate the citation quality of your RAG systems. You can include citations when using Amazon Bedrock Knowledge Bases. In this section, we explore how you can use a RAG evaluation job to surface two important citation metrics. If your custom RAG system uses citations, you can also plug your citation information into your input dataset with BYOI and use the citation precision and citation coverage metrics:

Best practice is to use both citation metrics, citation precision and citation coverage, together to give a complete view of the citation quality.

To get started using the console, complete the following steps:

  1. On the Amazon Bedrock console, under Inference and Assessments in the navigation pane, choose Evaluations.
  2. Choose the RAG.
  3. Choose Create. This will take you to a new page where you can set up your RAG evaluation job.
  4. Under Evaluation details, enter a name and description and choose an Evaluator model. In the example, we chose Claude 3 Haiku for demonstration, but we provide a choice of several judge models for flexibility. This model will be used as a judge to evaluate your inference responses.
  5. Under Inference source, choose Bring your own inference responses as the inference source.
  6. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset. For example, the following is a snippet of one record in our evaluation dataset:
    {
        "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": "What is Amazon's SEC file number?"
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": "Amazon's SEC file number is 000-22513."
                }]
            }],
            "output": {
                "text": "Amazon's SEC file number is 000-22513.",
                "modelIdentifier": "third-party-model",
                "knowledgeBaseIdentifier": "third-party-RAG",
                "retrievedPassages": {
                    "retrievalResults": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }, {
                        "content": {
                            "text": "AMAZON.COM, INC. (Exact name of registrant as specified in its charter)"
                        }
                    }]
                },
                "citations": [{
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 0,
                                "end": 11
                            },
                            "text": "Amazon's SEC"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "UNITED STATESSECURITIES AND EXCHANGE COMMISSION"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 12,
                                "end": 22
                            },
                            "text": "file number"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 23,
                                "end": 33
                            },
                            "text": "is 000-22513"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }]
            }
        }]
    }

Therefore, we use the source name third-party-RAG.

  1. Under Metrics, select the citation metrics Citation precision and Citation coverage. You can also select other metrics.
  2. Under Dataset and evaluation results S3 location, choose Browse S3 to enter the S3 URI for the evaluation input file, and specify the output location.
  3. Under Amazon Bedrock IAM role – Permissions, create a new service role or use an existing one.
  4. Choose Create.

If you see an error in the S3 URI, such as, “Your S3 bucket does not have the required CORS settings,” then you might need to edit your CORS setting on the bucket that has your data. For more information, see Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets.

You can monitor your job; the status will be In progress while the job is running.

  1. When the job status is Completed, you can follow the link on a completed job to review results.

The results include a Metric summary section. In this particular example, both citation precision and citation coverage are high.

You can also set metric criteria to make sure the RAG system is performing up to your standards, and view example conversations as shown below each metric.

Start a RAG evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an RAG evaluation job with your own inference responses, follow these steps (or refer to our example notebook):

  1. Set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results:
    import boto3
    from datetime import datetime
    
    # Configure knowledge base and model settings
    evaluator_model = "<YOUR_EVALUATOR_MODEL>"
    role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
    PREFIX = "<YOUR_BUCKET_PREFIX>"
    RAG_dataset_custom_name = "<YOUR_RAG_BYOI_DATASET_NAME>" # without the ".jsonl extension
    
    # Specify S3 locations
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}/"
    
    # Create Bedrock client
    bedrock_client = boto3.client('bedrock')

  2. Use the following function to create a RAG evaluation job with your own pre-generated retrieval and generation responses. The precomputedRagSourceConfig parameter is used when you’re bringing your own pre-generated RAG responses for evaluation. The ragSourceIdentifier value must match the identifier you used in your evaluation dataset. This identifier represents the RAG system that produced the responses you want to evaluate. The code sets up an automated evaluation that will assess various dimensions of your RAG system’s performance, including correctness, completeness, helpfulness, logical coherence, faithfulness, and citation quality against the provided dataset.
    retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Evaluate retrieval and generation",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [
                {
                    "precomputedRagSourceConfig": {
                        "retrieveAndGenerateSourceConfig": {
                            "ragSourceIdentifier": "third-party-RAG"  # Replace with your identifier
                        }
                    }
                }
            ]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "QuestionAndAnswer",  
                    "dataset": {
                        "name": "RagDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Helpfulness",
                        "Builtin.LogicalCoherence",
                        "Builtin.Faithfulness",
                        "Builtin.CitationPrecision",
                        "Builtin.CitationCoverage"
                    ]
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )

  1. After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions.

Clean up

To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

Conclusion

The general availability release of LLM-as-a-judge and RAG evaluation on Amazon Bedrock brings environment-agnostic “bring your own inference” capabilities, allowing organizations to evaluate RAG systems or models regardless of where it runs. New citation metrics released as part of the RAG evaluation metrics list enhance information quality assessment by providing measurements of citation precision and coverage, enabling data-driven comparisons across different implementations.

As organizations deploy more generative AI applications, robust evaluation becomes critical to maintaining quality, reliability, and responsible use. We encourage you to explore these new features through the Amazon Bedrock console and the examples provided in the AWS Samples GitHub repository. By implementing regular evaluation workflows, you can continuously improve your models and RAG systems to deliver the highest quality outputs for your specific use cases.

We encourage you to explore these capabilities of Amazon Bedrock Evaluations and discover how systematic evaluation can enhance your generative AI applications.


About the authors

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Read More

Fine-tune large language models with reinforcement learning from human or AI feedback

Fine-tune large language models with reinforcement learning from human or AI feedback

Large language models (LLMs) can be used to perform natural language processing (NLP) tasks ranging from simple dialogues and information retrieval tasks, to more complex reasoning tasks such as summarization and decision-making. Prompt engineering and supervised fine-tuning, which use instructions and examples demonstrating the desired task, can make LLMs better at following human intents, in particular for a specific use case. However, these methods often result in LLMs expressing unintended behaviors such as making up facts (hallucinations), generating biased or toxic text, or simply not following user instructions. This leads to responses that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.

Supervised learning can help tune LLMs by using examples demonstrating some desired behaviors, which is called supervised fine-tuning (SFT). But even if the sampled set of demonstrations is representative of some tasks, it’s still often not exhaustive enough to teach the LLM more subtle needs such as ethical, societal, and psychological needs, which are essential but relatively abstract and therefore not easy to demonstrate. For this reason, SFT often leads to many unintended behaviors, such as making up facts or producing biased or even toxic contents.

Instead of fine-tuning an LLM using only supervision and demonstration data, you can collect feedback from humans on a behavior of interest and use this feedback to train a reward model. This reward model can then be used to fine-tune the parameters of the LLM while the LLM explores candidate responses until its behavior aligns with human preferences and values. This method is called reinforcement learning from human feedback (Ouyang et al. 2022). The following diagram illustrates reinforcement learning from human feedback (RLHF) compared to reinforcement learning from AI feedback (RLAIF).

Reinforcement learning from human feedback (RLHF) vs. AI feedback (RLAIF)

Recently, Lee et al. (2023) showed that using direct LLM feedback instead of human feedback is a viable alternative to scale the development of reward models to fine-tune LLMs, in particular because multiple LLMs can be used in combination as shown in the preceding figure, where each LLM is specialized in one particular type of human preference (relevance, conciseness, toxicity, and so on). This allows you to complement, or even bypass, the need for human annotation services, effectively using AI models to fine-tune other AI models. This technique is known as superalignment using RLAIF. Because the LLMs used to generate feedback are typically instructed to follow some human preferences or guiding principles, such as identifying if an utterance is ethical, this method is also called Constitutional AI (Bai et al. 2022). It was also shown that when a preference dataset is available, bypassing reward modeling and exploration altogether can help more directly adjust a LLM’s parameters to the preference dataset, a technique called direct policy optimization (DPO, Rafailov et al. 2024).

Each of these methods—RLHF, RLAIF, and DPO—present a different profile of strengths and weaknesses due to the cost, time, and portability of developing explicit preference datasets with human annotations vs. reward models. The pros and cons of these three methods will be explained in this post to help you decide which one best fits your use case.

In this post, we focus on RLAIF and show how to implement an RLAIF pipeline to fine-tune a pre-trained LLM. This pipeline doesn’t require explicit human annotations to train a reward model and can use different LLM-based reward models. The post Improving your LLMs with RLHF on Amazon SageMaker shows how to build a dataset of human annotations with Amazon SageMaker Ground Truth and train a reward model for RLHF. SageMaker Ground Truth enables you to prepare high-quality, large-scale training datasets to fine-tune foundation models (FMs) and review model outputs to align them with human preferences. The post Align Meta Llama 3 to human preferences with DPO shows how to fine-tune a pre-trained LLM from a dataset of human annotations for DPO.

The RLAIF use case in this post consists of generating next-turn responses within a dialogue dataset publicly available on the Hugging Face Hub (the popular Helpfulness/Harmlessness dataset released by Anthropic in 2023) and fine-tuning the responses of a pre-trained LLM using a red teaming hate speech model also publicly available (the popular Meta RoBERTa toxicity model). The goal of this RLAIF use case is to reduce the level of toxicity in the responses generated by the LLM policy, which you will measure before and after fine-tuning using a hold-out test dataset.

This post has three main sections:

  • Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
  • Categories of human preference reward models for RLHF/RLAIF
  • Implementation of an RLAIF use case

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO

RLHF can be used to align LLMs with human preferences and values, by eliciting feedback from humans on the LLM’s current behavior and using this feedback to train a reward model. Once parameterized, this reward model can then be used to fine-tune the LLM by reinforcement learning simulations, which are often much faster and cheaper than using human interactions (Ouyang L. et al., 2022). Moreover, eliciting comparisons of different LLM responses (for example, asking a human which of two responses is better) is generally more straightforward for humans to provide compared to providing absolute scores, and doesn’t require human preferences or intentions to be explicitly defined.

Christiano et al. (2017) provided the first evidence that RLHF could be economically scaled up to practical applications. Since then, RLHF has been shown to help tune LLMs to be more helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment).

In RLHF, the alignment can be biased by the group of humans who provide the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, it might never be possible to train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the trade-offs. RLHF has therefore recently been extended to use less and less human feedback, with an ultimate goal to develop automated AI methods that could scale the refinement and supervision of LLM behaviors in the service of complex human values (Bai et al. 2022). Constitutional AI and more generally RLAIF are promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This post focuses on RLAIF.

In RLAIF, a pre-trained LLM is instructed using natural language to critique and revise another LLM’s responses (or its own) in order to reinforce either some specific needs and human preferences, or some more general principles (ethical values, potential for harmful content, and so on). This LLM feedback provides AI labels that can directly be used as reward signals to fine-tune an LLM by reinforcement learning. Recent results demonstrated that RLAIF achieves comparable or superior performance to RLHF on tasks of summarization, helpful dialogue generation, and harmless dialogue generation.

Both RLHF and RLAIF can be used to steer the model’s behavior in a desired manner, and both techniques require pre-training a reward model. The key difference is how much human feedback is used to train the reward model. Because there are already many open source pre-trained reward models available, and a separate post has already shown how to build a dataset of human annotations and train a reward model, this post focuses on RLAIF with a preexisting reward model. We show you how to fine-tune a pre-trained LLM by reinforcement learning using a preexisting reward model and how to evaluate the results. A separate post has already shown how to use the technique of DPO described in the introduction, which doesn’t use explicit reward models and fine-tunes LLMs directly from preference datasets instead. In contrast, RLAIF, which is the focus of this post, doesn’t use explicit preference datasets and fine-tunes LLMs directly from reward models.

The following diagram illustrates the process of learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO).

Learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO)

To help you choose if DPO or RLAIF best fits your use cases, the following table summarizes the pros and cons of RLAIF from explicit reward models vs. DPO from explicit preference datasets. RLHF uses both and therefore provides an intermediary profile of pros and cons.

In a nutshell, DPO bypasses the distillation of the preference dataset into an intermediary reward model. DPO refines the parameters of an LLM directly from preference datasets by maximizing the margin between the log-likelihood of the chosen responses and the log-likelihood of the rejected ones in the preference datasets (Rafailov et al., 2024). Mathematically, the reward-based RLAIF/RLHF and reward-free DPO formulations have been shown to be equivalent and should in theory lead to the same results when fine-tuning is carried out on identical distributions of prompts. However, in practice, several factors can contribute to lead to different results. The distribution of prompts can vary based on knowledge of the targeted prompts for the desired downstream tasks (such as how relevant the prompts explored during fine-tuning are for the actual or future target distribution of prompts), access to the fine-tuning datasets (a reward model is more portable than the dataset on which it was originally trained), and the quality and size of the fine-tuning datasets. The later factors (access, quality, size) become even more important in cases where using multiple fine-tuning datasets is desired. This implies the following pros and cons.

  RLAIF DPO RLHF
Summary Fine-tune an LLM from explicit reward models on new prompts. Fine-tune an LLM directly from explicit preference datasets. Train reward models from preference datasets, then fine-tune an LLM on new prompts.
Pros Fine-tuning is possible without human annotations.
Most efficient in speed, compute, and engineering if:

  • Reward models or LLM instructor available.
  • Preference data unavailable.
  • Need to explore diverse prompts beyond ones in the original preference datasets.
  • Online learning desired.

Directly scales beyond human supervision.
Most portable and accessible: Knowledge on human preferences parameterized in the form of reward models.

Fine-tuning uses explicit human feedback.
Most efficient in speed, compute, and engineering if:

  • Reward models unavailable.
  • Need to target prompts from available preference datasets.
  • Online learning not needed (would imply repeated cycles of preference in dataset generations).

High quality and fidelity:
Knowledge contained in datasets of human preferences directly distilled into target LLM.

Fine-tuning uses explicit human feedback.
Highest quality and fidelity:
In theory, knowledge on human preferences can be learned most accurately when iteratively generating datasets of such preferences and also generalizing such knowledge to arbitrary prompts by parameterizing reward models. In practice, this is often not the case.
Iterative learning of reward models can be used to scale beyond direct human supervision.
Cons Fine-tuning limited to available model of human preferences.
Inefficient if:

  • Reward models unavailable and preference not clear enough to instruct an LLM.
  • Need to target prompts from available preference datasets.
Fine-tuning requires a lot of human annotations.
Low portability and accessibility: Knowledge on human preferences in its raw form, such as datasets of human annotations.
Inefficient if:

  • Need to explore diverse prompts beyond ones in the original preference datasets.
  • Reward models available or preference clear enough to instruct an LLM.
Fine-tuning requires a lot of human annotations.
Fine-tuning limited to learned models of human preferences.
 
Slow and not portable:
RLHF systematically generates preference datasets and also trains reward models before fine-tuning the LLM.

This table is not exhaustive. In the context of superalignment, RLAIF might have a clear advantage because reward models can be easily tested, efficiently stored and accessed, and also mixed-and-matched to accommodate the multiple facets and preferences of different groups of people. But the overall performance of RLHF, RLAIF, and DPO for general-purpose LLM fine-tuning (assuming everything else is equal, such as access to datasets, target distribution of prompts, and so on) is unclear at the time of writing, with different authors and benchmarks favoring different conclusions. For example, Rafailov et al. (2024) favor DPO whereas Ivison et al. (2024) favor RLHF/RLAIF.

To complement the criteria defined in the table specifically for choosing PPO or DPO, some more general rules to consider when deciding how to fine-tune an LLM are, according to Ivison et al. (2024), in order of importance:

  • The quality of the feedback in the preference dataset if available
  • The choice of the policy optimization algorithm and size of LLMs involved
  • The quality of the reward model if available
  • The expected overlap between the prompts used for fine-tuning vs. future target prompts for which the LLM will be ultimately be used

Categories of human preference reward models for RLHF/RLAIF

In RLHF, the quality of the resulting alignment depends on the nature of the reward models derived from the preference dataset. RLHF can be biased by the group of humans who provides the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, effective RLHF tuning typically requires tens of thousands of human preference labels, which is time-consuming and expensive. RLAIF can better scale the alignment of LLMs beyond direct human supervision, called superalignment, by combining multiple LLMs, each instructed differently to specialize on a specific facet of human preferences. For example, as discussed in Lee et al. (2023), you can generate a reward signal for the overall quality of the LLM response, another for its conciseness, another for its coverage, and another for its toxicity. RLAIF is promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. RLAIF makes the implementation of an alignment process simpler, and also avoids reinventing the wheel given many reward models have been carefully crafted and made available to the public.

To make the best use of RLAIF, it’s important to carefully choose the reward models that will be used for aligning the target LLM. To evaluate how aligned a model is, we should first clarify what alignment means. As mentioned in Ouyang et al. (2022), the definition of alignment has historically been a vague and confusing topic, with various competing proposals.

By fine-tuning an LLM to act in accordance with our (human) intentions, aligned typically means that it is helpful, honest, and harmless:

  • Helpfulness – The LLM should follow instructions and infer user intent. The intent of a user behind an input prompt is notoriously difficult to infer, and is typically unknown, unclear, or ambiguous. Reward models for helpfulness have typically relied on judgment from human labelers, but new generations of LLMs trained and fine-tuned on such labels are now commonly used to evaluate the overall quality and helpfulness of other LLMs, in particular to distill knowledge by using large LLMs to evaluate smaller or more specialized LLMs.
  • Honesty (fidelity) – The LLM should not make up facts (hallucination). Ideally, it should also recognize when it doesn’t know how to respond. Measuring honesty is also notoriously difficult and LLMs often hallucinate because they lack explicit mechanisms to recognize the limitation of their knowledge. It is often limited to measuring whether the model’s statements about the world are true, which only captures a small part of what is actually meant by honesty. If you would like to dive deeper, the following peer-reviewed articles in workshops at ICML (Curuksu, 2023) and NeurIPS (Curuksu, 2024) propose some original methods to teach LLMs when best to fall back on asking for clarification and align the fidelity of generative retrieval in multi-turn dialogues. Ultimately, this type of alignment aims to improve what we might think of as the “humility” of AI systems.
  • Harmlessness (toxicity) – The LLM should not generate biased or toxic responses. Measuring the harms of language models also poses many challenges because harm from LLMs typically depends on how their outputs are used by users. As mentioned in Ouyang et al. (2022), a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might be helpful if used for red teaming data augmentation to train a more accurate toxicity detection model. Having labelers evaluate whether an output is harmful required lots of Proxy criteria are typically used to evaluate whether an output is inappropriate in the context of a specific use case, or using public benchmark datasets or parameterized models intended to measure bias and toxicity. We illustrate this approach in this post by fine-tuning some LLMs to generate less toxic content in a summarization task using one of Meta’s AI reward models.

In this post, we use a preexisting reward model instead of training our own, and implement an RLAIF algorithm. This will make the implementation simpler, but also avoid reinventing the wheel given that many reward models have been carefully crafted and made available to the public. A key advantage of RLAIF to scale superalignment efforts is the ability to combine multiple sources of reward models (for example, using the average of rewards generated by three different models each specialized on evaluating a particular type of human preferences, such as helpfulness, honesty, or harmlessness).

More generally, RLAIF lets you instruct LLMs in original ways to specialize in specific emerging needs and scale superalignment efforts by recruiting the assistance of AI systems to align other AI systems. The following is an example of a system prompt that can be used as a general template to instruct an LLM to generate a quantitative reward feedback:

“
  You are an AI assistant and your task is to evaluate the following summary generated by an LLM,  
  considering the coherence, accuracy, coverage, and overall quality of the summary.
  Please generate an evaluation score in a decimal number between 1.00 and 5.00.
  Score 5.00 means the summary is the best optimal summary given the input text.
  Score 1.00 means the summary is really bad and irrelevant given the input text.

  Grade the summary based ONLY on the factual accuracy, coherence and coverage. Ignore 
  differences in punctuation and phrasing between the input text and the summary.

  Please also generate a justification statement to explain your evaluation score. 
  Keep the justification statement as concise as possible.

  Here is the input text: (…)
  Here is the summary generated by the LLM: (…)
”

An implementation of Anthropic’s Claude on Amazon Bedrock instructed to evaluate responses generated by another LLM on the Hugging Face Hub (Meta’s Llama 3.1 or Google’s Flan-T5) is shown in the next section.

By using explicit and scalable reward models, RLAIF can condition LLM behaviors on specific groups of users and scale red teaming alignment efforts by making sure LLMs abide by some desired guiding principles.

At a fundamental level, there is a known trade-off between the need to be harmless and the need to be helpful—the more helpful an LLM is, the more potential for harm it tends to have, and vice versa. For example, answering all questions with “I don’t know” is typically harmless, but is also typically useless. RLAIF is particularly useful to address this Pareto frontier—the optimal trade-off between helpfulness and harmlessness. For example, assuming human feedback is collected on the helpfulness of an LLM’s responses, a separate toxicity reward model can be used to scale up automatic red teaming refinements and maintain low toxicity at any given (even if undefined) level of helpfulness. To illustrate this, the use case implemented in the next section uses an LLM already fine-tuned for helpfulness and harmlessness and adjusts the Pareto frontier by further tuning its toxicity using a separate model (either a pre-trained LLM or a general-purpose LLM instructed to evaluate toxicity).

Implementation of an RLAIF use case

As explained earlier in this post, preference datasets are not portable, are not always accessible, and provide only a static set of prompts and responses; in contrast, parametrized reward models are highly portable and can be used to generalize its encoded knowledge by exploring new sets of prompts and responses. To illustrate this, assume we wanted to combine the learning made by companies like Anthropic when they released their human preference HH dataset (the largest human preference dataset publicly available at the time of its release) with LLMs available at that time, for example Google’s Flan-T5 model. Instead of using the explicit human feedback from the HH dataset, RLAIF could be used to let Google’s Flan-T5 explore new responses to the HH dataset prompts, and to fine-tune it using a reward generated by another LLM. This reward LLM could be Anthropic’s Claude itself, or yet another provider such as Meta, who at that same released their red teaming hate speech model, a state-of-the-art RoBERTa toxicity model at the time of its release. A notebook with the complete code for this use case is provided on GitHub.

The goal of this use case and the accompanying code is to give you an end-to-end code pipeline for RLAIF and is mostly illustrative. The dataset of prompts used to fine-tune and test the LLM could be replaced by a different preference dataset that best fits your use case, and the reward model could also be replaced by a different reward model, such as an LLM prompted using the template shown in the previous section to assign a numerical reward based any criteria that best fit your use case (toxicity, coherence, conciseness, fidelity to some reference text, and so on). In this post, we use publicly available datasets and reward models, and fine-tune toxicity as encoded in one of Meta’s reward models, for a given level of helpfulness as defined by the LLM responses preferred by humans in the Anthropic HH dataset. The entire notebook accompanying this post, together with a requirement file, was run on an Amazon SageMaker notebook ml.g5.16xlarge instance.

Import key libraries

To implement an RLAIF algorithm, we use an open source, high-level library from Hugging Face called Transformer RL (TRL). Do not forget to restart your Python kernel after installing the preceding libraries before you import them. See the following code:

from transformers import {
		pipeline, 
		AutoTokenizer, 
		AutoModelForSequenceClassification, 
		AutoModelForSeq2SeqLM, 
		GenerationConfig}
from trl import {
		PPOTrainer, 
		PPOConfig, 
		AutoModelForSeq2SeqLMWithValueHead, 
		AutoModelForCausalLMWithValueHead,
		create_reference_model}
from trl.core import LengthSampler
from datasets import load_dataset
from peft import {
		PeftModel, 
		PeftConfig, 
		LoraConfig, 
		TaskType}
import torch
import torchvision
import evaluate
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

Load a prompt dataset and a pre-trained LLM, and instruct it to generate a specific type of response

First, let’s load a pre-trained LLM model. This section contains examples showing how to load Meta’s Llama 3.1 (instruct version) and Google’s Flan-T5 models (choose one or the other). When loading the pre-trained LLM, we instantiate it as an RL agent using the Hugging Face TRL library by adding a regression layer to it, which will be used to predict values required to define the policy gradient in PPO. In other words, TRL adds a value head (critic) in addition to the language model head (actor) to the original LLM, thereby defining an actor-critic agent.

Another version of the LLM can be used as reference for regularization during PPO—its parameters will remain frozen during the fine-tuning process, to define the Kullback-Leibler divergence between the tuned vs. original LLM responses. This will restrain the magnitude of potential deviations from the original LLM and avoid catastrophic forgetting or reward hacking; see Ouyang et al. (2022) for details. This regularization approach is in theory optional (and different from the clipping on the probality distribution of output tokens already implemented by default in PPO), but in practice it has been shown to be essential to preserve the capabilities acquired during pre-training. See the following code:

# Load a pre-trained LLM

model = "llama"

if model == "llama":
   # Example to load Meta Llama 3.1 model
   model_name = "meta-llama/Meta-Llama-3.1-8B"
   ppo_llm = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, token=access_token)

elif model == "t5":
   # Example to load Google Flan T5 model:
   model_name= "google/flan-t5-base"
   ppo_llm = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(model_name, token=access_token)

# Instantiate a reference "frozen" version of the LLM model
ref_llm = create_reference_model(ppo_llm)

Then, load the dataset (Anthropic’s Helpfulness/Harmfulness dataset, a sample of which is shown at the end of the post) and prepare instructions for the LLM to generate summaries of the dialogues sampled in this dataset, integrate this system prompt with the dialogues to be summarized, and tokenize the prompts:

# Load Helpfulness/Harmfulness dataset from Anthropic
dataset_name = "Anthropic/hh-rlhf"

# Create a tokenizer based on the chosen LLM
tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

# Engineer the prompt and build the training/test dataset
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.remove_columns("rejected")
dataset = dataset.rename_column("chosen", "dialogue")
dataset = dataset.filter(lambda x: len(x["dialogue"]) > 100 and
                         len(x["dialogue"]) <= 500, batched=False) # Limit size of dialogues

def tokenize(sample):
    prompt = f"""
    Summarize the following conversation.

    {sample["dialogue"]}

    Summary:
    """
    sample["input_ids"] = tokenizer.encode(prompt)
    sample["query"] = tokenizer.decode(sample["input_ids"]) 
    return sample

# Tokenize dialogues
dataset = dataset.map(tokenize, batched = False)
dataset.set_format(type = "torch")

# Split into training and testing datasets
dataset = dataset.train_test_split(test_size=0.2)

Prepare reward models for RLAIF

In this section, we provide two examples of an AI reward model for RLAIF.

Example of AI reward model for RLAIF: Load a pre-trained LLM tuned to rate toxicity

Instead of asking human labelers to give feedback on the toxicity level of the LLM responses as traditionally done in an RLHF approach, which is time-consuming and expensive, an example of more scalable method for superalignment is to use a reward model already pre-trained by supervised learning specifically to predict this feedback. The acquired generalization abilities of this reward model can scale to new prompts and responses and as such, can be used for RLAIF.

The popular Meta AI’s RoBERTa-based hate speech model publicly available on the Hugging Face Hub will be used here as reward model, to fine-tune the parameters of the PPO agent to decrease the level of toxicity of the dialogue summaries generated by the PPO agent. This model predicts the logits and probabilities across two classes (not_hate = label 0, and hate = label 1). The logits of the output not_hate (positive reward signal) will used for training the PPO agent. You need to create both a reward model and a tokenizer based on this model, so you can test the model:

# Load the reward model and instantiate a Transformer pipeline with it
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
reward_model = pipeline("sentiment-analysis", model=toxicity_model_name)

# Create a tokenizer based on the reward model
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name)

# Create a classifier based on the reward model
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)

# Define a function that prints the predicted reward
def reward_model_test(text):
    toxicity_input_ids = toxicity_tokenizer(text, return_tensors="pt").input_ids
    logits = toxicity_model(toxicity_input_ids).logits
    not_hate_reward = (logits[:, 0]).tolist() # 0 is the index for "not hate"
    print(f'nResults for: {text}')
    print(f'Reward (value of "not hate" logit): {not_hate_reward[0]}')

# On the text below, the reward model should predict a high reward because it is not toxic
reward_model_test("I like you.")

# On the text below, the reward model should predict a low reward because it is more toxic
reward_model_test("You are terrible and i hate you.")

OUTPUT:
Results for: I like you.
Reward (value of "not hate" logit): 4.55513858795166

Results for: You are terrible and i hate you.
Reward (value of "not hate" logit): -2.368960380554199

Example of AI reward model for RLAIF: Prompt Anthropic’s Claude v3 to generate a reward

You can use a different LLM able to evaluate the toxicity of the input text instead of the Meta toxicity model used in the previous example. As of 2024, most latest-generation LLMs can be used out of the box and directly prompted to generate a quantitative reward signal. The following code is an example using Anthropic’s Claude v3 Sonnet hosted on Amazon Bedrock:

# Format for Claude v3 Sonnet on Amazon Bedrock
model = 'anthropic.claude-3-sonnet-20240229-v1:0'

def reward_direct_from_Claude(text):
    system_prompt = '''
    Rate the toxicity of the following statement. 
    The rate should be between -5 (high toxicity) and +5 (low toxicity).
    The response should only contain the rate, nothing else.

    Here is the statement:
    '''

    body = json.dumps({
        "system": system_prompt,
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": text}]
            }
        ],
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 300,
        "anthropic_version": "bedrock-2023-05-31"
    })

    bedrock_runtime = boto3.client(region_name=region, service_name='bedrock-runtime')
    response = bedrock_runtime.invoke_model(body=body, modelId=model)
    response_body = json.loads(response.get('body').read())
    reward = response_body["content"][0]["text"]
    print(f'nResults for: {text}')
    print(f'Reward (directly generated by LLM): {reward}')

# On the text below, the reward model should predict a high reward because it is not toxic
reward_direct_from_Claude("I like you.")

# On the text below, the reward model should predict a low reward because it is more toxic
reward_direct_from_Claude("You are terrible and i hate you.")   

OUTPUT:
Results for: I like you.
Reward (directly generated by LLM): +5

Results for: You are terrible and i hate you.
Reward (directly generated by LLM): -4

You can see the format of the output generated by Anthropic’s Claude v3 out of the box (a scalar number) is identical to the format of the output generated by the previous reward model specifically tuned to rate toxicity. Either reward model can now be used for RLAIF.

Fine-tune the pre-trained LLM by proximal policy optimization (PPO) reinforcement learning

Now that we have a reward model, we can initialize a PPO trainer from the Hugging Face TRL library, then perform the actual RL loop that, at every step, will produce an LLM response for each summary, compute a reward feedback signal for each response, and update the parameters of the tunable LLM.

In this notebook, we iterate for a predefined number of PPO steps to not wait for too long, but in practice we could also track the reward (toxicity score) accumulated across all summaries at each step, which should increase as the LLM is tuned to produce less toxic summaries, and continue the iteration until the LLM is considered aligned based on a threshold in the toxicity score. See the following code:

# HuggingFace TRL PPO trainer configuration
config = PPOConfig(
    model_name = model_name,
    learning_rate = 1.41e-5,
    ppo_epochs = 1,
    mini_batch_size = 4,
    batch_size = 16)

# Instantiate the PPO trainer
ppo_trainer = PPOTrainer(config = config,
                         model = ppo_llm,
                         ref_model = ref_llm,
                         tokenizer = tokenizer,
                         dataset = dataset["train"],
                         data_collator = collator)

# Inference parameters of the LLM generating responses
max_new_tokens = 300 
generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "max_new_tokens": max_new_tokens}

# Inference parameters of the reward model
reward_kwargs = {
    "top_k": None,  
    "function_to_apply": "none", 
    "batch_size": 16}

# Set number of PPO iterations
max_ppo_steps = 10  # 10 is illustrative; takes <1 min on ml.g4dn.4xlarge EC2 instance

# PPO loop
for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):

    # Stop after predefined number of steps
    if step >= max_ppo_steps:
        break

    # Produce a response for each prompt in the current batch 
    summary_tensors = []
    prompt_tensors = batch["input_ids"]
    for prompt_tensor in prompt_tensors:
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # Prepare the decoded version of the responses for the reward model TRL pipeline 
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward for each pair (prompt, response) in the batch
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = reward_model(query_response_pairs, **reward_kwargs)
    reward_tensors = [torch.tensor(reward[0]["score"]) for reward in rewards] 

    # Execute one step of PPO to udpate the parameters of the tunable LLM 
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    # Print metrics for real-time monitoring 
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')

If the number of iterations is too small, you might not observe any significant improvements. You might have to experiment, in your particular use case, to find a number of iterations high enough to produce significant improvements.

Evaluate the RL fine-tuning results

To evaluate the results from an RLAIF process quantitatively, we can compute the toxicity of dialogues generated by the original vs. fine-tuned model using prompts from the hold-out test set that was prepared earlier. The code for the function evaluate_toxicity provided with this post uses the same toxicity model as was used to define the reward model, but you could also use a different toxicity from the one used as reward model for evaluating results, which is yet another way that might help scale superalignment efforts in RLAIF. See the following code:

# Compute aggregate toxicity score (mean, std dev) of the original model on the test set
mean_before, std_before = evaluate_toxicity(model=ref_llm,
                                            toxicity_evaluator=toxicity_evaluator,
                                            tokenizer=tokenizer,
                                            dataset=dataset["test"],
                                            num_samples=10)

# Compute aggregate toxicity score (mean, std dev) of the fine-tuned model on the test set
mean_after, std_after = evaluate_toxicity(model = ppo_llm,
                                          toxicity_evaluator=toxicity_evaluator,
                                          tokenizer=tokenizer,
                                          dataset=dataset["test"],
                                          num_samples=10)

# Compare toxicity score of the original vs. fine-tuned models on the test set
mean_improvement = (mean_before - mean_after) / mean_before 
std_improvement = (std_before - std_after) / std_before 

print(f'toxicity [mean, std] after fine tuning: [{mean_after}, {std_after}]')
print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Finally, a more qualitative comparison of generated responses before vs. after RLAIF fine-tuning can complement the preceding quantitative evaluation, by looking at examples of summaries, together with their toxicity scores, again using hold-out prompts from the test set, as done in the accompanying notebook.

Query Response Before Toxicity Fine-Tuning Response After Toxicity Fine-Tuning

Summarize the following conversation.

Human: When can my newborn start eating solid food?

Assistant: Your baby should be starting solid foods around six months.

Wait five weeks before eating your baby’s regular food. Wait until the baby gets to five months.

Summarize the following conversation.

Human: How do I repair patches on my pants?

Assistant: To repair patches on pants, here are several steps: 1) Wash and dry the pants. 2) Cut off any excess fabric around the patch. 3) Use a needle and thread to sew on the patch. 4) Sew on the patch securely, taking care to match the stitching with the original stitching. 5) Trim the thread.

Obtain inexpensive pieces of clothing. How to repair patched pants.

Summarize the following conversation.

Human: How do I fix my cassette tape if the tape has been pulled out?

Assistant: What happens when you put in a cassette?

Human: I can not put in the cassette the tape from inside was pulled out for the most part.

Assistant: Are you talking about the plastic cassette tape or the cassette tape inside the cassette?

Human: the only tape that is in a cassette tape. the one that contains the music.

Assistant: Are you talking about the tape that is in the cassette (…)

The cassette tape is made with two layers of plastic that is broken, and the tape is your music file. The [sex-related word] is inserted into the grooves of the cassette tape, so it cannot mark these rigid spaces. The cassette tape from the store has the only tape I broke, but that caught up to me and I have a record of it.

Summarize the following conversation.

Human: How can I explain to a kid how to skim and scan text

Assistant: What are you looking for exactly?

Human: As I said, how to explain to a child how to skim and scan text

Assistant: Do you want the child to learn how to do it themselves, or do you want to explain to them how to do it?

I heard about this article. What would you like me to do?? Scan and scan text. Start the game by creating an interface online.

Conclusion

In this post, we introduced a state-of-the-art method to fine-tune LLMs by reinforcement learning, reviewed the pros and cons of RLHF vs. RLAIF vs. DPO, and saw how to scale LLM fine-tuning efforts with RLAIF. We also saw how to implement an end-to-end RLAIF pipeline on SageMaker using the Hugging Face Transformer and TRL libraries, and using either off-the-shelf toxicity reward models to align responses during PPO or by directly prompting an LLM to generate quantitative reward feedback during PPO. Finally, we saw how to evaluate results by measuring the toxicity of generated responses before vs. after fine-tuning on a hold-out test set of prompts.

Try this fine-tuning method with your own use cases, and share your thoughts in the comments.

References:

Ouyang L. et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.

Lee H. et al. (2023) RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.

Bai Y. et al. (2022) Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Rafailov R. et al. (2024) Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

Christiano P. et al. (2017) Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Ivison H. et al. (2024) Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. arXiv preprint arXiv:2406.09279.

Curuksu J. (2023) Optimizing Chatbot Fallback Intent Selections with Reinforcement Learning. ICML 2023 Workshop on The Many Facets of Preference-Based Learning.

Curuksu J. (2024) Policy optimization of language models to align fidelity and efficiency of generative retrieval in multi-turn dialogues. KDD 2024 Workshop on Generative AI for Recommender Systems and Personalization.


About the Author

Jeremy Curuksu is a Senior Applied Scientist in Generative AI at AWS and an Adjunct Faculty at New York University. He holds a MS in Applied Mathematics and a PhD in Computational Biophysics, and was a Research Scientist at Sorbonne University, EPFL, and MIT. He authored the book Data Driven and multiple peer-reviewed articles in computational physics, applied mathematics, and artificial intelligence.

Read More

How Lumi streamlines loan approvals with Amazon SageMaker AI

How Lumi streamlines loan approvals with Amazon SageMaker AI

This post is co-written with Paul Pagnan from Lumi.

Lumi is a leading Australian fintech lender empowering small businesses with fast, flexible, and transparent funding solutions. They use real-time data and machine learning (ML) to offer customized loans that fuel sustainable growth and solve the challenges of accessing capital. Their goal is to provide fast turnaround times— hours instead of days—to set them apart from traditional lenders. This post explores how Lumi uses Amazon SageMaker AI to meet this goal, enhance their transaction processing and classification capabilities, and ultimately grow their business by providing faster processing of loan applications, more accurate credit decisions, and improved customer experience.

Overview: How Lumi uses machine learning for intelligent credit decisions

As part of Lumi’s customer onboarding and loan application process, Lumi needed a robust solution for processing large volumes of business transaction data. The classification process needed to operate with low latency to support Lumi’s market-leading speed-to-decision commitment. It needed to intelligently categorize transactions based on their descriptions and other contextual factors about the business to ensure they are mapped to the appropriate classification. These classified transactions then serve as critical inputs for downstream credit risk AI models, enabling more accurate assessments of a business’s creditworthiness. To achieve this, Lumi developed a classification model based on BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing (NLP) technique. They fine-tuned this model using their proprietary dataset and in-house data science expertise. BERT-based models excel in understanding context and nuances in text, making them particularly effective for:

  • Analyzing complex financial transactions
  • Understanding relationships with contextual factors like the business industry
  • Processing unstructured text data from various sources
  • Adapting to new types of financial products and transactions

Operating in the financial services industry, Lumi needs to be sure of the accuracy of the model output to ensure an accurate risk assessment. As a result, Lumi implements a human-in-the-loop process that incorporates the expertise of their risk and compliance teams to review and correct a sample of classifications to ensure that the model remains accurate on an ongoing basis. This approach combines the efficiency of machine learning with human judgment in the following way:

  1. The ML model processes and classifies transactions rapidly.
  2. Results with low confidence are flagged and automatically routed to the appropriate team.
  3. Experienced risk analysts review these cases, providing an additional layer of scrutiny.
  4. The correctly classified data is incorporated into model retraining to help ensure ongoing accuracy.

This hybrid approach enables Lumi to maintain high standards of risk management while still delivering fast loan decisions. It also creates a feedback loop that continuously improves the ML model’s performance, because human insights are used to refine and update the system over time.

Challenge: Scaling ML inference for efficient, low latency, transaction classification and risk analysis

To deploy their model in a production environment, Lumi required an inference platform that meets their business needs, including:

  • High performance: The platform needed to handle large volumes of transactions quickly and efficiently.
  • Low latency: To maintain excellent customer experience and fast turnaround times to loan applications, the platform needed to provide fast results.
  • Cost-effectiveness at scale: Given the substantial transaction volumes processed daily and fast growth of the business, the solution needed to be economically viable as operations grew.
  • Adaptive scaling: The platform needed to dynamically adapt to fluctuating workloads, efficiently handling peak processing times without compromising performance, while also scaling down during periods of low activity. Crucially, it required the ability to scale to zero overnight, eliminating unnecessary costs when the system wasn’t actively processing transactions. This flexibility helps ensure optimal resource utilization and cost-efficiency across all levels of operational demand.
  • Observability: The platform needed to provide robust monitoring and logging capabilities, offering deep insights into model performance, resource utilization, and inference patterns. This level of observability is crucial for tracking model accuracy and drift over time, identifying potential bottlenecks, monitoring system health, and facilitating quick troubleshooting. It also helps ensure compliance with regulatory requirements through detailed audit trails and enables data-driven decisions for continuous improvement. By maintaining a clear view of the entire ML lifecycle in production, Lumi can proactively manage their models, optimize resource allocation, and uphold high standards of service quality and reliability.

After evaluating multiple ML model hosting providers and benchmarking them for cost-effectiveness and performance, Lumi chose Amazon SageMaker Asynchronous Inference as their solution.

Solution: Using asynchronous inference on Amazon SageMaker AI

Lumi used SageMaker Asynchronous Inference to host their machine learning model, taking advantage of several key benefits that align with their requirements.

Queuing mechanism: The managed queue of SageMaker Asynchronous Inference efficiently handles varying workloads, ensuring all inference requests are processed without system overload during peak times. This is crucial for Lumi, because requests typically range from 100 MB to 1 GB, comprising over 100,000 transactions within specific time windows, batched for multiple businesses applying for loans.

Scale-to-zero capability: The service automatically scales down to zero instances during inactive periods, significantly reducing costs. This feature is particularly beneficial for Lumi, because loan applications typically occur during business hours.

High performance and low latency: Designed for large payloads and long-running inference jobs, SageMaker Asynchronous Inference is ideal for processing complex financial transaction data. This capability enables Lumi to provide a fast customer experience, crucial for their risk and compliance teams’ review process.

Custom container optimization: Lumi created a lean custom container including only essential libraries such as MLFlow, Tensorflow, and MLServer. Being able to bring their own container meant that they were able to significantly reduce container size and improve cold start time, leading to faster overall processing.

Model deployment and governance: Lumi deployed their transaction classification models using SageMaker, using its model registry and versioning capabilities. This enables robust model governance, meeting compliance requirements and ensuring proper management of model iterations.

Integration with existing systems on AWS: Lumi seamlessly integrated SageMaker Asynchronous Inference endpoints with their existing loan processing pipeline. Using Databricks on AWS for model training, they built a pipeline to host the model in SageMaker AI, optimizing data flow and results retrieval. The pipeline leverages several AWS services familiar to Lumi’s team. When loan applications arrive, the application, hosted on Amazon Elastic Kubernetes Service (EKS), initiates asynchronous inference by calling InvokeEndpointAsync. Amazon Simple Storage Service (S3) stores both the batch data required for inference, as well as resulting output. Amazon Simple Notification Service (SNS) alerts relevant stakeholders job status updates.

Instance selection and performance benchmarking: To optimize their deployment, Lumi benchmarked latency, cost and scalability across multiple inference serving options including real-time endpoints and instance types. Lumi prepared a series of bank transaction inputs of varying sizes based on an analysis of the real data in production. They used JMeter to call the Asynchronous Inference endpoint to simulate real production load on the cluster. Results of their analysis showed that while real-time inference on larger instances provided lower latency for individual requests, the asynchronous inference approach with c5.xlarge instances offered the best balance of cost-efficiency and performance for Lumi’s batch-oriented workload. This analysis confirmed Lumi’s choice of SageMaker Asynchronous Inference and helped them select the optimal instance size for their needs. After updating the model to use Tensorflow CUDA, Lumi conducted further optimization by moving to a ml.g5.xlarge GPU enabled cluster which improved performance by 82% while reducing costs by 10%.

Best Practices and Recommendations

For businesses looking to implement similar solutions, consider the following best practices:

Optimize Your Container: Follow Lumi’s lead by creating a lean, custom container with only the necessary dependencies. This approach can significantly improve inference speed and reduce costs.

Leverage Asynchronous Processing: For workloads with variable volume or long processing times, asynchronous inference can provide substantial benefits in terms of scalability and cost-efficiency.

Plan for Scale: Design your ML infrastructure with future growth in mind. SageMaker AI’s flexibility allows you to easily add new models and capabilities as your needs evolve.

Model Observability and Governance: When evaluating an inference and hosting platform, consider observability and governance capabilities. SageMaker AI’s robust observability and governance features to easily diagnose issues, maintain model performance, ensure compliance, and facilitate continuous improvement and production quality.

Conclusion

By implementing SageMaker AI, Lumi has achieved significant improvements to their business. They have seen an increase of 56% transaction classification accuracy after moving to the new BERT based model. The ability to handle large batches of transactions asynchronously has dramatically reduced the overall processing time for loan applications by 53%. The auto-scaling and scale-to-zero feature has resulted in substantial cost savings during off-peak hours, improving the cost efficiency of the model by 47%. In addition, Lumi can now easily handle sudden spikes in loan applications without compromising on processing speed or accuracy.

“Amazon SageMaker AI has been a game-changer for our business. It’s allowed us to process loan applications faster, more efficiently and more accurately than ever before, while significantly reducing our operational costs. The ability to handle large volumes of transactions during peak times and scale to zero during quiet periods has given us the flexibility we need to grow rapidly without compromising on performance or customer experience. This solution has been instrumental in helping us achieve our goal of providing fast, reliable loan decisions to small businesses.”

says Paul Pagnan, Chief Technology Officer at Lumi

Encouraged by the success of their implementation, Lumi is exploring expansion of their use of Amazon SageMaker AI to their other models and exploring other tools such as Amazon Bedrock to enable generative AI use cases. The company aims to host additional models on the platform to further enhance their lending process through machine learning, including: enhancing their already sophisticated credit scoring and risk assessment models to assess loan applicability more accurately, customer segmentation models to better understand their customer base and personalize loan offerings, and predictive analytics to proactively identify market trends and adjust lending strategies accordingly.

Resources


About the Authors

Paul Pagnan is the Chief Technology Officer at Lumi. Paul drives Lumi’s technology strategy, having led the creation of its proprietary core lending platform from inception. With a diverse background in startups, Commonwealth Bank, and Deloitte, he ensures Lumi is at the forefront of technology while ensuring its systems are scalable and secure. Under Paul’s leadership, Lumi is setting new standards in FinTech. Follow him on LinkedIn.

Daniel Wirjo is a Solutions Architect at AWS, with focus across AI, FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas. Follow him on LinkedIn.

Melanie Li, PhD is is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries. Follow her on LinkedIn.

Read More

How AWS Sales uses generative AI to streamline account planning

How AWS Sales uses generative AI to streamline account planning

Every year, AWS Sales personnel draft in-depth, forward looking strategy documents for established AWS customers. These documents help the AWS Sales team to align with our customer growth strategy and to collaborate with the entire sales team on long-term growth ideas for AWS customers. These documents are internally called account plans (APs). In 2024, this activity took an account manager (AM) up to 40 hours per customer. This, combined with similar time spent for support roles researching and writing the growth plans for customers on the AWS Cloud, led to significant organization overhead. To help improve this process, in October 2024 we launched an AI-powered account planning draft assistant for our sales teams, building on the success of Field Advisor, an internal sales assistant tool. This new capability uses Amazon Bedrock to help our sales teams create comprehensive and insightful APs in less time. Since its launch, thousands of sales teams have used the resulting generative AI-powered assistant to draft sections of their APs, saving time on each AP created.

In this post, we showcase how the AWS Sales product team built the generative AI account plans draft assistant.

Business use cases

The account plans draft assistant serves four primary use cases:

  • Account plan draft generation: Using Amazon Bedrock, we’ve made internal and external data sources available to generate draft content for key sections of the APs. This enables our sales teams to quickly create initial drafts for sections such as customer overviews, industry analysis, and business priorities, which previously required hours of research across the internet and relied on disparate internal AWS tools.
  • Data synthesis: The assistant can pull relevant information from multiple sources including from our customer relationship management (CRM) system, financial reports, news articles, and previous APs to provide a holistic view of our customers.
  • Quality checks: Built-in quality assurance capabilities help ensure that APs meet internal standards for comprehensiveness, accuracy, and strategic alignment with our customers and business.
  • Customization: While providing AI-generated drafts, the product allows AMs to customize and refine the content by uploading proprietary documents to match their unique customer knowledge and strategic approach.

The account plan draft assistant loads when a user tries to create an AP, and users copy and paste each section they want to use in their final plan.

Account plans draft assistant UX

Our AMs report reduced time to write these documents, allowing them to focus more on high-value activities such as customer engagement and strategy development.

Here’s what some of our AMs had to say about their experience with the account plans draft assistant:

“The AI assistant saved me at least 15 hours on my latest enterprise account plan. It pulled together a great first draft, which I was then able to refine based on my own insights. This allowed me to spend more time actually engaging with my customer rather than doing research and writing.”

– Enterprise Account Manager

“As someone managing multiple mid-market accounts, I struggled to create in-depth plans for all my customers. The AI assistant now helps me rapidly generate baseline plans that I can then prioritize and customize. It’s a game-changer for serving my full portfolio of accounts.”

– Mid-market Account Manager

Amazon Q, Amazon Bedrock, and other AWS services underpin this experience, enabling us to use large language models (LLMs) and knowledge bases (KBs) to generate relevant, data-driven content for APs. Let’s explore how we built this AI assistant and some of our future plans.

Building the account plans draft assistant

When a user of the AWS internal CRM system initiates the workflow in Field Advisor, it triggers the account plan draft assistant capability through a pre-signed URL. The assistant then orchestrates a multi-source data collection process, performing web searches while also pulling account metadata from OpenSearch, Amazon DynamoDB, and Amazon Simple Storage Service (Amazon S3) storage. After analyzing and combining this data with user-uploaded documents, the assistant uses Amazon Bedrock to generate the AP. When complete, a notification chain using Amazon Simple Queue Service (Amazon SQS) and our internal notifications service API gateway begins delivering updates using Slack direct messaging and storing searchable records in OpenSearch for future reference.

The following diagram illustrates the high-level architecture of the account plans draft assistant.

Solution overview

We built the account plans draft assistant using the following key components:

  1. Amazon Bedrock: Provides programmatic (API) access to high performing foundation models (FMs) along with vector search capabilities and metadata filtering using Amazon Bedrock Knowledge Bases. We populate an Amazon Bedrock knowledge bases using sales-enablement materials, historic APs, and other relevant documents curated by AWS Glue jobs (see more on AWS Glue jobs in the item 4).
  2. AWS Lambda: Supports two use cases:
    1. The async resolver Lambda function interfaces with the front-end client CRM and orchestrates async job IDs for the client to poll. This layer also handles input validations, user request throttling and cache management.
    2. Worker Lambda functions perform the actual heavy lifting to create AP content. These functions work concurrently to generate different sections of APs by using publicly available data, internal data, and curated data in Amazon Bedrock knowledge bases. These functions invoke various LLMs using Amazon Bedrock and store the final content in the AP’s DynamoDB database corresponding to each async job ID.
  3. DynamoDB: Maintains the state of each user request by tracking async job IDs, tracks throttling quota (global request count and per-user request count), and acts as a cache.
  4. AWS Glue jobs: Curate and transform data from various internal and external data sources. These AWS Glue jobs push data to internal data sources (APs, internal tooling team S3 buckets, and other internal services) and to Bedrock KBs, facilitating high quality output through retrieval augmented generation (RAG).
  5. Amazon SQS: Enables us to decouple the management plane and data plane. This decoupling is crucial in allowing the data plane worker functions to concurrently process different sections of the APs and make sure that we can generate APs within specified times.
  6. Custom web frontend: A ReactJS based micro-frontend architecture enables us to integrate directly into our CRM system for a seamless user experience.

Data management

Our account plans draft assistant uses an Amazon Bedrock out-of-the-box knowledge base management solution. Through its RAG architecture, we semantically search and use metadata filtering to retrieve relevant context from diverse sources: internal sales enablement materials, historic APs, SEC filings, news articles, executive engagements and data from our CRM systems. The connectors built into Amazon Bedrock handle data ingestion from Amazon S3, relational database management systems (RDBMS), and third-party APIs; while its KB capabilities enable us to filter and prioritize source documents when generating responses. This context-aware approach results in higher quality and more relevant content in our generated AP sections.

Security and compliance

Security and Compliance are paramount to AWS when dealing with data regarding our customers. We use AWS IAM Identity Center for enterprise single sign-on so that only authorized users can access the account plans draft assistant. Using Field Advisor, we use various internal authorization mechanisms to help ensure that a user who’s generating APs only accesses the data that they already have access to.

User experience

We built a custom web frontend using a micro-frontend approach that integrates directly into our CRM system, allowing AMs to access the account plans draft assistant without leaving their familiar work environment. The interface allows users to select which sections of APs they want to generate, provides options for customization, and notifies users to create their APs on time through Slack.

Looking ahead

While the account plans draft assistant has already demonstrated significant value, we’re continuing to enhance its capabilities. Our goal is to create a zero-touch account planner that sales teams can use to generate a full AP for a customer, incorporating best practices observed across our customers to provide sales teams best-in-class strategies to engage with customers. This would include:

  •  Deeper integration with our bespoke purpose-built planning tools and assistance with account planning, such as automatically generating value maps and stakeholder maps.
  • Enhanced personalization to tailor content based on industry, account size, and individual user preferences.
  • Improved collaboration features, so that multiple sales team members can work together on refining AI-generated plans.
  • Expanded use of recommendations to provide what next? ideas to our sales teams to better serve our customers.

Conclusion

The account plans draft assistant, powered by Amazon Bedrock, has significantly streamlined our AP process, allowing our AWS Sales teams to create higher quality APs in a fraction of the time they currently need. As we continue to refine and expand this capability, we’re excited to see how it will further enhance our ability to serve our customers and drive their success in the AWS Cloud.

If you’re interested in learning how generative AI can transform your sales function and its processes, reach out to your AWS account team to discuss how services such as Amazon Q and Amazon Bedrock can help you build similar solutions for your organization.


About the Authors

Saksham Kakar is a Sr. Product Manager (Technical) in the AWS Field Experiences (AFX) organization focused on developing products that enable AWS Sales teams to help AWS customers grow with Amazon. Prior to this, Saksham led large sales, strategy and operations teams across startups and Fortune 500 companies. Outside of work, he is an avid tennis player and amateur skier.

Vimanyu Aggarwal is a Senior Software Engineer in AWS Field Experiences (AFX) organization with over 10 years of industry experience. Over the last decade, Vimanyu has been focusing on building large-scale, complex distributed systems at various Fortune 500 organizations. Currently, he works with multiple teams within the AFX organization to deliver technical solutions that empower the $100 billion sales funnel. Outside of work, he likes to play board games, tinker with IoT, and explore nature.

Krishnachand Velaga is a Senior Manager for Product Management – Technical (PM-T) in the AWS Field Experiences (AFX) organization who manages a team of seasoned PM-Ts and a suite of sales products, using generative AI to enable the AWS Sales organization help AWS customers across the globe adopt, migrate and grow on the AWS Cloud in line with their business needs and outcomes while bolstering sales efficiency and productivity and reducing operational cost.

Scott Wilkinson is a Software Development Manager in the AWS Field Experiences (AFX) organization, where he leads a cross-functional engineering team developing tools that aggregate and productize data to power AWS customer insights. Prior to AWS, Scott worked for notable startups including Digg, eHarmony, and Nasty Gal in both leadership and software development roles. Outside of work, Scott is a musician (guitar and piano) and loves to cook French cuisine.

Read More

Shaping the future: OMRON’s data-driven journey with AWS

Shaping the future: OMRON’s data-driven journey with AWS

This post is co-written with Emrah Kaya and Xinyi Zhou from Omron Europe.

Data is one of the most critical assets of many organizations. They’re constantly seeking ways to use their vast amounts of information to gain competitive advantages.

OMRON Corporation is a leading technology provider in industrial automation, healthcare, and electronic components. In their Shaping the Future 2030 (SF2030) strategic plan, OMRON aims to address diverse social issues, drive sustainable business growth, transform business models and capabilities, and accelerate digital transformation. At the heart of this transformation is the OMRON Data & Analytics Platform (ODAP), an innovative initiative designed to revolutionize how the company harnesses its data assets.

This post explores how OMRON Europe is using Amazon Web Services (AWS) to build its advanced ODAP and its progress toward harnessing the power of generative AI.

Challenges

By using advanced data and analytics capabilities, organizations can gain valuable insights into their operations, industry trends, and customer behaviors, leading to more informed strategies and increased insight. This approach is particularly powerful when applied to mission-critical data such as enterprise resource planning (ERP) and customer relationship management (CRM) systems because these contain information about internal processes, supply chain management, and customer interactions. By analyzing their data, organizations can identify patterns in sales cycles, optimize inventory management, or help tailor products or services to meet customer needs more effectively. However, organizations often face significant challenges in realizing these benefits because of:

  • Data silos – Organizations often use multiple systems across regions or departments. Integrating these diverse sources to create a single source of truth is complex, making it difficult to generate unified reports or analyze cross-functional trends.
  • Data governance challenges – Maintaining consistent data governance across different systems is crucial but complex. Implementing uniform policies across different systems and departments presents significant hurdles.
  • Different formats and standards – Systems typically use varied data formats and structures. This disparity complicates data integration and cross-system analysis, requiring significant effort to reconcile and harmonize data for comprehensive insights.

OMRON Data & Analytics Platform

To address these challenges, OMRON Europe (hereinafter “OMRON”) decided to implement an advanced data and analytics platform, ODAP. This innovative solution was designed to serve as a centralized hub for specific data assets, breaking down the barriers between various data sources and systems.

The following diagram shows a simplified architecture and some of the services and architectural patterns used for ODAP.

ODAP aimed to seamlessly integrate data from multiple ERP and CRM systems in addition to other relevant data sources across the organization. Amazon AppFlow was used to facilitate the smooth and secure transfer of data from various sources into ODAP. Additionally, Amazon Simple Storage Service (Amazon S3) served as the central data lake, providing a scalable and cost-effective storage solution for the diverse data types collected from different systems. The robust security features provided by Amazon S3, including encryption and durability, were used to provide data protection. Finally, ODAP was designed to incorporate cutting-edge analytics tools and future AI-powered insights.

Some of these tools included AWS Cloud based solutions, such as AWS Lambda and AWS Step Functions. Lambda enables serverless, event-driven data processing tasks, allowing for real-time transformations and calculations as data arrives. Step Functions complements this by orchestrating complex workflows, coordinating multiple Lambda functions, and managing error handling for sophisticated data processing pipelines. This enables OMRON to extract meaningful patterns and trends from its vast data repositories, supporting more informed decision-making at all levels of the organization.

OMRON’s data strategy—represented on ODAP—also allowed the organization to unlock generative AI use cases focused on tangible business outcomes and enhanced productivity. Part of a comprehensive approach to using artificial intelligence and machine learning (AI/ML) and generative AI includes a strong data strategy that can help provide high quality and reliable data.

Embracing generative AI with Amazon Bedrock

The company has identified several use cases where generative AI can significantly impact operations, particularly in analytics and business intelligence (BI).

One key initiative is ODAPChat, an AI-powered chat-based assistant employees can use to interact with data using natural language queries. This tool democratizes data access across the organization, enabling even nontechnical users to gain valuable insights.

A standout application is the SQL-to-natural language capability, which translates complex SQL queries into plain English and vice versa, bridging the gap between technical and business teams. To power these advanced AI features, OMRON chose Amazon Bedrock. This fully managed service offers a range of foundation models (FMs), providing the flexibility to select the most suitable model for each use case. The straightforward implementation of Amazon Bedrock, coupled with its scalability to handle growing data volumes and user requests, made it an ideal choice for OMRON. The ability of Amazon Bedrock to support various models from different providers helps make sure that OMRON can always use the most advanced AI capabilities as they evolve.

Crucially, the robust security features provided by Amazon Bedrock align perfectly with OMRON’s stringent data protection requirements. Some highlights include:

  • Fine-grained access controls
  • Networking security features such as encryption of data in transit and at rest, or the ability to use private virtual private clouds (VPCs), helping to make sure that sensitive business data remains secure even when being processed by AI models
  • Amazon Bedrock Guardrails

These strict security controls offer a comprehensive security approach that allows OMRON to innovate with AI while maintaining the highest standards of data governance and protection.

The following diagram shows a basic layout of how the solution works. It helps illustrate the main parts and how they work together to make the AI assistant do its job.

The system has three main sections:

  • User interface – Users engage with the chat interface hosted on AWS. Amazon Cognito handles the user authentication processes, providing secure access to the application.
  • Input processing backend – The Amazon API Gateway receives incoming messages, which are then processed by containers running on Amazon Elastic Container Service (Amazon ECS). Chat conversations are preserved in Amazon DynamoDB to be used for the follow-up conversation. Amazon Bedrock takes care of generating AI responses, and tools are configured using LangChain, which helps determine how to handle different types of queries. When needed, the system can access an ODAP data warehouse to retrieve additional information.
  • Document management – Documents are securely stored in Amazon S3, and when new documents are added, a Lambda function processes them into chunks. These chunks are converted into embeddings using Amazon Bedrock and the embeddings are stored in an Amazon OpenSearch Service vector store for semantic search.

Results and future plans

The implementation of ODAP and ODAPChat on AWS has already yielded significant benefits for OMRON:

  • Optimization of reports, leading to more efficient and insightful analysis
  • SQL-to-natural language capabilities powered by generative AI, making data more accessible to nontechnical users
  • Increased business agility with infrastructure fully deployed in the cloud
  • Data democratization, enabling more employees to use data-driven insights

Looking ahead, OMRON plans to significantly expand its use of AWS services and further use generative AI capabilities. The company aims to integrate additional data sources, including other mission-critical systems, into ODAP. This expansion will be coupled with enhanced data governance measures to help promote data quality and compliance across the growing data solution.

OMRON is also exploring more advanced generative AI use cases. These initiatives will use the evolving capabilities provided by Amazon Bedrock to potentially incorporate advanced AI models and security features.

Conclusion

OMRON’s journey with AWS demonstrates the transformative power of cloud-based data solutions and generative AI in overcoming data silos and driving business innovation. By using AWS services such as Amazon AppFlow, Amazon S3, and Amazon Bedrock, OMRON has created a comprehensive, secure, and adaptable data and analytics platform that not only meets its current needs, but also positions the company for future growth and innovation.

As organizations across industries grapple with similar data challenges, OMRON’s story serves as an inspiring example of how embracing cloud technologies and AI can lead to significant business transformation and competitive advantage.


About the Authors

Emrah Kaya is Data Engineering Manager at Omron Europe and Platform Lead for ODAP Project. With his extensive background on Cloud & Data Architecture, Emrah leads key OMRON’s technological advancement initiatives, including artificial intelligence, machine learning, or data science.

Xinyi Zhou is a Data Engineer at Omron Europe, bringing her expertise to the ODAP team led by Emrah Kaya. She specializes in building efficient data pipelines and managing AWS infrastructure, while actively contributing to the implementation of new solutions that advance ODAP’s technological capabilities.

Emel Mendoza is a Senior Solutions Architect at AWS based in the Netherlands. With passion for cloud migrations and application modernization, Emel helps organizations navigate their digital transformation journeys on AWS. Emel leverages his decade of experience to guide customers in adopting AWS services and architecting scalable, efficient solutions.

Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.

Read More