Amazon AWS – Page 60

Llama 3.2 models from Meta are now available in Amazon SageMaker JumpStart

September 25, 2024

by Supriya Puragundla Amazon AWS

Today, we are excited to announce the availability of Llama 3.2 models in Amazon SageMaker JumpStart. Llama 3.2 oﬀers multi-modal vision and lightweight models representing Meta’s latest advancement in large language models (LLMs), providing enhanced capabilities and broader applicability across various use cases. With a focus on responsible innovation and system-level safety, these new models demonstrate state-of-the-art performance on a wide range of industry benchmarks and introduce features that help you build a new generation of AI experiences. SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.

In this post, we show how you can discover and deploy the Llama 3.2 11B Vision model using SageMaker JumpStart. We also share the supported instance types and context for all the Llama 3.2 models available in SageMaker JumpStart. Although not highlighted in this blog, you can also use the lightweight models along with fine-tuning using SageMaker JumpStart.

Llama 3.2 models are available in SageMaker JumpStart initially in the US East (Ohio) AWS Region. Please note that Meta has restrictions on your usage of the multi-modal models if you are located in the European Union. See Meta’s community license agreement for more details.

Llama 3.2 overview

Llama 3.2 represents Meta’s latest advancement in LLMs. Llama 3.2 models are offered in various sizes, from small and medium-sized multi-modal models. The larger Llama 3.2 models come in two parameter sizes—11B and 90B—with 128,000 context length, and are capable of sophisticated reasoning tasks including multi-modal support for high resolution images. The lightweight text-only models come in two parameter sizes—1B and 3B—with 128,000 context length, and are suitable for edge devices. Additionally, there is a new safeguard Llama Guard 3 11B Vision parameter model, which is designed to support responsible innovation and system-level safety.

Llama 3.2 is the first Llama model to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. With a focus on responsible innovation and system-level safety, Llama 3.2 models help you build and deploy cutting-edge generative AI models to ignite new innovations like image reasoning and are also more accessible for on-edge applications. The new models are also designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker Inference instances, including AWS Trainium and AWS Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This enforces data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker, you can streamline the entire model deployment process.

Prerequisites

To try out the Llama 3.2 models in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the LLMs.

Discover Llama 3.2 models in SageMaker JumpStart

SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive IDE that offers a unified, web-based interface for performing all aspects of the ML development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart from the Home page.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI/ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.

Deploy Llama 3.2 multi-modality models for inference using SageMaker JumpStart

On the SageMaker JumpStart landing page, you can discover all public pre-trained models offered by SageMaker. You can choose the Meta model provider tab to discover all the Meta models available in SageMaker.

If you’re using SageMaker Classic Studio and don’t see the Llama 3.2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Classic Apps.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.

When you choose either button, a pop-up window will show the End-User License Agreement (EULA) and acceptable use policy for you to accept.

Upon acceptance, you can proceed to the next step to use the model.

Deploy Llama 3.2 11B Vision model for inference using the Python SDK

When you choose Deploy and accept the terms, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker.

You can deploy a Llama 3.2 11B Vision model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel
model = JumpStartModel(model_id = "meta-vlm-llama-3-2-11b-vision")
predictor = model.deploy(accept_eula=accept_eula)

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "How are you doing today"},
        {"role": "assistant", "content": "Good, what can i help you with today?"},
        {"role": "user", "content": "Give me 5 steps to become better at tennis?"}
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "max_tokens": 512,
    "logprobs": False
}
response = predictor.predict(payload)
response_message = response['choices'][0]['message']['content']

Recommended instances and benchmark

The following table lists all the Llama 3.2 models available in SageMaker JumpStart along with the model_id, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model Name	Model ID	Default instance type	Supported instance types
Llama-3.2-1B	meta-textgeneration-llama-3-2-1b, meta-textgenerationneuron-llama-3-2-1b	ml.g6.xlarge (125K context length), ml.trn1.2xlarge (125K context length)	All g6/g5/p4/p5 instances; ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-1B-Instruct	meta-textgeneration-llama-3-2-1b-instruct, meta-textgenerationneuron-llama-3-2-1b-instruct	ml.g6.xlarge (125K context length), ml.trn1.2xlarge (125K context length)	All g6/g5/p4/p5 instances; ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-3B	meta-textgeneration-llama-3-2-3b, meta-textgenerationneuron-llama-3-2-3b	ml.g6.xlarge (125K context length), ml.trn1.2xlarge (125K context length)	All g6/g5/p4/p5 instances; ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-3B-Instruct	meta-textgeneration-llama-3-2-3b-instruct, meta-textgenerationneuron-llama-3-2-3b-instruct	ml.g6.xlarge (125K context length), ml.trn1.2xlarge (125K context length)	All g6/g5/p4/p5 instances; ml.inf2.xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Llama-3.2-11B-Vision	meta-vlm-llama-3-2-11b-vision	ml.p4d.24xlarge (125K context length)	p4d.24xlarge, p4de.24xlarge, p5.48xlarge
Llama-3.2-11B-Vision-Instruct	meta-vlm-llama-3-2-11b-vision-instruct	ml.p4d.24xlarge (125K context length)	p4d.24xlarge, p4de.24xlarge, p5.48xlarge
Llama-3.2-90B-Vision	meta-vlm-llama-3-2-90b-vision	ml.p5.24xlarge (125K context length)	p4d.24xlarge, p4de.24xlarge, p5.48xlarge
Llama-3.2-90B-Vision-Instruct	meta-vlm-llama-3-2-90b-vision-instruct	ml.p5.24xlarge (125K context length)	p4d.24xlarge, p4de.24xlarge, p5.48xlarge
Llama-Guard-3-11B-Vision	meta-vlm-llama-guard-3-11b-vision	ml.p4d.24xlarge	p4d.24xlarge, p4de.24xlarge, p5.48xlarge

Llama 3.2 models have been evaluated on over 150 benchmark datasets, demonstrating competitive performance with leading FMs.

Inference and example prompts for Llama-3.2 11B Vision

You can use Llama 3.2 11B and 90B models for text and image or vision reasoning use cases. You can perform a variety of tasks, such as image captioning, image text retrieval, visual question answering and reasoning, document visual question answering, and more. Input payload to the endpoint looks like the following code examples.

Text-only input

The following is an example of text-only input:

{
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant'}, 
        {'role': 'user', 'content': 'How are you doing today'}, 
        {'role': 'assistant', 'content': 'Good, what can i help you with today?'}, 
        {'role': 'user', "content": 'Give me 5 steps to become better at tennis?'}
    ], 
    'temperature': 0.6, 
    'top_p': 0.9, 
    'max_tokens': 512, 
    'logprobs': False
}

This produces the following response:

Here are 5 steps to help you improve your tennis game:

**Step 1: Develop a Consistent Serve**

A strong serve is the foundation of a good tennis game. Practice your serve regularly to develop a consistent and powerful shot. Focus on proper footwork, grip, and body positioning. Experiment with different serves, such as topspin and slice, to keep your opponent guessing.

**Step 2: Improve Your Footwork and Agility**

Good footwork is essential for getting into position to hit shots and covering the court. Practice shuffling, sliding, and turning to improve your agility and reaction time. Incorporate exercises like ladder drills and cone drills into your training to enhance your foot speed and agility.

**Step 3: Focus on Your Volley and Overhead Shots**

The volley and overhead shots are critical for winning points at the net. Practice hitting volleys and overheads with both forehand and backhand grips. Focus on keeping your racquet close to the ball and using your wrist and arm to generate power.

**Step 4: Work on Your Return of Serve**

The return of serve is a crucial shot in tennis, as it sets up the point for your opponent. Practice returning serves with different types of spin, such as topspin and slice. Focus on getting your racquet head to the ball early and using your legs to generate power.

**Step 5: Analyze Your Game and Practice with a Purpose**

To improve your game, you need to identify areas for improvement and practice with a purpose. Record your matches and analyze your game to identify weaknesses and areas for improvement. Create a practice plan that targets specific areas, such as your forehand or backhand, and focus on making progress in those areas.

Remember, improvement takes time and practice. Stay committed, and with consistent effort, you'll see improvement in your tennis game!

Single-image input

You can set up vision-based reasoning tasks with Llama 3.2 models with SageMaker JumpStart as follows:

import requests
import base64

def url_to_base64(image_url):
    # Download the image
    response = requests.get(image_url)
    if response.status_code != 200:
        return None

    # Encode the image content to base64
    image_base64 = base64.b64encode(response.content).decode('utf-8')
    return image_base64

Let’s load an image from the open source MATH-Vision dataset:

url = "https://raw.githubusercontent.com/mathvision-cuhk/MATH-V/refs/heads/main/images/13.jpg"
image_data = url_to_base64(url)

We can structure the message object with our base64 image data:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Which of these figures differs from the rest four?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"""data:image/jpeg;base64,{image_data}"""
          }
        }
      ]
    }
  ],
  "temperature": 0.4,
  "top_p": 0.95,
  "max_tokens": 512,
  "logprobs": False
}

This produces the following response:

The figure labeled E is the one that differs from the rest.

Multi-image input

The following code is an example of multi-image input:

image_data1 = url_to_base64("https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2023/04/27/DBBLOG-2448-SingleAZInstance.png")
image_data2 = url_to_base64("https://d2908q01vomqb2.cloudfront.net/887309d048beef83ad3eabf2a79a64a389ab1c9f/2023/04/27/DBBLOG-2448-MultiAZInstance.png")
payload = {
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "There are two images. Describe each one and why they are related"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"""data:image/jpeg;base64,{image_data1}"""
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"""data:image/jpeg;base64,{image_data2}"""
          }
        }
      ]
    }
  ],
  "temperature": 0.6,
  "top_p": 0.9,
  "max_tokens": 512,
  "logprobs": False
}

This produces the following response:

The two images depict different architectures for a distributed database system. The main difference between them is the location of the database instances and how they are connected.

**Image 1: Centralized Architecture**

In this image, there is a single database instance (DB1) that serves as the central repository for all data. This instance is connected to multiple application servers (App1, App2, App3) through a network. The application servers interact with the database instance to read and write data.

**Image 2: Distributed Architecture**

In this image, there are multiple database instances (DB1, DB2, DB3) that are distributed across different locations. Each database instance is connected to a local application server (App1, App2, App3) through a network. The application servers interact with their local database instances to read and write data.

The key differences between the two images are:

* **Scalability**: The distributed architecture (Image 2) allows for easier scalability, as new database instances can be added or removed as needed, without affecting the overall system.
* **Availability**: The distributed architecture provides higher availability, as if one database instance goes down, the other instances can still serve requests.
* **Performance**: The centralized architecture (Image 1) may experience performance issues if the central database instance becomes a bottleneck, while the distributed architecture can distribute the load across multiple instances.
* **Data consistency**: The centralized architecture may have easier data consistency management, as all data is stored in a single location, while the distributed architecture requires more complex mechanisms to ensure data consistency across multiple instances.

In summary, the centralized architecture is suitable for small to medium-sized applications with low traffic, while the distributed architecture is more suitable for large-scale applications with high traffic and scalability requirements.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, to use the SageMaker console, complete the following steps:

On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
Search for the embedding and text generation endpoints.
On the endpoint details page, choose Delete.
Choose Delete again to confirm.

Conclusion

In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Llama 3.2 models today. For more information about SageMaker JumpStart, see Train, deploy, and evaluate pretrained models with SageMaker JumpStart and Getting started with Amazon SageMaker JumpStart.

About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS
Armando Diaz is a Solutions Architect at AWS
Sharon Yu is a Software Development Engineer at AWS
Siddharth Venkatesan is a Software Development Engineer at AWS
Tony Lian is a Software Engineer at AWS
Evan Kravitz is a Software Development Engineer at AWS
Jonathan Guinegagne is a Senior Software Engineer at AWS
Tyler Osterberg is a Software Engineer at AWS
Sindhu Vahini Somasundaram is a Software Development Engineer at AWS
Hemant Singh is an Applied Scientist at AWS
Xin Huang is a Senior Applied Scientist at AWS
Adriana Simmons is a Senior Product Marketing Manager at AWS
June Won is a Senior Product Manager at AWS
Karl Albertsen is a Head of ML Algorithm and JumpStart at AWS

Vision use cases with Llama 3.2 11B and 90B models from Meta

September 25, 2024

by Natarajan Chennimalai Kumar Amazon AWS

Today, we are excited to announce the availability of Llama 3.2 in Amazon SageMaker JumpStart and Amazon Bedrock. The Llama 3.2 models are a collection of state-of-the-art pre-trained and instruct fine-tuned generative AI models that come in various sizes—in lightweight text-only 1B and 3B parameter models suitable for edge devices, to small and medium-sized 11B and 90B parameter models capable of sophisticated reasoning tasks, including multimodal support for high-resolution images. SageMaker JumpStart is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, like Meta, through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this post, we demonstrate how you can use Llama 3.2 11B and 90B models for a variety of vision-based use cases. This is the first time Meta’s Llama models have been released with vision capabilities. These new capabilities expand the usability of Llama models from their traditional text-only applications. The vision-based use cases that we discuss in this post include document visual question answering, extracting structured entity information from images, and image captioning.

Overview of Llama 3.2 11B and 90B Vision models

The Llama 3.2 collection of multimodal and multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in a variety of sizes. The 11B and 90B models are multimodal—they support text in/text out, and text+image in/text out.

Llama 3.2 11B and 90B are the ﬁrst Llama models to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. The new models are designed to be more eﬃcient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications. All Llama 3.2 models support a 128,000 context length, maintaining the expanded token capacity introduced in Llama 3.1. Additionally, the models oﬀer improved multilingual support for eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Llama 3.2 models are available today for inferencing in SageMaker JumpStart and Amazon Bedrock. With SageMaker JumpStart, you can access Llama 3.2 models initially in the US East (Ohio) AWS region and support the required instance types. Meta’s Llama 3.2 90B and 11B models are also available in Amazon Bedrock in the US West (Oregon) Region, and in the US East (Ohio, N. Virginia) Regions via cross-region inference. Llama 3.2 1B and 3B models are available in the US West (Oregon) and Europe (Frankfurt) Regions, and in the US East (Ohio, N. Virginia) and Europe (Ireland, Paris) Regions via cross-region inference with planned expanded regional availability in the future.

Solution overview

In the following sections, we walk through how to configure Llama 3.2 vision models in Amazon Bedrock and Amazon SageMaker JumpStart for vision-based reasoning. We also demonstrate use cases for document question answering, entity extraction, and caption generation.

For the examples shown in this post, we use the Llama 3.2 90B model unless otherwise noted. The fashion images are from the Fashion Product Images Dataset. Caption generation images are from Human Preference Synthetic Dataset. The interior design and real estate images are from the Interior design dataset.

Prerequisites

The following prerequisites are needed to implement the steps outlined in this post:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access Amazon SageMaker and Amazon Bedrock. For more information, see Identity and Access Management for Amazon SageMaker and Identity and access management for Amazon Bedrock.
Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code.

For information about how to set up Llama 3.2 model access for Amazon Bedrock, see launch post. For details on creating model endpoints in SageMaker JumpStart, refer to the launch post.

Configure Llama 3.2 for vision-based reasoning in Amazon Bedrock

To set up vision-based reasoning tasks with Llama 3.2 models in Amazon Bedrock, use the following code snippet:

import boto3
import json
import base64
from botocore.config import Config

# Initialize the Bedrock client
config = Config(
            region_name = os.getenv("BEDROCK_REGION", "us-west-2"),
            )
bedrock_runtime = boto3.client('bedrock-runtime', config=config)
MODEL_ID = " us.meta.llama3-2-90b-instruct-v1:0"

Amazon Bedrock supports the messages object as part of the Converse API. With the Converse API, you don’t have to convert the image into base64 (compared to SageMaker JumpStart).

You can read the image with the following code:

# Read and encode the image
image_path = "<your_file_path>"  # Replace with the actual path to your image
try:
    # Open the image file and read its contents
    with open(image_path, "rb") as image_file:
        image_bytes = image_file.read()
    # Encode the image bytes to base64
    image_data = image_bytes
except FileNotFoundError:
    print(f"Image file not found at {image_path}")
    image_data = None

Use the following code to create a messages object:

# Construct the messages for the model input

# Construct the messages for the model input
messages = [    
    {
        "role": "user",
        "content": [
            {                
                "text": prompt
            },
            {                
                "image": {
                    "format": "<your_file_format>",
                    "source": {
                        "bytes":image_data
                }
            }
        ]
    }
]

Invoke the Amazon Bedrock Converse API as follows:

try:
    # Invoke the SageMaker endpoint
    response = bedrock_runtime.converse(
        modelId=MODEL_ID, # MODEL_ID defined at the beginning
        messages=[
            messages
        ],
        inferenceConfig={
        "maxTokens": 4096,
        "temperature": 0,
        "topP": .1
        },        
    )
    
    # Read the response 
    print(response['output']['message']['content'][0]['text'])

except Exception as e:
    print(f"An error occurred while invoking the endpoint: {str(e)}")

Configure Llama 3.2 for vision-based reasoning in SageMaker

You can set up vision-based reasoning tasks with Llama 3.2 vision models with a SageMaker endpoint with the following code snippet (please refer to Llama 3.2 in SageMaker JumpStart blog to setup the inference endpoint):

import boto3
import json
import base64

# Initialize the SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')
endpoint_name = '<model-endpoint>'  # Replace with your actual endpoint name

SageMaker JumpStart deployment can also take in a Messages API style messages object as the input (similar to the Amazon Bedrock Converse API). First, the image needs to be read into a base64 format before sending it through the messages object.

Read the image with the following code:

# Read and encode the image
image_path = "<your_file_path>"  # Replace with the actual path to your image
try:
    # Open the image file and read its contents
    with open(image_path, "rb") as image_file:
        image_bytes = image_file.read()
    # Encode the image bytes to base64
    image_data = base64.b64encode(image_bytes).decode('utf-8')
    image_media_type = 'image/jpeg'  # Adjust if using a different image format
except FileNotFoundError:
    print(f"Image file not found at {image_path}")
    image_data = None
    image_media_type = None

Create a messages object with the following code:

# Create a data URL for the image
my_url = f"""data:image/jpeg;base64,{image_data}"""

# Construct the messages for the model input
messages = [    
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": prompt
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": my_url
                }
            }
        ]
    }
]

In the preceding code, prompt is the question we ask about the reasoning of the model with the image.

After you create the messages object, you can send that as payload to the SageMaker endpoint:

try:
    # Invoke the SageMaker endpoint
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Read the response body
    response_body = response['Body'].read()
    
    if response_body:
        try:
            # Parse the JSON response
            result = json.loads(response_body.decode('utf-8'))
            # Print the model's response
            print(result['choices'][0]['message']['content'])
        except json.JSONDecodeError as json_err:
            print(f"Failed to parse JSON: {json_err}")
            print(f"Raw response: {response_body['choices'][0]['message']['content']}")
    else:
        print("The response body is empty")

except Exception as e:
    print(f"An error occurred while invoking the endpoint: {str(e)}")

Document question answering

In this section, we present example use cases for document question answering. We focus on two use cases: first on answering questions based on financial results slides that contain complex graphical information, and secondly solving math-based problems by looking at an image.

Financial results slides Q&A

Llama 3.2 vision models can reason and answer questions based on images with complicated information in the image such as plots, graphs, and tables. Let’s use the model for a financial analyst assistant to analyze slides with different types of financial information to answer our questions. We use the Amazon Q2 2024 Financial Results Conference Calls Slides document to ask Llama 3.2 some questions based on this slide deck.

For our first prompt, we ask “How is the sales trending over the last twelve months?” with the following slide.

We get the following response:
The sales are trending up 12% year over year. The bar graph shows that the sales have increased from $538,046 in Q2 2023 to $604,334 in Q1 2024, and further to $604,334 in Q2 2024. This indicates a steady growth in sales over the last twelve months.

Next, we ask “Can you please help me understand this slide? How are the operating and free cash flow trending the last 12 month period? What is the significant contributor to the free cash flow trend?” We use the following slide.

We get the following response:
The slide presents a table with two main sections: "Operating Cash Flow" and "Free Cash Flow Less Equipment Finance Leases and Principal Repayments of All Other Finance Leases and Financing Obligations." The table is divided into five columns, each representing a quarter from Q2 2023 to Q2 2024.

**Operating Cash Flow Trend:**

* Q2 2023: $61,841

* Q3 2023: $71,654

* Q4 2023: $84,946

* Q1 2024: $99,147

* Q2 2024: $107,952

The operating cash flow has consistently increased over the five quarters, indicating a positive trend.

**Free Cash Flow Trend:**

* Q2 2023: $6,734

* Q3 2023: $20,241

* Q4 2023: $35,549

* Q1 2024: $48,778

* Q2 2024: $51,448

The free cash flow has also shown an upward trend, with a significant increase from Q2 2023 to Q2 2024.

**Significant Contributor to Free Cash Flow Trend:**

The significant contributor to the free cash flow trend is the "Principal repayments of all other finance leases" and "Principal repayments of financing obligations." These two items have consistently increased over the five quarters, contributing to the overall growth in free cash flow.

In summary, both operating cash flow and free cash flow have demonstrated a positive trend over the last 12-month period. The significant contributor to the free cash flow trend is the increase in principal repayments of all other finance leases and financing obligations.

Visual math question answering

Another interesting problem for visual question answering is answering visual math and logic puzzles. Llama 3.2 vision models are highly capable in solving and answering visual math questions. We show a few examples here.

For our first prompt, we use the following image and ask “The kangaroo is inside how many circles?”

We get the following response:
The kangaroo is inside 3 circles.

Next, we use the following image and ask “Which of these figures differ from the other four?”

We get the following response:
The figure labeled E is the only one with a different shape for the skirt. The other figures have triangular skirts, while figure E has a rectangular skirt.

Entity extraction

Extracting information about products is an important use case for the retail industry. On platforms supporting third-party sellers, product images are the most available resources to gather information about the product. Therefore, accurately extracting relevant information about the product is essential for the success of the product. In this section, we show a couple of examples of clothing product images, and how you can use Llama 3.2 to extract relevant information about the product that can support it on an ecommerce application.

We first use the following images and the prompt “Can you please extract all information about the product in the image? Think very carefully in a step-by-step manner to construct a json file to organize all the information about the product. Please consider formulating the json file so that the extracted information can be stored in a database for further use in a retail ecommerce application. Return the json file as the output.”

For the pink dress, we get the following output:
Based on the image, here is a JSON representation of the product information:

```
{
"product_id": "unknown",
"name": "Pink Floral Dress",
"description": "A sleeveless dress with a pink floral pattern and a pink trim at the bottom",
"category": "Dresses",
"sub_category": "Casual Dresses",
"color": "Pink",
"size": "unknown",
"price": "unknown",
"brand": "unknown",
"material": "unknown",
"features": [
"Sleeveless",
"Floral pattern",
"Pink trim at the bottom"
],
"images": [
{
"url": "image_url",
"alt_text": "Pink Floral Dress"
}
]
}
```

Please note that some information such as product_id, size, price, brand, and material are unknown as they are not visible in the image. Also, the image_url is a placeholder and should be replaced with the actual URL of the image.

For the blue cap, we get the following output:

Based on the image, here is a JSON representation of the product information:

```json
{
"product_id": "NIKE-CAP-001",
"name": "Nike Cap",
"description": "A blue baseball cap with a white Nike logo",
"category": "Accessories",
"subcategory": "Hats",
"brand": "Nike",
"color": "Blue",
"size": "One Size",
"material": "Cotton",
"price": 29.99,
"image_url": "https://example.com/nike-cap.jpg",
"features": [
"Adjustable strap in the back",
"Embroidered Nike logo on the front"
]
}
```

This JSON object includes the following fields:

* `product_id`: a unique identifier for the product
* `name`: the name of the product
* `description`: a brief description of the product
* `category` and `subcategory`: the categories and subcategories that the product belongs to
* `brand`: the brand of the product
* `color`: the color of the product
* `size`: the size of the product
* `material`: the material that the product is made of
* `price`: the price of the product
* `image_url`: the URL of the product image
* `features`: an array of features that describe the product

Note that this is just one possible way to represent the product information in JSON. Depending on the specific requirements of your application, you may need to add or remove fields, or use different data types for certain fields.

Caption generation

Another important and useful capability of vision models is to generate captions automatically for images. In this section, we showcase caption generation with a simple caption, a more creative and funnier caption, and a marketing style caption.

Simple captioning

For simple captioning, we use the following image and the prompt “Please provide a caption for this image. Be creative but respectful.”

We get the following response:

The image shows a young woman sitting in a dentist's chair, with a female dentist examining her teeth. The woman is wearing a bib to protect her clothes from any water or other substances that may be used during the examination. The dentist is wearing a white lab coat and a mask, and she is holding a dental mirror and a probe. The background of the image is a dental office, with various dental tools and equipment visible.

The overall atmosphere of the image is one of professionalism and care. The dentist appears to be focused on her work, and the woman seems to be relaxed and comfortable. The image suggests that the woman is receiving a routine dental checkup or cleaning, and that the dentist is taking good care of her.

Creative captioning

For our second example, we want a more creative and funny tone. We use the same image and the prompt “Please provide a short caption for this image. Be creative and funny but respectful. This is meant to be shared on a social network application.”

We get the following response:

"Brace yourself for a sparkling smile, because this dentist is about to make your teeth shine brighter than a disco ball!"

Listing generation

To generate a real estate listing, we use the following image and the prompt “You are a real estate marketing expert. Please create a listing description with the attached image of a property.”

We get the following output:

**Welcome to Your Dream Home**

This stunning property boasts an open-plan living area that seamlessly blends style and functionality. The spacious living room, dining area, and kitchen are perfect for entertaining, with plenty of natural light pouring in through the large windows. The sleek, modern kitchen features high-end appliances and ample counter space, making meal prep a breeze.

The living room is a cozy retreat, with a comfortable sofa and chairs arranged around a stylish coffee table. The adjacent dining area is perfect for hosting dinner parties, with a large table that seats eight.

But the real showstopper is the stunning view of the water, which can be enjoyed from the comfort of your own home. Imagine sipping your morning coffee or enjoying a glass of wine in the evening while taking in the breathtaking scenery.

This property is the perfect blend of form and function, offering the ultimate in comfort and style. Don't miss out on this incredible opportunity to make it your own.

**Key Features:**

* Open-plan living area with kitchen, dining area, and living room
* Modern kitchen with high-end appliances and ample counter space
* Comfortable living room with sofa and chairs
* Large dining table that seats eight
* Stunning view of the water
* Perfect for entertaining

**Don't miss out on this incredible opportunity. Contact us today to schedule a viewing.**

Conclusion

In this post, we demonstrated the diverse vision capabilities of the Llama 3.2 11B and 90B models from Meta. Llama 3.2 vision models enable you to solve multiple use cases, including document understanding, math and logic puzzle solving, entity extraction, and caption generation. These capabilities can drive productivity in a number of enterprise use cases, including ecommerce (retail), marketing, and much more.

To learn more about Llama 3.2 features and capabilities in Amazon Bedrock, refer to the launch post, product page, and documentation. To learn more about using Llama 3.2 in SageMaker JumpStart, see the launch post, and for more information about using foundation models in SageMaker JumpStart, check out product page and documentation.

We can’t wait to see what you build with the Llama 3.2 models on AWS!

About the Authors

Dr. Natarajan Chennimalai Kumar is a Principal Solutions Architect in the 3rd Party Model Provider team at AWS, working closely with the Llama partner engineering team at Meta to enable AWS customers use Llama models. He holds a PhD from University of Illinois at Urbana-Champaign. He is based in the Bay Area in California. Outside of work, he enjoys watching shows with his kids, playing tennis, and traveling with his family.

Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the outdoors with his wife.

Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyperscale on AWS. As a member of the 3rd Party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a Global Lead for the Meta – AWS Partnership and technical strategy. Based in Seattle, WA, Marco enjoys writing, reading, exercising, and building applications in his free time.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

AWS AI call for proposals — Fall 2024

September 25, 2024

by Amazon AWS

Advancing the frontiers of machine learning.Read More

AI for Information Security call for proposals — Fall 2024

September 25, 2024

by Amazon AWS

Advancing possible solutions for some of the most challenging problems in information security.Read More

How generative AI is transforming legal tech with AWS

September 24, 2024

by Vineet Kacchawaha Amazon AWS

Legal professionals often spend a significant portion of their work searching through and analyzing large documents to draw insights, prepare arguments, create drafts, and compare documents. The rise of generative artificial intelligence (AI) has brought an inflection of foundation models (FMs). These FMs, with simple instructions (prompts), can perform various tasks such as drafting emails, extracting key terms from contracts or briefs, summarizing documents, searching through multiple documents, and more. As a result, these models are fit for legal tech. Goldman Sachs estimated that generative AI could automate 44% of legal tasks in the US. A special report published by Thompson Reuters reported that generative AI awareness is significantly higher among legal professionals, with 91% of respondents saying they have heard of or read about these tools.

However, such models alone are not sufficient due to legal and ethical concerns around data privacy. Security and confidentiality are of paramount importance in the legal field. Legal tech professionals, like any other business handling sensitive customer information, require robust security and confidentiality practices. Advancements in AI and natural language processing (NLP) show promise to help lawyers with their work, but the legal industry also has valid questions around the accuracy and costs of these new techniques, as well as how customer data will be kept private and secure. AWS AI and machine learning (ML) services help address these concerns within the industry.

In this post, we share how legal tech professionals can build solutions for different use cases with generative AI on AWS.

AI/ML on AWS

AI and ML have been a focus for Amazon for over 25 years, and many of the capabilities customers use with Amazon are driven by ML. Ecommerce recommendation engines, Just Walk Out technology, Alexa devices, and route optimizations are some examples. These capabilities are built using the AWS Cloud. At AWS, we have played a key role in and making ML accessible to anyone who wants to use it, including more than 100,000 customers of all sizes and industries. Thomson Reuters, Booking.com, and Merck are some of the customers who are using the generative AI capabilities of AWS services to deliver innovative solutions.

AWS makes it straightforward to build and scale generative AI customized for your data, your use cases, and your customers. AWS gives you the flexibility to choose different FMs that work best for your needs. Your organization can use generative AI for various purposes like chatbots, intelligent document processing, media creation, and product development and design. You can now apply that same technology to the legal field.

When you’re building generative AI applications, FMs are part of the architecture and not the entire solution. There are other components involved, such as knowledge bases, data stores, and document repositories. It’s important to understand how your enterprise data is integrating with different components and the controls that can be put in place.

Security and your data on AWS

Robust security and confidentiality are foundations to the legal tech domain. At AWS, security is our top priority. AWS is architected to be the most secure global cloud infrastructure on which to build, migrate, and manage applications and workloads. This is backed by our deep set of over 300 cloud security tools and the trust of our millions of customers, including the most security sensitive organizations like government, healthcare, and financial services.

Security is a shared responsibility model. Core security disciplines, like identity and access management, data protection, privacy and compliance, application security, and threat modeling, are still critically important for generative AI workloads, just as they are for any other workload. For example, if your generative AI applications is accessing a database, you’ll need to know what the data classification of the database is, how to protect that data, how to monitor for threats, and how to manage access. But beyond emphasizing long-standing security practices, it’s crucial to understand the unique risks and additional security considerations that generative AI workloads bring. To learn more, refer to Securing generative AI: An introduction to the Generative AI Security Scoping Matrix.

Sovereignty has been a priority for AWS since the very beginning, when we were the only major cloud provider to allow you to control the location and movement of your customer data and address stricter data residency requirements. The AWS Digital Sovereignty Pledge is our commitment to offering AWS customers the most advanced set of sovereignty controls and features available in the cloud. We are committed to expanding our capabilities to allow you to meet your digital sovereignty needs, without compromising on the performance, innovation, security, or scale of the AWS Cloud.

AWS generative AI approach for legal tech

AWS solutions enable legal professionals to refocus their expertise on high-value tasks. On AWS, generative AI solutions are now within reach for legal teams of all sizes. With virtually unlimited cloud computing capacity, the ability to fine-tune models for specific legal tasks, and services tailored for confidential client data, AWS provides the ideal environment for applying generative AI in legal tech.

In the following sections, we share how we’re working with several legal customers on different use cases that are focused on improving the productivity of various tasks in legal firms.

Boost productivity to allow a search based on context and conversational Q&A

Legal professionals store their information in different ways, such as on premises, in the cloud, or a combination of the two. It can take hours or days to consolidate the documents prior to reviewing them if they are scattered across different locations. The industry relies on tools where searching is limited to each domain, and may not flexible enough for users to search for information.

To address this issue, AWS used AI/ML and search engines to provide a managed service where users can ask a human-like, open-ended generative AI-powered assistant to answer questions based on data and information. Users can prompt the assistant to extract key attributes that serve as metadata, find relevant documents, and answer legal questions and terms inquiries. What used to take hours can now be done in a matter of minutes, and based on what we have learned with our customers, AWS generative AI has been able to improve productivity of resources by up to a 15% increase compared to manual processes during its initial phases.

Boost productivity with legal document summarization

Legal tech workers can realize a benefit from the generation of first draft that can then be reviewed and revised by the process owner. Multiple use cases are being implemented under this category:

Contract summarization for tax approval
Approval attachment summarization
Case summarization

The summarization of documents can either use existing documents and videos from your document management system or allow users to upload a document and ask questions in real time. Instead of writing the summary, generative AI uses FMs to create the content so the lawyer can review the final content. This approach reduces these laborious tasks to 5–10 minutes instead of 20–60 minutes.

Boost attorney productivity by drafting and reviewing legal documents using generative AI

Generative AI can help boost attorney productivity by automating the creation of legal documents. Tasks like drafting contracts, briefs, and memos can be time-consuming for attorneys. With generative AI, attorneys can describe the key aspects of a document in plain language and instantly generate an initial draft. This new approach uses generative AI to use templates and chatbot interactions to add allowed text to an initial validation prior to legal review.

Another use case is to improve reviewing contracts using generative AI. Attorneys spend valuable time negotiating contracts. Generative AI can streamline this process by reviewing and redlining contracts, and identify potential discrepancies and conflicting provisions. Given a set of documents, this functionality allows attorneys to ask open-ended questions based on the documents along with follow-up questions, enabling human-like conversational experiences with enterprise data.

Start your AWS generative AI journey today

We are at the beginning of a new and exciting foray into generative AI, and we have just scratched the surface of some potential applications in the legal field—from text summarization, drafting legal documents, or searching based on context. The AWS generative AI stack offers you the infrastructure to build and train your own FMs, services to build with existing FMs, or applications that use other FMs. You can start with the following services:

Amazon Q Business is a new type of generative AI-powered assistant. It can be tailored to your business to have conversations, solve problems, generate content, and take actions using the data and expertise found in your company’s information repositories, code bases, and enterprise systems. Amazon Q Business provides quick, relevant, and actionable information and advice to help streamline tasks, speed up decision-making and problem-solving, and help spark creativity and innovation.
Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that perform tasks using your enterprise systems and data sources.

In upcoming posts, we will dive deeper into different architectural patterns that describe how to use AWS generative AI services to solve for these different use cases.

Conclusion

Generative AI solutions are empowering legal professionals to reduce the difficulty in finding documents and performing summarization, and allow your business to standardize and modernize contract generation and revisions. These solutions do not envision to replace law experts, but instead increase their productivity and time working on practicing law.

We are excited about how legal professionals can build with generative AI on AWS. Start exploring our services and find out where generative AI could benefit your organization. Our mission is to make it possible for developers of all skill levels and for organizations of all sizes to innovate using generative AI in a secure and scalable manner. This just the beginning of what we believe will be the next wave of generative AI, powering new possibilities in legal tech.

Resources

About the Authors

Victor Fiss a Sr. Solution Architect Leader at AWS, helping customers in their cloud journey from infrastructure to generative AI solutions at scale. In his free time, he enjoys hiking and playing with his family.

Vineet Kachhawaha is a Sr. Solutions Architect at AWS focusing on AI/ML and generative AI. He co-leads the AWS for Legal Tech team within AWS. He is passionate about working with enterprise customers and partners to design, deploy, and scale AI/ML applications to derive business value.

Pallavi Nargund is a Principal Solutions Architect at AWS. She is a generative AI lead for East – Greenfield. She leads the AWS for Legal Tech team. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Pallavi holds a Bachelor’s of Engineering from the University of Pune, India. She lives in Edison, New Jersey, with her husband, two girls, and a Labrador pup.

Deploy generative AI agents in your contact center for voice and chat using Amazon Connect, Amazon Lex, and Amazon Bedrock Knowledge Bases

September 24, 2024

by Vraj Shah Amazon AWS

This post is co-written with Vraj Shah and Chaitanya Hari from DoorDash.

DoorDash connects consumers with their favorite local businesses in more than 30 countries across the globe. Recently, they faced a significant challenge in handling the high volume of calls from its contractor delivery workers, known as Dashers. With a user base of over 37 million active consumers and 2 million monthly active Dashers at the end of 2023, the company recognized the need to reduce the burden on its live agents by providing a more efficient self-service experience for Dashers.

To address this challenge, the contact center team at DoorDash wanted to harness the power of generative AI to deploy a solution quickly, and at scale, while maintaining their high standards for issue resolution and customer satisfaction. Dashers, who generally prefer calling into support rather than texting while they’re on the road, require fast and reliable assistance, with minimal response latency. This low latency requirement became a critical factor in DoorDash’s quest for an effective, voice-enabled self-service solution.

Working with the AWS Generative AI Innovation Center, DoorDash built a solution to provide Dashers with a low-latency self-service voice experience to answer frequently asked questions, reducing the need for live agent assistance, in just 2 months.

The solution uses Amazon Lex, a voice-enabled conversational AI service; Amazon Bedrock, a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case; and Amazon Bedrock Knowledge Bases, a fully managed service that connects large language models (LLMs) to your data sources. It’s a fully serverless architecture that uses Amazon OpenSearch Serverless, which can run petabyte-scale workloads, without you having to manage the underlying infrastructure.

The solution fields hundreds of thousands of calls per day, responding to Dashers with answers to their questions in 2.5 seconds or less. It also enables operational capabilities including automated testing, conversation analytics, monitoring and observability, and LLM hallucination prevention and detection.

“We wanted to empower Dashers to get help with their most common questions and issues as quickly and efficiently as possible, saving them time, effort, and increasing their trust in DoorDash’s self-service capabilities,”

– Chaitanya Hari, Contact Center Product Lead at DoorDash.

In this post, we show how you can deploy generative AI agents in your contact center using AWS services.

Solution overview

For this post, we use a fictional hotel chain example called Example Corp Hospitality Group. The following is a sample interaction between a customer and the LLM-powered automated agent:

Customer: Hi, we’re looking for a great place for a family vacation

Agent: Example Corp Family Getaways offers family-friendly hotels and resorts located near top attractions like theme parks, national parks, and historic sites. The resorts feature spacious accommodations, activities for all ages, and diverse dining options to create unforgettable family memories.

Customer: Where are they located?

Agent: Example Corp Family Getaways offers family-friendly resorts in a variety of popular vacation destinations, including Orlando, Florida; Anaheim, California; Williamsburg, Virginia; Pigeon Forge, Tennessee; San Antonio, Texas; Scottsdale, Arizona; Branson, Missouri; Wisconsin Dells, Wisconsin; and Hershey, Pennsylvania.

Customer: Pigeon Forge sounds interesting.

Agent: Our Pigeon Forge resort offers families a chance to experience the great outdoors while enjoying the excitement of Dollywood and other nearby attractions. Guests can enjoy our indoor water park, mini-golf course, and nightly s’mores roasts around the fire pit.

Customer: Do they have kids’ menus?

Agent: Example Corp Family Getaways resorts in Pigeon Forge, Tennessee feature family-friendly dining options, including dedicated kids’ menus with favorites like pizza, chicken nuggets, and mac and cheese.

You can deploy the solution in your own AWS account and try the example solution. The following diagram illustrates the solution architecture.

We will walk you through deploying and testing these major components of the solution:

An AWS CloudFormation stack to set up an Amazon Bedrock knowledge base, where you store the content used by the solution to answer questions.
A CloudFormation stack to create an Amazon Lex bot and an AWS Lambda fulfillment function, which implement the core Retrieval Augmented Generation (RAG) question answering capability.
An optional CloudFormation stack to deploy a data pipeline to enable a conversation analytics dashboard.
An optional CloudFormation stack to enable an asynchronous LLM hallucination detection feature.
Optional Jupyter notebooks in Amazon SageMaker that provide an automated testing capability that compares generated answers to ground truth answers, providing pass/fail grades with explanations.

Everything you need is also provided as open source in our GitHub repo.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

This solution uses Amazon Bedrock LLMs to find answers to questions from your knowledge base. Before proceeding, if you have not previously done so, request access to at least the following Amazon Bedrock models:

Amazon Titan Embeddings G1 – Text
Cohere Embed English v3 and Cohere Embed Multilingual v3
Anthropic’s Claude 3 Haiku and Anthropic’s Claude 3 Sonnet

If you’ll be integrating with Amazon Connect, make sure you have an instance available in your account. If you don’t already have one, you can create one. If you plan to deploy the conversation analytics stack, you need Amazon QuickSight, so make sure you have enabled it in your AWS account.

At the time of writing, this solution is available in the following AWS Regions: Asia Pacific (Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, London), US East (N. Virginia), and US West (Oregon).

Deploy the Amazon Bedrock knowledge base

You can use the provided CloudFormation stack for the Amazon Bedrock knowledge base instances you may need using Amazon Simple Storage Service (Amazon S3) as a data source. Complete the following steps to set up your knowledge base:

Provide a stack name, for example contact-center-kb.
Provide the name for an existing S3 bucket, for example contact-center-kb-(your-account-number). This is where the content for the demo solution will be stored. Create this S3 bucket if you don’t already have one.
Do not specify an S3 prefix.
Choose an embedding model, such as amazon.titan-embed-text-v2:0.
Choose the Fixed-sized chunking chunking strategy.
For the maximum tokens per chunk entry, use 600 for the Amazon Titan embeddings model. (If you are using the Cohere embeddings model, use 512). This represents about a full page of text.
For the percentage overlap, use 10%.
Leave the four entries for Index Details at their default values (index name, vector field name, metadata field name, and text field name).
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack will take about 10 minutes to deploy.

Upload the sample content and test your knowledge base

The demonstration sample for the solution includes an LLM-based hotel-bot that can answer questions about the fictional hotel chain Example Corp Hospitality Group. You need to load the content for this hotel chain into the S3 bucket that you specified for the knowledge base stack. You can find the S3 bucket used by the CloudFormation stack on the Outputs tab for the stack.

Either using the AWS Command Line Interface (AWS CLI) or the AWS Management Console, upload the following folders from the content section of the GitHub repo:

- corporate
- family-getaways
- luxury-suites
- party-times
- seaside-resorts
- waypoint-inns

You can choose either the PDF versions or the Word document versions (Word versions recommended). When you’re done, the top level of your S3 bucket should contain six folders, each containing a single Word or PDF document.

On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
Choose your new knowledge base to open it.

A message appears that says “One or more data sources have not been synced.”

Select the data source and choose Sync.

The sync process should only take a minute or two.

After your data source has been synced, you can try some question answering on the Amazon Bedrock console. Make sure you have enabled all the models approved by your organization on the Amazon Bedrock Model access page.

Select an LLM model, such as Anthropic’s Claude 3 Haiku on Amazon Bedrock, and start asking questions! You might want to peruse the sample documents you uploaded for some ideas about questions to ask.

Deploy the hallucination detection stack (optional)

If you want to use the optional asynchronous hallucination detection feature, deploy this stack. Otherwise, move on to the next section. You can use this CloudFormation stack for any RAG-based solution requiring asynchronous hallucination detection.

Choose Launch Stack:

Provide a stack name, for example contact-center-hallucination-detection.
Specify an LLM to perform the hallucination detection. At the time of writing, there are seven LLMs that are recommended for hallucination detection. For the demo solution, choose the default (Claude V3 Sonnet).
Optionally, create an Amazon Key Management Service (AWS KMS) customer managed key (CMK) to encrypt the Amazon Simple Queue Service (Amazon SQS) queue and the Amazon CloudWatch Logs log group for the Lambda function (recommended for production).

There are two types of Amazon CloudWatch alarms in this stack:

ERROR alarms – For code issues with the Lambda function that does the hallucination detection work
WARNING alarms – For when the Lambda function actually detects a hallucination

Both alarm types are optional, but recommended.

Choose yes to enable or no to disable the alarms.
For the alarms that you enable, you can specify an optional email address or distribution list to receive email notifications about the alarms.
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack will take about a minute or two to deploy.

When the stack is complete, you can review the resources it creates on the Resources tab for the CloudFormation stack. In particular, review the Lambda function code.

If you entered email addresses for the alarm notifications, you should receive email requests asking you to confirm the subscriptions. Confirm them to receive email notifications about alarms that may occur.

Deploy the RAG solution stack

If you’re integrating with Amazon Connect, make sure you have an instance available in your account. If you don’t already have one, you can create one. Then complete the following steps to deploy the Amazon Lex bot and Lambda fulfillment function:

Choose Launch Stack:

Provide a stack name, for example contact-center-rag-solution.
Provide a name for the Amazon Lex bot, for example hotel-bot.
Specify the number of conversation turns to retain for context. This can be optimized for different use cases and datasets. For the hotel-bot demo, try the default of 4.
Optionally, specify an existing CloudWatch Logs log group ARN for the Amazon Lex conversation logs. You’ll need this if you’re planning to deploy the conversation analytics stack. Create a log group if you don’t already have one.
Optionally, enter a value for Lambda provisioned concurrency units for the Amazon Lex bot handler function. If set to a non-zero number, this will prevent Lambda cold starts and is recommended for production and for internal testing. For development, 0 or 1 is recommended.
Optionally, select the option to create a KMS CMK to encrypt the CloudWatch Logs log groups for the Lambda functions (recommended for production).
If you’re integrating with Amazon Connect, provide the Amazon Connect instance ARN, as well as the name for a new contact flow that the stack will create for you.
Provide the knowledge base ID from the knowledge base stack you just created. You can find this on the Outputs tab of the knowledge base stack.
Provide the S3 bucket used by the knowledge base stack (also referenced on the Outputs tab).
If you created the hallucination detection stack, enter the SQS queue name. You can find this on the Outputs tab of the hallucination detection stack.
If you opted for a KMS key for your hallucination detection stack, enter the KMS key ARN.
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack will take a few minutes to complete.

To try the RAG solution, navigate to the Amazon Lex console and open the hotel-bot bot. The bot has a single language section for the English language. Choose Intents in the navigation pane to check out the intents for this sample bot. They include the following:

Intents related to questions about the hotel chain and its various hotel brands – This includes Accommodations, Amenities, CorporateOverview, Locations, Parking, and more. These intents are routed to the RAG solution by Amazon Lex. Technically, intents like these could be omitted, allowing the FallbackIntent to handle requests of this nature. However, including these intents (and their sample utterances) provides Amazon Lex with information about the “language” of your solution domain, allowing it to better optimize its speech-to-text engine and improve speech transcription accuracy. In addition, including these intents is useful for conversation analytics.
SwitchBrand – This intent is designed to improve conversation flow by allowing the user to say things like “What about at your other hotels?” in the middle of a conversation.
Booking – This demonstrates an example of routing the caller to a live agent queue.
SpeakToAgent – This intent is for when a caller specifically requests a live agent.
Welcome, Goodbye, and Help – These conversation support intents are for starting and ending the conversation, or asking what the bot can do.
FallbackIntent – This is the standard intent for questions or requests that don’t match other intents. In this example solution, such requests are also routed to the RAG solution to allow the LLM to answer based on the content in the knowledge base.
SelectKnowledgeBase and SelectLLM – These allow the user to direct the RAG solution to use a different knowledge base instance (if more than one is available) or a different LLM. These intents are designed for testing purposes, and should normally be included only in non-production deployments. You can test the RAG solution with any of the LLMs available on Amazon Bedrock. You can also switch to a different knowledge base or LLM mid-conversation, if desired.
ToggleLLMGuardrails and ToggleLLMContext – These allow the user to turn the prompt-based LLM guardrails off or on, and to disable or enable the retrieval of information from the knowledge base. These intents are designed for testing purposes, and should normally be included only in non-production environments. You can turn these settings off and on mid-conversation, if desired.

You can choose Test on the Amazon Lex console to try the solution.

Try some sample conversations, for example:

Ask “We’re looking for a nice place for a family vacation” and the bot will respond “Example Corp Family Getaways offers family-friendly accommodations…”
Ask “Where are they located?” and the bot will respond “Example Corp Family Getaways has locations in…”
Ask “Tell me more about the one in Pigeon Forge” and the bot will respond “The Example Corp Family Getaways resort in Pigeon Forge, Tennessee is…”

You can refer to the sample documents you uploaded for some ideas about questions to ask.

If you deployed the hallucination detection stack, you can look at its assessment of the answers you got when you tested. From the hallucination detection stack details page, on the Resources tab, choose the HallucinationDetectionFunctionLogGroup entry. This opens the CloudWatch Logs log group for the Lambda hallucination detection function. You can inspect the log statements to observe the hallucination detection process in action, as shown in the following screenshot.

If you’re integrating with Amazon Connect, there will be a new contact flow in the Amazon Connect instance you specified, as shown in the following screenshot.

To test using voice, just claim a phone number, associate it with this contact flow, and give it a call!

Deploy the conversation analytics stack (optional)

This stack uses QuickSight for analytics, so make sure you have already enabled it in your AWS account before deploying this stack.

Choose Launch Stack:

Provide a stack name, for example contact-center-analytics.
Provide the name (not the ARN) of the Amazon Lex conversation logs log group. This is the same CloudWatch Logs log group you used for the the RAG solution CloudFormation stack.
Choose an option for purging source log streams from the log group. For testing, choose no.
Choose an option for redacting sensitive data using from the conversation logs. For testing, choose no.
Leave the personally identifiable information (PII) entity types and confidence score thresholds at their default values.
Choose an option for allowing unredacted logs for the Lambda function in the data pipeline. For testing, choose yes.
Select an option for creating a KMS CMK.

If you create a CMK, it will be used to encrypt the data in the S3 bucket that this stack creates, where the normalized conversation data is housed. This allows you to control which IAM principals are allowed to decrypt the data and view it. This setting is recommended for production.

Select the options for enabling CloudWatch alarms for ERRORS and WARNINGS in the Amazon Lex data pipeline. It is recommended to enable these alarms.
For the alarms that you enable, you can specify an optional email address or distribution list to receive email notifications about the alarms.
Choose Next.
On the Configure stack options page, choose Next
On the Review and create page, acknowledge the IAM capabilities message and choose Submit.

The stack should about 5 minutes to complete.

The following diagram illustrates the architecture of the stack.

As Amazon Lex writes conversation log entries to CloudWatch Logs (1), they are picked up by Amazon Data Firehose and streamed to an S3 bucket (2). Along the way, a Lambda transformation function (3) simplifies the JSON structure of the data to make it more user-friendly for querying purposes. The Lambda function can also redact sensitive data using Amazon Comprehend (4), and optionally purge the entries from the CloudWatch Logs log group as it consumes them.

On a scheduled basis (every 5 minutes), an AWS Glue crawler (5) inspects new data in the S3 bucket, and updates a data schema that is used by Amazon Athena (6) to provide a SQL interface to the data. This allows tools like QuickSight (7) to create near real-time dashboards, analytics, and visualizations of the data.

Set up the QuickSight dashboard (optional)

Before you create the QuickSight dashboard, make sure to return to the Amazon Lex console and ask a few questions, in order to generate some data for the dashboard. It will take about 5 minutes for the pipeline to process this new conversation data and make it available to QuickSight.

To set up dashboards and visualizations in QuickSight, complete the following steps:

On the QuickSight console, choose the user profile icon and choose Manage QuickSight.
Under Security & permissions, choose Manage in the QuickSight access to AWS services
Under Amazon S3, choose Select S3 buckets.
Enable access to the S3 bucket created by the conversation analytics stack (it will have a name with a 12-character unique identifier prepended to lex-conversation-logs). You don’t need to enable write permissions.
Choose Finish, then choose Save.
Choose the QuickSight menu icon to return to the main page in QuickSight.
In the navigation pane, choose Datasets.
Choose New dataset.
From the list of dataset sources, choose Athena.
Enter a data source name (for example contact-center-analytics).
Choose Create data source.
In the Choose your table window, choose your database, select your lex_conversation_logs table, and choose Edit/Preview data.

This opens your new QuickSight dataset. You can review the various attributes available, and see some results from your testing.

For improved speed in displaying the data, you can select the SPICE option for Query mode, but that will mean you need to refresh SPICE (or set up an hourly auto-update schedule) when you want to see data updates based on additional testing.

For now, leave the setting as Direct query.
When you’re ready, choose PUBLISH & VISUALIZE.
In the New sheet window, keep the defaults and choose CREATE.

This opens the analysis page, where you can start creating visualizations.

Automated testing notebooks (optional)

To try the automated testing capability, you need a SageMaker Jupyter notebook. Alternatively, you can run the notebooks locally in your integrated development environment (IDE) or other environment that supports Jupyter notebooks.

On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
Choose Create notebook instance.
Give your notebook a name, such as contact-center-rag-testing.
To enable multi-threaded testing, it’s recommended to select a larger instance, such as ml.m5.2xlarge (which has 8 vCPUs) or ml.m5.4xlarge (which has 16 vCPUs). Don’t forget to stop them when they’re not in use.
Keep the default setting for Platform identifier (Amazon Linux 2, Jupyter Lab 3).
Under Additional configuration, increase the Volume size in GB setting to 50 GB.
In the Permissions and encryption section, under IAM role, choose Create a new role in the drop down list (don’t use the role creation wizard).
In the Create an IAM role window, you can specify any S3 buckets you want to provide access to (none are needed for this solution).
Choose Create role.

Choose Create notebook instance.

It will take several minutes for your notebook instance to become available. While it’s being created, you can update the IAM role to add some inline policies you’ll need for accessing Amazon Bedrock and Amazon Lex.

On the Notebook instances page, open your notebook instance (for example, contact-center-rag-testing) and then choose the entry under IAM role ARN to open the role.
Add the following inline policies (available in the notebooks/iam-roles folder in the GitHub repository):

You can revise these roles to limit resource access as needed.

After your notebook instance has started, choose Open Jupyter to open the notebook.
Upload the following to your notebook instance (if desired, you can zip the files locally, upload the zip archive, and then unzip it in SageMaker):
1. bedrock_helpers.py – This script configures LLM instances for the notebooks.
2. bedrock_utils – You should make sure to upload all subfolders and files, and confirm that the folder structure is correct.
3. run_tests.ipynb – This notebook runs a set of test cases.
4. generate_ground_truths.ipynb – Given a set of questions, this notebook generates potential ground truth answers.
5. test-runs – This folder should contain Excel workbooks.
Open the run_tests.ipynb notebook.
In the second cell, replace the bot_id and bot_alias_id values with the values for your Amazon Lex bot (you can find these on the Outputs tab of the RAG solution stack).
After you updated these values, choose Restart & Run All on the Kernel

If you’re using a ml.m5.2xlarge instance type, it should take about a minute to run the 50 test cases in the test-runs/test-cases-claude-haiku-2024-09-02.xlsx workbook. When it’s complete, you should find a corresponding test-results workbook in the test-runs folder in your notebook.

After a few minutes, you can also see the test results in your conversation analytics dashboard.

Adapt the solution to your use case

You can adapt this solution to your specific use cases with minimal work:

Replace the Amazon Bedrock Knowledge Bases sample content with your content – Replace the content in the S3 bucket and organize it into a folder structure that makes sense for your use case. You can create a new knowledge base for your content.
Replace the intents in the Amazon Lex bot with intents for your use case – Modify the Amazon Lex bot definition to reflect the interactions you want to enable for your use case.
Modify the LLM prompts in the bedrock_utils code – In the Amazon Lex bot fulfillment Lambda function, review the LLM prompt definitions in the bedrock_utils folder. For example, provide a use case-specific definition for the role of the LLM-based agent.
Modify the bot handler code if necessary – In the Amazon Lex bot fulfillment Lambda function, review the code in the TopicIntentHandler.py function. For the knowledge base search, this code provides an example that uses the sample hotel brands as topics. You can replace this metadata search query with one appropriate for your use cases.

Clean up

Congratulations! You have completed all the steps for setting up your voice-enabled contact center generative AI agent solution using AWS services.

When you no longer need the solution deployed in your AWS account, you can delete the CloudFormation stacks that you deployed, as well as the SageMaker notebook instance if you created one.

Conclusion

The contact center generative AI agent solution offers a scalable, cost-effective approach to automate Q&A conversations in your contact center, using AWS services like Amazon Bedrock, Amazon Bedrock Knowledge Bases, OpenSearch Serverless, and Amazon Lex.

The solution code is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features through GitHub pull requests. Browse to the GitHub repository to explore the code, and check the CHANGELOG for the latest changes and the README for the latest documentation updates.

For expert assistance, the AWS Generative AI Innovation Center, AWS Professional Services, and our AWS Partners are here to help.

About the Authors

Vraj Shah is a Connect Developer at DoorDash.

Chaitanya Hari is a Voice/Contact Center Product Lead at DoorDash.

Marcelo Silva is a Principal Product Manager at Amazon Web Services, leading strategy and growth for Amazon Bedrock Knowledge Bases and Amazon Lex.

Adam Diesterhaft is a Sr. Pursuit Solutions Architect on the Amazon Connect team.

Brian Yost is a Principal Deep Learning Architect in the AWS Generative AI Innovation Center.

Migrating to Amazon SageMaker: Karini AI Cut Costs by 23%

September 24, 2024

by Deepali Rajale Amazon AWS

This post is co-written with Deepali Rajale from Karini AI.

Karini AI, a leading generative AI foundation platform built on AWS, empowers customers to quickly build secure, high-quality generative AI apps. GenAI is not just a technology; it’s a transformational tool that is changing how businesses use technology. Depending on where they are in the adoption journey, the adoption of generative AI presents a significant challenge for enterprises. While pilot projects using Generative AI can start effortlessly, most enterprises need help progressing beyond this phase. According to Everest Research, more than a staggering 50% of projects do not move beyond the pilots as they face hurdles due to the absence of standardized or established GenAI operational practices.

Karini AI offers a robust, user-friendly GenAI foundation platform that empowers enterprises to build, manage, and deploy Generative AI applications. It allows beginners and expert practitioners to develop and deploy Gen AI applications for various use cases beyond simple chatbots, including agentic, multi-agentic, Generative BI, and batch workflows. The no-code platform is ideal for quick experimentation, building PoCs, and rapid transition to production with built-in guardrails for safety and observability for troubleshooting. The platform includes an offline and online quality evaluation framework to assess quality during experimentation and continuously monitor applications post-deployment. Karini AI’s intuitive prompt playground allows authoring prompts, comparison with different models across providers, prompt management, and prompt tuning. It supports iterative testing of more straightforward, agentic, and multi-agentic prompts. For production deployment, the no-code recipes enable easy assembly of the data ingestion pipeline to create a knowledge base and deployment of RAG or agentic chains. The platform owners can monitor costs and performance in real-time with detailed observability and seamlessly integrate with Amazon Bedrock for LLM inference, benefiting from extensive enterprise connectors and data preprocessing techniques.

The following diagram illustrates how Karini AI delivers a comprehensive Generative AI foundational platform encompassing the entire application lifecycle. This platform delivers a holistic solution that speeds up time to market and optimizes resource utilization by providing a unified framework for development, deployment, and management.

In this post, we share how Karini AI’s migration of vector embedding models from Kubernetes to Amazon SageMaker endpoints improved concurrency by 30% and saved over 23% in infrastructure costs.

Karini AI’s Data Ingestion Pipeline for creating vector embeddings

Enriching large language models (LLMs) with new data is crucial to building practical generative AI applications. This is where Retrieval Augmented Generation (RAG) comes into play. RAG enhances LLMs’ capabilities by incorporating external data and producing state-of-the-art performance in knowledge-intensive tasks. Karini AI offers no-code solutions for creating Generative AI applications using RAG. These solutions include two primary components: a data ingestion pipeline for building a knowledge base and a system for knowledge retrieval and summarization. Together, these pipelines simplify the development process, enabling the creation of powerful AI applications with ease.

Data Ingestion Pipeline

Ingesting data from diverse sources is essential for executing Retrieval Augmented Generation (RAG). Karini AI’s data ingestion pipeline enables connection to multiple data sources, including Amazon S3, Amazon Redshift, Amazon Relational Database Service (RDS), websites and Confluence, handling structured and unstructured data. This source data is pre-processed, chunked, and transformed into vector embeddings before being stored in a vector database for retrieval. Karini AI’s platform provides flexibility by offering a range of embedding models from their model hub, simplifying the creation of vector embeddings for advanced AI applications.

Here is a screenshot of Karini AI’s no-code data ingestion pipeline.

Karini AI’s model hub streamlines adding models by integrating with leading foundation model providers such as Amazon Bedrock and self-managed serving platforms.

Infrastructure challenges

As customers explore complex use cases and datasets grow in size and complexity, Karini AI scales the data ingestion process efficiently to provide high concurrency for creating vector embeddings using state-of-the-art embedding models, such as those listed in the MTEB leaderboard, which are rapidly evolving and unavailable on managed platforms.

Before migrating to Amazon SageMaker, we deployed our models on self-managed Kubernetes(K8s) on EC2 instances. Kubernetes offered significant flexibility to deploy models from HuggingFace quickly, but soon, our engineering had to manage many aspects of scaling and deployment. We faced the following challenges with our existing setup that must be addressed to improve efficiency and performance.

Keeping up with SOTA(State-Of-The-Art) models: We managed different deployment manifests for each model type (such as classifiers, embeddings, and autocomplete), which was time-consuming and error-prone. We also had to maintain the logic to determine the memory allocation for different model types.
Managing dynamic concurrency was hard: A significant challenge with using models hosted on Kubernetes was achieving the highest dynamic concurrency level. We aimed to maximize endpoint performance to achieve target transactions per second (TPS) while meeting strict latency requirements.
Higher Costs: While Kubernetes (K8s) provides robust capabilities, it has become more costly due to the dynamic nature of data ingestion pipelines, which results in under-utilized instances and higher costs.

Our search for an inference platform led us to Amazon SageMaker, a solution that efficiently manages our models for higher concurrency, meets customer SLAs, and scales down serving when not needed. The reliability of SageMaker’s performance gave us confidence in its capabilities.

Amazon SageMaker for Model Serving

Choosing Amazon SageMaker was a strategic decision for Karini AI. It balanced the need for higher concurrencies at a lower cost, providing a cost-effective solution for our needs. SageMaker’s ability to scale and maximize concurrency while ensuring sub-second latency addresses various generative AI use cases making it a long-lasting investment for our platform.

Amazon SageMaker is a fully managed service that allows developers and data scientists to quickly build, train, and deploy machine learning (ML) models. With SageMaker, you can deploy your ML models on hosted endpoints and get real-time inference results. You can easily view the performance metrics for your endpoints in Amazon CloudWatch, automatically scale endpoints based on traffic, and update your models in production without losing any availability.

Karini AI’s data ingestion pipeline architecture with Amazon SageMaker Model endpoint is here.

Advantages of using SageMaker hosting

Amazon SageMaker offered our Gen AI ingestion pipeline many direct and indirect benefits.

Technical Debt Mitigation: Amazon SageMaker, being a managed service, allowed us to free our ML engineers from the burden of inference, enabling them to focus more on our core platform features—this relief from technical debt is a significant advantage of using SageMaker, reassuring us of its efficiency.
Meet customer SLAs: Knowledgebase creation is a dynamic task that may require higher concurrencies during vector embedding generation and minuscule load during query time. Based on customer SLAs and data volume, we can choose batch inference, real-time hosting with auto-scaling, or serverless hosting. Amazon SageMaker also provides recommendations for instance types suitable for embedding models.
Reduced Infrastructure cost: SageMaker is a pay-as-you-go service that allows you to create batch or real-time endpoints when there is demand and destroy them when work is complete. This approach reduced our infrastructure cost by more than 23% over the Kubernetes (K8s) platform.
SageMaker Jumpstart: SageMaker Jumpstart provides access to SOTA (State-Of-The-Art) models and optimized inference containers, making it ideal for creating new models that are accessible to our customers.
Amazon Bedrock compatibility: Karini AI integrates with Amazon Bedrock for LLM (Large Language Model) inference. The custom model import feature allows us to reuse the model weights used in SageMaker model hosting in Amazon Bedrock to maintain a joint code base and interchange serving between Bedrock and SageMaker as per the workload.

Conclusion

Karini AI significantly improved, achieving high performance and reducing model hosting costs by migrating to Amazon SageMaker. We can deploy custom third-party models to SageMaker and quickly make them available to Karini’s model hub for data ingestion pipelines. We can optimize our infrastructure configuration for model hosting as needed, depending on model size and our expected TPS. Using Amazon SagaMaker for model inference enabled Karini AI to handle increasing data complexities efficiently and meet concurrency needs while optimizing costs. Moreover, Amazon SageMaker allows easy integration and swapping of new models, ensuring that our customers can continuously leverage the latest advancements in AI technology without compromising performance or incurring unnecessary incremental costs.

Amazon SageMaker and Karini.ai offer a powerful platform to build, train, and deploy machine learning models at scale. By leveraging these tools, you can:

Accelerate development:Build and train models faster with pre-built algorithms and frameworks.
Enhance accuracy: Benefit from advanced algorithms and techniques for improved model performance.
Scale effortlessly:Deploy models to production with ease and handle increasing workloads.
Reduce costs:Optimize resource utilization and minimize operational overhead.

Don’t miss out on this opportunity to gain a competitive edge.

About Authors

Deepali Rajale is the founder of Karini AI, which is on a mission to democratize generative AI across enterprises. She enjoys blogging about Generative AI and coaching customers to optimize Generative AI practice. In her spare time, she enjoys traveling, seeking new experiences, and keeping up with the latest technology trends. You can find her on LinkedIn.

Ravindra Gupta is the Worldwide GTM lead for SageMaker and with a passion to help customers adopt SageMaker for their Machine Learning/ GenAI workloads. Ravi is fond of learning new technologies, and enjoy mentoring startups on their Machine Learning practice. You can find him on Linkedin

Harnessing the power of AI to drive equitable climate solutions: The AI for Equity Challenge

September 24, 2024

by Joe Fontaine Amazon AWS

The climate crisis is one of the greatest challenges facing our world today. Its impacts are far-reaching, affecting every aspect of our lives—from public health and food security to economic stability and social justice. What’s more, the effects of climate change disproportionately burden the world’s most vulnerable populations, exacerbating existing inequities around gender, race, and socioeconomic status.

But we have the power to create change. By harnessing the transformative potential of AI, we can develop innovative solutions to tackle the intersectional challenges at the heart of the climate crisis. That’s why the International Research Centre on Artificial Intelligence (IRCAI), Zindi, and Amazon Web Services (AWS) are proud to announce the launch of the “AI for Equity Challenge: Climate Action, Gender, and Health”—a global virtual competition aimed at empowering organizations to use advanced AI and cloud technologies to drive real-world impact with a focus on benefitting vulnerable populations around the world.

Aligning with the United Nations Sustainable Development Goals (SDGs) 3, 5, and 13—focused on good health and well-being, gender equality, and climate action respectively—this challenge seeks to uncover the most promising AI-powered solutions that address the compounding issues of climate change, gender equity, and public health. By bringing together a diverse global community of innovators, we hope to accelerate the development of equitable, sustainable, and impactful applications of AI for the greater good.

“As artificial intelligence rapidly evolves, it is crucial that we harness its potential to address real-world challenges. At IRCAI, our mission is to guide the ethical development of AI technologies, ensuring they serve the greater good and are inclusive of marginalized AI communities. This challenge, in collaboration with AWS, is an opportunity to discover and support the most innovative minds that are using AI and advanced computing to create impactful solutions for the climate crisis.”

– Davor Orlic, COO at IRCAI.

The challenge will unfold in two phases, welcoming both ideators and solution builders to participate. In the first phase, organizations are invited to submit technical proposals outlining specific challenges at the intersection of climate action, gender equity, and health that they aim to address using AI and cloud technologies. A steering committee convened by IRCAI will evaluate these proposals based on criteria such as innovation, feasibility, and potential for global impact. The competition will be judged and mentored in collaboration with NAIXUS, a network of AI and sustainable development research organizations.

The top two winning proposals from the first phase will then advance to the second round, where they will serve as the foundation for two AI challenges hosted on the Zindi platform. During this phase, developers and data scientists from around the world will compete to build the most successful AI-powered solutions to tackle the real-world problems identified by the first-round winners.

AI for Equity Challenge Timeline

The winning AI solutions from the second-round challenges will belong entirely to the organizations that submitted the original winning proposals, who will also receive $15,000 in AWS credits and technical support from AWS and IRCAI to help implement their solutions. Additionally, the first-place teams in each of the two final Zindi challenges will receive cash prizes of $6,000, $4,000, and $2,500 for first, second, and third place respectively.

But the true reward goes beyond the prizes. By participating in this challenge, organizations and individuals alike will have the opportunity to make a lasting impact on the lives of those most vulnerable to the effects of climate change. Through the power of AI and advanced cloud computing, we can develop groundbreaking solutions that empower women, improve public health outcomes, and drive sustainable progress on the climate action front.

Throughout the hackathon, participants will have access to a wealth of resources, including mentorship from industry experts, training materials, and AWS cloud computing resources. Amazon Sustainability Data Initiative (ASDI), a collaboration between AWS and leading scientific organizations, provides a catalog of over 200 datasets spanning climate projections, satellite imagery, air quality data, and more, enabling participants to build robust and data-driven solutions.

“Climate change is one of the greatest threats of our time, and we believe innovation is key to overcoming it. The AI for Equity Challenge invites innovators to bring forward their most visionary ideas, and we’ll support them with AWS resources — whether that’s computing power or advanced cloud technologies — to turn those ideas into reality. Our goal is to drive cloud innovation, support sustainability solutions, and make a meaningful impact on the climate crisis.”

– Dave Levy, Vice President of Worldwide Public Sector, AWS

This initiative is made possible through the support of ASDI, which provides researchers, scientists, and innovators with access to a wealth of publicly available datasets on AWS to advance their sustainability-focused work. The AI for Equity Challenge: Climate Action, Gender, and Health is open for submissions from September 23 to November 4, 2024. The two winning proposals from the first round will be announced on December 2, 2024, with the final AI challenge winners revealed on February 12, 2025.

Don’t miss your chance to be part of the solution. Visit https://zindi.africa/ai-equity-challenge to learn more and submit your proposal today. Together, we can harness the power of AI to create a more sustainable, equitable, and just world for generations to come.

Visit http://zindi.africa/ai-equity-challenge to learn more and participate.

This contest is hosted in collaboration with:

About the author

Joe Fontaine is the Product marketing lead for AWS AI Builder Programs. He is passionate about making machine learning more accessible to all through hands-on educational experiences. Outside of work he enjoys freeride mountain biking, aerial cinematography, and exploring the wilderness with his family.

Enhancing Just Walk Out technology with multi-modal AI

September 23, 2024

by Tian Lan Amazon AWS

Since its launch in 2018, Just Walk Out technology by Amazon has transformed the shopping experience by allowing customers to enter a store, pick up items, and leave without standing in line to pay. You can find this checkout-free technology in over 180 third-party locations worldwide, including travel retailers, sports stadiums, entertainment venues, conference centers, theme parks, convenience stores, hospitals, and college campuses. Just Walk Out technology’s end-to-end system automatically determines which products each customer chose in the store and provides digital receipts, eliminating the need for checkout lines.

In this post, we showcase the latest generation of Just Walk Out technology by Amazon, powered by a multi-modal foundation model (FM). We designed this multi-modal FM for physical stores using a transformer-based architecture similar to that underlying many generative artificial intelligence (AI) applications. The model will help retailers generate highly accurate shopping receipts using data from multiple inputs including a network of overhead video cameras, specialized weight sensors on shelves, digital floor plans, and catalog images of products. To put it in plain terms, a multi-modal model means using data from multiple inputs.

Our research and development (R&D) investments in state-of-the-art multi-modal FMs enables the Just Walk Out system to be deployed in a wide range of shopping situations with greater accuracy and at lower cost. Similar to large language models (LLMs) that generate text, the new Just Walk Out system is designed to generate an accurate sales receipt for every shopper visiting the store.

The challenge: Tackling complicated long-tail shopping scenarios

Because of their innovative checkout-free environment, Just Walk Out stores presented us with a unique technical challenge. Retailers and shoppers as well as Amazon demand nearly 100 percent checkout accuracy, even in the most complex shopping situations. These include unusual shopping behaviors that can create a long and complicated sequence of activities requiring additional effort to analyze what happened.

Previous generations of the Just Walk Out system utilized a modular architecture; it tackled complex shopping situations by breaking down the shopper’s visit into discrete tasks, such as detecting shopper interactions, tracking items, identifying products, and counting what is selected. These individual components were then integrated into sequential pipelines to enable the overall system functionality. While this approach produced highly accurate receipts, significant engineering efforts are required to address challenges in new, previously unencountered situations and complex shopping scenarios. This limitation restricted the scalability of this approach.

The solution: Just Walk Out multi-modal AI

To meet these challenges, we introduced a new multi-modal FM that we designed specifically for retail store environments, enabling Just Walk Out technology to handle complex real-world shopping scenarios. The new multi-modal FM further enhances the Just Walk Out system’s capabilities by generalizing more effectively to new store formats, products, and customer behaviors, which is crucial for scaling up Just Walk Out technology.

The incorporation of continuous learning enables the model training to automatically adapt and learn from new challenging scenarios as they arise. This self-improving capability helps ensure the system maintains high performance, even as shopping environments continue to evolve.

Through this combination of end-to-end learning and enhanced generalization, the Just Walk Out system can tackle a wider range of dynamic and complex retail settings. Retailers can confidently deploy this technology, knowing it will provide a frictionless checkout-free experience for their customers.

The following video shows our system’s architecture in action.

Key elements of our Just Walk Out multi-modal AI model include:

Flexible data inputs –The system tracks how users interact with products and fixtures, such as shelves or fridges. It primarily relies on multi-view video feeds as inputs, using weight sensors solely to track small items. The model maintains a digital 3D representation of the store and can access catalog images to identify products, even if the shopper returns items to the shelf incorrectly.
Multi-modal AI tokens to represent shoppers’ journeys – The multi-modal data inputs are processed by the encoders, which compress them into transformer tokens, the basic unit of input for the receipt model. This allows the model to interpret hand movements, differentiate between items, and accurately count the number of items picked up or returned to the shelf with speed and precision.
Continuously updating receipts – The system uses tokens to create digital receipts for each shopper. It can differentiate between different shopper sessions and dynamically updates each receipt as they pick up or return items.

Training the Just Walk Out FM

By feeding vast amounts of multi-modal data into the Just Walk Out FM, we found it could consistently generate—or, technically, “predict”— accurate receipts for shoppers. To improve accuracy, we designed over 10 auxiliary tasks, such as detection, tracking, image segmentation, grounding (linking abstract concepts to real-world objects), and activity recognition. All of these are learned within a single model, enhancing the model’s ability to handle new, never-before-seen store formats, products, and customer behaviors. This is crucial for bringing Just Walk Out technology to new locations.

AI model training—in which curated data is fed to selected algorithms—helps the system refine itself to produce accurate results. We quickly discovered we could accelerate the training of our model by using a data flywheel that continuously mines and labels high-quality data in a self-reinforcing cycle. The system is designed to integrate these progressive improvements with minimal manual intervention. The following diagram illustrates the process.

To train an FM effectively, we invested in a robust infrastructure that can efficiently process the massive amounts of data needed to train high-capacity neural networks that mimic human decision-making. We built the infrastructure for our Just Walk Out model with the help of several Amazon Web Services (AWS) services, including Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.

Here are some key steps we followed in training our FM:

Selecting challenging data sources – To train our AI model for Just Walk Out technology, we focus on training data from especially difficult shopping scenarios that test the limits of our model. Although these complex cases constitute only a small fraction of shopping data, they are the most valuable for helping the model learn from its mistakes.
Leveraging auto labeling – To increase operational efficiency, we developed algorithms and models that automatically attach meaningful labels to the data. In addition to receipt prediction, our automated labeling algorithms cover the auxiliary tasks, ensuring the model gains comprehensive multi-modal understanding and reasoning capabilities.
Pre-training the model – Our FM is pre-trained on a vast collection of multi-modal data across a diverse range of tasks, which enhances the model’s ability to generalize to new store environments never encountered before.
Fine-tuning the model – Finally, we refined the model further and used quantization techniques to create a smaller, more efficient model that uses edge computing.

As the data flywheel continues to operate, it will progressively identify and incorporate more high-quality, challenging cases to test the robustness of the model. These additional difficult samples are then fed into the training set, further enhancing the model’s accuracy and applicability across new physical store environments.

Conclusion

In this post, we showed how our multi-modal, AI system represents significant new possibilities for Just Walk Out technology. With our innovative approach, we are moving away from modular AI systems that rely on human-defined subcomponents and interfaces. Instead, we’re building simpler and more scalable AI systems that can be trained end-to-end. Although we’ve just scratched the surface, multi-modal AI has raised the bar for our already highly accurate receipt system and will enable us to improve the shopping experience at more Just Walk Out technology stores around the world.

Visit About Amazon to read the official announcement about the new multi-modal AI system and learn more about the latest improvements in Just Walk Out technology.

To find where you can find Just Walk Out technology locations, visit Just Walk Out technology locations near you. Learn more about how to power your store or venue with Just Walk Out technology by Amazon on the Just Walk Out technology product page.

Visit Build and scale the next wave of AI innovation on AWS to learn more about how AWS can reinvent customer experiences with the most comprehensive set of AI and ML services.

About the Authors

Tian Lan is a Principal Scientist at AWS. He currently leads the research efforts in developing the next-generation Just Walk Out 2.0 technology, transforming it into an end-to-end learned, store domain–focused multi-modal foundation model.

Chris Broaddus is a Senior Manager at AWS. He currently manages all the research efforts for Just Walk Out technology, including the multi-modal AI model and other projects, such as deep learning for human pose estimation and Radio Frequency Identification (RFID) receipt prediction.

Generate synthetic data for evaluating RAG systems using Amazon Bedrock

September 23, 2024

by Lukas Wenzel Amazon AWS

Evaluating your Retrieval Augmented Generation (RAG) system to make sure it fulfils your business requirements is paramount before deploying it to production environments. However, this requires acquiring a high-quality dataset of real-world question-answer pairs, which can be a daunting task, especially in the early stages of development. This is where synthetic data generation comes into play. With Amazon Bedrock, you can generate synthetic datasets that mimic actual user queries, enabling you to evaluate your RAG system’s performance efficiently and at scale. With synthetic data, you can streamline the evaluation process and gain confidence in your system’s capabilities before unleashing it to the real world.

This post explains how to use Anthropic Claude on Amazon Bedrock to generate synthetic data for evaluating your RAG system. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Fundamentals of RAG evaluation

Before diving deep into how to evaluate a RAG application, let’s recap the basic building blocks of a naive RAG workflow, as shown in the following diagram.

The workflow consists of the following steps:

In the ingestion step, which happens asynchronously, data is split into separate chunks. An embedding model is used to generate embeddings for each of the chunks, which are stored in a vector store.
When the user asks a question to the system, an embedding is generated from the questions and the top-k most relevant chunks are retrieved from the vector store.
The RAG model augments the user input by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the large language model (LLM). The augmented prompt allows the LLM to generate an accurate answer to user queries.
An LLM is prompted to formulate a helpful answer based on the user’s questions and the retrieved chunks.

Amazon Bedrock Knowledge Bases offers a streamlined approach to implement RAG on AWS, providing a fully managed solution for connecting FMs to custom data sources. To implement RAG using Amazon Bedrock Knowledge Bases, you begin by specifying the location of your data, typically in Amazon Simple Storage Service (Amazon S3), and selecting an embedding model to convert the data into vector embeddings. Amazon Bedrock then creates and manages a vector store in your account, typically using Amazon OpenSearch Serverless, handling the entire RAG workflow, including embedding creation, storage, management, and updates. You can use the RetrieveAndGenerate API for a straightforward implementation, which automatically retrieves relevant information from your knowledge base and generates responses using a specified FM. For more granular control, the Retrieve API is available, allowing you to build custom workflows by processing retrieved text chunks and developing your own orchestration for text generation. Additionally, Amazon Bedrock Knowledge Bases offers customization options, such as defining chunking strategies and selecting custom vector stores like Pinecone or Redis Enterprise Cloud.

A RAG application has many moving parts, and on your way to production you’ll need to make changes to various components of your system. Without a proper automated evaluation workflow, you won’t be able to measure the effect of these changes and will be operating blindly regarding the overall performance of your application.

To evaluate such a system properly, you need to collect an evaluation dataset of typical user questions and answers.

Moreover, you need to make sure you evaluate not only the generation part of the process but also the retrieval. An LLM without relevant retrieved context can’t answer the user’s question if the information wasn’t present in the training data. This holds true even if it has exceptional generation capabilities.

As such, a typical RAG evaluation dataset consists of the following minimum components:

A list of questions users will ask the RAG system
A list of corresponding answers to evaluate the generation step
The context or a list of contexts that contain the answer for each question to evaluate the retrieval

In an ideal world, you would take real user questions as a basis for evaluation. Although this is the optimal approach because it directly resembles end-user behavior, this is not always feasible, especially in the early stages of building a RAG system. As you progress, you should aim for incorporating real user questions into your evaluation set.

To learn more about how to evaluate a RAG application, see Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock.

Solution overview

We use a sample use case to illustrate the process by building an Amazon shareholder letter chatbot that allows business analysts to gain insights about the company’s strategy and performance over the past years.

For the use case, we use PDF files of Amazon’s shareholder letters as our knowledge base. These letters contain valuable information about the company’s operations, initiatives, and future plans. In a RAG implementation, the knowledge retriever might use a database that supports vector searches to dynamically look up relevant documents that serve as the knowledge source.

The following diagram illustrates the workflow to generate the synthetic dataset for our RAG system.

The workflow includes the following steps:

Load the data from your data source.
Chunk the data as you would for your RAG application.
Generate relevant questions from each document.
Generate an answer by prompting an LLM.
Extract the relevant text that answers the question.
Evolve the question according to a specific style.
Filter questions and improve the dataset either using domain experts or LLMs using critique agents.

We use a model from the Anthropic’s Claude 3 model family to extract questions and answers from our knowledge source, but you can experiment with other LLMs as well. Amazon Bedrock makes this effortless by providing standardized API access to many FMs.

For the orchestration and automation steps in this process, we use LangChain. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.

The next sections walk you through the most important parts of the process. If you want to dive deeper and run it yourself, refer to the notebook on GitHub.

Load and prepare the data

First, load the shareholder letters using LangChain’s PyPDFDirectoryLoader and use the RecursiveCharacterTextSplitter to split the PDF documents into chunks. The RecursiveCharacterTextSplitter divides the text into chunks of a specified size while trying to preserve context and meaning of the content. It’s a good way to start when working with text-based documents. You don’t have to split your documents to create your evaluation dataset if your LLM supports a context window that is large enough to fit your documents, but you could potentially end up with a lower quality of generated questions due to the larger size of the task. You want to have the LLM generate multiple questions per document in this case.

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader

# Load PDF documents from directory
loader = PyPDFDirectoryLoader("./synthetic_dataset_generation/")  
documents = loader.load()
# Use recursive character splitter, works better for this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Split documents into small chunks
    chunk_size = 1500,  
    # Overlap chunks to reduce cutting sentences in half
    chunk_overlap  = 100,
    separators=["nn", "n", ".", " ", ""],
)

# Split loaded documents into chunks
docs = text_splitter.split_documents(documents)

To demonstrate the process of generating a corresponding question and answer and iteratively refining them, we use an example chunk from the loaded shareholder letters throughout this post:

“page_content=''Our AWS and Consumer businesses have had different demand trajectories during the pandemic. In thenfirst year of the pandemic, AWS revenue continued to grow at a rapid clip—30% year over year (“Y oY”) in2020 on a $35 billion annual revenue base in 2019—but slower than the 37% Y oY growth in 2019. [...] This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.nConversely, our Consumer revenue grew dramatically in 2020. In 2020, Amazon’s North America andnInternational Consumer revenue grew 39% Y oY on the very large 2019 revenue base of $245 billion; and,this extraordinary growth extended into 2021 with revenue increasing 43% Y oY in Q1 2021. These areastounding numbers. We realized the equivalent of three years’ forecasted growth in about 15 months.nAs the world opened up again starting in late Q2 2021, and more people ventured out to eat, shop, and travel,”

Generate an initial question

To facilitate prompting the LLM using Amazon Bedrock and LangChain, you first configure the inference parameters. To accurately extract more extensive contexts, set the max_tokens parameter to 4096, which corresponds to the maximum number of tokens the LLM will generate in its output. Additionally, define the temperature parameter as 0.2 because the goal is to generate responses that adhere to the specified rules while still allowing for a degree of creativity. This value differs for different use cases and can be determined by experimentation.

import boto3

from langchain_community.chat_models import BedrockChat

# set up a Bedrock-runtime client for inferencing large language models
boto3_bedrock = boto3.client('bedrock-runtime')
# Choosing claude 3 Haiku due to cost and performance efficiency
claude_3_haiku = "anthropic.claude-3-haiku-20240307-v1:0"
# Set-up langchain LLM for implementing the synthetic dataset generation logic

# for each model provider there are different parameters to define when inferencing against the model
inference_modifier = {
                        "max_tokens": 4096,
                        "temperature": 0.2
                    }
                                         
llm = BedrockChat(model_id = claude_3_haiku,
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

You use each generated chunk to create synthetic questions that mimic those a real user might ask. By prompting the LLM to analyze a portion of the shareholder letter data, you generate relevant questions based on the information presented in the context. We use the following sample prompt to generate a single question for a specific context. For simplicity, the prompt is hardcoded to generate a single question, but you can also instruct the LLM to generate multiple questions with a single prompt.

The rules can be adapted to better guide the LLM in generating questions that reflect the types of queries your users would pose, tailoring the approach to your specific use case.

# Create a prompt template to generate a question a end-user could have about a given context
initial_question_prompt_template = PromptTemplate(
    input_variables=["context"],
    template="""
    <Instructions>
    Here is some context:
    <context>
    {context}
    </context>

    Your task is to generate 1 question that can be answered using the provided context, following these rules:

    <rules>
    1. The question should make sense to humans even when read without the given context.
    2. The question should be fully answered from the given context.
    3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
    4. The answer to the question should not contain any links.
    5. The question should be of moderate difficulty.
    6. The question must be reasonable and must be understood and responded by humans.
    7. Do not use phrases like 'provided context', etc. in the question.
    8. Avoid framing questions using the word "and" that can be decomposed into more than one question.
    9. The question should not contain more than 10 words, make use of abbreviations wherever possible.
    </rules>

    To generate the question, first identify the most important or relevant part of the context. Then frame a question around that part that satisfies all the rules above.

    Output only the generated question with a "?" at the end, no other text or characters.
    </Instructions>
    
    """)

The following is the generated question from our example chunk:

What is the price-performance improvement of AWS Graviton2 chip over x86 processors?

Generate answers

To use the questions for evaluation, you need to generate a reference answer for each of the questions to test against. With the following prompt template, you can generate a reference answer to the created question based on the question and the original source chunk:

# Create a prompt template that takes into consideration the the question and generates an answer
answer_prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    <Instructions>
    <Task>
    <role>You are an experienced QA Engineer for building large language model applications.</role>
    <task>It is your task to generate an answer to the following question <question>{question}</question> only based on the <context>{context}</context></task>
    The output should be only the answer generated from the context.

    <rules>
    1. Only use the given context as a source for generating the answer.
    2. Be as precise as possible with answering the question.
    3. Be concise in answering the question and only answer the question at hand rather than adding extra information.
    </rules>

    Only output the generated answer as a sentence. No extra characters.
    </Task>
    </Instructions>
    
    Assistant:""")

The following is the generated answer based on the example chunk:

“The AWS revenue grew 37% year-over-year in 2021.”

Extract relevant context

To make the dataset verifiable, we use the following prompt to extract the relevant sentences from the given context to answer the generated question. Knowing the relevant sentences, you can check whether the question and answer are correct.

# To check whether an answer was correctly formulated by the large language model you get the relevant text passages from the documents used for answering the questions.
source_prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""Human:
    <Instructions>
    Here is the context:
    <context>
    {context}
    </context>

    Your task is to extract the relevant sentences from the given context that can potentially help answer the following question. You are not allowed to make any changes to the sentences from the context.

    <question>
    {question}
    </question>

    Output only the relevant sentences you found, one sentence per line, without any extra characters or explanations.
    </Instructions>
    Assistant:""")

The following is the relevant source sentence extracted using the preceding prompt:

“This shift by so many companies (along with the economy recovering) helped re-accelerate AWS's revenue growth to 37% Y oY in 2021.”

Refine questions

When generating question and answer pairs from the same prompt for the whole dataset, it might appear that the questions are repetitive and similar in form, and therefore don’t mimic real end-user behavior. To prevent this, take the previously created questions and prompt the LLM to modify them according to the rules and guidance established in the prompt. By doing so, a more diverse dataset is synthetically generated. The prompt for generating questions tailored to your specific use case heavily depends on that particular use case. Therefore, your prompt must accurately reflect your end-users by setting appropriate rules or providing relevant examples. The process of refining questions can be repeated as many times as necessary.

# To generate a more versatile testing dataset you alternate the questions to see how your RAG systems performs against differently formulated of questions
question_compress_prompt_template = PromptTemplate(
    input_variables=["question"],
    template="""
    <Instructions>
    <role>You are an experienced linguistics expert for building testsets for large language model applications.</role>

    <task>It is your task to rewrite the following question in a more indirect and compressed form, following these rules:

    <rules>
    1. Make the question more indirect
    2. Make the question shorter
    3. Use abbreviations if possible
    </rules>

    <question>
    {question}
    </question>

    Your output should only be the rewritten question with a question mark "?" at the end. Do not provide any other explanation or text.
    </task>
    </Instructions>
    
    """)

Users of your application might not always use your solution in the same way, for instance using abbreviations when asking questions. This is why it’s crucial to develop a diverse dataset:

“AWS rev YoY growth in ’21?”

Automate dataset generation

To scale the process of the dataset generation, we iterate over all the chunks in our knowledge base; generate questions, answers, relevant sentences, and refinements for each question; and save them to a pandas data frame to prepare the full dataset.

To provide a clearer understanding of the structure of the dataset, the following table presents a sample row based on the example chunk used throughout this post.

Chunk	Our AWS and Consumer businesses have had different demand trajectories during the pandemic. In thenfirst year of the pandemic, AWS revenue continued to grow at a rapid clip—30% year over year (“Y oY”) in2020 on a $35 billion annual revenue base in 2019—but slower than the 37% Y oY growth in 2019. […] This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.nConversely, our Consumer revenue grew dramatically in 2020. In 2020, Amazon’s North America andnInternational Consumer revenue grew 39% Y oY on the very large 2019 revenue base of $245 billion; and,this extraordinary growth extended into 2021 with revenue increasing 43% Y oY in Q1 2021. These areastounding numbers. We realized the equivalent of three years’ forecasted growth in about 15 months.nAs the world opened up again starting in late Q2 2021, and more people ventured out to eat, shop, and travel,”
Question	“What was the YoY growth of AWS revenue in 2021?”
Answer	“The AWS revenue grew 37% year-over-year in 2021.”
Source Sentence	“This shift by so many companies (along with the economy recovering) helped re-accelerate AWS’s revenue growth to 37% Y oY in 2021.”
Evolved Question	“AWS rev YoY growth in ’21?”

On average, the generation of questions with a given context of 1,500–2,000 tokens results in an average processing time of 2.6 seconds for a set of initial question, answer, evolved question, and source sentence discovery using Anthropic Claude 3 Haiku. The generation of 1,000 sets of questions and answers costs approximately $2.80 USD using Anthropic Claude 3 Haiku. The pricing page gives a detailed overview of the cost structure. This results in a more time- and cost-efficient generation of datasets for RAG evaluation compared to the manual generation of these questions sets.

Improve your dataset using critique agents

Although using synthetic data is a good starting point, the next step should be to review and refine the dataset, filtering out or modifying questions that aren’t relevant to your specific use case. One effective approach to accomplish this is by using critique agents.

Critique agents are a technique used in natural language processing (NLP) to evaluate the quality and suitability of questions in a dataset for a particular task or application using a machine learning model. In our case, the critique agents are employed to assess whether the questions in the dataset are valid and appropriate for our RAG system.

The two main metrics evaluated by the critique agents in our example are question relevance and answer groundedness. Question relevance determines how relevant the generated question is for a potential user of our system, and groundedness assesses whether the generated answers are indeed based on the given context.

groundedness_check_prompt_template = PromptTemplate(
    input_variables=["context","question"],
    template="""
    <Instructions>
    You will be given a context and a question related to that context.

    Your task is to provide an evaluation of how well the given question can be answered using only the information provided in the context. Rate this on a scale from 1 to 5, where:

    1 = The question cannot be answered at all based on the given context
    2 = The context provides very little relevant information to answer the question
    3 = The context provides some relevant information to partially answer the question 
    4 = The context provides substantial information to answer most aspects of the question
    5 = The context provides all the information needed to fully and unambiguously answer the question

    First, read through the provided context carefully:

    <context>
    {context}
    </context>

    Then read the question:

    <question>
    {question}
    </question>

    Evaluate how well you think the question can be answered using only the context information. Provide your reasoning first in an <evaluation> section, explaining what relevant or missing information from the context led you to your evaluation score in only one sentence.

    Provide your evaluation in the following format:

    <rating>(Your rating from 1 to 5)</rating>
    
    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>


    </Instructions>
    
    """)

relevance_check_prompt_template = PromptTemplate(
    input_variables=["question"],
    template="""
    <Instructions>
    You will be given a question related to Amazon Shareholder letters. Your task is to evaluate how useful this question would be for a financial and business analyst working in wallstreet.

    To evaluate the usefulness of the question, consider the following criteria:

    1. Relevance: Is the question directly relevant to your work? Questions that are too broad or unrelated to this domain should receive a lower rating.

    2. Practicality: Does the question address a practical problem or use case that analysts might encounter? Theoretical or overly academic questions may be less useful.

    3. Clarity: Is the question clear and well-defined? Ambiguous or vague questions are less useful.

    4. Depth: Does the question require a substantive answer that demonstrates understanding of financial topics? Surface-level questions may be less useful.

    5. Applicability: Would answering this question provide insights or knowledge that could be applied to real-world company evaluation tasks? Questions with limited applicability should receive a lower rating.

    Provide your evaluation in the following format:

    <rating>(Your rating from 1 to 5)</rating>
    
    <evaluation>(Your evaluation and reasoning for the rating)</evaluation>

    Here is the question:

    {question}
    </Instructions>
    """)

Evaluating the generated questions helps with assessing the quality of a dataset and eventually the quality of the evaluation. The generated question was rated very well:

Groundedness score: 5
“The context provides the exact information needed to answer the question[...]”

Relevance score: 5
“This question is highly relevant and useful for a financial and business analyst working on Wall Street. AWS (Amazon Web Services) is a key business segment for Amazon, and understanding its year-over-year (YoY) revenue growth is crucial for evaluating the company's overall performance and growth trajectory.[...].

Best practices for generating synthetic datasets

Although generating synthetic datasets offers numerous benefits, it’s essential to follow best practices to maintain the quality and representativeness of the generated data:

Combine with real-world data – Although synthetic datasets can mimic real-world scenarios, they might not fully capture the nuances and complexities of actual human interactions or edge cases. Combining synthetic data with real-world data can help address this limitation and create more robust datasets.
Choose the right model – Choose different LLMs for dataset creation than used for RAG generation in order to avoid self-enhancement bias.
Implement robust quality assurance – You can employ multiple quality assurance mechanisms, such as critique agents, human evaluation, and automated checks, to make sure the generated datasets meet the desired quality standards and accurately represent the target use case.
Iterate and refine – You should treat synthetic dataset generation as an iterative process. Continuously refine and improve the process based on feedback and performance metrics, adjusting parameters, prompts, and quality assurance mechanisms as needed.
Domain-specific customization – For highly specialized or niche domains, consider fine-tuning the LLM (such as with PEFT or RLHF) by injecting domain-specific knowledge to improve the quality and accuracy of the generated datasets.

Conclusion

The generation of synthetic datasets is a powerful technique that can significantly enhance the evaluation process of your RAG system, especially in the early stages of development when real-world data is scarce or difficult to obtain. By taking advantage of the capabilities of LLMs, this approach enables the creation of diverse and representative datasets that accurately mimic real human interactions, while also providing the scalability necessary to meet your evaluation needs.

Although this approach offers numerous benefits, it’s essential to acknowledge its limitations. Firstly, the quality of the synthetic dataset heavily relies on the performance and capabilities of the underlying language model, knowledge retrieval system, and quality of prompts used for generation. Being able to understand and adjust the prompts for generation is crucial in this process. Biases and limitations present in these components may be reflected in the generated dataset. Additionally, capturing the full complexity and nuances of real-world interactions can be challenging because synthetic datasets may not account for all edge cases or unexpected scenarios.

Despite these limitations, generating synthetic datasets remains a valuable tool for accelerating the development and evaluation of RAG systems. By streamlining the evaluation process and enabling iterative development cycles, this approach can contribute to the creation of better-performing AI systems.

We encourage developers, researchers, and enthusiasts to explore the techniques mentioned in this post and the accompanying GitHub repository and experiment with generating synthetic datasets for your own RAG applications. Hands-on experience with this technique can provide valuable insights and contribute to the advancement of RAG systems in various domains.

About the Authors

Johannes Langer is a Senior Solutions Architect at AWS, working with enterprise customers in Germany. Johannes is passionate about applying machine learning to solve real business problems. In his personal life, Johannes enjoys working on home improvement projects and spending time outdoors with his family.

Lukas Wenzel is a Solutions Architect at Amazon Web Services in Hamburg, Germany. He focuses on supporting software companies building SaaS architectures. In addition to that, he engages with AWS customers on building scalable and cost-efficient generative AI features and applications. In his free-time, he enjoys playing basketball and running.

David Boldt is a Solutions Architect at Amazon Web Services. He helps customers build secure and scalable solutions that meet their business needs. He is specialized in machine learning to address industry-wide challenges, using technologies to drive innovation and efficiency across various sectors.

Llama 3.2 overview

SageMaker JumpStart overview

Prerequisites

Discover Llama 3.2 models in SageMaker JumpStart

Deploy Llama 3.2 multi-modality models for inference using SageMaker JumpStart

Deploy Llama 3.2 11B Vision model for inference using the Python SDK

Recommended instances and benchmark

Inference and example prompts for Llama-3.2 11B Vision

Text-only input

Single-image input

Multi-image input

Clean up

Conclusion

About the Authors

Overview of Llama 3.2 11B and 90B Vision models

Solution overview

Prerequisites

Configure Llama 3.2 for vision-based reasoning in Amazon Bedrock

Configure Llama 3.2 for vision-based reasoning in SageMaker

Document question answering

Financial results slides Q&A

Visual math question answering

Entity extraction

Caption generation

Simple captioning

Creative captioning

Listing generation

Conclusion

About the Authors

AI/ML on AWS

Security and your data on AWS

AWS generative AI approach for legal tech

Boost productivity to allow a search based on context and conversational Q&A

Boost productivity with legal document summarization

Boost attorney productivity by drafting and reviewing legal documents using generative AI

Start your AWS generative AI journey today

Conclusion

Resources

About the Authors

Solution overview

Prerequisites

Deploy the Amazon Bedrock knowledge base

Upload the sample content and test your knowledge base

Deploy the hallucination detection stack (optional)

Deploy the RAG solution stack

Deploy the conversation analytics stack (optional)

Set up the QuickSight dashboard (optional)

Automated testing notebooks (optional)

Adapt the solution to your use case

Clean up

Conclusion

About the Authors

Karini AI’s Data Ingestion Pipeline for creating vector embeddings

Data Ingestion Pipeline

Infrastructure challenges

Amazon SageMaker for Model Serving

Advantages of using SageMaker hosting

Conclusion

About Authors

AI for Equity Challenge Timeline

About the author

The challenge: Tackling complicated long-tail shopping scenarios

The solution: Just Walk Out multi-modal AI

Training the Just Walk Out FM

Conclusion

About the Authors

Fundamentals of RAG evaluation

Solution overview

Load and prepare the data

Generate an initial question

Generate answers

Extract relevant context

Refine questions

Automate dataset generation

Improve your dataset using critique agents

Best practices for generating synthetic datasets

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities