From fridge to table: Use Amazon Rekognition and Amazon Bedrock to generate recipes and combat food waste

From fridge to table: Use Amazon Rekognition and Amazon Bedrock to generate recipes and combat food waste

In today’s fast-paced world, time is of the essence and even basic tasks like grocery shopping can feel rushed and challenging. Despite our best intentions to plan meals and shop accordingly, we often end up ordering takeout; leaving unused perishable items to spoil in the refrigerator. This seemingly small issue of wasted groceries, paired with the about-to-perish grocery supplies thrown away by grocery stores, contributes significantly to the global food waste problem. This demonstrates how we can help solve this problem by harnessing the power of generative AI on AWS.

By using computer vision capabilities through Amazon Rekognition and the content generation capabilities offered by foundation models (FMs) available through Amazon Bedrock, we developed a solution that will recommend recipes based on what you already have in your refrigerator and an inventory of about-to-expire items in local supermarkets, making sure that both food in your home and food in grocery stores are used, saving money and reducing waste.

In this post, we walk through how to build the FoodSavr solution (fictitious name used for the purposes of this post) using Amazon Rekognition Custom Labels to detect the ingredients and generate personalized recipes using Anthropic’s Claude 3.0 on Amazon Bedrock. We demonstrate an end-to-end architecture where a user can upload an image of their fridge, and using the ingredients found there (detected by Amazon Rekognition), the solution will give them a list of recipes (generated by Amazon Bedrock). The architecture also recognizes missing ingredients and provides the user with a list of nearby grocery stores.

Solution overview

The following reference architecture shows how you can use Amazon Bedrock, Amazon Rekognition, and other AWS services to implement the FoodSavr solution.

As shown in the preceding figure, the architecture includes the following steps:

  1. For an end-to-end solution, we recommend having a frontend where your users can upload images of items that they want detected and labeled. To learn more about frontend deployment on AWS, see Front-end Web & Mobile on AWS.
  2. The picture taken by the user is stored in an Amazon Simple Storage Service (Amazon S3) This S3 bucket should be configured with a lifecycle policy that deletes the image after use. To learn more about S3 lifecycle policies, see Managing your storage lifecycle.
  3. This architecture uses different AWS Lambda Lambda is a serverless AWS compute service that runs event driven code and automatically manages the compute resources. The first Lambda function, DetectIngredients harnesses the power of Amazon Rekognition by using the Boto3 Python API. Amazon Rekognition is a cutting-edge computer vision service that uses machine learning (ML) models to analyze the uploaded images.
  4. We use Rekognition Custom Labels to train a model with a dataset of ingredients. You can adopt this architecture to use Rekognition Custom Labels with your own use case. With the aid of custom labels trained to recognize various ingredients, Amazon Rekognition identifies the items present in the images.
  5. The detected ingredient names are then securely stored in an Amazon DynamoDB (a fully managed NoSQL database service) table. for retrieval and modification. Users are presented with list of the ingredients that have been detected, along with the option of adding other ingredients or deleting ingredients that they might not want or were misidentified.
  6. After the ingredient list is confirmed by the user through the web interface, they can initiate the recipe generation process with a click of a button. This action invokes another Lambda function called GenerateRecipes, which uses the advanced language capabilities of the Amazon Bedrock API (Anthropic’s Claude v3 in this post). This state-of-the-art FM analyzes the confirmed ingredient list retrieved from DynamoDB and generates relevant recipes tailored to those specific ingredients. Additionally, the model provides images to accompany each recipe, providing a visually appealing and inspiring culinary experience.
  7. Amazon Bedrock contains two key FMs that are used for this solution example: Anthropic’s Claude v3 (newer versions have been released since the writing of this post) and Stable Diffusion, used for recipe generation and image generation respectively. For this solution, you can use any combination of FMs that suit your use case. The generated content (recipes as text and recipe images, in this case) can then be displayed to the user on the frontend.
  8. For this use case, you can also set up an optional ordering pipeline, which allows a user to place orders for the ingredients described by the FMs. This would be fronted by a Lambda function, FindGroceryItems, that can look for the recommended grocery items in a database contributed to by local supermarkets. This database would consist of about-to-expire ingredients along with prices for those ingredients.

In the following sections, we dive into how you can set up this architecture on your own account. Step 8 is optional and therefore not covered in this post.

Using Amazon Rekognition to detect images

The image recognition is powered by Amazon Rekognition, which offers pre-trained and customizable computer vision capabilities to allow users to obtain information and insights from their images. For customizability, you can use Rekognition Custom Labels to identify scenes and objects in your images that are specific to your business needs. If your images are already labeled, you can begin training a model from the Amazon Rekognition console. Otherwise, you can label them directly from the Amazon Rekognition labeling interface, or use other services such as Amazon SageMaker Ground Truth. The following screenshot shows an example of what the bounding box process would look like on the Amazon Rekognition labeling interface.

To get started with labeling, see Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions. For this architecture, we collected a dataset of up to 70 images of common food items typically found in refrigerators. We recommend that you gather your own relevant images and store them in an S3 bucket to use for training with Amazon Rekognition. You can then use Rekognition Custom Labels to create labels with food names, and assign bounding boxes on the images so the model knows where to look. To get started with training your own custom model, see Training an Amazon Rekognition Custom Labels model.

When model training is complete, you will see all your trained models under Projects on the AWS Management Console for Amazon Rekognition. Here, you can also look at the model performance, measured by the F1 score (shown in the following screenshot).

You can also iterate and modify your existing models to create newer versions. Before using your model, make sure it’s in STARTED state. To use the model, choose the model you want to use, and on the Use model tab, choose Start.

You also have the option to programmatically start and stop your model (the exact API call can be copied from the Amazon Rekognition console, but the following is provided as an example):

Use the following API (which is present in the Lambda function) call to detect groceries in an image using your custom labels and custom models:

aws rekognition detect-custom-labels 
--project-version-arn "MODEL_ARN" 
--image '{"S3Object": {"Bucket": "MY_BUCKET","Name": "PATH_TO_MY_IMAGE"}}' 
--region us-east-1

To stop incurring costs, you can also stop your model when not in use:

aws rekognition stop-project-version 
--project-version-arn "MODEL ARN 
--region us-east-1

Because we’re using Python, the boto3 Python package is used to make all AWS API calls mentioned in this post. For more information about Boto3, see the Boto3 documentation.

Starting a model might take a few minutes to complete. To check the current status of the model readiness, check the details page for the project or use DescribeProjectVersions. Wait for the model status to change to RUNNING.

In the meantime, you can explore the different statistics provided by Amazon Rekognition about your model. Some notable ones are the model performance (F1 score), precision, and recall. These statistics are gathered by Amazon Rekognition at both the model level (as seen in the earlier screenshot) and the individual custom label level (as shown in the following screenshot).

For more information on these statistics, see Metrics for evaluating your model.

Be aware that, while Anthropic’s Claude models offer impressive multi-modal capabilities for understanding and generating content based on text and images, we chose to use Amazon Rekognition Custom Labels for ingredient detection in this solution. Amazon Rekognition is a specialized computer vision service optimized for tasks such as object detection and image classification, using state-of-the-art models trained on massive datasets. Additionally, Rekognition Custom Labels allows us to train custom models tailored to recognize specific food items and ingredients, providing a level of customization that might not be as straightforward with a general-purpose language model. Furthermore, as a fully managed service, Amazon Rekognition can scale seamlessly to handle large volumes of images. While a hybrid approach combining Rekognition and Claude’s multi-modal capabilities could be explored, we chose Rekognition Custom Labels for its specialized computer vision capabilities, customizability, and to demonstrate combining FMs on Amazon Bedrock with other AWS services for this specific use case.

Using Amazon Bedrock FMs to generate recipes

To generate the recipes, we use Amazon Bedrock, a fully managed service that offers high-performing FMs. We use the Amazon Bedrock API to query Anthropic’s Claude v3 Sonnet model. We use the following prompt to provide context to the FM:

You are an expert chef, with expertise in diverse cuisines and recipes. 
I am currently a novice and I require you to write me recipes based on the ingredients provided below. 
The requirements for the recipes are as follows:
- I need 3 recipes from you
- These recipes can only use ingredients listed below, and nothing else
- For each of the recipes, provide detailed step by step methods for cooking. Format it like this:
1. Step 1: <instructions>
2. Step 2: <instructions>
...
n. Step n: <instructions>
Remember, you HAVE to use ONLY the ingredients that are provided to you. DO NOT use any other ingredient. 
This is crucial. For example, if you are given ingredients "Bread" and "Butter", you can ONLY use Bread and Butter, 
and no other ingredient can be added on. 
An example recipe with these two can be:
Recipe 1: Fried Bread
Ingredients:
- Bread
- Butter
1. Step 1: Heat up the pan until it reaches 40 degrees
2. Step 2: Drop in a knob of butter and melt it
3. Step 3: Once butter is melted, add a piece of bread onto pan
4. Step 4: Cook until the bread is browned and crispy
5. Step 5: Repeat on the other side
6. Step 6: You can repeat this for other breads, too

The following code is the body of the Amazon Bedrock API call:

# master_ingredients_str: Labels retrieved from DynamoDB table
# prompt: Prompt shown above
content = "Here is a list of ingredients that a person currently has." + user_ingredients_str + "nn And here are a list of ingredients at a local grocery store " + master_ingredients_str + prompt

body = json.dumps({
"max_tokens": 2047,
"messages": [{"role": "user", "content": content}],
"anthropic_version": "bedrock-2023-05-31"
})

j_body = json.dumps(body)

modelId = "anthropic.claude-3-sonnet-20240229-v1:0"

response = bedrock.invoke_model(body=body, modelId=modelId)

Using the combination of the prompt and API call, we generate three recipes using the ingredients retrieved from the DynamoDB table. You can add additional parameters to body such as temperature, top_p, and top_k to further set thresholds for your prompt. For more information on getting responses from the Anthropic’s Claude 3 model using the Amazon Bedrock API, see Anthropic Claude Messages API. We recommend setting the temperature to something low (such as 0.1 or 0.2) to help ensure deterministic and structured generation of recipes. We also recommend setting the top_p value (nucleus sampling) to something high (such as 0.9) to limit the FM’s predictions to the most probable tokens (in this case, the model will consider the most probable tokens that make up 90% of the total probability mass for its next prediction). top_k is another sampling technique that limits the model’s predictions to the top_k most probable tokens. For example, if top_k = 10, the model will only consider the 10 most probable tokens for its next prediction. One of the key benefits of using Amazon Bedrock is the ability to use multiple FMs for different tasks within the same solution. In addition to generating textual recipes with Anthropic’s Claude 3, we can also dynamically generate visually appealing images to accompany those recipes. For this task, we chose to use the Stable Diffusion model available on Amazon Bedrock. Amazon Bedrock also offers other powerful image generation models such as Titan, and we’ve given you an example API call for that, too. Similar to using the Amazon Bedrock API to generate a response from Anthropic’s Claude 3, we use the following code:

modelId = "stability.stable-diffusion-xl-v0" 
accept = "application/json"
contentType = "application/json"

body = json.dumps({
"text_prompts": [
{
"text": recipe_name
}
], 
"cfg_scale": 10,
"seed": 20,
"steps": 50
})

response = brt.invoke_model(
body = body,
modelId = modelId,
accept = accept, 
contentType = contentType
)

For Titan, you might use something like:

modelId="amazon.titan-image-generator-v1",
accept="application/json", 
contentType="application/json"

body = json.dumps({
    "taskType": "TEXT_IMAGE",
    "textToImageParams": {
        "text":prompt,   # Required
    },
    "imageGenerationConfig": {
        "numberOfImages": 1,   # Range: 1 to 5 
        "quality": "premium",  # Options: standard or premium
        "height": 768,         # Supported height list in the docs 
        "width": 1280,         # Supported width list in the docs
        "cfgScale": 7.5,       # Range: 1.0 (exclusive) to 10.0
        "seed": 42             # Range: 0 to 214783647
    }
})

response = brt.invoke_model(
body = body, 
modelId = modelId,
accept = accept,
contentType = contentType
)

This returns a base64 encoded string that you need to decode in your frontend so that you can display it. For more information about other parameters that you can include in your API call, see Stability.ai Diffusion 1.0 text to image, and Using Amazon Bedrock to generate images with Titan Image Generator models. In the following sections, you walk through the steps to deploy the solution in your AWS account.

Prerequisites

You need an AWS account to deploy this solution. If you don’t have an existing account, you can sign up for one. The instructions in this post use the us-east-1 AWS Region. Make sure you deploy your resources in a Region with AWS Machine Learning services available. For the Lambda functions to run successfully, Lambda requires an AWS Identity and Access Management (IAM) role and policy with the appropriate permissions. Complete the necessary steps from Defining Lambda function permissions with an execution role to create and attach a Lambda execution role for the Lambda functions to access all necessary actions for DynamoDB, Amazon Rekognition, and Amazon Bedrock.

Create the Lambda function to detect ingredients

Complete the following steps to create your first Lambda function (DetectIngredients):

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create Lambda function.
  3. Choose Author from scratch.
  4. Name your function DetectIngredients, select Python 3.12 for Runtime, and choose Create function.
  5. For your Lambda configuration, choose lambdaDynamoRole for Execution role, increase Timeout to 8 seconds, verify the settings, and choose Save.
  6. Replace the text in the Lambda function code with the following sample code and choose Save:
import json
import boto3
import inference
import time
s3 = boto3.client('s3') 

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('TestDataTable')
table_name = 'TestDataTable'

def lambda_handler(event, context):
    clearTable()

test = inference.main()

labels, label_count = inference.main()

# The names array will contain a list of all the grocery ingredients detected
# in the image
names = []

for label_dic in labels:
name = label_dic['Name']
# Getting rid of unnecessary parts of label string
if "Food" in name:
    # Remove "Food" from name
    name = name.replace("Food", "")
if "In Fridge" in name:
    # Remove "In Fridge" from name
    name = name.replace("In Fridge", "")
    name = name.strip()

names.append(name)

# Loop through the list of grocery ingredients to construct a dictionary called
# items
# the items dict will be used to batch write up to 25 items at a time when
# batch_write_all is called
items=[]
for name in names:
    if (len(items)) < 29:
        items.append({
           'grocery_item': name
        })

# Remove all duplicates from array
seen = set()
unique_grocery_items = []
for item in items:
    val = item['grocery_item'].lower().strip()
    if val not in seen:
        unique_grocery_items.append(item)
        seen.add(val)

batch_write_all(unique_grocery_items)

table.put_item(
Item={
'grocery_item': "DONE"
})

def batch_write_all(items):
    batch_write_requests = [{
        'PutRequest': {
            'Item': item
        }
    } for item in items]

response = dynamodb.batch_write_item(
    RequestItems={
         table_name:batch_write_requests
    }
)

def clearTable():
    response = table.scan()
    with table.batch_writer() as batch:
        for each in response['Items']:
             batch.delete_item(
                 Key={
                         'grocery_item': each['grocery_item'] 
                 }

Create a DynamoDB table to store ingredients

Complete the following steps to create your DynamoDB table.

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Choose Create table.
  3. For Table name, enter MasterGroceryDB.
  4. For Partition key, use grocery_item (string).
  5. Verify that all entries on the page are accurate, leave the rest of the settings as default, and choose Create.

Wait for the table creation to complete and for your table status to change to Active before proceeding to the next step.

Create the Lambda function to call Amazon Bedrock

Complete the following steps to create another Lambda function that will call the Amazon Bedrock APIs to generate recipes:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose Create function.
  3. Choose Author from scratch.
  4. Name your function GenerateRecipes, choose Python 3.12 for Runtime, and choose Create function.
  5. For your Lambda configuration, choose lambdaDynamoRole for Execution role, increase Timeout to 8 seconds, verify the settings, and choose Save.
  6. Replace the text in the Lambda function code with the following sample code choose Save:
import json
import boto3
import re
import base64
import image_gen

dynamodb = boto3.resource('dynamodb')

bedrock = boto3.client(service_name='bedrock-runtime')

def get_ingredients(tableName):
    table = dynamodb.Table(tableName)
    response = table.scan()
    data = response['Items']

    # Support for pagination
    while 'LastEvaluatedKey' in response:
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
        data.extend(response['Items'])

    data = [g_i for g_i in data if g_i['grocery_item'] != 'DONE']
    return data


# Converts dynamoDB grocery items into a string
def convertItemsToString(grocery_dict):
    ingredients_list = []
    for each in grocery_dict:
        ingredients_list.append(each['grocery_item'])
        ingredients_list_str = ", ".join(ingredients_list)
    return ingredients_list_str

def read_prompt():
    with open ('Prompt.md', 'r') as f:
        text = f.read() 
    return text

# Gets the names of all the recipes generated
def get_recipe_names(response_body):
    recipe_names = []
    for i in range(len(response_body)):
        if response_body[i] == 'n' and response_body[i + 1] == 'n' and response_body[i + 2] == 'R':
    recipe_str = ""
    while response_body[i + 2] != 'n':
        recipe_str += response_body[i + 2]
        i += 1
    recipe_str = recipe_str.replace("Recipe", '') 
    recipe_str = recipe_str.replace(": ", '')
    recipe_str = re.sub(" d+", "", recipe_str) 
    recipe_names.append(recipe_str)
return recipe_names 

def lambda_handler(event, context):
    # Write the ingredients to a .md file
    user_ingredients_dict = get_ingredients('TestDataTable')
    master_ingredients_dict = get_ingredients('MasterGroceryDB')

    # Get string values for ingredients in both databases
    user_ingredients_str = convertItemsToString(user_ingredients_dict)
    master_ingredients_str = convertItemsToString(master_ingredients_dict)

    # Convert dictionary into comma seperated string arg to pass into prompt

    # Read the prompt + ingredients file
    prompt = read_prompt()
    # Query for recipes using prompt + ingredients

    content = "Here is a list of ingredients that a person currently has." + user_ingredients_str + "nn And here are a list of ingredients at a local grocery store " + master_ingredients_str + prompt

    body = json.dumps({
        "max_tokens": 2047,
        "messages": [{"role": "user", "content": content}],
        "anthropic_version": "bedrock-2023-05-31"
    })

    j_body = json.dumps(body)

    modelId = "anthropic.claude-3-sonnet-20240229-v1:0"

    
    response = bedrock.invoke_model(body=body, modelId=modelId)


    response_body = json.loads(response.get('body').read())
    response_body_content = response_body.get("content")
    response_body_completion = response_body_content[0]['text']

    recipe_names_list = get_recipe_names(response_body_completion)

    first_image_imgstr = image_gen.image_gen(recipe_names_list[0])
    second_image_imgstr = image_gen.image_gen(recipe_names_list[1])
    third_image_imgstr = image_gen.image_gen(recipe_names_list[2])

    return response_body_completion, first_image_imgstr, second_image_imgstr, third_image_imgstr

Create an S3 bucket to store the images

Lastly, you create an S3 bucket to store the images you upload, which automatically invokes the DetectIngredients Lambda function after each upload. Complete the following steps to create the bucket and configure the Lambda function:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Enter a unique bucket name, set the desired Region to us-east-1, and choose Create bucket.
  4. On the Lambda console, navigate to the DetectIngredients
  5. On the Configuration tab, choose Add trigger.
  6. Select the trigger type as S3 and choose the bucket you created.
  7. Set Event type to All object create events and choose Add.
  8. On the Amazon S3 console, navigate to the bucket you created.
  9. Under Properties and Event Notifications, choose Create event notification.
  10. Enter an event name (for example, Trigger DetectIngredients) and set the events to All object create events.
  11. For Destination, select Lambda Function and select the DetectIngredients Lambda function.
  12. Choose Save.

Conclusion

In this post, we explored the use of Amazon Rekognition and FMs on Amazon Bedrock with AWS services such as Lambda and DynamoDB to build a comprehensive solution that addresses food waste in the US. With the use of cutting-edge AWS services including Rekognition Custom Labels and content generation with models on Amazon Bedrock, this application provides value and proof of work for AWS generative AI capabilities.

Stay on the lookout for a follow-up to this post, where we demonstrate using the multi-modal capabilities of FMs such as Anthropic’s Claude v3.1 on Amazon Bedrock to deploy this entire solution end-to-end.

Although we highlighted a food waste use case in this post, we urge you to apply your own use case to this solution. The flexibility of this architecture allows you to adapt these services to multiple scenarios, enabling you to solve a wide range of challenges.

Special thanks to Tommy Xie and Arnav Verma for their contributions to the blog.


About the Authors

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Michael Lue is a Sr. Solution Architect at AWS Canada based out of Toronto. He works with Canadian enterprise customers to accelerate their business through optimization, innovation, and modernization. He is particularly passionate and curious about disruptive technologies like containers and AI/ML. In his spare time, he coaches and plays tennis and enjoys hanging at the beach with his French Bulldog, Marleé.

Vineet Kachhawaha is a Solutions Architect at AWS with expertise in machine learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.

Read More

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Investment professionals face the mounting challenge of processing vast amounts of data to make timely, informed decisions. The traditional approach of manually sifting through countless research documents, industry reports, and financial statements is not only time-consuming but can also lead to missed opportunities and incomplete analysis. This challenge is particularly acute in credit markets, where the complexity of information and the need for quick, accurate insights directly impacts investment outcomes. Financial institutions need a solution that can not only aggregate and process large volumes of data but also deliver actionable intelligence in a conversational, user-friendly format. The intersection of AI and financial analysis presents a compelling opportunity to transform how investment professionals access and use credit intelligence, leading to more efficient decision-making processes and better risk management outcomes.

Founded in 2013, Octus, formerly Reorg, is the essential credit intelligence and data provider for the world’s leading buy side firms, investment banks, law firms and advisory firms. By surrounding unparalleled human expertise with proven technology, data and AI tools, Octus unlocks powerful truths that fuel decisive action across financial markets. Visit octus.com to learn how we deliver rigorously verified intelligence at speed and create a complete picture for professionals across the entire credit lifecycle. Follow Octus on LinkedIn and X.

Using advanced GenAI, CreditAI by Octus™ is a flagship conversational chatbot that supports natural language queries and real-time data access with source attribution, significantly reducing analysis time and streamlining research workflows. It gives instant access to insights on over 10,000 companies from hundreds of thousands of proprietary intel articles, helping financial institutions make informed credit decisions while effectively managing risk. Key features include chat history management, being able to ask questions that are targeted to a specific company or more broadly to a sector, and getting suggestions on follow-up questions.

In this post, we demonstrate how Octus migrated its flagship product, CreditAI, to Amazon Bedrock, transforming how investment professionals access and analyze credit intelligence. We walk through the journey Octus took from managing multiple cloud providers and costly GPU instances to implementing a streamlined, cost-effective solution using AWS services including Amazon Bedrock, AWS Fargate, and Amazon OpenSearch Service. We share detailed insights into the architecture decisions, implementation strategies, security best practices, and key learnings that enabled Octus to maintain zero downtime while significantly improving the application’s performance and scalability.

Opportunities for innovation

CreditAI by Octus™ version 1.x uses Retrieval Augmented Generation (RAG). It was built using a combination of in-house and external cloud services on Microsoft Azure for large language models (LLMs), Pinecone for vectorized databases, and Amazon Elastic Compute Cloud (Amazon EC2) for embeddings. Based on our operational experience, and as we started scaling up, we realized that there were several operational inefficiencies and opportunities for improvement:

  • Our in-house services for embeddings (deployed on EC2 instances) were not as scalable and reliable as needed. They also required more time on operational maintenance than our team could spare.
  • The overall solution was incurring high operational costs, especially due to the use of on-demand GPU instances. The real-time nature of our application meant that Spot Instances were not an option. Additionally, our investigation of lower-cost CPU-based instances revealed that they couldn’t meet our latency requirements.
  • The use of multiple external cloud providers complicated DevOps, support, and budgeting.

These operational inefficiencies meant that we had to revisit our solution architecture. It became apparent that a cost-effective solution for our generative AI needs was required. Enter Amazon Bedrock Knowledge Bases. With its support for knowledge bases that simplify RAG operations, vectorized search as part of its integration with OpenSearch Service, availability of multi-tenant embeddings, as well as Anthropic’s Claude suite of LLMs, it was a compelling choice for Octus to migrate its solution architecture. Along the way, it also simplified operations as Octus is an AWS shop more generally. However, we were still curious about how we would go about this migration, and whether there would be any downtime through the transition.

Strategic requirements

To help us move forward systematically, Octus identified the following key requirements to guide the migration to Amazon Bedrock:

  • Scalability – A crucial requirement was the need to scale operations from handling hundreds of thousands of documents to millions of documents. A significant challenge in the previous system was the slow (and relatively unreliable) process of embedding new documents into vector databases, which created bottlenecks in scaling operations.
  • Cost-efficiency and infrastructure optimization – CreditAI 1.x, though performant, was incurring high infrastructure costs due to the use of GPU-based, single-tenant services for embeddings and reranking. We needed multi-tenant alternatives that were much cheaper while enabling elasticity and scale.
  • Response performance and latency – The success of generative AI-based applications depends on the response quality and speed. Given our user base, it’s important that our responses are accurate while valuing users’ time (low latency). This is a challenge when the data size and complexity grow. We want to balance spatial and temporal retrieval in order to give responses that have the best answer and context relevance, especially when we get large quantities of data updated every day.
  • Zero downtime – CreditAI is in production and we could not afford any downtime during this migration.
  • Technological agility and innovation – In the rapidly evolving AI landscape, Octus recognized the importance of maintaining technological competitiveness. We wanted to move away from in-house development and feature maintenance such as embeddings services, rerankers, guardrails, and RAG evaluators. This would allow Octus to focus on product innovation and faster feature deployment.
  • Operational consolidation and reliability – Octus’s goal is to consolidate cloud providers, and to reduce support overheads and operational complexity.

Migration to Amazon Bedrock and addressing our requirements

Migrating to Amazon Bedrock addressed our aforementioned requirements in the following ways:

  • Scalability – The architecture of Amazon Bedrock, combined with AWS Fargate for Amazon ECS, Amazon Textract, and AWS Lambda, provided the elastic and scalable infrastructure necessary for this expansion while maintaining performance, data integrity, compliance, and security standards. The solution’s efficient document processing and embedding capabilities addressed the previous system’s limitations, enabling faster and more efficient knowledge base updates.
  • Cost-efficiency and infrastructure optimization – By migrating to Amazon Bedrock multi-tenant embedding, Octus achieved significant cost reduction while maintaining performance standards through Anthropic’s Claude Sonnet and improved embedding capabilities. This move alleviated the need for GPU-instance-based services in favor of more cost-effective and serverless Amazon ECS and Fargate solutions.
  • Response performance and latency – Octus verified the quality and latency of responses from Anthropic’s Claude Sonnet to confirm that response accuracy and latency are not maintained (or even exceeded) as part of this migration. With this LLM, CreditAI was now able to respond better to broader, industry-wide queries than before.
  • Zero downtime – We were able to achieve zero downtime migration to Amazon Bedrock for our application using our in-house centralized infrastructure frameworks. Our frameworks comprise infrastructure as code (IaC) through Terraform, continuous integration and delivery (CI/CD), SOC2 security, monitoring, observability, and alerting for our infrastructure and applications.
  • Technological agility and innovation – Amazon Bedrock emerged as an ideal partner, offering solutions specifically designed for AI application development. Amazon Bedrock built-in features, such as embeddings services, reranking, guardrails, and the upcoming RAG evaluator, alleviated the need for in-house development of these components, allowing Octus to focus on product innovation and faster feature deployment.
  • Operational consolidation and reliability – The comprehensive suite of AWS services offers a streamlined framework that simplifies operations while providing high availability and reliability. This consolidation minimizes the complexity of managing multiple cloud providers and creates a more cohesive technological ecosystem. It also enables economies of scale with development velocity given that over 75 engineers at Octus already use AWS services for application development.

In addition, the Amazon Bedrock Knowledge Bases team worked closely with us to address several critical elements, including expanding embedding limits, managing the metadata limit (250 characters), testing different chunking methods, and syncing throughput to the knowledge base.

In the following sections, we explore our solution and how we addressed the details around the migration to Amazon Bedrock and Fargate.

Solution overview

The following figure illustrates our system architecture for CreditAI on AWS, with two key paths: the document ingestion and content extraction workflow, and the Q&A workflow for live user query response.

Solution Architecture

In the following sections, we dive into crucial details within key components in our solution. In each case, we connect them to the requirements discussed earlier for readability.

The document ingestion workflow (numbered in blue in the preceding diagram) processes content through five distinct stages:

  1. Documents uploaded to Amazon Simple Storage Service (Amazon S3) automatically invoke Lambda functions through S3 Event Notifications. This event-driven architecture provides immediate processing of new documents.
  2. Lambda functions process the event payload containing document location, perform format validation, and prepare content for extraction. This includes file type verification, size validation, and metadata extraction before routing to Amazon Textract.
  3. Amazon Textract processes the documents to extract both text and structural information. This service handles various formats, including PDFs, images, and forms, while preserving document layout and relationships between content elements.
  4. The extracted content is stored in a dedicated S3 prefix, separate from the source documents, maintaining clear data lineage. Each processed document maintains references to its source file, extraction timestamp, and processing metadata.
  5. The extracted content flows into Amazon Bedrock Knowledge Bases, where our semantic chunking strategy is implemented to divide content into optimal segments. The system then generates embeddings for each chunk and stores these vectors in OpenSearch Service for efficient retrieval. Throughout this process, the system maintains comprehensive metadata to support downstream filtering and source attribution requirements.

The Q&A workflow (numbered in yellow in the preceding diagram) processes user interactions through six integrated stages:

  1. The web application, hosted on AWS Fargate, handles user interactions and query inputs, managing initial request validation before routing queries to appropriate processing services.
  2. Amazon Managed Streaming for Kafka (Amazon MSK) serves as the streaming service, providing reliable inter-service communication while maintaining message ordering and high-throughput processing for query handling.
  3. The Q&A handler, running on AWS Fargate, orchestrates the complete query response cycle by coordinating between services and processing responses through the LLM pipeline.
  4. The pipeline integrates with Amazon Bedrock foundation models through these components:
    1. Cohere Embeddings model performs vector transformations of the input.
    2. Amazon OpenSearch Service manages vector embeddings and performs similarity searches.
    3. Amazon Bedrock Knowledge Bases provides efficient access to the document repository.
  5. Amazon Bedrock Guardrails implements content filtering and safety checks as part of the query processing pipeline.
  6. Anthropic Claude LLM performs the natural language processing, generating responses that are then returned to the web application.

This integrated workflow provides efficient query processing while maintaining response quality and system reliability.

For scalability: Using OpenSearch Service as our vector database

Amazon OpenSearch Serverless emerged as the optimal solution for CreditAI’s evolving requirements, offering advanced capabilities while maintaining seamless integration within the AWS ecosystem:

  • Vector search capabilities – OpenSearch Serverless provides robust built-in vector search capabilities essential for our needs. The service supports hybrid search, allowing us to combine vector embeddings with raw text search without modifying our embedding model. This capability proved crucial for enabling broader question support in CreditAI 2.x, enhancing its overall usability and flexibility.
  • Serverless architecture benefits – The serverless design alleviates the need to provision, configure, or tune infrastructure, significantly reducing operational complexities. This shift allows our team to focus more time and resources on feature development and application improvements rather than managing underlying infrastructure.
  • AWS integration advantages – The tight integration with other AWS services, particularly Amazon S3 and Amazon Bedrock, streamlines our content ingestion process. This built-in compatibility provides a cohesive and scalable landscape for future enhancements while maintaining optimal performance.

OpenSearch Serverless enabled us to scale our vector search capabilities efficiently while minimizing operational overhead and maintaining high performance standards.

For scalability and security: Splitting data across multiple vector databases with in-house support for intricate permissions

To enhance scalability and security, we implemented isolated knowledge bases (corresponding to vector databases) for each client data. Although this approach slightly increases costs, it delivers multiple significant benefits. Primarily, it maintains complete isolation of client data, providing enhanced privacy and security. Thanks to Amazon Bedrock Knowledge Bases, this solution doesn’t compromise on performance. Amazon Bedrock Knowledge Bases enables concurrent embedding and synchronization across multiple knowledge bases, allowing us to maintain real-time updates without delays—something previously unattainable with our previous GPU based architectures.

Additionally, we introduced two in-house services within Octus to strengthen this system:

  • AuthZ access management service – This service enforces granular access control, making sure users and applications can only interact with the data they are authorized to access. We had to migrate our AuthZ backend from Airbyte to native SQL replication so that it can support access management in near real time at scale.
  • Global identifiers service – This service provides a unified framework to link identifiers across multiple domains, enabling seamless integration and cross-referencing of identifiers across multiple datasets.

Together, these enhancements create a robust, secure, and highly efficient environment for managing and accessing client data.

For cost efficiency: Adopting a multi-tenant embedding service

In our migration to Amazon Bedrock Knowledge Bases, Octus made a strategic shift from using an open-source embedding service on EC2 instances to using the managed embedding capabilities of Amazon Bedrock through Cohere’s multilingual model. This transition was carefully evaluated based on several key factors.

Our selection of Cohere’s multilingual model was driven by two primary advantages. First, it demonstrated superior retrieval performance in our comparative testing. Second, it offered robust multilingual support capabilities that were essential for our global operations.

The technical benefits of this migration manifested in two distinct areas: document embedding and message embedding. In document embedding, we transitioned from a CPU-based system to Amazon Bedrock Knowledge Bases, which enabled faster and higher throughput document processing through its multi-tenant architecture. For message embedding, we alleviated our dependency on dedicated GPU instances while maintaining optimal performance with 20–30 millisecond embedding times. The Amazon Bedrock Knowledge Bases API also simplified our operations by combining embedding and retrieval functionality into a single API call.

The migration to Amazon Bedrock Knowledge Bases managed embedding delivered two significant advantages: it eliminated the operational overhead of maintaining our own open-source solution while providing access to industry-leading embedding capabilities through Cohere’s model. This helped us achieve both our cost-efficiency and performance objectives without compromises.

For cost-efficiency and response performance: Choice of chunking strategy

Our primary goal was to improve three critical aspects of CreditAI’s responses: quality (accuracy of information), groundedness (ability to trace responses back to source documents), and relevance (providing information that directly answers user queries). To achieve this, we tested three different approaches to breaking down documents into smaller pieces (chunks):

  • Fixed chunking – Breaking text into fixed-length pieces
  • Semantic chunking – Breaking text based on natural semantic boundaries like paragraphs, sections, or complete thoughts
  • Hierarchical chunking – Creating a two-level structure with smaller child chunks for precise matching and larger parent chunks for contextual understanding

Our testing showed that both semantic and hierarchical chunking performed significantly better than fixed chunking in retrieving relevant information. However, each approach came with its own technical considerations.

Hierarchical chunking requires a larger chunk size to maintain comprehensive context during retrieval. This approach creates a two-level structure: smaller child chunks for precise matching and larger parent chunks for contextual understanding. During retrieval, the system first identifies relevant child chunks and then automatically includes their parent chunks to provide broader context. Although this method optimizes both search precision and context preservation, we couldn’t implement it with our preferred Cohere embeddings because they only support chunks up to 512 tokens, which is insufficient for the parent chunks needed to maintain effective hierarchical relationships.

Semantic chunking uses LLMs to intelligently divide text by analyzing both semantic similarity and natural language structures. Instead of arbitrary splits, the system identifies logical break points by calculating embedding-based similarity scores between sentences and paragraphs, making sure semantically related content stays together. The resulting chunks maintain context integrity by considering both linguistic features (like sentence and paragraph boundaries) and semantic coherence, though this precision comes at the cost of additional computational resources for LLM analysis and embedding calculations.

After evaluating our options, we chose semantic chunking despite two trade-offs:

  • It requires additional processing by our LLMs, which increases costs
  • It has a limit of 1,000,000 tokens per document processing batch

We made this choice because semantic chunking offered the best balance between implementation simplicity and retrieval performance. Although hierarchical chunking showed promise, it would have been more complex to implement and harder to scale. This decision helped us maintain high-quality, grounded, and relevant responses while keeping our system manageable and efficient.

For response performance and technical agility: Adopting Amazon Bedrock Guardrails with Amazon Bedrock Knowledge Bases

Our implementation of Amazon Bedrock Guardrails focused on three key objectives: enhancing response security, optimizing performance, and simplifying guardrail management. This service plays a crucial role in making sure our responses are both safe and efficient.

Amazon Bedrock Guardrails provides a comprehensive framework for content filtering and response moderation. The system works by evaluating content against predefined rules before the LLM processes it, helping prevent inappropriate content and maintaining response quality. Through the Amazon Bedrock Guardrails integration with Amazon Bedrock Knowledge Bases, we can configure, test, and iterate on our guardrails without writing complex code.

We achieved significant technical improvements in three areas:

  • Simplified moderation framework – Instead of managing multiple separate denied topics, we consolidated our content filtering into a unified guardrail service. This approach allows us to maintain a single source of truth for content moderation rules, with support for customizable sample phrases that help fine-tune our filtering accuracy.
  • Performance optimization – We improved system performance by integrating guardrail checks directly into our main prompts, rather than running them as separate operations. This optimization reduced our token usage and minimized unnecessary API calls, resulting in lower latency for each query.
  • Enhanced content control – The service provides configurable thresholds for filtering potentially harmful content and includes built-in capabilities for detecting hallucinations and assessing response relevance. This alleviated our dependency on external services like TruLens while maintaining robust content quality controls.

These improvements have helped us maintain high response quality while reducing both operational complexity and processing overhead. The integration with Amazon Bedrock has given us a more streamlined and efficient approach to content moderation.

To achieve zero downtime: Infrastructure migration

Our migration to Amazon Bedrock required careful planning to provide uninterrupted service for CreditAI while significantly reducing infrastructure costs. We achieved this through our comprehensive infrastructure framework that addresses deployment, security, and monitoring needs:

  • IaC implementation – We used reusable Terraform modules to manage our infrastructure consistently across environments. These modules enabled us to share configurations efficiently between services and projects. Our approach supports multi-Region deployments with minimal configuration changes while maintaining infrastructure version control alongside application code.
  • Automated deployment strategy – Our GitOps-embedded framework streamlines the deployment process by implementing a clear branching strategy for different environments. This automation handles CreditAI component deployments through CI/CD pipelines, reducing human error through automated validation and testing. The system also enables rapid rollback capabilities if needed.
  • Security and compliance – To maintain SOC2 compliance and robust security, our framework incorporates comprehensive access management controls and data encryption at rest and in transit. We follow network security best practices, conduct regular security audits and monitoring, and run automated compliance checks in the deployment pipeline.

We maintained zero downtime during the entire migration process while reducing infrastructure costs by 70% by eliminating GPU instances. The successful transition from Amazon ECS on Amazon EC2 to Amazon ECS with Fargate has simplified our infrastructure management and monitoring.

Achieving excellence

CreditAI’s migration to Amazon Bedrock has yielded remarkable results for Octus:

  • Scalability – We have almost doubled the number of documents available for Q&A across three environments in days instead of weeks. Our use of Amazon ECS with Fargate with auto scaling rules and controls gives us elastic scalability for our services during peak usage hours.
  • Cost-efficiency and infrastructure optimization – By moving away from GPU-based clusters to Fargate, our monthly infrastructure costs are now 78.47% lower, and our per-question costs have reduced by 87.6%.
  • Response performance and latency – There has been no drop in latency, and have seen a 27% increase in questions answered successfully. We have also seen a 250% boost in user engagement. Users especially love our support for broad, industry-wide questions enabled by Anthropic’s Claude Sonnet.
  • Zero downtime – We experienced zero downtime during migration and 99% uptime overall for the whole application.
  • Technological agility and innovation – We have been able to add new document sources in a quarter of the time it took pre-migration. In addition, we adopted enhanced guardrails support for free and no longer have to retrieve documents from the knowledge base and pass the chunks to Anthropic’s Claude Sonnet to trigger a guardrail.
  • Operational consolidation and reliability – Post-migration, our DevOps and SRE teams see 20% less maintenance burden and overheads. Supporting SOC2 compliance is also straightforward now that we’re using only one cloud provider.

Operational monitoring

We use Datadog to monitor both LLM latency and our document ingestion pipeline, providing real-time visibility into system performance. The following screenshot showcases how we use custom Datadog dashboards to provide a live view of the document ingestion pipeline. This visualization offers both a high-level overview and detailed insights into the ingestion process, helping us understand the volume, format, and status of the documents processed. The bottom half of the dashboard presents a time-series view of document processing volumes. The timeline tracks fluctuations in processing rates, identifies peak activity periods, and provides actionable insights to optimize throughput. This detailed monitoring system enables us to maintain efficiency, minimize failures, and provide scalability.

Observability Dashboard

Roadmap

Looking ahead, Octus plans to continue enhancing CreditAI by taking advantage of new capabilities released by Amazon Bedrock that continue to meet and exceed our requirements. Future developments will include:

  • Enhance retrieval by testing and integrating with reranking techniques, allowing the system to prioritize the most relevant search results for better user experience and accuracy.
  • Explore the Amazon Bedrock RAG evaluator to capture detailed metrics on CreditAI’s performance. This will add to the existing mechanisms at Octus to track performance that include tracking unanswered questions.
  • Expand to ingest large-scale structured data, making it capable of handling complex financial datasets. The integration of text-to-SQL will enable users to query structured databases using natural language, simplifying data access.
  • Explore replacing our in-house content extraction service (ADE) with the Amazon Bedrock advanced parsing solution to potentially further reduce document ingestion costs.
  • Improve CreditAI’s disaster recovery and redundancy mechanisms, making sure that our services and infrastructure are more fault tolerant and can recover from outages faster.

These upgrades aim to boost the precision, reliability, and scalability of CreditAI.

Vishal Saxena, CTO at Octus, shares: “CreditAI is a first-of-its-kind generative AI application that focuses on the entire credit lifecycle. It is truly ’AI embedded’ software that combines cutting-edge AI technologies with an enterprise data architecture and a unified cloud strategy.”

Conclusion

CreditAI by Octus is the company’s flagship conversational chatbot that supports natural language queries and gives instant access to insights on over 10,000 companies from hundreds of thousands of proprietary intel articles. In this post, we described in detail our motivation, process, and results on Octus’s migration to Amazon Bedrock. Through this migration, Octus achieved remarkable results that included an over 75% reduction in operating costs as well as a 250% boost in engagement. Future steps include adopting new features such as reranking, RAG evaluator, and advanced parsing to further reduce costs and improve performance. We believe that the collaboration between Octus and AWS will continue to revolutionize financial analysis and research workflows.

To learn more about Amazon Bedrock, refer to the Amazon Bedrock User Guide.


About the Authors

Vaibhav Sabharwal is a Senior Solutions Architect with Amazon Web Services based out of New York. He is passionate about learning new cloud technologies and assisting customers in building cloud adoption strategies, designing innovative solutions, and driving operational excellence. As a member of the Financial Services Technical Field Community at AWS, he actively contributes to the collaborative efforts within the industry.

Yihnew Eshetu is a Senior Director of AI Engineering at Octus, leading the development of AI solutions at scale to address complex business problems. With seven years of experience in AI/ML, his expertise spans GenAI and NLP, specializing in designing and deploying agentic AI systems. He has played a key role in Octus’s AI initiatives, including leading AI Engineering for its flagship GenAI chatbot, CreditAI.

Harmandeep Sethi is a Senior Director of SRE Engineering and Infrastructure Frameworks at Octus, with nearly 10 years of experience leading high-performing teams in the design, implementation, and optimization of large-scale, highly available, and reliable systems. He has played a pivotal role in transforming and modernizing Credit AI infrastructure and services by driving best practices in observability, resilience engineering, and the automation of operational processes through Infrastructure Frameworks.

Rohan Acharya is an AI Engineer at Octus, specializing in building and optimizing AI-driven solutions at scale. With expertise in GenAI and NLP, he focuses on designing and deploying intelligent systems that enhance automation and decision-making. His work involves developing robust AI architectures and advancing Octus’s AI initiatives, including the evolution of CreditAI.

Hasan Hasibul is a Principal Architect at Octus leading the DevOps team, with nearly 12 years of experience in building scalable, complex architectures while following software development best practices. A true advocate of clean code, he thrives on solving complex problems and automating infrastructure. Passionate about DevOps, infrastructure automation, and the latest advancements in AI, he has architected Octus initial CreditAI, pushing the boundaries of innovation.

Philipe Gutemberg is a Principal Software Engineer and AI Application Development Team Lead at Octus, passionate about leveraging technology for impactful solutions. An AWS Certified Solutions Architect – Associate (SAA), he has expertise in software architecture, cloud computing, and leadership. Philipe led both backend and frontend application development for CreditAI, ensuring a scalable system that integrates AI-driven insights into financial applications. A problem-solver at heart, he thrives in fast-paced environments, delivering innovative solutions for financial institutions while fostering mentorship, team development, and continuous learning.

Kishore Iyer is the VP of AI Application Development and Engineering at Octus. He leads teams that build, maintain and support Octus’s customer-facing GenAI applications, including CreditAI, our flagship AI offering. Prior to Octus, Kishore has 15+ years of experience in engineering leadership roles across large corporations, startups, research labs, and academia. He holds a Ph.D. in computer engineering from Rutgers University.

Kshitiz Agarwal is an Engineering Leader at Amazon Web Services (AWS), where he leads the development of Amazon Bedrock Knowledge Bases. With a decade of experience at Amazon, having joined in 2012, Kshitiz has gained deep insights into the cloud computing landscape. His passion lies in engaging with customers and understanding the innovative ways they leverage AWS to drive their business success. Through his work, Kshitiz aims to contribute to the continuous improvement of AWS services, enabling customers to unlock the full potential of the cloud.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Tim Ramos is a Senior Account Manager at AWS. He has 12 years of sales experience and 10 years of experience in cloud services, IT infrastructure, and SaaS. Tim is dedicated to helping customers develop and implement digital innovation strategies. His focus areas include business transformation, financial and operational optimization, and security. Tim holds a BA from Gonzaga University and is based in New York City.

Read More

Optimize reasoning models like DeepSeek with prompt optimization on Amazon Bedrock

Optimize reasoning models like DeepSeek with prompt optimization on Amazon Bedrock

DeepSeek-R1 models, now available on Amazon Bedrock Marketplace, Amazon SageMaker JumpStart, as well as a serverless model on Amazon Bedrock, were recently popularized by their long and elaborate thinking style, which, according to DeepSeek’s published results, lead to impressive performance on highly challenging math benchmarks like AIME-2024 and MATH-500, as well as competitive performance compared to then state-of-the-art models like Anthropic’s Claude Sonnet 3.5, GPT 4o, and OpenAI O1 (more details in this paper).

During training, researchers showed how DeepSeek-R1-Zero naturally learns to solve tasks with more thinking time, which leads to a boost in performance. However, what often gets ignored is the number of thinking tokens required at inference time, and the time and cost of generating these tokens before answering the original question.

In this post, we demonstrate how to optimize reasoning models like DeepSeek-R1 using prompt optimization on Amazon Bedrock.

Long reasoning chains and challenges with maximum token limits

Let’s try out a straightforward question on DeepSeek-R1:

For the given math problem: Nate’s dog can dig six holes a day. He digs for 14 days while Nate is on vacation. When Nate gets home, he starts filling in 9 holes a day, but the dog keeps digging 6 new holes every night. How many weeks does it take him to fill in all the holes?, write out the steps you would take to solve it.

On the Amazon Bedrock Chat/Text Playground, you can follow along by choosing the new DeepSeek-R1 model, as shown in the following screenshot.

You might see that sometimes, based on the question, reasoning models don’t finish thinking within the overall maximum token budget.

Increasing the output token budget allows the model to think for longer. With the maximum tokens increased from 2,048 to 4,096, you should see the model reasoning for a while before printing the final answer.

The appendix at the end of this post provides the complete response. You can also collapse the reasoning steps to view just the final answer.

As we can see in the case with the 2,048-token budget, the thinking process didn’t end. This not only cost us 2,048 tokens’ worth of time and money, but we also didn’t get the final answer! This observation of high token counts for thinking usually leads to a few follow-up questions, such as:

  • Is it possible to reduce the thinking tokens and still get a correct answer?
  • Can the thinking be restricted to a maximum number of thinking tokens, or a thinking budget?
  • At a high level, should thinking-intensive models like DeepSeek be used in real-time applications at all?

In this post, we show you how you can optimize thinking models like DeepSeek-R1 using prompt optimization on Amazon Bedrock, resulting in more succinct thinking traces without sacrificing accuracy.

Optimize DeepSeek-R1 prompts

To get started with prompt optimization, select DeepSeek-R1 on the model playground on Amazon Bedrock, enter your prompt, and choose the magic wand icon, or use the Amazon Bedrock optimize_prompt() API. You may also use prompt optimization on the console, add variables if required, set your model to Deepseek-R1 and model parameters, and click “Optimize”:

To demonstrate how prompt optimization on Amazon Bedrock can help with reasoning models, we first need a challenging dataset. Humanity’s Last Exam (HLE), a benchmark of extremely challenging questions from dozens of subject areas, is designed to be the “final” closed-ended benchmark of broad academic capabilities. HLE is multi-modal, featuring questions that are either text-only or accompanied by an image reference, and includes both multiple-choice and exact-match questions for automated answer verification. The questions require deep domain knowledge in various verticals; they are unambiguous and resistant to simple internet lookups or database retrieval. For context, several state-of-the-art models (including thinking models) perform poorly on the benchmark (see the results table in this full paper).

Let’s look at an example question from this dataset:

In an alternate universe where the mass of the electron was 1% heavier and the charges of the
electron and proton were both 1% smaller, but all other fundamental constants stayed the same, 
approximately how would the speed of sound in diamond change? 

Answer Choices:
A. Decrease by 2%
B. Decrease by 1.5%
C. Decrease by 1%
D. Decrease by 0.5%
E. Stay approximately the same 
F. Increase by 0.5%
G. Increase by 1%
H. Increase by 1.5%
I. Increase by 2%

The question requires a deep understanding of physics, which most large language models (LLMs) today will fail at. Our goal with prompt optimization on Amazon Bedrock for reasoning models is to reduce the number of thinking tokens but not sacrifice accuracy. After using prompt optimization, the optimized prompt is as follows:

## Question
<extracted_question_1>In an alternate universe where the mass of the electron was 1% heavier 
and the charges of the electron and proton were both 1% smaller, but all other fundamental constants
stayed the same, approximately how would the speed of sound in diamond change?

Answer Choices:
A. Decrease by 2%
B. Decrease by 1.5%
C. Decrease by 1%
D. Decrease by 0.5%
E. Stay approximately the same
F. Increase by 0.5%
G. Increase by 1%
H. Increase by 1.5%
I. Increase by 2%</extracted_question_1>

## Instruction
Read the question above carefully and provide the most accurate answer possible.
If multiple choice options are provided within the question, respond with the entire text of the
correct answer option, not just the letter or number. Do not include any additional explanations or
preamble in your response.

Remember, your goal is to answer as precisely and accurately as possible!

The following figure shows how, for this specific case, the number of thinking tokens reduced by 35%, while still getting the final answer correct (B. Decrease by 1.5%). Here, the number of thinking tokens reduced from 5,000 to 3,300. We also notice that in this and other examples with the original prompts, part of the reasoning is summarized or repeated before the final answer. As we can see in this example, the optimized prompt gives clear instructions, separates different prompt sections, and provides additional guidance based on the type of question and how to answer. This leads to both shorter, clearer reasoning traces and a directly extractable final answer.

Optimized prompts can also lead to correct answers as opposed to wrong ones after long-form thinking, because thinking doesn’t guarantee a correct final answer. In this case, we see that the number of thinking tokens reduced from 5,000 to 1,555, and the answer is obtained directly, rather than after another long, post-thinking explanation. The following figure shows an example.

The preceding two examples demonstrate ways in which prompt optimization can improve results while shortening output tokens for models like DeepSeek R1. Prompt optimization was also applied to 400 questions from HLE. The following table summarizes the results.

Experiment Overall Accuracy Average Number of Prompt Tokens Average Number of Tokens Completion
 (Thinking + Response)
Average Number of Tokens (Response
 Only)
Average Number of Tokens (Thinking Only) Percentage of Thinking Completed
 (6,000 Maximum output Token)
Baseline DeepSeek 8.75 288 3334 271 3063 80.0%
Prompt Optimized DeepSeek 11 326 1925 27 1898 90.3%

As we can see, the overall accuracy jumps to 11% on this subset of the HLE dataset, the number of thinking and output tokens are reduced (therefore reducing the time to last token and cost), and the rate of completing thinking increased to 90% overall. From our experiments, we see that although there is no explicit reference to reducing the thinking tokens, the clearer, more detailed instructions about the task at hand after prompt optimization might reduce the additional effort involved for models like DeepSeek-R1 to do self-clarification or deeper problem understanding. Prompt optimization for reasoning models makes sure that the quality of thinking and overall flow, which is self-adaptive and dependent on the question, is largely unaffected, leading to better final answers.

Conclusion

In this post, we demonstrated how prompt optimization on Amazon Bedrock can effectively enhance the performance of thinking-intensive models like DeepSeek-R1. Through our experiments with the HLE dataset, we showed that optimized prompts not only reduced the number of thinking tokens by a significant margin, but also improved overall accuracy from 8.75% to 11%. The optimization resulted in more efficient reasoning paths without sacrificing the quality of answers, leading to faster response times and lower costs. This improvement in both efficiency and effectiveness suggests that prompt optimization can be a valuable tool for deploying reasoning-heavy models in production environments where both accuracy and computational resources need to be carefully balanced. As the field of AI continues to evolve with more sophisticated thinking models, techniques like prompt optimization will become increasingly important for practical applications.

To get started with prompt optimization on Amazon Bedrock, refer to Optimize a prompt and Improve the performance of your Generative AI applications with Prompt Optimization on Amazon Bedrock.

Appendix

The following is the full response for the question about Nate’s dog:


Okay, let's see. So the problem is about Nate's dog digging holes and Nate trying to fill them.
I need to figure out how many weeks it takes Nate to fill all the holes once he starts. Let me break
it down step by step.nnFirst, the dog digs 6 holes a day. He does this for 14 days while Nate is on
vacation. So when Nate comes back, there are a certain number of holes already dug. Let me calculate
that first. If the dog digs 6 holes each day for 14 days, that's 6 multiplied by 14. Let me do that
math: 6*14=84. So there are 84 holes when Nate gets home.nnNow, once Nate starts filling the holes,
he fills 9 holes per day. But here's the catch: the dog keeps digging 6 new holes every night. So 
each day, Nate fills 9 holes, but the dog adds 6 more overnight. I need to figure out the net change
per day. That would be the number of holes filled minus the new holes dug. So 9 filled - 6 dug = 3 
holes net reduction per day. Wait, but does this happen every day? Let me make sure.nnSo Nate works
during the day, filling 9 holes. Then, during the night, the dog digs 6 new holes. So each full day
(day and night) results in a net decrease of 3 holes. So starting from 84 holes, each day the total
number decreases by 3. But I need to check if the process is that Nate fills 9, then the dog digs 6,
so the next day starts with 84 -9 +6 = 84 -3 =81, and so on. So yes, each day the net is -3.nnBut
wait, maybe the timing is important. Let's think: when Nate starts filling, does he fill 9 holes on
the first day, and then the dog digs 6 that night? So each day after Nate fills, the dog adds 6. So
the sequence is: Day 1: Nate fills 9, holes become 84-9=75. Then night: dog digs 6, holes become
75+6=81. Day 2: Nate fills 9, holes 81-9=72. Night: +6=78. So each full day cycle (day and night)
reduces the total by 3. So yes, net 3 per day.nnBut wait, maybe on the last day, if Nate fills the
remaining holes and the dog doesn't get to dig that night because all holes are filled. Hmm, that's a
crucial point. For example, if on a certain day, the number of holes left is less than or equal to 9,
Nate can fill them all in that day, and the dog doesn't add any more because there are no holes left
to dig. So we need to check if on the day when Nate fills the remaining holes, the total would reach
zero, and the dog doesn't dig that night. So we can't just do total holes divided by 3, because the
last day might not have the dog digging.nnSo let's model this properly. Let's denote the number of
holes each day.nnStarting with 84 holes.nnEach day:nn1. Nate fills 9 holes. Subtract 9.nn2.
If the result is zero or negative, then done.nn3. Otherwise, the dog digs 6 holes. Add 6.nnSo we
need to iterate this until the number of holes is zero or less after Nate fills them.nnLet me try
to calculate day by day.nnStart: 84nnDay 1:nnAfter Nate fills: 84 -9 =75nnSince 75 >0, dog
digs 6: 75 +6=81nnDay 2:nnAfter Nate fills:81-9=72nnDog digs 6:72+6=78nnDay3:nn78-
9=69nn+6=75nnDay4:75-9=66+6=72nnDay5:72-9=63+6=69nnDay6:69-9=60+6=66nnDay7:66-
9=57+6=63nnDay8:63-9=54+6=60nnDay9:60-9=51+6=57nnDay10:57-9=48+6=54nnDay11:54-
9=45+6=51nnDay12:51-9=42+6=48nnDay13:48-9=39+6=45nnDay14:45-9=36+6=42nnDay15:42-
9=33+6=39nnDay16:39-9=30+6=36nnDay17:36-9=27+6=33nnDay18:33-9=24+6=30nnDay19:30-
9=21+6=27nnDay20:27-9=18+6=24nnDay21:24-9=15+6=21nnDay22:21-9=12+6=18nnDay23:18-
9=9+6=15nnDay24:15-9=6+6=12nnDay25:12-9=3+6=9nnDay26:9-9=0nnAt this point, after Nate fills
on day26, the holes are 0. Since there are no holes left, the dog can't dig any more that night. So
the process stops here.nnSo total days:26 days.nnBut the question asks for how many weeks. Since
a week is 7 days, we divide 26 by7.nn26 divided by7 is 3 with a remainder of5. So 3 weeks and5
days. But the question says "how many weeks does it take him to fill in all the holes". Depending
on whether they want it rounded up to full weeks or expressed as a fraction. Let me check the problem
statement again.nn"How many weeks does it take him to fill in all the holes?"nnTypically, such
problems expect the answer in whole weeks, rounded up if there's a remainder. Because you can't have
a fraction of a week in practical terms. So 26 days is 3 weeks and5 days, which is more than3 weeks
but less than4. So they might expect 4 weeks. But let me verify.nnAlternatively, maybe they want
the exact number in weeks as a fraction. 26 days is 26/7 weeks u22483.714 weeks. But the problem
might expect an integer number of weeks, rounded up. Let's see.nnBut let's check the problem again.
It says "how many weeks does it take him to fill in all the holes?" So if it takes 26 days, which
is 3 weeks and5 days, then the answer would be 4 weeks if we're counting full weeks needed. Because
after 3 weeks (21 days), he still has 5 days left. So total weeks is4.nnBut maybe the answer is
26/7 weeks, which is approximately3.714, but since the problem might expect an exact answer in weeks,
possibly as a fraction. However, the problem might be designed to have a whole number. Wait, let's
check the math again.nnWait, when I did the day-by-day count, I got 26 days. Let me recount to make
sure I didn't make a mistake.nnStarting from day1 to day26. Let's check day26:nnStart with
day1:84nnAfter day1:81nn...nnLet me check day25:nnAfter day24:12nnDay25:12-9=3, then dog
digs6:3+6=9nnDay26:9-9=0. So day26 is the day when Nate fills the last 9 holes. So yes, 26
days.nn26 days divided by7 days/week=3.714... weeks. So the answer is 26/7 weeks, which is3 and5/7
weeks. But the problem might want it expressed as a mixed number or an exact number. However, the
question says "how many weeks does it take him", so likely they expect the answer in weeks,
possibly as a fraction. But let's see if there's a better way to model this without counting each
day.nnAlternative approach:nnTotal holes after vacation:84nnEach day, Nate fills9, dog digs6,
net -3 per day, except possibly the last day.nnBut if on the last day, Nate fills the remaining
holes without the dog digging, then we need to adjust.nnLetu2019s model it as:nnLet H be the
number of holes. Start with H=84.nnEach day, H decreases by9, then if H>0, increases by6.nnWe
need to find the number of days until H<=0.nnThis can be modeled as:nnEach full cycle (day and
night) reduces H by3, except the last day where if H<=9, then H becomes0 and no addition.nnSo
the number of full cycles (days where H>9 after filling) plus the final day.nnLetu2019s
denote:nnLetu2019s say after n days, the holes would be:nnEach day, H = H -9 +6 = H -3, but only
if after subtracting9, H >=1 (since if H-9 <=0, then the dog doesnu2019t dig). Wait, no. Wait,
the dog digs6 only if after Nate fills9, there are still holes left. Wait, no. Wait, the problem says
the dog keeps digging6 new holes every night. So regardless of whether there are holes left, the dog
digs6 every night. Wait, but if all holes are filled, can the dog dig new ones? The problem says
"fill in all the holes", so once all holes are filled, the dog can't dig new ones because there are
no existing holes? Wait, the problem might be implying that the dog digs new holes regardless, but
Nate is filling the existing ones. Wait, the problem says "the dog keeps digging6 new holes every
night". So perhaps every night, regardless of the current state, the dog digs6 new holes. But that
would mean that even if Nate fills all holes on a day, the dog digs6 new ones that night, making it
impossible to ever finish. But that can't be, because the problem asks how long to fill all holes,
which implies that eventually Nate can catch up.nnWait, this is a critical point. Let me re-examine
the problem statement:nn"When Nate gets home, he starts filling in 9 holes a day, but the dog
keeps digging 6 new holes every night. How many weeks does it take him to fill in all the holes?
"nnSo the key is: every night, the dog digs6 new holes. So even if Nate fills all the holes during
the day, the dog will dig6 new ones that night. But that would mean the process never ends. But the
problem must assume that once all holes are filled, the dog stops. But the wording is ambiguous.
However, logically, if all holes are filled, there's nothing for the dog to dig. So perhaps the dog
only digs6 holes if there's ground to dig, but the problem states "keeps digging6 new holes every
night", implying that regardless of the current state, the dog digs6 each night. This is a problem
because if Nate ever fills all holes, the next night the dog digs6, so Nate would have to fill those6
the next day, but then the dog digs6 again, leading to an infinite loop. But the problem must have a
solution, so likely the dog stops digging once all holes are filled. Therefore, the correct
interpretation is that each night, the dog digs6 new holes, but if there are no holes (i.e., all
filled), the dog can't dig. Therefore, the last day when Nate fills the remaining holes, the dog
doesn't dig that night.nnTherefore, the correct model is:nnEach day:nn1. Nate fills9
holes.nn2. If holes remain after filling, the dog digs6 that night.nnSo the process is:nnStart
with H=84.nnWhile H>0:nn H = H -9nn if H >0:nn H = H +6nn days +=1nnSo let's model
this with equations.nnLetu2019s find the number of days required.nnThis is similar to the
classic problem where each day you have a net change, but with a different condition on the last
day.nnLetu2019s denote that each "cycle" (day and night) reduces H by3, except possibly the last
day where H <=9, so after filling, H becomes0, and no addition.nnSo the total number of days is
the number of full cycles (each reducing H by3) plus the final day.nnBut how to calculate this?
nnLetu2019s think of it as:nnLetu2019s suppose that after n days, the number of holes is H_n
=84 -3n.nnBut this is only true if every day the net change is -3. However, this would be the case
until H_n -9 <=0. Wait, no. Wait, the H after filling is H -9. If H -9 <=0, then the dog
doesn't dig, so the total reduction is H (to 0). So we need to find the day when H -9 <=0.nnSo
letu2019s solve for when H -9 <=0:nnH =84 -3(n-1) -9 <=0nnWait, this is getting
complicated. Let me use another approach.nnLetu2019s consider that each day before the final day,
the net change is -3. On the final day, the net change is -9 (since the dog doesnu2019t add
6).nnLetu2019s let x be the number of full days (with net -3), and 1 final day (net -9).nnTotal
reduction: 3x +9 =84nnSo 3x =84 -9=75nnx=75/3=25nnSo x=25 days of net -3, and 1 final day of 
-9.nnTotal days=25+1=26 days.nnYes, this matches the day-by-day count. Therefore, 26 
days.nnConvert to weeks:26/7=3.714... weeks. But the problem asks "how many weeks does it take
him". If they want a whole number of weeks, rounded up, it's4 weeks. If they accept fractions,
it's26/7u22483.714. But the problem might expect the answer in weeks as a mixed number, like3 and5/7
weeks. However, typically, such problems expect the answer in whole weeks, rounded up. Let me check
the problem statement again.nnThe problem says "how many weeks does it take him to fill in all the
holes". The answer should be in weeks, and since it's more than3 weeks but less than4, but you can't
have a fraction of a week in practical terms when scheduling, so they might expect4 weeks. However,
mathematically, it's26/7 weeks. Let me see if the problem expects an exact answer or a whole 
number.nnLooking back at the problem statement, it's a math problem, so likely expects the exact
value, which is26/7 weeks, which can be written as3 5/7 weeks. However, sometimes problems like this
expect the answer in whole weeks, rounding up. Let's check both possibilities.nnIf the answer is26
days, which is3 weeks and5 days, then depending on the context:nn- If partial weeks count as a full
week, then4 weeks.nn- If exact, then3 5/7 weeks.nnBut the problem doesn't specify, so as a math
problem, the precise answer is26/7 weeks, which is3 and5/7 weeks. However, the question might want it
in weeks as a whole number, rounded up. Let's see.nnBut let's check the problem again. The problem
says "how many weeks does it take him to fill in all the holes?" The answer choices (if any) aren't
given, but since the user hasn't provided options, we need to infer. In math problems like this,
unless specified otherwise, the answer is expected in the most precise form, which would be26/7 weeks
or3 5/7 weeks. However, sometimes problems expect whole numbers, so maybe the answer is4 weeks. But
let's verify with the day count.nn26 days is exactly3 weeks and5 days. If the question asks for the
number of weeks, and if part of a week counts as a full week, then it's4. But if they want the exact
number, it's3 5/7. However, the problem might have an integer answer. Wait, let's check the
calculations again.nnWait, when I did the day-by-day count, I got26 days. Let me verify that again
quickly.nnStart with84.nnEach day:nn1. Subtract9.nn2. If result >0, add6.nnRepeat until
H<=0.nnLetu2019s model this with a formula.nnLetu2019s denote that each day, except the
last, the net change is -3. The last day is -9.nnSo total holes =84 =3x +9nnSo3x=75 u2192x=25
days of net -3, plus1 day of -9. Total26 days.nnYes, that's correct. So26 days.nn26 days divided
by7 days/week=3.714... weeks.nnBut the problem might want the answer as a mixed number, so3 and5/7
weeks. However, in some contexts, especially in school problems, they might expect the answer in
whole weeks, rounded up, so4 weeks. But since the problem is from Nate's perspective, he would count
the exact days, so converting26 days to weeks gives3 weeks and5 days. But the question asks for
weeks, so unless specified, it's better to present it as26/7 weeks, which is3 5/7 weeks. However,
let's check if there's a different approach that gives a whole number.nnAlternatively, maybe I made
a mistake in the day count. Let me recount the days more carefully.nnStarting with84
holes.nnDay1:nnFill9:84-9=75. Since75>0, dog digs6:75+6=81.nnDay2:nn81-9=72>0
u2192+6=78.nnDay3:78-9=69>0 u2192+6=75.nnDay4:75-9=66>0 u2192+6=72.nnDay5:72-9=63>0
u2192+6=69.nnDay6:69-9=60>0 u2192+6=66.nnDay7:66-9=57>0 u2192+6=63.nnDay8:63-9=54>0
u2192+6=60.nnDay9:60-9=51>0 u2192+6=57.nnDay10:57-9=48>0 u2192+6=54.nnDay11:54-
9=45>0 u2192+6=51.nnDay12:51-9=42>0 u2192+6=48.nnDay13:48-9=39>0 
u2192+6=45.nnDay14:45-9=36>0 u2192+6=42.nnDay15:42-9=33>0 u2192+6=39.nnDay16:39-
9=30>0 u2192+6=36.nnDay17:36-9="

About the authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Zhengyuan Shen is an Applied Scientist at Amazon Bedrock, specializing in foundational models and ML modeling for complex tasks including natural language and structured data understanding. He is passionate about leveraging innovative ML solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside work, he enjoys sports and cooking.

Xuan Qi is an Applied Scientist at Amazon Bedrock, where she applies her background in physics to tackle complex challenges in machine learning and artificial intelligence. Xuan is passionate about translating scientific concepts into practical applications that drive tangible improvements in technology. Her work focuses on creating more intuitive and efficient AI systems that can better understand and interact with the world. Outside of her professional pursuits, Xuan finds balance and creativity through her love for dancing and playing the violin, bringing the precision and harmony of these arts into her scientific endeavors.

Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language proceeding, machine learning, large language modeling, and other related AI areas.

Read More

Amazon Bedrock announces general availability of multi-agent collaboration

Amazon Bedrock announces general availability of multi-agent collaboration

Today, we’re announcing the general availability (GA) of multi-agent collaboration on Amazon Bedrock. This capability allows developers to build, deploy, and manage networks of AI agents that work together to execute complex, multi-step workflows efficiently.

Since its preview launch at re:Invent 2024, organizations across industries—including financial services, healthcare, supply chain and logistics, manufacturing, and customer support—have used multi-agent collaboration to orchestrate specialized agents, driving efficiency, accuracy, and automation. With this GA release, we’ve introduced enhancements based on customer feedback, further improving scalability, observability, and flexibility—making AI-driven workflows easier to manage and optimize.

What is multi-agent collaboration?

Generative AI is no longer just about models generating responses, it’s about automation. The next wave of innovation is driven by agents that can reason, plan, and act autonomously across company systems. Generative AI applications are no longer just generating content; they also take action, solve problems, and execute complex workflows. The shift is clear: businesses need AI that doesn’t just respond to prompts but orchestrates entire workflows, automating processes end to end.

Agents enable generative AI applications to perform tasks across company systems and data sources, and Amazon Bedrock already simplifies building them. With Amazon Bedrock, customers can quickly create agents that handle sales orders, compile financial reports, analyze customer retention, and much more. However, as applications become more capable, the tasks customers want them to perform can exceed what a single agent can manage—either because the tasks require specialized expertise, involve multiple steps, or demand continuous execution over time.

Coordinating potentially hundreds of agents at scale is also challenging, because managing dependencies, ensuring efficient task distribution, and maintaining performance across a large network of specialized agents requires sophisticated orchestration. Without the right tools, businesses can face inefficiencies, increased latency, and difficulties in monitoring and optimizing performance. For customers looking to advance their agents and tackle more intricate, multi-step workflows, Amazon Bedrock supports multi-agent collaboration, enabling developers to easily build, deploy, and manage multiple specialized agents working together seamlessly.

Multi-agent collaboration enables developers to create networks of specialized agents that communicate and coordinate under the guidance of a supervisor agent. Each agent contributes its expertise to the larger workflow by focusing on a specific task. This approach breaks down complex processes into manageable sub-tasks processed in parallel. By facilitating seamless interaction among agents, Amazon Bedrock enhances operational efficiency and accuracy, ensuring workflows run more effectively at scale. Because each agent only accesses the data required for its role, this approach minimizes exposure of sensitive information while reinforcing security and governance. This allows businesses to scale their AI-driven workflows without the need for manual intervention in coordinating agents. As more agents are added, the supervisor ensures smooth collaboration between them all.

By using multi-agent collaboration on Amazon Bedrock, organizations can:

  • Streamline AI-driven workflows by distributing workloads across specialized agents.
  • Improve execution efficiency by parallelizing tasks where possible.
  • Enhance security and governance by restricting agent access to only necessary data.
  • Reduce operational complexity by eliminating manual intervention in agent coordination.

A key challenge in building effective multi-agent collaboration systems is managing the complexity and overhead of coordinating multiple specialized agents at scale. Amazon Bedrock simplifies the process of building, deploying, and orchestrating effective multi-agent collaboration systems while addressing efficiency challenges through several key features and optimizations:

  • Quick setup – Create, deploy, and manage AI agents working together in minutes without the need for complex coding.
  • Composability – Integrate your existing agents as subagents within a larger agent system, allowing them to seamlessly work together to tackle complex workflows.
  • Efficient inter-agent communication – The supervisor agent can interact with subagents using a consistent interface, supporting parallel communication for more efficient task completion.
  • Optimized collaboration modes – Choose between supervisor mode and supervisor with routing mode. With routing mode, the supervisor agent will route simple requests directly to specialized subagents, bypassing full orchestration. For complex queries or when no clear intention is detected, it automatically falls back to the full supervisor mode, where the supervisor agent analyzes, breaks down problems, and coordinates multiple subagents as needed.
  • Integrated trace and debug console – Visualize and analyze multi-agent interactions behind the scenes using the integrated trace and debug console.

What’s new in general availability?

The GA release introduces several key enhancements based on customer feedback, making multi-agent collaboration more scalable, flexible, and efficient:

  • Inline agent support – Enables the creation of supervisor agents dynamically at runtime, allowing for more flexible agent management without predefined structures.
  • AWS CloudFormation and AWS Cloud Development Kit (AWS CDK) support – Enables customers to deploy agent networks as code, enabling scalable, reusable agent templates across AWS accounts.
  • Enhanced traceability and debugging – Provides structured execution logs, sub-step tracking, and Amazon CloudWatch integration to improve monitoring and troubleshooting.
  • Increased collaborator and step count limits – Expands self-service limits for agent collaborators and execution steps, supporting larger-scale workflows.
  • Payload referencing – Reduces latency and costs by allowing the supervisor agent to reference external data sources without embedding them in the agent request.
  • Improved citation handling – Enhances accuracy and attribution when agents pull external data sources into their responses.

These features collectively improve coordination capabilities, communication speed, and overall effectiveness of the multi-agent collaboration framework in tackling complex, real-world problems.

Multi-agent collaboration across industries

Multi-agent collaboration is already transforming AI automation across sectors:

  • Investment advisory – A financial firm uses multiple agents to analyze market trends, risk factors, and investment opportunities to deliver personalized client recommendations.
  • Retail operations – A retailer deploys agents for demand forecasting, inventory tracking, pricing optimization, and order fulfillment to increase operational efficiency.
  • Fraud detection – A banking institution assigns agents to monitor transactions, detect anomalies, validate customer behaviors, and flag potential fraud risks in real time.
  • Customer support – An enterprise customer service platform uses agents for sentiment analysis, ticket classification, knowledge base retrieval, and automated responses to enhance resolution times.
  • Healthcare diagnosis – A hospital system integrates agents for patient record analysis, symptom recognition, medical imaging review, and treatment plan recommendations to assist clinicians.

Deep dive: Syngenta’s use of multi-agent collaboration

Syngenta, a global leader in agricultural innovation, has integrated cutting-edge generative AI into its Cropwise service, resulting in the development of Cropwise AI. This advanced system is designed to enhance the efficiency of agronomic advisors and growers by providing tailored recommendations for crop management practices.

Business challenge

The agricultural sector faces the complex task of optimizing crop yields while ensuring sustainability and profitability. Farmers and agronomic advisors must consider a multitude of factors, including weather patterns, soil conditions, crop growth stages, and potential pest and disease threats. In the past, analyzing these variables required extensive manual effort and expertise. Syngenta recognized the need for a more efficient, data-driven approach to support decision-making in crop management.

Solution: Cropwise AI

To address these challenges, Syngenta collaborated with AWS to develop Cropwise AI, using Amazon Bedrock Agents to create a multi-agent system that integrates various data sources and AI capabilities. This system offers several key features:

  • Advanced seed recommendation and placement – Uses predictive machine learning algorithms to deliver personalized seed recommendations tailored to each grower’s unique environment.
  • Sophisticated predictive modeling – Employs state-of-the-art machine learning algorithms to forecast crop growth patterns, yield potential, and potential risk factors by integrating real-time data with comprehensive historical information.
  • Precision agriculture optimization – Provides hyper-localized, site-specific recommendations for input application, minimizing waste and maximizing resource efficiency.

Agent architecture

Cropwise AI is built on AWS architecture and designed for scalability, maintainability, and security. The system uses Amazon Bedrock Agents to orchestrate multiple AI agents, each specializing in distinct tasks:

  • Data aggregation agent – Collects and integrates extensive datasets, including over 20 years of weather history, soil conditions, and more than 80,000 observations on crop growth stages.
  • Recommendation agent – Analyzes the aggregated data to provide tailored recommendations for precise input applications, product placement, and strategies for pest and disease control.
  • Conversational AI agent – Uses a multilingual conversational large language model (LLM) to interact with users in natural language, delivering insights in a clear format.

This multi-agent collaboration enables Cropwise AI to process complex agricultural data efficiently, offering actionable insights and personalized recommendations to enhance crop yields, sustainability, and profitability.

Results

By implementing Cropwise AI, Syngenta has achieved significant improvements in agricultural practices:

  • Enhanced decision-making: Agronomic advisors and growers receive data-driven recommendations, leading to optimized crop management strategies.
  • Increased yields: Utilizing Syngenta’s seed recommendation models, Cropwise AI helps growers increase yields by up to 5%.
  • Sustainable practices: The system promotes precision agriculture, reducing waste and minimizing environmental impact through optimized input applications.

Highlighting the significance of this advancement, Feroz Sheikh, Chief Information and Digital Officer at Syngenta Group, stated:

“Agricultural innovation leader Syngenta is using Amazon Bedrock Agents as part of its Cropwise AI solution, which gives growers deep insights to help them optimize crop yields, improve sustainability, and drive profitability. With multi-agent collaboration, Syngenta will be able to use multiple agents to further improve their recommendations to growers, transforming how their end-users make decisions and delivering even greater value to the farming community.” 

This collaboration between Syngenta and AWS exemplifies the transformative potential of generative AI and multi-agent systems in agriculture, driving innovation and supporting sustainable farming practices.

How multi-agent collaboration works

Amazon Bedrock automates agent collaboration, including task delegation, execution tracking, and data orchestration. Developers can configure their system in one of two collaboration modes:

  • Supervisor mode
    • The supervisor agent receives an input, breaks down complex requests, and assigns tasks to specialized sub-agents.
    • Sub-agents execute tasks in parallel or sequentially, returning responses to the supervisor, which consolidates the results.
  • Supervisor with routing mode
    • Simple queries are routed directly to a relevant sub-agent.
    • Complex or ambiguous requests trigger the supervisor to coordinate multiple agents to complete the task.

Watch the Amazon Bedrock multi-agent collaboration video to learn how to get started.

Conclusion

By enabling seamless multi-agent collaboration, Amazon Bedrock empowers businesses to scale their generative AI applications with greater efficiency, accuracy, and flexibility. As organizations continue to push the boundaries of AI-driven automation, having the right tools to orchestrate complex workflows will be essential. With Amazon Bedrock, companies can confidently build AI systems that don’t just generate responses but drive real impact—automating processes, solving problems, and unlocking new possibilities across industries.

Amazon Bedrock multi-agent collaboration is now generally available.

Multi-agent collaboration opens new possibilities for AI-driven automation. Whether in finance, healthcare, retail, or agriculture, Amazon Bedrock helps organizations scale AI workflows with efficiency and precision.

Start building today—and let us know what you create!


About the authors

Sri Koneru has spent the last 13.5 years honing her skills in both cutting-edge product development and large-scale infrastructure. At Salesforce for 7.5 years, she had the incredible opportunity to build and launch brand new products from the ground up, reaching over 100,000 external customers. This experience was instrumental in her professional growth. Then, at Google for 6 years, she transitioned to managing critical infrastructure, overseeing capacity, efficiency, fungibility, job scheduling, data platforms, and spatial flexibility for all of Alphabet. Most recently, Sri joined Amazon Web Services leveraging her diverse skillset to make a significant impact on AI/ML services and infrastructure at AWS. Personally, Sri & her husband recently became empty nesters, relocating to Seattle from the Bay Area. They’re a basketball-loving family who even catch pre-season Warriors games but are looking forward to cheering on the Seattle Storm this year. Beyond basketball, Sri enjoys cooking, recipe creation, reading, and her newfound hobby of hiking. While she’s a sun-seeker at heart, she is looking forward to experiencing the unique character of Seattle weather.

Read More

Accelerating insurance policy reviews with generative AI: Verisk’s Mozart companion

Accelerating insurance policy reviews with generative AI: Verisk’s Mozart companion

This post is co-authored with Sundeep Sardana, Malolan Raman, Joseph Lam, Maitri Shah and Vaibhav Singh from Verisk.

Verisk (Nasdaq: VRSK) is a leading strategic data analytics and technology partner to the global insurance industry, empowering clients to strengthen operating efficiency, improve underwriting and claims outcomes, combat fraud, and make informed decisions about global risks. Through advanced data analytics, software, scientific research, and deep industry knowledge, Verisk helps build global resilience across individuals, communities, and businesses. At the forefront of using generative AI in the insurance industry, Verisk’s generative AI-powered solutions, like Mozart, remain rooted in ethical and responsible AI use. Mozart, the leading platform for creating and updating insurance forms, enables customers to organize, author, and file forms seamlessly, while its companion uses generative AI to compare policy documents and provide summaries of changes in minutes, cutting the change adoption time from days or weeks to minutes.

The generative AI-powered Mozart companion uses sophisticated AI to compare legal policy documents and provides essential distinctions between them in a digestible and structured format. The new Mozart companion is built using Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. The Mozart application rapidly compares policy documents and presents comprehensive change details, such as descriptions, locations, excerpts, in a tracked change format.

The following screenshot shows an example of the output of the Mozart companion displaying the summary of changes between two legal documents, the excerpt from the original document version, the updated excerpt in the new document version, and the tracked changes represented with redlines.

Verisk Application Screen Shot

In this post, we describe the development journey of the generative AI companion for Mozart, the data, the architecture, and the evaluation of the pipeline.

Data: Policy forms

Mozart is designed to author policy forms like coverage and endorsements. These documents provide information about policy coverage and exclusions (as shown in the following screenshot) and help in determining the risk and premium associated with an insurance policy.

Document Example

Solution overview

The policy documents reside in Amazon Simple Storage Service (Amazon S3) storage. An AWS Batch job reads these documents, chunks them into smaller slices, then creates embeddings of the text chunks using the Amazon Titan Text Embeddings model through Amazon Bedrock and stores them in an Amazon OpenSearch Service vector database. Along with each document slice, we store the metadata associated with it using an internal Metadata API, which provides document characteristics like document type, jurisdiction, version number, and effective dates. This process has been implemented as a periodic job to keep the vector database updated with new documents. During the solution design process, Verisk also considered using Amazon Bedrock Knowledge Bases because it’s purpose built for creating and storing embeddings within Amazon OpenSearch Serverless. In the future, Verisk intends to use the Amazon Titan Embeddings V2 model.

The user can pick the two documents that they want to compare. This action invokes an AWS Lambda function to retrieve the document embeddings from the OpenSearch Service database and present them to Anthropic’s Claude 3 Sonnet FM, which is accessed through Amazon Bedrock. The results are stored in a JSON structure and provided using the API service to the UI for consumption by the end-user.

The following diagram illustrates the solution architecture.

Application Architecture

Security and governance

Generative AI is very new technology and brings with it new challenges related to security and compliance. Verisk has a governance council that reviews generative AI solutions to make sure that they meet Verisk’s standards of security, compliance, and data use. Verisk also has a legal review for IP protection and compliance within their contracts. It’s important that Verisk makes sure the data that is shared by the FM is transmitted securely and the FM doesn’t retain any of their data or use it for its own training. The quality of the solution, speed, cost, and ease of use were the key factors that led Verisk to pick Amazon Bedrock and Anthropic’s Claude Sonnet within their generative AI solution.

Evaluation criteria

To assess the quality of the results produced by generative AI, Verisk evaluated based on the following criteria:

  • Accuracy
  • Consistency
  • Adherence to context
  • Speed and cost

To assess the generative AI results’ accuracy and consistency, Verisk designed human evaluation metrics with the help of in-house insurance domain experts. Verisk conducted multiple rounds of human evaluation of the generated results. During these tests, in-house domain experts would grade accuracy, consistency, and adherence to context on a manual grading scale of 1–10. The Verisk team measured how long it took to generate the results by tracking latency. Feedback from each round of tests was incorporated in subsequent tests.

The initial results that Verisk got from the model were good but not close to the desired level of accuracy and consistency. The development process underwent iterative improvements that included redesign, making multiple calls to the FM, and testing various FMs. The primary metric used to evaluate the success of FM and non-FM solutions was a manual grading system where business experts would grade results and compare them. FM solutions are improving rapidly, but to achieve the desired level of accuracy, Verisk’s generative AI software solution needed to contain more components than just FMs. To achieve the desired accuracy, consistency, and efficiency, Verisk employed various techniques beyond just using FMs, including prompt engineering, retrieval augmented generation, and system design optimizations.

Prompt optimization

The change summary is different than showing differences in text between the two documents. The Mozart application needs to be able to describe the material changes and ignore the noise from non-meaningful changes. Verisk created prompts using the knowledge of their in-house domain experts to achieve these objectives. With each round of testing, Verisk added detailed instructions to the prompts to capture the pertinent information and reduce possible noise and hallucinations. The added instructions would be focused on reducing any issues identified by the business experts reviewing the end results. To get the best results, Verisk needed to adjust the prompts based on the FM used—there are differences in how each FM responds to prompts, and using the prompts specific to the given FM provides better results. Through this process, Verisk instructed the model on the role it is playing along with the definition of common terms and exclusions. In addition to optimizing prompts for the FMs, Verisk also explored techniques for effectively splitting and processing the document text itself.

Splitting document pages

Verisk tested multiple strategies for document splitting. For this use case, a recursive character text splitter with a chunk size of 500 characters with 15% overlap provided the best results. This splitter is part of the LangChain framework; it’s a semantic splitter that considers semantic similarities in the text. Verisk also considered the NLTK splitter. With an effective approach for splitting the document text into processable chunks, Verisk then focused on enhancing the quality and relevance of the summarized output.

Quality of summary

The quality assessment starts with confirming that the correct documents are picked for comparison. Verisk enhanced the quality of the solution by using document metadata to narrow the search results by specifying which documents to include or exclude from a query, resulting in more relevant responses generated by the FM. For the generative AI description of change, Verisk wanted to capture the essence of the change instead of merely highlighting the differences. The results were reviewed by their in-house policy authoring experts and their feedback was used to determine the prompts, document splitting strategy, and FM. With techniques in place to enhance output quality and relevance, Verisk also prioritized optimizing the performance and cost-efficiency of their generative AI solution. These techniques were specific to prompt engineering; some examples are few-shot prompting, chain of thought prompting, and the needle in a haystack approach.

Price-performance

To achieve lower cost, Verisk regularly evaluated various FM options and changed them as new options with lower cost and better performance were released. During the development process, Verisk redesigned the solution to reduce the number of calls to the FM and wherever possible used non-FM based options.

As mentioned earlier, the overall solution consists of a few different components:

  • Location of the change
  • Excerpts of the changes
  • Change summary
  • Changes shown in the tracked change format

Verisk reduced the FM load and improved accuracy by identifying the sections that contained differences and then passing these sections to the FM to generate the change summary. For constructing the tracked difference format, containing redlines, Verisk used a non-FM based solution. In addition to optimizing performance and cost, Verisk also focused on developing a modular, reusable architecture for their generative AI solution.

Reusability

Good software development practices apply to the development of generative AI solutions too. You can create a decoupled architecture with reusable components. The Mozart generative AI companion is provided as an API, which decouples it from the frontend development and allows for reusability of this capability. Similarly, the API consists of many reusable components like common prompts, common definitions, retrieval service, embedding creation, and persistence service. Through their modular, reusable design approach and iterative optimization process, Verisk was able to achieve highly satisfactory results with their generative AI solution.

Results

Based on Verisk’s evaluation template questions and rounds of testing, they concluded that the results generated over 90% good or acceptable summaries. Testing was done by providing results of the solution to business experts, and having these experts grade the results using a grading scale.

Business impact

Verisk’s customers spend significant time regularly to review changes to the policy forms. The generative AI-powered Mozart companion can simplify the review process by ingesting these complex and unstructured policy documents and providing a summary of changes in minutes. This enables Verisk’s customers to cut the change adoption time from days to minutes. The improved adoption speed not only increases productivity, but also enable timely implementation of changes.

Conclusion

Verisk’s generative AI-powered Mozart companion uses advanced natural language processing and prompt engineering techniques to provide rapid and accurate summaries of changes between insurance policy documents. By harnessing the power of large language models like Anthropic’s Claude 3 Sonnet while incorporating domain expertise, Verisk has developed a solution that significantly accelerates the policy review process for their customers, reducing change adoption time from days or weeks to just minutes. This innovative application of generative AI delivers tangible productivity gains and operational efficiencies to the insurance industry. With a strong governance framework promoting responsible AI use, Verisk is at the forefront of unlocking generative AI’s potential to transform workflows and drive resilience across the global risk landscape.

For more information, see the following resources:


About the Authors

Sundeep Sardana is the Vice President of Software Engineering at Verisk Analytics, based in New Jersey. He leads the Reimagine program for the company’s Rating business, driving modernization across core services such as forms, rules, and loss costs. A dynamic change-maker and technologist, Sundeep specializes in building high-performing teams, fostering a culture of innovation, and leveraging emerging technologies to deliver scalable, enterprise-grade solutions. His expertise spans cloud computing, Generative AI, software architecture, and agile development, ensuring organizations stay ahead in an evolving digital landscape. Connect with him on LinkedIn.

Malolan Raman is a Principal Engineer at Verisk, based out of New Jersey specializing in the development of Generative AI (GenAI) applications. With extensive experience in cloud computing and artificial intelligence, He has been at the forefront of integrating cutting-edge AI technologies into scalable, secure, and efficient cloud solutions.

Joseph Lam is the senior director of commercial multi-lines that include general liability, umbrella/excess, commercial property, businessowners, capital assets, crime and inland marine. He leads a team responsible for research, development, and support of commercial casualty products, which mostly consist of forms and rules. The team is also tasked with supporting new and innovative solutions for the emerging marketplace.

Maitri Shah is a Software Development Engineer at Verisk with over two years of experience specializing in developing innovative solutions in Generative AI (GenAI) on Amazon Web Services (AWS). With a strong foundation in machine learning, cloud computing, and software engineering, Maitri has successfully implemented scalable AI models that drive business value and enhance user experiences.

Vaibhav Singh is a Product Innovation Analyst at Verisk, based out of New Jersey. With a background in Data Science, engineering, and management, he works as a pivotal liaison between technology and business, enabling both sides to build transformative products & solutions that tackle some of the current most significant challenges in the insurance domain. He is driven by his passion for leveraging data and technology to build innovative products that not only address the current obstacles but also pave the way for future advancements in that domain.

Ryan Doty is a Solutions Architect Manager at AWS, based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Tarik Makota is a Sr. Principal Solutions Architect with Amazon Web Services. He provides technical guidance, design advice, and thought leadership to AWS’ customers across the US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

Alex Oppenheim is a Senior Sales Leader at Amazon Web Services, supporting consulting and services customers. With extensive experience in the cloud and technology industry, Alex is passionate about helping enterprises unlock the power of AWS to drive innovation and digital transformation.

Read More

Announcing general availability of Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

Announcing general availability of Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

Today, Amazon Web Services (AWS) announced the general availability of Amazon Bedrock Knowledge Bases GraphRAG (GraphRAG), a capability in Amazon Bedrock Knowledge Bases that enhances Retrieval-Augmented Generation (RAG) with graph data in Amazon Neptune Analytics. This capability enhances responses from generative AI applications by automatically creating embeddings for semantic search and generating a graph of the entities and relationships extracted from ingested documents. The graph, stored in Amazon Neptune Analytics, provides enriched context during the retrieval phase to deliver more comprehensive, relevant, and explainable responses tailored to customer needs. Developers can enable GraphRAG with just a few clicks on the Amazon Bedrock console to boost the accuracy of generative AI applications without any graph modeling expertise.

In this post, we discuss the benefits of GraphRAG and how to get started with it in Amazon Bedrock Knowledge Bases.

Enhance RAG with graphs for more comprehensive and explainable GenAI applications

Generative AI is transforming how humans interact with technology by having natural conversations that provide helpful, nuanced, and insightful responses. However, a key challenge facing current generative AI systems is providing responses that are comprehensive, relevant, and explainable because data is stored across multiple documents. Without effectively mapping shared context across input data sources, responses risk being incomplete and inaccurate.

To address this, AWS announced a public preview of GraphRAG at re:Invent 2024, and is now announcing its general availability. This new capability integrates the power of graph data modeling with advanced natural language processing (NLP). GraphRAG automatically creates graphs which capture connections between related entities and sections across documents. More specifically, the graph created will connect chunks to documents, and entities to chunks.

During response generation, GraphRAG first does semantic search to find the top k most relevant chunks, and then traverses the surrounding neighborhood of those chunks to retrieve the most relevant content. By linking this contextual information, the generative AI system can provide responses that are more complete, precise, and grounded in source data. Whether answering complex questions across topics or summarizing key details from lengthy reports, GraphRAG delivers the comprehensive and explainable responses needed to enable more helpful, reliable AI conversations.

GraphRAG boosts relevance and accuracy when relevant information is dispersed across multiple sources or documents, which can be seen in the following three use cases.

Streamlining market research to accelerate business decisions

A leading global financial institution sought to enhance insight extraction from its proprietary research. With a vast repository of economic and market research reports, the institution wanted to explore how GraphRAG could improve information retrieval and reasoning for complex financial queries. To evaluate this, they added their proprietary research papers, focusing on critical market trends and economic forecasts.

To evaluate the effectiveness of GraphRAG, the institution partnered with AWS to build a proof-of-concept using Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics. The goal was to determine if GraphRAG could more effectively surface insights compared to traditional retrieval methods. GraphRAG structures knowledge into interconnected entities and relationships, enabling multi-hop reasoning across documents. This capability is crucial for answering intricate questions such as “What are some headwinds and tailwinds to capex growth in the next few years?” or “What is the impact of the ILA strike on international trade?”. Rather than relying solely on keyword matching, GraphRAG allows the model to trace relationships between economic indicators, policy changes, and industry impacts, ensuring responses are contextually rich and data-driven.

When comparing the quality of responses from GraphRAG and other retrieval methods, notable differences emerged in their comprehensiveness, clarity, and relevance. While other retrieval methods delivered straightforward responses, they often lacked deeper insights and broader context. GraphRAG instead provided more nuanced answers by incorporating related factors and offering additional relevant information, which made the responses more comprehensive than the other retrieval methods.

Improving data-driven decision-making in automotive manufacturing

An international auto company manages a large dataset, supporting thousands of use cases across engineering, manufacturing, and customer service. With thousands of users querying different datasets daily, making sure insights are accurate and connected across sources has been a persistent challenge.

To address this, the company worked with AWS to prototype a graph that maps relationships between key data points, such as vehicle performance, supply chain logistics, and customer feedback. This structure allows for more precise results across datasets, rather than relying on disconnected query results.

With Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics automatically constructing a graph from ingested documents, the company can surface relevant insights more efficiently in their RAG applications. This approach helps teams identify patterns in manufacturing quality, predict maintenance needs, and improve supply chain resilience, making data analysis more effective and scalable across the organization.

Enhancing cybersecurity incident analysis

A cybersecurity company is using GraphRAG to improve how its AI-powered assistant analyzes security incidents. Traditional detection methods rely on isolated alerts, often missing the broader context of an attack.

By using a graph, the company connects disparate security signals, such as login anomalies, malware signatures, and network traffic patterns, into a structured representation of threat activity. This allows for faster root cause analysis and more comprehensive security reporting.

Amazon Bedrock Knowledge Bases and Neptune Analytics enable this system to scale while maintaining strict security controls, providing resource isolation. With this approach, the company’s security teams can quickly interpret threats, prioritize responses, and reduce false positives, leading to more efficient incident handling.

Solution overview

In this post, we provide a walkthrough to build Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics, using files in an Amazon Simple Storage Service (Amazon S3) bucket. Running this example will incur costs in Amazon Neptune Analytics, Amazon S3, and Amazon Bedrock. Amazon Neptune Analytics costs for this example will be approximately $0.48 per hour. Amazon S3 costs will vary depending on how large your dataset is, and more details on Amazon S3 pricing can be found here. Amazon Bedrock costs will vary depending on the embeddings model and chunking strategy you select, and more details on Bedrock pricing can be found here.

Prerequisites

To follow along with this post, you need an AWS account with the necessary permissions to access Amazon Bedrock, and an Amazon S3 bucket containing data to serve as your knowledge base. Also ensure that you have enabled model access to Claude 3 Haiku (anthropic.claude-3-haiku-20240307-v1:0) and any other models that you wish to use as your embeddings model. For more details on how to enable model access, refer to the documentation here.

Build Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics

To get started, complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge Bases under Builder tools in the navigation pane.
  2. In the Knowledge Bases section, choose Create and Knowledge Base with vector store.
  3. For Knowledge Base details, enter a name and an optional description.
  4. For IAM permissions, select Create and use a new service role to create a new AWS Identity and Access Management (IAM) role.
  5. For Data source details, select Amazon S3 as your data source.
  6. Choose Next.
  7. For S3 URI, choose Browse S3 and choose the appropriate S3 bucket.
  8. For Parsing strategy, select Amazon Bedrock default parser.
  9. For Chunking strategy, choose Default chunking (recommended for GraphRAG) or any other strategy as you wish.
  10. Choose Next.
  11. For Embeddings model, choose an embeddings model, such as Amazon Titan Text Embeddings v2.
  12. For Vector database, select Quick create a new vector store and then select Amazon Neptune Analytics (GraphRAG).
  13. Choose Next.
  14. Review the configuration details and choose Create Knowledge Base.

Sync the data source

  1. Once the knowledge base is created, click Sync under the Data source section. The data sync can take a few minutes to a few hours, depending on how many source documents you have and how big each one is.

Test the knowledge base

Once the data sync is complete:

  1. Choose the expansion icon to expand the full view of the testing area.
  2. Configure your knowledge base by adding filters or guardrails.
  3. We encourage you to enable reranking (For information about pricing for reranking models, see Amazon Bedrock Pricing) to fully take advantage of the capabilities of GraphRAG. Reranking allows GraphRAG to refine and optimize search results.
  4. You can also supply a custom metadata file (each up to 10 KB) for each document in the knowledge base. You can apply filters to your retrievals, instructing the vector store to pre-filter based on document metadata and then search for relevant documents. This way, you have control over the retrieved documents, especially if your queries are ambiguous. Note that the list type is not supported.
  5. Use the chat area in the right pane to ask questions about the documents from your Amazon S3 bucket.

The responses will use GraphRAG and provide references to chunks and documents in their response.

Now that you’ve enabled GraphRAG, test it out by querying your generative AI application and observe how the responses have improved compared to baseline RAG approaches. You can monitor the Amazon CloudWatch logs for performance metrics on indexing, query latency, and accuracy.

Clean up

When you’re done exploring the solution, make sure to clean up by deleting any resources you created. Resources to clean up include the Amazon Bedrock knowledge base, the associated AWS IAM role that the Amazon Bedrock knowledge base uses, and the Amazon S3 bucket that was used for the source documents.

You will also need to separately delete the Amazon Neptune Analytics graph that was created on your behalf, by Amazon Bedrock Knowledge Bases.

Conclusion

In this post, we discussed how to get started with Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune. For further experimentation, check out the Amazon Bedrock Knowledge Bases Retrieval APIs to use the power of GraphRAG in your own applications. Refer to our documentation for code samples and best practices.


About the authors

Denise Gosnell is a Principal Product Manager for Amazon Neptune, focusing on generative AI infrastructure and graph data applications that enable scalable, cutting-edge solutions across industry verticals.

Melissa Kwok is a Senior Neptune Specialist Solutions Architect at AWS, where she helps customers of all sizes and verticals build cloud solutions according to best practices. When she’s not at her desk you can find her in the kitchen experimenting with new recipes or reading a cookbook.

Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge Generative AI and Graph Analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.

Harsh Singh is a Principal Product Manager Technical at AWS AI. Harsh enjoys building products that bring AI to software developers and everyday users to improve their productivity.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High-Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Build a Multi-Agent System with LangGraph and Mistral on AWS

Build a Multi-Agent System with LangGraph and Mistral on AWS

Agents are revolutionizing the landscape of generative AI, serving as the bridge between large language models (LLMs) and real-world applications. These intelligent, autonomous systems are poised to become the cornerstone of AI adoption across industries, heralding a new era of human-AI collaboration and problem-solving. By using the power of LLMs and combining them with specialized tools and APIs, agents can tackle complex, multistep tasks that were previously beyond the reach of traditional AI systems. The Multi-Agent City Information System demonstrated in this post exemplifies the potential of agent-based architectures to create sophisticated, adaptable, and highly capable AI applications.

As we look to the future, agents will have a very important role to play in:

  1. Improving decision-making with deeper, context-aware information
  2. Automating complex workflows across various domains, from customer service to scientific research
  3. Enabling more natural and intuitive human-AI interactions
  4. Generating new ideas by bringing together diverse data sources and specialized knowledge
  5. Addressing ethical concerns by providing more transparent and explainable AI systems

Building and deploying multi-agent systems like the one in this post is a step toward unlocking the full potential of generative AI. As these systems evolve, they will transform industries, expand possibilities, and open new doors for artificial intelligence.

Solution overview

In this post, we explore how to use LangGraph and Mistral models on Amazon Bedrock to create a powerful multi-agent system that can handle sophisticated workflows through collaborative problem-solving. This integration enables the creation of AI agents that can work together to solve complex problems, mimicking humanlike reasoning and collaboration.

The result is a system that delivers comprehensive details about events, weather, activities, and recommendations for a specified city, illustrating how stateful, multi-agent applications can be built and deployed on Amazon Web Services (AWS) to address real-world challenges.

LangGraph is essential to our solution by providing a well-organized method to define and manage the flow of information between agents. It provides built-in support for state management and checkpointing, providing smooth process continuity. This framework also allows for straightforward visualization of the agentic workflows, enhancing clarity and understanding. It integrates easily with LLMs and Amazon Bedrock, providing a versatile and powerful solution. Additionally, its support for conditional routing allows for dynamic workflow adjustments based on intermediate results, providing flexibility in handling different scenarios.

The multi-agent architecture we present offers several key benefits:

  • Modularity – Each agent focuses on a specific task, making the system easier to maintain and extend
  • Flexibility – Agents can be quickly added, removed, or modified without affecting the entire system
  • Complex workflow handling – The system can manage advanced and complex workflows by distributing tasks among multiple agents
  • Specialization – Each agent is optimized for its specific task, improving latency, accuracy, and overall system efficiency
  • Security – The system enhances security by making sure that each agent only has access to the tools necessary for its task, reducing the potential for unauthorized access to sensitive data or other agents’ tasks

How our multi-agent system works

In this section, we explore how our Multi-Agent City Information System works, based on the multi-agent LangGraph Mistral Jupyter notebook available in the Mistral on AWS examples for Bedrock & SageMaker repository on GitHub.

This agentic workflow takes a city name as input and provides detailed information, demonstrating adaptability in handling different scenarios:

  1. Events – It searches a local database and online sources for upcoming events in the city. Whenever local database information is unavailable, it triggers an online search using the Tavily API. This makes sure that users receive up-to-date event information, regardless of whether it’s stored locally or needs to be retrieved from the web
  2. Weather – The system fetches current weather data using the OpenWeatherMap API, providing accurate and timely weather information for the queried location. Based on the weather, the system also offers outfit and activity recommendations tailored to the conditions, providing relevant suggestions for each city
  3. Restaurants – Recommendations are provided through a Retrieval Augmented Generation (RAG) system. This method combines prestored information with real-time generation to offer relevant and up-to-date dining suggestions

The system’s ability to work with varying levels of information is showcased through its adaptive approach, which means that users receive the most comprehensive and up-to-date information possible, regardless of the varying availability of data for different cities. For instance:

  • Some cities might require the use of the search tool for event information when local database data is unavailable
  • Other cities might have data available in the local database, providing quick access to event information without needing an online search
  • In cases where restaurant recommendations are unavailable for a particular city, the system can still provide valuable insights based on the available event and weather data

The following diagram is the solution’s reference architecture:

Data sources

The Multi-Agent City Information System can take advantage of two sources of data.

Local events database

This SQLite database is populated with city events data from a JSON file, providing quick access to local event information that ranges from community happenings to cultural events and citywide activities. This database is used by the events_database_tool() for efficient querying and retrieval of city event details, including location, date, and event type.

Restaurant RAG system

For restaurant recommendations, the generate_restaurants_dataset() function generates synthetic data, creating a custom dataset specifically tailored to our recommendation system. The create_restaurant_vector_store() function processes this data, generates embeddings using Amazon Titan Text Embeddings, and builds a vector store with Facebook AI Similarity Search (FAISS). Although this approach is suitable for prototyping, for a more scalable and enterprise-grade solution, we recommend using Amazon Bedrock Knowledge Bases.

Building the multi-agent architecture

At the heart of our Multi-Agent City Information System lies a set of specialized functions and tools designed to gather, process, and synthesize information from various sources. They form the backbone of our system, enabling it to provide comprehensive and up-to-date information about cities. In this section, we explore the key components that drive our system: the generate_text() function, which uses Mistral model, and the specialized data retrieval functions for local database queries, online searches, weather information, and restaurant recommendations. Together, these functions and tools create a robust and versatile system capable of delivering valuable insights to users.

Text generation function

This function serves as the core of our agents, allowing them to generate text using the Mistral model as needed. It uses the Amazon Bedrock Converse API, which supports text generation, streaming, and external function calling (tools).

The function works as follows:

  1. Sends a user message to the Mistral model using the Amazon Bedrock Converse API
  2. Invokes the appropriate tool and incorporates the results into the conversation
  3. Continues the conversation until a final response is generated

Here’s the implementation:

def generate_text(bedrock_client, model_id, tool_config, input_text):
    ......
    
    while True:
        response = bedrock_client.converse(**kwargs)
        output_message = response['output']['message']
        messages.append(output_message) # Add assistant's response to messages
        
        stop_reason = response.get('stopReason')

        if stop_reason == 'tool_use' and tool_config:
            tool_use = output_message['content'][0]['toolUse']
            tool_use_id = tool_use['toolUseId']
            tool_name = tool_use['name']
            tool_input = tool_use['input']

            try:
                if tool_name == 'get_upcoming_events':
                    tool_result = local_info_database_tool(tool_input['city'])
                    json_result = json.dumps({"events": tool_result})
                elif tool_name == 'get_city_weather':
                    tool_result = weather_tool(tool_input['city'])
                    json_result = json.dumps({"weather": tool_result})
                elif tool_name == 'search_and_summarize_events':
                    tool_result = search_tool(tool_input['city'])
                    json_result = json.dumps({"events": tool_result})
                else:
                    raise ValueError(f"Unknown tool: {tool_name}")
                
                tool_response = {
                    "toolUseId": tool_use_id,
                    "content": [{"json": json.loads(json_result)}]
                }
                
            ......
            
            messages.append({
                "role": "user",
                "content": [{"toolResult": tool_response}]
            })
            
            # Update kwargs with new messages
            kwargs["messages"] = messages
        else:
            break

    return output_message, tool_result

Local database query tool

The events_database_tool() queries the local SQLite database for events information by connecting to the database, executing a query to fetch upcoming events for the specified city, and returning the results as a formatted string. It’s used by the events_database_agent() function. Here’s the code:

def events_database_tool(city: str) -> str:
    conn = sqlite3.connect(db_path)
    query = """
        SELECT event_name, event_date, description 
        FROM local_events 
        WHERE city = ?
        ORDER BY event_date
        LIMIT 3
    """
    df = pd.read_sql_query(query, conn, params=(city,))
    conn.close()
    print(df)
    if not df.empty:
        events = df.apply(
            lambda row: (
                f"{row['event_name']} on {row['event_date']}: {row['description']}"
            ),
            axis=1
        ).tolist()
        return "n".join(events)
    else:
        return f"No upcoming events found for {city}."

Weather tool

The weather_tool() fetches current weather data for the specified city by calling the OpenWeatherMap API. It’s used by the weather_agent() function. Here’s the code:

def weather_tool(city: str) -> str:
    weather = OpenWeatherMapAPIWrapper()
    tool_result = weather.run("Tampa")
    return tool_result

Online search tool

When local event information is unavailable, the search_tool() performs an online search using the Tavily API to find upcoming events in the specified city and return a summary. It’s used by the search_agent() function. Here’s the code:

def search_tool(city: str) -> str:
    client = TavilyClient(api_key=os.environ['TAVILY_API_KEY'])
    query = f"What are the upcoming events in {city}?"
    response = client.search(query, search_depth="advanced")
    results_content = "nn".join([result['content'] for result in response['results']])
    return results_content  

Restaurant recommendation function

The query_restaurants_RAG() function uses a RAG system to provide restaurant recommendations by performing a similarity search in the vector database for relevant restaurant information, filtering for highly rated restaurants in the specified city and using Amazon Bedrock with the Mistral model to generate a summary of the top restaurants based on the retrieved information. It’s used by the query_restaurants_agent() function.

For the detailed implementation of these functions and tools, environment setup, and use cases, refer to the Multi-Agent LangGraph Mistral Jupyter notebook.

Implementing AI agents with LangGraph

Our multi-agent system consists of several specialized agents. Each agent in this architecture is represented by a Node in LangGraph, which, in turn, interacts with the tools and functions defined previously. The following diagram shows the workflow:

The workflow follows these steps:

  1. Events database agent (events_database_agent) – Uses the events_database_tool() to query a local SQLite database and find local event information
  2. Online search agent (search_agent) – Whenever local event information is unavailable in the database, this agent uses the search_tool() to find upcoming events by searching online for a given city
  3. Weather agent (weather_agent) – Fetches current weather data using the weather_tool() for the specified city
  4. Restaurant recommendation agent (query_restaurants_agent) – Uses the query_restaurants_RAG() function to provide restaurant recommendations for a specified city
  5. Analysis agent (analysis_agent) – Aggregates information from other agents to provide comprehensive recommendations

Here’s an example of how we created the weather agent:

def weather_agent(state: State) -> State:
    ......
    
    tool_config = {
        "tools": [
            {
                "toolSpec": {
                    "name": "get_city_weather",
                    "description": "Get current weather information for a specific city",
                    "inputSchema": {
                        "json": {
                            "type": "object",
                            "properties": {
                                "city": {
                                    "type": "string",
                                    "description": "The name of the city to look up weather for"
                                }
                            },
                            "required": ["city"]
                        }
                    }
                }
            }
        ]
    }
    
    input_text = f"Get current weather for {state.city}"
    output_message, tool_result = generate_text(bedrock_client, DEFAULT_MODEL, tool_config, input_text)
    
    if tool_result:
        state.weather_info = {"city": state.city, "weather": tool_result}
    else:
        state.weather_info = {"city": state.city, "weather": "Weather information not available."}
    
    print(f"Weather info set to: {state.weather_info}")
    return state

Orchestrating agent collaboration

In the Multi-Agent City Information System, several key primitives orchestrate agent collaboration. The build_graph() function defines the workflow in LangGraph, utilizing nodes, routes, and conditions. The workflow is dynamic, with conditional routing based on event search results, and incorporates memory persistence to store the state across different executions of the agents. Here’s an overview of the function’s behavior:

  1. Initialize workflow – The function begins by creating a StateGraph object called workflow, which is initialized with a State. In LangGraph, the State represents the data or context that is passed through the workflow as the agents perform their tasks. In our example, the state includes things like the results from previous agents (for example, event data, search results, and weather information), input parameters (for example, city name), and other relevant information that the agents might need to process:
# Define the graph
def build_graph():
    workflow = StateGraph(State)
    ...
  1. Add nodes (agents) – Each agent is associated with a specific function, such as retrieving event data, performing an online search, fetching weather information, recommending restaurants, or analyzing the gathered information:
    workflow.add_node("Events Database Agent", events_database_agent)
    workflow.add_node("Online Search Agent", search_agent)
    workflow.add_node("Weather Agent", weather_agent)
    workflow.add_node("Restaurants Recommendation Agent", query_restaurants_agent)
    workflow.add_node("Analysis Agent", analysis_agent)
  1. Set entry point and conditional routing – The entry point for the workflow is set to the Events Database Agent, meaning the execution of the workflow starts from this agent. Also, the function defines a conditional route using the add_conditional_edges method. The route_events() function decides the next step based on the results from the Events Database Agent:
 workflow.set_entry_point("Events Database Agent")
    
    def route_events(state):
        print(f"Routing events. Current state: {state}")
        print(f"Events content: '{state.events_result}'")
        if f"No upcoming events found for {state.city}" in state.events_result:
            print("No events found in local DB. Routing to Online Search Agent.")
            return "Online Search Agent"
        else:
            print("Events found in local DB. Routing to Weather Agent.")
            return "Weather Agent"

    workflow.add_conditional_edges(
        "Events Database Agent",
        route_events,
        {
            "Online Search Agent": "Online Search Agent",
            "Weather Agent": "Weather Agent"
        }
    )
  1. Add Edges between agentsThese edges define the order in which agents interact in the workflow. The agents will proceed in a specific sequence: from Online Search Agent to Weather Agent, from Weather Agent to Restaurants Recommendation Agent, and from there to Analysis Agent, before finally reaching the END:
    workflow.add_edge("Online Search Agent", "Weather Agent")
    workflow.add_edge("Weather Agent", "Restaurants Recommendation Agent")
    workflow.add_edge("Restaurants Recommendation Agent", "Analysis Agent")
    workflow.add_edge("Analysis Agent", END)
  1. Initialize memory for state persistence – The MemorySaver class is used to make sure that the state of the workflow is preserved between runs. This is especially useful in multi-agent systems where the state of the system needs to be maintained as the agents interact:
    # Initialize memory to persist state between graph runs
    checkpointer = MemorySaver()
  1. Compile the workflow and visualize the graph – The workflow is compiled, and the memory-saving object (checkpointer) is included to make sure that the state is persisted between executions. Then, it outputs a graphical representation of the workflow:
    # Compile the workflow
    app = workflow.compile(checkpointer=checkpointer)
    
    # Visualize the graph
    display(
        Image(
            app.get_graph().draw_mermaid_png(
                draw_method=MermaidDrawMethod.API
            )
        )
    )

The following diagram illustrates these steps:

Results and analysis

To demonstrate the versatility of our Multi-Agent City Information System, we run it for three different cities: Tampa, Philadelphia, and New York. Each example showcases different aspects of the system’s functionality.

The used function main() orchestrates the entire process:

  1. Calls the build_graph() function, which implements the agentic workflow
  2. Initializes the state with the specified city
  3. Streams the events through the workflow
  4. Retrieves and displays the final analysis and recommendations

To run the code, do the following:

if __name__ == "__main__":
    cities = ["Tampa", "Philadelphia", "New York"]
    for city in cities:
        print(f"nStarting script execution for city: {city}")
        main(city)

Three example use cases

For Example 1 (Tampa), the following diagram shows how the agentic workflow produces the output in response to the user’s question, “What’s happening in Tampa and what should I wear?”

The system produced the following results:

  1. Events – Not found in the local database, triggering the search tool which called the Tavily API to find several upcoming events
  2. Weather – Retrieved from weather tool. Current conditions include moderate rain, 28°C, and 87% humidity
  3. Activities – The system suggested various indoor and outdoor activities based on the events and weather
  4. Outfit recommendations – Considering the warm, humid, and rainy conditions, the system recommended light, breathable clothing and rain protection
  5. Restaurants – Recommendations provided through the RAG system

For Example 2 (Philadelphia), the agentic workflow identified events in the local database, including cultural events and festivals. It retrieved weather data from the OpenWeatherMap API, then suggested activities based on local events and weather conditions. Outfit recommendations were made in line with the weather forecast, and restaurant recommendations were provided through the RAG system.

For Example 3 (New York), the workflow identified events such as Broadway shows and city attractions in the local database. It retrieved weather data from the OpenWeatherMap API and suggested activities based on the variety of local events and weather conditions. Outfit recommendations were tailored to New York’s weather and urban environment. However, the RAG system was unable to provide restaurant recommendations for New York because the synthetic dataset created earlier hadn’t included any restaurants from this city.

These examples demonstrate the system’s ability to adapt to different scenarios. For detailed output of these examples, refer to the Results and Analysis section of the Multi-Agent LangGraph Mistral Jupyter notebook.

Conclusion

In the Multi-Agent City Information System we developed, agents integrate various data sources and APIs within a flexible, modular framework to provide valuable information about events, weather, activities, outfit recommendations, and dining options across different cities. Using Amazon Bedrock and LangGraph, we’ve created a sophisticated agent-based workflow that adapts seamlessly to varying levels of available information, switching between local and online data sources as needed. These agents autonomously gather, process, and consolidate data into actionable insights, orchestrating and automating business logic to streamline processes and provide real-time insights. As a result, this multi-agent approach enables the creation of robust, scalable, and intelligent agentic systems that push the boundaries of what’s possible with generative AI.

Want to dive deeper? Explore the implementation of Multi-Agent Collaboration and Orchestration using LangGraph for Mistral Models on GitHub to observe the code in action and try out the solution yourself. You’ll find step-by-step instructions for setting up and running the multi-agent system, along with code for interacting with data sources, agents, routing data, and visualizing the workflow.


About the Author

Andre Boaventura is a Principal AI/ML Solutions Architect at AWS, specializing in generative AI and scalable machine learning solutions. With over 25 years in the high-tech software industry, he has deep expertise in designing and deploying AI applications using AWS services such as Amazon Bedrock, Amazon SageMaker, and Amazon Q. Andre works closely with global system integrators (GSIs) and customers across industries to architect and implement cutting-edge AI/ML solutions to drive business value. Outside of work, Andre enjoys practicing Brazilian Jiu-Jitsu with his son (often getting pinned or choked by a teenager), cheering for his daughter at her dance competitions (despite not knowing ballet terms—he claps enthusiastically anyway), and spending ‘quality time’ with his wife—usually in shopping malls, pretending to be interested in clothes and shoes while secretly contemplating a new hobby.

Read More

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

Evaluate RAG responses with Amazon Bedrock, LlamaIndex and RAGAS

In the rapidly evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a game-changer, revolutionizing how Foundation Models (FMs) interact with organization-specific data. As businesses increasingly rely on AI-powered solutions, the need for accurate, context-aware, and tailored responses has never been more critical.

Enter the powerful trio of Amazon Bedrock, LlamaIndex, and RAGAS– a cutting-edge combination that’s set to redefine the evaluation and optimization of RAG responses. This blog post delves into how these innovative tools synergize to elevate the performance of your AI applications, ensuring they not only meet but exceed the exacting standards of enterprise-level deployments.

Whether you’re a seasoned AI practitioner or a business leader exploring the potential of generative AI, this guide will equip you with the knowledge and tools to:

  1. Harness the full potential of Amazon Bedrock robust foundation models
  2. Utilize RAGAS’s comprehensive evaluation metrics for RAG systems

In this post, we’ll explore how to leverage Amazon Bedrock, LlamaIndex, and RAGAS to enhance your RAG implementations. You’ll learn practical techniques to evaluate and optimize your AI systems, enabling more accurate, context-aware responses that align with your organization’s specific needs. Let’s dive in and discover how these powerful tools can help you build more effective and reliable AI-powered solutions.

RAG Evaluation

RAG evaluation is important to ensure that RAG models produce accurate, coherent, and relevant responses. By analyzing the retrieval and generator components both jointly and independently, RAG evaluation helps identify bottlenecks, monitor performance, and improve the overall system. Current RAG pipelines frequently employ similarity-based metrics such as ROUGE, BLEU, and BERTScore to assess the quality of the generated responses, which is essential for refining and enhancing the model’s capabilities.

Above mentioned probabilistic metrics ROUGE, BLEU, and BERTScore have limitations in assessing relevance and detecting hallucinations. More sophisticated metrics are needed to evaluate factual alignment and accuracy.

Evaluate RAG components with Foundation models

We can also use a Foundation Model as a judge to compute various metrics for both retrieval and generation. Here are some examples of these metrics:

  • Retrieval component
    • Context precision – Evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not.
    • Context recall – Ensures that the context contains all relevant information needed to answer the question.
  • Generator component
    • Faithfulness – Verifies that the generated answer is factually accurate based on the provided context, helping to identify errors or “hallucinations.”
    • Answer relavancy : Measures how well the answer matches the question. Higher scores mean the answer is complete and relevant, while lower scores indicate missing or redundant information.

Ragas Metrics - generation, retrieval

Overview of solution

This post guides you through the process of assessing quality of RAG response with evaluation framework such as RAGAS and LlamaIndex with Amazon Bedrock.

In this post, we are also going to leverage Langchain to create a sample RAG application.

Amazon Bedrock is a fully managed service that offers a choice of high-performing Foundation Models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

The Retrieval Augmented Generation Assessment (RAGAS) framework offers multiple metrics to evaluate each part of the RAG system pipeline, identifying areas for improvement. It utilizes foundation models to test individual components, aiding in pinpointing modules for development to enhance overall results.

LlamaIndex is a framework for building LLM applications. It simplifies data integration from various sources and provides tools for data indexing, engines, agents, and application integrations. Optimized for search and retrieval, it streamlines querying LLMs and retrieving documents. This blog post focuses on using its Observability/Evaluation modules.

LangChain is an open-source framework that simplifies the creation of applications powered by foundation models. It provides tools for chaining LLM operations, managing context, and integrating external data sources. LangChain is primarily used for building chatbots, question-answering systems, and other AI-driven applications that require complex language processing capabilities.

Diagram Architecture

The following diagram is a high-level reference architecture that explains how you can evaluate the RAG solution with RAGAS or LlamaIndex.

Architecture Diagram

The solution consists of the following components:

  1. Evaluation dataset – The source data for the RAG comes from the Amazon SageMaker FAQ, which represents 170 question-answer pairs. This corresponds to Step 1 in the architecture diagram.
  1. Build sample RAG – Documents are segmented into chunks and stored in an Amazon Bedrock Knowledge Bases (Steps 2–4). We use Langchain Retrieval Q&A to answer user queries. This process retrieves relevant data from an index at runtime and passes it to the Foundation Model (FM).
  2. RAG evaluation – To assess the quality of the Retrieval-Augmented Generation (RAG) solution, we can use both RAGAS and LlamaIndex. An LLM performs the evaluation by comparing its predictions with ground truths (Steps 5–6).

You must follow the provided notebook to reproduce the solution. We elaborate on the main code components in this post.

Prerequisites

To implement this solution, you need the following:

  1. An AWS accountwith privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
  2. Access enabled for the Amazon Titan Embeddings G1 – Text model and Anthropic Claude 3 Sonnet on Amazon Bedrock. For instructions, see Model access.
  3. Run the prerequisite code provided in the Python

Ingest FAQ data

The first step is to ingest the SageMaker FAQ data. For this purpose, LangChain provides a WebBaseLoader object to load text from HTML webpages into a document format. Then we split each document in multiple chunks of 2,000 tokens with a 100-token overlap. See the following code below:

text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000,  chunk_overlap=100)
retriever_db= get_retriever(text_chunks, bedrock_embeddings)

Set up embeddings and LLM with Amazon Bedrock and LangChain

In order to build a sample RAG application, we need an LLM and an embedding model:

  • LLM – Anthropic Claude 3 Sonnet
  • Embedding – Amazon Titan Embeddings – Text V2

This code sets up a LangChain application using Amazon Bedrock, configuring embeddings with Titan and a Claude 3 Sonnet model for text generation with specific parameters for controlling the model’s output. See the following code below from the notebook :

from botocore.client import Config
from langchain.llms.bedrock import Bedrock
from langchain_aws import ChatBedrock
from langchain.embeddings import BedrockEmbeddings
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
import nest_asyncio
nest_asyncio.apply()

#URL to fetch the document
SAGEMAKER_URL="https://aws.amazon.com/sagemaker/faqs/"

#Bedrock parameters
EMBEDDING_MODEL="amazon.titan-embed-text-v2:0"
BEDROCK_MODEL_ID="anthropic.claude-3-sonnet-20240229-v1:0"

bedrock_embeddings = BedrockEmbeddings(model_id=EMBEDDING_MODEL,client=bedrock_client)

model_kwargs = {
    "temperature": 0, 
    "top_k": 250, 
    "top_p": 1,
    "stop_sequences": ["\n\nHuman:"]
}    

llm_bedrock = ChatBedrock(
    model_id=BEDROCK_MODEL_ID,
    model_kwargs=model_kwargs
)

Set up Knowledge Bases

We will create Amazon Bedrock knowledgebases Web Crawler datasource and process Sagemaker FAQ data.

In the code below, we load the embedded documents in Knowledge bases and we set up the retriever with LangChain:

from utils import split_document_from_url, get_bedrock_retriever
from botocore.exceptions import ClientError

text_chunks = split_document_from_url(SAGEMAKER_URL, chunck_size= 2000,  chunk_overlap=100)
retriever_db= get_bedrock_retriever(text_chunks, region)

Build a Q&A chain to query the retrieval API

After the database is populated, create a Q&A retrieval chain to perform question answering with context extracted from the vector store. You also define a prompt template following Claude prompt engineering guidelines. See the following code below from the notebook:

from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise and short. "
    "Context: {context}"
    )

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}")
    ]
)
question_answer_chain = create_stuff_documents_chain(llm_bedrock, prompt_template)
chain = create_retrieval_chain(retriever_db, question_answer_chain)

Build Dataset to evaluate RAG application

To evaluate a RAG application, we need a combination of the following datasets:

  • Questions – The user query that serves as input to the RAG pipeline
  • Context – The information retrieved from enterprise or external data sources based on the provided query
  • Answers – The responses generated by LLMs
  • Ground truths – Human-annotated, ideal responses for the questions that can be used as the benchmark to compare against the LLM-generated answers

We are ready to evaluate the RAG application. As describe in the introduction, we select 3 metrics to assess our RAG solution:

  1. Faithfulness
  2. Answer Relevancy
  3. Answer Correctness

For more information, refer to Metrics.

This step involves defining an evaluation dataset with a set of ground truth questions and answers. For this post, we choose four random questions from the SageMaker FAQ. See the following code below from the notebook:

EVAL_QUESTIONS = [
"Can I stop a SageMaker Autopilot job manually?",
"Do I get charged separately for each notebook created and run in SageMaker Studio?",
"Do I get charged for creating and setting up an SageMaker Studio domain?",
"Will my data be used or shared to update the base model that is offered to customers using SageMaker JumpStart?",
]

#Defining the ground truth answers for each question

EVAL_ANSWERS = [
"Yes. You can stop a job at any time. When a SageMaker Autopilot job is stopped, all ongoing trials will be stopped and no new trial will be started.",
"""No. You can create and run multiple notebooks on the same compute instance.
You pay only for the compute that you use, not for individual items.
You can read more about this in our metering guide.
In addition to the notebooks, you can also start and run terminals and interactive shells in SageMaker Studio, all on the same compute instance.""",
"No, you don’t get charged for creating or configuring an SageMaker Studio domain, including adding, updating, and deleting user profiles.",
"No. Your inference and training data will not be used nor shared to update or train the base model that SageMaker JumpStart surfaces to customers."
]

Evaluation of RAG with RAGAS

Evaluating the RAG solution requires to compare LLM predictions with ground truth answers. To do so, we use the batch() function from LangChain to perform inference on all questions inside our evaluation dataset.

Then we can use the evaluate() function from RAGAS to perform evaluation on each metric (answer relevancy, faithfulness and answer corectness). It uses an LLM to compute metrics. Feel free to use other Metrics from RAGAS.

See the following code below from the notebook:

from ragas.metrics import answer_relevancy, faithfulness, answer_correctness
from ragas import evaluate

#Batch invoke and dataset creation
result_batch_questions = chain.batch([{"input": q} for q in EVAL_QUESTIONS])

dataset= build_dataset(EVAL_QUESTIONS,EVAL_ANSWERS,result_batch_questions, text_chunks)

result = evaluate(dataset=dataset, metrics=[ answer_relevancy, faithfulness, answer_correctness ],llm=llm_bedrock, embeddings=bedrock_embeddings, raise_exceptions=False )
df = result.to_pandas()
df.head()
 The following screenshot shows the evaluation results and the RAGAS answer relevancy score.

Relevancy score

Answer Relevancy

In the answer_relevancy_score column, a score closer to 1 indicates the response generated is relevant to the input query.

Faithfulness

In the second column, the first query result has a lower faithfulness_score (0.2), which indicates the responses are not derived from the context and are hallucinations. The rest of the query results have a higher faithfulness_score (1.0), which indicates the responses are derived from the context.

Answer Correctness

In the last column answer_correctness, the second and last row have high answer correctness, meaning that answer provided by the LLM is closer to to from the groundtruth.

Evaluation of RAG with LlamaIndex

LlamaIndex, similar to Ragas, provides a comprehensive RAG (Retrieval-Augmented Generation) evaluation module. This module offers a variety of metrics to assess the performance of your RAG system. The evaluation process generates two key outputs:

  1. Feedback: The judge LLM (Language Model) provides detailed evaluation feedback in the form of a string, offering qualitative insights into the system’s performance.
  2. Score: This numerical value indicates how well the answer meets the evaluation criteria. The scoring system varies depending on the specific metric being evaluated. For example, metrics like Answer Relevancy and Faithfulness are typically scored on a scale from 0 to 1.

These outputs allow for both qualitative and quantitative assessment of your RAG system’s performance, enabling you to identify areas for improvement and track progress over time.

The following is a code sample from the notebook:

from llama_index.llms.bedrock import Bedrock
from llama_index.core.evaluation import (
    AnswerRelevancyEvaluator,
    CorrectnessEvaluator,
    FaithfulnessEvaluator
)
from utils import evaluate_llama_index_metric

bedrock_llm_llama = Bedrock(model=BEDROCK_MODEL_ID)
faithfulness= FaithfulnessEvaluator(llm=bedrock_llm_llama)
answer_relevancy= AnswerRelevancyEvaluator(llm=bedrock_llm_llama)
correctness= CorrectnessEvaluator(llm=bedrock_llm_llama)

Answer Relevancy

df_answer_relevancy= evaluate_llama_index_metric(answer_relevancy, dataset)
df_answer_relevancy.head()

The column Score defines the result for the answer_relevancy evaluation criteria. All passing values are set to 1, meaning that all predictions are relevant with the context retrieved.

Additionally, the column Feedback provides a clear explanation of the result of the passing score. We can observe that all answers align with the context extracted from the retriever.

Answer Correctness

df_correctness= evaluate_llama_index_metric(correctness, dataset)
df_correctness.head()

All values from the column Score are set to 5.0, meaning that all predictions are coherent with ground truth answers.

Faithfulness

The following screenshot shows the evaluation results for answer faithfulness.

df_faithfulness= evaluate_llama_index_metric(faithfulness, dataset)
df_faithfulness.head()

All values from the Score column are set to 1.0, which means all answers generated by LLM are coherent given the context retrieved.

Conclusion

While Foundation Models offer impressive generative capabilities, their effectiveness in addressing organization-specific queries has been a persistent challenge. The Retrieval Augmented Generation framework emerges as a powerful solution, bridging this gap by enabling LLMs to leverage external, organization-specific data sources.

To truly unlock the potential of RAG pipelines, the RAGAS framework, in conjunction with LlamaIndex, provides a comprehensive evaluation solution. By meticulously assessing both retrieval and generation components, this approach empowers organizations to pinpoint areas for improvement and refine their RAG implementations. The result? Responses that are not only factually accurate but also highly relevant to user queries.

By adopting this holistic evaluation approach, enterprises can fully harness the transformative power of generative AI applications. This not only maximizes the value derived from these technologies but also paves the way for more intelligent, context-aware, and reliable AI systems that can truly understand and address an organization’s unique needs.

As we continue to push the boundaries of what’s possible with AI, tools like Amazon Bedrock, LlamaIndex, and RAGAS will play a pivotal role in shaping the future of enterprise AI applications. By embracing these innovations, organizations can confidently navigate the exciting frontier of generative AI, unlocking new levels of efficiency, insight, and competitive advantage.

For further exploration, readers interested in enhancing the reliability of AI-generated content may want to look into Amazon Bedrock’s Guardrails feature, which offers additional tools like the Contextual Grounding Check.


About the authors

Madhu is a Senior Partner Solutions Architect specializing in worldwide public sector cybersecurity partners. With over 20 years in software design and development, he collaborates with AWS partners to ensure customers implement solutions that meet strict compliance and security objectives. His expertise lies in building scalable, highly available, secure, and resilient applications for diverse enterprise needs.

Babu Kariyaden Parambath is a Senior AI/ML Specialist at AWS. At AWS, he enjoys working with customers in helping them identify the right business use case with business value and solve it using AWS AI/ML solutions and services. Prior to joining AWS, Babu was an AI evangelist with 20 years of diverse industry experience delivering AI driven business value for customers.

Read More

Innovating at speed: BMW’s generative AI solution for cloud incident analysis

Innovating at speed: BMW’s generative AI solution for cloud incident analysis

This post was co-authored with Johann Wildgruber, Dr. Jens Kohl, Thilo Bindel, and Luisa-Sophie Gloger from BMW Group.

The BMW Group—headquartered in Munich, Germany—is a vehicle manufacturer with more than 154,000 employees, and 30 production and assembly facilities worldwide as well as research and development locations across 17 countries. Today, the BMW Group (BMW) is the world’s leading manufacturer of premium automobiles and motorcycles, and provider of premium financial and mobility services.

BMW Connected Company is a division within BMW responsible for developing and operating premium digital services for BMW’s connected fleet, which currently numbers more than 23 million vehicles worldwide. These digital services are used by many BMW vehicle owners daily; for example, to lock or open car doors remotely using an app on their phone, to start window defrost remotely, to buy navigation map updates from the car’s menu, or to listen to music streamed over the internet in their car.

In this post, we explain how BMW uses generative AI technology on AWS to help run these digital services with high availability. Specifically, BMW uses Amazon Bedrock Agents to make remediating (partial) service outages quicker by speeding up the otherwise cumbersome and time-consuming process of root cause analysis (RCA). The fully automated RCA agent correctly identifies the right root cause for most cases (measured at 85%), and helps engineers in terms of system understanding and real-time insights in their cases. This performance was further validated during the proof of concept, where employing the RCA agent on representative use cases clearly demonstrates the benefits of this solution, allowing BMW to achieve significantly lower diagnosis times.

The challenges of root cause analysis

Digital services are often implemented by chaining multiple software components together; components that might be built and run by different teams. For example, consider the service of remotely opening and locking vehicle doors. There might be a development team building and running the iOS app, another team for the Android app, a team building and running the backend-for-frontend used by both the iOS and Android app, and so on. Moreover, these teams might be geographically dispersed and run their workloads in different locations and regions; many hosted on AWS, some elsewhere.

Now consider a (fictitious) scenario where reports come in from car owners complaining that remotely locking doors with the app no longer works. Is the iOS app responsible for the outage, or the backend-for-frontend? Did a firewall rule change somewhere? Did an internal TLS certificate expire? Is the MQTT system experiencing delays? Was there an inadvertent breaking change in recent API changes? When did they actually deploy that? Or was the database password for the central subscription service rotated again?

It can be difficult to determine the root cause of issues in situations like this. It requires checking many systems and teams, many of which might be failing, because they’re interdependent. Developers need to reason about the system architecture, form hypotheses, and follow the chain of components until they have located the one that is the culprit. They often have to backtrack and reassess their hypotheses, and pursue the investigation in another chain of components.

Understanding the challenges in such complex systems highlights the need for a robust and efficient approach to root cause analysis. With this context in mind, let’s explore how BMW and AWS collaborated to develop a solution using Amazon Bedrock Agents to streamline and enhance the RCA process.

Solution overview

At a high level, the solution uses an Amazon Bedrock agent to do automated RCA. This agent has several custom-built tools at its disposal to do its job. These tools, implemented by AWS Lambda functions, use services like Amazon CloudWatch and AWS CloudTrail to analyze system logs and metrics. The following diagram illustrates the solution architecture.

High level diagram of the solution

When an incident occurs, an on-call engineer gives a description of the issue at hand to the Amazon Bedrock agent. The agent will then start investigating for the root cause of the issue, using its tools to do tasks that the on-call engineer would otherwise do manually, such as searching through logs. Based on the clues it uncovers, the agent proposes several likely hypotheses to the on-call engineer. The engineer can then resolve the issue, or give pointers to the agent to direct the investigation further. In the following section, we take a closer look at the tools the agent uses.

Amazon Bedrock agent tools

The Amazon Bedrock agent’s effectiveness in performing RCA lies in its ability to seamlessly integrate with custom tools. These tools, designed as Lambda functions, use AWS services like CloudWatch and CloudTrail to automate tasks that are typically manual and time-intensive for engineers. By organizing its capabilities into specialized tools, the Amazon Bedrock agent makes sure that RCA is both efficient and precise.

Architecture Tool

The Architecture Tool uses C4 diagrams to provide a comprehensive view of the system’s architecture. These diagrams, enhanced through Structurizr, give the agent a hierarchical understanding of component relationships, dependencies, and workflows. This allows the agent to target the most relevant areas during its RCA process, effectively narrowing down potential causes of failure based on how different systems interact.

For instance, if an issue affects a specific service, the Architecture Tool can identify upstream or downstream dependencies and suggest hypotheses focused on those systems. This accelerates diagnostics by enabling the agent to reason contextually about the architecture instead of blindly searching through logs or metrics.

Logs Tool

The Logs Tool uses CloudWatch Logs Insights to analyze log data in real time. By searching for patterns, errors, or anomalies, as well as comparing the trend to the previous period, it helps the agent pinpoint issues related to specific events, such as failed authentications or system crashes.

For example, in a scenario involving database access failures, the Logs Tool might identify a new spike in the number of error messages such as “FATAL: password authentication failed” compared to the previous hour. This insight allows the agent to quickly associate the failure with potential root causes, such as an improperly rotated database password.

Metrics Tool

The Metrics Tool provides the agent with real-time insights into the system’s health by monitoring key metrics through CloudWatch. This tool identifies statistical anomalies in critical performance indicators such as latency, error rates, resource utilization, or unusual spikes in usage patterns, which can often signal potential issues or deviations from normal behavior.

For instance, in a Kubernetes memory overload scenario, the Metrics Tool might detect a sharp increase in memory consumption or unusual resource allocation prior to the failure. By surfacing CloudWatch metric alarms for such anomalies, the tool enables the agent to prioritize hypotheses related to resource mismanagement, misconfigured thresholds, or unexpected system load, guiding the investigation more effectively toward resolving the issue.

Infrastructure Tool

The Infrastructure Tool uses CloudTrail data to analyze critical control-plane events, such as configuration changes, security group updates, or API calls. This tool is particularly effective in identifying misconfigurations or breaking changes that might trigger cascading failures.

Consider a case where a security group ingress rule is inadvertently removed, causing connectivity issues between services. The Infrastructure Tool can detect and correlate this event with the reported incident, providing the agent with actionable insights to guide its RCA process.

By combining these tools, the Amazon Bedrock agent mimics the step-by-step reasoning of an experienced engineer while executing tasks at machine speed. The modular nature of the tools allows for flexibility and customization, making sure that RCA is tailored to the unique needs of BMW’s complex, multi-regional cloud infrastructure.

In the next section, we discuss how these tools work together within the agent’s workflow.

Amazon Bedrock agents: The ReAct framework in action

At the heart of BMW’s rapid RCA lies the ReAct (Reasoning and Action) agent framework, an innovative approach that dynamically combines logical reasoning with task execution. By integrating ReAct with Amazon Bedrock, BMW gains a flexible solution for diagnosing and resolving complex cloud-based incidents. Unlike traditional methods, which rely on predefined workflows, ReAct agents use real-time inputs and iterative decision-making to adapt to the specific circumstances of an incident.

The ReAct agent in BMW’s RCA solution uses a structured yet adaptive workflow to diagnose and resolve issues. First, it interprets the textual description of an incident (for example, “Vehicle doors cannot be locked via the app”) to identify which parts of the system are most likely impacted. Guided by the ReAct framework’s iterative reasoning, the agent then gathers evidence by calling specialized tools, using data centrally aggregated in a cross-account observability setup. By continuously reevaluating the results of each tool invocation, the agent zeros in on potential causes—whether an expired certificate, a revoked firewall rule, or a spike in traffic—until it isolates the root cause. The following diagram illustrates this workflow.

The ReAct framework offers the following benefits:

  • Dynamic and adaptive – The ReAct agent tailors its approach to the specific incident, rather than a one-size-fits-all methodology. This adaptability is especially critical in BMW’s multi-regional, multi-service architecture.
  • Efficient tool utilization – By reasoning about which tools to invoke and when, the ReAct agent minimizes redundant queries, providing faster diagnostics without overloading AWS services like CloudWatch or CloudTrail.
  • Human-like reasoning – The ReAct agent mimics the logical thought process of a seasoned engineer, iteratively exploring hypotheses until it identifies the root cause. This capability bridges the gap between automation and human expertise.

By employing Amazon Bedrock ReAct agents, significantly lower diagnosis times are achieved. These agents not only enhance operational efficiency but also empower engineers to focus on strategic improvements rather than labor-intensive diagnostics.

Case study: Root cause analysis “Unlocking vehicles via the iOS app”

To illustrate the power of Amazon Bedrock agents in action, let us explore a possible real-world scenario involving the interplay between BMW’s connected fleet and the digital services running in the cloud backend.

We deliberately change the security group for the central networking account in a test environment. This has the effect that requests from the fleet are (correctly) blocked by the changed security group and do not reach the services hosted in the backend. Hence, a test user cannot lock or unlock her vehicle door remotely.

Incident details

BMW engineers received a report from a tester indicating the remote lock/unlock functionality on the mobile app does not work.

This report raised immediate questions: was the issue in the app itself, the backend-for-frontend service, or deeper within the system, such as in the MQTT connectivity or authentication mechanisms?

How the ReAct agent addresses the problem

The problem is described to the Amazon Bedrock ReAct agent: “Users of the iOS app cannot unlock car doors remotely.” The agent immediately begins its analysis:

  1. The agent begins by understanding the overall system architecture, calling the Architecture Tool. The outputs of the architecture tool reveal that the iOS app, like the Android app, is connected to a backend-for-frontend API, and that the backend-for-frontend API itself is connected to several other internal APIs, such as the Remote Vehicle Management API. The Remote Vehicle Management API is responsible for sending commands to cars by using MQTT messaging.
  2. The agent uses the other tools at its disposal in a targeted way: it scans the logs, metrics, and control plane activities of only those components that are involved in remotely unlocking car doors: iOS app remote logs, backend-for-frontend API logs, and so on. The agent finds several clues:
    1. Anomalous logs that indicate connectivity issues (network timeouts).
    2. A sharp decrease in the number of successful invocations of the Remote Vehicle Management API.
    3. Control plane activities: several security groups in the central networking account hosted on the testing environment were changed.
  3. Based on those findings, the agent infers and defines several hypotheses and presents these to the user, ordered by their likelihood. In this case, the first hypothesis is the actual root cause: a security group was inadvertently changed in the central networking account, which meant that network traffic between the backend-for-frontend and the Remote Vehicle Management API was now blocked. The agent correctly correlated logs (“fetch timeout error”), metrics (decrease in invocations) and control plane changes (security group ingress rule removed) to come to this conclusion.
  4. If the on-call engineer wants further information, they can now ask follow-up questions to the agent, or instruct the agent to investigate elsewhere as well.

The entire process—from incident detection to resolution—took minutes, compared to the hours it could have taken with traditional RCA methods. The ReAct agent’s ability to dynamically reason, access cross-account observability data, and iterate on its hypotheses alleviated the need for tedious manual investigations.

Conclusion

By using Amazon Bedrock ReAct agents, BMW has shown how to improve its approach to root cause analysis, turning a complex and manual process into an efficient, automated workflow. The tools integrated within the ReAct framework significantly narrow down potential reasoning space, and enable dynamic hypotheses generation and targeted diagnostics, mimicking the reasoning process of seasoned engineers while operating at machine speed. This innovation has reduced the time required to identify and resolve service disruptions, further enhancing the reliability of BMW’s connected services and improving the experience for millions of customers worldwide.

The solution has demonstrated measurable success, with the agent identifying root causes in 85% of test cases and providing detailed insights in the remainder, greatly expediting engineers’ investigations. By lowering the barrier to entry for junior engineers, it has enabled less-experienced team members to diagnose issues effectively, maintaining reliability and scalability across BMW’s operations.

Incorporating generative AI into RCA processes showcases the transformative potential of AI in modern cloud-based operations. The ability to adapt dynamically, reason contextually, and handle complex, multi-regional infrastructures makes Amazon Bedrock Agents a game changer for organizations aiming to maintain high availability in their digital services.

As BMW continues to expand its connected fleet and digital offerings, the adoption of generative AI-driven solutions like Amazon Bedrock will play an important role in maintaining operational excellence and delivering seamless experiences to customers. By following BMW’s example, your organization can also benefit from Amazon Bedrock Agents for root cause analysis to enhance service reliability.

Get started by exploring Amazon Bedrock Agents to optimize your incident diagnostics or use CloudWatch Logs Insights to identify anomalies in your system logs. If you want a hands-on introduction to creating your own Amazon Bedrock agents—complete with code examples and best practices—check out the following GitHub repo. These tools are setting a new industry standard for efficient RCA and operational excellence.


About the Authors

Johann Wildgruber is a transformation lead reliability engineer at BMW Group, working currently to set up an observability platform to strengthen the reliability of ConnectedDrive services. Johann has several years of experience as a product owner in operating and developing large and complex cloud solutions. He is interested in applying new technologies and methods in software development.

Dr. Jens Kohl is a technology leader and builder with over 13 years of experience at the BMW Group. He is responsible for shaping the architecture and continuous optimization of the Connected Vehicle cloud backend. Jens has been leading software development and machine learning teams with a focus on embedded, distributed systems and machine learning for more than 10 years.

Thilo Bindel is leading the Offboard Reliability & Data Engineering team at BMW Group. He is responsible for defining and implementing strategies to ensure reliability, availability, and maintainability of BMW’s backend services in the Connected Vehicle domain. His goal is to establish reliability and data engineering best practices consistently across the organization and to position the BMW Group as a leader in data-driven observability within the automotive industry and beyond.

Luisa-Sophie Gloger is a Data Scientist at the BMW Group with a focus on Machine Learning. As a lead developer within the Connected Company’s Connected AI platform team, she enjoys helping teams to improve their products and workflows with Generative AI. She also has a background in working on Natural Language processing (NLP) and a degree in psychology.

Tanrajbir Takher is a Data Scientist at AWS’s Generative AI Innovation Center, where he works with enterprise customers to implement high-impact generative AI solutions. Prior to AWS, he led research for new products at a computer vision unicorn and founded an early generative AI startup.

Otto Kruse is a Principal Solutions Developer within AWS Industries – Prototyping and Customer Engineering (PACE), a multi-disciplinary team dedicated to helping large companies utilize the potential of the AWS cloud by exploring and implementing innovative ideas. Otto focuses on application development and security.

Huong Vu is a Data Scientist at AWS Generative AI Innovation Centre. She drives projects to deliver generative-AI applications for enterprise customers from a diverse range of industries. Prior to AWS, she worked on improving NLP models for Alexa shopping assistant both on the Amazon.com website and on Echo devices.

Aishwarya is a Senior Customer Solutions Manager with AWS Automotive. She is passionate about solving business problems using Generative AI and cloud-based technologies.

Satyam Saxena is an Applied Science Manager at AWS Generative AI Innovation Center team. He leads Generative AI customer engagements, driving innovative ML/AI initiatives from ideation to production with over a decade of experience in machine learning and data science. His research interests include deep learning, computer vision, NLP, recommender systems, and generative AI.

Kim Robins, a Senior AI Strategist at AWS’s Generative AI Innovation Center, leverages his extensive artificial intelligence and machine learning expertise to help organizations develop innovative products and refine their AI strategies, driving tangible business value.

Read More