How Formula 1® uses generative AI to accelerate race-day issue resolution

How Formula 1® uses generative AI to accelerate race-day issue resolution

Formula 1® (F1) races are high-stakes affairs where operational efficiency is paramount. During these live events, F1 IT engineers must triage critical issues across its services, such as network degradation to one of its APIs. This impacts downstream services that consume data from the API, including products such as F1 TV, which offer live and on-demand coverage of every race as well as real-time telemetry. Determining the root cause of these issues and preventing it from happening again takes significant effort. Due to the event schedule and change freeze periods, it can take up to 3 weeks to triage, test, and resolve a critical issue, requiring investigations across teams including development, operations, infrastructure, and networking.

“We used to have a recurring issue with the web API system, which was slow to respond and provided inconsistent outputs. Teams spent around 15 full engineer days to iteratively resolve the issue over several events: reviewing logs, inspecting anomalies, and iterating on the fixes,” says Lee Wright, head of IT Operations at Formula 1. Recognizing this challenge as an opportunity for innovation, F1 partnered with Amazon Web Services (AWS) to develop an AI-driven solution using Amazon Bedrock to streamline issue resolution. In this post, we show you how F1 created a purpose-built root cause analysis (RCA) assistant to empower users such as operations engineers, software developers, and network engineers to troubleshoot issues, narrow down on the root cause, and significantly reduce the manual intervention required to fix recurrent issues during and after live events. We’ve also provided a GitHub repo for a general-purpose version of the accompanying chat-based application.

Users can ask the RCA chat-based assistant questions using natural language prompts, with the solution troubleshooting in the background, identifying potential reasons for the incident and recommending next steps. The assistant is connected to internal and external systems, with the capability to query various sources such as SQL databases, Amazon CloudWatch logs, and third-party tools to check the live system health status. Because the solution doesn’t require domain-specific knowledge, it even allows engineers of different disciplines and levels of expertise to resolve issues.

“With the RCA tool, the team could narrow down the root cause and implement a solution within 3 days, including deployments and testing over a race weekend. The system not only saves time on active resolution, it also routes the issue to the correct team to resolve, allowing teams to focus on other high-priority tasks, like building new products to enhance the race experience,” adds Wright. By using generative AI, engineers can receive a response within 5–10 seconds on a specific query and reduce the initial triage time from more than a day to less than 20 minutes. The end-to-end time to resolution has been reduced by as much as 86%.

Implementing the root cause analysis solution architecture

In collaboration with the AWS Prototyping team, F1 embarked on a 5-week prototype to demonstrate the feasibility of this solution. The objective was to use AWS to replicate and automate the current manual troubleshooting process for two candidate systems. As a starting point, the team reviewed real-life issues, drafting a flowchart outlining 1) the troubleshooting process, 2) teams and systems involved, 3) required live checks, and 4) logs investigations required for each scenario. The following is a diagram of the solution architecture.

architecture diagram for a root cause analysis solution

To handle the log data efficiently, raw logs were centralized into an Amazon Simple Storage Service (Amazon S3) bucket. An Amazon EventBridge schedule checked this bucket hourly for new files and triggered log transformation extract, transform, and load (ETL) pipelines built using AWS Glue and Apache Spark. The transformed logs were stored in a separate S3 bucket, while another EventBridge schedule fed these transformed logs into Amazon Bedrock Knowledge Bases, an end-to-end managed Retrieval Augmented Generation (RAG) workflow capability, allowing the chat assistant to query them efficiently. Amazon Bedrock Agents facilitates interaction with internal systems such as databases and Amazon Elastic Compute Cloud (Amazon EC2) instances and external systems such as Jira and Datadog. Anthropic’s Claude 3 models (the latest model at the time of development) were used to orchestrate and generate high-quality responses, maintaining accurate and relevant information from the chat assistant. Finally, the chat application is hosted in an AWS Fargate for Amazon Elastic Container Service (Amazon ECS) service, providing scalability and reliability to handle variable loads without compromising performance.

The following sections further explain the main components of the solution: ETL pipelines to transform the log data, agentic RAG implementation, and the chat application.

Creating ETL pipelines to transform log data

Preparing your data to provide quality results is the first step in an AI project. AWS helps you improve your data quality over time so you can innovate with trust and confidence. Amazon CloudWatch gives you visibility into system-wide performance and allows you to set alarms, automatically react to changes, and gain a unified view of operational health.

For this solution, AWS Glue and Apache Spark handled data transformations from these logs and other data sources to improve the chatbot’s accuracy and cost efficiency. AWS Glue helps you discover, prepare, and integrate your data at scale. For this project, there was a simple three-step process for the log data transformation. The following is a diagram of the data processing flow.

diagram showing steps to create an ETL pipeline
  1. Data standardization: Schemas, types and formats – Conforming the data to a unified format helps the chat assistant understand the data more thoroughly, improving output accuracy. To enable Amazon Bedrock Knowledge Bases to ingest data consumed from different sources and formats (such as structure, schema, column names, timestamp formats), the data must first be standardized.
  2. Data filtering: Removing unnecessary data – To improve the chat assistant’s performance further, it’s important to reduce the amount of data to scan. A simple way to do that is to determine which data columns wouldn’t be used by the chat assistant. This removed a considerable amount of data in the ETL process even before ingesting into the knowledge base. Plus, it reduced costs in the embeddings process because less data is used to transform and tokenize into the vector database. All this helps improve the chat assistant’s accuracy, performance, and cost. For example, the chat assistant doesn’t need all the headers from some HTTP requests, but it does need the host and user agent.
  3. Data aggregation: Reducing data size – Users only need to know by the minute when a problem occurred, so aggregating data at the minute level helped to reduce the data size. For example, when there are 60 data points per minute with API response times, data was aggregated to a single data point per minute. This single aggregated event contains attributes such as the maximum time taken to fulfill a request, focusing the chat assistant to identify if the response time was high—again reducing the data needed to analyze the issue.

Building the RCA assistant with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases

Amazon Bedrock was used to build an agentic (agent-based) RAG solution for the RCA assistant. Amazon Bedrock Agents streamlines workflows and automates repetitive tasks. Agents uses the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. They use the provided instruction to create an orchestration plan and then carry out the plan by invoking company APIs and accessing knowledge bases using RAG to provide a final response to the end user.

Knowledge bases are essential to the RAG framework, querying business data sources and adding relevant context to answer your questions. Amazon Bedrock Agents also allows interaction with internal and external systems, such as querying database statuses to check their health, querying Datadog for live application monitoring, and raising Jira tickets for future analysis and investigation. Anthropic’s Claude 3 Sonnet model was selected for informative and comprehensive answers and the ability to understand diversified questions. For example, it can correctly interpret user input date formats such as “2024-05-10” or “10th May 2024.”

Amazon Bedrock Agents integrates with Amazon Bedrock Knowledge Bases, providing the end user with a single and consolidated frontend. The RCA agent considers the tools and knowledge bases available, then intelligently and autonomously creates an execution plan. After the agent receives documents from the knowledge base and responses from tool APIs, it consolidates the information to feed it to the large language model (LLM) and generate the final response. The following diagram illustrates the orchestration flow.

architecture diagram for an agentic rag chat assistant

Systems security

With Amazon Bedrock, you have full control over the data used to customize the FMs for generative AI applications such as RCA. Data is encrypted in transit and at rest. Identity-based policies provide further control over your data, helping you manage what actions roles can perform, on which resources, and under what conditions.

To evaluate the system health of RCA, the agent runs a series of checks, such as AWS Boto3 API calls (for example, boto3_client.describe_security_groups, to determine if an IP address is allowed to access system) or database SQL queries (SQL: sys.dm_os_schedulers, to query the database system metrics such as CPU, memory or user locks).

To help protect these systems against potential hallucinations or even prompt injections, agents aren’t allowed to create their own database queries or system health checks on the fly. Instead, a series of controlled SQL queries and API checks were implemented, following the principle of least privilege (PoLP). This layer also validates the input and output schema (see Powertools docs), making sure this aspect is also controlled. To learn more about protecting your application, refer to the ArXiv paper, From Prompt Injections to SQL Injection Attacks. The following code is an example.

"""
- Health Checks: one explicit function per Health Check, to avoid potential LLM hallucinations or risky syntax errors.
- DB is KMS-encrypted and behind private subnets. Connection uses Least-Privileges and Secrets Manager
- Schema is protected using OpenAPI, via AWS Lambda Powertools BedrockAgentResolver
"""

from typing import List, Annotated
from helpers import run_sql_query, check_ec2_port_access
from aws_lambda_powertools.event_handler.bedrock_agent import BedrockAgentResolver 
from aws_lambda_powertools.event_handler.openapi.params import Query, Body
from aws_lambda_powertools import Metrics, Tracer, Logger
from aws_lambda_powertools.metrics import MetricUnit

# Initialize Agents, Metrics, Loggers and Tracers
app = BedrockAgentResolver()
metrics = Metrics(namespace="rca-stack-api-logs", service="HealthChecks")
tracer = Tracer()
logger = Logger(level='INFO')

@tracer.capture_method
@app.get("/checkDatabaseCPUMemory", description='Checks the CPU and Memory usage, for the Database server.')
def check_db_cpu_memory() -> Annotated[List, Body(description='Returns Database CPU and Memory metrics')]:
    response = run_sql_query('db_cpu_memory')
    metrics.add_metric(name="DBCpuMemory", unit=MetricUnit.Count, value=1)
    logger.info(response)

    return response

Frontend application: The chat assistant UI

The chat assistant UI was developed using the Streamlit framework, which is Python-based and provides simple yet powerful application widgets. In the Streamlit app, users can test their Amazon Bedrock agent iterations seamlessly by providing or replacing the agent ID and alias ID. In the chat assistant, the full conversation history is displayed, and the conversation can be reset by choosing Clear. The response from the LLM application consists of two parts. On the left is the final neutral response based on the user’s questions. On the right is the trace of LLM agent orchestration plans and executions, which is hidden by default to keep the response clean and concise. The trace can be reviewed and examined by the user to make sure that the correct tools are invoked and the correct documents are retrieved by the LLM chatbot.

A general-purpose version of the chat-based application is available from this GitHub repo, where you can experiment with the solution and modify it for additional use cases.

In the following demo, the scenario involves user complaints that they can’t connect to F1 databases. Using the chat assistant, users can check if the database driver version they’re using is supported by the server. Additionally, users can verify EC2 instance network connectivity by providing the EC2 instance ID and AWS Region. These checks are performed by API tools accessible by the agent. Furthermore, users can troubleshoot website access issues by checking system logs. In the demo, users provide an error code and date, and the chat assistant retrieves relevant logs from Amazon Bedrock Knowledge Bases to answer their questions and provide information for future analysis.

Technical engineers can now query to investigate system errors and issues using natural language. It’s integrated with existing incident management tools (such as Jira) to facilitate seamless communication and ticket creation. In most cases, the chat assistant can quickly identify the root cause and provide remediation recommendations, even if multiple issues are present. When warranted, particularly challenging issues are automatically escalated to the F1 engineering team for investigation, allowing engineers to better prioritize their tasks.

Conclusion

In this post, we explained how F1 and AWS have developed a root cause analysis (RCA) assistant powered by Amazon Bedrock to reduce manual intervention and accelerate the resolution of recurrent operational issues during races from weeks to minutes. The RCA assistant enables the F1 team to spend more time on innovation and improving its services, ultimately delivering an exceptional experience for fans and partners. The successful collaboration between F1 and AWS showcases the transformative potential of generative AI in empowering teams to accomplish more in less time.

Learn more about how AWS helps F1 on and off the track.


About the Author

Carlos Contreras is a Senior Big Data and Generative AI Architect, at Amazon Web Services. Carlos specializes in designing and developing scalable prototypes for customers, to solve their most complex business challenges, implementing RAG and Agentic solutions with Distributed Data Processing techniques.

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training.

Olga Miloserdova is an Innovation Lead at Amazon Web Services, where she supports executive leadership teams across industries to drive innovation initiatives leveraging Amazon’s customer-centric Working Backwards methodology.

Ying Hou, PhD is a Senior GenAI Prototyping Architect at AWS, where she collaborates with customers to build cutting-edge GenAI applications, specialising in RAG and agentic solutions. Her expertise spans GenAI, ASR, Computer Vision, NLP, and time series prediction models. When she’s not architecting AI solutions, she enjoys spending quality time with her family, getting lost in novels, and exploring the UK’s national parks.

Read More

Using Amazon Rekognition to improve bicycle safety

Using Amazon Rekognition to improve bicycle safety

Cycling is a fun way to stay fit, enjoy nature, and connect with friends and acquaintances. However, riding is becoming increasingly dangerous, especially in situations where cyclists and cars share the road. According to the NHTSA, in the United States an average of 883 people on bicycles are killed in traffic crashes, with an average of about 45,000 injury-only crashes reported annually. While total bicycle fatalities only account for just over 2% of all traffic fatalities in the United States, as a cyclist, it’s still terrifying to be pushed off the road by a large SUV or truck. To better protect themselves, many cyclists are starting to ride with cameras mounted to the front or back of their bicycle. In this blog post, I will demonstrate a machine learning solution that cyclists can use to better identify close calls.

Many US states and countries throughout the world have some sort of 3-feet law. A 3-feet law requires motor vehicles to provide about 3 feet (1 meter) of distance when passing a bicycle. To promote safety on the road, cyclists are increasingly recording their rides, and if they encounter a dangerous situation where they aren’t given an appropriate safe distance, they can provide a video of the encounter to local law enforcement to help correct behavior. However, finding a single encounter in a recording of a multi-hour ride is time consuming and often requires specialized video skills to generate a short clip of the encounter.

To solve some of these problems, I have developed a simple solution using Amazon Rekognition video analysis. Amazon Rekognition can detect labels (essentially objects) and the timestamp of when that object is detected in a video. Amazon Rekognition can be used to quickly find any vehicles that appear in the video of a recorded ride.

If a cyclist’s camera records a passing vehicle, it must then determine if the vehicle is too close to the bicycle—in other words, if the vehicle is within the 3-foot range set by law. If it is, then I want to generate a clip of the encounter, which can be provided to the relevant authorities. The following figure shows the view from a cyclist’s camera with bounding boxes that identify a vehicle that’s passing too close to the bicycle. A box at the bottom of the image shows the approximate 3-foot area around the bicycle.

A red bounding box identifies a vehicle, while a green bounding box identifies the location of the bicycle. The boxes overlap, showing the vehicle is too close to the bicycle.

Solution overview

The architecture of the solution is shown in the following figure.

A video recording is uploaded to an S3 bucket, where it is processed, close encounters are detected, video is extracted, and links to the encounter are provided

The steps of the solution are:

  1. When a cyclist completes a ride, they upload their MP4 videos from the ride into an Amazon Simple Storage Service (Amazon S3)
  2. The bucket has been configured with an S3 event notification that sends object created notifications to an AWS Lambda
  3. The Lambda function kicks off an AWS Step Functions workflow that begins by calling the StartLabelDetection API as part of Amazon Rekognition videos. The StartLabelDetection API is configured to detect Bus, Car, Fire Truck, Pickup Truck, Truck, Limo, and Moving Van as labels. It ignores other related non-vehicle labels like License Plate, Wheel, Tire, and Car Mirror.
  4. The Amazon Rekognition API returns a set of JSON identifying the selected labels and timestamps of detected objects.
  5. This JSON result is sent to a Lambda function to perform the geometry math to determine if a vehicle box overlapped with the bicycle safe area.
  6. Any detected encounters are generated and passed off to AWS Elemental MediaConvert, which can create snippets of video corresponding to the detected encounters, using the CreateJob API
  7. MediaConvert creates these videos and uploads them to an S3 bucket.
  8. Another Lambda function is called to generate pre-signed URLs of the videos. This allows the videos to be temporarily downloaded by anyone with the pre-signed URL.
  9. Amazon Simple Notification Service (Amazon SNS) sends an email message with links to the pre-signed URLs.

Prerequisites

To use the solution outlined in this post, you must have:

  1. An AWS account with appropriate permissions to allow you to deploy AWS CloudFormation stacks
  2. A video recording in MP4 format with the .MP4 extension using the H.264 codec. The video should be from a front or rear-facing camera, from any off-the-shelf vendor (for example GoPro, DJI, or Cycliq). The maximum file size is 10 GB.

Deploying the solution

  1. Deploy this solution in your environment or select Launch Stack. This solution will deploy in the AWS US East (N. Virginia) us-east-1 AWS Region.

Launch stack

  1. The Create stack page from the CloudFormation dashboard appears. At the bottom of the page, choose Next.
  2. On the Specify stack details page, enter the email address where you’d like to receive notifications. Choose Next.
  3. Select the box that says I acknowledge that AWS CloudFormation might create IAM resources and Choose Next. Choose Submit and the installation will begin. The solution takes about 5 minutes to be installed.
  4. You will receive an email confirming your Amazon SNS subscription. You will not receive emails from the solution unless you confirm your subscription.
  5. After the stack completes, select the Outputs tab and take note of the bucket name listed under InputBucket.

Using the solution

To test the solution, I have a sample video where I asked a stunt driver to drive very closely to me.

To begin the video processing, I upload the video to the S3 bucket (the InputBucket from the Outputs tab). The bucket has encryption enabled, so under Properties, I choose Specify an encryption key and select Use bucket settings for default encryption. Choosing Upload begins the upload process, as shown in the following figure.

Uploading the video to S3, I specify the file and the settings for encryption

After a moment, the step function begins processing. After a few minutes, you will receive an email with links to any encounters identified, as shown in the following figure.

An email that contains links to view the detected encounters

In my case, it identified two encounters. In the first encounter identified, I rode too close to a parked car. However, in the second encounter identified, it shows a dangerous encounter that I experienced with my stunt driver.

Had this been an actual dangerous encounter, the video clip could be provided to the appropriate authorities to help change behavior and make the road safer for everyone.

Pricing

Because this is a fully serverless solution, you only pay for what you use. With Amazon Rekognition, you pay for the minutes of video that are processed. With MediaConvert, you pay for normalized minutes of video processed, which is each minute of video output with multipliers that apply based on features used. The solution’s use of Lambda, Step Functions, and SNS are minimal and will likely fall under the free tier for most users.

Clean up

To delete the resources created as part of this solution, go to the CloudFormation console, select the stack that was deployed, and choose Delete.

Conclusion

In this example I demonstrated how to use Amazon Rekognition video analysis in a unique scenario. Amazon Rekognition is a powerful computer vision tool that allows you to get insights out of images or video without the overhead of building or managing a machine learning model. Of course, Amazon Rekognition can also handle more advanced use cases than the one I demonstrated here.

In this example I demonstrated how using Amazon Rekognition with other serverless services can yield a serverless video processing workflow that—in this case—can help improve the safety of cyclists. While you might not be an avid cyclist, the solution demonstrated here can be extended to a variety of use cases and industries. For example, this solution could be extended to detect wildlife on nature cameras or you could use Amazon Rekognition streaming video events to detect people and packages in security video.

Get started today by using Amazon Rekognition for your computer vision use case.


About the Author

Mike George author photoMike George is a Principal Solutions Architect at Amazon Web Services (AWS) based in Salt Lake City, Utah. He enjoys helping customers solve their technology problems. His interests include software engineering, security, artificial intelligence (AI), and machine learning (ML).

Read More

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

Build a dynamic, role-based AI agent using Amazon Bedrock inline agents

AI agents continue to gain momentum, as businesses use the power of generative AI to reinvent customer experiences and automate complex workflows. We are seeing Amazon Bedrock Agents applied in investment research, insurance claims processing, root cause analysis, advertising campaigns, and much more. Agents use the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. They use developer-provided instructions to create an orchestration plan and carry out that plan by securely invoking company APIs and accessing knowledge bases using Retrieval Augmented Generation (RAG) to accurately handle the user’s request.

Although organizations see the benefit of agents that are defined, configured, and tested as managed resources, we have increasingly seen the need for an additional, more dynamic way to invoke agents. Organizations need solutions that adjust on the fly—whether to test new approaches, respond to changing business rules, or customize solutions for different clients. This is where the new inline agents capability in Amazon Bedrock Agents becomes transformative. It allows you to dynamically adjust your agent’s behavior at runtime by changing its instructions, tools, guardrails, knowledge bases, prompts, and even the FMs it uses—all without redeploying your application.

In this post, we explore how to build an application using Amazon Bedrock inline agents, demonstrating how a single AI assistant can adapt its capabilities dynamically based on user roles.

Inline agents in Amazon Bedrock Agents

This runtime flexibility enabled by inline agents opens powerful new possibilities, such as:

  • Rapid prototyping – Inline agents minimize the time-consuming create/update/prepare cycles traditionally required for agent configuration changes. Developers can instantly test different combinations of models, tools, and knowledge bases, dramatically accelerating the development process.
  • A/B testing and experimentation – Data science teams can systematically evaluate different model-tool combinations, measure performance metrics, and analyze response patterns in controlled environments. This empirical approach enables quantitative comparison of configurations before production deployment.
  • Subscription-based personalization – Software companies can adapt features based on each customer’s subscription level, providing more advanced tools for premium users.
  • Persona-based data source integration – Institutions can adjust content complexity and tone based on the user’s profile, providing persona-appropriate explanations and resources by changing the knowledge bases associated to the agent on the fly.
  • Dynamic tool selection – Developers can create applications with hundreds of APIs, and quickly and accurately carry out tasks by dynamically choosing a small subset of APIs for the agent to consider for a given request. This is particularly helpful for large software as a service (SaaS) platforms needing multi-tenant scaling.

Inline agents expand your options for building and deploying agentic solutions with Amazon Bedrock Agents. For workloads needing managed and versioned agent resources with a pre-determined and tested configuration (specific model, instructions, tools, and so on), developers can continue to use InvokeAgent on resources created with CreateAgent. For workloads that need dynamic runtime behavior changes for each agent invocation, you can use the new InvokeInlineAgent API. With either approach, your agents will be secure and scalable, with configurable guardrails, a flexible set of model inference options, native access to knowledge bases, code interpretation, session memory, and more.

Solution overview

Our HR assistant example shows how to build a single AI assistant that adapts to different user roles using the new inline agent capabilities in Amazon Bedrock Agents. When users interact with the assistant, the assistant dynamically configures agent capabilities (such as model, instructions, knowledge bases, action groups, and guardrails) based on the user’s role and their specific selections. This approach creates a flexible system that adjusts its functionality in real time, making it more efficient than creating separate agents for each user role or tool combination. The complete code for this HR assistant example is available on our GitHub repo.

This dynamic tool selection enables a personalized experience. When an employee logs in without direct reports, they see a set of tools that they have access to based on their role. They can select from options like requesting vacation time, checking company policies using the knowledge base, using a code interpreter for data analysis, or submitting expense reports. The inline agent assistant is then configured with only these selected tools, allowing it to assist the employee with their chosen tasks. In a real-world example, the user would not need to make the selection, because the application would make that decision and automatically configure the agent invocation at runtime. We make it explicit in this application so that you can demonstrate the impact.

Similarly, when a manager logs in to the same system, they see an extended set of tools reflecting their additional permissions. In addition to the employee-level tools, managers have access to capabilities like running performance reviews. They can select which tools they want to use for their current session, instantly configuring the inline agent with their choices.

The inclusion of knowledge bases is also adjusted based on the user’s role. Employees and managers see different levels of company policy information, with managers getting additional access to confidential data like performance review and compensation details. For this demo, we’ve implemented metadata filtering to retrieve only the appropriate level of documents based on the user’s access level, further enhancing efficiency and security.

Let’s look at how the interface adapts to different user roles.

The employee view provides access to essential HR functions like vacation requests, expense submissions, and company policy lookups. Users can select which of these tools they want to use for their current session.

The manager view extends these options to include supervisory functions like compensation management, demonstrating how the inline agent can be configured with a broader set of tools based on user permissions.

The manager view extends these capabilities to include supervisory functions like compensation management, demonstrating how the inline agent dynamically adjusts its available tools based on user permissions. Without inline agents, we would need to build and maintain two separate agents.

As shown in the preceding screenshots, the same HR assistant offers different tool selections based on the user’s role. An employee sees options like Knowledge Base, Apply Vacation Tool, and Submit Expense, whereas a manager has additional options like Performance Evaluation. Users can select which tools they want to add to the agent for their current interaction.

This flexibility allows for quick adaptation to user needs and preferences. For instance, if the company introduces a new policy for creating business travel requests, the tool catalog can be quickly updated to include a Create Business Travel Reservation tool. Employees can then choose to add this new tool to their agent configuration when they need to plan a business trip, or the application could automatically do so based on their role.

With Amazon Bedrock inline agents, you can create a catalog of actions that is dynamically selected by the application or by users of the application. This increases the level of flexibility and adaptability of your solutions, making them a perfect fit for navigating the complex, ever-changing landscape of modern business operations. Users have more control over their AI assistant’s capabilities, and the system remains efficient by only loading the necessary tools for each interaction.

Technical foundation: Dynamic configuration and action selection

Inline agents allow dynamic configuration at runtime, enabling a single agent to effectively perform the work of many. By specifying action groups and modifying instructions on the fly, even within the same session, you can create versatile AI applications that adapt to various scenarios without multiple agent deployments.

The following are key points about inline agents:

  • Runtime configuration – Change the agent’s configuration, including its FM, at runtime. This enables rapid experimentation and adaptation without redeploying the application, reducing development cycles.
  • Governance at tool level – Apply governance and access control at the tool level. With agents changing dynamically at runtime, tool-level governance helps maintain security and compliance regardless of the agent’s configuration.
  • Agent efficiency – Provide only necessary tools and instructions at runtime to reduce token usage and improve the agent accuracy. With fewer tools to choose from, it’s less complicated for the agent to select the right one, reducing hallucinations in the tool selection process. This approach can also lead to lower costs and improved latency compared to static agents because removing unnecessary tools, knowledge bases, and instructions reduces the number of input and output tokens being processed by the agent’s large language model (LLM).
  • Flexible action catalog – Create reusable actions for dynamic selection based on specific needs. This modular approach simplifies maintenance, updates, and scalability of your AI applications.

The following are examples of reusable actions:

  • Enterprise system integration – Connect with systems like Salesforce, GitHub, or databases
  • Utility tools – Perform common tasks such as sending emails or managing calendars
  • Team-specific API access – Interact with specialized internal tools and services
  • Data processing – Analyze text, structured data, or other information
  • External services – Fetch weather updates, stock prices, or perform web searches
  • Specialized ML models – Use specific machine learning (ML) models for targeted tasks

When using inline agents, you configure parameters for the following:

  • Contextual tool selection based on user intent or conversation flow
  • Adaptation to different user roles and permissions
  • Switching between communication styles or personas
  • Model selection based on task complexity

The inline agent uses the configuration you provide at runtime, allowing for highly flexible AI assistants that efficiently handle various tasks across different business contexts.

Building an HR assistant using inline agents

Let’s look at how we built our HR Assistant using Amazon Bedrock inline agents:

  1. Create a tool catalog – We developed a demo catalog of HR-related tools, including:
    • Knowledge Base – Using Amazon Bedrock Knowledge Bases for accessing company policies and guidelines based on the role of the application user. In order to filter the knowledge base content based on the user’s role, you also need to provide a metadata file specifying the type of employee’s roles that can access each file
    • Apply Vacation – For requesting and tracking time off.
    • Expense Report – For submitting and managing expense reports.
    • Code Interpreter – For performing calculations and data analysis.
    • Compensation Management – for conducting and reviewing employee compensation assessments (manager only access).
  2. Set conversation tone – We defined multiple conversation tones to suit different interaction styles:
    • Professional – For formal, business-like interactions.
    • Casual – For friendly, everyday support.
    • Enthusiastic – For upbeat, encouraging assistance.
  3. Implement access control – We implemented role-based access control. The application backend checks the user’s role (employee or manager) and provides access to appropriate tools and information and passes this information to the inline agent. The role information is also used to configure metadata filtering in the knowledge bases to generate relevant responses. The system allows for dynamic tool use at runtime. Users can switch personas or add and remove tools during their session, allowing the agent to adapt to different conversation needs in real time.
  4. Integrate the agent with other services and tools – We connected the inline agent to:
    • Amazon Bedrock Knowledge Bases for company policies, with metadata filtering for role-based access.
    • AWS Lambda functions for executing specific actions (such as submitting vacation requests or expense reports).
    • A code interpreter tool for performing calculations and data analysis.
  5. Create the UI – We created a Flask-based UI that performs the following actions:
    • Displays available tools based on the user’s role.
    • Allows users to select different personas.
    • Provides a chat window for interacting with the HR assistant.

To understand how this dynamic role-based functionality works under the hood, let’s examine the following system architecture diagram.

As shown in preceding architecture diagram, the system works as follows:

  1. The end-user logs in and is identified as either a manager or an employee.
  2. The user selects the tools that they have access to and makes a request to the HR assistant.
  3. The agent breaks down the problems and uses the available tools to solve for the query in steps, which may include:
    1. Amazon Bedrock Knowledge Bases (with metadata filtering for role-based access).
    2. Lambda functions for specific actions.
    3. Code interpreter tool for calculations.
    4. Compensation tool (accessible only to managers to submit base pay raise requests).
  4. The application uses the Amazon Bedrock inline agent to dynamically pass in the appropriate tools based on the user’s role and request.
  5. The agent uses the selected tools to process the request and provide a response to the user.

This approach provides a flexible, scalable solution that can quickly adapt to different user roles and changing business needs.

Conclusion

In this post, we introduced the Amazon Bedrock inline agent functionality and highlighted its application to an HR use case. We dynamically selected tools based on the user’s roles and permissions, adapted instructions to set a conversation tone, and selected different models at runtime. With inline agents, you can transform how you build and deploy AI assistants. By dynamically adapting tools, instructions, and models at runtime, you can:

  • Create personalized experiences for different user roles
  • Optimize costs by matching model capabilities to task complexity
  • Streamline development and maintenance
  • Scale efficiently without managing multiple agent configurations

For organizations demanding highly dynamic behavior—whether you’re an AI startup, SaaS provider, or enterprise solution team—inline agents offer a scalable approach to building intelligent assistants that grow with your needs. To get started, explore our GitHub repo and HR assistant demo application, which demonstrate key implementation patterns and best practices.

To learn more about how to be most successful in your agent journey, read our two-part blog series:

To get started with Amazon Bedrock Agents, check out the following GitHub repository with example code.


About the authors

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS certifications, including the ML Specialty Certification.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Ashrith Chirutani is a Software Development Engineer at Amazon Web Services (AWS). He specializes in backend system design, distributed architectures, and scalable solutions, contributing to the development and launch of high-impact systems at Amazon. Outside of work, he spends his time playing ping pong and hiking through Cascade trails, enjoying the outdoors as much as he enjoys building systems.

Shubham Divekar is a Software Development Engineer at Amazon Web Services (AWS), working in Agents for Amazon Bedrock. He focuses on developing scalable systems on the cloud that enable AI applications frameworks and orchestrations. Shubham also has a background in building distributed, scalable, high-volume-high-throughput systems in IoT architectures.

Vivek Bhadauria is a Principal Engineer for Amazon Bedrock. He focuses on building deep learning-based AI and computer vision solutions for AWS customers. Oustide of work, Vivek enjoys trekking and following cricket.

Read More

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

In this post, we discuss what embeddings are, show how to practically use language embeddings, and explore how to use them to add functionality such as zero-shot classification and semantic search. We then use Amazon Bedrock and language embeddings to add these features to a really simple syndication (RSS) aggregator application.

Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using Amazon Web Services (AWS) services without having to manage infrastructure. For this post, we use the Cohere v3 Embed model on Amazon Bedrock to create our language embeddings.

Use case: RSS aggregator

To demonstrate some of the possible uses of these language embeddings, we developed an RSS aggregator website. RSS is a web feed that allows publications to publish updates in a standardized, computer-readable way. On our website, users can subscribe to an RSS feed and have an aggregated, categorized list of the new articles. We use embeddings to add the following functionalities:

  • Zero-shot classification – Articles are classified between different topics. There are some default topics, such as Technology, Politics, and Health & Wellbeing, as shown in the following screenshot. Users can also create their own topics.
    An example of the topics functionality
  • Semantic search – Users can search their articles using semantic search, as shown in the following screenshot. Users can not only search for a specific topic but also narrow their search by factors such as tone or style.
    Example of the semantic search functionality

This post uses this application as a reference point to discuss the technical implementation of the semantic search and zero-shot classification features.

Solution overview

This solution uses the following services:

  • Amazon API Gateway – The API is accessible through Amazon API Gateway. Caching is performed on Amazon CloudFront for certain topics to ease the database load.
  • Amazon Bedrock with Cohere v3 Embed – The articles and topics are converted into embeddings with the help of Amazon Bedrock and Cohere v3 Embed.
  • Amazon CloudFront and Amazon Simple Storage Service (Amazon S3) – The single-page React application is hosted using Amazon S3 and Amazon CloudFront.
  • Amazon Cognito – Authentication is done using Amazon Cognito user pools.
  • Amazon EventBridge – Amazon EventBridge and EventBridge schedules are used to coordinate new updates.
  • AWS Lambda – The API is a Fastify application written in TypeScript. It’s hosted on AWS Lambda.
  • Amazon Aurora PostgreSQL-Compatible Edition and pgvector – Amazon Aurora PostgreSQL-Compatible is used as the database, both for the functionality of the application itself and as a vector store using pgvector.
  • Amazon RDS Proxy – Amazon RDS Proxy is used for connection pooling.
  • Amazon Simple Queue Service (Amazon SQS) – Amazon SQS is used to queue events. It consumes one event at a time so it doesn’t hit the rate limit of Cohere in Amazon Bedrock.

The following diagram illustrates the solution architecture.

Scope of solution

What are embeddings?

This section offers a quick primer on what embeddings are and how they can be used.

Embeddings are numerical representations of concepts or objects, such as language or images. In this post, we discuss language embeddings. By reducing these concepts to numerical representations, we can then use them in a way that a computer can understand and operate on.

Let’s take Berlin and Paris as an example. As humans, we understand the conceptual links between these two words. Berlin and Paris are both cities, they’re capitals of their respective countries, and they’re both in Europe. We understand their conceptual similarities almost instinctively, because we can create a model of the world in our head. However, computers have no built-in way of representing these concepts.

To represent these concepts in a way a computer can understand, we convert them into language embeddings. Language embeddings are high dimensional vectors that learn their relationships with each other through the training of a neural network. During training, the neural network is exposed to enormous amounts of text and learns patterns based on how words are colocated and relate to each other in different contexts.

Embedding vectors allow computers to model the world from language. For instance, if we embed “Berlin” and “Paris,” we can now perform mathematical operations on these embeddings. We can then observe some fairly interesting relationships. For instance, we could do the following: Paris – France + Germany ~= Berlin. This is because the embeddings capture the relationships between the words “Paris” and “France” and between “Germany” and “Berlin”—specifically, that Paris and Berlin are both capital cities of their respective countries.

The following graph shows the word vector distance between countries and their respective capitals.

The embedding representations of countries, and capitals.

Subtracting “France” from “Paris” removes the country semantics, leaving a vector representing the concept of a capital city. Adding “Germany” to this vector, we are left with something closely resembling “Berlin,” the capital of Germany. The vectors for this relationship are shown in the following graph.

Demonstrating the manipulation of these embeddings to show conceptual similarities.

For our use case, we use the pre-trained Cohere Embeddings model in Amazon Bedrock, which embeds entire texts rather than a single word. The embeddings represent the meaning of the text and can be operated on using mathematical operations. This property can be useful to map relationships such as similarity between texts.

Zero-shot classification

One way in which we use language embeddings is by using their properties to calculate how similar an article is to one of the topics.

To do this, we break down a topic into a series of different and related embeddings. For instance, for culture, we have a set of embeddings for sports, TV programs, music, books, and so on. We then embed the incoming title and description of the RSS articles, and calculate the similarity against the topic embeddings. From this, we can assign topic labels to an article.

The following figure illustrates how this works. The embeddings that Cohere generates are highly dimensional, containing 1,024 values (or dimensions). However, to demonstrate how this system works, we use an algorithm designed to reduce the dimensionality of the embeddings, t-distributed Stochastic Neighbor Embedding (t-SNE), so that we can view them in two dimensions. The following image uses these embeddings to visualize how topics are clustered based on similarity and meaning.

Clustering of different topics

You can use the embedding of an article and check the similarity of the article against the preceding embeddings. You can then say that if an article is clustered closely to one of these embeddings, it can be classified with the associated topic.

This is the k-nearest neighbor (k-NN) algorithm. This algorithm is used to perform classification and regression tasks. In k-NN, you can make assumptions around a data point based on its proximity to other data points. For instance, you can say that an article that has proximity to the music topic shown in the preceding diagram can be tagged with the culture topic.

The following figure demonstrates this with an ArsTechnica article. We plot against the embedding of an article’s title and description: (The climate is changing so fast that we haven’t seen how bad extreme weather could get: Decades-old statistics no longer represent what is possible in the present day).

Display of ars-technica article clustered against different topics

The advantage of this approach is that you can add custom, user-generated topics. You can create a topic by first creating a series of embeddings of conceptually related items. For instance, an AI topic would be similar to the embeddings for AI, Generative AI, LLM, and Anthropic, as shown in the following screenshot.

Adding a custom topic to the application

In a traditional classification system, we’d be required to train a classifier—a supervised learning task where we’d need to provide a series of examples to establish whether an article belongs to its respective topic. Doing so can be quite an intensive task, requiring labeled data and training the model. For our use case, we can provide examples, create a cluster, and tag articles without having to provide labeled examples or train additional models. This is shown in the following screenshot of results page of our website.

In our application, we ingest new articles on a schedule. We use EventBridge schedules to periodically call a Lambda function, which checks if there are new articles. If there are, it creates an embedding from them using Amazon Bedrock and Cohere.

We calculate the article’s distance to the different topic embeddings, and can then determine whether the article belongs to that category. This is done with Aurora PostgreSQL with pgvector. We store the embeddings of the topics and then calculate their distance using the following SQL query:

const topics = await sqlClient.then(it=> it.query(
    `SELECT name, embedding_description, similarity
     FROM (SELECT topic_id as name, embedding_description, (1- ABS( 1 –(embed.embedding <-> $1))) AS "similarity" FROM topic_embedding_link embed)  topics
     ORDER BY similarity desc`,
    [toSql(articleEmbedding)]
  ))

The <-> operator in the preceding code calculates the Euclidean distance between the article and the topic embedding. This number allows us to understand how close an article is to one of the topics. We can then determine the appropriateness of a topic based on this ranking.

We then tag the article with the topic. We do this so that the subsequent request for a topic is as computationally as light as possible; we do a simple join rather than calculating the Euclidean distance.

const formattedTopicInsert = pgformat(
    `INSERT INTO feed_article_topic_link(topic_id, feed_article_id) VALUES %L ON CONFLICT DO NOTHING`,
    topicLinks
  )

We also cache a specific topic/feed combination because these are calculated hourly and aren’t expected to change in the interim.

Semantic search

As previously discussed, the embeddings produced by Cohere contain a multitude of features; they embed the meanings and semantics of a word of phrase. We’ve also found that we can perform mathematical operations on these embeddings to do things such as calculate the similarity between two phrases or words.

We can use these embeddings and calculate the similarity between a search term and an embedding of an article with the k-NN algorithm to find articles that have similar semantics and meanings to the search term we’ve provided.

For example, in one of our RSS feeds, we have a lot of different articles that rate products. In a traditional search system, we’d rely on keyword matches to provide relevant results. Although it might be simple to find a specific article (for example, by searching “best digital notebooks”), we would need a different method to capture multiple product list articles.

In a semantic search system, we first transform the term “Product list” in an embedding. We can then use the properties of this embedding to perform a search within our embedding space. Using the k-NN algorithm, we can find articles that are semantically similar. As shown in the following screenshot, despite not containing the text “Product list” in either the title or description, we’ve been able to find articles that contain a product list. This is because we were able to capture the semantics of the query and match it to the existing embeddings we have for each article.

Semantic search example

In our application, we store these embeddings using pgvector on Aurora PostgreSQL. pgvector is an open source extension that enables vector similarity search in PostgreSQL. We transform our search term into an embedding using Amazon Bedrock and Cohere v3 Embed.

After we’ve converted the search term to an embedding, we can compare it with the embeddings on the article that have been saved during the ingestion process. We can then use pgvector to find articles that are clustered together. The SQL code for that is as follows:

SELECT *
FROM (
    SELECT feed_articles.id as id, title, feed_articles.feed_id as feed, feedName, slug, description, url, author, image, published_at as published, 1 - ABS(1 - (embedding <-> $2)) AS "similarity"
    FROM feed_articles
    INNER JOIN (select feed_id, name as feedName from feed_user_subscription fus where fus.user_id=$1) sub on feed_articles.feed_id=sub.feed_id
    ${feedId != undefined ? `WHERE feed_articles.feed_id = $4` : ""}
)
WHERE similarity > 0.95
ORDER BY similarity desc
LIMIT $3;

This code calculates the distance between the topics, and the embedding of this article as “similarity.” If this distance is close, then we can assume that the topic of the article is related, and we therefore attach the topic to the article.

Prerequisites

To deploy this application in your own account, you need the following prerequisites:

  • An active AWS account.
  • Model access for Cohere Embed English. On the Amazon Bedrock console, choose Model access in the navigation pane, then choose Manage model access. Select the FMs of your choice and request access.

Model access dialog

Deploy the AWS CDK stack

When the prerequisite steps are complete, you’re ready to set up the solution:

  1. Clone the GitHub repository containing the solution files:
    git clone https://github.com/aws-samples/rss-aggregator-using-cohere-embeddings-bedrock
  1. Navigate to the solution directory:
    cd infrastructure
  1. In your terminal, export your AWS credentials for a role or user in ACCOUNT_ID. The role needs to have all necessary permissions for AWS CDK deployment:
    • export AWS_REGION=”<region>”
      – The AWS Region you want to deploy the application to
    • export AWS_ACCESS_KEY_ID=”<access-key>”
      – The access key of your role or user
    • export AWS_SECRET_ACCESS_KEY=”<secret-key>”
      – The secret key of your role or user
  1. If you’re deploying the AWS CDK for the first time, run the following command:
    cdk bootstrap
  1. To synthesize the AWS CloudFormation template, run the following command:
    cdk synth -c vpc_id=<ID Of your VPC>
  1. To deploy, use the following command:
    cdk deploy -c vpc_id=<ID Of your VPC>

When deployment is finished, you can check these deployed stacks by visiting the AWS CloudFormation console, as shown in the following screenshot.

Application CDK Stack

Clean up

Run the following command in the terminal to delete the CloudFormation stack provisioned using the AWS CDK:

cdk destroy --all

Conclusion

In this post, we explored what language embeddings are and how they can be used to enhance your application. We’ve learned how, by using the properties of embeddings, we can implement a real-time zero-shot classifier and can add powerful features such as semantic search.

The code for this application can be found on the accompanying GitHub repo. We encourage you to experiment with language embeddings and find out what powerful features they can enable for your applications!


About the Author

About the AuthorThomas Rogers is a Solutions Architect based in Amsterdam, the Netherlands. He has a background in software engineering. At AWS, Thomas helps customers build cloud solutions, focusing on modernization, data, and integrations.

Read More

Fine-tune LLMs with synthetic data for context-based Q&A using Amazon Bedrock

Fine-tune LLMs with synthetic data for context-based Q&A using Amazon Bedrock

There’s a growing demand from customers to incorporate generative AI into their businesses. Many use cases involve using pre-trained large language models (LLMs) through approaches like Retrieval Augmented Generation (RAG). However, for advanced, domain-specific tasks or those requiring specific formats, model customization techniques such as fine-tuning are sometimes necessary. Amazon Bedrock provides you with the ability to customize leading foundation models (FMs) such as Anthropic’s Claude 3 Haiku and Meta’s Llama 3.1.

Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.

Fine-tuning is a supervised training process where labeled prompt and response pairs are used to further train a pre-trained model to improve its performance for a particular use case. One consistent pain point of fine-tuning is the lack of data to effectively customize these models. Gathering relevant data is difficult, and maintaining its quality is another hurdle. Furthermore, fine-tuning LLMs requires substantial resource commitment. In such scenarios, synthetic data generation offers a promising solution. You can create synthetic training data using a larger language model and use it to fine-tune a smaller model, which has the benefit of a quicker turnaround time.

In this post, we explore how to use Amazon Bedrock to generate synthetic training data to fine-tune an LLM. Additionally, we provide concrete evaluation results that showcase the power of synthetic data in fine-tuning when data is scarce.

Solution overview

The solution comprises two main steps:

  1. Generate synthetic data using the Amazon Bedrock InvokeModel API.
  2. Fine-tune using an Amazon Bedrock custom model.

For synthetic data generation, we use a larger language model (such as Anthropic’s Claude 3 Sonnet on Amazon Bedrock) as the teacher model, and a smaller language model (such as Anthropic’s Claude Instant 1.2 or Claude 3 Haiku on Amazon Bedrock) as the student model for fine-tuning. We use the larger teacher model to generate new data based on its knowledge, which is then used to train the smaller student model. This concept is similar to knowledge distillation used in deep learning, except that we’re using the teacher model to generate a new dataset from its knowledge rather than directly modifying the architecture of the student model.

The following diagram illustrates the overall flow of the solution.

two steps of the described solution

Finally, we share our experiment results, where we compare the performance of the model fine-tuned with synthetic data to the baseline (not fine-tuned) model and to a model fine-tuned with an equal amount of original training data.

Prerequisites

To generate synthetic data and fine-tune models using Amazon Bedrock, you first need to create an AWS Identity and Access Management (IAM) service role with the appropriate permissions. This role is used by Amazon Bedrock to access the necessary resources on your behalf.

For instructions on creating the service role, refer to Create a service role for model customization. Also, make sure the role has the permission for the bedrock:InvokeModel action.

If you’re running this code using an Amazon SageMaker notebook instance, edit the IAM role that’s attached to the notebook (for example, AmazonSageMaker-ExecutionRole-XXX) instead of creating a new role. Follow Create a service role for model customization to modify the trust relationship and add the S3 bucket permission. Additionally, on the role’s Permissions tab, create the following inline policies:

  1. Policy name: bedrock-customization
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:ListModelCustomizationJobs",
                "bedrock:DeleteCustomModel",
                "bedrock:CreateModelCustomizationJob",
                "bedrock:StopModelCustomizationJob",
                "bedrock:ListCustomModels",
                "bedrock:GetCustomModel",
                "bedrock:GetModelCustomizationJob"
            ],
            "Resource": "*"
        }
    ]
}
  1. Policy name: iam-pass-role
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": [
                "${sagemaker-execution-role-arn}"
            ]
        }
    ]
}

The final permission policies for the SageMaker execution role should look like the following, which include AmazonSageMaker-ExecutionPolicy, AmazonSageMakerFullAccess, bedrock-customization, and iam-pass-role.

final list of permission policies for the role

Generate synthetic data using the Amazon Bedrock InvokeModel API

We use the Amazon Bedrock InvokeModel API to generate synthetic data for fine-tuning. You can use the API to programmatically send an inference (text generation) request to the model of your choice. All you need is a well-crafted prompt tailored for data synthesis. We used the following sample prompt for our use case:

PROMPT = """
You are an AI assistant who is an expert in Amazon services. Your task is to understand a system that takes in a list of documents, and based on that, answers a question by providing citations for the documents that it referred the answer from.

Your job is to generate three new Question/Answer pairs, emulating the tone, style, and grammar of the original data provided.

Here is the original data :
Input Documents and Question : {document}nnQuestion: {question}
Output Answer : {answer}

Strictly return a jsonl with the keys (question, answer, topic). Every topic should be different. The answers should be in the exact same format as the original. The question and the answer should be different in content from the original data provided, and all questions should be diverse and different from each other. Do not answer in any other format. The response should be parsable as a jsonl.
"""

The goal of our use case was to fine-tune a model to generate a relevant and coherent answer based on a given reference document and a question. RAG is a popular technique used for such Q&A tasks; however, one significant challenge with RAG is the potential for retrieving unrelated or irrelevant documents, which can lead to inaccurate responses. You can apply fine-tuning to guide the model to better focus on the relevance of the documents to the question instead of using the provided documents without context to answer the question.

Our dataset includes Q&A pairs with reference documents regarding AWS services. Each sample has up to five reference documents as context, and a single-line question follows. The following table shows an example.

document

Context:

Document 1:

Step 1: Prepare to work with AWS CodeStar projects

In this step, you create an AWS CodeStar service role and an Amazon EC2 key pair, so that you can begin creating and working with AWS CodeStar projects. If you have used AWS CodeStar before, skip ahead to Step 2

Step 2: Create a Project in AWS CodeStar.

For this step, follow the instructions in Setting Up AWS CodeStar in the AWS CodeStar User Guide. Do not create a new AWS account, IAM user, or IAM group as part of those instructions. Use the ones you created or identified in Team Setup for AWS Cloud9. When you finish following those instructions, return to this topic.

Document 2:

Setting Up AWS CodeStar

Before you can start using AWS CodeStar, you must complete the following steps.

Topics:

Step 1: Create an account

Step 2: Create the AWS CodeStar Service Role

Step 3: Configure the User’s IAM Permissions

Step 4: Create an Amazon EC2 Key Pair for AWS CodeStar Projects

Step 5: Open the AWS CodeStar Console

Next Steps

Document 3:

How Do I Get Started with AWS CodeStar?

To get started with AWS CodeStar:

Prepare to use AWS CodeStar by following the steps in Setting Up AWS CodeStar.

Experiment with AWS CodeStar by following the steps in the Getting Started with AWS CodeStar tutorial.

Share your project with other developers by following the steps in Add Team Members to an AWS CodeStar Project.

Integrate your favorite IDE by following the steps in Use an IDE with AWS CodeStar.

Document 4:

Step 2: Create a project in AWS CodeStar

In this step, you create a project in AWS CodeStar.

If you already have a project in AWS CodeStar you want to use, skip ahead to Step 3: Create an AWS Cloud9 Development Environment and Connect It to the Project.

For this step, follow the instructions in Create a Project in AWS CodeStar in the AWS CodeStar User Guide. In the AWS CodeStar create project wizard, when you get to the Setup tools page or Connect to your source repository page, choose Skip, and then return to this topic.

Document 5:

Create a Project in AWS CodeStar

You use the AWS CodeStar console to create a project. If you use a project template, it sets up the required resources for you. The template also includes sample code that you can use to start coding.

To create a project, sign in to the AWS Management Console with an IAM user that has the AWSCodeStarFullAccess policy or equivalent permissions. For more information, see Setting Up AWS CodeStar.

Note: You must complete the steps in Setting Up AWS CodeStar before you can complete the procedures in this topic.

Topics

Create a Project in AWS CodeStar (Console)

Create a Project in AWS CodeStar (AWS CLI)

Answer the following question to the best of your ability. Use and paraphrase the information in the context, if it answers the question. Otherwise, rely on your knowledge and memory.

question How do I get started with AWS CodeStar?
answer

To get started with AWS CodeStar, follow these steps:

1. Prepare to work with AWS CodeStar projects:

a. Create an AWS CodeStar service role by following the instructions in “Setting Up AWS CodeStar” (Document 2) in the AWS CodeStar User Guide [1].

b. Create an Amazon EC2 key pair for AWS CodeStar projects, as described in the same guide [1].

2. Create a project in AWS CodeStar:

a. Sign in to the AWS Management Console with an IAM user that has the AWSCodeStarFullAccess policy or equivalent permissions [2].

b. In the AWS CodeStar console, select one of the available project templates, which will automatically provision all the required resources for your project [3].

c. Once your project is provisioned, you can see your running application from the Application endpoints tile [3].

3. Connect to the AWS CodeCommit source repository for your project:

a. Use the steps provided in the CodeStar console to connect to the AWS CodeCommit source repository for your project [3].

b. Begin coding and making changes to your project’s source code.

4. Track and manage changes in the release process:

a. Use the project dashboard to track and manage changes in the release process [3].

b. View the most recent project activity on the dashboard.

5. Share your project with other developers:

a. Follow the steps in “Add Team Members to an AWS CodeStar Project” to share your project with other developers [3].

6. Integrate your favorite IDE:

a. Follow the steps in “Use an IDE with AWS CodeStar” to integrate your preferred Integrated Development Environment (IDE) with AWS CodeStar [3].

[1] https://docs.aws.amazon.com/codestar/latest/userguide/setting-up.html

[2] https://docs.aws.amazon.com/codestar/latest/userguide/create-project.html

[3] https://docs.aws.amazon.com/codestar/latest/userguide/getting-started.html

For data synthesis, we asked the model to generate three new Q&A pairs per reference document. However, you can adjust the number as needed. The crucial part is to make the model think deeply about a variety of topics. Because the purpose of generating synthetic data is to enrich the training dataset, it’s more beneficial to have the model look at different parts of the documents and create Q&A pairs with different topics than the original.

The following example shows how to generate synthetic data with the Amazon Bedrock InvokeModel API. We tested the preceding prompt with Anthropic’s Claude 3 Sonnet. If you want to test a different model, retrieve the corresponding model ID from Amazon Bedrock model IDs, and replace the modelId variable in the function.

import boto3
import json

bedrock = boto3.client(service_name="bedrock-runtime")

def generate_synthetic_data(document, question, answer):
    
    values = {
        "document": document,
        "question": question,
        "answer": answer
    }
    
    body = {
        "messages": [{
            "role": "user", "content": PROMPT.format(**values)
        }],
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "temperature" : 0.5
    }
    
    response = bedrock.invoke_model(
        body=json.dumps(body),
        modelId="anthropic.claude-3-sonnet-20240229-v1:0",
        accept="application/json",
        contentType="application/json"
    )
    
    response_body = json.loads(response.get('body').read())
    
    return response_body['content'][0]['text']

The preceding function returns three JSONL records in strings with question, answer, and topic as keys. The following parse_llm_output function loads the strings and uses regular expressions to retrieve the generated questions and answers. Then, the create_synthetic_samples function combines those two functionalities to produce the final synthetic training samples.

import re
import pd

def parse_llm_output(jsonl_string):
    
    question_pattern = re.compile(r'"question":s*"([^"]+)"')
    answer_pattern = re.compile(r'"answer":s*"(.*?)"s*,s*"topic"') 
    questions = question_pattern.findall(jsonl_string)
    answers = answer_pattern.findall(jsonl_string)
    
    return questions, answers


def create_synthetic_samples(row: pd.Series) -> pd.DataFrame:

    jsonl_string = generate_synthetic_data(row['document'], row['question'], row['answer'])
    questions, answers = parse_llm_output(jsonl_string)
    
    return pd.DataFrame({
        "document": [row['document']] * len(questions),
        "question": questions,
        "answer": answers
    })


def to_customization_format(row):

    msg = {
        "messages": [
            {"role": "user", "content": f"{row['document']}nnQuestion: {row['question']}"},
            {"role": "assistant", "content": row['answer']}
        ]
    }
    
    return msg

The following script combines all of the preceding functions and gives you the final training set with both original and synthetic samples. We convert the samples into the format required by the customization job using the to_customization_format function and save them as train.jsonl. Assume the input data is a CSV file with three columns: document, question, and answer.

import pandas as pd

# Load original training samples
original_train = pd.read_csv(input_df_path)

# Create synthetic training samples
synthetic_train = pd.concat(original_train.apply(create_synthetic_samples, axis=1).tolist())

# Combine original and synthetic samples
final_train_df = pd.concat([original_train, synthetic_train])

# Convert to the format required by the customization job
final_train = final_train_df.apply(to_customization_format, axis=1).tolist()

# Write to JSONL file    
with open('train.jsonl', 'w') as file:
    for item in final_train:
        json.dump(item, file)
        file.write('n')

Fine-tune using an Amazon Bedrock custom model

Now that you have the synthetic data generated by the teacher model along with your original data, it’s time to train the student model. We fine-tune the student model using the Amazon Bedrock custom model functionality.

Model customization is the process of providing training data to an FM to improve its performance for specific use cases. Amazon Bedrock offers three model customization methods as of this writing:

  • Fine-tuning
  • Continued pre-training
  • Distillation (preview).

You can create your own custom model using any of these methods through the Amazon Bedrock console or API. For more information on supported models and AWS Regions with various customization methods, please see User guide for model customization. In this section, we focus on how to fine-tune a model using the API.

To create a fine-tuning job in Amazon Bedrock, complete the following prerequisite steps:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket for your training data and another one for your output data (the names must be unique).
  2. Upload the jsonl file to the training data bucket.
  3. Make sure that you have created an IAM role, as described in the Prerequisites

When these steps are complete, run the following code to submit a new fine-tuning job. In our use case, the student model was Anthropic’s Claude Instant 1.2. At the time of writing, Anthropic’s Claude 3 Haiku is generally available, and we recommend following the rest of the code using Anthropic’s Claude 3 Haiku. For the release announcement, see Fine-tuning for Anthropic’s Claude 3 Haiku in Amazon Bedrock is now generally available.

If you want to try different models, you must check the model provider’s terms of service yourself. Many providers restrict using their models to train competing models. For the latest model support information, see Supported Regions and models for model customization, and replace baseModelIdentifier accordingly. Different models have different hyperparameters. For more information, see Custom model hyperparameters.

import boto3
import json
import time

bedrock = boto3.client(service_name='bedrock')
    
# Set parameters
customizationType = "FINE_TUNING"
baseModelIdentifier = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-instant-v1:2:100k"
roleArn = "${customization-role-arn}"
jobName = "${customization-job-name}"
customModelName = "${customization-model-name}"
hyperParameters = {
    "epochCount": "1",
    "batchSize": "96",
    "learningRateMultiplier": "0.5",
 }
trainingDataConfig = {"s3Uri": "s3://${training-bucket}/train.jsonl"}
outputDataConfig = {"s3Uri": "s3://${output-bucket}/myOutputData"}

# Create job
response_ft = bedrock.create_model_customization_job(
    jobName=jobName, 
    customModelName=customModelName,
    roleArn=roleArn,
    baseModelIdentifier=baseModelIdentifier,
    hyperParameters=hyperParameters,
    trainingDataConfig=trainingDataConfig,
    outputDataConfig=outputDataConfig
)

jobArn = response_ft.get('jobArn')

# Check job status
while True:
    status = bedrock.get_model_customization_job(jobIdentifier=jobArn).get('status')
    if status != 'InProgress':
        print(status)
        break
    else:
        print(status)
    time.sleep(30)

When the status changes to Completed, your fine-tuned student model is ready for use. To run an inference with this custom model, you need to purchase provisioned throughput. A flexible No commitment option is available for custom models, which can be turned off when not in use and billed by the hour. A cost estimate is provided on the console prior to purchasing provisioned throughput.

On the Amazon Bedrock console, choose Custom models in the navigation pane. Select the model you fine-tuned and choose Purchase provisioned throughput.

purchase provisioned throughput UI

The model name and type are automatically selected for you. Select No commitment for Commitment term. After you make this selection, the estimated cost is shown. If you’re okay with the pricing, choose Confirm purchase.

purchase provisioned throughput UI details

When the Provisioned Throughput becomes available, retrieve the ARN of the provisioned custom model and run the inference:

import boto3
import json

bedrock = boto3.client(service_name="bedrock-runtime")

def run_student_model(document, question):
    
    values = {
        "document": document,
        "question": question,
    }
    
    body = {
        "messages": [{
            "role": "user", "content": PROMPT.format(**values)
        }],
        "max_tokens": 2048,
        "temperature" : 0.5
    }
    
    response = bedrock.invoke_model(
        body=json.dumps(body),
        modelId="${provisioned_model_arn}",
        accept="application/json",
        contentType="application/json"
    )
    
    response_body = json.loads(response.get('body').read())
    
    return response_body['content'][0]['text']

Evaluate

In this section, we share our experiment results to provide data points on how the synthetic data generated by a teacher model can improve the performance of a student model. For evaluation methods, we used an LLM-as-a-judge approach, where a judge model compares responses from two different models and picks a better response. Additionally, we conducted a manual evaluation on a small subset to assess whether the LLM-as-a-judge and human judges have aligned preferences.

We carried out controlled experiments where we compared four different models as follows: 1,500 synthetic training samples for the 4th model were generated by Anthropic’s Claude 3 Sonnet, and we created three synthetic samples per one original reference document (3 samples * 500 original reference documents = 1,500 synthetic samples).

Instant base model Anthropic’s Claude Instant without any customization
Fine-tuned 500 original Anthropic’s Claude Instant fine-tuned with 500 original training samples
Fine-tuned 2,000 original Anthropic’s Claude Instant fine-tuned with 2,000 original training samples
Fine-tuned with synthetic Anthropic’s Claude Instant fine-tuned with 500 original training samples plus 1,500 synthetic training samples

LLM-as-a-judge results

LLM output evaluation is an important step in developing generative AI applications, but it is expensive and takes considerable time if done manually. An alternative solution to systematically evaluate output quality in large volume is the LLM-as-a-judge approach, where an LLM is used to evaluate another LLM’s responses.

For our use case, we used Anthropic’s Claude 3 Sonnet and Meta Llama 3 70B as the judges. We asked the LLM judges to compare outputs from two different models and choose one over the other or state a tie. The following chart summarizes the judges’ decisions. Each number represents the percentage of times when the respective model was selected as providing a better answer, excluding tie cases. The test set contained 343 samples.

LLM Judge result (Sonnet 3.0)

As shown in the preceding chart, the Anthropic’s Claude 3 Sonnet judge preferred the response from the fine-tuned model with synthetic examples over the Anthropic’s Claude Instant base model (84.8% preference) and the fine-tuned model with original 500 samples (72.3% preference). However, the judge concluded that the fine-tuned model with 2,000 original examples was preferred over the fine-tuned model with synthetic examples (32.3% preference). This aligns with the expectation that when large, high-quality original data is available, it’s better to use the large training data that accurately reflects the target data distribution.

LLM Judge result (Llama 3 70B)

The Meta Llama judge reached a similar conclusion. As shown in the preceding chart, it preferred the response from the fine-tuned model with synthetic samples over the Anthropic’s Claude Instant base model (75.6% preference) and the fine-tuned model with original 500 examples (76.4% preference), but the fine-tuned model with 2,000 original examples was the ultimate winner.

Human evaluation results

To complement the LLM-as-a-judge result, we conducted manual evaluation with two human judges. We asked the two human evaluators to perform the same pairwise comparison task as the LLM judge, but for 20 examples. The following chart summarizes the results.

Human Evaluation Result

As shown in the preceding chart, the two human evaluators reached a similar conclusion, reinforcing the LLM-as-a-judge result. The fine-tuned model with synthetic examples produced outputs that were more preferable than the Anthropic’s Claude Instant base model and the fine-tuned model with the original 500 examples; however, it didn’t outperform the fine-tuned model with the 2,000 original examples.

These comparative evaluation results from both the LLM judges and human judges strongly demonstrate the power and potential of using data synthesis when training data is scarce. Moreover, by using high-quality data from the teacher model, we can effectively train the student model, which is lightweight and cost-effective for deployment in a production environment.

Amazon Bedrock evaluations

Running LLM-as-a-judge and human evaluation has become much easier with Amazon Bedrock. Model evaluation on Amazon Bedrock allows you to evaluate, compare, and select the best FMs for your use case. Human evaluation workflows can use your own employees or an AWS-managed team as reviewers. For more information on how to set up a human evaluation workflow, see Creating your first model evaluation that uses human workers. The latest feature, LLM-as-a-judge, is now in preview and allows you to assess multiple quality dimensions including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness. For step-by-step instructions, see New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock.

Clean up

Make sure to delete the following resources to avoid incurring cost:

  • Provisioned throughput for the custom model
  • The training_bucket and output_bucket S3 buckets

Conclusion

In this post, we explored how to use Amazon Bedrock to generate synthetic training data using a large teacher language model and fine-tune a smaller student model with synthetic data. We provided instructions on generating synthetic data using the Amazon Bedrock InvokeModel API and fine-tuning the student model using an Amazon Bedrock custom model. Our evaluation results, based on both an LLM-as-a-judge approach and human evaluation, demonstrated the effectiveness of synthetic data in improving the student model’s performance when original training data is limited.

Although fine-tuning with a large amount of high-quality original data remains the ideal approach, our findings highlight the promising potential of synthetic data generation as a viable solution when dealing with data scarcity. This technique can enable more efficient and cost-effective model customization for domain-specific or specialized use cases.

If you’re interested in working with the AWS Generative AI Innovation Center and learning more about LLM customization and other generative AI use cases, visit Generative AI Innovation Center.


About the Author

Sujeong Cha is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she specializes in model customization and optimization. She has extensive hands-on experience in solving customers’ business use cases by utilizing generative AI as well as traditional AI/ML solutions. Sujeong holds a M.S. degree in Data Science from New York University.

Arijit Ghosh Chowdhury is a Scientist with the AWS Generative AI Innovation Center, where he works on model customization and optimization. In his role, he works on applied research in fine-tuning and model evaluations to enable GenAI for various industries. He has a Master’s degree in Computer Science from the University of Illinois at Urbana Champaign, where his research focused on question answering, search and domain adaptation.

Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, Sungmin enjoys hiking, reading and cooking.

Yiyue Qian is an Applied Scientist II at the AWS Generative AI Innovation Center, where she develops generative AI solutions for AWS customers. Her expertise encompasses designing and implementing innovative AI-driven and deep learning techniques, focusing on natural language processing, computer vision, multi-modal learning, and graph learning. Yiyue holds a Ph.D. in Computer Science from the University of Notre Dame, where her research centered on advanced machine learning and deep learning methodologies. Outside of work, she enjoys sports, hiking, and traveling.

Wei-Chih Chen is a Machine Learning Engineer at the AWS Generative AI Innovation Center, where he works on model customization and optimization for LLMs. He also builds tools to help his team tackle various aspects of the LLM development life cycle—including fine-tuning, benchmarking, and load-testing—that accelerating the adoption of diverse use cases for AWS customers. He holds an M.S. degree in Computer Science from UC Davis.

Hannah Marlowe is a Senior Manager of Model Customization at the AWS Generative AI Innovation Center. Her team specializes in helping customers develop differentiating Generative AI solutions using their unique and proprietary data to achieve key business outcomes. She holds a Ph.D in Physics from the University of Iowa, with a focus on astronomical X-ray analysis and instrumentation development. Outside of work, she can be found hiking, mountain biking, and skiing around the mountains in Colorado.

Read More

Achieve ~2x speed-up in LLM inference with Medusa-1 on Amazon SageMaker AI

Achieve ~2x speed-up in LLM inference with Medusa-1 on Amazon SageMaker AI

This blog post is co-written with Moran beladev, Manos Stergiadis, and Ilya Gusev from Booking.com.

Large language models (LLMs) have revolutionized the field of natural language processing with their ability to understand and generate humanlike text. Trained on broad, generic datasets spanning a wide range of topics and domains, LLMs use their parametric knowledge to perform increasingly complex and versatile tasks across multiple business use cases. Furthermore, companies are increasingly investing resources in customizing LLMs through few-shot learning and fine-tuning to optimize their performance for specialized applications.

However, the impressive performance of LLMs comes at the cost of significant computational requirements, driven by their large number of parameters and autoregressive decoding process which is sequential in nature. This combination makes achieving low latency a challenge for use cases such as real-time text completion, simultaneous translation, or conversational voice assistants, where subsecond response times are critical.

Researchers developed Medusa, a framework to speed up LLM inference by adding extra heads to predict multiple tokens simultaneously. This post demonstrates how to use Medusa-1, the first version of the framework, to speed up an LLM by fine-tuning it on Amazon SageMaker AI and confirms the speed up with deployment and a simple load test. Medusa-1 achieves an inference speedup of around two times without sacrificing model quality, with the exact improvement varying based on model size and data used. In this post, we demonstrate its effectiveness with a 1.8 times speedup observed on a sample dataset.

Introduction to Medusa and its benefits for LLM inference speed

LLMs generate text in a sequential manner, which involves autoregressive sampling, with each new token conditional on the previous ones. Generating K tokens necessitates K sequential executions of the model. This token-by-token processing introduces an inherent latency and computational overhead because the model needs to perform a separate forward pass for each new token in the output sequence. The following diagram from Role-Play with Large Language Models illustrates this flow.

Autoregressive sampling overview

Speculative decoding tackles this challenge by using a smaller, faster draft model to generate multiple potential token continuations in parallel, which are then verified by a larger, more accurate target model. This parallelization speeds up text generation while maintaining the quality of the target model because the verification task is faster than autoregressive token generation. For a detailed explanation of the concept, refer to the paper Accelerating Large Language Model Decoding with Speculative Sampling. The speculative decoding technique can be implemented using the inference optimization toolkit on Amazon SageMaker Jumpstart.

The paper Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads introduced Medusa as an alternative to speculative decoding. Instead of adding a separate draft model, it adds extra decoding heads to the LLM that generate candidate continuations simultaneously. These candidates are then evaluated in parallel using a tree-based attention mechanism. This parallel processing reduces the number of sequential steps needed, leading to faster inference times. The main advantage of Medusa over speculative decoding is that it eliminates the need to acquire and maintain a separate draft model while achieving higher speedups. For example, when tested on the MT-Bench dataset, the paper reports that Medusa-2 (the second version of Medusa) speeds up inference time by 2.8 times. This outperforms speculative decoding, which only manages to speed up inference time by 1.5 times on the same dataset.

The Medusa framework currently supports Llama and Mistral models. Although it offers significant speed improvements, it does come with a memory trade-off (similar to speculative decoding). For instance, adding five Medusa heads to the 7-billion-parameter Mistral model increases the total parameter count by 750 million (150 million per head), which means these additional parameters must be stored in GPU memory, leading to a higher memory requirement. However, in most cases, this increase doesn’t necessitate switching to a higher GPU memory instance. For example, you can still use an ml.g5.4xlarge instance with 24 GB of GPU memory to host your 7-billion-parameter Llama or Mistral model with extra Medusa heads.

Training Medusa heads requires additional development time and computational resources, which should be factored into project planning and resource allocation. Another important limitation to mention is that the current framework, when deployed on an Amazon SageMaker AI endpoint, only supports a batch size of one—a configuration typically used for low-latency applications.

The following diagram from the original Medusa paper authors’ FasterDecoding repository gives a visual Medusa framework overview.

Medusa framework overview

There are two main variants of Medusa:

  1. Medusa-1 – Requires a two-stage approach where you first fine-tune your LLM and then add Medusa heads and train them on top of your frozen fine-tuned LLM
  2. Medusa-2 – Introduced later as an improvement, fine-tunes both the additional heads and the backbone LLM parameters together, enabling potentially even further latency speedups

The Medusa paper reports that across models of varying sizes, you can achieve inference speedups of around two times for Medusa-1 and around three times for Medusa-2. With Medusa-1, the predictions are identical to those of the originally fine-tuned LLM. In contrast, with Medusa-2, we might observe slightly different results compared to simple fine-tuning of the LLM because both the heads and the backbone LLM parameters are updated together. In this post, we focus on Medusa-1.

Solution overview

We cover the following steps in our solution:

  • Prerequisites
  • Load and prepare the dataset
  • Fine-tune an LLM using a SageMaker AI training job
  • Train Medusa heads on top of a frozen fine-tuned LLM using a SageMaker AI training job
  • Deploy the fine-tuned LLM with Medusa heads on a SageMaker AI endpoint
  • Demonstrate LLM inference speedup

By following this solution, you can accelerate LLM inference in your applications, leading to faster response times and improved user experience.

Prerequisites

To build the solution yourself, there are the following prerequisites:

Load and prepare the dataset

Now that you have cloned the GitHub repository and opened the medusa_1_train.ipynb notebook, you will load and prepare the dataset in the notebook. We encourage you to read this post while running the code in the notebook. For this post, we use a dataset called sql-create-context, which contains samples of natural language instructions, schema definitions and the corresponding SQL query. It contains 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL queries answering the question using the CREATE statement as context. For demonstration purposes, we select 3,000 samples and split them into train, validation, and test sets.

You need to run the “Load and prepare the dataset” section of the medusa_1_train.ipynb to prepare the dataset for fine-tuning. We also included a data exploration script to analyze the length of input and output tokens. After data exploration, we prepare the train, validation, and test sets and upload them to Amazon Simple Storage Service (Amazon S3).

Fine-tune an LLM using SageMaker AI training job

We use the Zephyr 7B β model as our backbone LLM. Zephyr is a series of language models trained to act as helpful assistants, and Zephyr 7B β is a fine-tuned version of Mistral-7B-v0.1, trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization.

To launch a SageMaker AI training job, we need to use the PyTorch or Hugging Face estimator. SageMaker AI starts and manages all the necessary Amazon Elastic Compute Cloud (Amazon EC2) instances for us, supplies the appropriate containers, downloads data from our S3 bucket to the container and uploads and runs the specified training script, in our case fine_tune_llm.py. We select the hyperparameters based on the QLoRA paper, but we encourage you to experiment with your own combinations. To expedite the execution of this code, we set the number of epochs to 1. However, for better results, it’s generally recommended to set the number of epochs to at least 2 or 3.

from sagemaker.pytorch.estimator import PyTorch
from sagemaker.debugger import TensorBoardOutputConfig
import time
import os

def get_current_time():
    return time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())

def create_estimator(hyperparameters_dict, job_name, role, sess, train_scipt_path):
    metric=[
        {"Name": "loss", "Regex": r"'loss':s*([0-9.]+)"},
        {"Name": "epoch", "Regex": r"'epoch':s*([0-9.]+)"},
    ]

    tensorboard_s3_output_path = os.path.join(
       "s3://", sess.default_bucket(), job_name, 'tensorboard'
    )
    print("Tensorboard output path:", tensorboard_s3_output_path)

    tensorboard_output_config = TensorBoardOutputConfig(
        s3_output_path=tensorboard_s3_output_path,
        container_local_output_path=hyperparameters_dict['logging_dir']
    )
    estimator = PyTorch(
        sagemaker_session    = sess,
        entry_point          = train_scipt_path,    # train script
        source_dir           = 'train',      # directory which includes all the files needed for training
        instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job, "local_gpu" for local mode
        metric_definitions   = metric,
        instance_count       = 1,                 # the number of instances used for training
        role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
        volume_size          = 300,               # the size of the EBS volume in GB
        framework_version      = '2.1.0',             # the pytorch_version version used in the training job
        py_version           = 'py310',           # the python version used in the training job
        hyperparameters      =  hyperparameters_dict,  # the hyperparameters passed to the training job
        disable_output_compression = True,        # not compress output to save training time and cost
        tensorboard_output_config = tensorboard_output_config
    )
    return estimator
    
# hyperparameters, which are passed into the training job
sft_hyperparameters = {
  ### SCRIPT PARAMETERS ###
  'train_dataset_path': '/opt/ml/input/data/train/train_dataset.json', # path where sagemaker will save training dataset
  'eval_dataset_path': '/opt/ml/input/data/eval/eval_dataset.json', # path where sagemaker will save evaluation dataset
  'model_id': model_id,
  'max_seq_len': 256,                               # max sequence length for model and packing of the dataset
  'use_qlora': True,                                 # use QLoRA model
  ### TRAINING PARAMETERS ###
  'num_train_epochs': 1,                             # number of training epochs
  'per_device_train_batch_size': 1,                  # batch size per device during training
  'gradient_accumulation_steps': 16,                  # number of steps before performing a backward/update pass
  'gradient_checkpointing': True,                    # use gradient checkpointing to save memory
  'optim': "adamw_8bit",                             # use fused adamw 8bit optimizer
  'logging_steps': 15,                               # log every 10 steps
  'save_strategy': "steps",                          # save checkpoint every epoch
  'save_steps': 15,
  'save_total_limit': 2,
  'eval_strategy': "steps",
  'eval_steps': 15,
  'learning_rate': 1e-4,                             # learning rate, based on QLoRA paper
  'bf16': True,                                      # use bfloat16 precision
  'max_grad_norm': 10,                              # max gradient norm based on QLoRA paper
  'warmup_ratio': 0.03,                              # warmup ratio based on QLoRA paper
  'lr_scheduler_type': "constant",                   # use constant learning rate scheduler
  'output_dir': '/opt/ml/checkpoints/',              # Temporary output directory for model checkpoints
  'merge_adapters': True,                            # merge LoRA adapters into model for easier deployment
  'report_to': "tensorboard",                        # report metrics to tensorboard
  'logging_dir': "/opt/ml/output/tensorboard"        # tensorboard logging directory
}
 
sft_job_name = f"sft-qlora-text-to-sql-{get_current_time()}"
data = {
    'train': train_dataset_path,
    'eval': eval_dataset_path
}

sft_estimator = create_estimator(sft_hyperparameters, sft_job_name, role, sess, "fine_tune_llm.py")

sft_estimator.fit(job_name=sft_job_name, inputs=data, wait=False)

When our training job has completed successfully after approximately 1 hour, we can use the fine-tuned model artifact for the next step, training the Medusa heads on top of it. To visualize the training metrics in Tensorboard, you can follow the guidance in this documentation: Load and visualize output tensors using the TensorBoard application

Train Medusa heads on top of frozen fine-tuned LLM using a SageMaker AI training job

For training Medusa heads, we can reuse the functions previously mentioned to launch the training job. We selected hyperparameters based on a combination of what the Medusa paper reported and what we found to be best performing after a few experiments. We set the number of Medusa heads to 5 and used the 8-bit AdamW optimizer, as recommended by the paper. For simplicity, we maintained a constant learning rate of 1e-4 with a constant scheduler, similar to the previous fine-tuning step. Although the paper recommends an increased learning rate and a cosine scheduler, we found that our chosen combination of hyperparameters performed well on this dataset. However, we encourage you to experiment with your own hyperparameter settings to potentially achieve even better results.

# hyperparameters, which are passed into the training job
medusa_hyperparameters = {
  ### SCRIPT PARAMETERS ###
  'train_dataset_path': '/opt/ml/input/data/train/train_dataset.json', # path where sagemaker will save training dataset
  'eval_dataset_path': '/opt/ml/input/data/eval/eval_dataset.json', # path where sagemaker will save evaluation dataset
  'model_path': '/opt/ml/input/data/fine-tuned-model/',
  'max_seq_len': 256,                               # max sequence length for model and packing of the dataset
  'medusa_num_heads': 5,
  ### TRAINING PARAMETERS ###
  'num_train_epochs': 3,                             # number of training epochs
  'per_device_train_batch_size': 1,                  # batch size per device during training
  'gradient_accumulation_steps': 16,                  # number of steps before performing a backward/update pass
  'gradient_checkpointing': True,                    # use gradient checkpointing to save memory
  'optim': "adamw_8bit",                             # use fused adamw 8bit optimizer
  'logging_steps': 15,                               # log every 10 steps
  'save_strategy': "steps",                          # save checkpoint every epoch
  'save_steps': 15,
  'save_total_limit':2,
  'eval_strategy': "steps",
  'eval_steps': 15,
  'learning_rate': 1e-4,                             # learning rate
  'bf16': True,                                      # use bfloat16 precision
  'max_grad_norm': 10,                              # max gradient norm based on QLoRA paper
  'warmup_ratio': 0.03,                              # warmup ratio based on QLoRA paper
  'lr_scheduler_type': "constant",                   # use constant learning rate scheduler
  'output_dir': '/opt/ml/checkpoints/',              # Temporary output directory for model checkpoints
  'report_to': "tensorboard",                        # report metrics to tensorboard
  'logging_dir': "/opt/ml/output/tensorboard"        # tensorboard logging directory
}

medusa_train_job_name = f"medusa-text-to-sql-{get_current_time()}"
data = {
    'train': train_dataset_path,
    'eval': eval_dataset_path,
    'fine-tuned-model': fine_tuned_model_path
}

medusa_estimator = create_estimator(medusa_hyperparameters, medusa_train_job_name, role, sess, "train_medusa_heads.py")

medusa_estimator.fit(job_name=medusa_train_job_name, inputs=data, wait=False)

We found that after 3 epochs, the evaluation loss of Medusa heads was converging, which can be observed in the TensorBoard graph in the following image.

TensorBoard graph showing the evaluation loss during Medusa heads training

Besides the hyperparameters, the main difference is that we pass train_medusa_heads.py as the training entrypoint, where we first add Medusa heads, then freeze the fine-tuned LLM, and we create custom MedusaSFTTrainer class, which is a subclass of the transformers SFTTrainer.

# Add medusa heads and freeze base model
add_medusa_heads(
    model,
    medusa_num_heads=script_args.medusa_num_heads,
)
freeze_layers(model)
model.config.torch_dtype = torch_dtype
model.config.use_cache = False

logger.info("Finished loading model and medusa heads")

tokenizer = AutoTokenizer.from_pretrained(script_args.model_path, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

################
# Training
################
trainer = MedusaSFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False,  # No need to add additional separator token
    },
    medusa_num_heads=script_args.medusa_num_heads,
    medusa_heads_coefficient=script_args.medusa_heads_coefficient,
    medusa_decay_coefficient=script_args.medusa_decay_coefficient,
    medusa_scheduler=script_args.medusa_scheduler,
    train_only_medusa_heads=script_args.train_only_medusa_heads,
    medusa_lr_multiplier=script_args.medusa_lr_multiplier
)
trainer.train()

In the add_medusa_heads() function, we add the residual blocks of the Medusa heads, and also override the forward pass for our model to make sure not to train the frozen backbone LLM:

def add_medusa_heads(
    model,
    medusa_num_heads,
):
    """
    Args:
        model (nn.Module): The base language model to be used.
        medusa_num_heads (int, optional): Number of additional tokens to predict
    """
    hidden_size = model.lm_head.weight.shape[-1]
    vocab_size = model.lm_head.weight.shape[0]
    model.config.medusa_num_layers = 1
    model.config.medusa_num_heads = medusa_num_heads
    model.medusa_num_heads = medusa_num_heads
    # Create a list of Medusa heads
    model.medusa_heads = nn.ModuleList(
        [
            nn.Sequential(
                ResBlock(hidden_size),
                nn.Linear(hidden_size, vocab_size, bias=False),
            )
            for _ in range(medusa_num_heads)
        ]
    )

    # Ensure medusa_head's dtype and device align with the base_model
    model.medusa_heads.to(model.dtype).to(model.device)
    logger.info(f"Loading medusa heads in {str(model.dtype)} to device {model.device}")

    for i in range(medusa_num_heads):
        # Initialize the weights of each medusa_head using the base model's weights
        model.medusa_heads[i][-1].weight.data[:] = model.lm_head.weight.data[:]

    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        train_only_medusa_heads: bool = False,
    ):
        """Forward pass of the MedusaModel.
        Returns:
            torch.Tensor: A tensor containing predictions from all Medusa heads.
            (Optional) Original predictions from the base model's LM head.
        """
        maybe_grad = torch.no_grad() if train_only_medusa_heads else nullcontext()
        with maybe_grad:
            outputs = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                position_ids=position_ids,
                past_key_values=past_key_values,
                inputs_embeds=inputs_embeds,
                use_cache=use_cache,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
            hidden_states = outputs[0]
            medusa_logits = [self.lm_head(hidden_states)]
        for i in range(self.medusa_num_heads):
            medusa_logits.append(self.medusa_heads[i](hidden_states))
        return torch.stack(medusa_logits, dim=0)

    model.forward = types.MethodType(forward, model)

After the model training is finished (which takes 1 hour), we prepare the model artefacts for deployment and upload it to Amazon S3. Your final model artifact contains both the original fine-tuned model from the previous step under the base-model prefix and the trained Medusa heads in a file named medusa_heads.safetensors.

Deploy the fine-tuned LLM with Medusa heads on a SageMaker AI endpoint

The Medusa framework is supported by the Text Generation Inference (TGI) server. After training the LLM with Medusa heads, we deploy it to a SageMaker AI real-time endpoint using the Hugging Face Inference Container set up with TGI.

First, we create a SageMaker AI HuggingFaceModel object and then deploy the model to an endpoint with the following function:

import json
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri


def deploy_model(endpoint_name, instance_type, model_s3_path=None, hf_model_id=None):
    llm_image = get_huggingface_llm_image_uri(
      "huggingface",
      version="2.2.0",
      session=sess,
    )

    print(f"llm image uri: {llm_image}")

    model_data = None
    if model_s3_path:
        model_data = {'S3DataSource': {'S3Uri': model_s3_path, 'S3DataType': 'S3Prefix', 'CompressionType': 'None'}}
        hf_model_id = "/opt/ml/model"
    else:
        assert hf_model_id, "You need to provide either pretrained HF model id, or S3 model data to deploy"
    config = {
      'HF_MODEL_ID': hf_model_id,  # path to where sagemaker stores the model
      'SM_NUM_GPUS': json.dumps(1),  # Number of GPU used per replica
      'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
      'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
    }

    llm_model = HuggingFaceModel(
      name=endpoint_name,
      role=role,
      image_uri=llm_image,
      model_data=model_data,
      env=config
    )

    deployed_llm = llm_model.deploy(
      endpoint_name=endpoint_name,
      initial_instance_count=1,
      instance_type=instance_type,
      container_startup_health_check_timeout=300,
    )
    return deployed_llm

We deploy three LLMs on three SageMaker AI endpoints:

  1. Base LLM which isn’t fine-tuned
  2. The LLM that we fine-tuned
  3. The fine-tuned LLM that also has trained Medusa heads

You can deploy the three models in parallel by using a function that we included in the notebook, or you can deploy the models one by one by running the code below:

base_deployed_llm = deploy_model( f"base-{get_current_time()}", instance_type="ml.g5.4xlarge", model_s3_path=None, hf_model_id=model_id )
sft_deployed_llm = deploy_model( f"sft-{get_current_time()}", instance_type="ml.g5.4xlarge", model_s3_path=fine_tuned_model_path )
medusa_deployed_llm = deploy_model( f"medusa-{get_current_time()}", instance_type="ml.g5.4xlarge", model_s3_path=medusa_trained_model_path )

After the status for each endpoint becomes InService, which should take around 15 minutes, we can invoke them for inference. We send the following input:

“You are a text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA. SCHEMA: CREATE TABLE table_name_32 (time VARCHAR, heat VARCHAR, name VARCHAR)
What was Liu Limin's time in heat 4?“

We can observe the following responses:

  1. The base LLM response contains extra words that aren’t needed:
“To retrieve the time of Liu Limin in heat 4 based on the provided SCHEMA, we need to write a SQL query. Since the table name is not provided, let 's assume it's "my_table".  Assuming the table has a primary key (which is not mentioned in the SCHEMA), we can join the table with itself to compare the heat and name of Liu Limin in heat 4 with all the records in the table.  Here's the SQL query:  ``sql SELECT t1.time FROM my_table t1 JOIN my_table t2 ON t1.name = t2.name AND t2.heat = 4 WHERE t1.name = 'Liu Limin' AND t1.heat <> 4; `  Explanation:  1. We're selecting the time` column from the first table (t1) in the FROM clause. 2. We're joining the table with itself (my_table t1 JOIN my_table t2) to compare the name and heat of Liu Limin in heat 4 with all the records in the table. 3. We're filtering the results using the WHERE clause. We're selecting only the records where the name is 'Liu Limin' and the heat is not equal to 4 (i.e., not heat 4). This is to ensure that we're selecting the time of Liu Limin in heat 3.  Note: This query assumes that the table has a unique primary key. If the table doesn't have a primary key, you may need to add additional conditions to the JOIN and WHERE clauses to ensure that we're selecting the correct records.“
  1. The fine-tuned LLM response is improved significantly, and contains only the required output:
'SELECT time FROM table_name_32 WHERE heat = 4 AND name = "liu limin"'
  1. The fine-tuned LLM with trained Medusa heads provides the exact same response as the fine-tuned model, demonstrating that Medusa-1, by design, maintains the output (quality) of the original model:
'SELECT time FROM table_name_32 WHERE heat = 4 AND name = "liu limin"'

Demonstrate LLM inference speedup

To measure the inference speed improvements, we compare the response times of the deployed fine-tuned LLM and the fine-tuned LLM with Medusa heads on 450 test observations with the following code:

import time
import numpy as np
from tqdm import tqdm

def request(sample, deployed_llm):
    prompt = tokenizer.apply_chat_template(sample, tokenize=False, add_generation_prompt=True)
    outputs = deployed_llm.predict({
      "inputs": prompt,
      "parameters": {
        "max_new_tokens": 512,
        "do_sample": False,
        "return_full_text": False,
      }
    })
    return {"role": "assistant", "content": outputs[0]["generated_text"].strip()}

def predict(deployed_llm, test_dataset):
    predicted_answers = []
    latencies = []

    for sample in tqdm(test_dataset):
        start_time = time.time()
        predicted_answer = request(sample["messages"][:2], deployed_llm)
        end_time = time.time()

        latency = end_time - start_time
        latencies.append(latency)
        predicted_answers.append(predicted_answer)

    # Calculate p90 and average latencies
    p90_latency = np.percentile(latencies, 90)
    avg_latency = np.mean(latencies)

    print(f"P90 Latency: {p90_latency:.2f} seconds")
    print(f"Average Latency: {avg_latency:.2f} seconds")

    return predicted_answers

First, we run predictions using the fine-tuned LLM:

sft_predictions = predict(sft_deployed_llm, test_dataset)
P90 Latency: 1.28 seconds
Average Latency: 0.95 seconds

Then, we run predictions using the fine-tuned LLM with Medusa heads:

medusa_predictions = predict(medusa_deployed_llm, test_dataset)
P90 Latency: 0.80 seconds
Average Latency: 0.53 seconds

The prediction runs should take around 8 and 4 minutes respectively. We can observe that the average latency decreased from 950 to 530 milliseconds, which is an improvement of 1.8 times. You can achieve even higher improvements if your dataset contains longer inputs and outputs. In our dataset, we only had an average of 18 input tokens and 30 output tokens.

We want to once again highlight that, with this technique, the output quality is fully maintained, and all the prediction outputs are the same. The model responses for the test set of 450 observations are the same for both with Medusa heads and without Medusa heads:

match_percentage = sum(a["content"] == b["content"] for a, b in zip(sft_predictions, medusa_predictions)) / len(sft_predictions) * 100
print(f"Predictions with the fine-tuned model with medusa heads are the same as without medusa heads: {match_percentage:.2f}% of test set ")

Predictions with fine-tuned model with medusa heads are the same as without medusa heads: 100.00% of test set 

You might notice in your run that a few observations aren’t exactly matching, and you might get a 99% match due to small errors in floating point operations caused by optimizations on GPUs.

Cleanup

At the end of this experiment, don’t forget to delete the SageMaker AI endpoints you created:

base_deployed_llm.delete_model()
base_deployed_llm.delete_endpoint()
sft_deployed_llm.delete_model()
sft_deployed_llm.delete_endpoint()
medusa_deployed_llm.delete_model()
medusa_deployed_llm.delete_endpoint()

Conclusion

In this post, we demonstrated how to fine-tune and deploy an LLM with Medusa heads using the Medusa-1 technique on Amazon SageMaker AI to accelerate LLM inference. By using this framework and SageMaker AI scalable infrastructure, we showed how to achieve up to twofold speedups in LLM inference while maintaining model quality. This solution is particularly beneficial for applications requiring low-latency text generation, such as customer service chat assistants, content creation, and recommendation systems.

As a next step, you can explore fine-tuning your own LLM with Medusa heads on your own dataset and benchmark the results for your specific use case, using the provided GitHub repository.


About the authors

Daniel Zagyva is a Senior ML Engineer at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI and machine learning operations.

Aleksandra Dokic is a Senior Data Scientist at AWS Professional Services. She enjoys supporting customers to build innovative AI/ML solutions on AWS and she is excited about business transformations through the power of data.

Moran Beladev is a Senior ML Manager at Booking.com. She is leading the content intelligence track which is focused on building, training and deploying content models (computer vision, NLP and generative AI) using the most advanced technologies and models. Moran is also a PhD candidate, researching applying NLP models on social graphs.

Manos Stergiadis is a Senior ML Scientist at Booking.com. He specializes in generative NLP and has experience researching, implementing and deploying large deep learning models at scale.

Ilya Gusev is a Senior Machine Learning Engineer at Booking.com. He leads the development of the several LLM systems inside Booking.com. His work focuses on building production ML systems that help millions of travelers plan their trips effectively.

Laurens van der Maas is a Machine Learning Engineer at AWS Professional Services. He works closely with customers building their machine learning solutions on AWS, specializes in natural language processing, experimentation and responsible AI, and is passionate about using machine learning to drive meaningful change in the world.

Read More

LLM-as-a-judge on Amazon Bedrock Model Evaluation

LLM-as-a-judge on Amazon Bedrock Model Evaluation

The evaluation of large language model (LLM) performance, particularly in response to a variety of prompts, is crucial for organizations aiming to harness the full potential of this rapidly evolving technology. The introduction of an LLM-as-a-judge framework represents a significant step forward in simplifying and streamlining the model evaluation process. This approach allows organizations to assess their AI models’ effectiveness using pre-defined metrics, making sure that the technology aligns with their specific needs and objectives. By adopting this method, companies can more accurately gauge the performance of their AI systems, making informed decisions about model selection, optimization, and deployment. This not only enhances the reliability and efficiency of AI applications, but also contributes to a more strategic and informed approach to technology adoption within the organization.

Amazon Bedrock, a fully managed service offering high-performing foundation models from leading AI companies through a single API, has recently introduced two significant evaluation capabilities: LLM-as-a-judge under Amazon Bedrock Model Evaluation and RAG evaluation for Amazon Bedrock Knowledge Bases. Both features use the LLM-as-a-judge technique behind the scenes but evaluate different things. This blog post explores LLM-as-a-judge on Amazon Bedrock Model Evaluation, providing comprehensive guidance on feature setup, evaluating job initiation through both the console and Python SDK and APIs, and demonstrating how this innovative evaluation feature can enhance generative AI applications across multiple metric categories including quality, user experience, instruction following, and safety.

Before we explore the technical aspects and implementation details, let’s examine the key features that make LLM-as-a-judge on Amazon Bedrock Model Evaluation particularly powerful and distinguish it from traditional evaluation methods. Understanding these core capabilities will help illuminate why this feature represents a significant advancement in AI model evaluation.

Key features of LLM-as-a-judge

  1. Automated intelligent evaluation: LLM-as-a-judge uses pre-trained models to evaluate responses automatically, providing human-like evaluation quality with up to 98% cost savings. The system dramatically reduces evaluation time from weeks to hours while maintaining consistent evaluation standards across large datasets.
  2. Comprehensive metric categories: The evaluation system covers four key metric areas: quality assessment (correctness, completeness, faithfulness), user experience (helpfulness, coherence, relevance), instruction compliance (following instructions, professional style), and safety monitoring (harmfulness, stereotyping, refusal handling).
  3. Seamless integration: The feature integrates directly with Amazon Bedrock and remains compatible with existing Amazon Bedrock Model Evaluation features. Users can access the functionality through the AWS Management Console for Amazon Bedrock and quickly integrate their custom datasets for evaluation purposes.
  4. Flexible implementation: The system supports the evaluation of models hosted on Amazon Bedrock, custom fine-tuned models, and imported models. Users can seamlessly connect their evaluation datasets through Amazon Simple Storage Service (Amazon S3) buckets, making the evaluation process streamlined and efficient.
  5. Curated judge models: Amazon Bedrock provides pre-selected, high-quality evaluation models with optimized prompt engineering for accurate assessments. Users don’t need to bring external judge models, because the Amazon Bedrock team maintains and updates a selection of judge models and associated evaluation judge prompts.
  6. Cost-effective scaling: The feature enables organizations to perform comprehensive model evaluations at scale without the traditional costs and time investments associated with human evaluation. The automated process maintains high-quality assessments while significantly reducing operational overhead.

These features create a powerful evaluation framework that helps organizations optimize their AI model performance while maintaining high standards of quality and safety, all within their secure AWS environment.

Product overview

Now that you understand the key features of LLM-as-a-judge, let’s examine how to implement and use this capability within Amazon Bedrock Model Evaluation. This section provides a comprehensive overview of the architecture and walks through each component, demonstrating how they work together to deliver accurate and efficient model evaluations.

LLM-as-a-judge on Amazon Bedrock Model Evaluation provides a comprehensive, end-to-end solution for assessing and optimizing AI model performance. This automated process uses the power of LLMs to evaluate responses across multiple metric categories, offering insights that can significantly improve your AI applications. Let’s walk through the key components of this solution as shown in the following diagram:

LLM-as-a-judge on Amazon Bedrock Model Evaluation follows a streamlined workflow that enables systematic model evaluation. Here’s how each component works together in the evaluation process:

  • Prompt dataset: The process begins with a prepared dataset containing prompts that will be used to test the model’s performance. The evaluation can be conducted with or without ground truth responses—while including ground truth provides additional comparison points, it’s entirely optional and not required for successful evaluation.
  • JSONL file preparation: The prompt dataset is converted into JSONL format, which is specifically structured for LLM-as-a-judge evaluation jobs. This format promotes proper processing of evaluation data.
  • Amazon S3 storage: The prepared JSONL file is uploaded to an S3 bucket, serving as the secure storage location for the evaluation data.
  • Evaluation processing: The Amazon Bedrock LLM-as-a-judge model evaluation job processes the stored data, running comprehensive assessments across the selected metric categories (including quality, user experience, instruction following, and safety).
  • Automated report generation: Upon completion, the system generates detailed evaluation reports containing metrics, scores, and insights at both aggregate and individual response levels.
  • Expert analysis: Data scientists or machine learning engineers analyze the generated reports to derive actionable insights and make informed decisions.

With this solution architecture in mind, let’s explore how to implement LLM-as-a-judge model evaluations effectively, making sure that you get the most valuable insights from your assessment process.

Prerequisites

To use the LLM-as-a-judge model evaluation, make sure that you have satisfied the following requirements:

  • An active AWS account.
  • Selected evaluator and generator models enabled in Amazon Bedrock. You can confirm that the models are enabled for your account on the Model access page of the Amazon Bedrock console.
  • Confirm the AWS Regions where the model is available and quotas.
  • Complete model evaluation prerequisitesrelated to AWS Identity and Access Management (IAM) creation, and add permissions for an S3 bucket to access and write output data.
  • If you’re using a custom model instead of an on-demand model for your generator model, make sure that you have sufficient quota for running a Provisioned Throughput during inference.
    • Complete the prerequisites for importing a custom model.
    • Go to the AWS Service Quotas console, and check the following quotas:
      • Model units no-commitment Provisioned Throughputs across custom models.
      • Model units per provisioned model for [your custom model name].
      • Both of these fields need to have enough quota to support your Provisioned Throughput model unit. Request a quota increase if necessary to accommodate your expected inference workload.

Prepare input dataset

When preparing your dataset for LLM-as-a-judge model evaluation jobs, each prompt must include specific key-value pairs. Here are the required and optional fields:

  • prompt (required): This key indicates the input for various tasks. It can be used for general text generation where the model needs to provide a response, question-answering tasks where the model must answer a specific question, text summarization tasks where the model needs to summarize a given text, or classification tasks where the model must categorize the provided text.
  • referenceResponse (used for specific metrics with ground truth): This key contains the ground truth or correct response. It serves as the reference point against which the model’s responses will be evaluated if it is provided.
  • category (optional): This key is used to generate evaluation scores reported by category, helping organize and segment evaluation results for better analysis.

Dataset requirements:

  • Each line must be a valid JSON object
  • The file must use JSONL format
  • The dataset should be stored in an Amazon S3 bucket

Example JSONL format without ground truth (category is optional):

{
    "prompt": "What is machine learning?"
    "category": "technical"
}
{
    "prompt": "Summarize climate change impacts",
    "category": "environmental"
}

Example JSONL format with ground truth (category is optional):

{
    "prompt": "What is machine learning?",
    "referenceResponse": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses algorithms and statistical models to analyze and draw inferences from patterns in data, allowing computers to perform specific tasks without explicit instructions.",
    "category": "technical"
}
{
    "prompt": "Summarize climate change impacts",
    "referenceResponse": "Climate change leads to rising global temperatures, extreme weather events, sea level rise, and disruption of ecosystems. These changes result in more frequent natural disasters, threats to food security, loss of biodiversity, and various public health challenges. The impacts affect agriculture, coastal communities, and vulnerable populations disproportionately.",
    "category": "environmental"
}

Start an LLM-as-a-judge model evaluation job using the console

You can use LLM-as-a-judge on Amazon Bedrock Model Evaluation to assess model performance through a user-friendly console interface. Follow these steps to start an evaluation job:

  1. In the Amazon Bedrock console, choose Inference and Assessment and then select Evalutaions. On the Evaluations page, choose the Models

  1. Choose Create and select Automatic: LLM-as-a-judge.
  2. Enter a name and description and select an Evaluator model. This model will be used as a judge to evaluate the response of a prompt or model from your generative AI application.

  1. Choose Tags and select the model to be used for generating responses in this evaluation job.

  1. Select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).

  1. Select the S3 URI for Choose a prompt dataset and for Evaluation results. You can use the Browse S3 option.

  1. Select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job. Specify the output S3 bucket and choose Create.

  1. You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete.

  1. When complete, select the job to see its details. The following is the metrics summary (such as 0.83 for helpfulness, 1.00 for correctness, 1.00 for faithfulness, 1.00 for relevance, and 0.00 for harmfulness).

  1. To view generation metrics details, scroll down in the model evaluation report and choose any individual metric (like helpfulness or correctness) to see its detailed breakdown.

  1. To see each record’s prompt input, generation output, ground truth, and individual scores, choose a metric and select “Prompt details”. Hover over any individual score to view its detailed explanation.

Start an LLM-as-a-judge evaluation job using Python SDK and APIs

To use the Python SDK for creating an LLM-as-a-judge model evaluation job, use the following steps. First, set up the required configurations:

import boto3
from datetime import datetime

# Generate unique name for the job
job_name = f"Model-evaluation-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"

# Configure your knowledge base and model settings
evaluator_model = "mistral.mistral-large-2402-v1:0"
generator_model = "amazon.nova-pro-v1:0"
role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"

# Specify S3 locations for evaluation data and output
input_data = "s3://<YOUR_BUCKET>/evaluation_data/input.jsonl"
output_path = "s3://<YOUR_BUCKET>/evaluation_output/"

# Create Bedrock client
bedrock_client = boto3.client('bedrock')

To create an LLM-as-a-judge model evaluation job:

def create_llm_judge_evaluation(
    client,
    job_name: str,
    role_arn: str,
    input_s3_uri: str,
    output_s3_uri: str,
    evaluator_model_id: str,
    generator_model_id: str,
    dataset_name: str = None,
    task_type: str = "General" # must be General for LLMaaJ
):    
    # All available LLM-as-judge metrics
    llm_judge_metrics = [
        "Builtin.Correctness",
        "Builtin.Completeness", 
        "Builtin.Faithfulness",
        "Builtin.Helpfulness",
        "Builtin.Coherence",
        "Builtin.Relevance",
        "Builtin.FollowingInstructions",
        "Builtin.ProfessionalStyleAndTone",
        "Builtin.Harmfulness",
        "Builtin.Stereotyping",
        "Builtin.Refusal"
    ]

    # Configure dataset
    dataset_config = {
        "name": dataset_name or "CustomDataset",
        "datasetLocation": {
            "s3Uri": input_s3_uri
        }
    }

    try:
        response = client.create_evaluation_job(
            jobName=job_name,
            roleArn=role_arn,
            applicationType="ModelEvaluation",
            evaluationConfig={
                "automated": {
                    "datasetMetricConfigs": [
                        {
                            "taskType": task_type,
                            "dataset": dataset_config,
                            "metricNames": llm_judge_metrics
                        }
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [
                            {
                                "modelIdentifier": evaluator_model_id
                            }
                        ]
                    }
                }
            },
            inferenceConfig={
                "models": [
                    {
                        "bedrockModel": {
                            "modelIdentifier": generator_model_id
                        }
                    }
                ]
            },
            outputDataConfig={
                "s3Uri": output_s3_uri
            }
        )
        return response
        
    except Exception as e:
        print(f"Error creating evaluation job: {str(e)}")
        raise
        
 # Create evaluation job
try:
    llm_as_judge_response = create_llm_judge_evaluation(
        client=bedrock_client,
        job_name=job_name,
        role_arn=ROLE_ARN,
        input_s3_uri=input_data,
        output_s3_uri=output_path,
        evaluator_model_id=evaluator_model,
        generator_model_id=generator_model,
        task_type="General"
    )
    print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
except Exception as e:
    print(f"✗ Failed to create evaluation job: {str(e)}")
    raise

To monitor the progress of your evaluation job:

# Get job ARN based on job type
evaluation_job_arn = llm_as_judge_response['jobArn']
# Check job status
check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
print(f"Job Status: {check_status['status']}")

You can also compare multiple foundation models to determine which one works best for your needs. By using the same evaluator model across all comparisons, you’ll get consistent benchmarking results to help identify the optimal model for your use case.

# Generator Models
GENERATOR_MODELS = [
    "anthropic.claude-3-haiku-20240307-v1:0",
    "amazon.nova-micro-v1:0"
]

# Consistent Evaluator
EVALUATOR_MODEL = "anthropic.claude-3-haiku-20240307-v1:0"

def run_model_comparison(
    generator_models: List[str],
    evaluator_model: str
) -> List[Dict[str, Any]]:
    evaluation_jobs = []
    
    for generator_model in generator_models:
        job_name = f"llmaaj-{generator_model.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
        
        try:
            response = create_llm_judge_evaluation(
                client=bedrock_client,
                job_name=job_name,
                role_arn=ROLE_ARN,
                input_s3_uri=input_data,
                output_s3_uri=f"{output_path}/{job_name}/",
                evaluator_model_id=evaluator_model,
                generator_model_id=generator_model,
                task_type="General"
            )
            
            job_info = {
                "job_name": job_name,
                "job_arn": response["jobArn"],
                "generator_model": generator_model,
                "evaluator_model": evaluator_model,
                "status": "CREATED"
            }
            evaluation_jobs.append(job_info)
            
            print(f"✓ Created job: {job_name}")
            print(f"  Generator: {generator_model}")
            print(f"  Evaluator: {evaluator_model}")
            print("-" * 80)
            
        except Exception as e:
            print(f"✗ Error with {generator_model}: {str(e)}")
            continue
            
    return evaluation_jobs

# Run model comparison
evaluation_jobs = run_model_comparison(GENERATOR_MODELS, EVALUATOR_MODEL)

Correlation analysis for LLM-as-a-judge evaluations

You can use the Spearman’s rank correlation coefficient to compare evaluation results between different generator models using LLM-as-a-judge in Amazon Bedrock. After retrieving the evaluation results from your S3 bucket, containing evaluation scores across various metrics, you can begin the correlation analysis.

Using scipy.stats, compute the correlation coefficient between pairs of generator models, filtering out constant values or error messages to have a valid statistical comparison. The resulting correlation coefficients help identify how similarly different models respond to the same prompts. A coefficient closer to 1.0 indicates stronger agreement between the models’ responses, while values closer to 0 suggest more divergent behavior. This analysis provides valuable insights into model consistency and helps identify cases where different models might produce significantly different outputs for the same input.

import json
import boto3
import numpy as np
from scipy import stats

def read_and_organize_metrics_from_s3(bucket_name, file_key):
    s3_client = boto3.client('s3')
    metrics_dict = {}
    
    try:
        response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
        content = response['Body'].read().decode('utf-8')
        
        for line in content.strip().split('n'):
            if line:
                data = json.loads(line)
                if 'automatedEvaluationResult' in data and 'scores' in data['automatedEvaluationResult']:
                    for score in data['automatedEvaluationResult']['scores']:
                        metric_name = score['metricName']
                        if 'result' in score:
                            metric_value = score['result']
                            if metric_name not in metrics_dict:
                                metrics_dict[metric_name] = []
                            metrics_dict[metric_name].append(metric_value)
        return metrics_dict
    
    except Exception as e:
        print(f"Error: {e}")
        return None

def get_spearmanr_correlation(scores1, scores2):
    if len(set(scores1)) == 1 or len(set(scores2)) == 1:
        return "undefined (constant scores)", "undefined"
    
    try:
        result = stats.spearmanr(scores1, scores2)
        return round(float(result.statistic), 4), round(float(result.pvalue), 4)
    except Exception as e:
        return f"error: {str(e)}", "undefined"

# Extract metrics
bucket_name = "<EVALUATION_OUTPUT_BUCKET>"
file_key1 = "<EVALUATION_FILE_KEY1>"
file_key2 = "<EVALUATION_FILE_KEY2>"

metrics1 = read_and_organize_metrics_from_s3(bucket_name, file_key1)
metrics2 = read_and_organize_metrics_from_s3(bucket_name, file_key2)

# Calculate correlations for common metrics
common_metrics = set(metrics1.keys()) & set(metrics2.keys())

for metric_name in common_metrics:
    scores1 = metrics1[metric_name]
    scores2 = metrics2[metric_name]
    
    if len(scores1) == len(scores2):
        correlation, p_value = get_spearmanr_correlation(scores1, scores2)
        
        print(f"nMetric: {metric_name}")
        print(f"Number of samples: {len(scores1)}")
        print(f"Unique values in Model 1 scores: {len(set(scores1))}")
        print(f"Unique values in Model 2 scores: {len(set(scores2))}")
        print(f"Model 1 scores range: [{min(scores1)}, {max(scores1)}]")
        print(f"Model 2 scores range: [{min(scores2)}, {max(scores2)}]")
        print(f"Spearman correlation coefficient: {correlation}")
        print(f"P-value: {p_value}")
    else:
        print(f"nMetric: {metric_name}")
        print("Error: Different number of samples between models")

Best practices for LLM-as-a-judge implementation

You can also compare multiple foundation models to determine which one works best for your needs. By using the same evaluator model across all comparisons, you’ll get consistent, scalable results. The following best practices will help you establish standardized benchmarking when comparing different foundation models.

  • Create diverse test datasets that represent real-world use cases and edge cases. For large workloads (more than 1,000 prompts), use stratified sampling to maintain comprehensive coverage while managing costs and completion time. Include both simple and complex prompts to test model capabilities across different difficulty levels.
  • Choose evaluation metrics that align with your specific business objectives and application requirements. Balance quality metrics (correctness, completeness) with user experience metrics (helpfulness, coherence). Include safety metrics when deploying customer-facing applications.
  • Maintain consistent evaluation conditions when comparing different models. Use the same evaluator model across comparisons for standardized benchmarking. Document your evaluation configuration and parameters for reproducibility.
  • Schedule regular evaluation jobs to track model performance over time. Monitor trends across different metric categories to identify areas for improvement. Set up performance baselines and thresholds for each metric.
  • Optimize batch sizes based on your evaluation needs and cost constraints. Consider using smaller test sets for rapid iteration and larger sets for comprehensive evaluation. Balance evaluation frequency with resource utilization.
  • Maintain detailed records of evaluation jobs, including configurations and results. Track improvements and changes in model performance over time. Document any modifications made based on evaluation insights. The optional job description field can help you here.
  • Use evaluation results to guide model selection and optimization. Implement feedback loops to continuously improve prompt engineering. Regularly update evaluation criteria based on emerging requirements and user feedback.
  • Design your evaluation framework to accommodate growing workloads. Plan for increased complexity as you add more models or use cases. Consider automated workflows for regular evaluation tasks.

These best practices help establish a robust evaluation framework using LLM-as-a-judge on Amazon Bedrock. For deeper insights into the scientific validation of these practices, including case studies and correlation with human judgments, stay tuned for our upcoming technical deep-dive blog post.

Conclusion

LLM-as-a-judge on Amazon Bedrock Model Evaluation represents a significant advancement in automated model assessment, offering organizations a powerful tool to evaluate and optimize their AI applications systematically. This feature combines the efficiency of automated evaluation with the nuanced understanding typically associated with human assessment, enabling organizations to scale their quality assurance processes while maintaining high standards of performance and safety.

The comprehensive metric categories, flexible implementation options, and seamless integration with existing AWS services make it possible for organizations to establish robust evaluation frameworks that grow with their needs. Whether you’re developing conversational AI applications, content generation systems, or specialized enterprise solutions, LLM-as-a-judge provides the necessary tools to make sure that your models align with both technical requirements and business objectives.

We’ve provided detailed implementation guidance, from initial setup to best practices, to help you use this feature effectively. The accompanying code samples and configuration examples in this post demonstrate how to implement these evaluations in practice. Through systematic evaluation and continuous improvement, organizations can build more reliable, accurate, and trustworthy AI applications.

We encourage you to explore LLM-as-a-judge capabilities in the Amazon Bedrock console and discover how automatic evaluation can enhance your AI applications. To help you get started, we’ve prepared a Jupyter notebook with practical examples and code snippets that you can find on our GitHub repository.


About the Authors

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Read More

From concept to reality: Navigating the Journey of RAG from proof of concept to production

From concept to reality: Navigating the Journey of RAG from proof of concept to production

Generative AI has emerged as a transformative force, captivating industries with its potential to create, innovate, and solve complex problems. However, the journey from a proof of concept to a production-ready application comes with challenges and opportunities. Moving from proof of concept to production is about creating scalable, reliable, and impactful solutions that can drive business value and user satisfaction.

One of the most promising developments in this space is the rise of Retrieval Augmented Generation (RAG) applications. RAG is the process of optimizing the output of a foundation model (FM), so it references a knowledge base outside of its training data sources before generating a response.

The following diagram illustrates a sample architecture.

In this post, we explore the movement of RAG applications from their proof of concept or minimal viable product (MVP) phase to full-fledged production systems. When transitioning a RAG application from a proof of concept to a production-ready system, optimization becomes crucial to make sure the solution is reliable, cost-effective, and high-performing. Let’s explore these optimization techniques in greater depth, setting the stage for future discussions on hosting, scaling, security, and observability considerations.

Optimization techniques

The diagram below illustrates the tradeoffs to consider for a production-ready RAG application.

The success of a production-ready RAG system is measured by its quality, cost, and latency. Machine learning (ML) engineers must make trade-offs and prioritize the most important factors for their specific use case and business requirements. For example, consider the use case of generating personalized marketing content for a luxury fashion brand. The brand might be willing to absorb the higher costs of using a more powerful and expensive FMs to achieve the highest-quality classifications, because misclassifications could lead to customer dissatisfaction and damage the brand’s reputation. Consider another use case of generating personalized product descriptions for an ecommerce site. The retailer might be willing to accept slightly longer latency to reduce infrastructure and operational costs, as long as the generated descriptions remain reasonably accurate and compelling. The optimal balance of quality, cost, and latency can vary significantly across different applications and industries.

Let’s look into practical guidelines on how you can enhance the overall quality of your RAG workflow, including the quality of the retriever and quality of the result generator using Amazon Bedrock Knowledge Bases and other features of Amazon Bedrock. Amazon Bedrock Knowledge Bases provides a fully managed capability that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.

Evaluation framework

An effective evaluation framework is crucial for assessing and optimizing RAG systems as they move from proof of concept to production. These frameworks typically include overall metrics for a holistic assessment of the entire RAG pipeline, as well as specific diagnostic metrics for both the retrieval and generation components. This allows for targeted improvements in each phase of the system. By implementing a robust evaluation framework, developers can continuously monitor, diagnose, and enhance their RAG systems, achieving optimal performance across quality, cost, and latency dimensions as the application scales to production levels. Amazon Bedrock Evaluations can help you evaluate your retrieval or end-to-end RAG workflow in Amazon Bedrock Knowledge Bases. In the following sections, we discuss these specific metrics in different phases of the RAG workflow in more detail.

Retriever quality

For better retrieval performance, the way the data is stored in the vector store has a big impact. For example, your input document might include tables within the PDF. In such cases, using an FM to parse the data will provide better results. You can use advanced parsing options supported by Amazon Bedrock Knowledge Bases for parsing non-textual information from documents using FMs. Many organizations store their data in structured formats within data warehouses and data lakes. Amazon Bedrock Knowledge Bases offers a feature that lets you connect your RAG workflow to structured data stores. This fully managed out-of-the-box RAG solution can help you natively query structured data from where it resides.

Another important consideration is the way your source document is split up into chunks. If your document would benefit from inherent relationships within your document, it might be wise to use hierarchical chunking, which allows for more granular and efficient retrieval. Some documents benefit from semantic chunking by preserving the contextual relationship in the chunks, helping make sure that the related information stays together in logical chunks. You can also use your own custom chunking strategy for your RAG application’s unique requirements.

RAG applications process user queries by searching across a large set of documents. However, in many situations, you might need to retrieve documents with specific attributes or content. You can use metadata filtering to narrow down search results by specifying inclusion and exclusion criteria. Amazon Bedrock Knowledge Bases now also supports auto generated query filters, which extend the existing capability of manual metadata filtering by allowing you to narrow down search results without the need to manually construct complex filter expressions. This improves retrieval accuracy by making sure the documents are relevant to the query.

Generator quality

Writing an effective query is just as important as any other consideration for generation accuracy. You can add a prompt providing instructions to the FM to provide an appropriate answer to the user. For example, a legal tech company would want to provide instructions to restrict the answers to be based on the input documents and not based on general information known to the FM. Query decomposition by splitting the input query into multiple queries is also helpful in retrieval accuracy. In this process, the subqueries with less semantic complexity might find more targeted chunks. These chunks can then be pooled and ranked together before passing them to the FM to generate a response.

Reranking, as a post-retrieval step, can significantly improve response quality. This technique uses LLMs to analyze the semantic relevance between the query and retrieved documents, reordering them based on their pertinence. By incorporating reranking, you make sure that only the most contextually relevant information is used for generation, leading to more accurate and coherent responses.

Adjusting inference parameters, such as temperature and top-k/p sampling, can help in further refining the output.

You can use Amazon Bedrock Knowledge Bases to configure and customize queries and response generation. You can also improve the relevance of your query responses with a reranker model in Amazon Bedrock.

Overall quality

The key metrics for retriever quality are context precision, context recall, and context relevance. Context precision measures how well the system ranks relevant pieces of information from the given context. It considers the question, ground truth, and context. Context recall provides the percentage of ground truth claims or key information covered by the retrieved context. Context relevance measures whether the retrieved passages or chunks are relevant for answering the given query, excluding extraneous details. Together, these three metrics offer insight into how effectively the retriever is able to surface the most relevant and focused source material to support a high-quality response.

Generator quality can be assessed through several key metrics. Context utilization examines how effectively the generator uses relevant information from the provided source material. Noise sensitivity gauges the generator’s propensity to include inaccurate details from the retrieved content. Hallucination measures the extent to which the generator produces incorrect claims not present in the source data. Self-knowledge reflects the proportion of accurate statements generated that can’t be found in the retrieved chunks. Finally, faithfulness evaluates how closely the generator’s output aligns with the information contained in the source material.

For measuring the overall generation quality, the key metrics include measuring the precision, recall, and answer similarity. Precision suggests the proportion of the correct claims in model’s response, whereas recall suggests the proportion of the ground truth claims covered by the model’s response. Answer similarity compares the meaning and content of a generated answer with a reference or ground truth answer. It evaluates how closely the generated answer matches the intended meaning of the ground truth answer.

Establishing a feedback loop with an evaluation framework against these quality metrics allows for continuous improvement, where the system can learn from user interactions and refine its performance over time. By optimizing these quality metrics, the RAG system can be designed to deliver reliable, cost-effective, and high-performing results for users.

For a demonstration on how you can use a RAG evaluation framework in Amazon Bedrock to compute RAG quality metrics, refer to New RAG evaluation and LLM-as-a-judge capabilities in Amazon Bedrock.

Responsible AI

Implementing responsible AI practices is crucial for maintaining ethical and safe deployment of RAG systems. This includes using guardrails to filter harmful content, deny certain topics, mask sensitive information, and ground responses in verified sources to reduce hallucinations.

You can use Amazon Bedrock Guardrails for implementing responsible AI policies. Along with protecting against toxicity and harmful content, it can also be used for Automated Reasoning checks, which helps you protect against hallucinations.

Cost and latency

Cost considers the compute resources and infrastructure required to run the system, and latency evaluates the response times experienced by end-users. To optimize cost and latency, implement caching strategies to reduce the need for expensive model inferences. Efficient query batching can also improve overall throughput and reduce resource usage. Balance performance and resource usage to find the ideal configuration that meets your application’s requirements.

Use tools like Amazon Bedrock Knowledge Bases so you can take advantage of fully managed support for the end-to-end RAG workflow. It supports many of the advanced RAG capabilities we discussed earlier. By addressing these optimization techniques, you can transition your RAG-powered proof of concept to a robust, production-ready system that delivers high-quality, cost-effective, and low-latency responses to your users.

For more information on building RAG applications using Amazon Bedrock Knowledge Bases, refer to Building scalable, secure, and reliable RAG applications using Amazon Bedrock Knowledge Bases.

Hosting and scaling

When it comes to hosting your web application or service, there are several approaches to consider. The key is to choose a solution that can effectively host your database and compute infrastructure. This could include server-based options like Amazon Elastic Compute Cloud (Amazon EC2), managed services like Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB, or serverless approaches such as AWS Amplify and Amazon Elastic Container Service (Amazon ECS). For a practical approach to building an automated AI assistant using Amazon ECS, see Develop a fully automated chat-based assistant by using Amazon Bedrock agents and knowledge bases.

In addition to the server or compute layer, you will also need to consider an orchestration tool, testing environments, and a continuous integration and delivery (CI/CD) pipeline to streamline your application deployment. Having a feedback loop established based on the quality metrics along with a CI/CD pipeline is an important first step to creating self-healing architectures.

As your application grows, you will need to make sure your infrastructure can scale to meet the increasing demand. This can involve containerization with Docker or choosing serverless options, implementing load balancing, setting up auto scaling, and choosing between on-premises, cloud, or hybrid solutions. It also includes unique scaling requirements of your frontend application and backend generative AI workflow, as well as the use of content delivery networks (CDNs) and disaster recovery and backup strategies.

The following is a sample architecture for a secure and scalable RAG-based web application. This architecture uses Amazon ECS for hosting the service, Amazon CloudFront as a CDN, AWS WAF as a firewall, and Amazon MemoryDB for providing a semantic cache.

By carefully considering these aspects of hosting and scaling your infrastructure, you can build a resilient and adaptable system to support your growing web application or service. Stay tuned for more detailed information on these topics in upcoming blog posts.

Data privacy, security, and observability

Maintaining data privacy and security is of utmost importance. This includes implementing security measures at each layer of your application, from encrypting data in transit to setting up robust authentication and authorization controls. It also involves focusing on compute and storage security, as well as network security. Compliance with relevant regulations and regular security audits are essential. Securing your generative AI system is another crucial aspect. By default, Amazon Bedrock Knowledge Bases encrypts the traffic using AWS managed AWS Key Management Service (AWS KMS) keys. You can also choose customer managed KMS keys for more control over encryption keys. For more information on application security, refer to Safeguard a generative AI travel agent with prompt engineering and Amazon Bedrock Guardrails.

Comprehensive logging, monitoring, and maintenance are crucial to maintaining a healthy infrastructure. This includes setting up structured logging, centralized log management, real-time monitoring, and strategies for system updates and migrations.

By addressing these critical areas, you can build a secure and resilient infrastructure to support your growing web application or service. Stay tuned for more in-depth coverage of these topics in upcoming blog posts.

Conclusion

To successfully transition a RAG application from a proof of concept to a production-ready system, you should focus on optimizing the solution for reliability, cost-effectiveness, and high performance. Key areas to address include enhancing retriever and generator quality, balancing cost and latency, and establishing a robust and secure infrastructure.

By using purpose-built tools like Amazon Bedrock Knowledge Bases to streamline the end-to-end RAG workflow, organizations can successfully transition their RAG-powered proofs of concept into high-performing, cost-effective, secure production-ready solutions that deliver business value.

References


About the Author

Vivek Mittal is a Solution Architect at Amazon Web Services, where he helps organizations architect and implement cutting-edge cloud solutions. With a deep passion for Generative AI, Machine Learning, and Serverless technologies, he specializes in helping customers harness these innovations to drive business transformation. He finds particular satisfaction in collaborating with customers to turn their ambitious technological visions into reality.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High-Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

Meta SAM 2.1 is now available in Amazon SageMaker JumpStart

This blog post is co-written with George Orlin from Meta.

Today, we are excited to announce that Meta’s Segment Anything Model (SAM) 2.1 vision segmentation model is publicly available through Amazon SageMaker JumpStart to deploy and run inference. Meta SAM 2.1 provides state-of-the-art video and image segmentation capabilities in a single model. This cutting-edge model supports long-context processing, complex segmentation scenarios, and fine-grained analysis, making it ideal for automating processes for various industries such as medical imaging in healthcare, satellite imagery for environment monitoring, and object segmentation for autonomous systems. Meta SAM 2.1 is well suited for zero-shot object segmentation and accurate object detection based on simple prompts such as point coordinates and bounding boxes in a frame for video tracking and image masking.

This model was predominantly trained on AWS, and AWS will also be the first cloud provider to make it available to customers. In this post, we walk through how to discover and deploy the Meta SAM 2.1 model using SageMaker JumpStart.

Meta SAM 2.1 overview

Meta SAM 2.1 is a state-of-the-art vision segmentation model designed for high-performance computer vision tasks, enabling advanced object detection and segmentation workflows. Building upon its predecessor, version 2.1 introduces enhanced segmentation accuracy, robust generalization across diverse datasets, and scalability for production-grade applications. These features enable AI researchers and developers in computer vision, image processing, and data-driven research to improve tasks that require detailed analysis segmentation across multiple fields.

Meta SAM 2.1 has a streamlined architecture that is optimized for integration with popular model-serving frameworks like TorchServe and can be deployed on Amazon SageMaker AI to power real-time or batch inference pipelines. Meta SAM 2.1 empowers organizations to achieve precise segmentation outcomes in vision-centric workflows with minimal configuration and maximum efficiency.

Meta SAM 2.1 offers multiple variants—Tiny, Small, Base Plus, and Large—available now on SageMaker JumpStart, balancing model size, speed, and segmentation performance to cater to diverse application needs.

SageMaker JumpStart overview

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. Models hosted on JumpStart can be provisioned on dedicated SageMaker Inference instances, including AWS Trainium and AWS Inferentia based instances, and are isolated within your virtual private cloud (VPC). This enforces data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of SageMaker AI, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.

Prerequisites

Make sure you have the following prerequisites to deploy Meta SAM 2.1 and run inference:

  • An AWS account that will contain all your AWS resources.
  • An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, refer to Identity and Access Management for Amazon SageMaker AI.
  • Access to Amazon SageMaker Studio or a SageMaker notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
  • Access to accelerated instances (GPUs) for hosting the model.

Discover Meta SAM 2.1 in SageMaker JumpStart

SageMaker JumpStart provides FMs through two primary interfaces: SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive IDE that offers a unified, web-based interface for performing all aspects of the machine learning (ML) development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or SageMaker Studio. To deploy Meta SAM 2.1 using the SageMaker JumpStart UI, complete the following steps:

In SageMaker Unified Studio, on the Build menu, choose JumpStart models.

If you’re already on the SageMaker Studio console, choose JumpStart in the navigation pane.

You will be prompted to create a project, after which you can begin deployment.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use SageMaker JumpStart models. This approach allows for greater flexibility and integration with existing AI/ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI/ML development efforts, regardless of your preferred interface or workflow.

Deploy Meta SAM 2.1 for inference using SageMaker JumpStart

On the SageMaker JumpStart landing page, you can discover the public pre-trained models offered by SageMaker AI. You can choose the Meta model provider tab to discover the Meta models available.

If you’re using SageMaker Studio and don’t see the SAM 2.1 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Classic Apps.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.

When you choose Deploy, you should be prompted to the next screen to choose an endpoint name and instance type to initiate deployment.

Upon defining your endpoint settings, you can proceed to the next step to use the model.

Deploy Meta SAM 2.1 vision segmentation model for inference using the Python SDK

When you choose Deploy, model deployment will start. Alternatively, you can deploy through the example notebook by choosing Open Notebook. The notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, you start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker AI.

You can deploy a Meta SAM 2.1 vision segmentation model using SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel 
model = JumpStartModel(model_id = "meta-vs-sam-2-1-hiera-tiny") 
predictor = model.deploy()

This deploys the model on SageMaker AI with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor. There are three tasks that are available with this endpoint: automatic mask generator, image predictor, and video predictor. We provide a code snippet for each later in this post. To use the predictor, a certain payload schema needs to be followed. The endpoint has sticky sessions enabled, so to start inference, you need to send a start_session payload:

def start_session(asset_type, asset_path):

    asset_base64 = None
    
     with open(image_path, 'rb') as f:
            asset_base64 = base64.b64encode(f.read()).decode('utf-8')
    
    response = predictor.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({
                    "type": "start_session",
                    "input_type": asset_type,
                    "path": asset_base64 
                }),
        SessionId="NEW_SESSION",
    )
    
    session_id = response.headers.get("x-amzn-sagemaker-new-session-id")
    
    return session_id

The start_session invocation needs an input media type of either image or video and the base64 encoded data of the media. This will launch a session with an instance of the model and load the media to be segmented.

To close a session, send a close_session invocation:

def close_session(session_id):
    response = predictor.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({
                    "type": "close_session",
                    "session_id": session_id
                }),
        SessionId=session_id,
    )
    
    session_id = response.headers.get("x-amzn-sagemaker-closed-session-id")
    
    return session_id

If x-amzn-sagemaker-closed-session-id exists as a header, then the session has been successfully closed.

To continue a session and retrieve the session ID of the existing session, the response header will have the x-amzn-sagemaker-session-id key with the current session ID for any operation that is not start_session or close_session. Operations that aren’t start_session or close_session need to be invoked with a response stream. This is due to the size of the resulting payload being larger than what SageMaker real-time endpoints can return.

This is a basic example of interacting with the SAM 2.1 SageMaker JumpStart endpoint with sticky sessions. The following examples for each of the tasks reference these operations without repeating them. The returned data is of mime type JSONL. For more complete examples, refer to the example notebooks for Meta SAM 2.1 on SageMaker Jumpstart.

Recommended instances and benchmarks

The following table lists all the Meta SAM 2.1 models available in SageMaker JumpStart along with the model_id, default instance types, and maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models. For increased context length, you can modify the default instance type in the SageMaker JumpStart UI.

Model Name Model ID Default Instance Type Supported Instance Types
Meta SAM 2.1 Tiny meta-vs-sam-2-1-hiera-tiny ml.g6.24xlarge (5.5 MB total image or video size)

ml.g5.24xlarge

ml.g5.48xlarge

ml.g6.24xlarge

ml.g6.48xlarge

ml.p4d.24xlarge

ml.p4de.24xlarge

Meta SAM 2.1 Small meta-vs-sam-2-1-hiera-small ml.g6.24xlarge (5.5 MB total image or video size)

ml.g5.24xlarge

ml.g5.48xlarge

ml.g6.24xlarge

ml.g6.48xlarge

ml.p4d.24xlarge

ml.p4de.24xlarge

Meta SAM 2.1 Base Plus meta-vs-sam-2-1-hiera-base-plus ml.g6.24xlarge (5.5 MB total image or video size)

ml.g5.24xlarge

ml.g5.48xlarge

ml.g6.24xlarge

ml.g6.48xlarge

ml.p4d.24xlarge

ml.p4de.24xlarge

Meta SAM 2.1 Large meta-vs-sam-2-1-hiera-large ml.g6.24xlarge (5.5 MB total image or video size)

ml.g5.24xlarge

ml.g5.48xlarge

ml.g6.24xlarge

ml.g6.48xlarge

ml.p4d.24xlarge

ml.p4de.24xlarge

Meta SAM 2.1 use cases: Inference and prompt examples

After you deploy the model using SageMaker JumpStart, you should be able to see a reference Jupyter notebook that references the parser and helper functions needed to begin using Meta SAM 2.1. After you follow those cells in the notebook, you should be ready to begin using the model’s vision segmentation capabilities.

Meta SAM 2.1 offers support for three different tasks (automatic mask generator, image predictor, video predictor) to generate masks for various objects in images, including object tracking in videos. In the following examples, we demonstrate how to use the automatic mask generator and image predictor on a JPG of a truck. This truck.jpg file is stored in the jumpstart-cache-prod bucket; you can access it with the following code:

s3_bucket = f"jumpstart-cache-prod-{region}"
key_prefix = "inference-notebook-assets"

def download_from_s3(key_filenames):
    for key_filename in key_filenames:
        s3.download_file(s3_bucket, f"{key_prefix}/{key_filename}", key_filename)
        
truck_jpg = "truck.jpg"

#Download images.
download_from_s3(key_filenames=[truck_jpg])
display(Image(filename=truck_jpg))

After you have your image and it is encoded, you can create masks for objects in the image. For use cases where you want to generate masks for every object in the image, you can use the automatic mask generator task.

Automatic mask generator

The automatic mask generator is great for AI researchers for computer vision tasks and applications such as medical imaging and diagnostics to automatically segment regions of interest like tumors or specific organs to provide more accurate diagnostic support. Additionally, the automatic mask generator can be particularly useful in the autonomous vehicle space, in which it can segment out elements in a camera like pedestrians, vehicles, and other objects. Let’s use the automatic mask generator to generate masks for all the objects in truck.jpg.

The following code is the prompt to generate masks for your base64 encoded image:

# Start session
session_id = start_session("image", truck_jpg)
    
# Generate and visualize masks with basic parameters
response = runtime_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({
            "type": "generate_automatic_masks",
            "session_id": session_id,
            "points_per_side": 32,
            "min_mask_region_area": 100
        }),
        SessionId=session_id,
        Accept="application/jsonlines"
    )
    
# Parse response stream
parser = StreamParser()
for event in response['Body']:
    parser.write(event)

masks = parser.get_responses()

# End session
end_session(session_id)

We receive the following output (parsed and visualized).

Image predictor

Additionally, you can choose which objects in the provided image you want to create a mask for by adding points within that object for Meta SAM 2.1 to create. A use case for the image predictor can be valuable for tasks related to design and modeling by automating processes that typically require manual efforts. For example, the image predictor can automate turning 2D images into 3D models by analyzing 2D images of blueprints, sketches, or floor plans and generating preliminary 3D models. This is one of many examples of how the image predictor can act as a bridge between 2D and 3D construction across many different tasks. We use the following image with the points that we used to prompt Meta SAM 2.1 for masking the object.

The following code is used to prompt Meta SAM 2.1 and plot the coordinates:

# Start session
session_id = start_session("image", truck_jpg)

points = [
            {"type": "point", "coordinates": [500, 375], "label": 1},
            {"type": "point", "coordinates": [1125, 625], "label": 1}
         ]
    
# Add multiple points
response = runtime_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({
            "type": "add_points",
            "session_id": session_id,
            "points": [p["coordinates"] for p in points],
            "labels": [p["label"] for p in points],
            "clear_old_points": clear_old_point,
        }),
        SessionId=session_id,
        Accept="application/jsonlines"
    )

# Parse response stream
parser = StreamParser()
for event in response['Body']:
    parser.write(event)

# Intermediate Response
masks = parser.get_responses()
    
response = runtime_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({
            "type": "predict",
            "session_id": session_id,
            "multimask_output": True,
            "return_logits": True
        }),
        SessionId=session_id,
        Accept="application/jsonlines"
    )

# Parse response stream
parser = StreamParser()
for event in response['Body']:
    parser.write(event)

masks = parser.get_responses()

# End session
end_session(session_id)

We receive the following output (parsed and visualized).

Video predictor

We now demonstrate how to prompt Meta SAM 2.1 for object tracking on video. One use case would be for ergonomic data collection and training purposes. You can use the video predictor to analyze the movement and posture of humans in real time, serving as a way to reduce injury and improve performance by setting alarms for bad posture or movements. Let’s start by accessing the basketball-layup.mp4 file [1] from the jumpstart-cache-prod S3 bucket defined in the following code:

basketball_mp4 = "basketball-layup.mp4"

#Download video
download_from_s3(key_filenames=[basketball_mp4])
display(Video(filename=basketball_mp4))

Video:

The following code shows how you can set up the prompt format to track objects in the video. The first object will use coordinates to track and not track, and the second object will track one coordinate.

# Start session
session_id = start_session("video", basketball_mp4)

# Object 1
prompts1 = [
        {"type": "point", "coordinates": [1478, 649], "label": 1},
        {"type": "point", "coordinates": [1433, 689], "label": 0},
    ]
    
# Extract points and labels
points = []
labels = []
for prompt in prompts1:
    if prompt["type"] == "point":
        points.append(prompt["coordinates"])
        labels.append(prompt["label"])

request = {
        "type": "add_points",
        "session_id": session_id,
        "frame_index": 0,
        "object_id": 1,
        "points": points,
        "labels": labels,
        "clear_old_points": True,
    }
    
# Add multiple points
response = runtime_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(request),
        SessionId=session_id,
        Accept="application/jsonlines"
    )

# Parse response stream
parser = StreamParser()
for event in response['Body']:
    parser.write(event)

# Intermediate Response
masks = parser.get_responses()

# Object 2
prompts2 = [{"type": "point", "coordinates": [1433, 689], "label": 1}]

# Extract points and labels
points = []
labels = []
for prompt in prompts2:
    if prompt["type"] == "point":
        points.append(prompt["coordinates"])
        labels.append(prompt["label"])

request = {
        "type": "add_points",
        "session_id": session_id,
        "frame_index": 0,
        "object_id": 2,
        "points": points,
        "labels": labels,
        "clear_old_points": True,
    }
    
# Add multiple points
response = runtime_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(request),
        SessionId=session_id,
        Accept="application/jsonlines"
    )

# Parse response stream
parser = StreamParser()
for event in response['Body']:
    parser.write(event)

# Intermediate Response
masks = parser.get_responses()
    
response = runtime_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps({
            "type": "propagate_in_video",
            "session_id": session_id,
            "start_frame_index": 0,
        }),
        SessionId=session_id,
        Accept="application/jsonlines"
    )

# Parse response stream
parser = StreamParser()
for event in response['Body']:
    parser.write(event)

masks = parser.get_responses()

# End session
end_session(session_id)

We receive the following output (parsed and visualized).

Video:

Here we can see that Meta SAM 2.1 Tiny was able to successfully track the objects based off the coordinates that were provided in prompt.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker AI endpoints using the following code:

predictor.delete_model()
predictor.delete_endpoint()

Alternatively, to use the SageMaker AI console, complete the following steps:

  1. On the SageMaker AI console, under Inference in the navigation pane, choose
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we explored how SageMaker JumpStart empowers data scientists and ML engineers to discover, access, and deploy a wide range of pre-trained FMs for inference, including Meta’s most advanced and capable models to date. Get started with SageMaker JumpStart and Meta SAM 2.1 models today. For more information about SageMaker JumpStart, see SageMaker JumpStart pretrained models and Getting started with Amazon SageMaker JumpStart.

Sources:

[1] Erčulj F, Štrumbelj E (2015) Basketball Shot Types and Shot Success in Different Levels of Competitive Basketball. PLOS ONE 10(6): e0128885. https://doi.org/10.1371/journal.pone.0128885


About the Authors

Marco Punio is a Sr. Specialist Solutions Architect focused on generative AI strategy, applied AI solutions, and conducting research to help customers hyper-scale on AWS. As a member of the 3rd Party Model Provider Applied Sciences Solutions Architecture team at AWS, he is a Global Lead for the Meta – AWS Partnership and technical strategy. Based in Seattle, WA, Marco enjoys writing, reading, exercising, and building applications in his free time.

Deepak Rupakula is a Principal GTM lead in the specialists group at AWS. He focuses on developing GTM strategy for large language models like Meta across AWS services like Amazon Bedrock and Amazon SageMaker AI. With over 15 years of experience in the tech industry, his experience includes leadership roles in product management, customer success, and analytics.

Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Baladithya Balamurugan is a Solutions Architect at AWS focused on ML deployments for inference and using AWS Neuron to accelerate training and inference. He works with customers to enable and accelerate their ML deployments on services such as Amazon SageMaker AI and Amazon EC2. Based in San Francisco, Baladithya enjoys tinkering, developing applications, and building his homelab in his free time.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, SageMaker AI’s machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Naman Nandan is a software development engineer at AWS, specializing in enabling large-scale AI/ML inference workloads on Amazon SageMaker AI using TorchServe, a project jointly developed by AWS and Meta. In his free time, he enjoys playing tennis and going on hikes.

Read More