Build multi-agent systems with LangGraph and Amazon Bedrock

Build multi-agent systems with LangGraph and Amazon Bedrock

Large language models (LLMs) have raised the bar for human-computer interaction where the expectation from users is that they can communicate with their applications through natural language. Beyond simple language understanding, real-world applications require managing complex workflows, connecting to external data, and coordinating multiple AI capabilities. Imagine scheduling a doctor’s appointment where an AI agent checks your calendar, accesses your provider’s system, verifies insurance, and confirms everything in one go—no more app-switching or hold times. In these real-world scenarios, agents can be a game changer, delivering more customized generative AI applications.

LLM agents serve as decision-making systems for application control flow. However, these systems face several operational challenges during scaling and development. The primary issues include tool selection inefficiency, where agents with access to numerous tools struggle with optimal tool selection and sequencing, context management limitations that prevent single agents from effectively managing increasingly complex contextual information, and specialization requirements as complex applications demand diverse expertise areas such as planning, research, and analysis. The solution lies in implementing a multi-agent architecture, which involves decomposing the main system into smaller, specialized agents that operate independently. Implementation options range from basic prompt-LLM combinations to sophisticated ReAct (Reasoning and Acting) agents, allowing for more efficient task distribution and specialized handling of different application components. This modular approach enhances system manageability and allows for better scaling of LLM-based applications while maintaining functional efficiency through specialized components.

This post demonstrates how to integrate open-source multi-agent framework, LangGraph, with Amazon Bedrock. It explains how to use LangGraph and Amazon Bedrock to build powerful, interactive multi-agent applications that use graph-based orchestration.

AWS has introduced a multi-agent collaboration capability for Amazon Bedrock Agents, enabling developers to build, deploy, and manage multiple AI agents working together on complex tasks. This feature allows for the creation of specialized agents that handle different aspects of a process, coordinated by a supervisor agent that breaks down requests, delegates tasks, and consolidates outputs. This approach improves task success rates, accuracy, and productivity, especially for complex, multi-step tasks.

Challenges with multi-agent systems

In a single-agent system, planning involves the LLM agent breaking down tasks into a sequence of small tasks, whereas a multi-agent system must have workflow management involving task distribution across multiple agents. Unlike single-agent environments, multi-agent systems require a coordination mechanism where each agent must maintain alignment with others while contributing to the overall objective. This introduces unique challenges in managing inter-agent dependencies, resource allocation, and synchronization, necessitating robust frameworks that maintain system-wide consistency while optimizing performance.

Memory management in AI systems differs between single-agent and multi-agent architectures. Single-agent systems use a three-tier structure: short-term conversational memory, long-term historical storage, and external data sources like Retrieval Augmented Generation (RAG). Multi-agent systems require more advanced frameworks to manage contextual data, track interactions, and synchronize historical records across agents. These systems must handle real-time interactions, context synchronization, and efficient data retrieval, necessitating careful design of memory hierarchies, access patterns, and inter-agent sharing.

Agent frameworks are essential for multi-agent systems because they provide the infrastructure for coordinating autonomous agents, managing communication and resources, and orchestrating workflows. Agent frameworks alleviate the need to build these complex components from scratch.

LangGraph, part of LangChain, orchestrates agentic workflows through a graph-based architecture that handles complex processes and maintains context across agent interactions. It uses supervisory control patterns and memory systems for coordination.

LangGraph Studio enhances development with graph visualization, execution monitoring, and runtime debugging capabilities. The integration of LangGraph with Amazon Bedrock empowers you to take advantage of the strengths of multiple agents seamlessly, fostering a collaborative environment that enhances the efficiency and effectiveness of LLM-based systems.

Understanding LangGraph and LangGraph Studio

LangGraph implements state machines and directed graphs for multi-agent orchestration. The framework provides fine-grained control over both the flow and state of your agent applications. LangGraph models agent workflows as graphs. You define the behavior of your agents using three key components:

  • State – A shared data structure that represents the current snapshot of your application.
  • Nodes – Python functions that encode the logic of your agents.
  • Edges – Python functions that determine which Node to execute next based on the current state. They can be conditional branches or fixed transitions.

LangGraph implements a central persistence layer, enabling features that are common to most agent architectures, including:

  • Memory – LangGraph persists arbitrary aspects of your application’s state, supporting memory of conversations and other updates within and across user interactions.
  • Human-in-the-loop – Because state is checkpointed, execution can be interrupted and resumed, allowing for decisions, validation, and corrections at key stages through human input.

LangGraph Studio is an integrated development environment (IDE) specifically designed for AI agent development. It provides developers with powerful tools for visualization, real-time interaction, and debugging capabilities. The key features of LangGraph Studio are:

  • Visual agent graphs – The IDE’s visualization tools allow developers to represent agent flows as intuitive graphic wheels, making it straightforward to understand and modify complex system architectures.
  • Real-time debugging – The ability to interact with agents in real time and modify responses mid-execution creates a more dynamic development experience.
  • Stateful architecture – Support for stateful and adaptive agents within a graph-based architecture enables more sophisticated behaviors and interactions.

The following screenshot shows the nodes, edges, and state of a typical LangGraph agent workflow as viewed in LangGraph Studio.

LangGraph agent workflow as viewed in LangGraph Studio

Figure 1: LangGraph Studio UI

In the preceding example, the state begins with __start__ and ends with __end__. The nodes for invoking the model and tools are defined by you and the edges tell you which paths can be followed by the workflow.

LangGraph Studio is available as a desktop application for MacOS users. Alternatively, you can run a local in-memory development server that can be used to connect a local LangGraph application with a web version of the studio.

Solution overview

This example demonstrates the supervisor agentic pattern, where a supervisor agent coordinates multiple specialized agents. Each agent maintains its own scratchpad while the supervisor orchestrates communication and delegates tasks based on agent capabilities. This distributed approach improves efficiency by allowing agents to focus on specific tasks while enabling parallel processing and system scalability.

Let’s walk through an example with the following user query: “Suggest a travel destination and search flight and hotel for me. I want to travel on 15-March-2025 for 5 days.” The workflow consists of the following steps:

  1. The Supervisor Agent receives the initial query and breaks it down into sequential tasks:
    1. Destination recommendation required.
    2. Flight search needed for March 15, 2025.
    3. Hotel booking required for 5 days.
  2. The Destination Agent begins its work by accessing the user’s stored profile. It searches its historical database, analyzing patterns from similar user profiles to recommend the destination. Then it passes the destination back to the Supervisor Agent.
  3. The Supervisor Agent forwards the chosen destination to the Flight Agent, which searches available flights for the given date.
  4. The Supervisor Agent activates the Hotel Agent, which searches for hotels in the destination city.
  5. The Supervisor Agent compiles the recommendations into a comprehensive travel plan, presenting the user with a complete itinerary including destination rationale, flight options, and hotel suggestions.

The following figure shows a multi-agent workflow of how these agents connect to each other and which tools are involved with each agent.

multi-agent workflow Figure 2: Multi-agent workflow

Prerequisites

You will need the following prerequisites before you can proceed with this solution. For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

Core components

Each agent is structured with two primary components:

  • graph.py – This script defines the agent’s workflow and decision-making logic. It implements the LangGraph state machine for managing agent behavior and configures the communication flow between different components. For example:
    • The Flight Agent’s graph manages the flow between chat and tool operations.
    • The Hotel Agent’s graph handles conditional routing between search, booking, and modification operations.
    • The Supervisor Agent’s graph orchestrates the overall multi-agent workflow.
  • tools.py – This script contains the concrete implementations of agent capabilities. It implements the business logic for each operation and handles data access and manipulation. It provides specific functionalities like:
    • Flight tools: search_flights, book_flights, change_flight_booking, cancel_flight_booking.
    • Hotel tools: suggest_hotels, book_hotels, change_hotel_booking, cancel_hotel_booking.

This separation between graph (workflow) and tools (implementation) allows for a clean architecture where the decision-making process is separate from the actual execution of tasks. The agents communicate through a state-based graph system implemented using LangGraph, where the Supervisor Agent directs the flow of information and tasks between the specialized agents.

To set up Amazon Bedrock with LangGraph, refer to the following GitHub repo. The high-level steps are as follows:

  1. Install the required packages:
pip install boto3 langchain-aws

These packages are essential for AWS Bedrock integration:

  • boto: AWS SDK for Python, handles AWS service communication
  • langchain-aws: Provides LangChain integrations for AWS services
  1. Import the modules:
from langchain_aws import ChatBedrockConverse 
from langchain_aws import ChatBedrock

  1. Create an LLM object:
bedrock_client = boto3.client("bedrock-runtime", region_name="<region_name>")
llm = ChatBedrockConverse(
        model="anthropic.claude-3-haiku-20240307-v1:0",
        temperature=0,
        max_tokens=None,
        client=bedrock_client,
        # other params...
    )

LangGraph Studio configuration

This project uses a langgraph.json configuration file to define the application structure and dependencies. This file is essential for LangGraph Studio to understand how to run and visualize your agent graphs.

{
"dependencies": [
"boto3>=1.35.87",
"langchain-aws>=0.2.10",
"."
],
"graphs": {
"supervisor": "./src/supervisor_agent/graph.py:graph",
"flight": "./src/flight_agent/graph.py:graph",
"hotel": "./src/hotel_agent/graph.py:graph"
},
"env": "./.env"
}

LangGraph Studio uses this file to build and visualize the agent workflows, allowing you to monitor and debug the multi-agent interactions in real time.

Testing and debugging

You’re now ready to test the multi-agent travel assistant. You can start the graph using the langgraph dev command. It will start the LangGraph API server in development mode with hot reloading and debugging capabilities. As shown in the following screenshot, the interface provides a straightforward way to select which graph you want to test through the dropdown menu at the top left. The Manage Configuration button at the bottom lets you set up specific testing parameters before you begin. This development environment provides everything you need to thoroughly test and debug your multi-agent system with real-time feedback and monitoring capabilities.

Testing multi-agent travel assistantFigure 3: LangGraph studio with Destination Agent recommendation

LangGraph Studio offers flexible configuration management through its intuitive interface. As shown in the following screenshot, you can create and manage multiple configuration versions (v1, v2, v3) for your graph execution. For example, in this scenario, we want to use user_id to fetch historic use information. This versioning system makes it simple to track and switch between different test configurations while debugging your multi-agent system.

Create and manage multiple configuration versions (v1, v2, v3) for your graph executionFigure 4: Runnable configuration details

In the preceding example, we set up the user_id that tools can use to retrieve history or other details.

Let’s test the Planner Agent. This agent has the compare_and_recommend_destination tool, which can check past travel data and recommend travel destinations based on the user profile. We use user_id in the configuration so that can it be used by the tool.

LangGraph has concept of checkpoint memory that is managed using a thread. The following screenshot shows that you can quickly manage threads in LangGraph Studio.

Manage threads in LangGraph StudioFigure 5: View graph state in the thread

In this example, destination_agent is using a tool; you can also check the tool’s output. Similarly, you can test flight_agent and hotel_agent to verify each agent.

When all the agents are working well, you’re ready to test the full workflow. You can evaluate the state a verify input and output of each agent.

The following screenshot shows the full view of the Supervisor Agent with its sub-agents.

Figure 6: Supervisor Agent with complete workflow

Considerations

Multi-agent architectures must consider agent coordination, state management, communication, output consolidation, and guardrails, maintaining processing context, error handling, and orchestration. Graph-based architectures offer significant advantages over linear pipelines, enabling complex workflows with nonlinear communication patterns and clearer system visualization. These structures allow for dynamic pathways and adaptive communication, ideal for large-scale deployments with simultaneous agent interactions. They excel in parallel processing and resource allocation but require sophisticated setup and might demand higher computational resources. Implementing these systems necessitates careful planning of system topology, robust monitoring, and well-designed fallback mechanisms for failed interactions.

When implementing multi-agent architectures in your organization, it’s crucial to align with your company’s established generative AI operations and governance frameworks. Prior to deployment, verify alignment with your organization’s AI safety protocols, data handling policies, and model deployment guidelines. Although this architectural pattern offers significant benefits, its implementation should be tailored to fit within your organization’s specific AI governance structure and risk management frameworks.

Clean up

Delete any IAM roles and policies created specifically for this post. Delete the local copy of this post’s code. If you no longer need access to an Amazon Bedrock FM, you can remove access from it. For instructions, see Add or remove access to Amazon Bedrock foundation models

Conclusion

The integration of LangGraph with Amazon Bedrock significantly advances multi-agent system development by providing a robust framework for sophisticated AI applications. This combination uses LangGraph’s orchestration capabilities and FMs in Amazon Bedrock to create scalable, efficient systems. It addresses challenges in multi-agent architectures through state management, agent coordination, and workflow orchestration, offering features like memory management, error handling, and human-in-the-loop capabilities. LangGraph Studio’s visualization and debugging tools enable efficient design and maintenance of complex agent interactions. This integration offers a powerful foundation for next-generation multi-agent systems, providing effective workflow handling, context maintenance, reliable results, and optimal resource utilization.

For the example code and demonstration discussed in this post, refer to the accompanying GitHub repository. You can also refer to the following GitHub repo for Amazon Bedrock multi-agent collaboration code samples.


About the Authors

Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for generative AI to help customers and partners build generative AI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture, and ML applications.

Ajeet Tewari is a Senior Solutions Architect for Amazon Web Services. He works with enterprise customers to help them navigate their journey to AWS. His specialties include architecting and implementing scalable OLTP systems and leading strategic AWS initiatives.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Read More

Dynamic text-to-SQL for enterprise workloads with Amazon Bedrock Agents

Dynamic text-to-SQL for enterprise workloads with Amazon Bedrock Agents

Generative AI enables us to accomplish more in less time. Text-to-SQL empowers people to explore data and draw insights using natural language, without requiring specialized database knowledge. Amazon Web Services (AWS) has helped many customers connect this text-to-SQL capability with their own data, which means more employees can generate insights. In this process, we discovered that a different approach is needed in enterprise environments where there are over 100 tables, each with dozens of columns. We also learned that robust error handling is critical when errors occur in the generated SQL query based on users’ questions.

This post demonstrates how enterprises can implement a scalable agentic text-to-SQL solution using Amazon Bedrock Agents, with advanced error-handling tools and automated schema discovery to enhance database query efficiency. Our agent-based solution offers two key strengths:

  1. Automated scalable schema discovery – The schema and table metadata can be dynamically updated to generate SQL when the initial attempt to execute the query fails. This is important for enterprise customers who have a lot of tables and columns and many queries patterns.
  2. Automated error handling – The error message is directly fed back to the agent to improve the success rate of running queries.

You’ll find that these features help you tackle enterprise-scale database challenges while making your text-to-SQL experience more robust and efficient.

Use case

An agentic text-to-SQL solution can benefit enterprises with complex data structures. In this post, to understand the mechanics and benefits of the agentic text-to-SQL solution in a complex enterprise environment, imagine you’re a business analyst on the risk management team in a bank. You need to answer questions such as “Find all transactions that occurred in the United States and were flagged as fraudulent, along with the device information used for those transactions,” or “Retrieve all transactions for John Doe that occurred between January 1, 2023, and December 31, 2023, including fraud flags and merchant details.” For this, there are dozens—or sometimes hundreds—of tables that you need to not only be aware but also craft complex JOIN queries. The following diagram illustrates a sample table schema that might be needed for fraud investigations.

Sample schema for fraud detection

The key pain points of implementing a text-to-SQL solution in this complex environment include the following, but aren’t limited to:

  1. The amount of table information and schema will get excessive, which will entail manual updates on the prompts and limit its scale.
  2. As a result, the solution might require additional validation, impacting the quality and performance of generating SQL.

Now, consider our solution and how it addresses these problems.

Solution overview

Amazon Bedrock Agents seamlessly manages the entire process from question interpretation to query execution and result interpretation, without manual intervention. It seamlessly incorporates multiple tools, and the agent analyzes and responds to unexpected results. When queries fail, the agent autonomously analyzes error messages, modifies queries, and retries—a key benefit over static systems.

As of December 2024, the Amazon Bedrock with structured data feature provides built-in support for Amazon Redshift, offering seamless text-to-SQL capabilities without custom implementation. This is recommended as the primary solution for Amazon Redshift users.

Here are the capabilities that this solution offers:

  1. Executing text-to-SQL with autonomous troubleshooting:
    1. The agent can interpret natural language questions and convert them into SQL queries. It then executes these queries against an Amazon Athena database and returns the results.
    2. If a query execution fails, the agent can analyze the error messages returned by AWS Lambda and automatically retries the modified query when appropriate.
  2. Dynamic schema discovery
    1. Listing tables – The agent can provide a comprehensive list of the tables in the fraud detection database. This helps users understand the available data structures.
    2. Describing table schemas – Users can request detailed information about the schema of specific tables. The agent will provide column names, data types, and associated comments, giving users a clear understanding of the data structure.

The solution uses direct database tools for schema discovery instead of vector store–based retrieval or static schema definitions. This approach provides complete accuracy with lower operational overhead because it doesn’t require a synchronization mechanism and continually reflects the current database structure. Direct schema access through tools is more maintainable than hardcoded approaches that require manual updates, and it provides better performance and cost-efficiency through real-time database interaction.

The workflow is as follows:

  1. A user asks questions to Amazon Bedrock Agents.
  2. To serve the user’s questions, the agent determines the appropriate action to invoke:
    1. To execute the generated query with confidence, the agent will invoke the athena-query
    2. To confirm the database schema first, the agent will invoke the athena-schema-reader tool:
      • Retrieve a list of available tables using its /list_tables endpoint.
      • Obtain the specific schema of a certain table using its /describe_table endpoint.
    3. The Lambda function sends the query to Athena to execute.
    4. Athena queries the data from the Amazon Simple Storage Service (Amazon S3) data bucket and stores the query results in the S3 output bucket.
    5. The Lambda function retrieves and processes the results. If an error occurs:
      • The Lambda function captures and formats the error message for the agent to understand.
      • The error message is returned to Amazon Bedrock Agents.
      • The agent analyzes the error message and tries to resolve it. To retry with the modified query, the agent may repeat steps 2–5.
    6. The agent formats and presents the final responses to the user.

The following architecture diagram shows this workflow.

Architecture diagram

Implementation walkthrough

To implement the solution, use the instructions in the following sections.

Intelligent error handling

Our agentic text-to-SQL solution implements practical error handling that helps agents understand and recover from issues. By structuring errors with consistent elements, returning nonbreaking errors where possible, and providing contextual hints, the system enables agents to self-correct and continue their reasoning process.

Agent instructions

Consider the key prompt components that make this solution unique. Intelligent error handling helps automate troubleshooting and refine the query by letting the agent understand the type of errors and what to do when error happens:

Execution and Error Handling:
   - Execute the query via the /athena_query endpoint
   - If the execution fails, carefully analyze the error message and hint provided by the Lambda function
   - Based on the error type received from the Lambda function, take appropriate action:
   - After identifying the issue based on the error message and hint:
     1. Modify your query or API request to address the specific problem
     2. If needed, use schema discovery tools (/list_tables, /describe_table) to gather updated information
     3. Reconstruct the query with the necessary corrections
     4. Retry the execution with the modified query or request

The prompt gives guidance on how to approach the errors. It also states that the error types and hints will be provided by Lambda. In the next section, we explain how Lambda processes the errors and passes them to the agent.

Implementation details

Here are some key examples from our error handling system:

ERROR_MESSAGES = {
    'QUERY_EXECUTION_FAILED': {
        'message': 'Failed to execute query',
        'hint': 'Please use fully qualified table names. Example: SELECT * FROM fraud_data.customers LIMIT 1'
    },
    'QUERY_RESULT_ERROR': {
        'message': 'Error occurred while getting query results',
        'hint': 'Check if the tables and columns in your query exist and you have proper permissions. Examples: "customers", "transactions", or "devices".'
    },
    'MISSING_QUERY': {
        'message': 'Query is required',
        'hint': 'No query was provided. Please provide a SQL query to execute'
    }
}

def create_query_response(query_result, status_code=200):
    if query_result.get('error'):
        error_info = ERROR_MESSAGES.get(query_result['error'])
        return {
            'error': query_result['error'],
            'message': error_info['message'],
            'hint': error_info['hint']
        }
    return query_result

These error types cover the main scenarios in text-to-SQL interactions:

  1. Query execution failures – Handles syntax errors and table reference issues, guiding the agent to use the correct table names and SQL syntax
  2. Result retrieval issues – Addresses permission problems and invalid column references, helping the agent verify the schema and access rights
  3. API validation – Verifies that basic requirements are met before query execution, minimizing unnecessary API calls

Each error type includes both an explanatory message and an actionable hint, enabling the agent to take appropriate corrective steps. This implementation shows how straightforward it can be to enable intelligent error handling; instead of handling errors traditionally within Lambda, we return structured error messages that the agent can understand and act upon.

Dynamic schema discovery

The schema discovery is pivotal to keeping Amazon Bedrock Agents consuming the most recent and relevant schema information.

Agent instructions

Instead of hardcoded database schema information, we allow the agent to discover the database schema dynamically. We’ve created two API endpoints for this purpose:

Schema Discovery: 
    - Use /list_tables endpoint to identify available tables in the database 
    - Use /describe_table endpoint to get detailed schema information for specific tables 
    - Always use the most recent and relevant table schemas, as the database structure may change frequently 
    - Before constructing queries, ensure you have up-to-date schema information

Implementation details

Based on the agent instructions, the agent will invoke the appropriate API endpoint.

The /list_tables endpoint lists the tables in a specified database. This is particularly useful when you have multiple databases or frequently add new tables:

@app.post("/list_tables", description="Retrieve a list of all tables in the specified database")
def list_tables(event, database_name):
    query = f"SHOW TABLES IN {database_name}"
    result = execute_and_get_results(query, s3_output)
    if isinstance(result, dict) and 'error' in result:
        return create_api_response(event, 400, get_error_response('QUERY_RESULT_ERROR'))
    return create_api_response(event, 200, result)

The /describe_table endpoint reads a specific table’s schema with details. We use the “DESCRIBE” command, which includes column comments along with other schema details. These comments help the agent better understand the meaning of the individual columns:

@app.post("/describe_table", description="Retrieve the schema information of a specific table")
def describe_table(event, database_name, table_name):
    query = f"DESCRIBE {database_name}.{table_name}"
    result = execute_and_get_results(query, s3_output)
    
    if isinstance(result, dict) and 'error' in result:
        return create_api_response(event, 400, get_error_response('QUERY_RESULT_ERROR'))
    
    formatted_result = {
        "table_name": table_name,
        "database": database_name,
        "columns": result
    }
    return create_api_response(event, 200, formatted_result)

When implementing a dynamic schema reader, consider including comprehensive column descriptions to enhance the agent’s understanding of the data model.

These endpoints enable the agent to maintain an up-to-date understanding of the database structure, improving its ability to generate accurate queries and adapt to changes in the schema.

Demonstration

You might not experience the exact same response with the presented screenshot due to the indeterministic nature of large language models (LLMs).

The solution is available for you to deploy in your environment with sample data. Clone the repository from this GitHub link and follow the README guidance. After you deploy the two stacks—AwsText2Sql-DbStack and AwsText2Sql-AgentStack—follow these steps to put the solution in action:

  1. Go to Amazon Bedrock and select Agents.
  2. Select AwsText-to-SQL-AgentStack-DynamicAgent and test by asking questions in the Test window on the right.
  3. Example interactions:
    • Which demographic groups or industries are most frequently targeted by fraudsters? Present aggregated data.
    • What specific methods or techniques are commonly used by perpetrators in the reported fraud cases?
    • What patterns or trends can we identify in the timing and location of fraud incidents?
    • Show the details of customers who have made transactions with merchants located in Denver.
    • Provide a list of all merchants along with the total number of transactions they’ve processed and the number of those transactions that were flagged as fraudulent.
    • List the top five customers based on the highest transaction amounts they’ve made.

Agent Builder screen shot

  1. Choose Show trace and examine each step to understand what tools are used and the agent’s rationale for approaching your question, as shown in the following screenshot.

Trace example screen shot

  1. (Optional) You can test the Amazon Bedrock Agents code interpreter by enabling it in Agent settings. Follow the instructions at Enable code interpretation in Amazon Bedrock and ask the agent “Create a bar chart showing the top three cities that have the most fraud cases.”

Code interpreter screen shot

Best practices

Building on our discussion of dynamic schema discovery and intelligent error handling, here are key practices to optimize your agentic text-to-SQL solution:

  1. Use dynamic schema discovery and error handling – Use endpoints such as /list_tables and /describe_table to allow the agent to dynamically adapt to your database structure. Implement comprehensive error handling as demonstrated earlier, enabling the agent to interpret and respond to various error types effectively.
  2. Balance static and dynamic information – Although dynamic discovery is powerful, consider including crucial, stable information in the prompt. This might include database names, key table relationships, or frequently used tables that rarely change. Striking this balance can improve performance without sacrificing flexibility.
  3. Tailoring to your environment – We designed the sample to always invoke /list_tables and /describe_table, and your implementation might need adjustments. Consider your specific database engine’s capabilities and limitations. You might need to provide additional context beyond only column comments. Think about including database descriptions, table relationships, or common query patterns. The key is to give your agent as much relevant information as possible about your data model and business context, whether through extended metadata, custom endpoints, or detailed instructions.
  4. Implement robust data protection – Although our solution uses Athena, which inherently doesn’t support write operations, it’s crucial to consider data protection in your specific environment. Start with clear instructions in the prompt (for example, “read-only operations only”), and consider additional layers such as Amazon Bedrock Guardrails or an LLM-based review system to make sure that generated queries align with your security policies.
  5. Implement layered authorization – To enhance data privacy when using Amazon Bedrock Agents, you can use services such as Amazon Verified Permissions to validate user access before the agent processes sensitive data. Pass user identity information, such as a JWT token, to the agent and its associated Lambda function, enabling fine-grained authorization checks against pre-built policies. By enforcing access control at the application level based on the Verified Permissions decision, you can mitigate unintended data disclosure and maintain strong data isolation. To learn more, refer to Enhancing data privacy with layered authorization for Amazon Bedrock Agents in the AWS Security Blog.
  6. Identify the best orchestration strategy for your agent – Amazon Bedrock provides you with an option to customize your agent’s orchestration strategy. Custom orchestration gives you full control of how you want your agents to handle multistep tasks, make decisions, and execute workflows.

By implementing these practices, you can create a text-to-SQL solution that not only uses the full potential of AI agents, it also maintains the security and integrity of your data systems.

Conclusion

In conclusion, the implementation of a scalable agentic text-to-SQL solution using AWS services offers significant advantages for enterprise workloads. By using automated schema discovery and robust error handling, organizations can efficiently manage complex databases with numerous tables and columns. The agent-based approach promotes dynamic query generation and refinement, leading to higher success rates in data querying. We’d like to invite you to try this solution out today! Visit GitHub to dive deeper into the details of the solution, and follow the deployment guide to test in your AWS account.


About the Authors

Jimin Kim is a Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team, based in Los Angeles. With specialties in Generative AI and SaaS, she loves helping her customers succeed in their business. Outside of work, she cherishes moments with her wife and three adorable calico cats.

Jiwon Yeom is a Solutions Architect at AWS, based in New York City. She focuses on Generative AI in the financial services industry and is passionate about helping customers build scalable, secure, and human-centered AI solutions. Outside of work, she enjoys writing, and exploring hidden bookstores.

Read More

Building an AIOps chatbot with Amazon Q Business custom plugins

Building an AIOps chatbot with Amazon Q Business custom plugins

Many organizations rely on multiple third-party applications and services for different aspects of their operations, such as scheduling, HR management, financial data, customer relationship management (CRM) systems, and more. However, these systems often exist in silos, requiring users to manually navigate different interfaces, switch between environments, and perform repetitive tasks, which can be time-consuming and inefficient.

Moreover, while many enterprise systems are equipped with APIs for integration, users often lack the technical expertise to interact with these APIs directly. As a result, organizations need an intuitive and seamless way to query data and perform actions across these applications using natural language, without requiring specialized knowledge of each system or its APIs.

To address the challenge of integrating multiple third-party applications into a unified, natural language-driven interface, users can use plugins for Amazon Q Business. Plugins provide a way to bridge the gap between complex, siloed enterprise applications in a user-friendly interfacing empowering users to take action across systems with easy. Amazon Q Business supports multiple enterprise systems with pre-built plugins, as well as custom plugins, that users can use to integrate a variety of enterprise systems with Amazon Q Business applications.

Solution overview

In this post, we demonstrate how you can use custom plugins for Amazon Q Business to build a chatbot that can interact with multiple APIs using natural language prompts. We showcase how to build an AIOps chatbot that enables users to interact with their AWS infrastructure through natural language queries and commands. The chatbot is capable of handling tasks such as querying the data about Amazon Elastic Compute Cloud (Amazon EC2) ports and Amazon Simple Storage Service (Amazon S3) buckets access settings. For example, users can ask the chatbot questions like “Which EC2 instances have port 3389 open?” or request actions such as “Please close public access for S3 buckets.”

By integrating other AWS services with Amazon Q using OpenAPI schemas, the chatbot can not only retrieve real-time information (such as checking which S3 buckets have public access), but also take corrective actions (such as closing open ports or public access) in response to user commands. This solution reduces manual intervention and simplifies complex cloud operations by enabling IT teams to manage infrastructure through natural language interactions. The chatbot will streamline operational tasks, reduce the need for switching between different tools, and improve the efficiency of IT and operations teams by allowing them to interact with complex systems using simple, intuitive language.

Architecture

To implement the solution, you will build the following architecture.

Users sign in the AIOps Chatbot using the credentials configured in AWS IAM Identity Center. You will use finding and removing public access from S3 buckets along with finding and closing specific open ports on Amazon EC2 instances as the use cases to demonstrate the capability of this AIOps chatbot using Amazon Q Business custom plugins. However, you can extend the architecture to support other operations use cases through API based integration.

You deploy the required infrastructure using the AWS Serverless Application Model (AWS SAM).

The following is a summary of the functionality of the architecture:

Prerequisites

Deploy and run the solution

The resources in this demonstration will be provisioned in the US East (N. Virginia) AWS Region (us-east-1). You walk through the following phases to implement the model customization workflow:

  1. Deploy the solution using the AWS SAM template
  2. Configure a user for the AIOps Q Business chatbot application
  3. Test the AIOps Q Business chatbot application
  4. Clean up

Step 1: Deploy the solution using the AWS SAM template

See the GitHub repository for the latest instructions. Run the following steps to deploy the AWS Step Functions workflow using the AWS SAM template.

  1. Create a new directory, navigate to that directory in a terminal, and clone the GitHub repository:
git clone https://github.com/aws-samples/ai-ops-with-amazon-q-business.git

2. Change directory to the solution directory:

cd ai-ops-with-amazon-q-business

3. Run the following command to deploy the resources using SAM.

sam deploy -g

4. When prompted, enter the following parameter values:

Stack Name [sam-app]: aiops
AWS Region [us-east-1]: us-east-1
Confirm changes before deploy [y/N]: N

Allow SAM CLI IAM role creation [Y/n]: Y

Disable rollback [y/N]: N

FindS3BucketsWithPublicAccessFunction has no authentication. Is this okay? [y/N]: y

RemovePublicAcessFromS3BucketFunction has no authentication. Is this okay? [y/N]: y

FindEC2WithSpecificOpenPortFunction has no authentication. Is this okay? [y/N]: y

CloseUnwantedPortForEC2Function has no authentication. Is this okay? [y/N]: y

Save arguments to configuration file [Y/n]: Y

SAM configuration file [samconfig.toml]: hit enter

SAM configuration environment [default]: hit enter  

5. Note the outputs from the AWS SAM deployment process. This contains the Amazon Q Business web experience (chatbot) URL. Before you can sign in to the chatbot application, you must set up a user.

Step 2: Configure a user for the AIOps Amazon Q Business chatbot application

Use the following steps to configure a user for the AIOps chatbot application.

  1. Open Amazon Q Business from the console and select the AIOps application.

Amazon Console for AI Ops

2. Choose Manage access and subscription.

Choose Manage and Access subscription

3. Choose Add groups and users.

Add groups and users

4. Select either Add and assign new users or Assign existing users and groups depending on if you pre-created the user as mentioned in the prerequisites and choose Next.

5. If you have an existing user that you want to provide access to your AIOps application, search for and select the username and choose Assign.

Choose Assign

6. On the review page, select the current subscription and choose Confirm.

Review page

Step 3: Test the AIOps Q Business chatbot application

Use the following steps to log into the chatbot and test it. Responses from large language models are non-deterministic. Hence, you may not get the exact same response every time.

  1. Take the QBusinessWebExperienceURL from the sam deploy output using the user credential configured in the previous step.
  2. After signing in to the AIOps Chatbot, select the kebab menu option (three dots) at the bottom right corner and select the AIOpsCustomPlugin as follows:

AIOps Chatbot

3. Enable public access on an Amazon S3 bucket. This is done for testing purposes only, so check your organization policies before performing this test. For this demo we used a bucket named aiops-chatbot-demo.

4. Return to the AIOps Chatbot and enter a question such as: Do I have any S3 bucket with public access? and choose Submit. Provide the bucket prefix to narrow down the search.

AIOps Chatbot - S3 buckets test

5. The AIOps chatbot identifies the buckets that have public access:

AIOps Answer - S3 Buckets

6. Ask a follow up question such as: Please block the public access. The chat bot blocks public access. Validate the change from the S3 console.

Chatbot - public access block

7. Open a port, such as 1234, for an Amazon EC2 instance using security group inbound rules.

Port test

8. Return to the chat bot and enter a question such as: Do I have any EC2 instance with port 1234 open?

9. After the chat bot identifies the EC2 instance with the open port, confirm that you want to close the port.

10. The chat bot closes the open port and confirms.

port close testing

Clean up

Properly decommissioning provisioned AWS resources is an important best practice to optimize costs and enhance security posture after concluding proofs of concept and demonstrations. To delete the resources deployed to your AWS account through AWS SAM, run the following command:

sam delete

OpenAPI schema definition

After the custom plugin is deployed, Amazon Q Business will process a user’s prompt and use the OpenAPI schema to dynamically determine the appropriate APIs to call to accomplish the user’s goal. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy. Follow the best practices for OpenAPI schema definition for ideal results. This AIOps chatbot demonstrated four operations supported by the following API operations:

  • find-s3-bucket-with-public-access – This API finds S3 buckets that have the specified prefix and are configured for public access.
  • remove-public-access-from-s3-bucket – This API removes public access from a specific S3 bucket.
  • find-ec2-with-specific-open-port – This API finds EC2 instances that have a specified port open for inbound access.
  • close-unwanted-port-for-ec2 – This API removes a specified port from a given EC2 instance.

The API operations are implemented using API Gateway and Lambda functions.

Troubleshooting

The following are some troubleshooting steps if you encounter errors while using the AIOps chatbot.

  • As Amazon Q Business dynamically determines the appropriate API operations to be invoked, the questions (prompts) must be unambiguous. Be specific rather than asking generic questions. For example: Do I have any EC2 instance with port 1234 open? instead of Do I have any EC2 exposed to internet?
  • The APIs are exposed using API Gateway backed by Lambda functions. Check that you can invoke the API operations using Curl or API testing tools.
  • Check the Lambda function logs in Amazon CloudWatch for errors. Follow the Lambda debugging steps if needed.

Conclusion

In this post, you learned an end-to-end process for creating an AIOps chatbot using Amazon Q Business custom plugins, demonstrating how users can use natural language processing to interact with AWS resources and streamline cloud operations. By integrating other AWS services with Amazon Q Business, the chatbot can query infrastructure for security and compliance status while automating key actions such as closing open ports or restricting public access to S3 buckets. This solution enhances operational efficiency, reduces manual intervention, and enabled teams to manage complex environments more effectively through intuitive, conversational interfaces. With custom plugins and OpenAPI schemas, users can build a powerful, flexible chatbot solution tailored to their specific operational needs, transforming the way they manage IT operations and respond to business challenges.

Further study

For more information on Amazon Q Business and custom plugins:


About the authors

Upendra V is a Sr. Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprise customers design and deploy production-ready Generative AI workloads, implement Large Language Models (LLMs) and Agentic AI systems, and optimize cloud deployments. With expertise in cloud adoption and machine learning, he enables organizations to build and scale AI-driven applications efficiently.

BiswaBiswanath Mukherjee is a Senior Solutions Architect at Amazon Web Services. He works with large strategic customers of AWS by providing them technical guidance to migrate and modernize their applications on AWS Cloud. With his extensive experience in cloud architecture and migration, he partners with customers to develop innovative solutions that leverage the scalability, reliability, and agility of AWS to meet their business needs. His expertise spans diverse industries and use cases, enabling customers to unlock the full potential of the AWS Cloud.

Read More

How TransPerfect Improved Translation Quality and Efficiency Using Amazon Bedrock

How TransPerfect Improved Translation Quality and Efficiency Using Amazon Bedrock

This post is co-written with Keith Brazil, Julien Didier, and Bryan Rand from TransPerfect.

TransPerfect, a global leader in language and technology solutions, serves a diverse array of industries. Founded in 1992, TransPerfect has grown into an enterprise with over 10,000 employees in more than 140 cities on six continents. The company offers a broad spectrum of services, including translation, localization, interpretation, multicultural marketing, website globalization, subtitling, voiceovers, and legal support services. TransPerfect also uses cutting-edge technology to offer AI-driven language solutions, such as its proprietary translation management system, GlobalLink.

This post describes how the AWS Customer Channel Technology – Localization Team worked with TransPerfect to integrate Amazon Bedrock into the GlobalLink translation management system, a cloud-based solution designed to help organizations manage their multilingual content and translation workflows. Organizations use TransPerfect’s solution to rapidly create and deploy content at scale in multiple languages using AI.

Amazon Bedrock is a fully managed service that simplifies the deployment and management of generative AI models. It offers access to a variety of foundation models (FMs), enabling developers to build and scale AI applications efficiently. Amazon Bedrock is designed to be highly scalable, secure, and straightforward to integrate with other AWS services, making it suitable for a broad array of use cases, including language translation.

The AWS Customer Channel Technology – Localization Team is a long-standing TransPerfect customer. The team manages the end-to-end localization process of digital content at AWS, including webpages, technical documentation, ebooks, banners, videos, and more. The AWS team handles billions of words in multiple languages across digital assets. Given the growing demand for multilingual content by internationally minded businesses and new local cloud adoption journeys, the AWS team needs to support an ever-increasing load and a wider set of languages. To do so, the team relies on the GlobalLink technology suite to optimize and automate translation processes.

The challenge

The AWS team and TransPerfect created streamlined custom workflows and toolsets that enable the translation and delivery of billions of words each year. Content localization is a multi-step process consisting minimally of asset handoff, asset preprocessing, machine translation, post-editing, quality review cycles, and asset handback. These steps are often manual, costly, and time-consuming. AWS and TransPerfect are continually striving to optimize this workflow to enable the processing of more content at a lower cost and to decrease those assets’ time to market—providing valuable, salient content faster for non-English-speaking customers. Additionally, transcreation of creative content posed a unique challenge, because it traditionally required highly skilled human linguists and was resistant to automation, resulting in higher costs and longer turnaround times. To address these issues, TransPerfect worked with AWS to evaluate generative AI-powered initiatives for transcreation and automatic post-editing within TransPerfect’s GlobalLink architecture.

Security and data safety

Amazon Bedrock helps make sure data is neither shared with FM providers nor used to improve base models. Amazon Bedrock adheres to major compliance standards like ISO and SOC and is also a FedRAMP-authorized service, making it suitable for government contracts. The extensive monitoring and logging capabilities of Amazon Bedrock allow TransPerfect to align with stringent auditability requirements.

Although data safety is a key requirement, there are many other factors to take into account, such as responsible AI. Amazon Bedrock Guardrails enabled TransPerfect to build and customize truthfulness protections for the automatic post-edit offering. Large language models (LLMs) can generate incorrect information due to hallucinations. Amazon Bedrock supports contextual grounding checks to detect and filter hallucinations if the responses are factually incorrect or inconsistent. This is a critical feature for a translation solution that requires perfect accuracy.

Harnessing LLMs for automatic post-editing

To translate at scale, Amazon Translate powered machine translation is used in AWS team workflows. Segments whose translations can’t be recycled from translation memories (databases of previous high-quality human translations) are routed to machine translation workflows. Depending on the language or content, Amazon either uses a machine translation-only workflow where content is translated and published with no human touch, or machine translation post-edit workflows. Post-editing is when a linguist finesses the machine-translated output of a given segment to make sure it correctly conveys the meaning of the original sentence and is in line with AWS style guides and agreed glossaries. Because this process can add days to the translation timeline, automating some or all of the process would have a major impact on cost and turnaround times.

The following diagram illustrates the machine translation workflow.

The workflow consists of the following components:

  • TM (translation memory) – The translation memory is a client-specific repository of previously translated and approved content. It’s always applied first and maximizes the reuse of existing translations.
  • MT (machine translation) – After existing translations are applied, new content is processed through machine translation using Amazon Translate.
  • APE (automated post-edit) – An LLM is employed to edit, improve, and correct machine-translated content.
  • HPE (human post-edit) – A subject matter expert linguist revises and perfects the machine-translated content.

The following example follows the path through the preceding workflow for one source segment.

Source To choose user name attributes, don’t select User name as a sign-in option when you create your user pool.
MT Pour choisir des attributs de nom d’utilisateur, évitez de sélectionner User name (Nom d’utilisateur) comme option de connexion au moment de créer votre groupe d’utilisateurs.
APE Pour choisir des attributs de nom d’utilisateur, évitez de sélectionner User name (Nom d’utilisateur) comme option de connexion lorsque vous créez votre groupe d’utilisateurs.
HPE Pour choisir les attributs de nom d’utilisateur, évitez de sélectionner User name (Nom d’utilisateur) comme option de connexion lorsque vous créez votre groupe d’utilisateurs.

TransPerfect began working with generative AI and LLMs several years ago with the foresight that AI was on track to disrupt the translation industry. As expected, localization workflows have mostly shifted to “expert in the loop”, and are striving toward “no human touch” models. In pursuit of this, TransPerfect chose to use Amazon Bedrock within its GlobalLink Enterprise solution to further automate and optimize these workflows. Amazon Bedrock, by design, provides data ownership and security. This is a critical feature for TransPerfect clients, especially those in sensitive industries such as life sciences or banking.

With Amazon Bedrock and GlobalLink, machine-translated content is now routed through one of the LLMs available in Amazon Bedrock for automatic post-editing. By using style guides, relevant examples of approved translations, and examples of errors to avoid, the LLM is prompted to improve existing machine translations. This post-edited content is either handed off to a linguist for a lighter post-edit (a less difficult task) or is applied in “no human touch workflows” to greatly improve the output. The result is enhanced quality across the board and the ability for post-editors to focus on higher-value edits.

For post-editing, over 95% of all edits suggested by Amazon Bedrock LLMs showed markedly improved translation quality, leading to up to 50% overall cost savings for translations for Transperfect and freeing human linguists for higher-level tasks.

Harnessing LLMs for transcreation

Although machine translation shows great strength in technical, formal, and instructional content, it hasn’t historically performed as well with creative content that leans into nuance, subtlety, humor, descriptiveness, and cultural references. Creative content can sound stiff or unnatural when machine translated. Because of this, TransPerfect has traditionally relied on human linguists to manually transcreate this type of content.

Transcreation is the process of adapting a message from one language to another while maintaining its intent, style, tone, and context. In German, for example, Nike’s “Just do it” tagline is transcreated to “Du tust es nie nur für dich,” which actually means “you never do it just for yourself.”

A successfully transcreated message evokes the same emotions and carries the same implications in the target language as it does in the source language. The AWS team uses transcreation for highly creative marketing assets to maximize their impact in a given industry. However, transcreation historically hasn’t benefitted from the automation solutions used in other types of localization workflows due to the highly customized and creative nature of the process. This means there has been a lot of interest in using generative AI to potentially decrease the costs and time associated with transcreation.

TransPerfect sought to use LLMs to cut down on time and costs typically associated with transcreation. Rather than an all-human or fully automated process, translations are produced through Anthropic’s Claude or Amazon Nova Pro on Amazon Bedrock, with the prompt to create multiple candidate translations with some variations. Within the translation editor, the human linguist chooses the most suitable adapted translation instead of composing it from scratch.

The following screenshot shows an LLM-powered transcreation within the GlobalLink Translate online editor.

Using GlobalLink powered by Amazon Bedrock for transcreation, users are seeing linguist productivity gains of up to 60%.

Conclusion

Thanks to LLM-powered transcreation and post-editing, customers in industries ranging from life sciences to finance to manufacturing have seen cost savings of up to 40% within their translation workflows and up to an 80% reduction in project turnaround times. In addition, the automatic post-edit step added to machine translation-only workflows provides a major quality boost to the no human touch output.

Amazon Bedrock safeguards data by not allowing sharing with FM providers and excluding it from model improvements. Beyond data security, responsible AI is essential. Amazon Bedrock Guardrails allows TransPerfect to customize truthfulness protections for post-editing. To address AI hallucinations, it offers contextual grounding checks to identify and filter inaccuracies—critical for producing precise translations.

Try out LLM-powered transcreation and post-editing with Amazon Bedrock for your own use case, and share your feedback and questions in the comments.


About the authors

Peter Chung is a Senior Solutions Architect at AWS, based in New York. Peter helps software and internet companies across multiple industries scale, modernize, and optimize. Peter is the author of “AWS FinOps Simplified”, and is an active member of the FinOps community.

Franziska Willnow is a Senior Program Manager (Tech) at AWS. A seasoned localization professional, Franziska Willnow brings over 15 years of expertise from various localization roles at Amazon and other companies. Franziska focuses on localization efficiency improvements through automation, machine learning, and AI/LLM. Franziska is passionate about building innovative products to support AWS’ global customers.

Ajit Manuel is a product leader at AWS, based in Seattle. Ajit heads the content technology product practice, which powers the AWS global content supply chain from creation to intelligence with practical enterprise AI. Ajit is passionate about enterprise digital transformation and applied AI product development. He has pioneered solutions that transformed InsurTech, MediaTech, and global MarTech.

Keith Brazil is Senior Vice President of Technology at TransPerfect, with specialization in Translation Management technologies as well as AI/ML data collection and annotation platforms. A native of Dublin, Ireland, Keith has been based in New York city for the last 23 years.

Julien Didier is Vice-President of Technology for translations.com and is responsible for the implementation of AI for both internal workflows and client-facing products. Julien manages a worldwide team of engineers, developers and architects who ensure successful deployments in addition to providing feedback for feature requests.

Bryan Rand is Senior Vice President of Global Solutions at TransPerfect, specializing in enterprise software, AI-driven digital marketing, and content management strategies. With over 20 years of experience leading business units and implementing customer experience innovations, Bryan has played a key role in driving successful global transformations for Fortune 1000 companies. He holds a BA in Economics from the University of Texas.

Read More

Racing beyond DeepRacer: Debut of the AWS LLM League

Racing beyond DeepRacer: Debut of the AWS LLM League

The AWS DeepRacer League is the world’s first autonomous racing league, open to anyone. Announced at re:Invent 2018, it puts machine learning in the hands of every developer through the fun and excitement of developing and racing self-driving remote control cars. Through the past 7 years, over 560 thousand developers of all skill levels have competed in the league at thousands of Amazon and customer events globally. While the final championships concluded at re:Invent 2024, that same event played host to a brand new AI competition, ushering in a new era of gamified learning in the age of generative AI.

In December 2024, AWS launched the AWS Large Language Model League (AWS LLM League) during re:Invent 2024. This inaugural event marked a significant milestone in democratizing machine learning, bringing together over 200 enthusiastic attendees from diverse backgrounds to engage in hands-on technical workshops and a competitive foundation model fine-tuning challenge. Using learnings from DeepRacer, the primary objective of the event was to simplify model customization learning while fostering a collaborative community around generative AI innovation through a gamified competition format.

AWS LLM League structure and outcomes

The AWS LLM League was designed to lower the barriers to entry in generative AI model customization by providing an experience where participants, regardless of their prior data science experience, could engage in fine-tuning LLMs. Using Amazon SageMaker JumpStart, attendees were guided through the process of customizing LLMs to address real business challenges adaptable to their domain.

LLM league structure.

As shown in the preceding figure, the challenge began with a workshop, where participants embarked on a competitive journey to develop highly effective fine-tuned LLMs. Competitors were tasked with customizing Meta’s Llama 3.2 3B base model for a specific domain, applying the tools and techniques they learned. The submitted model would be compared against a bigger 90B reference model with the quality of the responses decided using an LLM-as-a-Judge approach. Participants score a win for each question where the LLM judge deemed the fine-tuned model’s response to be more accurate and comprehensive than that of the larger model.

In the preliminary rounds, participants submitted hundreds of unique fine-tuned models to the competition leaderboard, each striving to outperform the baseline model. These submissions were evaluated based on accuracy, coherence, and domain-specific adaptability. After rigorous assessments, the top five finalists were shortlisted, with the best models achieving win rates above 55% against the large reference models (as shown in the preceding figure). Demonstrating that a smaller model can achieve competitive performance highlights significant benefits in compute efficiency at scale. Using a 3B model instead of a 90B model reduces operational costs, enables faster inference, and makes advanced AI more accessible across various industries and use cases.

The competition culminates in the Grand Finale, where finalists showcase their models in a final round of evaluation to determine the ultimate winner.

The fine-tuning journey

This journey was carefully designed to guide participants through each critical stage of fine-tuning a large language model—from dataset creation to model evaluation—using a suite of no-code AWS tools. Whether they were newcomers or experienced builders, participants gained hands-on experience in customizing a foundation model through a structured, accessible process. Let’s take a closer look at how the challenge unfolded, starting with how participants prepared their datasets.

Stage 1: Preparing the dataset with PartyRock

During the workshop, participants learned how to generate synthetic data using an Amazon PartyRock playground (as shown in the following figure). PartyRock offers access to a variety of top foundation models through Amazon Bedrock at no additional cost. This enabled participants to use a no-code AI generated app for creating synthetic training data that were used for fine-tuning.

Participants began by defining the target domain for their fine-tuning task, such as finance, healthcare, or legal compliance. Using PartyRock’s intuitive interface, they generated instruction-response pairs that mimicked real-world interactions. To enhance dataset quality, they used PartyRock’s ability to refine responses iteratively, making sure that the generated data was both contextually relevant and aligned with the competition’s objectives.

This phase was crucial because the quality of synthetic data directly impacted the model’s ability to outperform a larger baseline model. Some participants further enhanced their datasets by employing external validation methods, such as human-in-the-loop review or reinforcement learning-based filtering.

Stage 2: Fine-tuning with SageMaker JumpStart

After the datasets were prepared, participants moved to SageMaker JumpStart, a fully managed machine learning hub that simplifies the fine-tuning process. Using a pre-trained Meta Llama 3.2 3B model as the base, they customized it with their curated datasets, adjusting hyperparameters (shown in the following figure) such as:

  • Epochs: Determining how many times the model iterates over the dataset.
  • Learning rate: Controlling how much the model weights adjust with each iteration.
  • LoRA parameters: Optimizing efficiency with low-rank adaptation (LoRA) techniques.

One of the key advantages of SageMaker JumpStart is that it provides a no-code UI, shown in the following figure, allowing participants to fine-tune models without needing to write code. This accessibility enabled even those with minimal machine learning experience to engage in model customization effectively.

By using the distributed training capabilities of SageMaker, participants were able to run multiple experiments in parallel, optimizing their models for accuracy and response quality. The iterative fine-tuning process allowed them to explore different configurations to maximize performance.

Stage 3: Evaluation with Sagemaker Clarify

To make sure that their models were not only accurate but also unbiased, participants had the option to use Amazon SageMaker Clarify for evaluation, shown in the following figure.

This phase included:

  • Bias detection: Identifying skewed response patterns that might favor specific viewpoints.
  • Explainability metrics: Understanding why the model made certain predictions.
  • Performance scoring: Comparing model output against ground truth labels.

While not mandatory, the integration of SageMaker Clarify provided an additional layer of assurance for participants who wanted to validate their models further, verifying that their outputs were reliable and performant.

Stage 4: Submission and evaluation using LLM-as-a-Judge from Amazon Bedrock

After fine-tuned models were ready, they were submitted to the competition leaderboard for evaluation using the Amazon Bedrock Evaluations LLM-as-a-Judge approach. This automated evaluation system compares the fine-tuned models against the reference 90B model using predefined benchmarks, as shown in the following figure.

Each response was scored based on:

  • Relevance: How well the response addressed the question.
  • Depth: The level of detail and insight provided.
  • Coherence: Logical flow and consistency of the answer.

Participants’ models earned a score each time their response outperformed the 90B model in a head-to-head comparison. The leaderboard dynamically updated as new submissions were evaluated, fostering a competitive yet collaborative learning environment.

Grand Finale showcase

The Grand Finale of the AWS LLM League was an electrifying showdown, where the top five finalists, handpicked from hundreds of submissions, competed in a high-stakes live event. Among them was Ray, a determined contender whose fine-tuned model had consistently delivered strong results throughout the competition. Each finalist had to prove not just the technical superiority of their fine-tuned models, but also their ability to adapt and refine responses in real-time.

The competition was intense from the outset, with each participant bringing unique strategies to the table. Ray’s ability to tweak prompts dynamically set him apart early on, providing optimal responses to a range of domain-specific questions. The energy in the room was palpable as finalists’ AI-generated answers were judged by a hybrid evaluation system—40% by an LLM, 40% by expert panelists from Meta AI and AWS, and 20% by an enthusiastic live audience against the following rubric:

  • Generalization ability: How well the fine-tuned model adapted to previously unseen questions.
  • Response quality: Depth, accuracy, and contextual understanding.
  • Efficiency: The model’s ability to provide comprehensive answers with minimal latency.

One of the most gripping moments came when contestants encountered the infamous Strawberry Problem, a deceptively simple letter-counting challenge that exposed an inherent weakness in LLMs. Ray’s model delivered the correct answer, but the AI judge misclassified it, sparking a debate among the human judges and audience. This pivotal moment underscored the importance of human-in-the-loop evaluation, highlighting how AI and human judgment must complement each other for fair and accurate assessments.

As the final round concluded, Ray’s model consistently outperformed expectations, securing him the title of AWS LLM League Champion. The Grand Finale was not just a test of AI—it was a showcase of innovation, strategy, and the evolving synergy between artificial intelligence and human ingenuity.

Conclusion and looking ahead

The inaugural AWS LLM League competition successfully demonstrated how large language model fine-tuning can be gamified to drive innovation and engagement. By providing hands-on experience with cutting-edge AWS AI and machine learning (ML) services, the competition not only demystified the fine-tuning process, but also inspired a new wave of AI enthusiasts to experiment and innovate in this space.

As the AWS LLM League moves forward, future iterations will expand on these learnings, incorporating more advanced challenges, larger datasets, and deeper model customization opportunities. Whether you’re a seasoned AI practitioner or a newcomer to machine learning, the AWS LLM League offers an exciting and accessible way to develop real-world AI expertise.

Stay tuned for upcoming AWS LLM League events and get ready to put your fine-tuning skills to the test!


About the authors

Vincent Oh is the Senior Specialist Solutions Architect in AWS for AI & Innovation. He works with public sector customers across ASEAN, owning technical engagements and helping them design scalable cloud solutions across various innovation projects. He created the LLM League in the midst of helping customers harness the power of AI in their use cases through gamified learning. He also serves as an Adjunct Professor in Singapore Management University (SMU), teaching computer science modules under School of Computer & Information Systems (SCIS). Prior to joining Amazon, he worked as Senior Principal Digital Architect at Accenture and Cloud Engineering Practice Lead at UST.

Natasya K. Idries is the Product Marketing Manager for AWS AI/ML Gamified Learning Programs. She is passionate about democratizing AI/ML skills through engaging and hands-on educational initiatives that bridge the gap between advanced technology and practical business implementation. Her expertise in building learning communities and driving digital innovation continues to shape her approach to creating impactful AI education programs. Outside of work, Natasya enjoys traveling, cooking Southeast Asian cuisines and exploring nature trails.

Read More

Reduce ML training costs with Amazon SageMaker HyperPod

Reduce ML training costs with Amazon SageMaker HyperPod

Training a frontier model is highly compute-intensive, requiring a distributed system of hundreds, or thousands, of accelerated instances running for several weeks or months to complete a single job. For example, pre-training the Llama 3 70B model with 15 trillion training tokens took 6.5 million H100 GPU hours. On 256 Amazon EC2 P5 instances (p5.48xlarge, each with 8 NVIDIA H100 GPUs), this would take approximately 132 days.

Distributed training workloads run in a synchronous manner because each training step requires all participating instances to complete their calculations before the model can advance to the next step. It implies that if a single instance fails, it stops the entire job. As cluster sizes grow, the likelihood of failure increases due to the number of hardware components involved. Each hardware failure can result in wasted GPU hours and requires valuable engineering time to identify and resolve the issue, making the system prone to downtime that can disrupt progress and delay completion. To assess system reliability, engineering teams often rely on key metrics such as mean time between failures (MTBF), which measures the average operational time between hardware failures and serves as a valuable indicator of system robustness.

In this post, we explore the challenges of large-scale frontier model training, focusing on hardware failures and the benefits of Amazon SageMaker HyperPod—a resilient solution that minimizes disruptions, enhances efficiency, and reduces training costs.

Instance failure rate

To understand the typical MTBF for large-scale frontier model training, it helps to first understand instance failure rates by reviewing three noteworthy examples:

  1. When training OPT-175B on 992 A100 GPUs, Meta AI encountered significant hardware reliability challenges. Across 2 months, the team managed 35 manual restarts and cycled over 100 hosts due to hardware issues, and automated systems triggered more than 70 restarts. Operating 124 instances (each with 8 GPUs) continuously over 1,440 hours, Meta accumulated a total of 178,560 instance-hours. The observed failure rate during this period was around 0.0588% per instance-hour, underscoring the reliability hurdles in training large frontier models at this scale.
  2. During the training of Llama 3.1 405B on 16,000 H100 GPUs, a total of 417 unscheduled hardware failures occurred during a 54-day period. This translates to an effective failure rate of about 0.0161% per instance-hour.
  3. MPT-7B was trained on 1 trillion tokens over the course of 9.5 days on 440 x A100-40GB. During this period, the training job experienced four hardware failures, resulting in an effective failure rate of approximately 0.0319% per instance-hour.

Based on these examples, it’s realistic to expect that in a single hour of large-scale distributed training, an instance will fail about 0.02%–0.06% of the time.

Larger clusters, more failures, smaller MTBF

As cluster size increases, the entropy of the system increases, resulting in a lower MTBF. The following table illustrates how the MTBF (in hours) changes with the number of instances in a cluster and the estimated failure rate for each instance. For example, with a 0.04% per-hour failure rate per instance, a 512-instance system is expected to experience a failure approximately every 5 hours. The following table shows MTBF (in hours) by failure rates.

 . Size of cluster (instances)
Failure rate (per instance per hour) 4 8 16 32 64 128 256 512
0.01% 2500 1250 625 313 157 79 40 20
0.02% 1250 625 313 157 79 40 20 10
0.04% 625 313 157 79 40 20 10 5
0.08% 313 157 79 40 20 10 5 3

Table 1: The change in MTBF (in hours) with the number of instances in a training cluster (with assumed failure rates in the columns)

What happens after a failure?

In a perfect world, without failures, the training job proceeds as shown in the following graph, which illustrates the total training time without failures, demonstrating a linear progression.

Figure 1: Training is linear in a perfect world without failures, since there are no interruptions to completion.

However, as previously noted, hardware failures are inevitable. Troubleshooting these failures typically involves several steps:

  1. Root cause analysis (mean time to detect) – Identifying hardware failures as the root cause of training interruptions can be time-consuming, especially in complex systems with multiple potential failure points. The time taken to determine the root cause is referred to as mean time to detect (MTTD).
  2. Hardware repair or replacement (mean time to replace) – Sometimes, a simple instance restart resolves the issue. At other times, the instance must be replaced, which can involve logistical delays, especially if specialized components aren’t readily available. If a replacement instance isn’t on hand when a GPU fails, the system must wait for one to become available. Common redistribution techniques, such as PyTorch FSDP, don’t permit workload redistribution among remaining instances.
  3. System recovery and resumption (mean time to restart) – After resolving hardware issues and replacing the instance, additional time is needed to restore it to its previous state. The new instance must match the original configuration, and the entire cluster must load the model weights from the latest saved checkpoint.

Each failure incurs engineering effort to identify its root cause. When hardware issues arise, diagnostics confirm the problem and isolate the faulty instance, pausing the training job and increasing downtime. The impact of these failures is illustrated in the following figure and can be empirically measured for large distributed training jobs. The figure outlines the troubleshooting steps that follow a failure.

Figure 2: Impact of failures on a distributed training run. Once a failure occurs, time (idle GPUs) is spent on detecting (MTD), replacing (MTT Replace), and continuing (MTR Restart) a training run, often wasting time and expensive resources.

In a scenario where a distributed training job is running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with n reserved instances and an Auto Scaling group set to maintain a minimum of n instances, a hardware issue such as a GPU failure can cause the job to fail. The affected instance will be marked as Unhealthy by a Kubernetes health monitor such as Node Problem Detector, and Amazon EKS will attempt to reschedule the training pods to healthy instances. If no instances have sufficient resources, the pods remain in a Pending state, and because the instance count is limited to n, no new instance will be automatically provisioned.

In such cases, the failed job must be manually identified through pod logs or the Kubernetes API and deleted. The failed instance also needs to be isolated and terminated manually, either through the AWS Management Console, AWS Command Line Interface (AWS CLI), or tools like kubectl or eksctl. To restore cluster capacity, the user must increase the cluster size by modifying the Auto Scaling group or updating the instance group. After the new instance is provisioned, bootstrapped, and added to the cluster, the training job must be restarted manually. If checkpointing is enabled, the job can resume from the last saved state. The overall downtime depends on the time required to provision a new instance and restart the job by rescheduling the pods.

Faster failure detection (shorter MTTD), shorter replacement times (shorter MTTR), and rapid resumption will all contribute to reducing total training time. Automating these processes with minimal user intervention is a key advantage of Amazon SageMaker HyperPod. 

Amazon SageMaker HyperPod resilient training infrastructure

SageMaker HyperPod is a compute environment optimized for large-scale frontier model training. This means users can build resilient clusters for machine learning (ML) workloads and develop or fine-tune state-of-the-art frontier models, as demonstrated by organizations such as Luma Labs and Perplexity AI. SageMaker HyperPod runs health monitoring agents in the background for each instance. When it detects a hardware failure, SageMaker HyperPod automatically repairs or replaces the faulty instance and resumes training from the last saved checkpoint. This automation alleviates the need for manual management, which means customers can train in distributed settings for weeks or months with minimal disruption. The benefits are particularly significant for customers deploying many instances (greater than 16) in a cluster.

Frontier model builders can further enhance model performance using built-in ML tools within SageMaker HyperPod. They can use Amazon SageMaker AI with MLflow to create, manage, and track ML experiments, or use Amazon SageMaker AI with TensorBoard to visualize model architecture and address convergence issues. Additionally, integrating with observability tools such as Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana provides deeper insights into cluster performance, health, and utilization, ultimately saving valuable development time. The following figure compares the downtime of an infrastructure system using SageMaker HyperPod versus one without SageMaker HyperPod.

Figure 3: Comparing downtime chart from figure 1 with downtime on SageMaker HyperPod. When a failure occurs, it is detected automatically by HyperPod agents, and the instance is replaced in the background. Training is also resumed from the latest checkpoint

SageMaker HyperPod reduces the downtime per hardware failure by automatically detecting hardware issues. When these issues are detected, SageMaker HyperPod automatically replaces the faulty node(s) and resumes your training job from the latest checkpoint, assuming that checkpoints are written.

To evaluate this, we conducted experiments on SageMaker HyperPod using different cluster sizes of p5.48xlarge instances. The results in the following table, showing empirical measurements of time to resume by cluster size, displays the 90th percentile (P90), which represents a value that will be met or exceeded 90% of the time.

Cluster size (number of instances) P90 time to detect (in seconds) P90 time to replace (in seconds) P90 time to resume (in seconds) Total downtime per failure (in seconds) Total downtime per failure (in minutes)
16 83 912 1212 2207 36.8
64 90 963 1320 2373 39.6
256 89 903 1398 2390 39.8
1024 80 981 1440 2501 41.7

Table 2: MTTResume (in seconds) on clusters with different sizes

As shown, the mean time to replace an instance is independent of cluster size. For a cluster of 256 x p5.48xlarge instances training Meta Llama 3.1 70B parameter model with batch size = 8, replacing an instance takes about 940 seconds (or 15.7 minutes). After replacement, the new instance must install additional packages using lifecycle scripts and run deep health checks before reading from the latest saved checkpoint. When it’s operational, the training job resumes from the most recent checkpoint, minimizing progress loss despite the interruption. For a 256-instance cluster, it took us about 2,390 seconds (about 40 minutes) to automatically resume the training job after each failure.

Without SageMaker HyperPod, when a GPU failure occurs during a training job, the time it takes to resume the training can vary widely depending on the infrastructure and processes in place. With proper check-pointing, automated job orchestration, and efficient hardware provisioning, the resume time can be reduced. However, without these optimizations, the impact can be much more severe. Empirical evidence from customer experiences—including a leading open source frontier model provider, a top large language model (LLM) startup, an AI company specializing in enterprise frontier models, and a cutting-edge scientific research institute—indicates that without SageMaker HyperPod, the total downtime per GPU failure can average approximately 280 minutes per failure. Thus, Amazon SageMaker HyperPod saves about 240 minutes (or about 4 hours) of downtime per failure:

. Without SageMaker HyperPod (in minutes) With SageMaker HyperPod (in minutes)
Mean time to root-cause 10 1.5
Mean time to replace 240 15
Mean time to resume 30 23.5
Total downtime per failure 280 40

Table 3: Typical failure numbers, in minutes (as described in section “What happens after a failure?” with and without SageMaker HyperPod)

Quantifying the downtime savings

Depending on the frequency of failures, we can calculate the time to train and the cost savings of using SageMaker HyperPod. To illustrate this calculation, we assume it takes 40 minutes to replace an instance with SageMaker HyperPod compared to 280 minutes without it (as previously explained). Additionally, for this calculation, let’s assume a training job requiring 10 million GPU hours on H100 instances, running on a 256-instance P5 cluster.

Although the actual overhead (in hours) depends on the size of the training job, the relative overhead remains constant. The benefits of SageMaker HyperPod in reducing total training time are demonstrated in the following chart. For example, in a 256-instance cluster with a failure rate of 0.05%, SageMaker HyperPod reduces total training time by 32%.

. Size of cluster (instances)
Failure rate
(per instance per hour
)
4 8 16 32 64 128 256 512
0.01% 0% 0% 1% 1% 2% 5% 9% 17%
0.02% 0% 1% 1% 2% 5% 9% 17% 28%
0.05% 1% 2% 3% 6% 11% 20% 32% 48%
0.07% 1% 2% 4% 8% 15% 25% 40% 55%

Table 4: Total % of training time reduced by SageMaker HyperPod compared to a P5 cluster of comparable size

To translate this into actual savings, for a training job requiring 10 million GPU hours on a 256-instance cluster, SageMaker HyperPod saves 104 days of training time. As a result, customers can reduce time-to-market by 3.5 months. Without SageMaker HyperPod, the total time to train would be approximately 325 days, 121 of which are just spent on isolating and mitigating hardware issues. The following table shows the time to train benefits.

H100 GPU hours for training 10,000,000
Number of instances 256
Failure rate (per instance per hour) 0.05%
Additional time to fix per failure (hours) 4
Days lost due to hardware issues (with SageMaker HyperPod) 17
Days lost due to hardware issues (without SageMaker HyperPod) 121
Time to train with SageMaker HyperPod (days) 221
Time to train without SageMaker HyperPod (days) 325
SageMaker HyperPod improvement 32%
Time saved with SageMaker HyperPod (days) 104

Table 5: Benefits presented by SageMaker HyperPod for a training run requiring 10 million GPU hours and a 256 instance cluster. SageMaker HyperPod saves 104 days of training time overall, resulting in a faster time to market (by 3.5 months!)

For the same example, we can estimate the total cost savings using:

Days lost due to hardware issues = (Number of instances) × (Failure rate per instance per hour) × (24 hours per day) × (Total training days) × (Downtime per failure in hours)

The following shows cost to train benefits.

H100 GPU hours for training 10,000,000
Number of instances 256
Failure rate (per instance per hour) 0.05%
Time saved with SageMaker HyperPod (days) 104
Cost per GPU per hour $5
Total cost saving with SageMaker HyperPod $25,559,040

Table 6: Using the calculation described above, the cost to train benefits laid out for a training run requiring 10 million GPU hours, 256 GPU based instances, and an assumed failure rate of 0.05% per instance per hour

A training job requiring 10 million GPU hours and 104 additional days of resolving hardware issues results in significant idle cluster time. Assuming a GPU cost of $5 per hour (equivalent to the price of P5 instances on Capacity Blocks for ML), the total cost savings with SageMaker HyperPod amounts to $25,559,040.

Summary

Training frontier models is a complex, resource-intensive process that is particularly vulnerable to hardware failures. In this post, we explored the instance failure rate, which can range about 0.02%–0.07% per hour during large-scale distributed training. As cluster size grows, the likelihood of failures increases, and the MTBF decreases. We also examined what happens after failure, including root cause analysis, hardware repair or replacement, and system recovery and resumption.

Next, we examined Amazon SageMaker HyperPod—a purpose-built, fully resilient cluster for frontier model training. By incorporating robust fault-tolerance mechanisms and automated health monitoring, SageMaker HyperPod minimizes disruptions caused by hardware issues. This not only streamlines the training process but also enhances the reliability and efficiency of model development, enabling faster and more effective innovation delivery. The benefits are measurable and correlate with both cluster size and failure rate. For a 256-instance cluster with a 0.05% per-instance-per-hour failure rate, SageMaker HyperPod reduces total training time by 32%, resulting in an approximate savings of $25.6 million in total training costs.

By addressing the reliability challenges of frontier model training, SageMaker HyperPod allows ML teams to focus on model innovation rather than infrastructure management. Organizations can now conduct long training runs with confidence, knowing that hardware failures will be automatically detected and resolved with minimal disruption to their ML workloads. Get started with Amazon SageMaker HyperPod.

Special thanks to Roy Allela, Senior AI/ML Specialist Solutions Architect for his support on the launch of this post.


About the Authors

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Trevor Harvey is a Principal Specialist in generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Read More

Model customization, RAG, or both: A case study with Amazon Nova

Model customization, RAG, or both: A case study with Amazon Nova

As businesses and developers increasingly seek to optimize their language models for specific tasks, the decision between model customization and Retrieval Augmented Generation (RAG) becomes critical. In this post, we seek to address this growing need by offering clear, actionable guidelines and best practices on when to use each approach, helping you make informed decisions that align with your unique requirements and objectives.

The introduction of Amazon Nova models represent a significant advancement in the field of AI, offering new opportunities for large language model (LLM) optimization. In this post, we demonstrate how to effectively perform model customization and RAG with Amazon Nova models as a baseline. We conducted a comprehensive comparison study between model customization and RAG using the latest Amazon Nova models, and share these valuable insights.

Approach and base model overview

In this section, we discuss the differences between a fine-tuning and RAG approach, present common use cases for each approach, and provide an overview of the base model used for experiments.

Demystifying RAG and model customization

RAG is a technique to enhance the capability of pre-trained models by allowing the model access to external domain-specific data sources. It combines two components: retrieval of external knowledge and generation of responses. It allows pre-trained language models to dynamically incorporate external data during the response-generation process, enabling more contextually accurate and updated outputs. Unlike fine-tuning, in RAG, the model doesn’t undergo any training and the model weights aren’t updated to learn the domain knowledge. Although fine-tuning implicitly uses domain-specific information by embedding the required knowledge directly into the model, RAG explicitly uses the domain-specific information through external retrieval.

Model customization refers to adapting a pre-trained language model to better fit specific tasks, domains, or datasets. Fine-tuning is one such technique, which helps in injecting task-specific or domain-specific knowledge for improving model performance. It adjusts the model’s parameters to better align with the nuances of the target task while using its general knowledge.

Common use cases for each approach

RAG is optimal for use cases requiring dynamic or frequently updated data (such as customer support FAQs and ecommerce catalogs), domain-specific insights (such as legal or medical Q&A), scalable solutions for broad applications (such as software as a service (SaaS) platforms), multimodal data retrieval (such as document summarization), and strict compliance with secure or sensitive data (such as financial and regulatory systems).

Conversely, fine-tuning thrives in scenarios demanding precise customization (such as personalized chatbots or creative writing), high accuracy for narrow tasks (such as code generation or specialized summarization), ultra-low latency (such as real-time customer interactions), stability with static datasets (such as domain-specific glossaries), and cost-efficient scaling for high-volume tasks (such as call center automation).

Although RAG excels at real-time grounding in external data and fine-tuning specializes in static, structured, and personalized workflows, choosing between them often depends on nuanced factors. This post offers a comprehensive comparison of RAG and fine-tuning, clarifying their strengths, limitations, and contexts where each approach delivers the best performance.

Introduction to Amazon Nova models

Amazon Nova is a new generation of foundation model (FM) offering frontier intelligence and industry-leading price-performance. Amazon Nova Pro and Amazon Nova Lite are multimodal models excelling in accuracy and speed, with Amazon Nova Lite optimized for low-cost, fast processing. Amazon Nova Micro focuses on text tasks with ultra-low latency. They offer fast inference, support agentic workflows with Amazon Bedrock Knowledge Bases and RAG, and allow fine-tuning for text and multi-modal data. Optimized for cost-effective performance, they are trained on data in over 200 languages.

Solution overview

To evaluate the effectiveness of RAG compared to model customization, we designed a comprehensive testing framework using a set of AWS-specific questions. Our study used Amazon Nova Micro and Amazon Nova Lite as baseline FMs and tested their performance across different configurations.

We structured our evaluation as follows:

  • Base model:
    • Used out-of-box Amazon Nova Micro and Amazon Nova Lite
    • Generated responses to AWS-specific questions without additional context
  • Base model with RAG:
    • Connected the base models to Amazon Bedrock Knowledge Bases
    • Provided access to relevant AWS documentation and blogs
  • Model customization:
    • Fine-tuned both Amazon Nova models using 1,000 AWS-specific question-answer pairs generated from the same set of AWS articles
    • Deployed the customized models through provisioned throughput
    • Generated responses to AWS-specific questions with fine-tuned models
  • Model customization and RAG combined approach:
    • Connected the fine-tuned models to Amazon Bedrock Knowledge Bases
    • Provided fine-tuned models access to relevant AWS articles at inference time

In the following sections, we walk through how to set up the second and third approaches (base model with RAG and model customization with fine-tuning) in Amazon Bedrock.

Prerequisites

To follow along with this post, you need the following prerequisites:

  • An AWS account and appropriate permissions
  • An Amazon Simple Storage Service (Amazon S3) bucket with two folders: one containing your training data, and one for your model output and training metrics

Implement RAG with the baseline Amazon Nova model

In this section, we walk through the steps to implement RAG with the baseline model. To do so, we create a knowledge base. Complete the following steps:

  1. On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
  2. Under Knowledge Bases, choose Create.

kb_creation

  1. On the Configure data source page, provide the following information:
    1. Specify the Amazon S3 location of the documents.
    2. Specify a chunking strategy.
  2. Choose Next.

configure_kb

  1. On the Select embeddings model and configure vector store page, provide the following information:
    1. In the Embeddings model section, choose your embeddings model, which is used for embedding the chunks.
    2. In the Vector database section, create a new vector store or use an existing one where the embeddings will be stored for retrieval.
  2. Choose Next.

select_embedding_models

  1. On the Review and create page, review the settings and choose Create Knowledge Base.

kb_confirmation

Fine-tune an Amazon Nova model using the Amazon Bedrock API

In this section, we provide detailed walkthroughs on fine-tuning and hosting customized Amazon Nova models using Amazon Bedrock. The following diagram illustrates the solution architecture.

ft_diagram 

Create a fine-tuning job

Fine-tuning Amazon Nova models through the Amazon Bedrock API is a streamlined process:

  1. On the Amazon Bedrock console, choose us-east-1 as your AWS Region.

At the time of writing, Amazon Nova model fine-tuning is exclusively available in us-east-1.

  1. Choose Custom models under Foundation models in the navigation pane.
  2. Under Customization methods, choose Create Fine-tuning job.

ft_job_creation

  1. For Source model, choose Select model.
  2. Choose Amazon as the provider and the Amazon Nova model of your choice.
  3. Choose Apply.

ft_model_selection

  1. For Fine-tuned model name, enter a unique name for the fine-tuned model.
  2. For Job name, enter a name for the fine-tuning job.
  3. Under Input data, enter the location of the source S3 bucket (training data) and target S3 bucket (model outputs and training metrics), and optionally the location of your validation dataset.

ft_input_data

Configure hyperparameters

For Amazon Nova models, the following hyperparameters can be customized:

Parameter Range/Constraints
Epochs 1–5
Batch Size Fixed at 1
Learning Rate 0.000001–0.0001
Learning Rate Warmup Steps 0–100

Prepare the dataset for compatibility with Amazon Nova models

Similar to other LLMs, Amazon Nova requires prompt-completion pairs, also known as question and answer (Q&A) pairs, for supervised fine-tuning (SFT). This dataset should contain the ideal outputs you want the language model to produce for specific tasks or prompts. Refer to Guidelines for preparing your data for Amazon Nova on best practices and example formats when preparing datasets for fine-tuning Amazon Nova models.

Examine fine-tuning job status and training artifacts

After you create your fine-tuning job, choose Custom models under Foundation models in the navigation pane. You will find the current fine-tuning job listed under Jobs. You can use this page to monitor your fine-tuning job status.

examine_ft_status

When your fine-tuning job status changes to Complete, you can choose the job name and navigate to the Training job overview page. You will find the following information:

  • Training job specifications
  • Amazon S3 location for input data used for fine-tuning
  • Hyperparameters used during fine-tuning
  • Amazon S3 location for training output

ft_job_overview

Host the fine-tuned model with provisioned throughput

After your fine-tuning job completes successfully, you can access your customized model through the following steps:

  1. On the Amazon Bedrock console, choose Custom models under Foundation models in the navigation pane.
  2. Under Models, choose your custom model.

select_custom_models

The model details page shows the following information:

  • Fine-tuned model details
  • Amazon S3 location for input data used for fine-tuning
  • Hyperparameters used during fine-tuning
  • Amazon S3 location for training output

ft_output

  1. To make your fine-tuned model available for inference, choose Purchase provisioned throughput.
  2. Choose a commitment term (no commitment, 1 month, or 6 months) and review the associated cost for hosting the fine-tuned models.

After the customized model is hosted through provisioned throughput, a model ID will be assigned and can be used for inference.

The aforementioned fine-tuning and inference steps can also be done programmatically. For more information, refer to the following GitHub repo, which contains sample code.

Evaluation framework and results

In this section, we first introduce our multi-LLM-judge evaluation framework, which is set up to mitigate an individual LLM judge’s bias. We then compare RAG vs. fine-tuning results in terms of response quality as well as latency and token implications.

Multiple LLMs as judges to mitigate bias

The following diagram illustrates our workflow using multiple LLMs as judges.

multi-llm-judge

Using LLMs as judges has become an increasingly popular approach to evaluate tasks that are challenging to assess through traditional methods or human evaluation. For our evaluation framework, we constructed 10 domain-specific test questions covering key aspects of AWS services and features, designed to test both factual accuracy and depth of understanding. Each model-generated response was evaluated using a standardized scoring system on a scale of 0–10, where 0–3 indicates incorrect or misleading information, 4–6 represents partially correct but incomplete answers, 7–8 signifies mostly correct with minor inaccuracies, and 9–10 denotes completely accurate with comprehensive explanation.

We use the following LLM judge evaluation prompt:

{
    "system_prompt": "You are a helpful assistant.",
    "prompt_template": "[Instruction] Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".nn[Question]n{question}nn[The Start of Assistant's Answer]n{answer}n[The End of Assistant's Answer]",
    "description": "Prompt for general questions",
    "category": "general",
    "output_format": "[[rating]]"
}

We use the following sample evaluation question and ground truth:


{
    "question_id": 9161,
    "category": "AWS",
    "turns": [
        " "What specific details are collected and sent to AWS when anonymous operational metrics are enabled for an Amazon EFS file system?",
        "What's required for a successful AWS CloudFormation launch?"
    ],
    "reference": [
        "When anonymous operational metrics are enabled for an Amazon EFS file system, the following specific details are collected and sent to AWS: Solution ID, Unique ID, Timestamp, Backup ID, Backup Start Time, Backup Stop Time, Backup Window, Source EFS Size, Destination EFS Size, Instance Type, Retain, S3 Bucket Size, Source Burst Credit Balance, Source Burst Credit Balance Post Backup, Source Performance Mode, Destination Performance Mode, Number of Files, Number of Files Transferred, Total File Size, Total Transferred File Size, Region, Create Hard Links Start Time, Create Hard Links Stop Time, Remove Snapshot Start Time, Remove Snapshot Stop Time, Rsync Delete Start Time, Rsync Delete Stop Time.",
        "For a successful AWS CloudFormation launch, you need to sign in to the AWS Management Console, choose the correct AWS Region, use the button to launch the template, verify the correct template URL, assign a name to your solution stack, review and modify the parameters as necessary, review and confirm the settings, check the boxes acknowledging that the template creates AWS Identity and Access Management resources and may require an AWS CloudFormation capability, and choose Create stack to deploy the stack. You should receive a CREATE_COMPLETE status in approximately 15 minutes."
    ]
}

To mitigate potential intrinsic biases among different LLM judges, we adopted two LLM judges to evaluate the model-generated responses: Anthropic’s Claude Sonnet 3.5 and Meta’s Llama 3.1 70B. Each judge was provided with the original test question, the model-generated response, and specific scoring criteria focusing on factual accuracy, completeness, relevance, and clarity. Overall, we observed a high level of rank correlation among LLM judges in assessing different approaches, with consistent evaluation patterns across all test cases.

Response quality comparison

Both fine-tuning and RAG significantly improve the quality of generated responses on AWS-specific questions over the base model. Using Amazon Nova Lite as the base model, we observed that both fine-tuning and RAG improved the average LLM judge score on response quality by 30%, whereas combining fine-tuning with RAG enhanced the response quality by a total of 83%, as shown in the following figure.

nova_lite

Notably, our evaluation revealed an interesting finding (as shown in the following figure): when combining fine-tuning and RAG approaches, smaller models like Amazon Nova Micro showed significant performance improvements in domain-specific tasks, nearly matching the performance of bigger models. This suggests that for specialized use cases with well-defined scope, using smaller models with both fine-tuning and RAG could be a more cost-effective solution compared to deploying larger models.

nova_micro_lite

Latency and token implications

In addition to enhancing the response quality, both fine-tuning and RAG help reduce the response generation latency compared to the base model. For both Amazon Nova Micro and Amazon Nova Lite, fine-tuning reduced the base model latency by approximately 50%, whereas RAG reduced it by about 30%, as shown in the following figure.

latency

Fine-tuning also presented the unique advantage of improving the tone and style of the generated answers to align more closely with the training data. In our experiments, the average total tokens (input and output tokens) dropped by more than 60% with both fine-tuned models. However, the average total tokens more than doubled with the RAG approach due to passing of context, as shown in the following figure. This finding suggests that for latency-sensitive use cases or when the objective is to align the model’s responses to a specific tone, style, or brand voice, model customization might offer more business value.

tokens

Conclusion

In this post, we compared model customization (fine-tuning) and RAG for domain-specific tasks with Amazon Nova. We first provided a detailed walkthrough on how to fine-tune, host, and conduct inference with customized Amazon Nova through the Amazon Bedrock API. We then adopted an LLM-as-a-judge approach to evaluate response quality from different approaches. In addition, we examined the latency and token implications of different setups.

Both fine-tuning and RAG improved the model performance. Depending on the task and evaluation criteria, model customization showed similar, or sometimes better, performance compared to RAG. Model customization can also be helpful to improve the style and tone of a generated answer. In this experiment, the customized model’s response follows the succinct answer style of the given training data, which resulted in lower latency compared to the baseline counterpart. Additionally, model customization can also be used for many use cases where RAG isn’t as straightforward to be used, such as tool calling, sentiment analysis, entity extraction, and more. Overall, we recommend combining model customization and RAG for question answering or similar tasks to maximize performance.

For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide. The AWS Generative AI Innovation Center has a group of AWS science and strategy experts with comprehensive expertise spanning the generative AI journey, helping customers prioritize use cases, build a roadmap, and move solutions into production. Check out the Generative AI Innovation Center for our latest work and customer success stories.


About the Authors

Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalable Generative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.

Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, he prides himself on keeping his indoor plants alive for 3+ years.

Jae Oh Woo is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he specializes in developing custom solutions and model customization for a diverse range of use cases. He has a strong passion for interdisciplinary research that connects theoretical foundations with practical applications in the rapidly evolving field of generative AI. Prior to joining Amazon, Jae Oh was a Simons Postdoctoral Fellow at the University of Texas at Austin, where he conducted research across the Mathematics and Electrical and Computer Engineering departments. He holds a Ph.D. in Applied Mathematics from Yale University.

Rahul Ghosh is an Applied Scientist at Amazon’s Generative AI Innovation Center, where he works with AWS customers across different verticals to expedite their use of Generative AI. Rahul holds a Ph.D. in Computer Science from the University of Minnesota.

Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS,
where she focuses on advancing Generative AI solutions for real-world applications. She has a
strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.

Anila Joshi has more than a decade of experience building AI solutions. As a AWSI Geo Leader at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Read More

Generate user-personalized communication with Amazon Personalize and Amazon Bedrock

Generate user-personalized communication with Amazon Personalize and Amazon Bedrock

Today, businesses are using AI and generative models to improve productivity in their teams and provide better experiences to their customers. Personalized outbound communication can be a powerful tool to increase user engagement and conversion.

For instance, as a marketing manager for a video-on-demand company, you might want to send personalized email messages tailored to each individual user—taking into account their demographic information, such as gender and age, and their viewing preferences. You want the messaging and movie recommendations to be both engaging and applicable to the customer. To achieve this, you can use Amazon Personalize to generate user-personalized recommendations and Amazon Bedrock to generate the text of the email.

Amazon Personalize enables your business to improve customer engagement by creating personalized product and content recommendations in websites, applications, and targeted marketing campaigns. You can get started without any prior machine learning (ML) experience, and Amazon Personalize allows you to use APIs to build sophisticated personalization capabilities. Using this service, all your data is encrypted to be private and secure, and is only used to create recommendations for your users.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, customize the model using fine tuning, or restrict the model output using Retrieval Augmented Generaion (RAG), and build agents that execute tasks using your enterprise systems and data sources.

In this post, we demonstrate how to use Amazon Personalize and Amazon Bedrock to generate personalized outreach emails for individual users using a video-on-demand use case. This concept can be applied to other domains, such as compelling customer experiences for ecommerce and digital marketing use cases.

Solution overview

The following diagram shows how you can use Amazon Personalize and Amazon Bedrock to generate user-personalized outreach messages for each user.

Workflow Diagram: 1. Import your user, item, and interaction data into Amazon Personalize. 2. Train an Amazon Personalize “Top pics for you” recommender. 3. Get the top recommended movies for each user. 4. Use a prompt template, the recommended movies, and the user demographics to generate the model prompt. 5. Use Amazon Bedrock LLMs to generate personalized outbound communication with the prompt. 6. Share the personalize outbound communication with each of your users.

The workflow consists of the following steps:

  1. Import your user, item, and interaction data into Amazon Personalize. The user and item datasets are not required for Amazon Personalize to generate recommendations, but providing good item and user metadata provides the best results in your trained models.
  2. Train an Amazon Personalize “Top picks for you” recommender. Amazon Personalize recommenders are domain-specific resources that generate recommendations. When you create an Amazon Personalize recommender, Amazon Personalize trains the models backing the recommender with the best configurations for the use case. In our example, we use the “Top picks for you” recommender. This recommender generates personalized content recommendations for a user that you specify. With this use case, Amazon Personalize automatically filters videos the user watched.
  3. After the model is trained, you can get the top recommended movies for each user by querying the recommender with each user ID through the Amazon Personalize Runtime API.
  4. Combine a predefined prompt template with the top recommendations and user demographic information to generate an enhanced prompt.
  5. Use the enhanced prompt in Amazon Bedrock through its API to generate your personalized outbound communication.
  6. Amazon Bedrock returns the personalized outbound communication that you can email to your users.

We go deeper into each of these steps in the following sections. A code sample for this use case is available on AWS Samples on GitHub.

Prerequisites

To generate personalized recommendations, you must first set up Amazon Personalize resources. You start by creating your dataset group, loading your data, and then training a recommender. For full instructions, see Getting started tutorials.

    1. Create a dataset group.
    2. Create an Interactions dataset using the following schema:
      {
          "type": "record"
          "name": "Interactions",
          "namespace": "com.amazonaws.personalize.schema",
          "fields": [
              {
                  "name": "USER_ID",
                  "type": "string"
              },
              {
                  "name": "ITEM_ID",
                  "type": "string"
              },
              {
                  "name": "TIMESTAMP",
                  "type": "long"
              },
              {
                  "name": "EVENT_TYPE",
                  "type": "string"
              }
          ],
          "version": "1.0"
      }

      Interaction data consists of information about the user interactions with the content in your application. This usually comes from analytics tools or a customer data platform (CDP). The best interaction data to use in Amazon Personalize includes the sequential order of user behavior and the content the user watched or clicked on. For this example, we use the ml-latest-small dataset from the MovieLens dataset to simulate user-item interactions.

    3. Import the interaction data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3). For this example, we convert the data to the appropriate format following the steps in the notebook 01_Introduction_and_Data_Preparation.
    4. Item data consists of information about the content that is being interacted with, which generally comes from a content management system (CMS) in video-on-demand use cases. This can be information like the title, description, or movie genre. To provide additional metadata, and also provide a consistent experience for our users, we use a subset of the IMDb Essential Metadata for Movies/TV/OTT dataset. IMDb has multiple datasets available in AWS Data Exchange. For this post, we have extracted and prepared a subset of data for use with the following information from the IMDb Essential Metadata for Movies/TV/OTT (Bulk data) dataset.With this data, create an Items dataset using the following schema:
      items_schema = {
          "type": "record",
          "name": "Items",
          "namespace": "com.amazonaws.personalize.schema",
          "fields": [
              {
                  "name": "ITEM_ID",
                  "type": "string"
              },
              {
                  "name": "TITLE",
                  "type": "string"
              },
              {
                  "name": "YEAR",
                  "type": "int"
              },
              {
                  "name": "IMDB_RATING",
                  "type": "int"
              },
              {
                  "name": "IMDB_NUMBEROFVOTES",
                  "type": "int"
              },
              {
                  "name": "PLOT",
                  "type": "string",
                  "textual": True
              },
              {
                  "name": "US_MATURITY_RATING_STRING",
                  "type": "string"
              },
              {
                  "name": "US_MATURITY_RATING",
                  "type": "int"
              },
              {
                  "name": "GENRES",
                  "type": "string",
                  "categorical": True
              },
              {
                  "name": "CREATION_TIMESTAMP",
                  "type": "long"
              },
              {
                  "name": "PROMOTION",
                  "type": "string"
              }
          ],
          "version": "1.0
      }

    5. Import the item data to Amazon Personalize from Amazon S3. For this example, we convert the data to the appropriate format following the steps in the notebook 01_Introduction_and_Data_Preparation.
      For more information on formatting and importing your interactions and items data from Amazon S3, see Importing bulk records.
    6. Create a recommender. In this example, we create a “Top picks for you” recommender.

Get personalized recommendations using Amazon Personalize

Now that we have trained the “Top picks for you” recommender, we can generate recommendations for our users. For more details and ways to use Amazon Personalize to get recommendations, see Getting recommendations from Amazon Personalize.We include the item metadata in the response so we can use this information in our outbound communication in the next step.You can use the following code to get recommended movies for each user:

get_recommendations_response = personalize_runtime.get_recommendations(
    recommenderArn = workshop_recommender_top_picks_arn,
    userId = str(user_id),
    numResults = number_of_movies_to_recommend,
    metadataColumns = {
        "ITEMS": [
            'TITLE', 'PLOT', 'GENRES']
        }
)

In the items dataset, we can specify the metadata columns we want the recommender to return. In this case, we request the Title, Plot, and Genres of the recommended movie. You can request metadata columns only if this feature has been enabled when the recommender was created.

For an example user_Id, the following movies are recommended:

Title: There's Something About Mary
Genres: Comedy and Romance
Plot: A man gets a chance to meet up with his dream girl from high school, even though his date with her back then was a complete disaster.

Title: Shakespeare in Love
Genres: Comedy and Drama and History and Romance
Plot: The world's greatest ever playwright, William Shakespeare, is young, out of ideas and short of cash, but meets his ideal woman and is inspired to write one of his most famous plays.

Title: The Birdcage
Genres: Comedy
Plot: A gay cabaret owner and his drag queen companion agree to put up a false straight front so that their son can introduce them to his fiancée's right-wing moralistic parents.

Get the user’s favorite movie genre

To provide a better personalized outbound communication experience, we determine the user’s favorite movie genre based on the genres of all the movies they have interacted with in the past. There are a number of ways to do this, such as counting the number of interactions per genre for our user. In this example, our sample user’s favorite genre is Comedy.

Generate personalized marketing emails with recommended movies

To generate personalized marketing emails, we use Amazon Bedrock. Amazon Bedrock users must request access to models before they are available for use. Amazon Bedrock is a fully managed service that makes base models from Amazon and third-party model providers accessible through an API.

To request access, select choose Model access in the navigation pane on the Amazon Bedrock console. For more information, see Access Amazon Bedrock foundation models.

In this example, we use Anthropic’s Claude 3.7 on Amazon Bedrock and have defined the following configuration parameters:

# The LLM we will be using
model_id = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

# The maximum number of tokens to use in the generated response
max_tokens_to_sample = 1000

Let’s generate a simple outreach email using the recommended movies and the following prompt template:

prompt_template = f'''Write a marketing email advertising several movies available in a video-on-demand streaming platform next week, given the movie and user information below. The movies to recommend and their information is contained in the <movie> tag. Put the email between <email> tags.

<movie>
{movie_list}
</movie>

Assistant: Email body:
<email>
'''

Using the recommended movies, the full prompt is as follows:

"Write a marketing email advertising several movies available in a video-on-demand streaming platform next week, given the movie and user information below. The movies to recommend and their information is contained in the <movie> tag. Put the email between <email> tags.
n
n
<movie>
n
[
{
'title': "There's Something About Mary",
'genres': 'Comedy and Romance',
'plot': 'A man gets a chance to meet up with his dream girl from high school, even though his date with her back then was a complete disaster.'
},
{
'title': 'Shakespeare in Love',
'genres': 'Comedy and Drama and History and Romance',
'plot': "The world's greatest ever playwright, William Shakespeare, is young, out of ideas and short of cash, but meets his ideal woman and is inspired to write one of his most famous plays."
},
{
'title': 'The Birdcage',
'genres': 'Comedy',
'plot': "A gay cabaret owner and his drag queen companion agree to put up a false straight front so that their son can introduce them to his fiancu00e9e's right-wing moralistic parents."
}
]
n
</movie>
n
n
Assistant: Email body:
n
<email>.
"

We then use an Amazon Bedrock API call to generate the personalized email. For more information, see Amazon Bedrock API Reference.

request_body = json.dumps({
    "max_tokens": max_tokens_to_sample,
    "messages": [{"role": "user", "content": prompt}],
    "anthropic_version": "bedrock-2023-05-31"
})

personalized_email_response = bedrock_client.invoke_model(
    body = request_body,
    modelId = identifier_of_the_model
)

Amazon Bedrock returns a personalized email for the user:

Subject: Your Weekend Movie Escape Awaits! Three Laugh-Out-Loud Comedies Coming Next Week

Hi there,

Need a break from reality? We’ve got you covered with three fantastic comedies hitting our streaming platform next week!

## This Week’s Spotlight: Comedy Gems That Will Make Your Day

**There’s Something About Mary**
This hilarious romantic comedy follows a man who finally gets a second chance with his high school dream girl—after their first date went hilariously wrong. With unforgettable laughs and heartwarming moments, it’s the perfect weekend watch!

**Shakespeare in Love**
When the greatest playwright of all time faces writer’s block and money troubles, an unexpected romance changes everything! This award-winning comedy-drama blends history, romance, and witty humor as Shakespeare finds his muse and creates one of his most beloved plays. A delightful escape for literature lovers and romantics alike!

**The Birdcage**
Prepare for non-stop laughter in this comedy classic! When a gay cabaret owner and his drag queen partner pretend to be straight to impress their future in-laws (who happen to be ultra-conservative), chaos and hilarity ensue. A perfect blend of humor and heart that still resonates today.

So grab your popcorn, get comfortable on the couch, and enjoy these comedy classics starting next week!

Happy streaming!

The Movies-On-Demand Team

P.S. Don’t forget to check out our complete catalog for more great films in every genre!

Although this is already a good outreach email because the recommendations are personalized to the user, we can personalize it further by adding more information about the user.

Generate personalized communication with recommended movies, user demographic information, and favorite genre

We will generate emails by assuming two different demographics for the users as well as their favorite genre.

The version of the ml-latest-small dataset from the MovieLens dataset we used in this example doesn’t contain demographic data; therefore, we will try out multiple options. In a real-world scenario, you might know the demographics of your audience.

To experiment, let’s use the following example demographic:

# Sample user demographics
user_demographic_1 = f'The user is a 50 year old adult called Otto.'

We also add the user’s favorite genre to the prompt as follows:

prompt_template = f'''You are a skilled publicist. Write a high-converting marketing email advertising several movies available in a video-on-demand streaming platform next week,
given the movie and user information below. Do not add additional information. Your email will leverage the power of storytelling and persuasive language.
You want the email to impress the user, so make it appealing to them based on the information contained in the <user> tags,
and take into account the user's favorite genre in the <genre> tags.
The movies to recommend and their information is contained in the <movie> tag.
All movies in the <movie> tag must be recommended. Give a summary of the movies and why the human should watch them.
Put the email between <email> tags.

<user>
{user_demographic}
</user>

<genre>
{favorite_genre}
</genre>

<movie>
{movie_list}
</movie>

Assistant:

<email>
'''

After adding the information, the new prompt is as follows:

"You are a skilled publicist. Write a high-converting marketing email advertising several movies available in a video-on-demand streaming platform next week, given the movie and user information below. Do not add additional information. Your email will leverage the power of storytelling and persuasive language. You want the email to impress the user, so make it appealing to them based on the information contained in the <user> tags, and take into account the user's favorite genre in the <genre> tags. The movies to recommend and their information is contained in the <movie> tag. All movies in the <movie> tag must be recommended. Give a summary of the movies and why the human should watch them. Put the email between <email> tags.
n
n
<user>
n
The user is a 50 year old adult called Otto.
n
</user>
n
n
<genre>
n
Comedy
n
</genre>
n
n
<movie>
n
[
{
'title': "There's Something About Mary",
'genres': 'Comedy and Romance',
'plot': 'A man gets a chance to meet up with his dream girl from high school, even though his date with her back then was a complete disaster.'
},
{
'title': 'Shakespeare in Love',
'genres': 'Comedy and Drama and History and Romance',
'plot': "The world's greatest ever playwright, William Shakespeare, is young, out of ideas and short of cash, but meets his ideal woman and is inspired to write one of his most famous plays."
},
{
'title': 'The Birdcage',
'genres': 'Comedy',
'plot': "A gay cabaret owner and his drag queen companion agree to put up a false straight front so that their son can introduce them to his fiancu00e9e's right-wing moralistic parents."
}
]
n
</movie>
n
n
Assistant:
n
<email>
n    "

Amazon Bedrock returns a personalized email for the user:

Subject: Otto, Get Ready for a Comedy Extravaganza on Your Screen Next Week!

Dear Otto,

We’re thrilled to bring you an exclusive lineup of comedy classics hitting our streaming platform next week! As someone who appreciates a good laugh, you’re in for a treat with these award-winning comedies that will brighten your evenings.

## “There’s Something About Mary”
This hilarious romantic comedy follows the misadventures of a man who finally gets a second chance with his high school dream girl. After a first date that was nothing short of catastrophic, he’s determined to make things right years later. With its perfect blend of outrageous humor and heartwarming moments, this comedy classic delivers laughs that have stood the test of time.

## “Shakespeare in Love”
Experience the witty and charming story of a young, broke William Shakespeare who finds his muse in the most unexpected place. This brilliant comedy-drama offers a fictional account of how the greatest playwright found inspiration through love. With its clever dialogue, historical setting, and romantic storyline, this Academy Award-winning film combines your love of comedy with rich storytelling that will keep you engaged from beginning to end.

## “The Birdcage”
A comedy masterpiece that delivers non-stop laughs! When a gay cabaret owner and his flamboyant partner must pretend to be straight to impress their future in-laws (who happen to be ultra-conservative), chaos ensues. The brilliant performances and hilarious situations make this one of the most beloved comedies of its era. It’s the perfect film for when you need genuine belly laughs and brilliant comedic timing.

Otto, these comedies are among the best in their genre and will be available for your enjoyment starting next week. Whether you’re in the mood for slapstick humor, clever wit, or situational comedy, this collection has something perfect for your evening entertainment.

Grab your favorite snack, get comfortable on the couch, and prepare for an unforgettable comedy marathon!

Happy streaming!

The VOD Team

The email now contains information about the user’s favorite genre and is personalized to the user using their name and the recommended the movies the user is most likely to be interested in.

Clean up

Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete filters, recommenders, datasets, and dataset groups using the AWS Management Console or the Python SDK.

Conclusion

Traditional AI and generative AI allow you to build hyper-personalized experiences for your users. In this post, we showed how to generate personalized outbound communication by getting personalized recommendations for each user using Amazon Personalize and then using user preferences and demographic information to write a personalized email communication using Amazon Bedrock. By using AWS managed services, such as Amazon Personalize and Amazon Bedrock, you can create this content with only a few API calls—no ML experience required.

For more information about Amazon Personalize, see the Amazon Personalize Developer Guide. For more information on working with generative AI on AWS, see Announcing New Tools for Building with Generative AI on AWS.


About the Author

Anna Grüebler Clark is a Specialist Solutions Architect at AWS focusing on in Artificial Intelligence. She has more than 16 years experience helping customers develop and deploy machine learning applications. Her passion is taking new technologies and putting them in the hands of everyone, and solving difficult problems leveraging the advantages of using traditional and generative AI in the cloud.

Read More

Automating regulatory compliance: A multi-agent solution using Amazon Bedrock and CrewAI

Automating regulatory compliance: A multi-agent solution using Amazon Bedrock and CrewAI

Financial institutions today face an increasingly complex regulatory world that demands robust, efficient compliance mechanisms. Although organizations traditionally invest countless hours reviewing regulations such as the Anti-Money Laundering (AML) rules and the Bank Secrecy Act (BSA), modern AI solutions offer a transformative approach to this challenge. By using Amazon Bedrock Knowledge Bases alongside CrewAI—an open source multi-agent orchestration framework, organizations can now deploy intelligent systems where multiple AI agents work together to automate and streamline specific compliance processes. This powerful combination enables financial institutions to move from manual, time-intensive compliance reviews to a streamlined, assisted compliance management approach that adapts to evolving regulatory requirements.

In this post, we explore how AI agents can streamline compliance and fulfill regulatory requirements for financial institutions using Amazon Bedrock and CrewAI. We demonstrate how to build a multi-agent system that can automatically summarize new regulations, assess their impact on operations, and provide prescriptive technical guidance. You’ll learn how to use Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents with CrewAI to create a comprehensive, automated compliance solution.

This solution’s architecture can be adapted to help healthcare systems, enable manufacturers to maintain ISO safety documentation, and assist retailers in monitoring Federal Trade Commission (FTC) advertising regulations. It can also assist in other segments such as legal, finance, or human resources, offering wide-ranging potential for process automation and efficiency gains across various industries.The code used for this post is available on GitHub.

Solution overview

Traditional large language model (LLM) applications excel at following predefined instructions, but solving complex challenges such as compliance automation requires an autonomous network of specialized agents that mirror the structure of a comprehensive compliance department. Our system employs three key agents:

  1. Compliance analyst agent that continuously monitors and analyzes regulatory changes
  2. Compliance specialist agent that transforms requirements into organizational policies
  3. Enterprise architect agent that designs and implements the necessary security controls

In this multi-agent approach, specialized AI agents work together seamlessly to streamline the compliance lifecycle. The compliance analyst agent collects latest regulatory changes and helps to stay ahead of regulatory changes and their potential impact where the Compliance specialist agent translates these regulatory requirements into actionable organizational procedures. Meanwhile, the enterprise architect agent makes sure that the technical controls align with organizational controls. CrewAI provides an open source framework to orchestrate this collaborative system, enabling these agents to work in concert while maintaining clear handoffs and accountability. Next, we will explore how to create this multi-agent compliance automation system using CrewAI.

Although this solution demonstrates CrewAI’s capabilities, it’s important to note that Amazon Bedrock Agents has built-in support for multi-agent collaboration, and organizations could implement their agent workflows entirely within Amazon Bedrock Agents. However, we’ve chosen CrewAI for this demonstration to showcase how open source frameworks can extend Amazon Bedrock capabilities while maintaining enterprise-grade security through Bedrock Guardrails.

Solution components

This solution shows you how to combine multiple capabilities. It shows how to:

  1. Develop a multi-agent solution using a CrewAI framework
  2. Enrich the solution using domain-specific data using Amazon Bedrock Knowledge Bases
  3. Safeguard your generative AI application using Amazon Bedrock Guardrails
  4. Bring everything together using CrewAI and Amazon Bedrock Agents

You can use CrewAI to develop AI agents and coordinate tasks among those agents. This structure enables systematic management of complex AI workflows while maintaining oversight of agent interactions and outcomes. The framework has the following components, which are shown in the following figure:

CrewAI Framework is built around the following components:

  • Agents in CrewAI are autonomous components designed to perform specific tasks or roles within a multi-agent system. They have specific roles (such as researcher or writer) and make autonomous decisions with or without using external tools. LLMs are the core intelligence behind CrewAI agents. LLMs enable agents to understand context, make decisions, and generate human-like responses.
  • Tasks are defined jobs assigned to agents with clear objectives, including execution details and required resources.
  • Crews are coordinated teams of agents working together on a shared goal. Crews require defining agent roles, task assignments, and execution order.
  • Tools refer to the skills or functions that agents can use to carry out various actions.
  • Processes are responsible for orchestrating how tasks are executed by agents, similar to project management in human teams. These processes make sure that tasks are allocated and completed efficiently, in accordance with a predefined strategy.

Prerequisites

Before getting started with the solution, you need to get access to Amazon Bedrock models:

  1. Sign in to the Amazon Bedrock console and in the navigation pane under Bedrock configurations, select Model access to request access to Amazon Bedrock models. This step is shown in the following screenshots.

In this example, we use Amazon Nova Pro through Amazon Bedrock as our LLM. CrewAI provides built-in integration with Amazon Bedrock.

  1. Clone the GitHub repo into a local folder
git clone https://github.com/aws-samples/sample-compliance-assistant-with-agents.git

3. Use the following command to install the dependencies for running CrewAI in your Python environment:

pip install crewai uv

Your compliance agents

In this step, you will define your agents:

  1. Define compliance agents in the agents.yaml file. Each agent has a specific role to play:
    compliance_analyst:
      role: {topic} Senior Compliance Analyst
      goal: Review and understand regulatory and compliance requirements around {topic}
      backstory: You're a seasoned Compliance analyst with deep expertise in areas such as PCI DSS, HIPAA, NIST, ISO and knack for uncovering the latest regulations and requirements in {topic}.
    compliance_specialist:
      role: {topic} Compliance Specialist
      goal: Create detailed reports based on {topic} compliance analysis and research findings
      backstory: You're a meticulous compliance specialist with deep understanding of compliance and regulatory landscape for Financial services and Technology Industry. You create standards and policies for the organization to meet regulations and compliance needs.

  2. Define tasks for the agents:
    compliance_analysis_task:
      description: Conduct a thorough analysis about {topic}. Make sure you find relevant information given the current year is {current_year}.
      expected_output: A list with 10 bullet points of the most relevant information about {topic}
      agent: compliance_analyst
    compliance_reporting_task:
      description: Review the context you got and expand each topic into a full section for a report.
        Make sure the report is detailed and contains any and all relevant information for Financial Services Organization
      expected_output: A fully fledged report with the main topics, each with a full section of information.
      agent: compliance_specialist

  3. The execution and process steps are defined in crew.py:
    def crew(self) -> Crew:
    """Creates the Compliance Automation crew"""
        return Crew(
           agents=self.agents, 
           tasks=self.tasks, 
           process=Process.sequential,
           verbose=True)

  4. Define your LLM, topic, and runtime parameters in the .env file:
    MODEL=bedrock/us.amazon.nova-pro-v1:0
    AWS_REGION_NAME=us-west-2
    TOPIC='GDPR requirements for Data Privacy'

  5. Run the crew as follows:
    crewai run

  6. The following demo shows the output of the crew. You can see the agents collaborating to generate a detailed solutionComplianceAgents-Topic_GDPR

In the output, notice that the compliance analyst and the compliance specialist are working together to solve multiple aspects of General Data Protection Regulation (GDPR) requirements for trading services. Note the synergistic collaboration between agents as they refine their approach and develop a comprehensive compliance management response through iterative problem-solving.

Addressing LLM challenges with domain-specific knowledge

LLMs, although impressive in their broad knowledge, face two key limitations when dealing with specialized domains or recent information. First, they struggle with specialized information specific to your organization. Second, because their knowledge is limited to their training data, they might not reflect the latest updates in rapidly changing fields. This limitation becomes particularly important when dealing with evolving compliance requirements, such as Payment Card Industry Data Security Standard (PCI DSS), GDPR, AML rules, and Know Your Customer (KYC) regulations. Additionally, organizations need solutions that are customized to their specific compliance requirements and internal standards, rather than generic responses from LLMs.

Retrieval Augmented Generation (RAG) is a technique that enables generative AI models to retrieve and incorporate current organizational and domain-specific information from external databases. Amazon Bedrock Knowledge Base is a managed capability that helps you implement the entire RAG technique without having to build custom integrations to data sources and manage data flows. By incorporating a knowledge base containing the latest publications, regulatory updates, and compliance guidelines from authoritative sources such as NIST, ISO, PCI, and regulatory bodies, Amazon Bedrock Knowledge Bases helps make sure that your AI system stays current with the latest compliance requirements. During prompt generation, RAG first retrieves relevant data from this continually updated knowledge base, then uses it to create informed responses. This helps provide more relevant, accurate, and customized responses aligned with current regulatory and organizational standards. For example, when querying about PCI DSS v4.0 requirements or recent GDPR amendments, the system can pull the most up-to-date information directly from authoritative sources rather than relying on potentially outdated training data.

Create an Amazon Bedrock knowledge base with contextual information from your data sources

  1. From the Amazon Bedrock navigation pane, select Knowledge Bases under Builder tools and choose a Knowledge Base with vector store.
  2. Provide the Knowledge Base name and Data source details. You’ll use the web crawler for ingesting data.
  3. The web crawler provided by Amazon Bedrock connects to and crawls URLs you selected for use in your Amazon Bedrock knowledge base. Add the URLs as data sources under Source URLs.
  1. Select the model for embeddings. We have selected Amazon Titan Text Embeddings v2, as shown in the following screenshot.

After a few minutes, the knowledge base will be ready. After it has synced, Amazon Bedrock Knowledge Bases handles generating, running, and formatting the result of the query, simplifying building natural language interfaces to structured data.

Amazon Bedrock Agents

Amazon Bedrock Agents is a comprehensive environment for building sophisticated AI agent systems. At its core, it enables seamless multi-agent collaboration and maintains conversation context through native memory retention across interactions. Amazon Bedrock Agents integrates naturally with knowledge bases and enforce security through built-in guardrails. For this solution, we focus on two key capabilities: the RAG feature, which allows agents to access and utilize information from knowledge bases, and the security features provided through Amazon Bedrock Guardrails. These guardrails serve as an essential safeguard for your generative AI applications, promoting responsible and secure AI interactions.

  1. To create an agent, from the Amazon Bedrock navigation pane under Builder tools, select Agents and select Create Agent.
  1. Under Agent details, choose the model. We use Amazon Nova Pro for our use case, as shown in the following screenshot.
  1. Under Knowledge Bases, add knowledge bases to your agent.
  1. Choose the knowledge base name from the dropdown list, as shown in the following screenshot.

Amazon Bedrock Guardrails

Amazon Bedrock Guardrails provides safety controls that help maintain responsible AI use by providing a layer of security. Guardrails provide content filtering to monitor and filter AI model outputs to help prevent harmful, inappropriate, or biased content. You can set up filters for things such as hate speech, explicit content, or personally identifiable information (PII). You can also apply customizable rules and input/output validation.

  1. You can find Guardrails in the Amazon Bedrock navigation pane under Safeguards. Choose Create guardrail and provide a guardrail name
  1. As shown in the following screenshot, select the content filters you want to implement for your Amazon Bedrock based application
  2. Add denied topics with specific examples
  3. After you’ve created your guardrail, attach guardrail to the agent.

Putting it all together: Integrating Amazon Bedrock Agents with CrewAI

CrewAI provides seamless integration with Amazon Bedrock features, including Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents through CrewAI tools functionality. When these tools are triggered from CrewAI agents, they process your query, retrieve the relevant information from the Amazon Bedrock knowledge base, and return responses back to CrewAI agent.

  1. Refer to the sample code demonstrating CrewAI tools for Amazon Bedrock Agent. You need to define your Amazon Bedrock AgentId and Alias as parameters in the .env file
  2. Execute the crew again with Amazon Bedrock Agents:
    crewai run

  3. You can find the generated output below
    ComplianceAgents-Topic_PCI

When you execute the crew, the compliance analyst agent initiates the process by invoking the CrewAI Bedrock tool to extract regulatory requirements from Amazon Bedrock Knowledge Bases, which is then seamlessly transformed into technical requirements by the compliance specialist agent. Through iterative collaboration, these specialized agents work together to fill information gaps, and the enterprise architect agent synthesizes the gathered insights to develop a robust implementation strategy and execution plan. This streamlined process demonstrates how multiple AI agents can effectively coordinate to transform compliance requirements into actionable technical solutions.

Clean up

To avoid ongoing charges, follow these steps to clean up resources:

  1. Delete the Amazon Bedrock knowledge base that you created:
aws bedrock-agent delete-knowledge-base --knowledge-base-id <your-kb-id>
  1. Delete the Amazon Bedrock agents that you created:
aws bedrock-agent delete-agent --agent-id <your-agent-id>

Conclusion

In this post, we demonstrated how to:

  • Build a multi-agent AI system using CrewAI that mimics the structure of a comprehensive compliance department with specialized agents for different functions
  • Enhance AI responses with domain-specific knowledge by implementing RAG using Amazon Bedrock Knowledge Bases
  • Safeguard your generative AI applications with Amazon Bedrock Guardrails to help prevent harmful, inappropriate, or biased content
  • Create custom tools in CrewAI to integrate with Amazon Bedrock Agents for more powerful and context-aware compliance solutions
  • Automate the entire compliance lifecycle from monitoring regulatory changes to implementing technical controls without extensive manual effort
  • Deploy a production-ready solution that continually adapts to evolving regulatory requirements in financial services and other highly regulated industries

This solution combines Amazon Bedrock Knowledge Bases and CrewAI to create smart, multi-agent AI systems that help streamline regulatory compliance tasks. With simplified RAG implementation, sophisticated workflows that mirror human teams, and faster adaptation to new regulations, this approach shows how AI can assist organizations with specific aspects of complex regulatory requirements.

This solution serves as a practical starting point for organizations looking to enhance their compliance processes with AI capabilities, demonstrating how intelligent systems could complement and streamline existing compliance workflows. The complete source code for this project is available on the GitHub repository. Feel free to explore, fork, or contribute!


About the Authors

Balu Mathew is a Senior Solutions Architect at AWS, based in Raleigh, NC. He collaborates with Global Financial Services customers to design and implement secure, scalable and resilient solutions on AWS. With deep expertise in security, machine learning, and the financial services industry, he helps organizations build, protect, and scale large-scale distributed systems efficiently. Outside of work, he enjoys spending time with his kids and exploring the mountains and the outdoors.

Read More