Build AWS architecture diagrams using Amazon Q CLI and MCP

Build AWS architecture diagrams using Amazon Q CLI and MCP

Creating professional AWS architecture diagrams is a fundamental task for solutions architects, developers, and technical teams. These diagrams serve as essential communication tools for stakeholders, documentation of compliance requirements, and blueprints for implementation teams. However, traditional diagramming approaches present several challenges:

  • Time-consuming process – Creating detailed architecture diagrams manually can take hours or even days
  • Steep learning curve – Learning specialized diagramming tools requires significant investment
  • Inconsistent styling – Maintaining visual consistency across multiple diagrams is difficult
  • Outdated AWS icons – Keeping up with the latest AWS service icons and best practices challenging.
  • Difficult maintenance – Updating diagrams as architectures evolve can become increasingly burdensome

Amazon Q Developer CLI with the Model Context Protocol (MCP) offers a streamlined approach to creating AWS architecture diagrams. By using generative AI through natural language prompts, architects can now generate professional diagrams in minutes rather than hours, while adhering to AWS best practices.

In this post, we explore how to use Amazon Q Developer CLI with the AWS Diagram MCP and the AWS Documentation MCP servers to create sophisticated architecture diagrams that follow AWS best practices. We discuss techniques for basic diagrams and real-world diagrams, with detailed examples and step-by-step instructions.

Solution overview

Amazon Q Developer CLI is a command line interface that brings the generative AI capabilities of Amazon Q directly to your terminal. Developers can interact with Amazon Q through natural language prompts, making it an invaluable tool for various development tasks.

Developed by Anthropic as an open protocol, the Model Context Protocol (MCP) provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture (as illustrated in the following diagram), the MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

  • Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), AWS MCP CLI, or other AI applications
  • Client – Protocol clients that maintain one-to-one connections with server
  • Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
  • Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect with

mcp-architectinfo

As announced in April 2025, MCP enables Amazon Q Developer to connect with specialized servers that extend its capabilities beyond what’s possible with the base model alone. MCP servers act as plugins for Amazon Q, providing domain-specific knowledge and functionality. The AWS Diagram MCP server specifically enables Amazon Q to generate architecture diagrams using the Python diagrams package, with access to the complete AWS icon set and architectural best practices.

Prerequisites

To implement this solution, you must have an AWS account with appropriate permissions and follow the steps below.

Set up your environment

Before you can start creating diagrams, you need to set up your environment with Amazon Q CLI, the AWS Diagram MCP server, and AWS Documentation MCP server. This section provides detailed instructions for installation and configuration.

Install Amazon Q Developer CLI

Amazon Q Developer CLI is available as a standalone installation. Complete the following steps to install it:

  1. Download and install Amazon Q Developer CLI. For instructions, see Using Amazon Q Developer on the command line.
  2. Verify the installation by running the following command: q --version
    You should see output similar to the following: Amazon Q Developer CLI version 1.x.x
  3. Configure Amazon Q CLI with your AWS credentials: q login
  4. Choose the login method suitable for you:

Set up MCP servers

Complete the following steps to set up your MCP servers:

  1. Install uv using the following command: pip install uv
  2. Install Python 3.10 or newer: uv python install 3.10
  3. Install GraphViz for your operating system.
  4. Add the servers to your ~/.aws/amazonq/mcp.json file:
{
  "mcpServers": {
    "awslabs.aws-diagram-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-diagram-mcp-server"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    },
    "awslabs.aws-documentation-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-documentation-mcp-server@latest"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    }
  }
}

Now, Amazon Q CLI automatically discovers MCP servers in the ~/.aws/amazonq/mcp.json file.

Understanding MCP server tools

The AWS Diagram MCP server provides several powerful tools:

  • list_icons – Lists available icons from the diagrams package, organized by provider and service category
  • get_diagram_examples – Provides example code for different types of diagrams (AWS, sequence, flow, class, and others)
  • generate_diagram – Creates a diagram from Python code using the diagrams package

The AWS Documentation MCP server provides the following useful tools:

  • search_documentation – Searches AWS documentation using the official AWS Documentation Search API
  • read_documentation – Fetches and converts AWS documentation pages to markdown format
  • recommend – Gets content recommendations for AWS documentation pages

These tools work together to help you create accurate architecture diagrams that follow AWS best practices.

Test your setup

Let’s verify that everything is working correctly by generating a simple diagram:

  1. Start the Amazon Q CLI chat interface and verify the output shows the MCP servers being loaded and initialized: q chat
    q chat
  2. In the chat interface, enter the following prompt:
    Please create a diagram showing an EC2 instance in a VPC connecting to an external S3 bucket. Include essential networking components (VPC, subnets, Internet Gateway, Route Table), security elements (Security Groups, NACLs), and clearly mark the connection between EC2 and S3. Label everything appropriately concisely and indicate that all resources are in the us-east-1 region. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.
  3. Amazon Q CLI will ask you to trust the tool that is being used; enter t to trust it.Amazon Q CLI will generate and display a simple diagram showing the requested architecture. Your diagram should look similar to the following screenshot, though there might be variations in layout, styling, or specific details because it’s created using generative AI. The core architectural components and relationships will be represented, but the exact visual presentation might differ slightly with each generation.

    If you see the diagram, your environment is set up correctly. If you encounter issues, verify that Amazon Q CLI can access the MCP servers by making sure you installed the necessary tools and the servers are in the ~/.aws/amazonq/mcp.json file.

Configuration options

The AWS Diagram MCP server supports several configuration options to customize your diagramming experience:

  • Output directory – By default, diagrams are saved in a generated-diagrams directory in your current working directory. You can specify a different location in your prompts.
  • Diagram format – The default output format is PNG, but you can request other formats like SVG in your prompts.
  • Styling options – You can specify colors, shapes, and other styling elements in your prompts.

Now that our environment is set up, let’s create more diagrams.

Create AWS architecture diagrams

In this section, we walk through the process of multiple AWS architecture diagrams using Amazon Q CLI with the AWS Diagram MCP server and AWS Documentation MCP server to make sure our requirements follow best practices.

When you provide a prompt to Amazon Q CLI, the AWS Diagram and Documentation MCP servers complete the following steps:

  1. Interpret your requirements.
  2. Check for best practices on the AWS documentation.
  3. Generate Python code using the diagrams package.
  4. Execute the code to create the diagram.
  5. Return the diagram as an image.

This process happens seamlessly, so you can focus on describing what you want rather than how to create it.

AWS architecture diagrams typically include the following components:

  • Nodes – AWS services and resources
  • Edges – Connections between nodes showing relationships or data flow
  • Clusters – Logical groupings of nodes, such as virtual private clouds (VPCs), subnets, and Availability Zones
  • Labels – Text descriptions for nodes and connections

Example 1: Create a web application architecture

Let’s create a diagram for a simple web application hosted on AWS. Enter the following prompt:

Create a diagram for a simple web application with an Application Load Balancer, two EC2 instances, and an RDS database. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram

After you enter your prompt, Amazon Q CLI will search AWS documentation for best practices using the search_documentation tool from awslabsaws_documentation_mcp_server.
search documentaion-img


Following the search of the relevant AWS documentation, it will read the documentation using the read_documentation tool from the MCP server awslabsaws_documentation_mcp_server.
read_document-img

Amazon Q CLI will then list the needed AWS service icons using the list_icons tool, and will use generate_diagram with awslabsaws_diagram_mcp_server.
list and generate on cli

You should receive an output with a description of the diagram created based on the prompt along with the location of where the diagram was saved.
final-output-1stexample

Amazon Q CLI will generate and display the diagram.

The generated diagram shows the following key components:

Example 2: Create a multi-tier architecture

Multi-tier architectures separate applications into functional layers (presentation, application, and data) to improve scalability and security. We use the following prompt to create our diagram:

Create a diagram for a three-tier web application with a presentation tier (ALB and CloudFront), application tier (ECS with Fargate), and data tier (Aurora PostgreSQL). Include VPC with public and private subnets across multiple AZs. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram shows the following key components:

  • A presentation tier in public subnets
  • An application tier in private subnets
  • A data tier in isolated private subnets
  • Proper security group configurations
  • Traffic flow between tiers

Example 3: Create a serverless architecture

We use the following prompt to create a diagram for a serverless architecture:

Create a diagram for a serverless web application using API Gateway, Lambda, DynamoDB, and S3 for static website hosting. Include Cognito for user authentication and CloudFront for content delivery. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

Example 4: Create a data processing diagram

We use the following prompt to create a diagram for a data processing pipeline:

Create a diagram for a data processing pipeline with components organized in clusters for data ingestion, processing, storage, and analytics. Include Kinesis, Lambda, S3, Glue, and QuickSight. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram organizes components into distinct clusters:

Real-world examples

Let’s explore some real-world architecture patterns and how to create diagrams for them using Amazon Q CLI with the AWS Diagram MCP server.

Ecommerce platform

Ecommerce platforms require scalable, resilient architectures to handle variable traffic and maintain high availability. We use the following prompt to create an example diagram:

Create a diagram for an e-commerce platform with microservices architecture. Include components for product catalog, shopping cart, checkout, payment processing, order management, and user authentication. Ensure the architecture follows AWS best practices for scalability and security. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

Intelligent document processing solution

We use the following prompt to create a diagram for an intelligent document processing (IDP) architecture:

Create a diagram for an intelligent document processing (IDP) application on AWS. Include components for document ingestion, OCR and text extraction, intelligent data extraction (using NLP and/or computer vision), human review and validation, and data output/integration. Ensure the architecture follows AWS best practices for scalability and security, leveraging services like S3, Lambda, Textract, Comprehend, SageMaker (for custom models, if applicable), and potentially Augmented AI (A2I). Check for AWS documentation related to intelligent document processing best practices to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

  • Amazon API Gateway as the entry point for client applications, providing a secure and scalable interface
  • Microservices implemented as containers in ECS with Fargate, enabling flexible and scalable processing
  • Amazon RDS databases for product catalog, shopping cart, and order data, providing reliable structured data storage
  • Amazon ElastiCache for product data caching and session management, improving performance and user experience
  • Amazon Cognito for authentication, ensuring secure access control
  • Amazon Simple Queue Service and Amazon Simple Notification Service for asynchronous communication between services, enabling decoupled and resilient architecture
  • Amazon CloudFront for content delivery and static assets from S3, optimizing global performance
  • Amazon Route53 for DNS management, providing reliable routing
  • AWS WAF for web application security, protecting against common web exploits
  • AWS Lambda functions for serverless microservice implementation, offering cost-effective scaling
  • AWS Secrets Manager for secure credential storage, enhancing security posture
  • Amazon CloudWatch for monitoring and observability, providing insights into system performance and health.

Clean up

If you no longer need to use the AWS Cost Analysis MCP server with Amazon Q CLI, you can remove it from your configuration:

  1. Open your ~/.aws/amazonq/mcp.json file.
  2. Remove or comment out the MCP server entries.
  3. Save the file.

This will prevent the server from being loaded when you start Amazon Q CLI in the future.

Conclusion

In this post, we explored how to use Amazon Q CLI with the AWS Documentation MCP and AWS Diagram MCP servers to create professional AWS architecture diagrams that adhere to AWS best practices referenced from official AWS documentation. This approach offers significant advantages over traditional diagramming methods:

  • Time savings – Generate complex diagrams in minutes instead of hours
  • Consistency – Make sure diagrams follow the same style and conventions
  • Best practices – Automatically incorporate AWS architectural guidelines
  • Iterative refinement – Quickly modify diagrams through simple prompts
  • Validation – Check architectures against official AWS documentation and recommendations

As you continue your journey with AWS architecture diagrams, we encourage you to deepen your knowledge by learning more about the Model Context Protocol (MCP) to understand how it enhances the capabilities of Amazon Q. When seeking inspiration for your own designs, the AWS Architecture Center offers a wealth of reference architectures that follow best practices. For creating visually consistent diagrams, be sure to visit the AWS Icons page, where you can find the complete official icon set. And to stay at the cutting edge of these tools, keep an eye on updates to the official AWS MCP Servers—they’re constantly evolving with new features to make your diagramming experience even better.


About the Authors

Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech (Government Technology) customers. With a strong background in data science and application development, he brings deep technical expertise to creating secure and scalable cloud architectures for his customers. Joel is passionate about data analytics, machine learning, and robotics, leveraging his development experience to design innovative solutions that meet complex government requirements. He holds 13 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.

Dunieski Otano is a Solutions Architect at Amazon Web Services based out of Miami, Florida. He works with World Wide Public Sector MNO (Multi-International Organizations) customers. His passion is Security, Machine Learning and Artificial Intelligence, and Serverless. He works with his customers to help them build and deploy high available, scalable, and secure solutions. Dunieski holds 14 AWS certifications and is an AWS Golden Jacket recipient. In his free time, you will find him spending time with his family and dog, watching a great movie, coding, or flying his drone.

Varun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

Read More

AWS costs estimation using Amazon Q CLI and AWS Cost Analysis MCP

AWS costs estimation using Amazon Q CLI and AWS Cost Analysis MCP

Managing and optimizing AWS infrastructure costs is a critical challenge for organizations of all sizes. Traditional cost analysis approaches often involve the following:

  • Complex spreadsheets – Creating and maintaining detailed cost models, which requires significant effort
  • Multiple tools – Switching between the AWS Pricing Calculator, AWS Cost Explorer, and third-party tools
  • Specialized knowledge – Understanding the nuances of AWS pricing across services and AWS Regions
  • Time-consuming analysis – Manually comparing different deployment options and scenarios
  • Delayed optimization – Cost insights often come too late to inform architectural decisions

Amazon Q Developer CLI with the Model Context Protocol (MCP) offers a revolutionary approach to AWS cost analysis. By using generative AI through natural language prompts, teams can now generate detailed cost estimates, comparisons, and optimization recommendations in minutes rather than hours, while providing accuracy through integration with official AWS pricing data.

In this post, we explore how to use Amazon Q CLI with the AWS Cost Analysis MCP server to perform sophisticated cost analysis that follows AWS best practices. We discuss basic setup and advanced techniques, with detailed examples and step-by-step instructions.

Solution overview

Amazon Q Developer CLI is a command line interface that brings the generative AI capabilities of Amazon Q directly to your terminal. Developers can interact with Amazon Q through natural language prompts, making it an invaluable tool for various development tasks.
Developed by Anthropic as an open protocol, the Model Context Protocol (MCP) provides a standardized way to connect AI models to different data sources or tools. Using a client-server architecture (as illustrated in the following diagram), the MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

  • Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), or other AI applications
  • Client – Protocol clients that maintain one-to-one connections with servers
  • Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
  • Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect with

mcp-architectinfo

As announced in April 2025, the MCP enables Amazon Q Developer to connect with specialized servers that extend its capabilities beyond what’s possible with the base model alone. MCP servers act as plugins for Amazon Q, providing domain-specific knowledge and functionality. The AWS Cost Analysis MCP server specifically enables Amazon Q to generate detailed cost estimates, reports, and optimization recommendations using real-time AWS pricing data.

Prerequisites

To implement this solution, you must have an AWS account with appropriate permissions and follow the steps below.

Set up your environment

Before you can start analyzing costs, you need to set up your environment with Amazon Q CLI and the AWS Cost Analysis MCP server. This section provides detailed instructions for installation and configuration.

Install Amazon Q Developer CLI

Amazon Q Developer CLI is available as a standalone installation. Complete the following steps to install it:

  1. Download and install Amazon Q Developer CLI. For instructions, see Using Amazon Q Developer on the command line.
  2. Verify the installation by running the following command: q --version
    You should see output similar to the following: Amazon Q Developer CLI version 1.x.x
  3. Configure Amazon Q CLI with your AWS credentials: q login
  4. Choose the login method suitable for you:

Set up MCP servers

Before using the AWS Cost Analysis MCP server with Amazon Q CLI, you must install several tools and configure your environment. The following steps guide you through installing the necessary tools and setting up the MCP server configuration:

  1. Install Panoc using the following command (you can install with brew as well), converting the output to PDF: pip install pandoc
  2. Install uv with the following command: pip install uv
  3. Install Python 3.10 or newer: uv python install 3.10
  4. Add the servers to your ~/.aws/amazonq/mcp.json file:
    {
      "mcpServers": {
        "awslabs.cost-analysis-mcp-server": {
          "command": "uvx",
          "args": ["awslabs.cost-analysis-mcp-server"],
          "env": {
            "FASTMCP_LOG_LEVEL": "ERROR"
          },
          "autoApprove": [],
          "disabled": false
        }
      }
    }
    

    Now, Amazon Q CLI automatically discovers MCP servers in the ~/.aws/amazonq/mcp.json file.

Understanding MCP server tools

The AWS Cost Analysis MCP server provides several powerful tools:

  • get_pricing_from_web – Retrieves pricing information from AWS pricing webpages
  • get_pricing_from_api – Fetches pricing data from the AWS Price List API
  • generate_cost_report – Creates detailed cost analysis reports with breakdowns and visualizations
  • analyze_cdk_project – Analyzes AWS Cloud Development Kit (AWS CDK) projects to identify services used and estimate costs
  • analyze_terraform_project – Analyzes Terraform projects to identify services used and estimate costs
  • get_bedrock_patterns – Retrieves architecture patterns for Amazon Bedrock with cost considerations

These tools work together to help you create accurate cost estimates that follow AWS best practices.

Test your setup

Let’s verify that everything is working correctly by generating a simple cost analysis:

  1. Start the Amazon Q CLI chat interface and verify the output shows the MCP server being loaded and initialized: q chat
  2. In the chat interface, enter the following prompt:Please create a cost analysis for a simple web application with an Application Load Balancer, two t3.medium EC2 instances, and an RDS db.t3.medium MySQL database. Assume 730 hours of usage per month and moderate traffic of about 100 GB data transfer. Convert estimation to a PDF format.
  3. Amazon Q CLI will ask for permission to trust the tool that is being used; enter t to trust it. Amazon Q should generate and display a detailed cost analysis. Your output should look like the following screenshot.

    If you see the cost analysis report, your environment is set up correctly. If you encounter issues, verify that Amazon Q CLI can access the MCP servers by making sure you installed install the necessary tools and the servers are in the ~/.aws/amazonq/mcp.json file.

Configuration options

The AWS Cost Analysis MCP server supports several configuration options to customize your cost analysis experience:

  • Output format – Choose between markdown, CSV formats, or PDF (which we installed the package for) for cost reports
  • Pricing model – Specify on-demand, reserved instances, or savings plans
  • Assumptions and exclusions – Customize the assumptions and exclusions in your cost analysis
  • Detailed cost data – Provide specific usage patterns for more accurate estimates

Now that our environment is set up, let’s create more cost analyses.

Create AWS Cost Analysis reports

In this section, we walk through the process of creating AWS cost analysis reports using Amazon Q CLI with the AWS Cost Analysis MCP server.

When you provide a prompt to Amazon Q CLI, the AWS Cost Analysis MCP server completes the following steps:

  1. Interpret your requirements.
  2. Retrieve pricing data from AWS pricing sources.
  3. Generate a detailed cost analysis report.
  4. Provide optimization recommendations.

This process happens seamlessly, so you can focus on describing what you want rather than how to create it.

AWS Cost Analysis reports typically include the following information:

  • Service costs – Breakdown of costs by AWS service
  • Unit pricing – Detailed unit pricing information
  • Usage quantities – Estimated usage quantities for each service
  • Calculation details – Step-by-step calculations showing how costs were derived
  • Assumptions – Clearly stated assumptions used in the analysis
  • Exclusions – Costs that were not included in the analysis
  • Recommendations – Cost optimization suggestions

Example 1: Analyze a serverless application

Let’s create a cost analysis for a simple serverless application. Use the following prompt:

Create a cost analysis for a serverless application using API Gateway, Lambda, and DynamoDB. Assume 1 million API calls per month, average Lambda execution time of 200ms with 512MB memory, and 10GB of DynamoDB storage with 5 million read requests and 1 million write requests per month. Convert estimation to a PDF format.

Upon entering your prompt, Amazon Q CLI will retrieve pricing data using the get_pricing_from_web or get_pricing_from_api tools, and will use generate_cost_report with awslabscost_analysis_mcp_server.

You should receive an output giving a detailed cost breakdown based on the prompt along with optimization recommendations.

The generated cost analysis shows the following information:

  • Amazon API Gateway costs for 1 million requests
  • AWS Lambda costs for compute time and requests
  • Amazon DynamoDB costs for storage, read, and write capacity
  • Total monthly cost estimate
  • Cost optimization recommendations

Example 2: Analyze multi-tier architectures

Multi-tier architectures separate applications into functional layers (presentation, application, and data) to improve scalability and security. This example analyzes costs for implementing such an architecture on AWS with components for each tier:

Create a cost analysis for a three-tier web application with a presentation tier (ALB and CloudFront), application tier (ECS with Fargate), and data tier (Aurora PostgreSQL). Include costs for 2 Fargate tasks with 1 vCPU and 2GB memory each, an Aurora db.r5.large instance with 100GB storage, an Application Load Balancer with 10

This time, we are formatting it into both PDF and DOCX.

The cost analysis shows the following information:

Example 3: Compare deployment options

When deploying containers on AWS, choosing between Amazon ECS with Amazon Elastic Compute Cloud (Amazon EC2) or Fargate involves different cost structures and management overhead. This example compares these options to determine the most cost-effective solution for a specific workload:

Compare the costs between running a containerized application on ECS with EC2 launch type versus Fargate launch type. Assume 4 containers each needing 1 vCPU and 2GB memory, running 24/7 for a month. For EC2, use t3.medium instances. Provide a recommendation on which option is more cost-effective for this workload. Convert estimation to a HTML webpage.

This time, we are formatting it into a HTML webpage.

The cost comparison includes the following information:

  • Amazon ECS with Amazon EC2 launch type costs
  • Amazon ECS with Fargate launch type costs
  • Detailed breakdown of each option’s pricing components
  • Side-by-side comparison of total costs
  • Recommendations for the most cost-effective option
  • Considerations for when each option might be preferred

Real-world examples

Let’s explore some real-world architecture patterns and how to analyze their costs using Amazon Q CLI with the AWS Cost Analysis MCP server.

Ecommerce platform

Ecommerce platforms require scalable, resilient architectures with careful cost management. These systems typically use microservices to handle various functions independently while maintaining high availability. This example analyzes costs for a complete ecommerce solution with multiple components serving moderate traffic levels:

Create a cost analysis for an e-commerce platform with microservices architecture. Include components for product catalog, shopping cart, checkout, payment processing, order management, and user authentication. Assume moderate traffic of 500,000 monthly active users, 2 million page views per day, and 50,000 orders per month. Ensure the analysis follows AWS best practices for cost optimization. Convert estimation to a PDF format.

The cost analysis includes the following key components:

Data analytics platform

Modern data analytics platforms need to efficiently ingest, store, process, and visualize large volumes of data while managing costs effectively. This example examines the AWS services and costs involved in building a complete analytics pipeline handling significant daily data volumes with multiple user access requirements:

Create a cost analysis for a data analytics platform processing 500GB of new data daily. Include components for data ingestion (Kinesis), storage (S3), processing (EMR), and visualization (QuickSight). Assume 50 users accessing dashboards daily and data retention of 90 days. Ensure the analysis follows AWS best practices for cost optimization and includes recommendations for cost-effective scaling. Convert estimation to a HTML webpage.

The cost analysis includes the following key components:

  • Data ingestion costs (Amazon Kinesis Data Streams and Amazon Data Firehose)
  • Storage costs (Amazon S3 with lifecycle policies)
  • Processing costs (Amazon EMR cluster)
  • Visualization costs (Amazon QuickSight)
  • Data transfer costs between services
  • Total monthly cost estimate
  • Cost optimization recommendations for each component
  • Scaling considerations and their cost implications

Clean up

If you no longer need to use the AWS Cost Analysis MCP server with Amazon Q CLI, you can remove it from your configuration:

  1. Open your ~/.aws/amazonq/mcp.json file.
  2. Remove or comment out the “awslabs.cost-analysis-mcp-server” entry.
  3. Save the file.

This will prevent the server from being loaded when you start Amazon Q CLI in the future.

Conclusion

In this post, we explored how to use Amazon Q CLI with the AWS Cost Analysis MCP server to create detailed cost analyses that use accurate AWS pricing data. This approach offers significant advantages over traditional cost estimation methods:

  • Time savings – Generate complex cost analyses in minutes instead of hours
  • Accuracy – Make sure estimates use the latest AWS pricing information
  • Comprehensive – Include relevant cost components and considerations
  • Actionable – Receive specific optimization recommendations
  • Iterative – Quickly compare different scenarios through simple prompts
  • Validation – Check estimates against official AWS pricing

As you continue exploring AWS cost analysis, we encourage you to deepen your knowledge by learning more about the Model Context Protocol (MCP) to understand how it enhances the capabilities of Amazon Q. For hands-on cost estimation, the AWS Pricing Calculator offers an interactive experience to model and compare different deployment scenarios. To make sure your architectures follow financial best practices, the AWS Well-Architected Framework Cost Optimization Pillar provides comprehensive guidance on building cost-efficient systems. And to stay at the cutting edge of these tools, keep an eye on updates to the official AWS MCP servers—they’re constantly evolving with new features to make your cost analysis experience even more powerful and accurate.


About the Authors

Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech (Government Technology) customers. With a strong background in data science and application development, he brings deep technical expertise to creating secure and scalable cloud architectures for his customers. Joel is passionate about data analytics, machine learning, and robotics, leveraging his development experience to design innovative solutions that meet complex government requirements. He holds 13 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.

Dunieski Otano is a Solutions Architect at Amazon Web Services based out of Miami, Florida. He works with World Wide Public Sector MNO (Multi-International Organizations) customers. His passion is Security, Machine Learning and Artificial Intelligence, and Serverless. He works with his customers to help them build and deploy high available, scalable, and secure solutions. Dunieski holds 14 AWS certifications and is an AWS Golden Jacket recipient. In his free time, you will find him spending time with his family and dog, watching a great movie, coding, or flying his drone.

Varun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

Read More

Tailor responsible AI with new safeguard tiers in Amazon Bedrock Guardrails

Tailor responsible AI with new safeguard tiers in Amazon Bedrock Guardrails

Amazon Bedrock Guardrails provides configurable safeguards to help build trusted generative AI applications at scale. It provides organizations with integrated safety and privacy safeguards that work across multiple foundation models (FMs), including models available in Amazon Bedrock, as well as models hosted outside Amazon Bedrock from other model providers and cloud providers. With the standalone ApplyGuardrail API, Amazon Bedrock Guardrails offers a model-agnostic and scalable approach to implementing responsible AI policies for your generative AI applications. Guardrails currently offers six key safeguards: content filters, denied topics, word filters, sensitive information filters, contextual grounding checks, and Automated Reasoning checks (preview), to help prevent unwanted content and align AI interactions with your organization’s responsible AI policies.

As organizations strive to implement responsible AI practices across diverse use cases, they face the challenge of balancing safety controls with varying performance and language requirements across different applications, making a one-size-fits-all approach ineffective. To address this, we’ve introduced safeguard tiers for Amazon Bedrock Guardrails, so you can choose appropriate safeguards based on your specific needs. For instance, a financial services company can implement comprehensive, multi-language protection for customer-facing AI assistants while using more focused, lower-latency safeguards for internal analytics tools, making sure each application upholds responsible AI principles with the right level of protection without compromising performance or functionality.

In this post, we introduce the new safeguard tiers available in Amazon Bedrock Guardrails, explain their benefits and use cases, and provide guidance on how to implement and evaluate them in your AI applications.

Solution overview

Until now, when using Amazon Bedrock Guardrails, you were provided with a single set of the safeguards associated to specific AWS Regions and a limited set of languages supported. The introduction of safeguard tiers in Amazon Bedrock Guardrails provides three key advantages for implementing AI safety controls:

  • A tier-based approach that gives you control over which guardrail implementations you want to use for content filters and denied topics, so you can select the appropriate protection level for each use case. We provide more details about this in the following sections.
  • Cross-Region Inference Support (CRIS) for Amazon Bedrock Guardrails, so you can use compute capacity across multiple Regions, achieving better scaling and availability for your guardrails. With this, your requests get automatically routed during guardrail policy evaluation to the optimal Region within your geography, maximizing available compute resources and model availability. This helps maintain guardrail performance and reliability when demand increases. There’s no additional cost for using CRIS with Amazon Bedrock Guardrails, and you can select from specific guardrail profiles for controlling model versioning and future upgrades.
  • Advanced capabilities as a configurable tier option for use cases where more robust protection or broader language support are critical priorities, and where you can accommodate a modest latency increase.

Safeguard tiers are applied at the guardrail policy level, specifically for content filters and denied topics. You can tailor your protection strategy for different aspects of your AI application. Let’s explore the two available tiers:

  • Classic tier (default):
    • Maintains the existing behavior of Amazon Bedrock Guardrails
    • Limited language support: English, French, and Spanish
    • Does not require CRIS for Amazon Bedrock Guardrails
    • Optimized for lower-latency applications
  • Standard tier:
    • Provided as a new capability that you can enable for existing or new guardrails
    • Multilingual support for more than 60 languages
    • Enhanced robustness against prompt typos and manipulated inputs
    • Enhanced prompt attack protection covering modern jailbreak and prompt injection techniques, including token smuggling, AutoDAN, and many-shot, among others
    • Enhanced topic detection with improved understanding and handling of complex topics
    • Requires the use of CRIS for Amazon Bedrock Guardrails and might have a modest increase in latency profile compared to the Classic tier option

You can select each tier independently for content filters and denied topics policies, allowing for mixed configurations within the same guardrail, as illustrated in the following hierarchy. With this flexibility, companies can implement the right level of protection for each specific application.

  • Policy: Content filters
    • Tier: Classic or Standard
  • Policy: Denied topics
    • Tier: Classic or Standard
  • Other policies: Word filters, sensitive information filters, contextual grounding checks, and Automated Reasoning checks (preview)

To illustrate how these tiers can be applied, consider a global financial services company deploying AI in both customer-facing and internal applications:

  • For their customer service AI assistant, they might choose the Standard tier for both content filters and denied topics, to provide comprehensive protection across many languages.
  • For internal analytics tools, they could use the Classic tier for content filters prioritizing low latency, while implementing the Standard tier for denied topics to provide robust protection against sensitive financial information disclosure.

You can configure the safeguard tiers for content filters and denied topics in each guardrail through the AWS Management Console, or programmatically through the Amazon Bedrock SDK and APIs. You can use a new or existing guardrail. For information on how to create or modify a guardrail, see Create your guardrail.

Your existing guardrails are automatically set to the Classic tier by default to make sure you have no impact on your guardrails’ behavior.

Quality enhancements with the Standard tier

According to our tests, the new Standard tier improves harmful content filtering recall by more than 15% with a more than 7% gain in balanced accuracy compared to the Classic tier. A key differentiating feature of the new Standard tier is its multilingual support, maintaining strong performance with over 78% recall and over 88% balanced accuracy for the most common 14 languages.The enhancements in protective capabilities extend across several other aspects. For example, content filters for prompt attacks in the Standard tier show a 30% improvement in recall and 16% gain in balanced accuracy compared to the Classic tier, while maintaining a lower false positive rate. For denied topic detection, the new Standard tier delivers a 32% increase in recall, resulting in an 18% improvement in balanced accuracy.These substantial evolutions in detection capabilities for Amazon Bedrock Guardrails, combined with consistently low false positive rates and robust multilingual performance, also represent a significant advancement in content protection technology compared to other commonly available solutions. The multilingual improvements are particularly noteworthy, with the new Standard tier in Amazon Bedrock Guardrails showing consistent performance gains of 33–49% in recall across different language evaluations compared to other competitors’ options.

Benefits of safeguard tiers

Different AI applications have distinct safety requirements based on their audience, content domain, and geographic reach. For example:

  • Customer-facing applications often require stronger protection against potential misuse compared to internal applications
  • Applications serving global customers need guardrails that work effectively across many languages
  • Internal enterprise tools might prioritize controlling specific topics in just a few primary languages

The combination of the safeguard tiers with CRIS for Amazon Bedrock Guardrails also addresses various operational needs with practical benefits that go beyond feature differences:

  • Independent policy evolution – Each policy (content filters or denied topics) can evolve at its own pace without disrupting the entire guardrail system. You can configure these with specific guardrail profiles in CRIS for controlling model versioning in the models powering your guardrail policies.
  • Controlled adoption – You decide when and how to adopt new capabilities, maintaining stability for production applications. You can continue to use Amazon Bedrock Guardrails with your previous configurations without changes and only move to the new tiers and CRIS configurations when you consider it appropriate.
  • Resource efficiency – You can implement enhanced protections only where needed, balancing security requirements with performance considerations.
  • Simplified migration path – When new capabilities become available, you can evaluate and integrate them gradually by policy area rather than facing all-or-nothing choices. This also simplifies testing and comparison mechanisms such as A/B testing or blue/green deployments for your guardrails.

This approach helps organizations balance their specific protection requirements with operational considerations in a more nuanced way than a single-option system could provide.

Configure safeguard tiers on the Amazon Bedrock console

On the Amazon Bedrock console, you can configure the safeguard tiers for your guardrail in the Content filters tier or Denied topics tier sections by selecting your preferred tier.

Use of the new Standard tier requires setting up cross-Region inference for Amazon Bedrock Guardrails, choosing the guardrail profile of your choice.

Configure safeguard tiers using the AWS SDK

You can also configure the guardrail’s tiers using the AWS SDK. The following is an example to get started with the Python SDK:

import boto3
import json

bedrock = boto3.client(
    "bedrock",
    region_name="us-east-1"
)

# Create a guardrail with Standard tier for both Content Filters and Denied Topics
response = bedrock.create_guardrail(
    name="enhanced-safety-guardrail",
    # cross-Region is required for STANDARD tier
    crossRegionConfig={
        'guardrailProfileIdentifier': 'us.guardrail.v1:0'
    },
    # Configure Denied Topics with Standard tier
    topicPolicyConfig={
        "topicsConfig": [
            {
                "name": "Financial Advice",
                "definition": "Providing specific investment advice or financial recommendations",
                "type": "DENY",
                "inputEnabled": True,
                "inputAction": "BLOCK",
                "outputEnabled": True,
                "outputAction": "BLOCK"
            }
        ],
        "tierConfig": {
            "tierName": "STANDARD"
        }
    },
    # Configure Content Filters with Standard tier
    contentPolicyConfig={
        "filtersConfig": [
            {
                "inputStrength": "HIGH",
                "outputStrength": "HIGH",
                "type": "SEXUAL"
            },
            {
                "inputStrength": "HIGH",
                "outputStrength": "HIGH",
                "type": "VIOLENCE"
            }
        ],
        "tierConfig": {
            "tierName": "STANDARD"
        }
    },
    blockedInputMessaging="I cannot respond to that request.",
    blockedOutputsMessaging="I cannot provide that information."
)

Within a given guardrail, the content filter and denied topic policies can be configured with its own tier independently, giving you granular control over how guardrails behave. For example, you might choose the Standard tier for content filtering while keeping denied topics in the Classic tier, based on your specific requirements.

For migrating existing guardrails’ configurations to use the Standard tier, add the sections highlighted in the preceding example for crossRegionConfig and tierConfig to your current guardrail definition. You can do this using the UpdateGuardrail API, or create a new guardrail with the CreateGuardrail API.

Evaluating your guardrails

To thoroughly evaluate your guardrails’ performance, consider creating a test dataset that includes the following:

  • Safe examples – Content that should pass through guardrails
  • Harmful examples – Content that should be blocked
  • Edge cases – Content that tests the boundaries of your policies
  • Examples in multiple languages – Especially important when using the Standard tier

You can also rely on openly available datasets for this purpose. Ideally, your dataset should be labeled with the expected response for each case for assessing accuracy and recall of your guardrails.

With your dataset ready, you can use the Amazon Bedrock ApplyGuardrail API as shown in the following example to efficiently test your guardrail’s behavior for user inputs without invoking FMs. This way, you can save the costs associated with the large language model (LLM) response generation.

import boto3
import json

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1"
)

# Test the guardrail with potentially problematic content
content = [
    {
        "text": {
            "text": "Your test prompt here"
        }
    }
]

response = bedrock_runtime.apply_guardrail(
    content=content,
    source="INPUT",
    guardrailIdentifier="your-guardrail-id",
    guardrailVersion="DRAFT"
)

print(json.dumps(response, indent=2, default=str))

Later, you can repeat the process for the outputs of the LLMs if needed. For this, you can use the ApplyGuardrail API if you want an independent evaluation for models in AWS or outside in another provider, or you can directly use the Converse API if you intend to use models in Amazon Bedrock. When using the Converse API, the inputs and outputs are evaluated with the same invocation request, optimizing latency and reducing coding overheads.

Because your dataset is labeled, you can directly implement a mechanism for assessing the accuracy, recall, and potential false negatives or false positives through the use of libraries like SKLearn Metrics:

# scoring script
# labels and preds store list of ground truth label and guardrails predictions

from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(labels, preds, labels=[0, 1]).ravel()

recall = tp / (tp + fn) if (tp + fn) != 0 else 0
fpr = fp / (fp + tn) if (fp + tn) != 0 else 0
balanced_accuracy = 0.5 * (recall + 1 - fpr)

Alternatively, if you don’t have labeled data or your use cases have subjective responses, you can also rely on mechanisms such as LLM-as-a-judge, where you pass the inputs and guardrails’ evaluation outputs to an LLM for assessing a score based on your own predefined criteria. For more information, see Automate building guardrails for Amazon Bedrock using test-drive development.

Best practices for implementing tiers

We recommend considering the following aspects when configuring your tiers for Amazon Bedrock Guardrails:

  • Start with staged testing – Test both tiers with a representative sample of your expected inputs and responses before making broad deployment decisions.
  • Consider your language requirements – If your application serves users in multiple languages, the Standard tier’s expanded language support might be essential.
  • Balance safety and performance – Evaluate both the accuracy improvements and latency differences to make informed decisions. Consider if you can afford a few additional milliseconds of latency for improved robustness with the Standard tier or prefer a latency-optimized option for more straight forward evaluations with the Classic tier.
  • Use policy-level tier selection – Take advantage of the ability to select different tiers for different policies to optimize your guardrails. You can choose separate tiers for content filters and denied topics, while combining with the rest of the policies and features available in Amazon Bedrock Guardrails.
  • Remember cross-Region requirements – The Standard tier requires cross-Region inference, so make sure your architecture and compliance requirements can accommodate this. With CRIS, your request originates from the Region where your guardrail is deployed, but it might be served from a different Region from the ones included in the guardrail inference profile for optimizing latency and availability.

Conclusion

The introduction of safeguard tiers in Amazon Bedrock Guardrails represents a significant step forward in our commitment to responsible AI. By providing flexible, powerful, and evolving safety tools for generative AI applications, we’re empowering organizations to implement AI solutions that are not only innovative but also ethical and trustworthy. This capabilities-based approach enables you to tailor your responsible AI practices to each specific use case. You can now implement the right level of protection for different applications while creating a path for continuous improvement in AI safety and ethics.The new Standard tier delivers significant improvements in multilingual support and detection accuracy, making it an ideal choice for many applications, especially those serving diverse global audiences or requiring enhanced protection. This aligns with responsible AI principles by making sure AI systems are fair and inclusive across different languages and cultures. Meanwhile, the Classic tier remains available for use cases prioritizing low latency or those with simpler language requirements, allowing organizations to balance performance with protection as needed.

By offering these customizable protection levels, we’re supporting organizations in their journey to develop and deploy AI responsibly. This approach helps make sure that AI applications are not only powerful and efficient but also align with organizational values, comply with regulations, and maintain user trust.

To learn more about safeguard tiers in Amazon Bedrock Guardrails, refer to Detect and filter harmful content by using Amazon Bedrock Guardrails, or visit the Amazon Bedrock console to create your first tiered guardrail.


About the Authors

Koushik Kethamakka is a Senior Software Engineer at AWS, focusing on AI/ML initiatives. At Amazon, he led real-time ML fraud prevention systems for Amazon.com before moving to AWS to lead development of AI/ML services like Amazon Lex and Amazon Bedrock. His expertise spans product and system design, LLM hosting, evaluations, and fine-tuning. Recently, Koushik’s focus has been on LLM evaluations and safety, leading to the development of products like Amazon Bedrock Evaluations and Amazon Bedrock Guardrails. Prior to joining Amazon, Koushik earned his MS from the University of Houston.

Hang Su is a Senior Applied Scientist at AWS AI. He has been leading the Amazon Bedrock Guardrails Science team. His interest lies in AI safety topics, including harmful content detection, red-teaming, sensitive information detection, among others.

Shyam Srinivasan is on the Amazon Bedrock product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Aartika Sardana Chandras is a Senior Product Marketing Manager for AWS Generative AI solutions, with a focus on Amazon Bedrock. She brings over 15 years of experience in product marketing, and is dedicated to empowering customers to navigate the complexities of the AI lifecycle. Aartika is passionate about helping customers leverage powerful AI technologies in an ethical and impactful manner.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Read More

Structured data response with Amazon Bedrock: Prompt Engineering and Tool Use

Structured data response with Amazon Bedrock: Prompt Engineering and Tool Use

Generative AI is revolutionizing industries by streamlining operations and enabling innovation. While textual chat interactions with GenAI remain popular, real-world applications often depend on structured data for APIs, databases, data-driven workloads, and rich user interfaces. Structured data can also enhance conversational AI, enabling more reliable and actionable outputs. A key challenge is that LLMs (Large Language Models) are inherently unpredictable, which makes it difficult for them to produce consistently structured outputs like JSON. This challenge arises because their training data mainly includes unstructured text, such as articles, books, and websites, with relatively few examples of structured formats. As a result, LLMs can struggle with precision when generating JSON outputs, which is crucial for seamless integration into existing APIs and databases. Models vary in their ability to support structured responses, including recognizing data types and managing complex hierarchies effectively. These capabilities can make a difference when choosing the right model.

This blog demonstrates how Amazon Bedrock, a managed service for securely accessing top AI models, can help address these challenges by showcasing two alternative options:

  1. Prompt Engineering: A straightforward approach to shaping structured outputs using well-crafted prompts.
  2. Tool Use with the Bedrock Converse API: An advanced method that enables better control, consistency, and native JSON schema integration.

We will use a customer review analysis example to demonstrate how Bedrock generates structured outputs, such as sentiment scores, with simplified Python code.

Building a prompt engineering solution

This section will demonstrate how to use prompt engineering effectively to generate structured outputs using Amazon Bedrock. Prompt engineering involves crafting precise input prompts to guide large language models (LLMs) in producing consistent and structured responses. It is a fundamental technique for developing Generative AI applications, particularly when structured outputs are required.Here are the five key steps we will follow:

  1. Configure the Bedrock client and runtime parameters.
  2. Create a JSON schema for structured outputs.
  3. Craft a prompt and guide the model with clear instructions and examples.
  4. Add a customer review as input data to analyse.
  5. Invoke Bedrock, call the model, and process the response.

While we demonstrate customer review analysis to generate a JSON output, these methods can also be used with other formats like XML or CSV.

Step 1: Configure Bedrock

To begin, we’ll set up some constants and initialize a Python Bedrock client connection object using the Python Boto3 SDK for Bedrock runtime, which facilitates interaction with Bedrock:

Python code configuring AWS Bedrock client with Anthropic Claude model and parameters

The REGION specifies the AWS region for model execution, while the MODEL_ID identifies the specific Bedrock model. The TEMPERATURE constant controls the output randomness, where higher values increase creativity, and lower values maintain precision, such as when generating structured output. MAX_TOKENS determines the output length, balancing cost-efficiency and data completeness.

Step 2: Define the Schema

Defining a schema is essential for facilitating structured and predictable model outputs, maintaining data integrity, and enabling seamless API integration. Without a well-defined schema, models may generate inconsistent or incomplete responses, leading to errors in downstream applications. The JSON standard schema used in the code below serves as a blueprint for structured data generation, guiding the model on how to format its output with explicit instructions.

Let’s create a JSON schema for customer reviews with three required fields: reviewId (string, max 50 chars), sentiment (number, -1 to 1), and summary (string, max 200 chars).

JSON schema for customer reviews with fields for ID, sentiment score, and summary, specifying data types and constraints

Step 3: Craft the Prompt text

To generate consistent, structured, and accurate responses, prompts must be clear and well-structured, as LLMs rely on precise input to produce reliable outputs. Poorly designed prompts can lead to ambiguity, errors, or formatting issues, disrupting structured workflows, so we follow these best practices:

  • Clearly outline the AI’s role and objectives to avoid ambiguity.
  • Divide tasks into smaller, manageable numbered steps for clarity.
  • Indicate that a JSON schema will be provided (see Step 5 below) to maintain a consistent and valid structure.
  • Use one-shot prompting with a sample output to guide the model; add more examples if needed for consistency, but avoid too many, as they may limit the model’s ability to handle new inputs.
  • Define how to handle missing or invalid data.

Instructions for AI system to analyze customer reviews and return JSON data with example response format

Step 4: Integrate Input Data

For demonstration purposes, we’ll include a review text in the prompt as a Python variable:

Customer review input data showing positive feedback about delivery, product quality, and service

Separating the input data with <input> tags improve readability and clarity, making it straightforward to identify and reference. This hardcoded input simulates real-world data integration. For production use, you might dynamically populate input data from APIs or user submissions.

Step 5: Call Bedrock

In this section, we construct a Bedrock request by defining a body object that includes the JSON schema, prompt, and input review data from previous steps. This structured request makes sure the model receives clear instructions, adheres to a predefined schema, and processes sample input data correctly. Once the request is prepared, we invoke Amazon Bedrock to generate a structured JSON response.

AWS Bedrock client setup with model parameters, message content, and API call for customer review analysis

We reuse the MAX_TOKENSTEMPERATURE, and MODEL_ID constants defined in Step 1. The body object has essential inference configurations like anthropic_version for model compatibility and the messages array, which includes a single message to provide the model with task instructions, the schema, and the input data. The role defines the “speaker” in the interaction context, with user value representing the program sending the request. Alternatively, we could simplify the input by combining instructions, schema, and data into one text prompt, which is straightforward to manage but less modular.

Finally, we use the client.invoke_model method to send the request. After invoking, the model processes the request, and the JSON data must be properly (not explained here) extracted from the Bedrock response. For example:

JSON format customer feedback data showing high sentiment (0.9) with positive comments on delivery, quality, and service

Tool Use with the Amazon Bedrock Converse API

In the previous chapter, we explored a solution using Bedrock Prompt Engineering. Now, let’s look at an alternative approach for generating structured responses with Bedrock.

We will extend the previous solution by using the Amazon Bedrock Converse API, a consistent interface designed to facilitate multi-turn conversations with Generative AI models. The API abstracts model-specific configurations, including inference parameters, simplifying integration.

A key feature of the Converse API is Tool Use (also known as Function Calling), which enables the model to execute external tools, such as calling an external API. This method supports standard JSON schema integration directly into tool definitions, facilitating output alignment with predefined formats. Not all Bedrock models support Tool Use, so make sure you check which models are compatible with these feature.

Building on the previously defined data, the following code provides a straightforward example of Tool Use tailored to our curstomer review use case:

AWS Bedrock API implementation code showing tool configuration, message structure, and model inference setup for review analysis

In this code the tool_list defines a custom customer review analysis tool with its input schema and purpose, while the messages provide the earlier defined instructions and input data. Unlike in the previous prompt engineering example we used the earlier defined JSON schema in the definition of a tool. Finally, the client.converse call combines these components, specifying the tool to use and inference configurations, resulting in outputs tailored to the given schema and task. After exploring Prompt Engineering and Tool Use in Bedrock solutions for structured response generation, let’s now evaluate how different foundation models perform across these approaches.

Test Results: Claude Models on Amazon Bedrock

Understanding the capabilities of foundation models in structured response generation is essential for maintaining reliability, optimizing performance, and building scalable, future-proof Generative AI applications with Amazon Bedrock. To evaluate how well models handle structured outputs, we conducted extensive testing of Anthropic’s Claude models, comparing prompt-based and tool-based approaches across 1,000 iterations per model. Each iteration processed 100 randomly generated items, providing broad test coverage across different input variations.The examples shown earlier in this blog are intentionally simplified for demonstration purposes, where Bedrock performed seamlessly with no issues. To better assess the models under real-world challenges, we used a more complex schema that featured nested structures, arrays, and diverse data types to identify edge cases and potential issues. The outputs were validated for adherence to the JSON format and schema, maintaining consistency and accuracy. The following diagram summarizes the results, showing the number of successful, valid JSON responses for each model across the two demonstrated approaches: Prompt Engineering and Tool Use.

Bar graph showing success rates of prompt vs tool approaches in structured generation for haiku and sonnet AI models

The results demonstrated that all models achieved over 93% success across both approaches, with Tool Use methods consistently outperforming prompt-based ones. While the evaluation was conducted using a highly complex JSON schema, simpler schemas result in significantly fewer issues, often nearly none. Future updates to the models are expected to further enhance performance.

Final Thoughts

In conclusion, we demonstrated two methods for generating structured responses with Amazon Bedrock: Prompt Engineering and Tool Use with the Converse API. Prompt Engineering is flexible, works with Bedrock models (including those without Tool Use support), and handles various schema types (e.g., Open API schemas), making it a great starting point. However, it can be fragile, requiring exact prompts and struggling with complex needs. On the other hand, Tool Use offers greater reliability, consistent results, seamless API integration, and runtime validation of JSON schema for enhanced control.

For simplicity, we did not demonstrate a few areas in this blog. Other techniques for generating structured responses include using models with built-in support for configurable response formats, such as JSON, when invoking models, or leveraging constraint decoding techniques with third-party libraries like LMQL. Additionally, generating structured data with GenAI can be challenging due to issues like invalid JSON, missing fields, or formatting errors. To maintain data integrity and handle unexpected outputs or API failures, effective error handling, thorough testing, and validation are essential.

To try the Bedrock techniques demonstrated in this blog, follow the steps to Run example Amazon Bedrock API requests through the AWS SDK for Python (Boto3). With pay-as-you-go pricing, you’re only charged for API calls, so little to no cleanup is required after testing. For more details on best practices, refer to the Bedrock prompt engineering guidelines and model-specific documentation, such as Anthropic’s best practices.

Structured data is key to leveraging Generative AI in real-world scenarios like APIs, data-driven workloads, and rich user interfaces beyond text-based chat. Start using Amazon Bedrock today to unlock its potential for reliable structured responses.


About the authors

Adam Nemeth is a Senior Solutions Architect at AWS, where he helps global financial customers embrace cloud computing through architectural guidance and technical support. With over 24 years of IT expertise, Adam previously worked at UBS before joining AWS. He lives in Switzerland with his wife and their three children.

Dominic Searle is a Senior Solutions Architect at Amazon Web Services, where he has had the pleasure of working with Global Financial Services customers as they explore how Generative AI can be integrated into their technology strategies. Providing technical guidance, he enjoys helping customers effectively leverage AWS Services to solve real business problems.

Read More

Using Amazon SageMaker AI Random Cut Forest for NASA’s Blue Origin spacecraft sensor data

Using Amazon SageMaker AI Random Cut Forest for NASA’s Blue Origin spacecraft sensor data

The successful deorbit, descent, and landing of spacecraft on the Moon requires precise control and monitoring of vehicle dynamics. Anomaly detection provides a unique utility for identifying important states that might represent vehicle behaviors of interest. By producing unique vehicle behavior points, critical spacecraft system states can be identified to be more appropriately addressed and potentially better understood. These identified states can be invaluable for efforts such as system failure mitigation, engineering design improvements, and mission planning. Today, space missions have become more frequent and complex, and the volume of telemetry data generated has grown exponentially. With this growth, methods of analyzing this data for anomalies need to effectively scale and without risking missing subtle, but important deviations in spacecraft behavior. Fortunately, AWS uses powerful AI/ML applications within Amazon SageMaker AI that can address these needs.

In this post, we demonstrate how to use SageMaker AI to apply the Random Cut Forest (RCF) algorithm to detect anomalies in spacecraft position, velocity, and quaternion orientation data from NASA and Blue Origin’s demonstration of lunar Deorbit, Descent, and Landing Sensors (BODDL-TP). The presented analysis focuses on detecting anomalies in spacecraft dynamics data, including positions, velocities, and quaternion orientations.

Solution overview

This solution provides an effective approach to anomaly detection in spacecraft data. We begin with data preprocessing and cleaning to produce quality input for our analysis. Using SageMaker AI, we train an RCF model specifically for detecting anomalies in complex spacecraft dynamics data. To handle the substantial volume of telemetry data efficiently, we implement batch processing for anomaly detection across large datasets.

After the model is trained and anomalies are detected, this solution produces robust visualization capabilities, presenting results with highlighted anomalies for clear interpretation of the findings. We use Amazon Simple Storage Service (Amazon S3) for seamless data storage and retrieval, including both raw data and generated plots. Throughout the implementation, we maintain careful cost management of SageMaker AI instances by deleting resources after they’re used to achieve efficient utilization while maintaining performance.

This combination of features creates a scalable, efficient pipeline for processing and analyzing spacecraft dynamics data, making it particularly suitable for space mission applications where reliability and precision are crucial.

Key concepts

In this section, we discuss some key concepts of spacecraft dynamics and machine learning (ML) in this solution.

Position and velocity in spacecraft dynamics

Position and velocity vectors in our NASA Blue Origin DDL data are represented in the Earth-Centered Earth-Fixed (ECEF) coordinate system. This reference frame rotates with the Earth, making it ideal for tracking spacecraft relative to landing sites on the lunar surface. The position vector [x, y, z] in ECEF pinpoints the spacecraft’s location in three-dimensional space. Its origin is at Earth’s center, with the X-axis intersecting the prime meridian at the equator, the Y-axis 90 degrees east in the equatorial plane, and the Z-axis aligned with Earth’s rotational axis. Measured in meters, this position data can reveal crucial information about orbital descent trajectories, landing approach paths, terminal descent profiles, and final touchdown positioning. Complementing position data, the velocity vector [vx, vy, vz] represents the spacecraft’s rate of position change in each direction. Measured in meters per second, this velocity data is vital for monitoring descent rates, maintaining safe approach speeds, controlling deceleration profiles, and verifying landing constraints. Our RCF algorithm scrutinizes both position and velocity data for anomalies. In position data, it looks for anomalies that might be caused by unexpected trajectory deviations, unrealistic position jumps, sensor glitches, or data recording errors. For velocity, its detected anomalies might be due to sudden speed changes, unusual acceleration patterns, potential thruster misfires, or navigation system issues. The fusion of position and velocity data offers a comprehensive view of the spacecraft’s translational motion. When combined with quaternion data describing rotational state, we obtain a complete picture of the spacecraft’s dynamic state during critical mission phases. These metrics play essential roles in mission planning, real-time monitoring, post-flight analysis, safety verification, C2 (command and control), and overall system performance evaluation. By using these rich datasets and advanced anomaly detection techniques, we enhance our ability to achieve mission success and spacecraft safety throughout the dynamic phases of lunar deorbit, descent, and landing.

Quaternions in spacecraft dynamics

Quaternions play a crucial role in spacecraft attitude (orientation) representation. Although Euler angles (roll, pitch, and yaw) are more intuitive, they can suffer from gimbal lock—a loss of one degree of freedom in certain orientations. Quaternions solve this problem by using a four-parameter representation that avoids such singularities. This representation consists of one scalar component (q0) and three vector components (q1, q2, q3), providing a robust mathematical framework for describing spacecraft orientation. In our NASA Blue Origin DDL data, quaternions serve a vital purpose: they represent the rotation from the spacecraft’s body-fixed coordinate system (CON) to the ECEF frame. This transformation is fundamental to several critical aspects of spacecraft operation, including maintaining precise attitude control during descent, preserving correct thrust vector orientation, facilitating accurate sensor measurements, and computing landing trajectories. For reliable anomaly detection, quaternion values must satisfy two essential mathematical properties. First, they must maintain unit magnitude, meaning the sum of their squared components (q0² + q1² + q2² + q3² = 1) equals one. Second, they must demonstrate continuity, avoiding sudden jumps that would indicate physically impossible rotations. These properties help confirm the validity of our orientation measurements and the effectiveness of our anomaly detection system. When our RCF algorithm identifies anomalies in quaternion data, these could signal various issues requiring attention. Such anomalies might indicate sensor malfunctions, attitude control system issues, data transmission errors, or actual problems with spacecraft orientation. By carefully monitoring these quaternion components alongside position and velocity data, we develop a comprehensive understanding of the spacecraft’s dynamic state during the critical phases of deorbit, descent, and landing.

The Random Cut Forest algorithm

Random Cut Forest is an unsupervised algorithm for detecting anomalies in high-dimensional data. The algorithm’s construction begins by creating multiple decision trees, each built through a process of repeatedly cutting the data space with random hyperplanes. This partitioning continues until each data point is isolated, creating a forest of trees that captures the underlying structure of the data. The novelty of RCF lies in the scoring mechanism. Points located in sparse regions of the data space that require fewer cuts to isolate score higher, while points in dense regions that need more cuts score lower. This fundamental principle allows the algorithm to assign anomaly scores inversely proportional to the number of cuts needed to isolate each point. Higher scores, therefore, indicate potential anomalies, making it straightforward to identify unusual patterns in the data.

In our spacecraft dynamics context, we apply RCF to 10-dimensional vectors that combine position (three dimensions), velocity (three dimensions), and quaternion orientation (four dimensions). Each vector represents a specific moment in time during the spacecraft’s mission states. The flight patterns create dense regions in this high-dimensional space, while anomalies appear as isolated points in sparse regions. This data is high-dimensional, multivariate time series, and has no labels, which RCF handles fairly well while maintaining computational efficiency and handling sensor noise. For this use case, RCF is able to detect subtle deviations between data points of spacecraft dynamics while handling the complex relationships between position, velocity, and orientation parameters. These features of RCF make it an effective tool for spacecraft dynamics monitoring analysis and anomaly detection.

Solution architecture

The solution architecture implements anomaly detection for NASA-Blue Origin Lunar DDL data using the RCF algorithm, as illustrated in the following diagram.

Architecture diagram showing data flow from Public DDL Data through Amazon SageMaker AI Domain, including JupyterLab processing and RCF model deployment, with outputs stored in Amazon S3. The entire process runs within a VPC.

Our solution’s data flow begins with public DDL (Deorbit, Descent, and Landing) data securely stored in an S3 bucket. This data is then accessed through a SageMaker AI domain using JupyterLab, providing a powerful and flexible environment for data scientists and engineers. Within JupyterLab, we use a custom notebook to process the raw data and implement our anomaly detection algorithms.

The core of our solution lies in the processing pipeline. It starts in the JupyterLab notebook, where we train an RCF model using SageMaker AI. After it’s trained, this model is deployed to a SageMaker AI endpoint, creating a scalable and responsive anomaly detection service. We then feed our spacecraft dynamics data through this model to identify potential anomalies. The pipeline concludes by generating detailed visualizations of these anomalies, providing clear and actionable insights.

For output, our system saves both the detected anomaly data and the generated plots back to Amazon S3. This makes sure the results are securely stored and accessible for further analysis or reporting. Additionally, we preserve all training data and model outputs in Amazon S3, enabling reproducibility and facilitating iterative improvements to our anomaly detection process. Throughout these operations, we maintain robust security measures, using Amazon Virtual Private Cloud (Amazon VPC) to enforce data privacy and integrity at every step of the process.

Prerequisites

Before standing up the project, you must set up the necessary tools and access rights:

  • The AWS environment should include an active AWS account with appropriate permissions for running ML workloads, along with the AWS Command Line Interface (AWS CLI) for command line operations installed
  • Access to SageMaker AI is essential for the ML implementation
  • On the development side, Python 3.7 or later needs to be installed, along with several key Python packages:

Set up the solution

The setup process includes accessing the SageMaker AI environment, where all the data analysis and model training is executed.

  1. On the SageMaker AI console, open the SageMaker domain details page.
  2. Open JupyterLab, then create a new Python notebook instance for this project.
  3. When the environment is ready, open a terminal in SageMaker AI JupyterLab to clone the project repository using the following commands:
git clone https://github.com/aws-samples/sample-SageMaker-ai-rcf-anomaly-detection-lunar-spacecraft.git
cd sample-SageMaker-ai-rcf-anomaly-detection-lunar-spacecraft
  1. Install the required Python libraries:

pip install -r requirements.txt

This process will set up the necessary dependencies for running anomaly detection analysis on the spacecraft data.

Execute anomaly detection

Update the bucket_name and file_name variables in the script with your S3 bucket and data file names.

Run the script in JupyterLab as a Jupyter notebook or run as a Python script: python Lunar_DDL_AD.py

Upon execution, the notebook or script performs a series of automated tasks to analyze the spacecraft data. It begins by loading and preprocessing the raw data, making sure it’s in the correct format for analysis. Next, it trains and deploys an RCF model using SageMaker AI, establishing the foundation for our anomaly detection system. When the model is operational, it processes the spacecraft dynamics data to identify potential anomalies in position, velocity, and quaternion measurements. Finally, the script generates detailed visualizations of these findings and automatically uploads both the plots and analysis results to Amazon S3 for secure storage and straightforward access.

Code structure

The Python implementation centers around an anomaly detection pipeline, structured in the main script. At its core is the AnomalyDetector class, which orchestrates the entire workflow from data ingestion to visualization. This class contains several methods that together process spacecraft telemetry data and identify anomalies.

The load_and_prepare_data method handles the initial data ingestion and preprocessing, making sure spacecraft measurements are properly formatted for analysis. After the data is prepared, train_and_deploy_model trains the RCF model and deploys it as a SageMaker endpoint. The predict_anomalies method then uses this trained model to identify unusual patterns in the spacecraft’s position, velocity, and quaternion data.

For visualization and storage, the plot_results method creates detailed graphs highlighting detected anomalies, and upload_plot_to_s3 makes sure these visualizations are securely stored in Amazon S3 for future reference and centralized access.

Together, these components create a comprehensive pipeline for processing spacecraft telemetry data and identifying potential anomalies that might warrant further investigation.

Configuration

Adjust the following parameters in the script as needed:

  • threshold_percentile for the threshold for anomaly classification
  • RCF hyperparameters in train_and_deploy_model:
    • feature_dim: Number of input features
    • num_samples_per_tree: Random data points per tree
    • num_trees: Number of trees in the algorithmic forest
  • batch_size in predict_anomalies for large datasets

For RCF applications, the hyperparameters and threshold configuration significantly influence anomaly detections. We use the following configuration values for this example:

  • threshold_percentile=0.9
  • RCF hyperparameters in train_and_deploy_model():
    • feature_dim=10
    • num_samples_per_tree=512
    • num_trees=100
  • batch_size=1000 in predict_anomalies()

SageMaker AI instance type size for training and inference can affect anomaly results, processing time, and cost. In this example, we used an ml.m5.4xlarge instance for both training and inference.

In addition, SageMaker AI can be integrated with various security features for protecting sensitive data and models. It’s possible to operate in no internet or VPC only modes so SageMaker AI instances remain isolated within your Amazon VPC. Secure data access can also be achieved through AWS PrivateLink, enabling private connections to Amazon S3 without internet exposure. Also, integration with AWS Identity and Access Management (IAM) provides fine-grained access control through custom user profiles, enforcing data privacy and adhering to the principle of least privilege, such as when using sensitive spacecraft telemetry data. These are some of the security enhancement services that can be applied according to your appropriate use case with SageMaker AI.

Data

The script uses public NASA-Blue Origin Demo of Lunar Deorbit, Descent, and Landing Sensors (BODDL-TP) data, which you can download. Make sure your data is in the correct format with columns for timestamps, positions, velocities, and quaternions.

Results

The script generates plots for positions, velocities, and quaternions. The respective data is plotted and the anomalies are plotted as an overlay in red. The plots are saved to the specified S3 bucket. Due to the small scale, the positions plot is difficult to observe anomalies. However, the SageMaker AI RCF algorithm can detect them and are highlighted in red. In the following plots, the sharp changes in velocities and quaternions correspond with the anomalies shown.

Time series plot showing spacecraft position data in ECEF coordinates across three dimensions. Blue, green, and yellow lines represent different position components, with red markers indicating detected anomalies. A purple dashed line shows the anomaly score over time.

Unlike the positions plot, the velocities plot shows discontinuities, which are detected as anomalies. This is likely due to rate changes for vehicle maneuvers during the deorbit, descent, and landing demonstration stages.

Time series plot of spacecraft velocity data in ECEF coordinates, showing three velocity components in blue, green, and yellow. Red markers indicate detected anomalies, with a purple dashed line representing the anomaly score throughout the time series.

Similarly to the velocities plot, the quaternions plot shows sharp changes, which are also detected as anomalies. This is likely due to rotational accelerations during vehicle maneuvers of the deorbit, descent, and landing demonstration stages.

Time series plot displaying spacecraft quaternion orientation data with four components shown in blue, green, yellow, and cyan. Red markers highlight detected anomalies, and a purple dashed line shows the anomaly score varying over time.

These anomalies most likely represent the lunar spacecraft vehicle dynamics at key maneuver stages of the deorbit, descent, and landing demonstration. Momentum wheels, thrusters, and various other C2 applications could be the cause of the observed abrupt positional, velocity, and quaternion changes being detected as anomalous. By having these results, data points of interest are indicated for more precise and potentially valuable analysis for improved vehicle health and status awareness.

Clean up

The provided script includes SageMaker AI endpoint deletion after training and inference to avoid any unnecessary charges. If you’re using JupyterLab and want to further avoid charges, stop the SageMaker AI instance running the RCF JupyterLab Python notebook.

Conclusion

In this post, we demonstrated how the SageMaker AI RCF algorithm can effectively detect anomalies in spacecraft dynamics data from NASA and Blue Origin’s lunar Deorbit, Descent, and Landing demonstration. By detecting anomalies for position, velocity, and quaternion orientation data, we’ve shown how ML can enhance space mission analysis, situational awareness, and autonomy. The built-in algorithm processes complex, multi-dimensional spacecraft telemetry data. Through efficient batch processing, we can analyze large-scale mission data effectively, and our visualization approach enables quick identification of potential issues in spacecraft dynamics. From there, the solution’s scalability shows the ability adapt to handle varying data volumes and mission durations, making it potentially suitable for a wide range of space applications. Although this solution applies to a lunar mission demonstration, the approach could have broad applications throughout the space industry. You can adapt the same architecture for various space operations, such as landing missions on other celestial bodies, orbital rendezvous, space station docking, and satellite constellations. This integration of AWS services with aerospace applications creates a robust, secure, and scalable platform for space mission analytics, which is becoming increasingly valuable as we continue to execute missions in the space environment. Looking forward, this solution opens many possibilities for enhancement and expansion. Real-time anomaly detection could be implemented for live mission data, providing immediate insights during critical operations. Also, the system could be enhanced by incorporating additional spacecraft parameters and sensor data, and automated alert services could be developed to provide immediate notification of detected anomalies. In addition, further developments might include extending the analysis to incorporate predictive ML models and creating custom metrics tailored to specific mission requirements. These potential advancements would continue to build upon the foundation we’ve established, creating even more powerful tools for spacecraft mission analysis.

The code and implementation details are available in our GitHub repository, enabling you to adapt and enhance the solution for your specific needs.

For space operations, the combination of cloud computing and ML have strong potential to play an increasingly crucial role in ensuring mission success. This solution demonstrates just one of many possible applications of AWS services for improving spacecraft mission compute and data analysis.

To learn more about the AWS services used in this solution, refer to Guide to getting set up with Amazon SageMaker AI, Train a Model with Amazon SageMaker, and the JupyterLab user guide.


About the authors

Dr. Ian Lunsford is an Aerospace AI Engineer at AWS Professional Services. He integrates cloud services into aerospace applications. Additionally, Ian focuses on building AI/ML solutions using AWS services.

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Read More

Build an intelligent multi-agent business expert using Amazon Bedrock

Build an intelligent multi-agent business expert using Amazon Bedrock

In this post, we demonstrate how to build a multi-agent system using multi-agent collaboration in Amazon Bedrock Agents to solve complex business questions in the biopharmaceutical industry. We show how specialized agents in research and development (R&D), legal, and finance domains can work together to provide comprehensive business insights by analyzing data from multiple sources.

Amazon Bedrock Agents and multi-agent collaboration

Business intelligence and market research enable large and small businesses to capture the trends of the industry, competitive landscape through data, and influences key business strategies. For example, biopharmaceutical companies use data to understand drug market size, clinical trials, prevalence of side effects, and innovation and pitfalls through analyzing patent and legal briefs to form investment strategies. In doing so, organizations face the challenges of accessing and analyzing information scattered across multiple data sources. Consolidating and querying these disparate datasets can be a complex and time-consuming task, requiring developers to navigate different data formats, query languages, and access mechanisms. Additionally, gaining a comprehensive understanding of an organization’s operations often requires combining data insights from various segments, such as legal, finance, and R&D.

Generative AI agentic systems have emerged as a promising solution, enabling organizations to use generative AI for autonomous reasoning and action-based tasks. However, many agentic systems to-date are built with a single-agent setup, which poses challenges in a complex business environment. Besides the challenge of managing multiple data sources, encoding information and guidance for multiple business domains might cause the prompt for an agent’s large language model (LLM) to grow to such an extent that is suffers from “forgetting the middle” of a long context. Therefore, there is a trade-off between the breadth vs. depth of knowledge for each domain that can be encoded in an agent effectively. Additionally, the use of a single LLM with an agent limits cost, latency, and accuracy optimizations for the selected model.

Amazon Bedrock Agents and its multi-agent collaboration feature provides powerful, enterprise-ready solutions for addressing these challenges and building intelligent and automated agentic systems. As a managed service within the AWS ecosystem, Amazon Bedrock Agents offers seamless integration with AWS data sources, built-in security controls, and enterprise-grade scalability. It contains built-in support for additional Amazon Bedrock features such as Amazon Bedrock Guardrails and Amazon Bedrock Knowledge Bases. The service significantly reduces deployment overhead, empowering developers to focus on agent logic through an API-driven, familiar AWS Cloud environment and console. The supervisor agent model with specialized sub-agents enables efficient distributed problem-solving, breaking down complex tasks with intelligent routing.

In this post, we discuss how to build a multi-agent system using multi-agent collaboration to solve complex business questions faced in a fictional biopharmaceutical company that requires expertise and data from three specialized domains: R&D, legal, and finance. We demonstrate how data in disparate sources can be combined intelligently to support complex reasoning, and how agent collaboration facilitates open-ended question answering, such as “What are the potential legal and financial risks associated with the side effects of therapeutic product X, and how might they affect the company’s long-term stock performance?”

(Below image depicts the roles of supervisor agent and the 3 subagents being used in our pharmaceutical example along with the benefits of using Multi Agent Collaboration. )

Solution overview

Our use case centers around PharmaCorp, a fictional pharmaceutical company, which faces the challenge of managing vast amounts of data across its Pharma R&D, Legal, and Finance divisions. Each division works with structured data, such as stock prices, and unstructured data, such as clinical trial reports. The data for each division is located in different data stores, which makes it difficult for teams to access cross-functional insights and slows down decision-making processes.

To address this, we build a multi-agent system with domain-specific sub-agents for each division using multi-agent collaboration within Amazon Bedrock Agents. These sub-agents efficiently handle data queries and information retrieval, and the main agent passes necessary context between sub-agents and synthesizes insights across divisions. The multi-agent setup empowers PharmaCorp to access expertise and information within minutes that would otherwise take hours of human effort to compile. This approach breaks down data silos and strengthens organizational collaboration.

The following architecture diagram illustrates the solution setup.

The main agent acts as an orchestrator, asking questions to multiple sub-agents and synthesizing retrieved data:

  • The R&D sub-agent has access to clinical trial data through Amazon Athena and unstructured clinical trial reports
  • The legal sub-agent has access to unstructured patents and lawsuit legal briefs
  • The finance sub-agent has access to research budget data through Athena and historical stock price data stored in Amazon Redshift

Each sub-agent has granular permissions to only access the data in its domain. Detailed information about the data and models used and main agent interactions are described in the following sections.

Dataset

We generated synthetic data using Anthropic’s Claude 3.5 Sonnet model, comprised of three domains: Pharma R&D, Legal, and Finance. The domains contain structured data stored in SQL tables and unstructured data that is used in domain knowledge bases. The data can be accessed through the following files: R&D, Legal, Finance.

Efforts have been made to align synthetic data within and across domains. For example, clinical trial reports map to each trial and side effects in related tables. Rises and dips in stock prices tend to correlate with patents and lawsuits. However, there might still be minor inconsistencies between data.

Pharma R&D domain

The Pharma R&D domain has three tables: Drugs, Drug Trials, and Side Effects. Each table is queried from Amazon Simple Storage Service (Amazon S3) through Athena. The Drugs table contains information on the company’s available products, therapeutic areas, target conditions, mechanisms of action, development phase, discovery year, and lead scientist. The Drug Trials table contains information on specific trials for each drug such as phase, dates, number of participations, and outcomes. The Side Effects table contains side effects, frequency, and severity reported from each trial.

The unstructured data for the Pharma R&D domain consists of synthetic clinical trial reports for each trial, which contain more detailed information about the trial design, outcomes, and recommendations.

Legal domain

The Legal domain has unstructured data consisting of patents and lawsuit legal briefs. The patents contain information about invention background, description, and experimental results. The legal briefs contain information about lawsuit court proceedings, outcomes, and analysis.

Finance domain

The Finance domain has two tables: Stock Price and Research Budgets. The Stock Price table is stored in Amazon Redshift and contains PharmaCorp’s historical monthly stock prices and volume. Amazon Redshift is a database optimized for online analytical processing (OLAP), which generally entails analyzing large amounts of data and performing complex analysis, as might be done by analysts looking at historical stock prices. The Research Budgets table is accessed from Amazon S3 through Athena and contains annual budgets for each department.

The data setup showcases how a multi-agent framework can synthesize data from multiple data sources and databases. In practice, data could also be stored in other databases such as Amazon Relational Database Service (Amazon RDS).

Models used

Anthropic’s Claude 3 Sonnet, which has a good balance of intelligence and speed, is used in this multi-agent demonstration. With the multi-agent setup, you can also employ a more intelligent or a smaller, faster model depending on the use case and requirements such as accuracy and latency.

Prerequisites

To deploy this solution, you need the following prerequisites:

Deploy the solution

To deploy the solution resources, we use AWS CloudFormation. The CloudFormation template creates two S3 buckets, two AWS Lambda functions, an Amazon Bedrock agent, an Amazon Bedrock knowledge base, and an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Download the provided CloudFormation template, then complete the following steps to deploy the stack:

  1. Open the AWS CloudFormation console (the preferred AWS Regions are us-west-2 or us-east-1 for the solution).
  2. Choose Stacks in the navigation pane.
  3. Choose Create stack and With new resources (standard).
  4. Select Choose existing template and upload the provided CloudFormation template file.
  5. Enter a stack name, then choose Next.
  6. Leave the stack settings as default and choose Next.
  7. Select the acknowledgement check box and create the stack.

After the stack is complete, you can view the new supervisor agent on the Amazon Bedrock console.

An example of agent collaboration

After you deploy the solution, you can test the communication among agents that help answer complex questions across PharmaCorp’s three divisions. For example, we ask the main agent “How did the results of NeuroClear’s Phase 2 trials affect PharmaCorp’s stock price, patent filings, and potential legal risks?”

This question requires a comprehensive understanding of the relationships between NeuroClear’s clinical trial results, financial impacts, and legal outcomes for PharmaCorp. Let’s see how the multi-agent system addresses this complex query.

The main agent identifies that it needs input from three specialized sub-agents to fully assess how NeuroClear’s clinical trial results might impact PharmaCorp’s legal and financial performance. It breaks down the user’s question into key components and develops a plan to gather detailed insights from each expert. The following is its chain-of-thought reasoning, task breakdown, and sub-agent routing:

This is a complex question that requires information from all three divisions to 
provide a comprehensive answer. 
I need to:
Get information about NeuroClear's Phase 2 trial results from Pharmaceutical division
Check stock price movements around the trial period from Finance division
Look for any patent filings and legal implications from Legal division

Then, the main agent asks a question to the R&D sub-agent:

Please provide details about NeuroClear's Phase 2 clinical trials, including trial dates,
outcomes, success rates, and any significant findings or side effects.

The R&D sub-agent correctly plans and executes its own sequence of steps, which include performing queries and searching its own knowledge base. It responds with the following:

Based on the information from the knowledge base search and Athena database queries,
here are the key details about NeuroClear's Phase 2 clinical trials:

* There were two Phase 2 trials conducted for NeuroClear, a novel drug for treating 
Alzheimer's disease.
* The first trial ran from June 2019 to May 2020 with 250 participants. It showed a 
moderate success rate of 0.65 in slowing cognitive decline compared to placebo.
* The second trial ran from August 2021 to August 2022 with 400 participants. It had 
a slightly higher success rate of 0.68.
* Both trials were randomized, double-blind, and placebo-controlled. The dosages tested 
were 25mg, 50mg, and 75mg once daily.
* Primary endpoints were change in ADAS-Cog score and secondary endpoints included CDR-SB, 
hippocampal volume, and CSF biomarkers.
* Notable side effects observed included insomnia (frequency 0.14-0.15, severity 3), 
anxiety (frequency 0.12-0.13, severity 3), tremors (frequency 0.09-0.1, severity 2), 
and weight loss (frequency 0.08, severity 2).

The trials showed NeuroClear had moderate efficacy in slowing cognitive decline in 
Alzheimer's patients, with an acceptable safety profile. The results supported 
advancing NeuroClear to larger Phase 3 trials to further evaluate its potential 
as a treatment for Alzheimer's disease.

The main agent takes this information and determines its next step:

Now that I have the trial dates and outcomes, I should check the stock price movements
during these periods and any patent/legal activities.

It asks the finance sub-agent the following:

Please provide stock price analysis for two periods:

1. June 2019 - May 2020
2. August 2021 - August 2022

Specifically looking for significant price movements and trading volume changes around
the completion dates of these periods, which correspond to NeuroClear's Phase 2 trial
results.

The finance sub-agent plans and executes queries on its stock database and responds with the following:

Stock Price Analysis for NeuroClear's Phase 2 Trial Periods:

Period 1 (June 2019 - May 2020):
* Final closing price: $34.69
* Average price: $31.43
* Price range: $27.99 - $34.69
* Average daily volume: 630,874 shares
* The period ended with moderate trading volume (566,927 shares)

Period 2 (August 2021 - August 2022):
* Final closing price: $53.71
* Average price: $49.79
* Price range: $44.11 - $55.41
* Average daily volume: 640,098 shares
* The period ended with slightly lower volume (491,629 shares)

Key Observations:
* Significant price appreciation between periods, with Period 2 showing ~58% 
higher average prices
* Both periods showed similar trading volume patterns, with Period 2 showing 
slightly higher average volume
* The stock showed slight negative movement on both completion dates 
(-0.20% and -0.77% respectively)

The main agent then asks a question to the legal sub-agent:

Please provide information about:
1. Any patent filings related to NeuroClear, particularly during or after the 
Phase 2 trials (2019-2022)
2. Any legal risks or challenges identified related to NeuroClear's development 
or trial results during this period.

The legal sub-agent searches its knowledge base and responds with the following:

Based on the patent search results, two key patent filings related to NeuroClear 
during or after its Phase 2 trials from 2019-2022 were:

1. A blood-brain barrier penetration enhancer system using nanoparticles to improve 
delivery of neurological drugs like NeuroClear to the brain (Patent ID: PAT004, 
filed in 2022). This technology could have been developed to address challenges 
with getting NeuroClear across the blood-brain barrier during clinical trials.
2. A targeted neurological receptor blocker designed to selectively inhibit 
receptors implicated in neurodegenerative diseases (Patent ID: PAT003, filed in 2021). 
This suggests NeuroClear may work by blocking certain neural receptors involved in 
conditions like Alzheimer's disease.

The legal search results identified a few key legal risks and challenges related to 
NeuroClear's development and trial results from 2019-2022:

1. A product liability lawsuit alleging NeuroClear caused cognitive impairment in a 
patient, with evidence that PharmaCorp failed to properly warn about this risk despite 
internal knowledge of it (Case 2023-CV-006).
2. A class action lawsuit alleging failure to warn about the risk of seizures associated 
with NeuroClear use, based on medical records, expert testimony, and internal PharmaCorp 
documents showing awareness of this risk.
3. Another product liability case where PharmaCorp was found liable for failing to warn 
about risk of liver damage from a different product, suggesting potential issues with 
adequate risk disclosure for NeuroClear as well.

Finally, the main agent synthesizes the information it has received into a response to the user’s question:

NeuroClear's Phase 2 trials had mixed impacts across different aspects of 
PharmaCorp's operations:

Clinical Results:
- Two Phase 2 trials (2019-2020 and 2021-2022) showed moderate success in treating
 Alzheimer's disease
- Success rates of 0.65 and 0.68 respectively
- Notable side effects included insomnia, anxiety, tremors, and weight loss

Stock Performance:
- The stock showed resilience during both trial periods
- Average stock price increased significantly (58%) between the first and second trial periods
- Interestingly, both trial completion dates saw slight negative price movements (-0.20% and -0.77%)

Patents and Legal Implications:
- Two significant patents were filed:
* Blood-brain barrier penetration enhancer system (2022)
* Targeted neurological receptor blocker (2021)
- However, several legal risks emerged:
* Product liability lawsuit regarding cognitive impairment
* Class action lawsuit concerning seizure risks
* Potential issues with risk disclosure based on previous cases

These results suggest that while NeuroClear showed promising clinical potential 
and led to valuable intellectual property, it also exposed PharmaCorp to significant 
legal risks that could impact future development and commercialization.

Through this example, we can see how multi-agent collaboration enables a comprehensive analysis of complex business questions by using specialized knowledge from different domains. The main agent effectively orchestrates the interaction between sub-agents, synthesizing their insights to provide a holistic answer that considers R&D, financial, and legal aspects of the NeuroClear clinical trials and their potential impacts on PharmaCorp.

Clean up

When you’re done testing the agent, complete the following steps to clean up your AWS environment and avoid unnecessary charges:

  1. Delete the S3 buckets:
    1. On the Amazon S3 console, empty the buckets structured-data-${AWS::AccountId}-${AWS::Region} and unstructured-data-${AWS::AccountId}-${AWS::Region}. Make sure that both of these buckets are empty by deleting the files.
    2. Select each file, choose Delete, and confirm by entering the bucket name.
  2. Delete the Lambda functions:
    1. On the Lambda console, select the CopyDataLambda function.
    2. Choose Delete and confirm the action.
    3. Repeat these steps for the CopyUnstructuredDataLambda function.
  3. Delete the Amazon Bedrock agent:
    1. On the Amazon Bedrock console, choose Agents in the navigation pane.
    2. Select the agent, then choose Delete.
  4. Delete the Amazon Bedrock knowledge base in Bedrock:
    1. On the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
    2. Select the knowledge base and choose Delete.
  5. Delete the EC2 instance:
    1. On the Amazon EC2 console, choose Instances in the navigation pane.
    2. Select the EC2 instance you created, then choose Delete.

Business impact

Implementing this multi-agent system using Amazon Bedrock Agents can provide significant benefits for pharmaceutical companies. By automating data retrieval and analysis across domains, companies can reduce research time and enable faster, data-driven decision-making, especially when domain experts are distributed across different organizational units with limited direct interaction. The system’s ability to provide comprehensive, cross-functional insights in minutes can lead to improved risk mitigation, because potential legal and financial issues can be identified earlier by connecting disparate data points. This automation also allows for more effective allocation of human resources, freeing up experts to focus on high-value tasks rather than routine data analysis.

Our example demonstrates the power of multi-agent systems in pharmaceutical research and development, but the applications of this technology extend far beyond a single use case. For example, biotech companies can accelerate the discovery of cancer biomarkers by having specialist agents extract genomic signals from Amazon Redshift, perform Kaplan-Meier survival analyses, and interpret CT scans in parallel. Large health systems could automatically aggregate patient records, lab results, and trial data to streamline care coordination and flag urgent cases. Travel agencies can orchestrate end‑to‑end itineraries, and firms can manage personalized client communications. For more information on potential applications, see the following posts:

Although the potential of multi-agent systems is compelling across these diverse applications, it’s important to understand the practical considerations in implementing such systems. Complex orchestration workflows can drive up inference costs through multiple model calls, increase end‑to‑end latency, amplify testing and maintenance requirements, and introduce operational overhead around rate limits, retries, and inter‑agent or data connection protocols. However, the state of the art is rapidly advancing. New generations of faster, cheaper models can help keep per‑call expenses and latency low, and more powerful models can accomplish tasks in fewer turns. Observability tools offer end‑to‑end tracing and dashboarding for multi‑agent pipelines. Finally, protocols like Anthropic’s Model Context Protocol are beginning to standardize the way agents access data, paving the way for robust multi‑agent ecosystems.

Conclusion

In this post, we explored how a multi-agent generative AI system, implemented with Amazon Bedrock Agents using multi-agent collaboration, addresses data access and analysis challenges across multiple business domains. Through a demo use case with a fictional pharmaceutical company managing data across its different divisions, we showcased how specialized sub-agents tailored to each domain streamline information retrieval and synthesis. Each sub-agent uses domain-optimized models and securely accesses relevant data sources, enabling the organization to generate cross-functional insights.

With this multi-agent architecture, organizations can overcome data silos, enhance collaboration, and achieve efficient, data-driven decision-making while optimizing for cost, latency, and security. Amazon Bedrock Agents with multi-agent collaboration facilitates this setup by providing a secure, scalable framework that manages the collaboration, communication, and task delegation between agents. Explore other demos and workshops about multi-agent collaboration in Amazon Bedrock in the following resources:


About the authors

Justin Ossai is a GenAI Labs Specialist Solutions Architect based in Dallas, TX. He is a highly passionate IT professional with over 15 years of technology experience. He has designed and implemented solutions with on-premises and cloud-based infrastructure for small and enterprise companies.

Michael Hsieh is a Principal AI/ML Specialist Solutions Architect. He works with HCLS customers to advance their ML journey with AWS technologies and his expertise in medical imaging. As a Seattle transplant, he loves exploring the great mother nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.

Shreya Mohanty  is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.

Rachel Hanspal is a Deep Learning Architect at AWS Generative AI Innovation Center, specializing in end-to-end GenAI solutions with a focus on frontend architecture and LLM integration. She excels in translating complex business requirements into innovative applications, leveraging expertise in natural language processing, automated visualization, and secure cloud architectures.

Read More

Driving cost-efficiency and speed in claims data processing with Amazon Nova Micro and Amazon Nova Lite

Driving cost-efficiency and speed in claims data processing with Amazon Nova Micro and Amazon Nova Lite

Amazon operations span the globe, touching the lives of millions of customers, employees, and vendors every day. From the vast logistics network to the cutting-edge technology infrastructure, this scale is a testament to the company’s ability to innovate and serve its customers. With this scale comes a responsibility to manage risks and address claims—whether they involve worker’s compensation, transportation incidents, or other insurance-related matters. Risk managers oversee claims against Amazon throughout their lifecycle. Claim documents from various sources grow as the claims mature, with a single claim consisting of 75 documents on average. Risk managers are required to strictly follow the relevant standard operating procedure (SOP) and review the evolution of dozens of claim aspects to assess severity and to take proper actions, reviewing and addressing each claim fairly and efficiently. But as Amazon continues to grow, how are risk managers empowered to keep up with the growing number of claims?

In December 2024, an internal technology team at Amazon built and implemented an AI-powered solution as applied to data related to claims against the company. This solution generates structured summaries of claims under 500 words across various categories, improving efficiency while maintaining accuracy of the claims review process. However, the team faced challenges with high inference costs and processing times (3–5 minutes per claim), particularly as new documents are added. Because the team plans to expand this technology to other business lines, they explored Amazon Nova Foundation Models as potential alternatives to address cost and latency concerns.

The following graphs show performance compared with latency and performance compared with cost for various foundation models on the claim dataset.

Performance comparison charts of language models like Sonnet and Nova, plotting BERT-F1 scores against operational metrics

The evaluation of the claims summarization use case proved that Amazon Nova foundation models (FMs) are a strong alternative to other frontier large language models (LLMs), achieving comparable performance with significantly lower cost and higher overall speed. The Amazon Nova Lite model demonstrates strong summarization capabilities in the context of long, diverse, and messy documents.

Solution overview

The summarization pipeline begins by processing raw claim data using AWS Glue jobs. It stores data into intermediate Amazon Simple Storage Service (Amazon S3) buckets, and uses Amazon Simple Queue Service (Amazon SQS) to manage summarization jobs. Claim summaries are generated by AWS Lambda using foundation models hosted in Amazon Bedrock. We first filter the irrelevant claim data using an LLM-based classification model based on Nova Lite and summarize only the relevant claim data to reduce the context window. Considering relevance and summarization requires different levels of intelligence, we select the appropriate models to optimize cost while maintaining performance. Because claims are summarized upon arrival of new information, we also cache the intermediate results and summaries using Amazon DynamoDB to reduce duplicate inference and reduce cost. The following image shows a high-level architecture of the claim summarization use case solution.

AWS claims summarization workflow diagram integrating data preprocessing, queuing, AI processing, and storage services

Although the Amazon Nova team has published performance benchmarks across several different categories, claims summarization is a unique use case given its diversity of inputs and long context windows. This prompted the technology team owning the claims solution to investigate further with their own benchmarking study. To assess the performance, speed, and cost of Amazon Nova models for their specific use case, the team curated a benchmark dataset consisting of 95 pairs of claim documents and verified aspect summaries. Claim documents range from 1,000 to 60,000 words, with most being around 13,000 words (median 10,100). The verified summaries of these documents are usually brief, containing fewer than 100 words. Inputs to the models include diverse types of documents and summaries that cover a variety of aspects in production.

According to benchmark tests, the team observed that Amazon Nova Lite is twice as fast and costs 98% less than their current model. Amazon Nova Micro is even more efficient, running four times faster and costing 99% less. The substantial cost-effectiveness and latency improvements offer more flexibility for designing a sophisticated model and scaling up test compute to improve summary quality. Moreover, the team also observed that the latency gap between Amazon Nova models and the next best model widened for long context windows and long output, making Amazon Nova a stronger alternative in the case of long documents while optimizing for latency. Additionally, the team performed this benchmarking study using the same prompt as the current in-production solution with seamless prompt portability. Despite this, Amazon Nova models successfully followed instructions and generated the desired format for post-processing. Based on the benchmarking and evaluation results, the team used Amazon Nova Lite for classification and summarization use cases.

Conclusion

In this post, we shared how an internal technology team at Amazon evaluated Amazon Nova models, resulting in notable improvements in inference speed and cost-efficiency. Looking back on the initiative, the team identified several critical factors that offer key advantages:

  • Access to a diverse model portfolio – The availability of a wide array of models, including compact yet powerful options such as Amazon Nova Micro and Amazon Nova Lite, enabled the team to quickly experiment with and integrate the most suitable models for their needs.
  • Scalability and flexibility – The cost and latency improvements of the Amazon Nova models allow for more flexibility in designing sophisticated models and scaling up test compute to improve summary quality. This scalability is particularly valuable for organizations handling large volumes of data or complex workflows.
  • Ease of integration and migration – The models’ ability to follow instructions and generate outputs in the desired format simplifies post-processing and integration into existing systems.

If your organization has a similar use case of large document processing that is costly and time-consuming, the above evaluation exercise shows that Amazon Nova Lite and Amazon Nova Micro can be game-changing. These models excel at handling large volumes of diverse documents and long context windows—perfect for complex data processing environments. What makes this particularly compelling is the models’ ability to maintain high performance while significantly reducing operational costs. It’s important to iterate over new models for all three pillars—quality, cost, and speed. Benchmark these models with your own use case and datasets.

You can get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.


About the authors

Aitzaz Ahmad is an Applied Science Manager at Amazon, where he leads a team of scientists building various applications of machine learning and generative AI in finance. His research interests are in natural language processing (NLP), generative AI, and LLM agents. He received his PhD in electrical engineering from Texas A&M University.

Stephen Lau is a Senior Manager of Software Development at Amazon, leads teams of scientists and engineers. His team develops powerful fraud detection and prevention applications, saving Amazon billions annually. They also build Treasury applications that optimize Amazon global liquidity while managing risks, significantly impacting the financial security and efficiency of Amazon.

Yong Xie is an applied scientist in Amazon FinTech. He focuses on developing large language models and generative AI applications for finance.

Kristen Henkels is a Sr. Product Manager – Technical in Amazon FinTech, where she focuses on helping internal teams improve their productivity by leveraging ML and AI solutions. She holds an MBA from Columbia Business School and is passionate about empowering teams with the right technology to enable strategic, high-value work.

Shivansh Singh, Principal Solutions ArchitectShivansh Singh is a Principal Solutions Architect at Amazon. He is passionate about driving business outcomes through innovative, cost-effective and resilient solutions, with a focus on machine learning, generative AI, and serverless technologies. He is a technical leader and strategic advisor to large-scale games, media, and entertainment customers. He has over 16 years of experience transforming businesses through technological innovations and building large-scale enterprise solutions.

Dushan Tharmal is a Principal Product Manager – Technical on the Amazons Artificial General Intelligence team, responsible for the Amazon Nova Foundation Models. He earned his bachelor’s in mathematics at the University of Waterloo and has over 10 years of technical product leadership experience across financial services and loyalty. In his spare time, he enjoys wine, hikes, and philosophy.

Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon builders who build customer-facing applications using generative AI. He lives in the Seattle area, and outside of work, he loves to go hiking and enjoy nature.

Read More

Using generative AI to do multimodal information retrieval

Using generative AI to do multimodal information retrieval


Using generative AI to do multimodal information retrieval

With large datasets, directly generating data ID codes from query embeddings is much more efficient than performing pairwise comparisons between queries and candidate responses.

Search and information retrieval

June 25, 09:00 AMJune 25, 09:00 AM

For most of the past 10 years, machine learning (ML) relied heavily on the concept of embedding: an ML model would learn to convert input data into vectors (embeddings) such that geometric relationships within the vector space had semantic implications. For instance, words whose embeddings were near each other in the representational space might have similar meanings.

The concept of embedding implied an obvious information retrieval paradigm: a query would be embedded in the representational space, and the model would select the response whose embedding was closest to it. This worked with multimodal information retrieval, too, as text and images (or other modalities) could be embedded in the same space.

More recently, however, generative AI has come to dominate ML research, and at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR), we presented a paper that updates ML-based information retrieval for the generative-AI era. Our model, dubbed GENIUS (for generative universal multimodal search), is a multimodal model whose inputs and outputs can be any combination of images, texts, or image-text pairs.

With embedding-based retrieval <i>(a)</i>, a text embedding must be compared to every possible image embedding, or vice versa. With generative retrieval <i>(b and c)</i>, by contrast, a retrieval model generates a single ID for each query. With GENIUS <i>(c)</i>, the first digit of the ID code indicates the modality of the output.

Instead of comparing a query vector to every possible response vector a time-consuming task, if the image catalogue or text corpus is large enough our model takes a query as input and generates a single ID code as output. This approach has been tried before, but GENIUS dramatically improves on previous generation-based information retrieval methods. In tests on two different datasets using three different metrics retrieval accuracy when one, five, or ten candidate responses are retrieved GENIUS improves on the best-performing prior generative retrieval model by 22% to 36%.

When we then use conventional embedding-based methods to rerank the top generated response candidates, we improve performance still further, by 31% to 56%, significantly narrowing the gap between generation-based methods and embedding-based methods.

Paradigm shift

Information retrieval (IR) is the process of finding relevant information from a large database. With traditional embedding-based retrieval, queries and database items are both mapped into a high-dimensional space, and similarity is measured using metrics like cosine similarity. While effective, these methods face scalability issues as the database grows, due to the increasing cost of index building, maintenance, and nearest-neighbor search.

Generative retrieval has emerged as a promising alternative. Instead of embedding items, generative models directly generate identifiers (IDs) of target data based on a query. This approach enables constant-time retrieval, regardless of database size. However, existing generative methods are often task specific, falling short in performance compared to embedding-based methods, and they struggle with multimodal data.

GENIUS

Unlike prior approaches that are limited to single-modality tasks or specific benchmarks, GENIUS generalizes across retrieval of texts, images, and image-text pairs, maintaining high speed and competitive accuracy. Its advantages over prior generation-based models are based on two key innovations:

Semantic quantization: During training, the models target output IDs are generated through residual quantization. Each ID is actually a sequence of codes, the first of which defines the data items modality image, text, or image-text pair. The successive codes define the data items region of the representational space with greater specificity: items that share the first code are in the same general area; items that share the first two codes are clustered more tightly in that area; items that share the first three codes are clustered more tightly still, and so on. The model tries to learn to reproduce the sequence of codes from the input encodings.

Query augmentation: This approach results in a model that can generate accurate ID codes for familiar types of objects and texts, but it can struggle to generalize to new data types. To address this limitation, we use query augmentation. For a representative sampling of query-ID pairs, we generate new queries by interpolating between the initial query and the target ID in the representational space. This way, the model learns that a variety of queries can map to the same target, which helps it generalize.

The GENIUS framework. <i>Stage 0:</i> Separate image and text encoders are pretrained. <i>Stage 1:</i> Through contrastive training, a residual-quantization module learns to map inputs to sequences of codes in which each code refines the coarser-grained specifications of the preceding code. <i>Stage 2:</i> A decoder is trained to generate output IDs directly from input encodings, using the outputs of the residual-quantization model as targets. At inference time, output codes are constrained by a data structure known as a trie, a tree whose traversals encode sequences of symbols.

Results

In experiments using the M-BEIR benchmark, GENIUS surpassed the best generative retrieval method by 28.6 points in Recall@5 on the COCO dataset for text-to-image retrieval. With embedding-based re-ranking, GENIUS often achieved results close to those of embedding-based baselines on the M-BEIR benchmark while preserving the efficiency benefits of generative retrieval.

GENIUS achieves state-of-the-art performance among generative methods and narrows the performance gap between generative and embedding-based methods. Its efficiency advantage becomes more significant as the dataset grows, maintaining high retrieval speed without the expensive index building typical of embedding-based methods. It thus represents a significant step forward in generative multimodal retrieval.

Research areas: Search and information retrieval

Tags: Generative AI

Read More

Power Your LLM Training and Evaluation with the New SageMaker AI Generative AI Tools

Power Your LLM Training and Evaluation with the New SageMaker AI Generative AI Tools

Today we are excited to introduce the Text Ranking and Question and Answer UI templates to SageMaker AI customers. The Text Ranking template enables human annotators to rank multiple responses from a large language model (LLM) based on custom criteria, such as relevance, clarity, or factual accuracy. This ranked feedback provides critical insights that help refine models through Reinforcement Learning from Human Feedback (RLHF), generating responses that better align with human preferences. The Question and Answer template facilitates the creation of high-quality Q&A pairs based on provided text passages. These pairs act as demonstration data for Supervised Fine-Tuning (SFT), teaching models how to respond to similar inputs accurately.

In this blog post, we’ll walk you through how to set up these templates in SageMaker to create high-quality datasets for training your large language models. Let’s explore how you can leverage these new tools.

Text Ranking

The Text Ranking template allows annotators to rank multiple text responses generated by a large language model based on customizable criteria such as relevance, clarity, or correctness. Annotators are presented with a prompt and several model-generated responses, which they rank according to guidelines specific to your use case. The ranked data is captured in a structured format, detailing the re-ranked indices for each criterion, such as “clarity” or “inclusivity.” This information is invaluable for fine-tuning models using RLHF, aligning the model outputs more closely with human preferences. In addition, this template is also highly effective for evaluating the quality of LLM outputs by allowing you to see how well responses match the intended criteria.

Setting Up in the SageMaker AI Console

A new Generative AI category has been added under Task Type in the SageMaker AI console, allowing you to select these templates. To configure the labeling job using the AWS Management Console, complete the following steps:

  1. On the SageMaker AI console, under Ground Truth in the navigation pane, choose Labeling job.
  2. Choose Create labeling job.
  3. Specify your input manifest location and output path. To configure the Text Ranking input file, use the Manual Data Setup under Create Labeling Job and input a JSON file with the prompt stored under the source field, while the list of model responses is placed under the responses field. Text Ranking does not support Automated Data Setup.

Here is an example of our input manifest file:

Upload this input manifest file into your S3 location and provide the S3 path to this file under Input dataset location:

  1. Select Generative AI as the task type and choose the Text Ranking UI.

  2. Choose Next.
  3. Enter your labeling instructions. Enter the dimensions you want to include in the Ranking dimensions section. For example, in the image above, the dimensions are Helpfulness and Clarity, but you can add, remove, or customize these based on your specific needs by clicking the “+” button to add new dimensions or the trash icon to remove them. Additionally, you have the option to allow tie rankings by selecting the checkbox. This option enables annotators to rank two or more responses equally if they believe the responses are of the same quality for a particular dimension.
  4. Choose Preview to display the UI template for review.
  5. Choose Create to create the labeling job.

When the annotators submit their evaluations, their responses are saved directly to your specified S3 bucket. The output manifest file includes the original data fields and a worker-response-ref that points to a worker response file in S3. This worker response file contains the ranked responses for each specified dimension, which can be used to fine-tune or evaluate your model’s outputs. If multiple annotators have worked on the same data object, their individual annotations are included within this file under an answers key, which is an array of responses. Each response includes the annotator’s input and metadata such as acceptance time, submission time, and worker ID. Here is an example of the output json file containing the annotations:

Question and Answer

The Question and Answer template allows you to create datasets for Supervised Fine-Tuning (SFT) by generating question-and-answer pairs from text passages. Annotators read the provided text and create relevant questions and corresponding answers. This process acts as a source of demonstration data, guiding the model on how to handle similar tasks. The template supports flexible input, letting annotators reference entire passages or specific sections of text for more targeted Q&A. A color-coded matching feature visually links questions to the relevant sections, helping streamline the annotation process. By using these Q&A pairs, you enhance the model’s ability to follow instructions and respond accurately to real-world inputs.

Setting Up in the SageMaker AI Console

The process for setting up a labeling job with the Question and Answer template follows similar steps as the Text Ranking template. However, there are differences in how you configure the input file and select the appropriate UI template to suit the Q&A task.

  1. On the SageMaker AI console, under Ground Truth in the navigation pane, choose Labeling job.
  2. Choose Create labeling job.
  3. Specify your input manifest location and output path. To configure the Question and Answer input file, use the Manual Data Setup and upload a JSON file where the source field contains the text passage. Annotators will use this text to generate questions and answers. Note that you can load the text from a .txt or .csv file and use Ground Truth’s Automated Data Setup to convert it to the required JSON format.

Here is an example of an input manifest file:

Upload this input manifest file into your S3 location and provide the S3 path to this file under Input dataset location

  1. Select Generative AI as the task type and choose the Question and Answer UI
  2. Choose Next.
  3. Enter your labeling instructions. You can configure additional settings to control the task. You can specify the minimum and maximum number of Q&A pairs that workers should generate from the provided text passage. Additionally, you can define the minimum and maximum word counts for both the question and answer fields, so that the responses fit your requirements. You can also add optional question tags to categorize the question and answer pairs. For example, you might include tags such as “What,” “How,” or “Why” to guide the annotators in their task. If these predefined tags are insufficient, you have the option to allow workers to enter their own custom tags by enabling the Allow workers to specify custom tags feature. This flexibility facilitates annotations that meet the specific needs of your use case.
  4. Once these settings are configured, you can choose to Preview the UI to verify that it meets your needs before proceeding.
  5. Choose Create to create the labeling job.

When annotators submit their work, their responses are saved directly to your specified S3 bucket. The output manifest file contains the original data fields along with a worker-response-ref that points to the worker response file in S3. This worker response file includes the detailed annotations provided by the workers, such as the ranked responses or question-and-answer pairs generated for each task.

Here’s an example of what the output might look like:

CreateLabelingJob API

In addition to creating these labeling jobs through the Amazon SageMaker AI console, customers can also use the Create Labeling Job API to set up Text Ranking and Question and Answer jobs programmatically. This method provides more flexibility for automation and integration into existing workflows. Using the API, you can define job configurations, input manifests, and worker task templates, and monitor the job’s progress directly from your application or system.

For a step-by-step guide on how to implement this, you can refer to the following notebooks, which walk through the entire process of setting up Human-in-the-Loop (HITL) workflows for Reinforcement Learning from Human Feedback (RLHF) using both the Text Ranking and Question and Answer templates. These notebooks will guide you through setting up the required Ground Truth pre-requisites, downloading sample JSON files with prompts and responses, converting them to Ground Truth input manifests, creating worker task templates, and monitoring the labeling jobs. They also cover post-processing the results to create a consolidated dataset with ranked responses.

Conclusion

With the introduction of the Text Ranking and Question and Answer templates, Amazon SageMaker AI empowers customers to generate high-quality datasets for training large language models more efficiently. These built-in capabilities simplify the process of fine-tuning models for specific tasks and aligning their outputs with human preferences, whether through supervised fine-tuning or reinforcement learning from human feedback. By leveraging these templates, you can better evaluate and refine your models to meet the needs of your specific application, helping achieve more accurate, reliable, and user-aligned outputs. Whether you’re creating datasets for training or evaluating your models’ outputs, SageMaker AI provides the tools you need to succeed in building state-of-the-art generative AI solutions.To begin creating fine-tuning datasets with the new templates:


About the authors

Sundar Raghavan is a Generative AI Specialist Solutions Architect at AWS, helping customers use Amazon Bedrock and next-generation AWS services to design, build and deploy AI agents and scalable generative AI applications. In his free time, Sundar loves exploring new places, sampling local eateries and embracing the great outdoors.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Niharika Jayanti is a Front-End Engineer at Amazon, where she designs and develops user interfaces to delight customers. She contributed to the successful launch of LLM evaluation tools on Amazon Bedrock and Amazon SageMaker Unified Studio. Outside of work, Niharika enjoys swimming, hitting the gym and crocheting.

Muyun Yan is a Senior Software Engineer at Amazon Web Services (AWS) SageMaker AI team. With over 6 years at AWS, she specializes in developing machine learning-based labeling platforms. Her work focuses on building and deploying innovative software applications for labeling solutions, enabling customers to access cutting-edge labeling capabilities. Muyun holds a M.S. in Computer Engineering from Boston University.

Kavya Kotra is a Software Engineer on the Amazon SageMaker Ground Truth team, helping build scalable and reliable software applications. Kavya played a key role in the development and launch of the Generative AI Tools on SageMaker. Previously, Kavya held engineering roles within AWS EC2 Networking, and Amazon Audible. In her free time, she enjoys painting, and exploring Seattle’s nature scene.

Alan Ismaiel is a software engineer at AWS based in New York City. He focuses on building and maintaining scalable AI/ML products, like Amazon SageMaker Ground Truth and Amazon Bedrock. Outside of work, Alan is learning how to play pickleball, with mixed results.

Read More

Amazon Bedrock Agents observability using Arize AI

Amazon Bedrock Agents observability using Arize AI

This post is cowritten with John Gilhuly from Arize AI.

With Amazon Bedrock Agents, you can build and configure autonomous agents in your application. An agent helps your end-users complete actions based on organization data and user input. Agents orchestrate interactions between foundation models (FMs), data sources, software applications, and user conversations. In addition, agents automatically call APIs to take actions and invoke knowledge bases to supplement information for these actions. By integrating agents, you can accelerate your development effort to deliver generative AI applications. With agents, you can automate tasks for your customers and answer questions for them. For example, you can create an agent that helps customers process insurance claims or make travel reservations. You don’t have to provision capacity, manage infrastructure, or write custom code. Amazon Bedrock manages prompt engineering, memory, monitoring, encryption, user permissions, and API invocation.

AI agents represent a fundamental shift in how applications make decisions and interact with users. Unlike traditional software systems that follow predetermined paths, AI agents employ complex reasoning that often operates as a “black box.” Monitoring AI agents presents unique challenges for organizations seeking to maintain reliability, efficiency, and optimal performance in their AI implementations.

Today, we’re excited to announce a new integration between Arize AI and Amazon Bedrock Agents that addresses one of the most significant challenges in AI development: observability. Agent observability is a crucial aspect of AI operations that provides deep insights into how your Amazon Bedrock agents perform, interact, and execute tasks. It involves tracking and analyzing hierarchical traces of agent activities, from high-level user requests down to individual API calls and tool invocations. These traces form a structured tree of events, helping developers understand the complete journey of user interactions through the agent’s decision-making process. Key metrics that demand attention include response latency, token usage, runtime exceptions, and inspect function calling. As organizations scale their AI implementations from proof of concept to production, understanding and monitoring AI agent behavior becomes increasingly critical.

The integration between Arize AI and Amazon Bedrock Agents provides developers with comprehensive observability tools for tracing, evaluating, and monitoring AI agent applications. This solution delivers three primary benefits:

  • Comprehensive traceability – Gain visibility into every step of your agent’s execution path, from initial user query through knowledge retrieval and action execution
  • Systematic evaluation framework – Apply consistent evaluation methodologies to measure and understand agent performance
  • Data-driven optimization – Run structured experiments to compare different agent configurations and identify optimal settings

The Arize AI service is available in two versions:

  • Arize AX – An enterprise solution offering advanced monitoring capabilities
  • Arize Phoenix – An open source service making tracing and evaluation accessible to developers

In this post, we demonstrate the Arize Phoenix system for tracing and evaluation. Phoenix can run on your local machine, a Jupyter notebook, a containerized deployment, or in the cloud. We explore how this integration works, its key features, and how you can implement it in your Amazon Bedrock Agents applications to enhance observability and maintain production-grade reliability.

Solution overview

Large language model (LLM) tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. It improves the visibility of your application or system’s health and makes it possible to debug behavior that is difficult to reproduce locally. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation, to provide a detailed timeline of the request’s execution.

For an application to emit traces for analysis, the application must be instrumented. Your application can be manually instrumented or be automatically instrumented. Arize Phoenix offers a set of plugins (instrumentors) that you can add to your application’s startup process that perform automatic instrumentation. These plugins collect traces for your application and export them (using an exporter) for collection and visualization. The Phoenix server is a collector and UI that helps you troubleshoot your application in real time. When you run Phoenix (for example, the px.launch_app() container), Phoenix starts receiving traces from an application that is exporting traces to it. For Phoenix, the instrumentors are managed through a single repository called OpenInference. OpenInference provides a set of instrumentations for popular machine learning (ML) SDKs and frameworks in a variety of languages. It is a set of conventions and plugins that is complimentary to OpenTelemetry and online transaction processing (OLTP) to enable tracing of AI applications. Phoenix currently supports OTLP over HTTP.

For AWS, Boto3 provides Python bindings to AWS services, including Amazon Bedrock, which provides access to a number of FMs. You can instrument calls to these models using OpenInference, enabling OpenTelemetry-aligned observability of applications built using these models. You can also capture traces on invocations of Amazon Bedrock agents using OpenInference and view them in Phoenix.The following high-level architecture diagram shows an LLM application created using Amazon Bedrock Agents, which has been instrumented to send traces to the Phoenix server.

In the following sections, we demonstrate how, by installing the openinference-instrumentation-bedrock library, you can automatically instrument interactions with Amazon Bedrock or Amazon Bedrock agents for observability, evaluation, and troubleshooting purposes in Phoenix.

Prerequisites

To follow this tutorial, you must have the following:

You can also clone the GitHub repo locally to run the Jupyter notebook yourself:

git clone https://github.com/awslabs/amazon-bedrock-agent-samples.git

Install required dependencies

Begin by installing the necessary libraries:

%pip install -r requirements.txt — quiet

Next, import the required modules:

import time
import boto3
import logging
import os
import nest_asyncio
from phoenix.otel import register
from openinference.instrumentation import using_metadata

nest_asyncio.apply()

The arize-phoenix-otel package provides a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults. These defaults are aware of environment variables you must set to configure Phoenix in the next steps, such as:

  • PHOENIX_COLLECTOR_ENDPOINT
  • PHOENIX_PROJECT_NAME
  • PHOENIX_CLIENT_HEADERS
  • PHOENIX_API_KEY

Configure the Phoenix environment

Set up the Phoenix Cloud environment for this tutorial. Phoenix can also be self-hosted on AWS instead.

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com“
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + input("Enter your Phoenix API key: ")

Connect your notebook to Phoenix with auto-instrumentation enabled:

project_name = "Amazon Bedrock Agent Example"
tracer_provider = register(project_name=project_name, auto_instrument=True)

The auto_instrument parameter automatically locates the openinference-instrumentation-bedrock library and instruments Amazon Bedrock and Amazon Bedrock Agent calls without requiring additional configuration. Configure metadata for the span:

metadata = { "agent" : "bedrock-agent", 
            "env" : "development"
Metadata is used to filter search values in the dashboard
       }

Set up an Amazon Bedrock session and agent

Before using Amazon Bedrock, make sure that your AWS credentials are configured correctly. You can set them up using the AWS Command Line Interface (AWS CLI) or by setting environment variables:

session = boto3.Session()
REGION = session.region_name
bedrock_agent_runtime = session.client(service_name="bedrock-agent-runtime",region_name=REGION)

We assume you’ve already created an Amazon Bedrock agent. To configure the agent, use the following code:

agent_id = "XXXXXYYYYY" # ← Configure your Bedrock Agent ID
agent_alias_id = "Z0ZZZZZZ0Z" # ← Optionally set a different Alias ID if you have one

Before proceeding to your next step, you can validate whether invoke agent is working correctly. The response is not important; we are simply testing the API call.

print(f"Trying to invoke alias {agent_alias_id} of agent {agent_id}...")
agent_resp = bedrock_agent_runtime.invoke_agent(
    agentAliasId=agent_alias_id,
    agentId=agent_id,
    inputText="Hello!",
    sessionId="dummy-session",
)
if "completion" in agent_resp:
    print("✅ Got response")
else:
    raise ValueError(f"No 'completion' in agent response:n{agent_resp}")

Run your agent with tracing enabled

Create a function to run your agent and capture its output:

@using_metadata(metadata)
def run(input_text):
    session_id = f"default-session1_{int(time.time())}"

    attributes = dict(
        inputText=input_text,
        agentId=agent_id,
        agentAliasId=agent_alias_id,
        sessionId=session_id,
        enableTrace=True,
    )
    response = bedrock_agent_runtime.invoke_agent(**attributes)

    # Stream the response
    for _, event in enumerate(response["completion"]):
        if "chunk" in event:
            print(event)
            chunk_data = event["chunk"]
            if "bytes" in chunk_data:
                output_text = chunk_data["bytes"].decode("utf8")
                print(output_text)
        elif "trace" in event:
            print(event["trace"])

Test your agent with a few sample queries:

run ("What are the total leaves for Employee 1?")
run ("If Employee 1 takes 4 vacation days off, What are the total leaves left for Employee 1?")

You should replace these queries with the queries that your application is built for. After executing these commands, you should see your agent’s responses in the notebook output. The Phoenix instrumentation is automatically capturing detailed traces of these interactions, including knowledge base lookups, orchestration steps, and tool calls.

View captured traces in Phoenix

Navigate to your Phoenix dashboard to view the captured traces. You will see a comprehensive visualization of each agent invocation, including:

  • The full conversation context
  • Knowledge base queries and results
  • Tool or action group calls and responses
  • Agent reasoning and decision-making steps

Phoenix’s tracing and span analysis capabilities are useful during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it straightforward to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts. With Phoenix’s tracing capabilities, you can monitor the following:

  • Application latency – Identify latency bottlenecks and address slow invocations of LLMs, retrievers, and other components within your application, enabling you to optimize performance and responsiveness.
  • Token usage – Gain a detailed breakdown of token usage for your LLM calls, so you can identify and optimize the most expensive LLM invocations.
  • Runtime exceptions – Capture and inspect critical runtime exceptions, such as rate-limiting events, that can help you proactively address and mitigate potential issues.
  • Retrieved documents – Inspect the documents retrieved during a retriever call, including the score and order in which they were returned, to provide insight into the retrieval process.
  • Embeddings – Examine the embedding text used for retrieval and the underlying embedding model, so you can validate and refine your embedding strategies.
  • LLM parameters – Inspect the parameters used when calling an LLM, such as temperature and system prompts, to facilitate optimal configuration and debugging.
  • Prompt templates – Understand the prompt templates used during the prompting step and the variables that were applied, so you can fine-tune and improve your prompting strategies.
  • Tool descriptions – View the descriptions and function signatures of the tools your LLM has been given access to, in order to better understand and control your LLM’s capabilities.
  • LLM function calls – For LLMs with function call capabilities (such as Anthropic’s Claude, Amazon Nova, or Meta’s Llama), you can inspect the function selection and function messages in the input to the LLM. This can further help you debug and optimize your application.

The following screenshot shows the Phoenix dashboard for the Amazon Bedrock agent, showing the latency, token usage, total traces.

You can choose one of the traces to drill down to the level of the entire orchestration.

Evaluate the agent in Phoenix

Evaluating any AI application is a challenge. Evaluating an agent is even more difficult. Agents present a unique set of evaluation pitfalls to navigate. A common evaluation metric for agents is their function calling accuracy, in other words, how well they do at choosing the right tool for the job. For example, agents can take inefficient paths and still get to the right solution. How do you know if they took an optimal path? Additionally, bad responses upstream can lead to strange responses downstream. How do you pinpoint where a problem originated? Phoenix also includes built-in LLM evaluations and code-based experiment testing. An agent is characterized by what it knows about the world, the set of actions it can perform, and the pathway it took to get there. To evaluate an agent, you must evaluate each component. Phoenix has built evaluation templates for every step, such as:

You can evaluate the individual skills and response using normal LLM evaluation strategies, such as retrieval evaluation, classification with LLM judges, hallucination, or Q&A correctness. In this post, we demonstrate evaluation of agent function calling. You can use the Agent Function Call eval to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code. Now that you’ve traced your agent in the previous step, the next step is to add evaluations to measure its performance. A common evaluation metric for agents is their function calling accuracy (how well they do at choosing the right tool for the job).Complete the following steps:

  1. Up until now, you have just used the lighter-weight Phoenix OTEL tracing library. To run evals, you must to install the full library:

!pip install -q arize-phoenix — quiet

  1. Import the necessary evaluation components:
import re
import json
import phoenix as px
from phoenix.evals import (
TOOL_CALLING_PROMPT_RAILS_MAP,
TOOL_CALLING_PROMPT_TEMPLATE,
BedrockModel,
llm_classify,
)
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

The following is our agent function calling prompt template:

TOOL_CALLING_PROMPT_TEMPLATE = """

You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""
  1. Because we are only evaluating the inputs, outputs, and function call columns, let’s extract those into a simpler-to-use dataframe. Phoenix provides a method to query your span data and directly export only the values you care about:
query = (
SpanQuery()
.where(
# Filter for the `LLM` span kind.
# The filter condition is a string of valid Python boolean expression.
"span_kind == 'LLM' and 'evaluation' not in input.value"
)
.select(
question="input.value",
outputs="output.value",
)
)
trace_df = px.Client().query_spans(query, project_name=project_name)
  1. The next step is to prepare these traces into a dataframe with columns for input, tool call, and tool definitions. Parse the JSON input and output data to create these columns:
def extract_tool_calls(output_value):
try:
tool_calls = []
# Look for tool calls within <function_calls> tags
if "<function_calls>" in output_value:
# Find all tool_name tags
tool_name_pattern = r"<tool_name>(.*?)</tool_name>"
tool_names = re.findall(tool_name_pattern, output_value)

# Add each found tool name to the list
for tool_name in tool_names:
if tool_name:
tool_calls.append(tool_name)
except Exception as e:
print(f"Error extracting tool calls: {e}")
pass

return tool_calls
  1. Apply the function to each row of trace_df.output.value:
trace_df["tool_call"] = trace_df["outputs"].apply(
lambda x: extract_tool_calls(x) if isinstance(x, str) else []
)

# Display the tool calls found
print("Tool calls found in traces:", trace_df["tool_call"].sum())
  1. Add tool definitions for evaluation:
trace_df["tool_definitions"] = (
"phoenix-traces retrieves the latest trace information from Phoenix, phoenix-experiments retrieves the latest experiment information from Phoenix, phoenix-datasets retrieves the latest dataset information from Phoenix"
)

Now with your dataframe prepared, you can use Phoenix’s built-in LLM-as-a-Judge template for tool calling to evaluate your application. The following method takes in the dataframe of traces to evaluate, our built-in evaluation prompt, the eval model to use, and a rails object to snap responses from our model to a set of binary classification responses. We also instruct our model to provide explanations for its responses.

  1. Run the tool calling evaluation:
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())

eval_model = BedrockModel(session=session, model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0")

response_classifications = llm_classify(
    data=trace_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=eval_model,
    rails=rails,
    provide_explanation=True,
)
response_classifications["score"] = response_classifications.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

We use the following parameters:

  • df – A dataframe of cases to evaluate. The dataframe must have columns to match the default template.
  • question – The query made to the model. If you exported spans from Phoenix to evaluate, this will be the llm.input_messages column in your exported data.
  • tool_call – Information on the tool called and parameters included. If you exported spans from Phoenix to evaluate, this will be the llm.function_call column in your exported data.
  1. Finally, log the evaluation results to Phoenix:
px.Client().log_evaluations(
SpanEvaluations(eval_name="Tool Calling Eval", dataframe=response_classifications),
)

After running these commands, you will see your evaluation results on the Phoenix dashboard, providing insights into how effectively your agent is using its available tools.

The following screenshot shows how the tool calling evaluation attribute shows up when you run the evaluation.

When you expand the individual trace, you can observe that the tool calling evaluation adds a score of 1 if the label is correct. This means that agent has responded correctly.

Conclusion

As AI agents become increasingly prevalent in enterprise applications, effective observability is crucial for facilitating their reliability, performance, and continuous improvement. The integration of Arize AI with Amazon Bedrock Agents provides developers with the tools they need to build, monitor, and enhance AI agent applications with confidence. We’re excited to see how this integration will empower developers and organizations to push the boundaries of what’s possible with AI agents.

Stay tuned for more updates and enhancements to this integration in the coming months. To learn more about Amazon Bedrock Agents and the Arize AI integration, refer to the Phoenix documentation and Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring.


About the Authors

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

John Gilhuly is the Head of Developer Relations at Arize AI, focused on AI agent observability and evaluation tooling. He holds an MBA from Stanford and a B.S. in C.S. from Duke. Prior to joining Arize, John led GTM activities at Slingshot AI, and served as a venture fellow at Omega Venture Partners. In his pre-AI life, John built out and ran technical go-to-market teams at Branch Metrics.

Richa Gupta is a Sr. Solutions Architect at Amazon Web Services. She is passionate about architecting end-to-end solutions for customers. Her specialization is machine learning and how it can be used to build new solutions that lead to operational excellence and drive business revenue. Prior to joining AWS, she worked in the capacity of a Software Engineer and Solutions Architect, building solutions for large telecom operators. Outside of work, she likes to explore new places and loves adventurous activities.

Aris Tsakpinis is a Specialist Solutions Architect for Generative AI, focusing on open weight models on Amazon Bedrock and the broader generative AI open source landscape. Alongside his professional role, he is pursuing a PhD in Machine Learning Engineering at the University of Regensburg, where his research focuses on applied natural language processing in scientific domains.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Musarath Rahamathullah is an AI/ML and GenAI Solutions Architect at Amazon Web Services, focusing on media and entertainment customers. She holds a Master’s degree in Analytics with a specialization in Machine Learning. She is passionate about using AI solutions in the AWS Cloud to address customer challenges and democratize technology. Her professional background includes a role as a Research Assistant at the prestigious Indian Institute of Technology, Chennai. Beyond her professional endeavors, she is interested in interior architecture, focusing on creating beautiful spaces to live.

Read More