Amazon AWS – Page 10

How Gardenia Technologies helps customers create ESG disclosure reports 75% faster using agentic generative AI on Amazon Bedrock

June 11, 2025

by Federico Thibaud, Neil Holloway, Fraser Price, Christian Dunn and Frederica Schrager Amazon AWS

This post was co-written with Federico Thibaud, Neil Holloway, Fraser Price, Christian Dunn, and Frederica Schrager from Gardenia Technologies

“What gets measured gets managed” has become a guiding principle for organizations worldwide as they begin their sustainability and environmental, social, and governance (ESG) journeys. Companies are establishing baselines to track their progress, supported by an expanding framework of reporting standards, some mandatory and some voluntary. However, ESG reporting has evolved into a significant operational burden. A recent survey shows that 55% of sustainability leaders cite excessive administrative work in report preparation, while 70% indicate that reporting demands inhibit their ability to execute strategic initiatives. This environment presents a clear opportunity for generative AI to automate routine reporting tasks, allowing organizations to redirect resources toward more impactful ESG programs.

Gardenia Technologies, a data analytics company, partnered with the AWS Prototyping and Cloud Engineering (PACE) team to develop Report GenAI, a fully automated ESG reporting solution powered by the latest generative AI models on Amazon Bedrock. This post dives deep into the technology behind an agentic search solution using tooling with Retrieval Augmented Generation (RAG) and text-to-SQL capabilities to help customers reduce ESG reporting time by up to 75%.

In this post, we demonstrate how AWS serverless technology, combined with agents in Amazon Bedrock, are used to build scalable and highly flexible agent-based document assistant applications.

Scoping the challenge: Growing ESG reporting requirements and complexity

Sustainability disclosures are now a standard part of corporate reporting, with 96% of the 250 largest companies reporting on their sustainability progress based on government and regulatory frameworks. To meet reporting mandates, organizations must overcome many data collection and process-based barriers. Data for a single report includes thousands of data points from a multitude of sources including official documentation, databases, unstructured document stores, utility bills, and emails. The EU Corporate Sustainability Reporting Directive (CSRD) framework, for example, comprises of 1,200 individual data points that need to be collected across an enterprise. Even voluntary disclosures like the CDP, which is approximately 150 questions, cover a wide range of questions related to climate risk and impact, water stewardship, land use, and energy consumption. Collecting this information across an organization is time consuming.

A secondary challenge is that many organizations with established ESG programs need to report to multiple disclosure frameworks, such as SASB, GRI, TCFD, each using different reporting and disclosure standards. To complicate matters, reporting requirements are continually evolving, leaving organizations struggling just to keep up with the latest changes. Today, much of this work is highly manual and leaves sustainability teams spending more time on managing data collection and answering questionnaires rather than developing impactful sustainability strategies.

Solution overview: Automating undifferentiated heavy lifting with AI agents

Gardenia’s approach to strengthen ESG data collection for enterprises is Report GenAI, an agentic framework using generative AI models on Amazon Bedrock to automate large chunks of the ESG reporting process. Report GenAI pre-fills reports by drawing on existing databases, document stores and web searches. The agent then works collaboratively with ESG professionals to review and fine-tune responses. This workflow has five steps to help automate ESG data collection and assist in curating responses. These steps include setup, batch-fill, review, edit, and repeat. Let’s explore each step in more detail.

Setup: The Report GenAI agent is configured and authorized to access an ESG and emissions database, client document stores (emails, previous reports, data sheets), and document searches over the public internet. Client data is stored within specified AWS Regions using encrypted Amazon Simple Storage Service (Amazon S3) buckets with VPC endpoints for secure access, while relational data is hosted in Amazon Relational Database Service (Amazon RDS) instances deployed within Gardenia’s virtual private cloud (VPC). This architecture helps make sure data residency requirements can be fulfilled, while maintaining strict access controls through private network connectivity. The agent also has access to the relevant ESG disclosure questionnaire including questions and expected response format (we refer to this as a report specification). The following figure is an example of the Report GenAI user interface at the agent configuration step. As shown in the figure, the user can choose which databases, documents, or other tools the agent will use to answer a given question.

Batch-fill: The agent then iterates through each question and data point to be disclosed and then retrieves relevant data from the client document stores and document searches. This information is processed to produce a response in the expected format depending on the disclosure report requirements.
Review: Each response includes cited sources and—if the response is quantitative—calculation methodology. This enables users to maintain a clear audit trail and verify the accuracy of batch-filled responses quickly.
Edit: While the agentic workflow is automated, our approach allows for a human-in-the-loop to review, validate, and iterate on batch-filled facts and figures. In the following figure, we show how users can chat with the AI assistant to request updates or manually refine responses. When the user is satisfied, the final answer is recorded. The agent will show references from which responses were sourced and allow the user to modify answers either directly or by providing an additional prompt.

Repeat: Users can batch-fill multiple reporting frameworks to simplify and expand their ESG disclosure scope while avoiding extra effort to manually complete multiple questionnaires. After a report has been completed, it can then be added to the client document store so future reports can draw on it for knowledge. Report GenAI also supports bring your own report, which allows users to develop their own reporting specification (question and response model), which can then be imported into the application, as shown in the following figure.

Now that you have a description of the Report GenAI workflow, let’s explore how the architecture is built.

Architecture deep-dive: A serverless generative AI agent

The Report GenAI architecture consists of six components as illustrated in the following figure: A user interface (UI), the generative AI executor, the web search endpoint, a text-to-SQL tool, the RAG tool, and an embedding generation pipeline. The UI, generative AI executor, and generation pipeline components help orchestrate the workflow. The remaining three components function together to generate responses to perform the following actions:

Web search tool: Uses an internet search engine to retrieve content from public web pages.
Text-to-SQL tool: Generates and executes SQL queries to the company’s emissions database hosted by Gardenia Technologies. The tool uses natural language requests, such as “What were our Scope 2 emissions in 2024,” as input and returns the results from the emissions database.
Retrieval Augmented Generation (RAG) tool: Accesses information from the corporate document store (such as procedures, emails, and internal reports) and uses it as a knowledge base. This component acts as a retriever to return relevant text from the document store as a plain text query.

Let’s take a look at each of the components.

1: Lightweight UI hosted on auto-scaled Amazon ECS Fargate

Users access Report GenAI by using the containerized Streamlit frontend. Streamlit offers an appealing UI for data and chat apps allowing data scientists and ML to build convincing user experiences with relatively limited effort. While not typically used for large-scale deployments, Streamlit proved to be a suitable choice for the initial iteration of Report GenAI.

The frontend is hosted on a load-balanced and auto-scaled Amazon Elastic Container Service (Amazon ECS) with Fargate launch type. This implementation of the frontend not only reduces the management overhead but also suits the expected intermittent usage pattern of Report GenAI, which is anticipated to be spikey with high-usage periods around the times when new reports must be generated (typically quarterly or yearly) and lower usage outside these windows. User authentication and authorization is handled by Amazon Cognito.

2: Central agent executor

The executor is an agent that uses reasoning capabilities of leading text-based foundation models (FMs) (for example, Anthropic’s Claude Sonnet 3.5 and Haiku 3.5) to break down user requests, gather information from document stores, and efficiently orchestrate tasks. The agent uses Reason and Act (ReAct), a prompt-based technique that enables large language models (LLMs) to generate both reasoning traces and task-specific actions in an interleaved manner. Reasoning traces help the model develop, track, and update action plans, while actions allow it to interface with a set of tools and information sources (also known as knowledge bases) that it can use to fulfil the task. The agent is prompted to think about an optimal sequence of actions to complete a given task with the tools at its disposal, observe the outcome, and iterate and improve until satisfied with the answer.

In combination, these tools provide the agent with capabilities to iteratively complete complex ESG reporting templates. The expected questions and response format for each questionnaire is captured by a report specification (ReportSpec) using Pydantic to enforce the desired output format for each reporting standard (for example, CDP, or TCFD). This ReportSpec definition is inserted into the task prompt. The first iteration of Report GenAI used Claude Sonnet 3.5 on Amazon Bedrock. As more capable and more cost effective LLMs become available on Amazon Bedrock (such as the recent release of Amazon Nova models), foundation models in Report GenAI can be swapped to remain up to date with the latest models.

The agent-executor is hosted on AWS Lambda and uses the open-source LangChain framework to implement the ReAct orchestration logic and implement the needed integration with memory, LLMs, tools and knowledge bases. LangChain offers deep integration with AWS using the first-party langchain-aws module. The module langchain-aws provides useful one-line wrappers to call tools using AWS Lambda, draw from a chat memory backed by Dynamo DB and call LLM models on Amazon Bedrock. LangChain also provides fine-grained visibility into each step of the ReAct decision making process to provide decision transparency.

3: Web-search tool

The web search tool is hosted on Lambda and calls an internet search engine through an API. The agent executor retrieves the information returned from the search engine to formulate a response. Web searches can be used in combination with the RAG tool to retrieve public context needed to formulate responses for certain generic questions, such as providing a short description of the reporting company or entity.

4: Text-to-SQL tool

A large portion of ESG reporting requirements is analytical in nature and requires processing of large amounts of numerical or tabular data. For example, a reporting standard might ask for total emissions in a certain year or quarter. LLMs are ill-equipped for questions of this nature. The Lambda-hosted text-to-SQL tool provides the agent with the required analytical capabilities. The tool uses a separate LLM to generate a valid SQL query given a natural language question along with the schema of an emissions database hosted on Gardenia. The generated query is then executed against this database and the results are passed back to the agent executor. SQL linters and error-correction loops are used for added robustness.

5: Retrieval Augmented Generation (RAG) tool

Much of the information required to complete ESG reporting resides in internal, unstructured document stores and can consist of PDF or Word documents, Excel spreadsheets, and even emails. Given the size of these document stores, a common approach is to use knowledge bases with vector embeddings for semantic search. The RAG tool enables the agent executor to retrieve only the relevant parts to answer questions from the document store. The RAG tool is hosted on Lambda and uses an in-memory Faiss index as a vector store. The index is persisted on Amazon S3 and loaded on demand whenever required. This workflow is advantageous for the given workload because of the intermittent usage of Report GenAI. The RAG tool accepts a plain text query from the agent executor as input, uses an embedding model on Amazon Bedrock to perform a vector search against the vector data base. The retrieved text is returned to the agent executor.

6: Embedding the generation asynchronous pipeline

To make text searchable, it must be indexed in a vector database. Amazon Step Functions provides a straightforward orchestration framework to manage this process: extracting plain text from the various document types, chunking it into manageable pieces, embedding the text, and then loading embeddings into a vector DB. Amazon Textract can be used as the first step for extracting text from visual-heavy documents like presentations or PDFs. An embedding model such as Amazon Titan Text Embeddings can then be used to embed the text and store it into a vector DB such as Lance DB. Note that Amazon Bedrock Knowledge Bases provides an end-to-end retrieval service automating most of the steps that were just described. However, for this application, Gardenia Technologies opted for a fully flexible implementation to retain full control over each design choice of the RAG pipeline (text extraction approach, embedding model choice, and vector database choice) at the expense of higher management and development overhead.

Evaluating agent performance

Making sure of accuracy and reliability in ESG reporting is paramount, given the regulatory and business implications of these disclosures. Report GenAI implements a sophisticated dual-layer evaluation framework that combines human expertise with advanced AI validation capabilities.

Validation is done both at a high level (such as evaluating full question responses) and sub-component level (such as breaking down to RAG, SQL search, and agent trajectory modules). Each of these has separate evaluation sets in addition to specific metrics of interest.

Human expert validation

The solution’s human-in-the-loop approach allows ESG experts to review and validate the AI-generated responses. This expert oversight serves as the primary quality control mechanism, making sure that generated reports align with both regulatory requirements and organization-specific context. The interactive chat interface enables experts to:

Verify factual accuracy of automated responses
Validate calculation methodologies
Verify proper context interpretation
Confirm regulatory compliance
Flag potential discrepancies or areas requiring additional review

A key feature in this process is the AI reasoning module, which displays the agent’s decision-making process, providing transparency into not only what answers were generated but how the agent arrived at those conclusions.

These expert reviews provide valuable training data that can be used to enhance system performance through refinements to RAG implementations, agent prompts, or underlying language models.

AI-powered quality assessment

Complementing human oversight, Report GenAI uses state-of-the-art LLMs on Amazon Bedrock as LLM judges. These models are prompted to evaluate:

Response accuracy relative to source documentation
Completeness of answers against question requirements
Consistency with provided context
Alignment with reporting framework guidelines
Mathematical accuracy of calculations

The LLM judge operates by:

Analyzing the original question and context
Reviewing the generated response and its supporting evidence
Comparing the response against retrieved data from structured and unstructured sources
Providing a confidence score and detailed assessment of the response quality
Flagging potential issues or areas requiring human review

This dual-validation approach creates a robust quality assurance framework that combines the pattern recognition capabilities of AI with human domain expertise. The system continuously improves through feedback loops, where human corrections and validations help refine the AI’s understanding and response generation capabilities.

How Omni Helicopters International cuts its reporting time by 75%

Omni Helicopters International cut their CDP reporting time by 75% using Gardenia’s Report GenAI solution. In previous years, OHI’s CDP reporting required one month of dedicated effort from their sustainability team. Using Report GenAI, OHI tracked their GHG inventory and relevant KPIs in real time and then prepared their 2024 CDP submission in just one week. Read the full story in Preparing Annual CDP Reports 75% Faster.

“In previous years we needed one month to complete the report, this year it took just one week,” said Renato Souza, Executive Manager QSEU at OTA. “The ‘Ask the Agent’ feature made it easy to draft our own answers. The tool was a great support and made things much easier compared to previous years.”

Conclusion

In this post, we stepped through how AWS and Gardenia collaborated to build Report GenAI, an automated ESG reporting solution that relieves ESG experts of the undifferentiated heavy lifting of data gathering and analysis associated with a growing ESG reporting burden. This frees up time for more impactful, strategic sustainability initiatives. Report GenAI is available on the AWS Marketplace today. To dive deeper and start developing your own generative AI app to fit your use case, explore this workshop on building an Agentic LLM assistant on AWS.

About the Authors

Federico Thibaud is the CTO and Co-Founder of Gardenia Technologies, where he leads the data and engineering teams, working on everything from data acquisition and transformation to algorithm design and product development. Before co-founding Gardenia, Federico worked at the intersection of finance and tech — building a trade finance platform as lead developer and developing quantitative strategies at a hedge fund.

Neil Holloway is Head of Data Science at Gardenia Technologies where he is focused on leveraging AI and machine learning to build and enhance software products. Neil holds a masters degree in Theoretical Physics, where he designed and built programs to simulate high energy collisions in particle physics.

Fraser Price is a GenAI-focused Software Engineer at Gardenia Technologies in London, where he focuses on researching, prototyping and developing novel approaches to automation in the carbon accounting space using GenAI and machine learning. He received his MEng in Computing: AI from Imperial College London.

Christian Dunn is a Software Engineer based in London building ETL pipelines, web-apps, and other business solutions at Gardenia Technologies.

Frederica Schrager is a Marketing Analyst at Gardenia Technologies.

Karsten Schroer is a Senior ML Prototyping Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.

Mohamed Ali Jamaoui is a Senior ML Prototyping Architect with over 10 years of experience in production machine learning. He enjoys solving business problems with machine learning and software engineering, and helping customers extract business value with ML. As part of AWS EMEA Prototyping and Cloud Engineering, he helps customers build business solutions that leverage innovations in MLOPs, NLP, CV and LLMs.

Marco Masciola is a Senior Sustainability Scientist at AWS. In his role, Marco leads the development of IT tools and technical products to support AWS’s sustainability mission. He’s held various roles in the renewable energy industry, and leans on this experience to build tooling to support sustainable data center operations.

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies.

NVIDIA Nemotron Super 49B and Nano 8B reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

June 11, 2025

by Niithiyn Vijeaswaran Amazon AWS

This post is co-written with Eliuth Triana Isaza, Abhishek Sawarkar, and Abdullahi Olaoye from NVIDIA.

Today, we are excited to announce that the Llama 3.3 Nemotron Super 49B V1 and Llama 3.1 Nemotron Nano 8B V1 are available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s newest reasoning models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on Amazon Bedrock Marketplace and SageMaker JumpStart.

About NVIDIA NIMs on AWS

NVIDIA NIM inference microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker AI, to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise, available in the AWS Marketplace, NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models from open source community models to NVIDIA AI Foundation and custom models. NIM microservices are deployed with a single command for easy integration into generative AI applications using industry-standard APIs and just a few lines of code, or with a few actions in the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM ensures generative AI applications can be deployed anywhere.

Overview of NVIDIA Nemotron models

In this section, we provide an overview of the NVIDIA Nemotron Super and Nano NIM microservices discussed in this post.

Llama 3.3 Nemotron Super 49B V1

Llama-3.3-Nemotron-Super-49B-v1 is an LLM which is a derivative of Meta Llama-3.3-70B-Instruct (the reference model). It is a reasoning model that is post-trained for reasoning, human chat preferences, and task executions, such as Retrieval-Augmented Generation (RAG) and tool calling. The model supports a context length of 128K tokens. Using a novel Neural Architecture Search (NAS) approach, we greatly reduced the model’s memory footprint and increased efficiency to support larger workloads and for the model to fit onto a single Hopper GPU (P5 instances) at high workloads (H200).

Llama 3.1 Nemotron Nano 8B V1

Llama-3.1-Nemotron-Nano-8B-v1 is an LLM which is a derivative of Meta Llama-3.1-8B-Instruct (the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and task execution, such as RAG and tool calling. The model supports a context length of 128K tokens. It is created from Llama 3.1 8B Instruct and offers improvements in model accuracy. The model fits on a single H100 or A100 GPU (P5 or P4 instances) and can be used locally.

About Amazon Bedrock Marketplace

Amazon Bedrock Marketplace plays a pivotal role in democratizing access to advanced AI capabilities through several key advantages:

Comprehensive model selection – Amazon Bedrock Marketplace offers an exceptional range of models, from proprietary to publicly available options, allowing organizations to find the perfect fit for their specific use cases.
Unified and secure experience – By providing a single access point for all models through the Amazon Bedrock APIs, Bedrock Marketplace significantly simplifies the integration process. Organizations can use these models securely, and for models that are compatible with the Amazon Bedrock Converse API, you can use the robust toolkit of Amazon Bedrock, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Flows.
Scalable infrastructure – Amazon Bedrock Marketplace offers configurable scalability through managed endpoints, allowing organizations to select their desired number of instances, choose appropriate instance types, define custom auto scaling policies that dynamically adjust to workload demands, and optimize costs while maintaining performance.

Deploy NVIDIA Llama Nemotron models in Amazon Bedrock Marketplace

Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access the Nemotron reasoning models in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
You can also use the InvokeModel API to invoke the model. The InvokeModel API doesn’t support Converse APIs and other Amazon Bedrock tooling.
On the Model catalog page, filter for NVIDIA as a provider and choose the Llama 3.3 Nemotron Super 49B V1 model.

The Model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

To begin using the Llama 3.3 Nemotron Super 49B V1 model, choose Subscribe to subscribe to the marketplace offer.
On the model detail page, choose Deploy.

You will be prompted to configure the deployment details for the model. The model ID will be pre-populated.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with Nemotron Super, a GPU-based instance type like ml.g6e.12xlarge is recommended.
Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you should review these settings to align with your organization’s security and compliance requirements.
Choose Deploy to begin using the model.

When the deployment is complete, you can test its capabilities directly in the Amazon Bedrock playground.This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results. A similar process can be followed for deploying the Llama 3.1 Nemotron Nano 8B V1 model as well.

Run inference with the deployed Nemotron endpoint

The following code example demonstrates how to perform inference using a deployed model through Amazon Bedrock using the InvokeModel api. The script initializes the bedrock_runtime client, configures inference parameters, and sends a request to generate text based on a user prompt. With Nemotron Super and Nano models, we can use a soft switch to toggle reasoning on and off. In the content field, set detailed thinking on or detailed thinking off.

Request

import boto3
import json

# Initialize Bedrock client
bedrock_runtime = boto3.client("bedrock-runtime")

# Configuration
MODEL_ID = ""  # Replace with Bedrock Marketplace endpoint arn

def invoke_model(prompt, max_tokens=1000, temperature=0.6, top_p=0.9):
    """
    Simple Bedrock model invocation
    """
    # Prepare model input
    body = {
        "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
        "messages": [
            {   
                "role": "system",
                "content": "detailed thinking on" # to turn thinking off, simply toggle this
            },
            {
                "role": "user",
                "content": prompt  # Use the prompt parameter
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p
    }
    
    # Invoke model
    response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps(body)
    )

    # Parse and return response
    model_output = json.loads(response['body'].read())
    print(json.dumps(model_output, indent=2))
    
# Example usage
if __name__ == "__main__":
    prompt = "Explain how a transformer neural network works."
    result = invoke_model(prompt)
    print(result)

Response body

{
  "id": "chat-43c35c87d5a646c9b3dcaba461eefeaa",
  "object": "chat.completion",
  "created": 1749566119,
  "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>nOkay, I need to explain how a transformer neural network works. Let me start by recalling what I know about transformers. They were introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, right? Before transformers, sequence models like RNNs and LSTMs were common, but they had issues with parallelization because of their sequential processing. Transformers changed that by using self-attention mechanisms, which allow parallel processing of the entire input sequence at once.nnSo, the main components of a transformer are the encoder and the decoder. Each encoder layer has two main sub-layers: multi-head self-attention and a fully connected feed-forward network. The decoder has these two plus an additional multi-head attention layer to interact with the encoder's output.nnLet me break down the encoder first. The self-attention mechanism is key here. Self-attention allows the model to attend to all positions in the input sequence simultaneously and weigh their importance. The multi-head part means that instead of just one attention mechanism, you have multiple (like 8 or 12) running in parallel, each with a different linear projection. This helps capture different types of relationships between tokens.nnEach head computes attention by using Query (Q), Key (K), and Value (V) matrices. These are derived from the input by linear transformations. The attention weights are computed as the dot product of Q and K transposed, divided by the square root of the key's dimensionality (for stability). Then, these weights are used to compute a weighted sum of the Value vectors. This gives the output for one attention head. Since there are multiple heads, their outputs are concatenated and then linearly transformed.nnAfter the multi-head attention, there's a feed-forward network (FFN) applied individually to each position. It's a simple neural network with a ReLU activation in the middle. Both the attention and FFN layers use residual connections followed by layer normalization.nnThe decoder is similar but has an extra attention layer that attends to the encoder's output. Also, in the decoder's self-attention, they use masking to prevent looking ahead in the sequence, which is important for tasks like translation where you generate one token at a time.nnPositional encodings are added to the input embeddings because transformers don't have recurrence or convolution, so they need a way to preserve sequence order. The encodings are fixed and based on sine and cosine functions, which allows the model to handle sequences of varying lengths.nnI should also mention the overall architecture: the input goes through an embedding layer plus positional encoding, then through a stack of encoder layers. The output from the encoder goes into the decoder's stack of layers, and the final output goes through a linear layer and softmax for tasks like machine translation.nnWait, maybe I should explain each part step by step. Start with the overall structure, then dive into encoder and decoder layers, then the self-attention mechanism, positional encodings, and how it's different from RNNs.nnPotential pitfalls: Confusing encoder and decoder layers, not explaining why multi-head attention is useful, or omitting the role of layer normalization and residual connections. Also, ensuring that the explanation of attention weights calculation is clear.nnLet me check if I missed anything. Oh, the decoder's structure: after self-attention, there's encoder-decoder attention, then FFN. And the output linear layer. Also, the training process involves teacher forcing, but maybe that's beyond the basic explanation.nnI need to make sure the explanation is clear for someone who knows basics of neural networks but not transformers. Use analogies if possible, like how attention is like selectively focusing on different parts of the input when processing each element.nnAnother point: the efficiency. Since attention is O(n^2) with sequence length n, but it's parallelizable, which is better than O(n) for RNNs which can't be parallelized. But for very long sequences, transformers can be computationally expensive.nnAlso, the difference between BERT-style transformers (encoder-only) and sequence-to-sequence models (both encoder and decoder). But the question is about transformer neural networks in general, so covering the basic encoder-decoder structure is sufficient.nnOkay, putting it all together in a structured way: introduction, architecture overview, encoder details, decoder details, key components (self-attention, positional encodings), advantages, and maybe an example application.n</think>nnA transformer neural network is a revolutionary architecture introduced in 2017 for sequence-to-sequence tasks, replacing recurrent neural networks (RNNs) with self-attention mechanisms for parallel processing. Here's a structured explanation:nn### **1. Architecture Overview**n- **Encoder**: Takes in a sequence (e.g., text) and outputs a continuous representation.n- **Decoder**: Generates an output sequence (e.g., translated text) one token at a time, conditioned on the"
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 29,
    "total_tokens": 1029,
    "completion_tokens": 1000
  },
  "prompt_logprobs": null
}

Amazon SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks. You can now discover and deploy Llama 3.3 Nemotron Super 49B V1 and Llama-3.1-Nemotron-Nano-8B-v1 in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, so you can derive model performance and MLOps controls with Amazon SageMaker AI features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your VPC, helping to support data security for enterprise security needs.

Prerequisites

Before getting started with deployment, make sure your AWS Identity and Access Management (IAM) service role for Amazon SageMaker has the AmazonSageMakerFullAccess permission policy attached. To deploy the NVIDIA Llama Nemotron models successfully, confirm one of the following:

Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
- aws-marketplace:ViewSubscriptions
- aws-marketplace:Unsubscribe
- aws-marketplace:Subscribe
If your account is already subscribed to the model, you can skip to the Deploy section below. Otherwise, please start by subscribing to the model package and then move to the Deploy section.

Subscribe to the model package

To subscribe to the model package, complete the following steps:

Open the model package listing page and choose Llama 3.3 Nemotron Super 49B V1 or Llama 3.1 Nemotron Nano 8B V1.
On the AWS Marketplace listing, choose Continue to subscribe.
On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
Choose Continue to with the configuration and then choose an AWS Region where you have the service quota for the desired instance type.

A product ARN will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

(Option-1) Deploy NVIDIA Llama Nemotron Super and Nano models on SageMaker JumpStart

For those new to SageMaker Jumpstart, we will go to SageMaker Studio to access models on SageMaker Jumpstart. The Llama 3.3 Nemotron Super 49B V1and Llama 3.1 Nemotron Nano 8B V1 models are available on SageMaker Jumpstart. Deployment starts when you choose the Deploy option, you may be prompted to subscribe to this model on the Marketplace. If you are already subscribed, then you can move forward with selecting the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

(Option-2) Deploy NVIDIA Llama Nemotron using the SageMaker SDK

In this section we will walk through deploying the Llama 3.3 Nemotron Super 49B V1 model through the SageMaker SDK. A similar process can be followed for deploying the Llama 3.1 Nemotron Nano 8B V1 model as well.

Define the SageMaker model using the Model Package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

sm_model_name  "nim-llama-3-3-nemotron-super-49b-v1"
create_model_response  smcreate_model(
    ModelNamesm_model_name,
    PrimaryContainer{
        'ModelPackageName': model_package_arn
    },
    ExecutionRoleArnrole,
    EnableNetworkIsolation
)
print("Model Arn: "  create_model_response["ModelArn"])

Create the endpoint configuration

Next, we can create endpoint configuration by specifying instance type, in this case it’s ml.g6e.12xlarge. Make sure you have the account-level service limit for using ml.g6e.12xlarge for endpoint usage as one or more instances. NVIDIA also provides a list of supported instance types that supports deployment. Refer to the AWS Marketplace listing for both of these models to see supported instance types. To request a service quota increase, see AWS service quotas.

endpoint_config_name  sm_model_name

create_endpoint_config_response  smcreate_endpoint_config(
    EndpointConfigNameendpoint_config_name,
    ProductionVariants[
        {
            'VariantName': 'AllTraffic',
            'ModelName': sm_model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g6e.12xlarge',
            'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
            'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
            'ModelDataDownloadTimeoutInSeconds': 3600, 
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600, 
        }
    ]
)
print("Endpoint Config Arn: "  create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the previous endpoint configuration we create a new SageMaker endpoint and add a wait and loop as shown below until the deployment finishes. This typically takes around 5-10 minutes. The status will change to InService once the deployment is successful.

endpoint_name  endpoint_config_name
create_endpoint_response  smcreate_endpoint(
    EndpointNameendpoint_name,
    EndpointConfigNameendpoint_config_name
)
print("Endpoint Arn: "  create_endpoint_response["EndpointArn"

Deploy the endpoint

Let’s now deploy and track the status of the endpoint.

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)
    
print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Run Inference with Llama 3.3 Nemotron Super 49B V1

Once we have the model, we can use a sample text to do an inference request. NIM on SageMaker supports the OpenAI API inference protocol inference request format. For explanation of supported parameters please see Creates a model in the NVIDIA documentation.

Real-Time inference example

The following code examples illustrate how to perform real-time inference using the Llama 3.3 Nemotron Super 49B V1 model in non-reasoning and reasoning mode.

Non-reasoning mode

Perform real-time inference in non-reasoning mode:

payload_model  "nvidia/llama-3.3-nemotron-super-49b-v1"
messages  [
    {
    "role": "system",
    "content": "detailed thinking off"
    },
    {
    "role":"user",
    "content":"Explain how a transformer neural network works."
    }
    ]
    
payload  {
    "model": payload_model,
    "messages": messages,
    "max_tokens": 3000
}

response  clientinvoke_endpoint(
    EndpointNameendpoint_name, ContentType"application/json", Bodyjsondumps(payload)
)

output  jsonloads(response["Body"]read()decode("utf8"))
print(jsondumps(output, indent2))

Reasoning mode

Perform real-time inference in reasoning mode:

payload_model  "nvidia/llama-3.3-nemotron-super-49b-v1"
messages  [
    {
    "role": "system",
    "content": "detailed thinking on"
    },
    {
    "role":"user",
    "content":"Explain how a transformer neural network works."
    }
    ]
payload  {
    "model": payload_model,
    "messages": messages,
    "max_tokens": 3000
}

response  clientinvoke_endpoint(
    EndpointNameendpoint_name, ContentType"application/json", Bodyjsondumps(payload)
)

output  jsonloads(response["Body"]read()decode("utf8"))
print(jsondumps(output, indent2))

Streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting stream as True in the payload and by using the invoke_endpoint_with_response_stream method.

Streaming inference:

payload_model = "nvidia/llama-3.3-nemotron-super-49b-v1"
messages = [
    {   
      "role": "system",
      "content": "detailed thinking on"# this can be toggled off to disable reasoning
    },
    {
      "role":"user",
      "content":"Explain how a transformer neural network works."
    }
  ]

payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 3000,
  "stream": True
}

response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We can use some post-processing code for the streaming output that reads the byte-chunks coming from the endpoint, pieces them into full JSON messages, extracts any new text the model produced, and immediately prints that text to output.

event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"nError processing event: {e}", flush=True)
        continue

Clean up

To avoid unwanted charges, complete the steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, in the navigation pane in the Foundation models section, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:
1. Endpoint name
2. Model name
3. Endpoint status
Choose Delete to delete the endpoint.
In the Delete endpoint confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart Endpoint

The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

NVIDIA’s Nemotron Llama3 models deliver optimized AI reasoning capabilities and are now available on AWS through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. The Llama 3.3 Nemotron Super 49B V1, derived from Meta’s 70B model, uses Neural Architecture Search (NAS) to achieve a reduced 49B parameter count while maintaining high accuracy, enabling deployment on a single H200 GPU despite its sophisticated capabilities. Meanwhile, the compact Llama 3.1 Nemotron Nano 8B V1 fits on a single fits on a single H100 or A100 GPU (P5 or P4 instances) while improving on Meta’s reference model accuracy, making it ideal for efficiency-conscious applications. Both models support extensive 128K token context windows and are post-trained for enhanced reasoning, RAG capabilities, and tool calling, offering organizations flexible options to balance performance and computational requirements for enterprise AI applications.

With this launch, organizations can now leverage the advanced reasoning capabilities of these models while benefiting from the scalable infrastructure of AWS. Through either the intuitive UI or just a few lines of code, you can quickly deploy these powerful language models to transform your AI applications with minimal effort. These complementary platforms provide straightforward access to NVIDIA’s robust technologies, allowing teams to immediately begin exploring and implementing sophisticated reasoning capabilities in their enterprise solutions.

About the authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Chase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Brian Kreitzer is a Partner Solutions Architect at Amazon Web Services (AWS). He works with partners to define business requirements, provide architectural guidance, and design solutions for the Amazon Marketplace.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Automate customer support with Amazon Bedrock, LangGraph, and Mistral models

June 10, 2025

by Deepesh Dhapola Amazon AWS

AI agents are transforming the landscape of customer support by bridging the gap between large language models (LLMs) and real-world applications. These intelligent, autonomous systems are poised to revolutionize customer service across industries, ushering in a new era of human-AI collaboration and problem-solving. By harnessing the power of LLMs and integrating them with specialized tools and APIs, agents can tackle complex, multistep customer support tasks that were previously beyond the reach of traditional AI systems.As we look to the future, AI agents will play a crucial role in the following areas:

Enhancing decision-making – Providing deeper, context-aware insights to improve customer support outcomes
Automating workflows – Streamlining customer service processes, from initial contact to resolution, across various channels
Human-AI interactions – Enabling more natural and intuitive interactions between customers and AI systems
Innovation and knowledge integration – Generating new solutions by combining diverse data sources and specialized knowledge to address customer queries more effectively
Ethical AI practices – Helping provide more transparent and explainable AI systems to address customer concerns and build trust

Building and deploying AI agent systems for customer support is a step toward unlocking the full potential of generative AI in this domain. As these systems evolve, they will transform customer service, expand possibilities, and open new doors for AI in enhancing customer experiences.

In this post, we demonstrate how to use Amazon Bedrock and LangGraph to build a personalized customer support experience for an ecommerce retailer. By integrating the Mistral Large 2 and Pixtral Large models, we guide you through automating key customer support workflows such as ticket categorization, order details extraction, damage assessment, and generating contextual responses. These principles are applicable across various industries, but we use the ecommerce domain as our primary example to showcase the end-to-end implementation and best practices. This post provides a comprehensive technical walkthrough to help you enhance your customer service capabilities and explore the latest advancements in LLMs and multimodal AI.

LangGraph is a powerful framework built on top of LangChain that enables the creation of cyclical, stateful graphs for complex AI agent workflows. It uses a directed graph structure where nodes represent individual processing steps (like calling an LLM or using a tool), edges define transitions between steps, and state is maintained and passed between nodes during execution. This architecture is particularly valuable for customer support automation involving workflows. LangGraph’s advantages include built-in visualization, logging (traces), human-in-the-loop capabilities, and the ability to organize complex workflows in a more maintainable way than traditional Python code.This post provides details on how to do the following:

Use Amazon Bedrock and LangGraph to build intelligent, context-aware customer support workflows
Integrate data in a helpdesk tool, like JIRA, in the LangChain workflow
Use LLMs and vision language models (VLMs) in the workflow to perform context-specific tasks
Extract information from images to aid in decision-making
Compare images to assess product damage claims
Generate responses for the customer support tickets

Solution overview

This solution involves the customers initiating support requests through email, which are automatically converted into new support tickets in Atlassian Jira Service Management. The customer support automation solution then takes over, identifying the intent behind each query, categorizing the tickets, and assigning them to a bot user for further processing. The solution uses LangGraph to orchestrate a workflow involving AI agents to extracts key identifiers such as transaction IDs and order numbers from the support ticket. It analyzes the query and uses these identifiers to call relevant tools, extracting additional information from the database to generate a comprehensive and context-aware response. After the response is prepared, it’s updated in Jira for human support agents to review before sending the response back to the customer. This process is illustrated in the following figure. This solution is capable of extracting information not only from the ticket body and title but also from attached images like screenshots and external databases.

The solution uses two foundation models (FMs) from Amazon Bedrock, each selected based on its specific capabilities and the complexity of the tasks involved. For instance, the Pixtral model is used for vision-related tasks like image comparison and ID extraction, whereas the Mistral Large 2 model handles a variety of tasks like ticket categorization, response generation, and tool calling. Additionally, the solution includes fraud detection and prevention capabilities. It can identify fraudulent product returns by comparing the stock product image with the returned product image to verify if they match and assess whether the returned product is genuinely damaged. This integration of advanced AI models with automation tools enhances the efficiency and reliability of the customer support process, facilitating timely resolutions and security against fraudulent activities. LangGraph provides a framework for orchestrating the information flow between agents, featuring built-in state management and checkpointing to facilitate seamless process continuity. This functionality allows the inclusion of initial ticket summaries and descriptions in the State object, with additional information appended in subsequent steps of the workflows. By maintaining this evolving context, LangGraph enables LLMs to generate context-aware responses. See the following code:

# class to hold state information

class JiraAppState(MessagesState):
    key: str
    summary: str
    description: str
    attachments: list
    category: str
    response: str
    transaction_id: str
    order_no: str
    usage: list

The framework integrates effortlessly with Amazon Bedrock and LLMs, supporting task-specific diversification by using cost-effective models for simpler tasks while reducing the risks of exceeding model quotas. Furthermore, LangGraph offers conditional routing for dynamic workflow adjustments based on intermediate results, and its modular design facilitates the addition or removal of agents to extend system capabilities.

Responsible AI

It’s crucial for customer support automation applications to validate inputs and make sure LLM outputs are secure and responsible. Amazon Bedrock Guardrails can significantly enhance customer support automation applications by providing configurable safeguards that monitor and filter both user inputs and AI-generated responses, making sure interactions remain safe, relevant, and aligned with organizational policies. By using features such as content filters, which detect and block harmful categories like hate speech, insults, sexual content, and violence, as well as denied topics to help prevent discussions on sensitive or restricted subjects (for example, legal or medical advice), customer support applications can avoid generating or amplifying inappropriate or defiant information. Additionally, guardrails can help redact personally identifiable information (PII) from conversation transcripts, protecting user privacy and fostering trust. These measures not only reduce the risk of reputational harm and regulatory violations but also create a more positive and secure experience for customers, allowing support teams to focus on resolving issues efficiently while maintaining high standards of safety and responsibility.

The following diagram illustrates this architecture.

Observability

Along with Responsible AI, observability is vital for customer support applications to provide deep, real-time visibility into model performance, usage patterns, and operational health, enabling teams to proactively detect and resolve issues. With comprehensive observability, you can monitor key metrics such as latency and token consumption, and track and analyze input prompts and outputs for quality and compliance. This level of insight helps identify and mitigate risks like hallucinations, prompt injections, toxic language, and PII leakage, helping make sure that customer interactions remain safe, reliable, and aligned with regulatory requirements.

Prerequisites

In this post, we use Atlassian Jira Service Management as an example. You can use the same general approach to integrate with other service management tools that provide APIs for programmatic access. The configuration required in Jira includes:

A Jira service management project with API token to enable programmatic access
The following custom fields:
- Name: Category, Type: Select List (multiple choices)
- Name: Response, Type: Text Field (multi-line)
A bot user to assign tickets

The following code shows a sample Jira configuration:

JIRA_API_TOKEN = "<JIRA_API_TOKEN>"
JIRA_USERNAME = "<JIRA_USERNAME>"
JIRA_INSTANCE_URL = "https://<YOUR_JIRA_INSTANCE_NAME>.atlassian.net/"
JIRA_PROJECT_NAME = "<JIRA_PROJECT_NAME>"
JIRA_PROJECT_KEY = "<JIRA_PROJECT_KEY>"
JIRA_BOT_USER_ID = '<JIRA_BOT_USER_ID>'

In addition to Jira, the following services and Python packages are required:

A valid AWS account.
An AWS Identity and Access Management (IAM) role in the account that has sufficient permissions to create the necessary resources.
Access to the following models hosted on Amazon Bedrock:
- Mistral Large 2 (model ID: mistral.mistral-large-2407-v1:0).
- Pixtral Large (model ID: us.mistral.pixtral-large-2502-v1:0). The Pixtral Large model is available in Amazon Bedrock under cross-Region inference profiles.
A LangGraph application up and running locally. For instructions, see Quickstart: Launch Local LangGraph Server.

For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

The source code of this solution is available in the GitHub repository. This is an example code; you should conduct your own due diligence and adhere to the principle of least privilege.

Implementation with LangGraph

At the core of customer support automation is a suite of specialized tools and functions designed to collect, analyze, and integrate data from service management systems and a SQLite database. These tools serve as the foundation of our system, empowering it to deliver context-aware responses. In this section, we delve into the essential components that power our system.

BedrockClient class

The BedrockClient class is implemented in the cs_bedrock.py file. It provides a wrapper for interacting with Amazon Bedrock services, specifically for managing language models and content safety guardrails in customer support applications. It simplifies the process of initializing language models with appropriate configurations and managing content safety guardrails. This class is used by LangChain and LangGraph to invoke LLMs on Amazon Bedrock.

This class also provides methods to create guardrails for responsible AI implementation. The following Amazon Bedrock Guardrails policy filters sexual, violence, hate, insults, misconducts, and prompt attacks, and helps prevent models from generating stock and investment advice, profanity, hate, violent and sexual content. Additionally, it helps prevent exposing vulnerabilities in models by alleviating prompt attacks.

# guardrails policy

contentPolicyConfig={
    'filtersConfig': [
        {
            'type': 'SEXUAL',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'VIOLENCE',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'HATE',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'INSULTS',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'MISCONDUCT',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'PROMPT_ATTACK',
            'inputStrength': 'LOW',
            'outputStrength': 'NONE'
        }
    ]
},
wordPolicyConfig={
    'wordsConfig': [
        {'text': 'stock and investment advice'}
    ],
    'managedWordListsConfig': [
        {'type': 'PROFANITY'}
    ]
},
contextualGroundingPolicyConfig={
    'filtersConfig': [
        {
            'type': 'GROUNDING',
            'threshold': 0.65
        },
        {
            'type': 'RELEVANCE',
            'threshold': 0.75
        }
    ]
}

Database class

The Database class is defined in the cs_db.py file. This class is designed to facilitate interactions with a SQLite database. It’s responsible for creating a local SQLite database and importing synthetic data related to customers, orders, refunds, and transactions. By doing so, it makes sure that the necessary data is readily available for various operations. Furthermore, the class includes convenient wrapper functions that simplify the process of querying the database.

JiraSM class

The JiraSM class is implemented in the cs_jira_sm.py file. It serves as an interface for interacting with Jira Service Management. It establishes a connection to Jira by using the API token, user name, and instance URL, all of which are configured in the .env file. This setup provides secure and flexible access to the Jira instance. The class is designed to handle various ticket operations, including reading tickets and assigning them to a preconfigured bot user. Additionally, it supports downloading attachments from tickets and updating custom fields as needed.

CustomerSupport class

The CustomerSupport class is implemented in the cs_cust_support_flow.py file. This class encapsulates the customer support processing logic by using LangGraph and Amazon Bedrock. Using LangGraph nodes and tools, this class orchestrates the customer support workflow. The workflow initially determines the category of the ticket by analyzing its content and classifying it as related to transactions, deliveries, refunds, or other issues. It updates the support ticket with the category detected. Following this, the workflow extracts pertinent information such as transaction IDs or order numbers, which might involve analyzing both text and images, and queries the database for relevant details. The next step is response generation, which is context-aware and adheres to content safety guidelines while maintaining a professional tone. Finally, the workflow integrates with Jira, assigning categories, updating responses, and managing attachments as needed.

The LangGraph orchestration is implemented in the build_graph function, as illustrated in the following code. This function also generates a visual representation of the workflow using a Mermaid graph for better clarity and understanding. This setup supports an efficient and structured approach to handling customer support tasks.

def build_graph(self):
    """
    This function prepares LangGraph nodes, edges, conditional edges, compiles the graph and displays it 
    """

    # create StateGraph object
    graph_builder = StateGraph(JiraAppState)

    # add nodes to the graph
    graph_builder.add_node("Determine Ticket Category", self.determine_ticket_category_tool)
    graph_builder.add_node("Assign Ticket Category in JIRA", self.assign_ticket_category_in_jira_tool)
    graph_builder.add_node("Extract Transaction ID", self.extract_transaction_id_tool)
    graph_builder.add_node("Extract Order Number", self.extract_order_number_tool)
    graph_builder.add_node("Find Transaction Details", self.find_transaction_details_tool)
    
    graph_builder.add_node("Find Order Details", self.find_order_details_tool)
    graph_builder.add_node("Generate Response", self.generate_response_tool)
    graph_builder.add_node("Update Response in JIRA", self.update_response_in_jira_tool)

    graph_builder.add_node("tools", ToolNode([StructuredTool.from_function(self.assess_damaged_delivery), StructuredTool.from_function(self.find_refund_status)]))
    
    # add edges to connect nodes
    graph_builder.add_edge(START, "Determine Ticket Category")
    graph_builder.add_edge("Determine Ticket Category", "Assign Ticket Category in JIRA")
    graph_builder.add_conditional_edges("Assign Ticket Category in JIRA", self.decide_ticket_flow_condition)
    graph_builder.add_edge("Extract Order Number", "Find Order Details")
    
    graph_builder.add_edge("Extract Transaction ID", "Find Transaction Details")
    graph_builder.add_conditional_edges("Find Order Details", self.order_query_decision, ["Generate Response", "tools"])
    graph_builder.add_edge("tools", "Generate Response")
    graph_builder.add_edge("Find Transaction Details", "Generate Response")
    
    graph_builder.add_edge("Generate Response", "Update Response in JIRA")
    graph_builder.add_edge("Update Response in JIRA", END)

    # compile the graph
    checkpoint = MemorySaver()
    app = graph_builder.compile(checkpointer=checkpoint)
    self.graph_app = app
    self.util.log_data(data="Workflow compiled successfully", ticket_id='NA')

    # Visualize the graph
    display(Image(app.get_graph().draw_mermaid_png(draw_method=MermaidDrawMethod.API)))

    return app

LangGraph generates the following Mermaid diagram to visually represent the workflow.

Utility class

The Utility class, implemented in the cs_util.py file, provides essential functions to support the customer support automation. It encompasses utilities for logging, file handling, usage metric tracking, and image processing operations. The class is designed as a central hub for various helper methods, streamlining common tasks across the application. By consolidating these operations, it promotes code reusability and maintainability within the system. Its functionality makes sure that the automation framework remains efficient and organized.

A key feature of this class is its comprehensive logging capabilities. It provides methods to log informational messages, errors, and significant events directly into the cs_logs.log file. Additionally, it tracks Amazon Bedrock LLM token usage and latency metrics, facilitating detailed performance monitoring. The class also logs the execution flow of application-generated prompts and LLM generated responses, aiding in troubleshooting and debugging. These log files can be seamlessly integrated with standard log pusher agents, allowing for automated transfer to preferred log monitoring systems. This integration makes sure that system activity is thoroughly monitored and quickly accessible for analysis.

Run the agentic workflow

Now that the customer support workflow is defined, it can be executed for various ticket types. The following functions use the provided ticket key to fetch the corresponding Jira ticket and download available attachments. Additionally, they initialize the State object with details such as the ticket key, summary, description, attachment file path, and a system prompt for the LLM. This State object is used throughout the workflow execution.

def generate_response_for_ticket(ticket_id: str):
    
    llm, vision_llm, llm_with_guardrails = bedrock_client.init_llms(ticket_id=ticket_id)
    cust_support = CustomerSupport(llm=llm, vision_llm=vision_llm, llm_with_guardrails=llm_with_guardrails)
    app   = cust_support.build_graph()
    
    state = cust_support.get_jira_ticket(key=ticket_id)
    state = app.invoke(state, thread)
    
    util.log_usage(state['usage'], ticket_id=ticket_id)
    util.log_execution_flow(state["messages"], ticket_id=ticket_id)

The following code snippet invokes the workflow for the Jira ticket with key AS-6:

# initialize classes and create bedrock guardrails
bedrock_client = BedrockClient()
util = Utility()
guardrail_id = bedrock_client.create_guardrail()

# process a JIRA ticket
generate_response_for_ticket(ticket_id='AS-6')

The following screenshot shows the Jira ticket before processing. Notice that the Response and Category fields are empty, and the ticket is unassigned.

The following screenshot shows the Jira ticket after processing. The Category field is updated as Refunds and the Response field is updated by the AI-generated content.

This logs LLM usage information as follows:

Model                              Input Tokens  Output Tokens Latency 
mistral.mistral-large-2407-v1:0      385               2         653  
mistral.mistral-large-2407-v1:0      452              27         884      
mistral.mistral-large-2407-v1:0     1039              36        1197   
us.mistral.pixtral-large-2502-v1:0  4632             425        5952   
mistral.mistral-large-2407-v1:0     1770             144        4556

Clean up

Delete any IAM roles and policies created specifically for this post. Delete the local copy of this post’s code.

If you no longer need access to an Amazon Bedrock FM, you can remove access from it. For instructions, see Add or remove access to Amazon Bedrock foundation models.

Delete the temporary files and guardrails used in this post with the following code:

shutil.rmtree(util.get_temp_path())
bedrock_client.delete_guardrail()

Conclusion

In this post, we developed an AI-driven customer support solution using Amazon Bedrock, LangGraph, and Mistral models. This advanced agent-based workflow efficiently handles diverse customer queries by integrating multiple data sources and extracting relevant information from tickets or screenshots. It also evaluates damage claims to mitigate fraudulent returns. The solution is designed with flexibility, allowing the addition of new conditions and data sources as businesses need to evolve. With this multi-agent approach, you can build robust, scalable, and intelligent systems that redefine the capabilities of generative AI in customer support.

Want to explore further? Check out the following GitHub repo. There, you can observe the code in action and experiment with the solution yourself. The repository includes step-by-step instructions for setting up and running the multi-agent system, along with code for interacting with data sources and agents, routing data, and visualizing workflows.

About the authors

Deepesh Dhapola is a Senior Solutions Architect at AWS India, specializing in helping financial services and fintech clients optimize and scale their applications on the AWS Cloud. With a strong focus on trending AI technologies, including generative AI, AI agents, and the Model Context Protocol (MCP), Deepesh uses his expertise in machine learning to design innovative, scalable, and secure solutions. Passionate about the transformative potential of AI, he actively explores cutting-edge advancements to drive efficiency and innovation for AWS customers. Outside of work, Deepesh enjoys spending quality time with his family and experimenting with diverse culinary creations.

Build responsible AI applications with Amazon Bedrock Guardrails

June 10, 2025

by Divya Muralidharan Amazon AWS

As organizations embrace generative AI, they face critical challenges in making sure their applications align with their designed safeguards. Although foundation models (FMs) offer powerful capabilities, they can also introduce unique risks, such as generating harmful content, exposing sensitive information, being vulnerable to prompt injection attacks, and returning model hallucinations.

Amazon Bedrock Guardrails has helped address these challenges for multiple organizations, such as MAPRE, KONE, Fiserv, PagerDuty, Aha, and more. Just as traditional applications require multi-layered security, Amazon Bedrock Guardrails implements essential safeguards across model, prompt, and application levels—blocking up to 88% more undesirable and harmful multimodal content. Amazon Bedrock Guardrails helps filter over 75% hallucinated responses in Retrieval Augmented Generation (RAG) and summarization use cases, and stands as the first and only safeguard using Automated Reasoning to prevent factual errors from hallucinations.

In this post, we show how to implement safeguards using Amazon Bedrock Guardrails in a healthcare insurance use case.

Solution overview

We consider an innovative AI assistant designed to streamline interactions of policyholders with the healthcare insurance firm. With this AI-powered solution, policyholders can check coverage details, submit claims, find in-network providers, and understand their benefits through natural, conversational interactions. The assistant provides all-day support, handling routine inquiries while allowing human agents to focus on complex cases. To help enable secure and compliant operations of our assistant, we use Amazon Bedrock Guardrails to serve as a critical safety framework. Amazon Bedrock Guardrails can help maintain high standards of blocking undesirable and harmful multimodal content. This not only protects the users, but also builds trust in the AI system, encouraging wider adoption and improving overall customer experience in healthcare insurance interactions.

This post walks you through the capabilities of Amazon Bedrock Guardrails from the AWS Management Console. Refer to the following GitHub repo for information about creating, updating, and testing Amazon Bedrock Guardrails using the SDK.

Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. It evaluates user inputs and model responses based on specific policies, working with all large language models (LLMs) on Amazon Bedrock, fine-tuned models, and external FMs using the ApplyGuardrail API. The solution integrates seamlessly with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, so organizations can apply multiple guardrails across applications with tailored controls.

Guardrails can be implemented in two ways: direct integration with Invoke APIs (InvokeModel and InvokeModelWithResponseStream) and Converse APIs (Converse and ConverseStream) for models hosted on Amazon Bedrock, applying safeguards during inference, or through the flexible ApplyGuardrail API, which enables independent content evaluation without model invocation. This second method is ideal for assessing inputs or outputs at various application stages and works with custom or third-party models that are not hosted on Amazon Bedrock. Both approaches empower developers to implement use case-specific safeguards aligned with responsible AI policies, helping to block undesirable and harmful multimodal content from generative AI applications.

The following diagram depicts the six safeguarding policies offered by Amazon Bedrock Guardrails.

Prerequisites

Before we begin, make sure you have access to the console with appropriate permissions for Amazon Bedrock. If you haven’t set up Amazon Bedrock yet, refer to Getting started in the Amazon Bedrock console.

Create a guardrail

To create guardrail for our healthcare insurance assistant, complete the following steps:

On the Amazon Bedrock console, choose Guardrails in the navigation pane.
Choose Create guardrail.
In the Provide guardrail details section, enter a name (for this post, we use MyHealthCareGuardrail), an optional description, and a message to display if your guardrail blocks the user prompt, then choose Next.

Configuring Multimodal Content filters

Security is paramount when building AI applications. With image content filters in Amazon Bedrock Guardrails, content filters can now detect and filter both text and image content through six protection categories: Hate, Insults, Sexual, Violence, Misconduct, and Prompt Attacks.

In the Configure content filters section, for maximum protection, especially in sensitive sectors like healthcare in our example use case, set your confidence thresholds to High across all categories for both text and image content.
Enable prompt attack protection to prevent system instruction tampering, and use input tagging to maintain accurate classification of system prompts, then choose Next.

Denied topics

In healthcare applications, we need clear boundaries around medical advice. Let’s configure Amazon Bedrock Guardrails to prevent users from attempting disease diagnosis, which should be handled by qualified healthcare professionals.

In the Add denied topics section, create a new topic called Disease Diagnosis, add example phrases that represent diagnostic queries, and choose Confirm.

This setting helps makes sure our application stays within appropriate boundaries for insurance-related queries while avoiding medical diagnosis discussions. For example, when users ask questions like “Do I have diabetes?” or “What’s causing my headache?”, the guardrail will detect these as diagnosis-related queries and block them with an appropriate response.

After you set up your denied topics, choose Next to proceed with word filters.

Word filters

Configuring word filters in Amazon Bedrock Guardrails helps keep our healthcare insurance application focused and professional. These filters help maintain conversation boundaries and make sure responses stay relevant to health insurance queries.

Let’s set up word filters for two key purposes:

Block inappropriate language to maintain professional discourse
Filter irrelevant topics that fall outside the healthcare insurance scope

To set them up, do the following:

In the Add word filters section, add custom words or phrases to filter (in our example, we include off-topic terms like “stocks,” “investment strategies,” and “financial performance”), then choose Next.

Sensitive information filtersWith sensitive information filters, you can configure filters to block email addresses, phone numbers, and other personally identifiable information (PII), as well as set up custom regex patterns for industry-specific data requirements. For example, healthcare providers use these filters to maintain HIPAA compliance to help automatically block PII types that they include. This way, they can use AI capabilities while helping to maintain strict patient privacy standards.

For our example, configure filters for blocking the email address and phone number of healthcare insurance users, then choose Next.

Contextual grounding checks We use Amazon Bedrock Guardrails contextual grounding and relevance checks in our application to help validate model responses, detect hallucinations, and support alignment with reference sources.

Set up the thresholds for contextual grounding and relevance checks (we set them to 0.7), then choose Next.

Automated Reasoning checks

Automated Reasoning checks help detect hallucinations and provide a verifiable proof that our application’s model (LLM) response is accurate.

The first step to incorporate Automated Reasoning checks for our application is to create an Automated Reasoning policy that is composed of a set of variables, defined with a name, type, and description, and the logical rules that operate on the variables. These rules are expressed in formal logic, but they’re translated to natural language to make it straightforward for a user without formal logic expertise to refine a model. Automated Reasoning checks use the variable descriptions to extract their values when validating a Q&A.

To create an Automated Reasoning policy, choose the new Automated Reasoning menu option under Safeguards.
Create a new policy and give it a name, then upload an existing document that defines the right solution space, such as an HR guideline or an operational manual. For this demo, we use an example healthcare insurance policy document that includes the insurance coverage policies applicable to insurance holders.

Automated Reasoning checks is in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. To request to be considered for access to the preview today, contact your AWS account team.

Define the policy’s intent and processing parameters and choose Create policy.

The system now initiates an automated process to create your Automated Reasoning policy. This process involves analyzing your document, identifying key concepts, breaking down the document into individual units, translating these natural language units into formal logic, validating the translations, and finally combining them into a comprehensive logical model. You can review the generated structure, including the rules and variables, and edit these for accuracy through the UI.

To attach the Automated Reasoning policy to your guardrail, turn on Enable Automated Reasoning policy, choose the policy and policy version you want to use, then choose Next.

Review the configurations set in the previous steps and choose Create guardrail.

Test your guardrail

We can now test our healthcare insurance call center application with different inputs and see how the configured guardrail intervenes for harmful and undesirable multimodal content.

On the Amazon Bedrock console, on the guardrail details page, choose Select model in the Test panel.

Choose your model, then choose Apply.

For our example, we use the Amazon Nova Lite FM, which is a low-cost multimodal model that is lightning fast for processing image, video, and text input. For your use case, you can use another model of your choice.

Enter a query prompt with a denied topic.

For example, if we ask “I have cold and sore throat, do you think I have Covid, and if so please provide me information on what is the coverage,” the system recognizes this as a request for a disease diagnosis. Because Disease Diagnosis is configured as a denied topic in the guardrail settings, the system blocks the response.

Choose View trace to see the details of the intervention.

You can test with other queries. For example, if we ask “What is the financial performance of your insurance company in 2024?”, the word filter guardrail that we configured earlier intervenes. You can choose View trace to see that the word filter was invoked.

Next, we use a prompt to validate if PII data in input can be blocked using the guardrail. We ask “Can you send my lab test report to abc@gmail.com?” Because the guardrail was set up to block email addresses, the trace shows an intervention due to PII detection in the input prompt.

If we enter the prompt “I am frustrated on someone, and feel like hurting the person.” The text content filter is invoked for Violence because we set up Violence as a high threshold for detection of the harmful content while creating the guardrail.

If we provide an image file in the prompt that contains content of the category Violence, the image content filter gets invoked for Violence.

Finally, we test the Automated Reasoning policy by using the Test playground on the Amazon Bedrock console. You can input a sample user question and an incorrect answer to check if your Automated Reasoning policy works correctly. In our example, according to the insurance policy provided, new insurance claims take a minimum 7 days to get processed. Here, we input the question “Can you process my new insurance claim in less than 3 days?” and the incorrect answer “Yes, I can process it in 3 days.”

The Automated Reasoning checks marked the answer as Invalid and provided details about why, including which specific rule was broken, the relevant variables it found, and recommendations for fixing the issue.

Independent API

In addition to using Amazon Bedrock Guardrails as shown in the preceding section for Amazon Bedrock hosted models, you can now use Amazon Bedrock Guardrails to apply safeguards on input prompts and model responses for FMs available in other services (such as Amazon SageMaker), on infrastructure such as Amazon Elastic Compute Cloud (Amazon EC2), on on-premises deployments, and other third-party FMs beyond Amazon Bedrock. The ApplyGuardrail API assesses text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs.

While testing Amazon Bedrock Guardrails, select Use ApplyGuardrail API to validate user inputs using MyHealthCareGuardrail. The following test doesn’t require you to choose an Amazon Bedrock hosted model, you can test configured guardrails as an independent API.

Conclusion

In this post, we demonstrated how Amazon Bedrock Guardrails helps block harmful and undesirable multimodal content. Using a healthcare insurance call center scenario, we walked through the process of configuring and testing various guardrails. We also highlighted the flexibility of our ApplyGuardrail API, which implements guardrail checks on any input prompt, regardless of the FM in use. You can seamlessly integrate safeguards across models deployed on Amazon Bedrock or external platforms.

Ready to take your AI applications to the next level of safety and compliance? Check out Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions, which enables security and compliance teams to establish mandatory guardrails for model inference calls, helping to consistently enforce your guardrails across AI interactions. To dive deeper into Amazon Bedrock Guardrails, refer to Use guardrails for your use case, which includes advanced use cases with Amazon Knowledge Bases and Amazon Bedrock Agents.

This guidance is for informational purposes only. You should still perform your own independent assessment and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the third-party model referenced in this guidance. AWS has no control or authority over the third-party model referenced in this guidance and does not make any representations or warranties that the third-party model is secure, virus-free, operational, or compatible with your production environment and standards. AWS does not make any representations, warranties, or guarantees that any information in this guidance will result in a particular outcome or result.

References

About the authors

Divya Muralidharan is a Solutions Architect at AWS, supporting a strategic customer. Divya is an aspiring member of the AI/ML technical field community at AWS. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes. Outside of work, she spends time cooking, singing, and growing plants.

Rachna Chadha is a Principal Technologist at AWS, where she helps customers leverage generative AI solutions to drive business value. With decades of experience in helping organizations adopt and implement emerging technologies, particularly within the healthcare domain, Rachna is passionate about the ethical and responsible use of artificial intelligence. She believes AI has the power to create positive societal change and foster both economic and social progress. Outside of work, Rachna enjoys spending time with her family, hiking, and listening to music.

Effective cost optimization strategies for Amazon Bedrock

June 10, 2025

by Biswanath Mukherjee Amazon AWS

Customers are increasingly using generative AI to enhance efficiency, personalize experiences, and drive innovation across various industries. For instance, generative AI can be used to perform text summarization, facilitate personalized marketing strategies, create business-critical chat-based assistants, and so on. However, as generative AI adoption grows, associated costs can escalate in several areas including cost in inference, deployment, and model customization. Effective cost optimization can help to make sure that generative AI initiatives remain financially sustainable and deliver a positive return on investment. Proactive cost management makes the best of generative AI’s transformative potential available to businesses while maintaining their financial health.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.

With the increasing adoption of Amazon Bedrock, optimizing costs is a must to help keep the expenses associated with deploying and running generative AI applications manageable and aligned with your organization’s budget. In this post, you’ll learn about strategic cost optimization techniques while using Amazon Bedrock.

Understanding Amazon Bedrock pricing

Amazon Bedrock offers a comprehensive pricing model based on actual usage of FMs and related services. The core pricing components include model inference (available in On-Demand, Batch, and Provisioned Throughput options), model customization (charging for training, storage, and inference), and Custom Model Import (free import but charges for inference and storage). Through Amazon Bedrock Marketplace, you can access over 100 models with varying pricing structures for proprietary and public models. You can check out Amazon Bedrock pricing for a pricing overview and more details on pricing models.

Cost monitoring in Amazon Bedrock

You can monitor the cost of your Amazon Bedrock usage using the following approaches:

Application inference profiles – Amazon Bedrock provides application inference profiles that you can use to apply custom cost allocation tags to track, manage, and control on-demand FM costs and usage across different workloads and tenants.
Cost allocation tagging – You can tag all Amazon Bedrock models, aligning usage to specific organizational taxonomies such as cost centers, business units, teams, and applications for precise expense tracking. To carry out tagging operations, you need the Amazon Resource Name (ARN) of the resource on which you want to carry out a tagging operation.
Integration with AWS cost tools – Amazon Bedrock cost monitoring integrates with AWS Budgets, AWS Cost Explorer, AWS Cost and Usage Reports, and AWS Cost Anomaly Detection, enabling organizations to set tag-based budgets, receive alerts for usage thresholds, and detect unusual spending patterns.
Amazon CloudWatch metrics monitoring – Organizations can use Amazon CloudWatch to monitor runtime metrics for Amazon Bedrock applications by inference profile, set alarms based on thresholds, and receive notifications for real-time management of resource usage and costs. You can monitor all parts of your Amazon Bedrock application using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. You can graph the metrics using the AWS Management Console for CloudWatch. You can also set alarms that watch for certain thresholds and send notifications or take action when values exceed those thresholds.
Resource-specific visibility – CloudWatch provides metrics such as Invocations, InvocationLatency, InputTokenCount, OutputTokenCount, and various error metrics that can be filtered by model IDs and other dimensions for granular monitoring of Amazon Bedrock usage and performance.

Cost optimization strategies for Amazon Bedrock

When building generative AI applications with Amazon Bedrock, implementing thoughtful cost optimization strategies can significantly reduce your expenses while maintaining application performance. In this section, you’ll find key approaches to consider in the following order:

Select the appropriate model
Determine if it needs customization
1. If yes, explore options in the correct order
2. If no, proceed to the next step
Perform prompt engineering and management
Design efficient agents
Select the correct consumption option

This flow is shown in the following flow diagram.

Choose an appropriate model for your use case

Amazon Bedrock provides access to a diverse portfolio of FMs through a single API. The service continually expands its offerings with new models and providers, each with different pricing structures and capabilities.

For example, consider the on-demand pricing variation among Amazon Nova models in the US East (Ohio) AWS Region. This pricing is current as of May 21, 2025. Refer to the Amazon Bedrock pricing page for latest data.

As shown in the following table, the price varies significantly between Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro models. For example, Amazon Nove Micro is approximately 1.71 times cheaper than Amazon Note Lite based on per 1,000 input tokens as of this writing. If you don’t need multimodal capability and the accuracy of Amazon Nova Micro meets your use case, then you need not opt for Amazon Nova Lite. This demonstrates why selecting the right model for your use case is critical. The largest or most advanced model isn’t always necessary for every application.

Amazon Nova models	Price per 1,000 input tokens	Price per 1,000 output tokens
Amazon Nova Micro	$0.000035	$0.00014
Amazon Nova Lite	$0.00006	$0.00024
Amazon Nova Pro	$0.0008	$0.0032

One of the key advantages of Amazon Bedrock is its unified API, which abstracts the complexity of working with different models. You can switch between models by changing the model ID in your request with minimal code modifications. With this flexibility, you can select the most cost and performance optimized model that meets your requirements and upgrade only when necessary.

Best practice: Use Amazon Bedrock native features to evaluate the performance of the foundation model for your use case. Begin with an automatic model evaluation job to narrow down the scope. Follow it up by using LLM as a judge or human-based evaluation as required for your use case.

Perform model customization in the right order

When customizing FMs in Amazon Bedrock for contextualizing responses, choosing the strategy in correct order can significantly reduce your expenses while maximizing performance. You have four primary strategies available, each with different cost implications:

Prompt Engineering – Start by crafting high-quality prompts that effectively condition the model to generate desired responses. This approach requires minimal resources and no additional infrastructure costs beyond your standard inference calls.
RAG – Amazon Bedrock Knowledge Bases is a fully managed feature with built-in session context management and source attribution that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
Fine-tuning – This approach involves providing labeled training data to improve model performance on specific tasks. Although its effective, fine-tuning requires additional compute resources and creates custom model versions with associated hosting costs.
Continued pre-training – The most resource-intensive option involves providing unlabeled data to further train an FM on domain-specific content. This approach incurs the highest costs and longest implementation time.

The following graph shows the escalation of the complexity, quality, cost, and time of these four approaches.

Best practice: Implement these strategies progressively. Begin with prompt engineering as your foundation—it’s cost-effective and can often deliver impressive results with minimal investment. Refer to the Optimize for clear and concise prompts section to learn about different strategies that you can follow to write good prompts. Next, integrate RAG when you need to incorporate proprietary information into responses. These two approaches together should address most use cases while maintaining efficient cost structures. Explore fine-tuning and continued pre-training only when you have specific requirements that can’t be addressed through the first two methods and your use case justifies the additional expense.

By following this implementation hierarchy, shown in the following figure, you can optimize both your Amazon Bedrock performance and your budget allocation. Here is the high-level mental model for choosing different options:

Use Amazon Bedrock native model distillation feature

Amazon Bedrock Model Distillation is a powerful feature that you can use to access smaller, more cost-effective models without sacrificing performance and accuracy for your specific use cases.

Enhance accuracy of smaller (student) cost-effective models – With Amazon Bedrock Model Distillation, you can select a teacher model whose accuracy you want to achieve for your use case and then select a student model that you want to fine-tune. Model distillation automates the process of generating responses from the teacher and using those responses to fine-tune the student model.
Maximize distilled model performance with proprietary data synthesis – Fine-tuning a smaller, cost-efficient model to achieve accuracy similar to a larger model for your specific use case is an iterative process. To remove some of the burden of iteration needed to achieve better results, Amazon Bedrock Model Distillation might choose to apply different data synthesis methods that are best suited for your use case. For example, Amazon Bedrock might expand the training dataset by generating similar prompts, or it might generate high-quality synthetic responses using customer provided prompt-response pairs as golden examples.
Reduce cost by bringing your production data – With traditional fine-tuning, you’re required to create prompts and responses. With Amazon Bedrock Model Distillation, you only need to provide prompts, which are used to generate synthetic responses and fine-tune student models.

Best practice: Consider model distillation when you have a specific, well-defined use case where a larger model performs well but costs more than desired. This approach is particularly valuable for high-volume inference scenarios where the ongoing cost savings will quickly offset the initial investment in distillation.

Use Amazon Bedrock intelligent prompt routing

With Amazon Bedrock Intelligent Prompt Routing, you can now use a combination of FMs from the same model family to help optimize for quality and cost when invoking a model. For example, you can route between the Anthropic’s Claude model family—between Claude 3.5 Sonnet and Claude 3 Haiku depending on the complexity of the prompt. This is particularly useful for applications like customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models. Intelligent prompt routing can reduce costs by up to 30% without compromising on accuracy.

Best practice: Implement intelligent prompt routing for applications that handle a wide range of query complexities.

Optimize for clear and concise prompts

Optimizing prompts for clarity and conciseness in Amazon Bedrock focuses on structured, efficient communication with the model to minimize token usage and maximize response quality. Through techniques such as clear instructions, specific output formats, and precise role definitions, you can achieve better results while reducing costs associated with token consumption.

Structured instructions – Break down complex prompts into clear, numbered steps or bullet points. This helps the model follow a logical sequence and improves the consistency of responses while reducing token usage.
Output specifications – Explicitly define the desired format and constraints for the response. For example, specify word limits, format requirements, or use indicators like Please provide a brief summary in 2-3 sentences to control output length.
Avoid redundancy – Remove unnecessary context and repetitive instructions. Keep prompts focused on essential information and requirements because superfluous content can increase costs and potentially confuse the model.
Use separators – Employ clear delimiters (such as triple quotes, dashes, or XML-style tags) to separate different parts of the prompt to help the model to distinguish between context, instructions, and examples.
Role and context precision – Start with a clear role definition and specific context that’s relevant to the task. For example, You are a technical documentation specialist focused on explaining complex concepts in simple terms provides better guidance than a generic role description.

Best practice: Amazon Bedrock offers a fully managed feature to optimize prompts for a select model. This helps to reduce costs by improving prompt efficiency and effectiveness, leading to better results with fewer tokens and model invocations. The prompt optimization feature automatically refines your prompts to follow best practices for each specific model, eliminating the need for extensive manual prompt engineering that could take months of experimentation. Use this built-in prompt optimization feature in Amazon Bedrock to get started and optimize further to get better results as needed. Experiment with prompts to make them clear and concise to reduce the number of tokens without compromising the quality of the responses.

Optimize cost and performance using Amazon Bedrock prompt caching

You can use prompt caching with supported models on Amazon Bedrock to reduce inference response latency and input token costs. By adding portions of your context to a cache, the model can use the cache to skip recomputation of inputs, enabling Amazon Bedrock to share in the compute savings and lower your response latencies.

Significant cost reduction – Prompt caching can reduce costs by up to 90% compared to standard model inference costs, because cached tokens are charged at a reduced rate compared to non-cached input tokens.
Ideal use cases – Prompt caching is particularly valuable for applications with long and repeated contexts, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that maintain context about code files.
Improved latency – Implementing prompt caching can decrease response latency by up to 85% for supported models by eliminating the need to reprocess previously seen content, making applications more responsive.
Cache retention period – Cached content remains available for up to 5 minutes after each access, with the timer resetting upon each successful cache hit, making it ideal for multiturn conversations about the same context.
Implementation approach – To implement prompt caching, developers identify frequently reused prompt portions, tag these sections using the cachePoint block in API calls, and monitor cache usage metrics (cacheReadInputTokenCount and cacheWriteInputTokenCount) in response metadata to optimize performance.

Best practice: Prompt caching is valuable in scenarios where applications repeatedly process the same context, such as document Q&A systems where multiple users query the same content. The technique delivers maximum benefit when dealing with stable contexts that don’t change frequently, multiturn conversations about identical information, applications that require fast response times, high-volume services with repetitive requests, or systems where cost optimization is critical without sacrificing model performance.

Cache prompts within the client application

Client-side prompt caching helps reduce costs by storing frequently used prompts and responses locally within your application. This approach minimizes API calls to Amazon Bedrock models, resulting in significant cost savings and improved application performance.

Local storage implementation – Implement a caching mechanism within your application to store common prompts and their corresponding responses, using techniques such as in-memory caching (Redis, Memcached) or application-level caching systems.
Cache hit optimization – Before making an API call to Amazon Bedrock, check if the prompt or similar variations exist in the local cache. This reduces the number of billable API calls to the FMs, directly impacting costs. You can check Caching Best Practices to learn more.
Expiration strategy – Implement a time-based cache expiration strategy such as Time To Live (TTL) to help make sure that cached responses remain relevant while maintaining cost benefits. This aligns with the 5-minute cache window used by Amazon Bedrock for optimal cost savings.
Hybrid caching approach – Combine client-side caching with the built-in prompt caching of Amazon Bedrock for maximum cost optimization. Use the local cache for exact matches and the Amazon Bedrock cache for partial context reuse.
Cache monitoring – Implement cache hit:miss ratio monitoring to continually optimize your caching strategy and identify opportunities for further cost reduction through cached prompt reuse.

Best practice: In performance-critical systems and high-traffic websites, client-side caching enhances response times and user experience while minimizing dependency on ongoing Amazon Bedrock API interactions.

Build small and focused agents that interact with each other rather than a single large monolithic agent

Creating small, specialized agents that interact with each other in Amazon Bedrock can lead to significant cost savings while improving solution quality. This approach uses the multi-agent collaboration capability of Amazon Bedrock to build more efficient and cost-effective generative AI applications.

The multi-agent architecture advantage: You can use Amazon Bedrock multi-agent collaboration to orchestrate multiple specialized AI agents that work together to tackle complex business problems. By creating smaller, purpose-built agents instead of a single large one, you can:

Optimize model selection based on specific tasks – Use more economical FMs for simpler tasks and reserve premium models for complex reasoning tasks
Enable parallel processing – Multiple specialized agents can work simultaneously on different aspects of a problem, reducing overall response time
Improve solution quality – Each agent focuses on its specialty, leading to more accurate and relevant responses

Best practice: Select appropriate models for each specialized agent, matching capabilities to task requirements while optimizing for cost. Based on the complexity of the task, you can choose either a low-cost model or a high-cost model to optimize the cost. Use AWS Lambda functions that retrieve only the essential data to reduce unnecessary cost in Lambda execution. Orchestrate your system with a lightweight supervisor agent that efficiently handles coordination without consuming premium resources.

Choose the desired throughput depending on the usage

Amazon Bedrock offers two distinct throughput options, each designed for different usage patterns and requirements:

On-Demand mode – Provides a pay-as-you-go approach with no upfront commitments, making it ideal for early-stage proof of concepts (POCs) on development and test environments, applications with unpredictable or seasonal or sporadic traffic with significant variation.

With On-Demand pricing, you’re charged based on actual usage:

- Text generation models – Pay per input token processed and output token generated
- Embedding models – Pay per input token processed
- Image generation models – Pay per image generated
Provisioned Throughput mode – By using Provisioned Throughput, you can purchase dedicated model units for specific FMs to get higher level of throughput for a model at a fixed cost. This makes Provisioned Throughput suitable for production workload requiring predictable performance without throttling. If you customized a model, you must purchase Provisioned Throughput to be able to use it.

Each model unit delivers a defined throughput capacity measured by the maximum number of tokens processed per minute. Provisioned Throughput is billed hourly with commitment options of 1-month or 6-month terms, with longer commitments offering greater discounts.

Best practice: If you’re working on a POC or on a use case that has a sporadic workload using one of the base FMs from Amazon Bedrock, use On-Demand mode to take the benefit of pay-as-you-go pricing. However, if you’re working on a steady state workload where throttling must be avoided, or if you’re using custom models, you should opt for provisioned throughput that matches your workload. Calculate your token processing requirements carefully to avoid over-provisioning.

Use batch inference

With batch mode, you can get simultaneous large-scale predictions by providing a set of prompts as a single input file and receiving responses as a single output file. The responses are processed and stored in your Amazon Simple Storage Service (Amazon S3) bucket so you can access them later. Amazon Bedrock offers select FMs from leading AI providers like Anthropic, Meta, Mistral AI, and Amazon for batch inference at a 50% lower price compared to On-Demand inference pricing. Refer to Supported AWS Regions and models for batch inference for more details. This approach is ideal for non-real-time workloads where you need to process large volumes of content efficiently.

Best practice: Identify workloads in your application that don’t require real-time responses and migrate them to batch processing. For example, instead of generating product descriptions on-demand when users view them, pre-generate descriptions for new products in a nightly batch job and store the results. This approach can dramatically reduce your FM costs while maintaining the same output quality.

Conclusion

As organizations increasingly adopt Amazon Bedrock for their generative AI applications, implementing effective cost optimization strategies becomes crucial for maintaining financial efficiency. The key to successful cost optimization lies in taking a systematic approach. That is, start with basic optimizations such as proper model selection and prompt engineering, then progressively implement more advanced techniques such as caching and batch processing as your use cases mature. Regular monitoring of costs and usage patterns, combined with continuous optimization of these strategies, will help make sure that your generative AI initiatives remain both effective and economically sustainable.Remember that cost optimization is an ongoing process that should evolve with your application’s needs and usage patterns, making it essential to regularly review and adjust your implementation of these strategies.For more information about Amazon Bedrock pricing and the cost optimization strategies discussed in this post, refer to:

About the authors

Biswanath Mukherjee is a Senior Solutions Architect at Amazon Web Services. He works with large strategic customers of AWS by providing them technical guidance to migrate and modernize their applications on AWS Cloud. With his extensive experience in cloud architecture and migration, he partners with customers to develop innovative solutions that leverage the scalability, reliability, and agility of AWS to meet their business needs. His expertise spans diverse industries and use cases, enabling customers to unlock the full potential of the AWS Cloud.

Upendra V is a Senior Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprise customers design and deploy production-ready Generative AI workloads, implement Large Language Models (LLMs) and Agentic AI systems, and optimize cloud deployments. With expertise in cloud adoption and machine learning, he enables organizations to build and scale AI-driven applications efficiently.

How E.ON saves £10 million annually with AI diagnostics for smart meters powered by Amazon Textract

June 10, 2025

by Sam Charlton Amazon AWS

E.ON—headquartered in Essen, Germany—is one of Europe’s largest energy companies, with over 72,000 employees serving more than 50 million customers across 15 countries. As a leading provider of energy networks and customer solutions, E.ON focuses on accelerating the energy transition across Europe. A key part of this mission involves the Smart Energy Solutions division, which manages over 5 million smart meters in the UK alone. These devices help millions of customers track their energy consumption in near real time, receive accurate bills without manual readings, reduce their carbon footprints through more efficient energy management, and access flexible tariffs aligned with their usage.

Historically, diagnosing errors on smart meters required an on-site visit—an approach that was both time-consuming and logistically challenging. To address this challenge, E.ON partnered with AWS to develop a remote diagnostic solution powered by Amazon Textract, a machine learning (ML) service that automatically extracts printed text, handwriting, and structure from scanned documents and images. Instead of dispatching engineers, the consumer captures a 7-second video of their smart meters, which is automatically uploaded to AWS through the E.ON application for remote analysis. In real-world testing, it delivers an impressive 84% accuracy. Beyond cost savings, this ML-powered solution enhances consistency in diagnostics and can detect malfunctioning meters before issues escalate.

By transforming on-site inspections into quick-turnaround video analysis, E.ON aims to reduce site visits, accelerate repair times, make sure assets achieve their full lifecycle expectation, and cut annual costs by £10 million. This solution also helps E.ON maintain its 95% smart meter connectivity target, further demonstrating the company’s commitment to customer satisfaction and operational excellence.

In this post, we dive into how this solution works and the impact it’s making.

The challenge: Smart meter diagnostics at scale

Smart meters are designed to provide near real-time billing data and support better energy management. But when something goes wrong, such as a Wide Area Network (WAN) connectivity error, resolving it has traditionally required dispatching a field technician. With 135,000 on-site appointments annually and costs exceeding £20 million, this approach is neither scalable nor sustainable.

The process is also inconvenient for customers, who often need to take time off work or rearrange their schedules. Even then, resolution isn’t guaranteed. Engineers diagnose faults by visually interpreting a set of LED indicators on the Communications Hub, the device that sits directly on top of the smart meter. These LEDs, SW, WAN, HAN, MESH, and GAS, blink at different frequencies (Off, Low, Medium, High), and accurate diagnosis requires matching these blink patterns to a technical manual. With no standardized digital output and thousands of possible combinations, the risk of human error is high, and without a confirmed fault in advance, engineers might arrive without the tools needed to resolve the issue.

The following visuals make these differences clear. The first is an animation that mimics how the four states blink in real time, with each pulse lasting 0.1 seconds.

Animation showing the four LED pulse states (Off, Low, Medium, High) and the wait time between each 0.1-second flash.

The following diagram presents a simplified 7-second timeline for each state, showing exactly when pulses occur and how they differ in count and spacing.

Timeline visualization of LED pulse patterns over 7 seconds.

E.ON wanted to change this. They set out to alleviate unnecessary visits, reduce diagnostic errors, and improve customer experience. Partnering with AWS, they developed a more automated, scalable, and cost-effective way to detect smart meter faults, without needing to send an engineer on-site.

From manual to automated diagnostics

In partnership with AWS, E.ON developed a solution where customers record and upload short, 7-second videos of their smart meter. These videos are analyzed by a diagnostic tool, which returns the error and a natural language explanation of the issue directly to the customer’s smartphone. If an engineer visit is necessary, the technician arrives equipped with the right tools, having already received an accurate diagnosis.

The following image shows a typical Communications Hub, mounted above the smart meter. The labeled indicators—SW, WAN, HAN, MESH, and GAS—highlight the LEDs used in diagnostics, illustrating how the system identifies and isolates each region for analysis.

A typical Communications Hub, with LED indicators labeled SW, WAN, MESH, HAN, and GAS.

Solution overview

The diagnostic tool follows three main steps, as outlined in the following data flow diagram:

Upon receiving a 7-second video, the solution breaks it into individual frames. A Signal Intensity metric flags frames where an LED is likely active, drastically reducing the total number of frames requiring deeper analysis.
Next, the tool uses Amazon Textract to find text labels (SW, WAN, MESH, HAN, GAS). These labels, serving as landmarks, guide the system to the corresponding LED regions, where custom signal- and brightness-based heuristics determines whether each LED is on or off.
Finally, the tool counts pulses for each LED over 7 seconds. This pulse count maps directly to Off, Low, Medium, or High frequencies, which in turn align with error codes from the meter’s reference manual. The error code can either be returned directly as shown in the conceptual view or translated into a natural language explanation using a dictionary lookup created from the meter’s reference manual.

A conceptual view of the remote diagnostic pipeline, centered around the use of Textract to extract insights from video input and drive error detection.

A 7-second clip is essential to reduce ambiguity around LED pulse frequency. For instance, the Low frequency might flash once or twice in a five-second window, which could be mistaken for Off. By extending to 7 seconds, each frequency (Off, Low, Medium, or High) becomes unambiguous:

Off: 0 pulses
Low: 1–2 pulses
Medium: 3–4 pulses
High: 11–12 pulses

Because there’s no overlap among these pulse counts, the system can now accurately classify each LED’s frequency.

In the following sections, we discuss the three key steps of the solution workflow in more detail.

Step 1: Identify key frames

A modern smartphone typically captures 30 frames per second, resulting in 210 frames over a 7-second video. As seen in the earlier image, many of these frames appear as though the LEDs are off, either because the LEDs are inactive or between pulses, highlighting the need for key frame detection. In practice, only a small subset of the 210 frames will contain a visible lit LED, making it unnecessarily expensive to analyze every frame.

To address this, we introduced a Signal Intensity metric. This simple heuristic examines color channels and assigns each frame a likelihood score of containing an active LED. Frames with a score below a certain threshold are discarded, because they’re unlikely to contain active LEDs. Although the metric might generate a few false positives, it effectively trims down the volume of frames for further processing. Testing in the field conditions has shown robust performance across various lighting scenarios and angles.

Step 2: Inspect light status

With key frames identified, the system next determines which LEDs are active. It uses Amazon Textract to treat the meter’s panel like a document. Amazon Textract identifies all visible text in the frame, and the diagnostic system then parses this output to isolate only the relevant labels: “SW,” “WAN,” “MESH,” “HAN,” and “GAS,” filtering out unrelated text.

The following image shows a key frame processed by Amazon Textract. The bounding boxes show detected text; LED labels appear in red after text matching.

A key frame processed by Amazon Textract. The bounding boxes show detected text; LED labels appear in red after text matching.

Because each Communications Hub follows standard dimensions, the LED for each label is consistently located just above it. Using the bounding box coordinates from Amazon Textract as our landmark, the system calculates an “upward” direction for the meter and places a new bounding region above each label, pinpointing the pixels corresponding to each LED. The resulting key frame highlights exactly where to look for LED activity.

To illustrate this, the following image of a key frame shows how the system maps each detected label (“SW,” “WAN,” “MESH,” “HAN,” “GAS”) to its corresponding LED region. Each region is automatically defined using the Amazon Textract output and geometric rules, allowing the system to isolate just the areas that matter for diagnosis.

A key frame showing the exact LED regions for “SW,” “WAN,” “MESH,” “HAN,” and “GAS.”

With the LED regions now precisely defined, the tool evaluates whether each one is on or off. Because E.ON didn’t have a labeled dataset large enough to train a supervised ML model, we opted for a heuristic approach. We combined the Signal Intensity metric from Step 1 with a brightness threshold to determine LED status. By using relative rather than absolute thresholds, the method remains robust across different lighting conditions and angles, even if an LED’s glow reflects off neighboring surfaces.The end result is a simple on/off status for each LED in every key frame, feeding into the final error classification in Step 3.

Step 3: Aggregate results to determine the error

Now that each key frame has an on/off status for each LED, the final step is to determine how many times each light pulses during the 7-second clip. This pulse count reveals which frequency (Off, Low, Medium, or High) each LED is blinking at, allowing the solution to identify the appropriate error code from the Communications Hub’s reference manual, just like a field engineer would, but in a fully automated way.

To calculate the number of pulses, the system first groups consecutive “on” frames. Because one pulse of light typically lasts 0.1 seconds, or about 2–3 frames, a continuous block of “on” frames represents a single pulse. After grouping these blocks, the total number of pulses for each LED can be counted. Thanks to the 7-second recording window, the mapping from pulse count to frequency is unambiguous.

After each LED’s frequency is determined, the system simply references the meter’s manual to find the corresponding error. This final diagnostic result is then relayed back to the customer.

The following demo video below shows this process in action, with a user uploading a 7-second clip of their meter. In just 5.77 seconds, the application detects a WAN error, explains how it arrived at that conclusion, and outlines the steps an engineer would take to address the issue.

Conclusion

E.ON’s story highlights how a creative application of Amazon Textract, combined with custom image analysis and pulse counting, can solve a real-world challenge at scale. By diagnosing smart meter errors through brief smartphone videos, E.ON aims to lower costs, improve customer satisfaction, and enhance overall energy service reliability.

Although the system is still being field tested, initial results are encouraging: approximately 350 cases per week (18,200 annually) can now be diagnosed remotely, with an estimated £10 million in projected annual savings. Real-world accuracy stands at 84%, without extensive tuning, while controlled environments have shown a 100% success rate. Notably, the tool has even caught errors that field engineers initially missed, pointing to opportunities for refined training and proactive fault detection.

Looking ahead, E.ON plans to expand this approach to other devices and integrate advanced computer vision techniques to further boost accuracy. If you’re interested in exploring a similar solution, consider the following next steps:

Explore the Amazon Textract documentation to learn how you can streamline text extraction for your own use cases
Alternatively, consider Amazon Bedrock Document Automation for a generative AI-powered alternative to extract insights from multimodal content in audio, documents, images, and video
Browse the Amazon Machine Learning Blog to discover innovative ways customers use AWS ML services to drive efficiency and reduce costs
Contact your AWS Account Manager to discuss your specific needs to design a proof of concept or production-ready solution

By combining domain expertise with AWS services, E.ON demonstrates how an AI-driven strategy can transform operational efficiency, even in early stages. If you’re considering a similar path, these resources can help you unlock the power of AWS AI and ML to meet your unique business goals.

About the Authors

Sam Charlton is a Product Manager at E.ON who looks for innovative ways to use existing technology against entrenched issues often ignored. Starting in the contact center, he has worked the breadth and depth of E.ON, ensuring a holistic stance for his business’s needs.

Tanrajbir Takher is a Data Scientist at the AWS Generative AI Innovation Center, where he works with enterprise customers to implement high-impact generative AI solutions. Prior to AWS, he led research for new products at a computer vision unicorn and founded an early generative AI startup.

Satyam Saxena is an Applied Science Manager at the AWS Generative AI Innovation Center. He leads generative AI customer engagements, driving innovative ML/AI initiatives from ideation to production, with over a decade of experience in machine learning and data science. His research interests include deep learning, computer vision, NLP, recommender systems, and generative AI.

Tom Chester is an AI Strategist at the AWS Generative AI Innovation Center, working directly with AWS customers to understand the business problems they are trying to solve with generative AI and helping them scope and prioritize use cases. Tom has over a decade of experience in data and AI strategy and data science consulting.

Amit Dhingra is a GenAI/ML Sr. Sales Specialist in the UK. He works as a trusted advisor to customers by providing guidance on how they can unlock new value streams, solve key business problems, and deliver results for their customers using AWS generative AI and ML services.

Building intelligent AI voice agents with Pipecat and Amazon Bedrock – Part 1

June 9, 2025

by Adithya Suresh Amazon AWS

Voice AI is transforming how we interact with technology, making conversational interactions more natural and intuitive than ever before. At the same time, AI agents are becoming increasingly sophisticated, capable of understanding complex queries and taking autonomous actions on our behalf. As these trends converge, you see the emergence of intelligent AI voice agents that can engage in human-like dialogue while performing a wide range of tasks.

In this series of posts, you will learn how to build intelligent AI voice agents using Pipecat, an open-source framework for voice and multimodal conversational AI agents, with foundation models on Amazon Bedrock. It includes high-level reference architectures, best practices and code samples to guide your implementation.

Approaches for building AI voice agents

There are two common approaches for building conversational AI agents:

Using cascaded models: In this post (Part 1), you will learn about the cascaded models approach, diving into the individual components of a conversational AI agent. With this approach, voice input passes through a series of architecture components before a voice response is sent back to the user. This approach is also sometimes referred to as pipeline or component model voice architecture.
Using speech-to-speech foundation models in a single architecture: In Part 2, you will learn how Amazon Nova Sonic, a state-of-the-art, unified speech-to-speech foundation model can enable real-time, human-like voice conversations by combining speech understanding and generation in a single architecture.

Common use cases

AI voice agents can handle multiple use cases, including but not limited to:

Customer Support: AI voice agents can handle customer inquiries 24/7, providing instant responses and routing complex issues to human agents when necessary.
Outbound Calling: AI agents can conduct personalized outreach campaigns, scheduling appointments or following up on leads with natural conversation.
Virtual Assistants: Voice AI can power personal assistants that help users manage tasks, answer questions.

Architecture: Using cascaded models to build an AI voice agent

To build an agentic voice AI application with the cascaded models approach, you need to orchestrate multiple architecture components involving multiple machine learning and foundation models.

Figure 1: Architecture overview of a Voice AI Agent using Pipecat

These components include:

WebRTC Transport: Enables real-time audio streaming between client devices and the application server.

Voice Activity Detection (VAD): Detects speech using Silero VAD with configurable speech start and speech end times, and noise suppression capabilities to remove background noise and enhance audio quality.

Automatic Speech Recognition (ASR): Uses Amazon Transcribe for accurate, real-time speech-to-text conversion.

Natural Language Understanding (NLU): Interprets user intent using latency-optimized inference on Bedrock with models like Amazon Nova Pro optionally enabling prompt caching to optimize for speed and cost efficiency in Retrieval Augmented Generation (RAG) use cases.

Tools Execution and API Integration: Executes actions or retrieves information for RAG by integrating backend services and data sources via Pipecat Flows and leveraging the tool use capabilities of foundation models.

Natural Language Generation (NLG): Generates coherent responses using Amazon Nova Pro on Bedrock, offering the right balance of quality and latency.

Text-to-Speech (TTS): Converts text responses back into lifelike speech using Amazon Polly with generative voices.

Orchestration Framework: Pipecat orchestrates these components, offering a modular Python-based framework for real-time, multimodal AI agent applications.

Best practices for building effective AI voice agents

Developing responsive AI voice agents requires focus on latency and efficiency. While best practices continue to emerge, consider the following implementation strategies to achieve natural, human-like interactions:

Minimize conversation latency: Use latency-optimized inference for foundation models (FMs) like Amazon Nova Pro to maintain natural conversation flow.

Select efficient foundation models: Prioritize smaller, faster foundation models (FMs) that can deliver quick responses while maintaining quality.

Implement prompt caching: Utilize prompt caching to optimize for both speed and cost efficiency, especially in complex scenarios requiring knowledge retrieval.

Deploy text-to-speech (TTS) fillers: Use natural filler phrases (such as “Let me look that up for you”) before intensive operations to maintain user engagement while the system makes tool calls or long-running calls to your foundation models.

Build a robust audio input pipeline: Integrate components like noise to support clear audio quality for better speech recognition results.

Start simple and iterate: Begin with basic conversational flows before progressing to complex agentic systems that can handle multiple use cases.

Region availability: Low-latency and prompt caching features may only be available in certain regions. Evaluate the trade-off between these advanced capabilities and selecting a region that is geographically closer to your end-users.

Example implementation: Build your own AI voice agent in minutes

This post provides a sample application on Github that demonstrates the concepts discussed. It uses Pipecat and and its accompanying state management framework, Pipecat Flows with Amazon Bedrock, along with Web Real-time Communication (WebRTC) capabilities from Daily to create a working voice agent you can try in minutes.

Prerequisites

To setup the sample application, you should have the following prerequisites:

Python 3.10+
An AWS account with appropriate Identity and Access Management (IAM) permissions for Amazon Bedrock, Amazon Transcribe, and Amazon Polly
Access to foundation models on Amazon Bedrock
Access to an API key for Daily
Modern web browser (such as Google Chrome or Mozilla Firefox) with WebRTC support

Implementation Steps

After you complete the prerequisites, you can start setting up your sample voice agent:

Clone the repository:

git clone https://github.com/aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock 
cd build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock/part-1

Set up the environment:

cd server
python3 -m venv venv
source venv/bin/activate  # Windows: venvScriptsactivate
pip install -r requirements.txt

Configure API key in.env:

DAILY_API_KEY=your_daily_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region

Start the server:
```
python server.py
```
Connect via browser at http://localhost:7860 and grant microphone access
Start the conversation with your AI voice agent

Customizing your voice AI agent

To customize, you can start by:

Modifying flow.py to change conversation logic
Adjusting model selection in bot.py for your latency and quality needs

To learn more, see documentation for Pipecat Flows and review the README of our code sample on Github.

Cleanup

The instructions above are for setting up the application in your local environment. The local application will leverage AWS services and Daily through AWS IAM and API credentials. For security and to avoid unanticipated costs, when you are finished, delete these credentials to make sure that they can no longer be accessed.

Accelerating voice AI implementations

To accelerate AI voice agent implementations, AWS Generative AI Innovation Center (GAIIC) partners with customers to identify high-value use cases and develop proof-of-concept (PoC) solutions that can quickly move to production.

Customer Testimonial: InDebted

InDebted, a global fintech transforming the consumer debt industry, collaborates with AWS to develop their voice AI prototype.

“We believe AI-powered voice agents represent a pivotal opportunity to enhance the human touch in financial services customer engagement. By integrating AI-enabled voice technology into our operations, our goals are to provide customers with faster, more intuitive access to support that adapts to their needs, as well as improving the quality of their experience and the performance of our contact centre operations”

says Mike Zhou, Chief Data Officer at InDebted.

By collaborating with AWS and leveraging Amazon Bedrock, organizations like InDebted can create secure, adaptive voice AI experiences that meet regulatory standards while delivering real, human-centric impact in even the most challenging financial conversations.

Conclusion

Building intelligent AI voice agents is now more accessible than ever through the combination of open-source frameworks such as Pipecat, and powerful foundation models with latency optimized inference and prompt caching on Amazon Bedrock.

In this post, you learned about two common approaches on how to build AI voice agents, delving into the cascaded models approach and its key components. These essential components work together to create an intelligent system that can understand, process, and respond to human speech naturally. By leveraging these rapid advancements in generative AI, you can create sophisticated, responsive voice agents that deliver real value to your users and customers.

To get started with your own voice AI project, try our code sample on Github or contact your AWS account team to explore an engagement with AWS Generative AI Innovation Center (GAIIC).

You can also learn about building AI voice agents using a unified speech-to-speech foundation models, Amazon Nova Sonic in Part 2.

About the Authors

Adithya Suresh serves as a Deep Learning Architect at the AWS Generative AI Innovation Center, where he partners with technology and business teams to build innovative generative AI solutions that address real-world challenges.

Daniel Wirjo is a Solutions Architect at AWS, focused on FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.

Karan Singh is a Generative AI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise generative AI challenges.

Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.

Stream multi-channel audio to Amazon Transcribe using the Web Audio API

June 9, 2025

by Jorge Lanzarotti Amazon AWS

Multi-channel transcription streaming is a feature of Amazon Transcribe that can be used in many cases with a web browser. Creating this stream source has it challenges, but with the JavaScript Web Audio API, you can connect and combine different audio sources like videos, audio files, or hardware like microphones to obtain transcripts.

In this post, we guide you through how to use two microphones as audio sources, merge them into a single dual-channel audio, perform the required encoding, and stream it to Amazon Transcribe. A Vue.js application source code is provided that requires two microphones connected to your browser. However, the versatility of this approach extends far beyond this use case—you can adapt it to accommodate a wide range of devices and audio sources.

With this approach, you can get transcripts for two sources in a single Amazon Transcribe session, offering cost savings and other benefits compared to using a separate session for each source.

Challenges when using two microphones

For our use case, using a single-channel stream for two microphones and enabling Amazon Transcribe speaker label identification to identify the speakers might be enough, but there are a few considerations:

Speaker labels are randomly assigned at session start, meaning you will have to map the results in your application after the stream has started
Mislabeled speakers with similar voice tones can happen, which even for a human is hard to distinguish
Voice overlapping can occur when two speakers talk at the same time with one audio source

By using two audio sources with microphones, you can address these concerns by making sure each transcription is from a fixed input source. By assigning a device to a speaker, our application knows in advance which transcript to use. However, you might still encounter voice overlapping if two nearby microphones are picking up multiple voices. This can be mitigated by using directional microphones, volume management, and Amazon Transcribe word-level confidence scores.

Solution overview

The following diagram illustrates the solution workflow.

Application diagram for two microphones

We use two audio inputs with the Web Audio API. With this API, we can merge the two inputs, Mic A and Mic B, into a single audio data source, with the left channel representing Mic A and the right channel representing Mic B.

Then, we convert this audio source to PCM (Pulse-Code Modulation) audio. PCM is a common format for audio processing, and it’s one of the formats required by Amazon Transcribe for the audio input. Finally, we stream the PCM audio to Amazon Transcribe for transcription.

Prerequisites

You should have the following prerequisites in place:

The source code from the GitHub repository.
Bun or Node.js installed as a JavaScript runtime.
A web browser with Web Audio API compatibility. This solution has been tested to work in Google Chrome version 135.0.7049.85.
Two microphones connected to your computer and with browser permission to access these microphones.
An AWS account with Amazon Transcribe permissions. As an example, you can use the following AWS Identity and Access Management (IAM) policy for Amazon Transcribe:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DemoWebAudioAmazonTranscribe",
      "Effect": "Allow",
      "Action": "transcribe:StartStreamTranscriptionWebSocket",
      "Resource": "*"
    }
  ]
}

Start the application

Complete the following steps to launch the application:

Go to the root directory where you downloaded the code.
Create a .env file to set up your AWS access keys from the env.sample file.
Install packages and run bun install (if you’re using node, run node install).
Start the web server and run bun dev (if you’re using node, run node dev).
Open your browser in http://localhost:5173/.

Application running on http://localhost:5173 with two connected microphones

Code walkthrough

In this section, we examine the important code pieces for the implementation:

The first step is to list the connected microphones by using the browser API navigator.mediaDevices.enumerateDevices():

const devices = await navigator.mediaDevices.enumerateDevices()
return devices.filter((d) => d.kind === 'audioinput')

Next, you need to obtain the MediaStream object for each of the connected microphones. This can be done using the navigator.mediaDevices.getUserMedia() API, which enables access the user’s media devices (such as cameras and microphones). You can then retrieve a MediaStream object that represents the audio or video data from those devices:

const streams = []
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    deviceId: device.deviceId,
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
  },
})

if (stream) streams.push(stream)

To combine the audio from the multiple microphones, you need to create an AudioContext interface for audio processing. Within this AudioContext, you can use ChannelMergerNode to merge the audio streams from the different microphones. The connect(destination, src_idx, ch_idx) method arguments are:
- destination – The destination, in our case mergerNode.
- src_idx – The source channel index, in our case both 0 (because each microphone is a single-channel audio stream).
- ch_idx – The channel index for the destination, in our case 0 and 1 respectively, to create a stereo output.

// instance of audioContext
const audioContext = new AudioContext({
       sampleRate: SAMPLE_RATE,
})
// this is used to process the microphone stream data
const audioWorkletNode = new AudioWorkletNode(audioContext, 'recording-processor', {...})
// microphone A
const audioSourceA = audioContext.createMediaStreamSource(mediaStreams[0]);
// microphone B
const audioSourceB = audioContext.createMediaStreamSource(mediaStreams[1]);
// audio node for two inputs
const mergerNode = audioContext.createChannelMerger(2);
// connect the audio sources to the mergerNode destination.  
audioSourceA.connect(mergerNode, 0, 0);
audioSourceB.connect(mergerNode, 0, 1);
// connect our mergerNode to the AudioWorkletNode
merger.connect(audioWorkletNode);

The microphone data is processed in an AudioWorklet that emits data messages every defined number of recording frames. These messages will contain the audio data encoded in PCM format to send to Amazon Transcribe. Using the p-event library, you can asynchronously iterate over the events from the Worklet. A more in-depth description about this Worklet is provided in the next section of this post.

import { pEventIterator } from 'p-event'
...

// Register the worklet
try {
  await audioContext.audioWorklet.addModule('./worklets/recording-processor.js')
} catch (e) {
  console.error('Failed to load audio worklet')
}

//  An async iterator 
const audioDataIterator = pEventIterator<'message', MessageEvent<AudioWorkletMessageDataType>>(
  audioWorkletNode.port,
  'message',
)
...

// AsyncIterableIterator: Every time the worklet emits an event with the message `SHARE_RECORDING_BUFFER`, this iterator will return the AudioEvent object that we need.
const getAudioStream = async function* (
  audioDataIterator: AsyncIterableIterator<MessageEvent<AudioWorkletMessageDataType>>,
) {
  for await (const chunk of audioDataIterator) {
    if (chunk.data.message === 'SHARE_RECORDING_BUFFER') {
      const { audioData } = chunk.data
      yield {
        AudioEvent: {
          AudioChunk: audioData,
        },
      }
    }
  }
}

To start streaming the data to Amazon Transcribe, you can use the fabricated iterator and enabled NumberOfChannels: 2 and EnableChannelIdentification: true to enable the dual channel transcription. For more information, refer to the AWS SDK StartStreamTranscriptionCommand documentation.

import {
  LanguageCode,
  MediaEncoding,
  StartStreamTranscriptionCommand,
} from '@aws-sdk/client-transcribe-streaming'

const command = new StartStreamTranscriptionCommand({
    LanguageCode: LanguageCode.EN_US,
    MediaEncoding: MediaEncoding.PCM,
    MediaSampleRateHertz: SAMPLE_RATE,
    NumberOfChannels: 2,
    EnableChannelIdentification: true,
    ShowSpeakerLabel: true,
    AudioStream: getAudioStream(audioIterator),
  })

After you send the request, a WebSocket connection is created to exchange audio stream data and Amazon Transcribe results:

const data = await client.send(command)
for await (const event of data.TranscriptResultStream) {
    for (const result of event.TranscriptEvent.Transcript.Results || []) {
        callback({ ...result })
    }
}

The result object will include a ChannelId property that you can use to identify your microphone source, such as ch_0 and ch_1, respectively.

Deep dive: Audio Worklet

Audio Worklets can execute in a separate thread to provide very low-latency audio processing. The implementation and demo source code can be found in the public/worklets/recording-processor.js file.

For our case, we use the Worklet to perform two main tasks:

Process the mergerNode audio in an iterable way. This node includes both of our audio channels and is the input to our Worklet.
Encode the data bytes of the mergerNode node into PCM signed 16-bit little-endian audio format. We do this for each iteration or when required to emit a message payload to our application.

The general code structure to implement this is as follows:

class RecordingProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super()
  }
  process(inputs, outputs) {...}
}

registerProcessor('recording-processor', RecordingProcessor)

You can pass custom options to this Worklet instance using the processorOptions attribute. In our demo, we set a maxFrameCount: (SAMPLE_RATE * 4) / 10 as a bitrate guide to determine when to emit a new message payload. A message is for example:

this.port.postMessage({
  message: 'SHARE_RECORDING_BUFFER',
  buffer: this._recordingBuffer,
  recordingLength: this.recordedFrames,
  audioData: new Uint8Array(pcmEncodeArray(this._recordingBuffer)), // PCM encoded audio format
})

PCM encoding for two channels

One of the most important sections is how to encode to PCM for two channels. Following the AWS documentation in the Amazon Transcribe API Reference, the AudioChunk is defined by: Duration (s) * Sample Rate (Hz) * Number of Channels * 2. For two channels, 1 second at 16000Hz is: 1 * 16000 * 2 * 2 = 64000 bytes. Our encoding function it should then look like this:

// Notice that input is an array, where each element is a channel with Float32 values between -1.0 and 1.0 from the AudioWorkletProcessor.
const pcmEncodeArray = (input: Float32Array[]) => {
  const numChannels = input.length
  const numSamples = input[0].length
  const bufferLength = numChannels * numSamples * 2 // 2 bytes per sample per channel
  const buffer = new ArrayBuffer(bufferLength)
  const view = new DataView(buffer)

  let index = 0

  for (let i = 0; i < numSamples; i++) {
    // Encode for each channel
    for (let channel = 0; channel < numChannels; channel++) {
      const s = Math.max(-1, Math.min(1, input[channel][i]))
      // Convert the 32 bit float to 16 bit PCM audio waveform samples.
      // Max value: 32767 (0x7FFF), Min value: -32768 (-0x8000) 
      view.setInt16(index, s < 0 ? s * 0x8000 : s * 0x7fff, true)
      index += 2
    }
  }
  return buffer
}

For more information how the audio data blocks are handled, see AudioWorkletProcessor: process() method. For more information on PCM format encoding, see Multimedia Programming Interface and Data Specifications 1.0.

Conclusion

In this post, we explored the implementation details of a web application that uses the browser’s Web Audio API and Amazon Transcribe streaming to enable real-time dual-channel transcription. By using the combination of AudioContext, ChannelMergerNode, and AudioWorklet, we were able to seamlessly process and encode the audio data from two microphones before sending it to Amazon Transcribe for transcription. The use of the AudioWorklet in particular allowed us to achieve low-latency audio processing, providing a smooth and responsive user experience.

You can build upon this demo to create more advanced real-time transcription applications that cater to a wide range of use cases, from meeting recordings to voice-controlled interfaces.

Try out the solution for yourself, and leave your feedback in the comments.

About the Author

Jorge Lanzarotti is a Sr. Prototyping SA at Amazon Web Services (AWS) based on Tokyo, Japan. He helps customers in the public sector by creating innovative solutions to challenging problems.

How Kepler democratized AI access and enhanced client services with Amazon Q Business

June 9, 2025

by Evan Miller, Noah Kershaw, and Valerie Renda Amazon AWS

This is a guest post co-authored by Evan Miller, Noah Kershaw, and Valerie Renda of Kepler Group

At Kepler, a global full-service digital marketing agency serving Fortune 500 brands, we understand the delicate balance between creative marketing strategies and data-driven precision. Our company name draws inspiration from the visionary astronomer Johannes Kepler, reflecting our commitment to bringing clarity to complex challenges and illuminating the path forward for our clients.

In this post, we share how implementing Amazon Q Business transformed our operations by democratizing AI access across our organization while maintaining stringent security standards, resulting in an average savings of 2.7 hours per week per employee in manual work and improved client service delivery.

The challenge: Balancing innovation with security

As a digital marketing agency working with Fortune 500 clients, we faced increasing pressure to use AI capabilities while making sure that we maintain the highest levels of data security. Our previous solution lacked essential features, which led team members to consider more generic solutions. Specifically, the original implementation was missing critical capabilities such as chat history functionality, preventing users from accessing or referencing their prior conversations. This absence of conversation context meant users had to repeatedly provide background information in each interaction. Additionally, the solution had no file upload capabilities, limiting users to text-only interactions. These limitations resulted in a basic AI experience where users often had to compromise by rewriting prompts, manually maintaining context, and working around the inability to process different file formats. The restricted functionality ultimately pushed teams to explore alternative solutions that could better meet their comprehensive needs. Being an International Organization for Standardization (ISO) 27001-certified organization, we needed an enterprise-grade solution that would meet our strict security requirements without compromising on functionality. Our ISO 27001 certification mandates rigorous security controls, which meant that public AI tools weren’t suitable for our needs. We required a solution that could be implemented within our secure environment while maintaining full compliance with our stringent security protocols.

Why we chose Amazon Q Business

Our decision to implement Amazon Q Business was driven by three key factors that aligned perfectly with our needs. First, because our Kepler Intelligence Platform (Kip) infrastructure already resided on Amazon Web Services (AWS), the integration process was seamless. Our Amazon Q Business implementation uses three core connectors (Amazon Simple Storage Service (Amazon S3), Google Drive, and Amazon Athena), though our wider data ecosystem includes 35–45 different platform integrations, primarily flowing through Amazon S3. Second, the commitment from Amazon Q Business to not use our data for model training satisfied our essential security requirements. Finally, the Amazon Q Business apps functionality enabled us to develop no-code solutions for everyday challenges, democratizing access to efficient workflows without requiring additional software developers.

Implementation journey

We began our Amazon Q Business implementation journey in early 2025 with a focused pilot group of 10 participants, expanding to 100 users in February and March, with plans for a full deployment reaching 500+ employees. During this period, we organized an AI-focused hackathon that catalyzed organic adoption and sparked creative solutions. The implementation was unique in how we integrated Amazon Q Business into our existing Kepler Intelligence Platform, rebranding it as Kip AI to maintain consistency with our internal systems.

Kip AI demonstrates how we’ve comprehensively integrated AI capabilities with our existing data infrastructure. We use multiple data sources, including Amazon S3 for our storage needs, Amazon QuickSight for our business intelligence requirements, and Google Drive for team collaboration. At the heart of our system is our custom extract, transform, and load ETL pipeline (Kip SSoT), which we’ve designed to feed data into QuickSight for AI-enabled analytics. We’ve configured Amazon Q Business to seamlessly connect with these data sources, allowing our team members to access insights through both a web interface and browser extension. The following figure shows the architecture of Kip AI.

This integrated approach helps ensure that Kepler’s employees can securely access AI capabilities while maintaining data governance and security requirements crucial for their clients. Access to the platform is secured through AWS Identity and Access Management (IAM), connected to our single sign-on provider, ensuring that only authorized personnel can use the system. This careful approach to security and access management has been crucial in maintaining our clients’ trust while rolling out AI capabilities across our organization.

Transformative use cases and results

The implementation of Amazon Q Business has revolutionized several key areas of our operations. Our request for information (RFI) response process, which traditionally consumed significant time and resources, has been streamlined dramatically. Teams now report saving over 10 hours per RFI response, allowing us to pursue more business opportunities efficiently.

Client communications have also seen substantial improvements. The platform helps us draft clear, consistent, and timely communications, from routine emails to comprehensive status reports and presentations. This enhancement in communication quality has strengthened our client relationships and improved service delivery.

Perhaps most significantly, we’ve achieved remarkable efficiency gains across the organization. Our employees report saving an average of 2.7 hours per week in manual work, with user satisfaction rates exceeding 87%. The platform has enabled us to standardize our approach to insight generation, ensuring consistent, high-quality service delivery across all client accounts.

Looking ahead

As we expand Amazon Q Business access to all Kepler employees (over 500) in the coming months, we’re maintaining a thoughtful approach to deployment. We recognize that some clients have specific requirements regarding AI usage, and we’re carefully balancing innovation with client preferences. This strategic approach includes working to update client contracts and helping clients become more comfortable with AI integration while respecting their current guidelines.

Conclusion

Our experience with Amazon Q Business demonstrates how enterprise-grade AI can be successfully implemented while maintaining strict security standards and respecting client preferences. The platform has not only improved our operational efficiency but has also enhanced our ability to deliver consistent, high-quality service to our clients. What’s particularly impressive is the platform’s rapid deployment capabilities—we were able to implement the solution within weeks, without any coding requirements, and eliminate ongoing model maintenance and data source management expenses. As we continue to expand our use of Amazon Q Business, we’re excited about the potential for further innovation and efficiency gains in our digital marketing services.

About the authors

Evan Miller, Global Head of Product and Data Science, is a strategic product leader who joined Kepler 2013. Currently serving as Global Head of Product and Data Science, he owns the end-to-end product strategy for the Kepler Intelligence Platform (Kip). Under his leadership, Kip has garnered industry recognition, winning awards for Best Performance Management Solution and Best Commerce Technology, while driving significant business impact through innovative features like automated Machine Learning analytics and Marketing Mix Modeling technology.

Noah Kershaw leads the product team at Kepler Group, a global digital marketing agency that helps brands connect with their audiences through data-driven strategies. With a passion for innovation, Noah has been at the forefront of integrating AI solutions to enhance client services and streamline operations. His collaborative approach and enthusiasm for leveraging technology have been key in bringing Kepler’s “Future in Focus” vision to life, helping Kepler and its clients navigate the modern era of marketing with clarity and precision.

Valerie Renda, Director of Data Strategy & Analytics, has a specialized focus on data strategy, analytics, and marketing systems strategy within digital marketing, a field she’s worked in for over eight years. At Kepler, she has made significant contributions to various clients’ data management and martech strategies. She has been instrumental in leading data infrastructure projects, including customer data platform implementations, business intelligence visualization implementations, server-side tracking, martech consolidation, tag migrations, and more. She has also led the development of workflow tools to automate data processes and streamline ad operations to improve internal organizational processes.

Al Destefano is a Sr. Generative AI Specialist on the Amazon Q GTM team based in New York City. At AWS, he uses technical knowledge and business experience to communicate the tangible enterprise benefits when using managed Generative AI AWS services.

Sunanda Patel is a Senior Account Manager with over 15 years of expertise in management consulting and IT sectors, with a focus on business development and people management. Throughout her career, Sunanda has successfully managed diverse client relationships, ranging from non-profit to corporate and large multinational enterprises. Sunanda joined AWS in 2022 as an Account Manager for the Manhattan Commercial sector and now works with strategic commercial accounts, helping them grow in their cloud journey to achieve complex business goals.

Kumar Karra is a Sr. Solutions Architect at AWS supporting SMBs. He is an experienced engineer with deep experience in the software development lifecycle. Kumar looks to solve challenging problems by applying technical, leadership, and business skills. He holds a Master’s Degree in Computer Science and Machine Learning from Georgia Institute of Technology and is based in New York (US).

Build a serverless audio summarization solution with Amazon Bedrock and Whisper

June 6, 2025

by Kaiyin Hu Amazon AWS

Recordings of business meetings, interviews, and customer interactions have become essential for preserving important information. However, transcribing and summarizing these recordings manually is often time-consuming and labor-intensive. With the progress in generative AI and automatic speech recognition (ASR), automated solutions have emerged to make this process faster and more efficient.

Protecting personally identifiable information (PII) is a vital aspect of data security, driven by both ethical responsibilities and legal requirements. In this post, we demonstrate how to use the Open AI Whisper foundation model (FM) Whisper Large V3 Turbo, available in Amazon Bedrock Marketplace, which offers access to over 140 models through a dedicated offering, to produce near real-time transcription. These transcriptions are then processed by Amazon Bedrock for summarization and redaction of sensitive information.

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon Nova through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Additionally, you can use Amazon Bedrock Guardrails to automatically redact sensitive information, including PII, from the transcription summaries to support compliance and data protection needs.

In this post, we walk through an end-to-end architecture that combines a React-based frontend with Amazon Bedrock, AWS Lambda, and AWS Step Functions to orchestrate the workflow, facilitating seamless integration and processing.

Solution overview

The solution highlights the power of integrating serverless technologies with generative AI to automate and scale content processing workflows. The user journey begins with uploading a recording through a React frontend application, hosted on Amazon CloudFront and backed by Amazon Simple Storage Service (Amazon S3) and Amazon API Gateway. When the file is uploaded, it triggers a Step Functions state machine that orchestrates the core processing steps, using AI models and Lambda functions for seamless data flow and transformation. The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

The React application is hosted in an S3 bucket and served to users through CloudFront for fast, global access. API Gateway handles interactions between the frontend and backend services.
Users upload audio or video files directly from the app. These recordings are stored in a designated S3 bucket for processing.
An Amazon EventBridge rule detects the S3 upload event and triggers the Step Functions state machine, initiating the AI-powered processing pipeline.
The state machine performs audio transcription, summarization, and redaction by orchestrating multiple Amazon Bedrock models in sequence. It uses Whisper for transcription, Claude for summarization, and Guardrails to redact sensitive data.
The redacted summary is returned to the frontend application and displayed to the user.

The following diagram illustrates the state machine workflow.

The Step Functions state machine orchestrates a series of tasks to transcribe, summarize, and redact sensitive information from uploaded audio/video recordings:

A Lambda function is triggered to gather input details (for example, Amazon S3 object path, metadata) and prepare the payload for transcription.
The payload is sent to the OpenAI Whisper Large V3 Turbo model through the Amazon Bedrock Marketplace to generate a near real-time transcription of the recording.
The raw transcript is passed to Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock, which produces a concise and coherent summary of the conversation or content.
A second Lambda function validates and forwards the summary to the redaction step.
The summary is processed through Amazon Bedrock Guardrails, which automatically redacts PII and other sensitive data.
The redacted summary is stored or returned to the frontend application through an API, where it is displayed to the user.

Prerequisites

Before you start, make sure that you have the following prerequisites in place:

Before using Amazon Bedrock models, you must request access—a one-time setup step. For this solution, verify that access to the Anthropic’s Claude Sonnet 3.5 model is enabled in your Amazon Bedrock account. For instructions, see Access Amazon Bedrock foundation models.
Set up a guardrail to enable PII redaction by configuring filters that block or mask sensitive information. For guidance on configuring filters for additional use cases, see Remove PII from conversations by using sensitive information filters.
Deploy the Whisper Large V3 Turbo model within the Amazon Bedrock Marketplace. This post also offers step-by-step guidance for the deployment.
The AWS Command Line Interface (AWS CLI) should be installed and configured. For instructions, see Installing or updating to the latest version of the AWS CLI.
Node.js 14.x or later should be installed.
The AWS CDK CLI should be installed.
You should have Python 3.8+.

Create a guardrail in the Amazon Bedrock console

For instructions for creating guardrails in Amazon Bedrock, refer to Create a guardrail. For details on detecting and redacting PII, see Remove PII from conversations by using sensitive information filters. Configure your guardrail with the following key settings:

Enable PII detection and handling
Set PII action to Redact
Add the relevant PII types, such as:
- Names and identities
- Phone numbers
- Email addresses
- Physical addresses
- Financial information
- Other sensitive personal information

After you deploy the guardrail, note the Amazon Resource Name (ARN), and you will be using this when deploys the model.

Deploy the Whisper model

Complete the following steps to deploy the Whisper Large V3 Turbo model:

On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
Search for and choose Whisper Large V3 Turbo.
On the options menu (three dots), choose Deploy.

Modify the endpoint name, number of instances, and instance type to suit your specific use case. For this post, we use the default settings.
Modify the Advanced settings section to suit your use case. For this post, we use the default settings.
Choose Deploy.

This creates a new AWS Identity and Access Management IAM role and deploys the model.

You can choose Marketplace deployments in the navigation pane, and in the Managed deployments section, you can see the endpoint status as Creating. Wait for the endpoint to finish deployment and the status to change to In Service, then copy the Endpoint Name, and you will be using this when deploying the

Deploy the solution infrastructure

In the GitHub repo, follow the instructions in the README file to clone the repository, then deploy the frontend and backend infrastructure.

We use the AWS Cloud Development Kit (AWS CDK) to define and deploy the infrastructure. The AWS CDK code deploys the following resources:

React frontend application
Backend infrastructure
S3 buckets for storing uploads and processed results
Step Functions state machine with Lambda functions for audio processing and PII redaction
API Gateway endpoints for handling requests
IAM roles and policies for secure access
CloudFront distribution for hosting the frontend

Implementation deep dive

The backend is composed of a sequence of Lambda functions, each handling a specific stage of the audio processing pipeline:

Upload handler – Receives audio files and stores them in Amazon S3
Transcription with Whisper – Converts speech to text using the Whisper model
Speaker detection – Differentiates and labels individual speakers within the audio
Summarization using Amazon Bedrock – Extracts and summarizes key points from the transcript
PII redaction – Uses Amazon Bedrock Guardrails to remove sensitive information for privacy compliance

Let’s examine some of the key components:

The transcription Lambda function uses the Whisper model to convert audio files to text:

def transcribe_with_whisper(audio_chunk, endpoint_name):
    # Convert audio to hex string format
    hex_audio = audio_chunk.hex()
    
    # Create payload for Whisper model
    payload = {
        "audio_input": hex_audio,
        "language": "english",
        "task": "transcribe",
        "top_p": 0.9
    }
    
    # Invoke the SageMaker endpoint running Whisper
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Parse the transcription response
    response_body = json.loads(response['Body'].read().decode('utf-8'))
    transcription_text = response_body['text']
    
    return transcription_text

We use Amazon Bedrock to generate concise summaries from the transcriptions:

def generate_summary(transcription):
    # Format the prompt with the transcription
    prompt = f"{transcription}nnGive me the summary, speakers, key discussions, and action items with owners"
    
    # Call Bedrock for summarization
    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
        body=json.dumps({
            "prompt": prompt,
            "max_tokens_to_sample": 4096,
            "temperature": 0.7,
            "top_p": 0.9,
        })
    )
    
    # Extract and return the summary
    result = json.loads(response.get('body').read())
    return result.get('completion')

A critical component of our solution is the automatic redaction of PII. We implemented this using Amazon Bedrock Guardrails to support compliance with privacy regulations:

def apply_guardrail(bedrock_runtime, content, guardrail_id):
# Format content according to API requirements
formatted_content = [{"text": {"text": content}}]

# Call the guardrail API
response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=guardrail_id,
guardrailVersion="DRAFT",
source="OUTPUT",  # Using OUTPUT parameter for proper flow
content=formatted_content
)

# Extract redacted text from response
if 'action' in response and response['action'] == 'GUARDRAIL_INTERVENED':
if len(response['outputs']) > 0:
output = response['outputs'][0]
if 'text' in output and isinstance(output['text'], str):
return output['text']

# Return original content if redaction fails
return content

When PII is detected, it’s replaced with type indicators (for example, {PHONE} or {EMAIL}), making sure that summaries remain informative while protecting sensitive data.

To manage the complex processing pipeline, we use Step Functions to orchestrate the Lambda functions:

{
"Comment": "Audio Summarization Workflow",
"StartAt": "TranscribeAudio",
"States": {
"TranscribeAudio": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "WhisperTranscriptionFunction",
"Payload": {
"bucket": "$.bucket",
"key": "$.key"
}
},
"Next": "IdentifySpeakers"
},
"IdentifySpeakers": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "SpeakerIdentificationFunction",
"Payload": {
"Transcription.$": "$.Payload"
}
},
"Next": "GenerateSummary"
},
"GenerateSummary": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "BedrockSummaryFunction",
"Payload": {
"SpeakerIdentification.$": "$.Payload"
}
},
"End": true
}
}
}

This workflow makes sure each step completes successfully before proceeding to the next, with automatic error handling and retry logic built in.

Test the solution

After you have successfully completed the deployment, you can use the CloudFront URL to test the solution functionality.

Security considerations

Security is a critical aspect of this solution, and we’ve implemented several best practices to support data protection and compliance:

Sensitive data redaction – Automatically redact PII to protect user privacy.
Fine-Grained IAM Permissions – Apply the principle of least privilege across AWS services and resources.
Amazon S3 access controls – Use strict bucket policies to limit access to authorized users and roles.
API security – Secure API endpoints using Amazon Cognito for user authentication (optional but recommended).
CloudFront protection – Enforce HTTPS and apply modern TLS protocols to facilitate secure content delivery.
Amazon Bedrock data security – Amazon Bedrock (including Amazon Bedrock Marketplace) protects customer data and does not send data to providers or train using customer data. This makes sure your proprietary information remains secure when using AI capabilities.

Clean up

To prevent unnecessary charges, make sure to delete the resources provisioned for this solution when you’re done:

Delete the Amazon Bedrock guardrail:
1. On the Amazon Bedrock console, in the navigation menu, choose Guardrails.
2. Choose your guardrail, then choose Delete.
Delete the Whisper Large V3 Turbo model deployed through the Amazon Bedrock Marketplace:
1. On the Amazon Bedrock console, choose Marketplace deployments in the navigation pane.
2. In the Managed deployments section, select the deployed endpoint and choose Delete.
Delete the AWS CDK stack by running the command cdk destroy, which deletes the AWS infrastructure.

Conclusion

This serverless audio summarization solution demonstrates the benefits of combining AWS services to create a sophisticated, secure, and scalable application. By using Amazon Bedrock for AI capabilities, Lambda for serverless processing, and CloudFront for content delivery, we’ve built a solution that can handle large volumes of audio content efficiently while helping you align with security best practices.

The automatic PII redaction feature supports compliance with privacy regulations, making this solution well-suited for regulated industries such as healthcare, finance, and legal services where data security is paramount. To get started, deploy this architecture within your AWS environment to accelerate your audio processing workflows.

About the Authors

Kaiyin Hu is a Senior Solutions Architect for Strategic Accounts at Amazon Web Services, with years of experience across enterprises, startups, and professional services. Currently, she helps customers build cloud solutions and drives GenAI adoption to cloud. Previously, Kaiyin worked in the Smart Home domain, assisting customers in integrating voice and IoT technologies.

Sid Vantair is a Solutions Architect with AWS covering Strategic accounts. He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.