A better training method for reinforcement learning with human feedback

May 2, 2025

by Sailik Sengupta Amazon AWS

A better training method for reinforcement learning with human feedback

Contrasting training pairs with large reward differences mitigate spurious correlations and improve performance of direct-alignment algorithms by as much as 20%40%.

Machine learning

Sailik Sengupta

Saket Dingliwal

May 02, 09:00 AMMay 02, 10:23 AM

Reinforcement learning with human feedback (RLHF) is the standard method for aligning large language models (LLMs) with human preferences such as the preferences for nontoxic language and factually accurate responses. Recently, one of the most popular RLHF methods has been direct preference optimization, in which the LLM chooses between two output options, one of which has been labeled as preferred by human annotators.

With direct preference optimization (DPO), however and with other, similar direct-alignment algorithms LLMs run the risk of learning spurious correlations from the data. In toxicity datasets, for instance, its common for the serious, thoughtful responses to be longer than the offensive responses. During RLHF, an LLM could thus learn to prefer longer responses to shorter ones, which may not be preferable in general.

At this years International Conference on Learning Representations (ICLR), we presented a method for limiting such spurious correlations, which we call SeRA, for self-reviewing and alignment. First, after the first round of RLHF on human-annotated data, we use the LLM itself to generate additional training examples. Then we use the LLMs output probabilities to assess the strength of preference for training pairs, keeping only those where the preferred response is strongly preferred.

To evaluate our approach, we compare a model trained using SeRA to three baseline models on four benchmark datasets. For each test input, we compare our models output to that of each of the baselines, and we use an off-the-shelf LLM to choose the better response. The SeRA-trained models win rate in these pairwise comparisons is higher than all three baselines across the board, sometimes by as much as 20% to 40%.

Direct preference optimization

Reinforcement learning is a trial-and-error method in which an agent interacts with the world and, depending on the actions it takes, receives greater or lesser rewards. Over time, the agent attempts to learn a policy that maximizes its cumulative reward.

In classical reinforcement learning, the interaction with the world can be literal: a robot, for instance, might receive a large reward for successfully navigating to a prescribed location and a negative reward for bumping into a wall. In RLHF, however, the reward depends on how well an LLMs output aligns with a paradigm case specified by a human.

With traditional RLHF, the reward is calculated by a separate model, which is also trained on human-annotated data. But this is a time-consuming approach that doesnt scale well. With DPO, theres no need for a second model: the LLM receives the reward if it picks the human-preferred output and not if it doesnt.

The drawback of DPO is that it treats all training pairs equally: the reward is the same whether the preferred output is strongly preferred or only mildly preferred. This increases the chances that the model will learn spurious correlations.

If, for instance, choosing strongly toxic responses incurred a greater penalty than choosing mildly toxic responses, the model could infer that toxicity and not response length was the relevant feature of the training examples. DPO irons out those differences; SeRA reintroduces them.

With SeRa, we first perform conventional DPO, using a dataset of human-annotated example pairs. After this first pass through the data, the LLM has learned something about the types of outputs that humans prefer.

We then use the updated model to generate a new set of training examples. For every generated response pair, we assign each response a preference score, which is based on the updated models probability of generating that response. We then keep only those pairs in which the preferred response scores significantly higher than the non-preferred response.

With SeRa (self-reviewing and alignment), the updated model generates a new response pair (a winner, or <i>y<sub>w</sub></i>, and loser, or <i>y<sub>l</sub></i>) for each sample input (<i>x</i>). Each response receives a preference score, which is based on the updated models probability of generating it. Pairs in which the score of the preferred response is significantly higher than that of the non-preferred response <i>(green)</i> are kept; the others <i>(red)</i> are discarded.<br/><br/>

Using the same metric, we next filter the data in the original, human-annotated dataset. Then we combine filtered samples from the original dataset with filtered samples from our new, generated dataset and perform DPO once again. This process repeats, with the generated samples constituting a larger and larger fraction of the dataset, until model performance converges.

The intuition here is that if a dataset is designed to represent some contrast, but it also contains spurious correlations, then the intended contrast between, say, toxic and non-toxic data will be significantly greater than the unintended contrast between, say, long and short responses.

This assumption held for the four benchmark datasets we used to evaluate our method, and we think that its a plausible assumption for other spurious correlations. But there could be instances in which it doesnt hold, so in applications of the SeRA method, the models convergence behavior should be monitored.

While we used DPO in our experiments, in our paper, we also demonstrate how to generalize our method to other direct-alignment algorithms. Finally, theres some risk that, when using model-generated data to train a model, we could get into a feedback loop where the model overamplifies some aspect of the initial dataset. As a consequence, in each pass through the data, the models reward is based not only on the current iteration but on past iterations as well, to ensure continuity in the characteristic features of the training data.

Acknowledgments: Sravan Bodapati

SeRA

Research areas: Machine learning, Conversational AI

Tags: Large language models (LLMs), Reinforcement learning, Contrastive learning

Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

May 1, 2025

by Yanyan Zhang Amazon AWS

Multimodal fine-tuning represents a powerful approach for customizing foundation models (FMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or particular output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that mater to your business. Our experiments show that fine-tuned Meta Llama 3.2 models can achieve up to 74% improvements in accuracy scores compared to their base versions with prompt optimization on specialized visual understanding tasks. Amazon Bedrock now offers fine-tuning capabilities for Meta Llama 3.2 multimodal models, so you can adapt these sophisticated models to your unique use case.

In this post, we share comprehensive best practices and scientific insights for fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock. Our recommendations are based on extensive experiments using public benchmark datasets across various vision-language tasks, including visual question answering, image captioning, and chart interpretation and understanding. By following these guidelines, you can fine-tune smaller, more cost-effective models to achieve performance that rivals or even surpasses much larger models—potentially reducing both inference costs and latency, while maintaining high accuracy for your specific use case.

Recommended use cases for fine-tuning

Meta Llama 3.2 multimodal fine-tuning excels in scenarios where the model needs to understand visual information and generate appropriate textual responses. Based on our experimental findings, the following use cases demonstrate substantial performance improvements through fine-tuning:

Visual question answering (VQA) – Customization enables the model to accurately answer questions about images.
Chart and graph interpretation – Fine-tuning allows models to comprehend complex visual data representations and answer questions about them.
Image captioning – Fine-tuning helps models generate more accurate and descriptive captions for images.
Document understanding – Fine-tuning is particularly effective for extracting structured information from document images. This includes tasks like form field extraction, table data retrieval, and identifying key elements in invoices, receipts, or technical diagrams. When working with documents, note that Meta Llama 3.2 processes documents as images (such as PNG format), not as native PDFs or other document formats. For multi-page documents, each page should be converted to a separate image and processed individually.
Structured output generation – Fine-tuning can teach models to output information in consistent JSON formats or other structured representations based on visual inputs, making integration with downstream systems more reliable.

One notable advantage of multimodal fine-tuning is its effectiveness with mixed datasets that contain both text-only and image and text examples. This versatility allows organizations to improve performance across a range of input types with a single fine-tuned model.

Prerequisites

To use this feature, make sure that you have satisfied the following requirements:

An active AWS account.
Meta Llama 3.2 models enabled in your Amazon Bedrock account. You can confirm that the models are enabled on the Model access page of the Amazon Bedrock console.
As of writing this post, Meta Llama 3.2 model customization is available in the US West (Oregon) AWS Region. Refer to Supported models and Regions for fine-tuning and continued pre-training for updates on Regional availability and quotas.
The required training dataset (and optional validation dataset) prepared and stored in Amazon Simple Storage Service (Amazon S3).

To create a model customization job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions (for more details, see Create a service role for model customization):

A trust relationship, which allows Amazon Bedrock to assume the role
Permissions to access training and validation data in Amazon S3
Permissions to write output data to Amazon S3
Optionally, permissions to decrypt an AWS Key Management Service (AWS KMS) key if you have encrypted resources with a KMS key

The following code is the trust relationship, which allows Amazon Bedrock to assume the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": <account-id>
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:<region>:account-id:model-customization-job/*" 
                }
            }
        }
    ] 
}

Key multimodal datasets and experiment setup

To develop our best practices, we conducted extensive experiments using three representative multimodal datasets:

LlaVA-Instruct-Mix-VSFT – This comprehensive dataset contains diverse visual question-answering pairs specifically formatted for vision-language supervised fine-tuning. The dataset includes a wide variety of natural images paired with detailed instructions and high-quality responses.
ChartQA – This specialized dataset focuses on question answering about charts and graphs. It requires sophisticated visual reasoning to interpret data visualizations and answer numerical and analytical questions about the presented information.
Cut-VQAv2 – This is a carefully curated subset of the VQA dataset, containing diverse image-question-answer triplets designed to test various aspects of visual understanding and reasoning.

Our experimental approach involved systematic testing with different sample sizes (ranging between 100–10,000 samples) from each dataset to understand how performance scales with data quantity. We fine-tuned both Meta Llama 3.2 11B and Meta Llama 3.2 90B models, using Amazon Bedrock Model Customization, to compare the impact of model size on performance gains. The models were evaluated using the SQuAD F1 score metric, which measures the word-level overlap between generated responses and reference answers.

Best practices for data preparation

The quality and structure of your training data fundamentally determine the success of fine-tuning. Our experiments revealed several critical insights for preparing effective multimodal datasets:

Data structure – You should use a single image per example rather than multiple images. Our research shows this approach consistently yields superior performance in model learning. With one image per example, the model forms clearer associations between specific visual inputs and corresponding textual outputs, leading to more accurate predictions across various tasks. Although we recommend single-image training examples for optimal results, you can include multiple images per training record based on your use case. Refer to Model requirements for training and validation datasets for detailed data preparation requirements.
Start small, scale as needed – Larger datasets generally produce better results, but initial gains are often substantial even with minimal data. Our experiments demonstrate that even small datasets (approximately 100 samples) yield significant performance improvements over base models. For specialized use cases, we recommend starting with approximately 100 high-quality examples, which can deliver meaningful improvements, then scaling up as resources allow to further enhance performance. This approach makes multimodal fine-tuning accessible even with limited data collection resources.
Annotation quality – This emerges as another critical factor, with the precision and consistency of text associated with images dramatically affecting fine-tuning results. Investing time in creating accurate, detailed annotations that reflect the desired output format yields substantial dividends in model performance. For complex reasoning tasks like chart interpretation, comprehensive annotations should capture the entire reasoning process rather than merely providing the final answer, enabling the model to learn sophisticated analytical pathways.
- Quality annotation (ChartQA as an example):
  - Question: Which mode of transport has CO2 emissions value 191.6 g?
  - Good annotation: Medium car (petrol)
  - Bad annotation: Medium car (diesel)
- Quality annotation (Cut-VQAv2 as an example):
  - Question: What does the sign in the window read?
  - Good annotation: ATM inside
  - Bad annotation: a clock
- Quality annotation (LlaVA as an example):
  - Question:
    - User: What type of building is featured in the image?
    - Assistant: The image features a historic church with two tall towers, which is a large and beautiful building.
    - User: Where is the church located in the city?
    - Assistant: The historic church is located at an intersection in the city.
    - User: Is there a street sign in the image? If so, can you describe it?
    - Assistant: Yes, there is a small street sign in the image, with the large and beautiful church in the background.
    - User: Are there any cars or other vehicles visible in the image?
    - Assistant: Yes, there are cars on the road near the old, large building in the city.
    - User: Is there any celebration or event happening in the area?
  - Good annotation:
    - Assistant: It appears that people are celebrating Cinco de Mayo in this area.
  - Bad annotation:
    - Assistant: People gather annually to celebrate National Pizza Day by launching tacos into orbit from the church rooftops.
Validation data – This provides additional performance insights during fine-tuning. We recommend allocating 10–20% of the dataset for validation purposes. Amazon Bedrock customization outputs validation loss metrics throughput the training process, allowing you to assess model convergence and potential overfitting without conducting extensive inference benchmarks. These validation metrics serve as early indicators of how your fine-tuned model performs on unseen data, providing additional performance insights during fine-tuning.
Formatting consistency – Consistency throughout your dataset further enhances learning efficiency. Standardizing the structure of training examples, particularly how images are referenced within the text, helps the model develop stable patterns for interpreting the relationship between visual and textual elements. This consistency enables more reliable learning across diverse examples and facilitates better generalization to new inputs during inference. Importantly, make sure that the data you plan to use for inference follows the same format and structure as your training data; significant differences between training and testing inputs can reduce the effectiveness of the fine-tuned model.

Configuring fine-tuning parameters

When fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock, you can configure the following key parameters to optimize performance for your specific use case:

Epochs – The number of complete passes through your training dataset significantly impacts model performance. Our findings suggest:
- For smaller datasets (fewer than 500 examples): Consider using more epochs (7–10) to allow the model sufficient learning opportunities with limited data. With the ChartQA dataset at 100 samples, increasing from 3 to 8 epochs improved F1 scores by approximately 5%.
- For medium datasets (500–5,000 examples): The default setting of 5 epochs typically works well, balancing effective learning with training efficiency.
- For larger datasets (over 5,000 examples): You might achieve good results with fewer epochs (3–4), because the model sees sufficient examples to learn patterns without overfitting.
Learning rate – This parameter controls how quickly the model adapts to your training data, with significant implications for performance:
- For smaller datasets: Lower learning rates (5e-6 to 1e-5) can help prevent overfitting by making more conservative parameter updates.
- For larger datasets: Slightly higher learning rates (1e-5 to 5e-5) can achieve faster convergence without sacrificing quality.
- If uncertain: Start with a learning rate of 1e-5 (the default), which performed robustly across most of our experimental conditions.
Behind-the-scenes optimizations – Through extensive experimentation, we’ve optimized implementations of Meta Llama 3.2 multimodal fine-tuning in Amazon Bedrock for better efficiency and performance. These include batch processing strategies, LoRA configuration settings, and prompt masking techniques that improved fine-tuned model performance by up to 5% compared to open-source fine-tuning recipe performance. These optimizations are automatically applied, allowing you to focus on data quality and the configurable parameters while benefiting from our research-backed tuning strategies.

Model size selection and performance comparison

Choosing between Meta Llama 3.2 11B and Meta Llama 3.2 90B for fine-tuning presents an important decision that balances performance against cost and latency considerations. Our experiments reveal that fine-tuning dramatically enhances performance regardless of model size. Looking at ChartQA as an example, the 11B base model improved from 64.1 with prompt optimization to 69.5 F1 score with fine-tuning, a 8.4% increase, whereas the 90B model improved from 64.0 to 71.9 F1 score (12.3% increase). For Cut-VQAv2, the 11B model improved from 42.17 to 73.2 F1 score (74% increase) and the 90B model improved from 67.4 to 76.5 (13.5% increase). These substantial gains highlight the transformative impact of multimodal fine-tuning even before considering model size differences.

The following visualization demonstrates how these fine-tuned models perform across different datasets and training data volumes.

The visualization demonstrates that the 90B model (orange bars) consistently outperforms the 11B model (blue bars) across all three datasets and training sizes. This advantage is most pronounced in complex visual reasoning tasks such as ChartQA, where the 90B model achieves 71.9 F1 score compared to 69.5 for the 11B model at 10,000 samples. Both models show improved performance as training data increases, with the most dramatic gains observed in the LLaVA dataset, where the 11B model improves from 76.2 to 82.4 F1 score and 90B model improves from 76.6 to 83.1 F1 score, when scaling from 100 to 10,000 samples.

An interesting efficiency pattern emerges when comparing across sample sizes: in several cases, the 90B model with fewer training samples outperforms the 11B model with significantly more data. For instance, in the Cut-VQAv2 dataset, the 90B model trained on just 100 samples (72.9 F1 score) exceeds the performance of the 11B model trained on 1,000 samples (68.6 F1 score).

For optimal results, we recommend selecting the 90B model for applications demanding maximum accuracy, particularly with complex visual reasoning tasks or limited training data. The 11B model remains an excellent choice for balanced applications where resource efficiency is important, because it still delivers substantial improvements over base models while requiring fewer computational resources.

Conclusion

Fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock offers organizations a powerful way to create customized AI solutions that understand both visual and textual information. Our experiments demonstrate that following best practices—using high-quality data with consistent formatting, selecting appropriate parameters, and validating results—can yield dramatic performance improvements across various vision-language tasks. Even with modest datasets, fine-tuned models can achieve remarkable enhancements over base models, making this technology accessible to organizations of all sizes.

Ready to start fine-tuning your own multimodal models? Explore our comprehensive code samples and implementation examples in our GitHub repository. Happy fine-tuning!

About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Karel Mundnich is a Sr. Applied Scientist in AWS Agentic AI. He has previously worked in AWS Lex and AWS Bedrock, where he worked in speech recognition, speech LLMs, and LLM fine-tuning. He holds a PhD in Electrical Engineering from the University of Southern California. In his free time, he enjoys skiing, hiking, and cycling.

Marcelo Aberle is a Sr. Research Engineer at AWS Bedrock. In recent years, he has been working at the intersection of science and engineering to enable new AWS service launches. This includes various LLM projects across Titan, Bedrock, and other AWS organizations. Outside of work, he keeps himself busy staying up-to-date on the latest GenAI startups in his adopted home city of San Francisco, California.

Jiayu Li is an Applied Scientist at AWS Bedrock, where he contributes to the development and scaling of generative AI applications using foundation models. He holds a Ph.D. and a Master’s degree in computer science from Syracuse University. Outside of work, Jiayu enjoys reading and cooking.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Jennifer Zhu is a Senior Applied Scientist at AWS Bedrock, where she helps building and scaling generative AI applications with foundation models. Jennifer holds a PhD degree from Cornell University, and a master degree from University of San Francisco. Outside of work, she enjoys reading books and watching tennis games.

Extend large language models powered by Amazon SageMaker AI using Model Context Protocol

May 1, 2025

by Mona Mona Amazon AWS

Organizations implementing agents and agent-based systems often experience challenges such as implementing multiple tools, function calling, and orchestrating the workflows of the tool calling. An agent uses a function call to invoke an external tool (like an API or database) to perform specific actions or retrieve information it doesn’t possess internally. These tools are integrated as an API call inside the agent itself, leading to challenges in scaling and tool reuse across an enterprise. Customers looking to deploy agents at scale need a consistent way to integrate these tools, whether internal or external, regardless of the orchestration framework they are using or the function of the tool.

Model Context Protocol (MCP) aims to standardize how these channels, agents, tools, and customer data can be used by agents, as shown in the following figure. For customers, this translates directly into a more seamless, consistent, and efficient experience compared to dealing with fragmented systems or agents. By making tool integration simpler and standardized, customers building agents can now focus on which tools to use and how to use them, rather than spending cycles building custom integration code. We will deep dive into the MCP architecture later in this post.

For MCP implementation, you need a scalable infrastructure to host these servers and an infrastructure to host the large language model (LLM), which will perform actions with the tools implemented by the MCP server. Amazon SageMaker AI provides the ability to host LLMs without worrying about scaling or managing the undifferentiated heavy lifting. You can deploy your model or LLM to SageMaker AI hosting services and get an endpoint that can be used for real-time inference. Moreover, you can host MCP servers on the compute environment of your choice from AWS, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda, according to your preferred level of managed service—whether you want to have complete control of the machine running the server, or you prefer not to worry about maintaining and managing these servers.

In this post, we discuss the following topics:

Understanding the MCP architecture, why you should use the MCP compared to implementing microservices or APIs, and two popular ways of implementing MCP using LangGraph adapters:
- FastMCP for prototyping and simple use cases
- FastAPI for complex routing and authentication
Recommended architecture for scalable deployment of MCP
Using SageMaker AI with FastMCP for rapid prototyping
Implementing a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

Understanding MCP

Let’s deep dive into the MCP architecture. Developed by Anthropic as an open protocol, the MCP provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture (as illustrated in the following screenshot), MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), or other AI applications
Client – Protocol clients that maintain one-to-one connections with servers
Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect to

Based on these components, we can define the protocol as the communication backbone connecting the MCP client and server within the architecture, which includes the set of rules and standards defining how clients and servers should interact, what messages they exchange (using JSON-RPC 2.0), and the roles of different components.

Now let’s understand the MCP workflow and how it interacts with an LLM to deliver you a response by using an example of a travel agent. You ask the agent to “Book a 5-day trip to Europe in January and we like warm weather.” The host application (acting as an MCP client) identifies the need for external data and connects through the protocol to specialized MCP servers for flights, hotels, and weather information. These servers return the relevant data through the MCP, which the host then integrates with the original prompt, providing enriched context to the LLM to generate a comprehensive, augmented response for the user. The following diagram illustrates this workflow.

When to use MCP instead of implementing microservices or APIs

MCP marks a significant advancement compared to traditional monolithic APIs and intricate microservices architectures. Traditional APIs often bundle the functionalities together, leading to challenges where scaling requires upgrading the entire system, updates carry high risks of system-wide failures, and managing different versions for various applications becomes overly complex. Although microservices offer more modularity, they typically demand separate, often complex, integrations for each service and intricate management overhead.

MCP overcomes these limitations by establishing a standardized client-server architecture specifically designed for efficient and secure integration. It provides a real-time, two-way communication interface enabling AI systems to seamlessly connect with diverse external tools, API services, and data sources using a “write once, use anywhere” philosophy. Using transports like standard input/output (stdio) or streamable HTTP under the unifying JSON-RPC 2.0 standard, MCP delivers key advantages such as superior fault isolation, dynamic service discovery, consistent security controls, and plug-and-play scalability, making it exceptionally well-suited for AI applications that require reliable, modular access to multiple resources.

FastMCP vs. FastAPI

In this post, we discuss two different approaches for implementing MCP servers: FastAPI with SageMaker, and FastMCP with LangGraph. Both are fully compatible with the MCP architecture and can be used interchangeably, depending on your needs. Let’s understand the difference between both.

FastMCP is used for rapid prototyping, educational demos, and scenarios where development speed is a priority. It’s a lightweight, opinionated wrapper built specifically for quickly standing up MCP-compliant endpoints. It abstracts away much of the boilerplate—such as input/output schemas and request handling—so you can focus entirely on your model logic.

For use cases where you need to customize request routing, add authentication, or integrate with observability tools like Langfuse or Prometheus, FastAPI gives you the flexibility to do so. FastAPI is a full-featured web framework that gives you finer-grained control over the server behavior. It’s well-suited for more complex workflows, advanced request validation, detailed logging, middleware, and other production-ready features.

You can safely use either approach in your MCP servers—the choice depends on whether you prioritize simplicity and speed (FastMCP) or flexibility and extensibility (FastAPI). Both approaches conform to the same interface expected by agents in the LangGraph pipeline, so your orchestration logic remains unchanged.

Solution overview

In this section, we walk through a reference architecture for scalable deployment of MCP servers and MCP clients, using SageMaker AI as the hosting environment for the foundation models (FMs) and LLMs. Although this architecture uses SageMaker AI as its reasoning core, it can be quickly adapted to support Amazon Bedrock models as well. The following diagram illustrates the solution architecture.

The architecture decouples the client from the server by using streamable HTTP as the transport layer. By doing this, clients and servers can scale independently, making it a great fit for serverless orchestration powered by Lambda, AWS Fargate for Amazon ECS, or Fargate for Amazon EKS. An additional benefit of decoupling is that you can better control authorization of applications and user by controlling AWS Identity and Access Management (IAM) permissions of client and servers separately, and propagating user access to the backend. If you’re running client and server with a monolithic architecture on the same compute, we suggest instead using stdio as the transport layer to reduce networking overhead.

Use SageMaker AI with FastMCP for rapid prototyping

With the architecture defined, let’s analyze the application flow as shown in the following figure.

In terms of usage patterns, MCP shares a logic similar to tool calling, with an initial addition to discover the available tools:

The client connects to the MCP server and obtains a list of available tools.
The client invokes the LLM using a prompt engineered with the list of tools available on the MCP server (message of type “user”).
The LLM reasons with respect to which ones it needs to call and how many times, and replies (“assistant” type message).
The client asks the MCP server to execute the tool calling and provides the result to the LLM (“user” type message).
This loop iterates until a final answer is reached and can be given back to the user.
The client disconnects from the MCP server.

Let’s start with the MCP server definition. To create an MCP server, we use the official Model Context Protocol Python SDK. For example, let’s create a simple server with just one tool. The tool will simulate searching for the most popular song played at a radio station, and return it in a Python dictionary. Make sure to add proper docstring and input/output typing, so that the both the server and client can discover and consume the resource correctly.

from mcp.server.fastmcp import FastMCP

# instantiate an MCP server client
mcp = FastMCP("Radio Station Server")

# DEFINE TOOLS
@mcp.tool()
def top_song(sign: str) -> dict:
"""Get the most popular song played on a radio station"""
# In this example, we simulate the return
# but you should replace this with your business logic
return {
"song": "In the end",
"author": "Linkin Park"
}

@mcp.tool()
def ...

if __name__ == "__main__":
# Start the MCP server using stdio/SSE transport
  mcp.run(transport="sse")

As we discussed earlier, MCP servers can be run on AWS compute services—Amazon EC2, Amazon EC2, Amazon EKS, or Lambda—and can then be used to safely access other resources in the AWS Cloud, for example databases in virtual private clouds (VPCs) or an enterprise API, as well as external resources. For example, a simple way to deploy an MCP server is to use Lambda support for Docker images to install the MCP dependency on the Lambda function or Fargate.

With the server set up, let’s turn our focus to the MCP client. Communication starts with the MCP client connecting to the MCP Server using streamable HTTP:

from mcp import ClientSession
from mcp.client.sse import sse_client

async def connect_to_sse_server(self, server_url: str):
"""Connect to an MCP server running with SSE transport"""
  # Store the context managers so they stay alive
  self._streams_context = sse_client(url=server_url)
  streams = await self._streams_context.__aenter__()

  self._session_context = ClientSession(*streams)
  self.session: ClientSession = await self._session_context.__aenter__()

  # Initialize
  await self.session.initialize()

  # List available tools to verify connection
  print("Initialized SSE client...")
  print("Listing tools...")
  response = await self.session.list_tools()
  tools = response.tools
  print("nConnected to server with tools:", [tool.name for tool in tools])

When connecting to the MCP server, a good practice is to ask the server for a list of available tools with the list_tools() API. With the tool list and their description, we can then define a system prompt for tool calling:

system_message = (
     "You are a helpful assistant with access to these tools:nn"
      f"{tools_description}n"
      "Choose the appropriate tool based on the user's question. "
      "If no tool is needed, reply directly.nn"
      "IMPORTANT: When you need to use a tool, you must ONLY respond with "
      "the exact JSON object format below, nothing else:n"
      "{n"
      '    "tool": "tool-name",n'
      '    "arguments": {n'
      '        "argument-name": "value"n'
      "    }n"
      "}nn"
      "After receiving a tool's response:n"
      "1. Transform the raw data into a natural, conversational responsen"
      "2. Keep responses concise but informativen"
      "3. Focus on the most relevant informationn"
      "4. Use appropriate context from the user's questionn"
      "5. Avoid simply repeating the raw datann"
      "Please use only the tools that are explicitly defined above."
)

Tools are usually defined using a JSON schema similar to the following example. This tool is called top_song and its function is to get the most popular song played on a radio station:

{
   "name": "top_song",
   "description": "Get the most popular song played on a radio station.",
   "parameters": {
     "type": "object",
     "properties": {
        "sign": {
           "type": "string",
           "description": "The call sign for the radio station for which you want the most popular song. Example calls signs are WZPZ and WKRP."
           }
         },
     "required": ["sign"]
     }
}

With the system prompt configured, you can run the chat loop as much as needed, alternating between invoking the hosted LLM and calling the tools powered by the MCP server. You can use packages such as SageMaker Boto3, the Amazon SageMaker Python SDK, or another third-party library, such as LiteLLM or similar.

messages = [
     {"role": "system", "content": system_message},
     {"role": "user", "content": "What is the most played song on WZPZ?"}
]

result = sagemaker_client.invoke_endpoint(...)
tool_name, tool_args = parse_tools_from_llm_response(result)
# Identify if there is a tool call in the message received from the LLM
result = await self.session.call_tool(tool_name, tool_args)
# Parse the output from the tool called, then invoke the endpoint again
result = sagemaker_client.invoke_endpoint(...)

A model hosted on SageMaker doesn’t support function calling natively in its API. This means that you will need to parse the content of the response using a regular expression or similar methods:

import re, json

def parse_tools_from_llm_response(message: str)->dict:
    match = re.search(r'(?s){(?:[^{}]|(?:{[^{}]*}))*}', content)
    content = json.loads(match.group(0))
    tool_name = content["tool"]
    tool_arguments = content["arguments"]
    return tool_name, tool_arguments

After no more tool requests are available in the LLM response, you can consider the content as the final answer and return it to the user. Finally, you close the stream to finalize interactions with the MCP server.

Implement a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

To demonstrate the power of MCP with SageMaker AI, let’s explore a loan underwriting system that processes applications through three specialized personas:

Loan officer – Summarizes the application
Credit analyst – Evaluates creditworthiness
Risk manager – Makes final approval or denial decisions

We will walk you through these personas through the following architecture for a loan processing workflow using MCP. The code for this solution is available in the following GitHub repo.

In the architecture, the MCP client and server are running on EC2 instances and the LLM is hosted on SageMaker endpoints. The workflow consists of the following steps:

The user enters a prompt with loan input details such as name, age, income, and credit score.
The request is routed to the loan MCP server by the MCP client.
The loan parser sends output as input to the credit analyzer MCP server.
The credit analyzer sends output as input to the risk manager MCP server.
The final prompt is processed by the LLM and sent back to the MCP client to provide the output to the user.

You can use LangGraph’s built-in human-in-the-loop feature when the credit analyzer sends the output to the risk manager and when the risk manager sends the output. For this post, we have not implemented this workflow.

Each persona is powered by an agent with LLMs hosted by SageMaker AI, and its logic is deployed by using a dedicated MCP server. Our MCP server implementation in the example uses the Awesome MCP FastAPI, but you can also build a standard MCP server implementation according to the original Anthropic package and specification. The dedicated MCP server in this example is running on a local Docker container, but it can be quickly deployed to the AWS Cloud using services like Fargate. To run the servers locally, use the following code:

uvicorn servers.loan_parser.main:app --port 8002
uvicorn servers.credit_analyzer.main:app --port 8003
uvicorn servers.risk_assessor.main:app --port 8004

When the servers are running, you can start creating the agents and the workflow. You will need to deploy the LLM endpoint by running the following command:

Python deploy_sm_endpoint.py

This example uses LangGraph, a common open source framework for agentic workflows, designed to support seamless integration of language models into complex workflows and applications. Workflows are represented as graphs made of nodes—actions, tools, or model queries—and edges with the flow of information between them. LangGraph provides a structured yet dynamic way to execute tasks, making it simple to write AI applications involving natural language understanding, automation, and decision-making.

In our example, the first agent we create is the loan officer:

graph = StateGraph(State)
graph.add_node("LoanParser", call_mcp_server(PARSER_URL))

The goal of the loan officer (or LoanParser) is to perform the tasks defined in its MCP server. To call the MCP server, we can use the httpx library:

import httpx
from langchain_core.runnables import RunnableLambda

def call_mcp_server(url):
    async def fn(state: State) -> State:
      print(f"[DEBUG] Calling {url} with payload:", state["output"])
      async with httpx.AsyncClient() as client:
        response = await client.post(url, json=state["output"])
        response.raise_for_status()
        return {"output": response.json()}
    return RunnableLambda(fn).with_config({"run_name": f"CallMCP::{url.split(':')[2]}"})

With that done, we can run the workflow using the scripts/run_pipeline.py file. We configured the repository to be traceable by using LangSmith. If you have correctly configured the environment variables, you will see a trace similar to this one in your LangSmith UI.

Configuring LangSmith UI for experiment tracing is optional. You can skip this step.

After running python3 scripts/run_pipeline.py, you should see the following in your terminal or log.

We use the following input:

loan_input = {
  "output": {
     "name": "Jane Doe",
     "age": 35,
     "income": 2000000,
     "loan_amount": 4500000,
     "credit_score": 820,
     "existing_liabilities": 15000,
     "purpose": "Home Renovation"
     }
}

We get the following output:

[DEBUG] Calling http://localhost:8002/process with payload: {'name': 'Jane Doe', 'age': 35, 'income': 2000000, 'loan_amount': 4500000, 'credit_score': 820, 'existing_liabilities': 15000, 'purpose': 'Home Renovation'}

[DEBUG] Calling http://localhost:8003/process with payload: {'summary': 'Jane Doe, 35 years old, applying for a loan of $4,500,000 to renovate her home. She has an income of $2,000,000, a credit score of 820, and existing liabilities of $150,000.', 'fields': {'name': 'Jane Doe', 'age': 35, 'income': 2000000.0, 'loan_amount': 4500000.0, 'credit_score': 820, 'existing_liabilities': 15000.0, 'purpose': 'Home Renovation'}}

[DEBUG] Calling http://localhost:8004/process with payload: {'credit_assessment': 'High', 'score': 'High', 'fields': {'name': 'Jane Doe', 'age': 35, 'income': 2000000.0, 'loan_amount': 4500000.0, 'credit_score': 820, 'existing_liabilities': 15000.0, 'purpose': 'Home Renovation'}}

Final result: {'decision': 'Approved', 'reasoning': 'Decision: Approved'}

Tracing with the LangSmith UI

LangSmith traces contain the full information of all the inputs and outputs of each step of the application, giving users full visibility into their agent. This is an optional step and in case you have configured LangSmith for tracing the MCP loan processing application. You can go the LangSmith login page and log in to the LangSmith UI. Then you can choose Tracing Project and run LoanUnderwriter. You should see a detailed flow of each MCP server, such as loan parser, credit analyzer, and risk assessor input and outputs by the LLM, as shown in the following screenshot.

Conclusion

The MCP proposed by Anthropic offers a standardized way of connecting FMs to data sources, and now you can use this capability with SageMaker AI. In this post, we presented an example of combining the power of SageMaker AI and MCP to build an application that offers a new perspective on loan underwriting through specialized roles and automated workflows.

Organizations can now streamline their AI integration processes by minimizing custom integrations and maintenance bottlenecks. As AI continues to evolve, the ability to securely connect models to your organization’s critical systems will become increasingly valuable. Whether you’re looking to transform loan processing, streamline operations, or gain deeper business insights, the SageMaker AI and MCP integration provides a flexible foundation for your next AI innovation.

The following are some examples of what you can build by connecting your SageMaker AI models to MCP servers:

A multi-agent loan processing system that coordinates between different roles and data sources
A developer productivity assistant that integrates with enterprise systems and tools
A machine learning workflow orchestrator that manages complex, multi-step processes while maintaining context across operations

If you’re looking for ways to optimize your SageMaker AI deployment, learn more about how to unlock cost savings with the new scale down to zero feature in SageMaker Inference, as well as how to unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with a SageMaker trained model. For application development, refer to Build agentic AI solutions with DeepSeek-R1, CrewAI, and Amazon SageMaker AI

About the Authors

Mona Mona currently works as a Sr World Wide Gen AI Specialist Solutions Architect at Amazon focusing on Gen AI Solutions. She was a Lead Generative AI specialist in Google Public Sector at Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.

Davide Gallitelli is a Senior Worldwide Specialist Solutions Architect for Generative AI at AWS, where he empowers global enterprises to harness the transformative power of AI. Based in Europe but with a worldwide scope, Davide partners with organizations across industries to architect custom AI agents that solve complex business challenges using AWS ML stack. He is particularly passionate about democratizing AI technologies and enabling teams to build practical, scalable solutions that drive organizational transformation.

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

Giuseppe Zappia is a Principal Solutions Architect at AWS, with over 20 years of experience in full stack software development, distributed systems design, and cloud architecture. In his spare time, he enjoys playing video games, programming, watching sports, and building things.

Automate document translation and standardization with Amazon Bedrock and Amazon Translate

May 1, 2025

by Nadhya Polanco Amazon AWS

Multinational organizations face the complex challenge of effectively managing a workforce and operations across different countries, cultures, and languages. Maintaining consistency and alignment across these global operations can be difficult, especially when it comes to updating and sharing business documents and processes. Delays or miscommunications can lead to productivity losses, operational inefficiencies, or potential business disruptions. Accurate and timely sharing of translated documents across the organization is an important step in making sure that employees have access to the latest information in their native language.

In this post, we show how you can automate language localization through translating documents using Amazon Web Services (AWS). The solution combines Amazon Bedrock and AWS Serverless technologies, a suite of fully managed event-driven services for running code, managing data, and integrating applications—all without managing servers. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, and Stability AI. Amazon Bedrock is accessible through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

The solution uses AWS Step Functions to orchestrate the translation of the source document into the specified language (English, French, or Spanish) using AWS Lambda functions to call Amazon Translate. Note that Amazon Translate currently supports translation of 75 languages and 3 have been chosen for this demo. It then uses Amazon Bedrock to refine the translation and create natural, flowing content.

Building this solution, shown in the following diagram, on AWS fully managed and serverless technologies eliminates the need to operate infrastructure, manage capacity, or invest significant funding upfront to evaluate the business benefit. The compute and AI services used to process documents for translation run only on demand, resulting in a consumption-based billing model where you only pay for your use.

The document translation and standardization workflow consists of the following steps:

The user uploads their source document requiring translation to the input Amazon Simple Storage Service (Amazon S3) bucket. The bucket has three folders: English, French, and Spanish. The user uploads the source document to the folder that matches the current language of the document. This can be done using the AWS Management Console, the AWS Command Line Interface (AWS CLI), or third-party tools that allow them to navigate an S3 bucket as a file system.
The presence of a new document in the input bucket initiates the Step Functions workflow using Amazon S3 Event Notifications.
The first step of this workflow is an AWS Lambda function that retrieves the source document from the bucket, saves it in temporary storage, and calls the Amazon Translate API TranslateDocument specifying the source document as the target for translation.
The second step of the workflow is another Lambda function that queries Amazon Bedrock using a pre-generated prompt with the translated source document included as the target. This prompt instructs Amazon Bedrock to perform a transcreation check on the document content. This validates that the intent, style, and tone of the document is maintained. The final version of the document is now saved in the output S3 bucket.
The last step of the workflow uses Amazon Simple Notification Service (Amazon SNS) to notify an SNS topic of the outcome of the workflow (success or failure). This will send an email to the subscribers to the topic.
The user downloads their translated document from the output S3 bucket. This can be done using the console, the AWS CLI, or third-party tools that allow them to navigate an S3 bucket as a file system.

This solution is available on GitHub and provides the AWS Cloud Development Kit (AWS CDK) code to deploy in your own AWS account. The AWS CDK is an open source software development framework for defining cloud infrastructure as code (IaC) and provisioning it through AWS CloudFormation. This provides an automated deployment process for your AWS account.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account to deploy the solution to.
An AWS Identity and Access Management (IAM) role in the account, with sufficient permissions to create the necessary resources. If you have administrator access, no additional action is required.
The AWS CDK installed on your local machine, or an AWS Cloud9 environment.
Python 3.9 or later.
Docker.

Deployment steps

To deploy this solution into your own AWS account:

Open your code editor of choice and authenticate to your AWS account. Instructions for linking to Visual Studio code can be found in Authentication and access for the AWS Toolkit for Visual Studio Code.

Clone the solution from the GitHub repository:

git clone https://github.com/aws-samples/sample-document-standardization-with-bedrock-and-translate.git

Follow the deployment instructions in the repository README file.
After the stack is deployed, go to the S3 console. Navigate to the S3 bucket that was created — docstandardizationstack-inputbucket. Upload the word_template.docx file that’s included in the repository. English, French, and Spanish folders will automatically be created.

Navigate to the Amazon Simple Notification Service (Amazon SNS) console and create a subscription to the topic DocStandardizationStack-ResultTopic created by the stack. After it’s created, make sure that you confirm subscription to the topic before testing the workflow by choosing the confirm subscription link in the automated email you receive from SNS.

After you have subscribed to the topic, you can test the workflow.

Language translation

To test the workflow, upload a .docx file to the folder corresponding to the document’s original language. For example, if you’re uploading a document that was written in English, this document should be uploaded to the English folder. If you don’t have a .docx file available, you can use the tone_test.docx file that’s included in the repository.

The Step Functions state machine will start after your document is uploaded. Translated versions of your source input document will be added to the other folders that were created in step 5. In this example, we uploaded a document in English and the document was translated in both Spanish and French.

Transcreation process

The translated documents are then processed using Amazon Bedrock. Amazon Bedrock reviews the documents’ intent, style and tone for use in a business setting. You can customize the output tone and style by modifying the Amazon Bedrock prompt to match your specific requirements. The final documents are added to the output S3 bucket with a suffix of _corrected, and each document is added to the folder that corresponds to the document’s language. The output bucket has the same format as the input bucket, with a separate folder created for each language.

The prompt used to instruct the generative AI model for the transcreation task has been designed to produce consistent and valid adjustments. It includes specific instructions, covering both what type of changes are expected from the model and rules to define boundaries that control adjustments. You can adjust this prompt if required to change the outcome of the document processing workflow.

The final documents will have a suffix of _corrected.

When the documents have been processed, you will receive an SNS notification. You will be able to download the processed documents from the S3 bucket DocStandardizationStack-OutputBucket.

Clean up

To delete the deployed resources, run the command cdk destroy in your terminal, or use the CloudFormation console to delete the CloudFormation stack DocStandardizationStack.

Conclusion

In this post, we explored how to automate the translation of business documents using AWS AI and serverless technologies. Through this automated translation process, companies can improve communication, consistency, and alignment across their global operations, making sure that employees can access the information they need when they need it. As organizations continue to expand their global footprint, tools like this will become increasingly important for maintaining a cohesive and informed workforce, no matter where in the world they might be located. By embracing the capabilities of AWS, companies can focus on their core business objectives without creating additional IT infrastructure overhead.

Bonne traduction!

Feliz traducción!

Happy translating!

About the Authors

Nadhya Polanco is an Associate Solutions Architect at AWS based in Brussels, Belgium. In this role, she supports organizations looking to incorporate AI and Machine Learning into their workloads. In her free time, Nadhya enjoys indulging in her passion for coffee and exploring new destinations.

Steve Bell is a Senior Solutions Architect at AWS based in Amsterdam, Netherlands. He helps enterprise organizations navigate the complexities of migration, modernization and multicloud strategy. Outside of work he loves walking his labrador, Lily, and practicing his amateur BBQ skills.

Autonomous mortgage processing using Amazon Bedrock Data Automation and Amazon Bedrock Agents

May 1, 2025

by Wrick Talukdar Amazon AWS

Mortgage processing is a complex, document-heavy workflow that demands accuracy, efficiency, and compliance. Traditional mortgage operations rely on manual review, rule-based automation, and disparate systems, often leading to delays, errors, and a poor customer experience. Recent industry surveys indicate that only about half of borrowers express satisfaction with the mortgage process, with traditional banks trailing non-bank lenders in borrower satisfaction. This gap in satisfaction level is largely attributed to the manual, error-prone nature of traditional mortgage processing, where delays, inconsistencies, and fragmented workflows create frustration for borrowers and impact overall experience.

In this post, we introduce agentic automatic mortgage approval, a next-generation sample solution that uses autonomous AI agents powered by Amazon Bedrock Agents and Amazon Bedrock Data Automation. These agents orchestrate the entire mortgage approval process—intelligently verifying documents, assessing risk, and making data-driven decisions with minimal human intervention. By automating complex workflows, businesses can accelerate approvals, accelerate approvals, minimize errors, and provide consistency while enhancing scalability and compliance.

The following video shows this agentic automation in action—enabling smarter, faster, and more reliable mortgage processing at scale.

Why agentic IDP?

Agentic intelligent document processing (IDP) revolutionizes document workflows by driving efficiency and autonomy. It automates tasks with precision, enabling systems to extract, classify, and process information while identifying and correcting errors in real time.

Agentic IDP goes beyond simple extraction by grasping context and intent, adding deeper insights to documents that fuel smarter decision-making. Powered by Amazon Bedrock Data Automation, it adapts to changing document formats and data sources, further reducing manual work.

Built for speed and scale, agentic IDP processes high volumes of documents quickly, reducing delays and optimizing critical business operations. Seamlessly integrating with AI agents and enterprise systems, it automates complex workflows, cutting operational costs and freeing teams to focus on high-value strategic initiatives.

IDP in mortgage processing

Mortgage processing involves multiple steps, including loan origination, document verification, underwriting, and closing; with each step requiring significant manual effort. These steps are often disjointed, leading to slow processing times (weeks instead of minutes), high operational costs (manual document reviews), and an increased risk of human errors and fraud. Organizations face numerous technical challenges when manually managing document-intensive workflows, as depicted in the following diagram.

These challenges include:

Document overload – Mortgage applications require verification of extensive documentation, including tax records, income statements, property appraisals, and legal agreements. For example, a single mortgage application might require manual review and cross-validation of hundreds of pages of tax returns, pay stubs, bank statements, and legal documents, consuming significant time and resources.
Data entry errors – Manual processing introduces inconsistencies, inaccuracies, and missing information during data entry. Incorrect transcription of applicant income from W-2 forms or misinterpreting property appraisal data can lead to miscalculated loan eligibility, requiring costly corrections and rework.
Delays in decision-making – Backlogs resulting from manual review processes extend processing times and negatively affect borrower satisfaction. A lender manually reviewing income verification and credit documentation might take several weeks to work through their backlog, causing delays that result in lost opportunities or frustrated applicants who turn to competitors.
Regulatory compliance complexity – Evolving mortgage industry regulations introduce complexity into underwriting and verification procedures. Changes in lending regulations, such as new mandatory disclosures or updated income verification guidelines, can require extensive manual updates to processes, leading to increased processing times, higher operational costs, and elevated error rates from manual data entry.

These challenges underscore the need for automation to enhance efficiency, speed, and accuracy for both lenders and mortgage borrowers.

Solution: Agentic workflows in mortgage processing

The following solution is self-contained and the applicant only interacts with the mortgage applicant supervisor agent to upload documents and check or retrieve application status. The following diagram illustrates the workflow.

The workflow consists of the following steps:

Applicant uploads documents to apply for a mortgage.
The supervisor agent confirms receipt of documents. Applicant can view and retrieve application status.
The underwriter updates the status of the application and sends approval documents to applicant.

At the core of the agentic mortgage processing workflow is a supervisor agent that orchestrates the entire workflow, manages sub-agents, and makes final decisions. Amazon Bedrock Agents is a capability within Amazon Bedrock that lets developers create AI-powered assistants capable of understanding user requests and executing complex tasks. These agents can break down requests into logical steps, interact with external tools and data sources, and use AI models to reason and take actions. They maintain conversation context while securely connecting to various APIs and AWS services, making them ideal for tasks like customer service automation, data analysis, and business process automation.

The supervisor agent intelligently delegates tasks to specialized sub-agents while maintaining the right balance between automated processing and human supervision. By aggregating insights and data from various sub-agents, the supervisor agent applies established business rules and risk criteria to either automatically approve qualifying loans or flag complex cases for human review, improving both efficiency and accuracy in the mortgage underwriting process.

In the following sections, we explore the sub-agents in more detail.

Data extraction agent

The data extraction agent uses Amazon Bedrock Data Automation to extract critical insights from mortgage application packages, including pay stubs, W-2 forms, bank statements, and identity documents. Amazon Bedrock Data Automation is a generative AI-powered capability of Amazon Bedrock that streamlines the development of generative AI applications and automates workflows involving documents, images, audio, and videos. The data extraction agent helps make sure that the validation, compliance, and decision-making agent receives accurate and structured data, enabling efficient validation, regulatory compliance, and informed decision-making. The following diagram illustrates the workflow.

The extraction workflow is designed to automate the process of extracting data from application packages efficiently. The workflow includes the following steps:

The supervisor agent assigns the extraction task to the data extraction agent.
The data extraction agent invokes Amazon Bedrock Data Automation to parse and extract applicant details from the application packages.
The extracted application information is stored in the extracted documents Amazon Simple Storage Service (Amazon S3) bucket.
The Amazon Bedrock Data Automation invocation response is sent back to the extraction agent.

Validation agent

The validation agent cross-checks extracted data with external resources such as IRS tax records and credit reports, flagging discrepancies for review. It flags inconsistencies such as doctored PDFs, low credit score, and also calculates debt-to-income (DTI) ratio, loan-to-value (LTV) limit, and an employment stability check. The following diagram illustrates the workflow.

The process consists of the following steps:

The supervisor agent assigns the validation task to the validation agent.
The validation agent retrieves the applicant details stored in the extracted documents S3 bucket.
The applicant details are cross-checked against third-party resources, such as tax records and credit reports, to validate the applicant’s information.
The third-party validated details are used by the validation agent to generate a status.
The validation agent sends the validation status to the supervisor agent.

Compliance agent

The compliance agent verifies that the extracted and validated data adheres to regulatory requirements, reducing the risk of compliance violations. It validates against lending rules. For example, loans are approved only if the borrower’s DTI ratio is below 43%, making sure they can manage monthly payments, or applications with a credit score below 620 are declined, whereas higher scores qualify for better interest rates. The following diagram illustrates the compliance agent workflow.

The workflow includes the following steps:

The supervisor agent assigns the compliance validation task to the compliance agent.
The compliance agent retrieves the applicant details stored in the extracted documents S3 bucket.
The applicant details are validated against mortgage processing rules.
The compliance agent calculates the applicant’s DTI ratio, applying corporate policy and lending rules to the application.
The compliance agent uses the validated details to generate a status.
The compliance agent sends the compliance status to the supervisor agent.

Underwriting agent

The underwriting agent generates an underwriting document for the underwriter to review. The underwriting agent workflow streamlines the process of reviewing and finalizing underwriting documents, as shown in the following diagram.

The workflow consists of the following steps:

The supervisor agent assigns the underwriting task to the underwriting agent.
The underwriting agent verifies the information and creates a draft of the underwriting document.
The draft document is sent to an underwriter for review.
Updates from the underwriter are sent back to the underwriting agent.

RACI matrix

The collaboration between intelligent agents and human professionals is key to efficiency and accountability. To illustrate this, we’ve crafted a RACI (Responsible, Accountable, Consulted, and Informed) matrix that maps out how responsibilities might be shared between AI-driven agents and human roles, such as compliance officers and the underwriting officer. This mapping serves as a conceptual guide, offering a glimpse into how agentic automation can enhance human expertise, optimize workflows, and provide clear accountability. Real-world implementations will differ based on an organization’s unique structure and operational needs.

The matrix components are as follows:

R: Responsible (executes the work)
A: Accountable (owns approval authority and outcomes)
C: Consulted (provides input)
I: Informed (kept informed of progress/status)

End-to-end IDP automation architecture for mortgage processing

The following architecture diagram illustrates the AWS services powering the solution and outlines the end-to-end user journey, showcasing how each component interacts within the workflow.

In Steps 1 and 2, the process begins when a user accesses the web UI in their browser, with Amazon CloudFront maintaining low-latency content delivery worldwide. In Step 3, Amazon Cognito handles user authentication, and AWS WAF provides security against malicious threats. Steps 4 and 5 show authenticated users interacting with the web application to upload required documentation to Amazon S3. The uploaded documents in Amazon S3 trigger Amazon EventBridge, which initiates the Amazon Bedrock Data Automation workflow for document processing and information extraction.

In Step 6, AWS AppSync manages user interactions, enabling real-time communication with AWS Lambda and Amazon DynamoDB for data storage and retrieval. Steps 7, 8, and 9 demonstrate how the Amazon Bedrock multi-agent collaboration framework comes into play, where the supervisor agent orchestrates the workflow between specialized AI agents. The verification agent verifies uploaded documents, manages data collection, and uses action groups to compute DTI ratios and generate an application summary, which is stored in Amazon S3.

Step 10 shows how the validation agent (broker assistant) evaluates the application based on predefined business criteria and automatically generates a pre-approval letter, streamlining loan processing with minimal human intervention. Throughout the workflow in Step 11, Amazon CloudWatch provides comprehensive monitoring, logging, and real-time visibility into all system components, maintaining operational reliability and performance tracking.

This fully agentic and automated architecture enhances mortgage processing by improving efficiency, reducing errors, and accelerating approvals, ultimately delivering a faster, smarter, and more scalable lending experience.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this solution. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

Deploy the solution

To get started, clone the GitHub repository and follow the instructions in the README to deploy the solution using AWS CloudFormation. The deployment steps offer clear guidance on how to build and deploy the solution. After the solution is deployed, you can proceed with the following instructions:

After you provision all the stacks, navigate to the stack AutoLoanAPPwebsitewafstackXXXXX on the AWS CloudFormation console.
On the Outputs tab, locate the CloudFront endpoint for the application UI.

You can also get the endpoint using the AWS Command Line Interface (AWS CLI) and the following command:

 aws cloudformation describe-stacks 
--stack-name $(aws cloudformation list-stacks 
--stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE | jq -r '.StackSummaries[] | select(.StackName | startswith("AutoLoanAPPwebsitewafstack")) | .StackName') 
--query 'Stacks[0].Outputs[?OutputKey==`configwebsitedistributiondomain`].OutputValue' 
--output text

Open the (https://<domain_name>.cloudfront.net) in a new browser.

You should see the application login page.

Create an Amazon Cognito user in the user pool to access the application.
Sign in using your Amazon Cognito email and password credentials to access the application.

Monitoring and troubleshooting

Consider the following best practices:

Monitor stack creation and update status using the AWS CloudFormation console or AWS CLI
Monitor Amazon Bedrock model invocation metrics using CloudWatch:
- InvokeModel requests and latency
- Throttling exceptions
- 4xx and 5xx errors
Check Amazon CloudTrail for API invocations and errors
Check CloudWatch for solution-specific errors and logs:

aws cloudformation describe-stacks —stack-name <stack-name>

Clean up

To avoid incurring additional costs after testing this solution, complete the following steps:

Delete the relevant stacks from the AWS CloudFormation console.
Verify the S3 buckets are empty before deleting them.

Conclusion

The sample automated loan application sample solution demonstrates how you can use Amazon Bedrock Agents and Amazon Bedrock Data Automation to transform mortgage loan processing workflows. Beyond mortgage processing, you can adapt this solution to streamline claims processing or address other complex document-processing scenarios. By using intelligent automation, this solution significantly reduces manual effort, shortens processing times, and accelerates decision-making. Automating these intricate workflows helps organizations achieve greater operational efficiency, maintain consistent compliance with evolving regulations, and deliver exceptional customer experiences.

The sample solution is provided as open source—use it as a starting point for your own solution, and help us make it better by contributing back fixes and features using GitHub pull requests. Browse to the GitHub repository to explore the code, click watch to be notified of new releases, and check the README for the latest documentation updates.

As next steps, we recommend assessing your current document processing workflows to identify areas suitable for automation using Amazon Bedrock Agents and Amazon Bedrock Data Automation.

For expert assistance, AWS Professional Services and other AWS Partners are here to help.

We’d love to hear from you. Let us know what you think in the comments section, or use the issues forum in the repository.

About the Authors

Wrick Talukdar is a Tech Lead – Generative AI Specialist focused on Intelligent Document Processing. He leads machine learning initiatives and projects across business domains, leveraging multimodal AI, generative models, computer vision, and natural language processing. He speaks at conferences such as AWS re:Invent, IEEE, Consumer Technology Society(CTSoc), YouTube webinars, and other industry conferences like CERAWEEK and ADIPEC. In his free time, he enjoys writing and birding photography.

Jady Liu is a Senior AI/ML Solutions Architect on the AWS GenAI Labs team based in Los Angeles, CA. With over a decade of experience in the technology sector, she has worked across diverse technologies and held multiple roles. Passionate about generative AI, she collaborates with major clients across industries to achieve their business goals by developing scalable, resilient, and cost-effective generative AI solutions on AWS. Outside of work, she enjoys traveling to explore wineries and distilleries.

Farshad Bidanjiri is a Solutions Architect focused on helping startups build scalable, cloud-native solutions. With over a decade of IT experience, he specializes in container orchestration and Kubernetes implementations. As a passionate advocate for generative AI, he helps emerging companies leverage cutting-edge AI technologies to drive innovation and growth.

Keith Mascarenhas leads worldwide GTM strategy for Generative AI at AWS, developing enterprise use cases and adoption frameworks for Amazon Bedrock. Prior to this, he drove AI/ML solutions and product growth at AWS, and held key roles in Business Development, Solution Consulting and Architecture across Analytics, CX and Information Security.

Jessie-Lee Fry is a Product and Go-to Market (GTM) Strategy executive specializing in Generative AI and Machine Learning, with over 15 years of global leadership experience in Strategy, Product, Customer success, Business Development, Business Transformation and Strategic Partnerships. Jessie has defined and delivered a broad range of products and cross-industry go- to-market strategies driving business growth, while maneuvering market complexities and C-Suite customer groups. In her current role, Jessie and her team focus on helping AWS customers adopt Amazon Bedrock at scale enterprise use cases and adoption frameworks, meeting customers where they are in their Generative AI Journey.

Raj Jayaraman is a Senior Generative AI Solutions Architect at AWS, bringing over a decade of experience in helping customers extract valuable insights from data. Specializing in AWS AI and generative AI solutions, Raj’s expertise lies in transforming business solutions through the strategic application of AWS’s AI capabilities, ensuring customers can harness the full potential of generative AI in their unique contexts. With a strong background in guiding customers across industries in adopting AWS Analytics and Business Intelligence services, Raj now focuses on assisting organizations in their generative AI journey—from initial demonstrations to proof of concepts and ultimately to production implementations.

Amazon Bedrock Model Distillation: Boost function calling accuracy while reducing cost and latency

May 1, 2025

by Yanyan Zhang Amazon AWS

Amazon Bedrock Model Distillation is generally available, and it addresses the fundamental challenge many organizations face when deploying generative AI: how to maintain high performance while reducing costs and latency. This technique transfers knowledge from larger, more capable foundation models (FMs) that act as teachers to smaller, more efficient models (students), creating specialized models that excel at specific tasks. In this post, we highlight the advanced data augmentation techniques and performance improvements in Amazon Bedrock Model Distillation with Meta’s Llama model family.

Agent function calling represents a critical capability for modern AI applications, allowing models to interact with external tools, databases, and APIs by accurately determining when and how to invoke specific functions. Although larger models typically excel at identifying the appropriate functions to call and constructing proper parameters, they come with higher costs and latency. Amazon Bedrock Model Distillation now enables smaller models to achieve comparable function calling accuracy while delivering substantially faster response times and lower operational costs.

The value proposition is compelling: organizations can deploy AI agents that maintain high accuracy in tool selection and parameter construction while benefiting from the reduced footprint and increased throughput of smaller models. This advancement makes sophisticated agent architectures more accessible and economically viable across a broader range of applications and scales of deployment.

Prerequisites

For a successful implementation of Amazon Bedrock Model Distillation, you’ll need to meet several requirements. We recommend referring to the Submit a model distillation job in Amazon Bedrock in the official AWS documentation for the most up-to-date and comprehensive information.

Key requirements include:

An active AWS account
Selected teacher and student models enabled in your account (verify on the Model access page of the Amazon Bedrock console)
An S3 bucket for storing input datasets and output artifacts
Appropriate IAM permissions:
Trust relationship allowing Amazon Bedrock to assume the role
Permissions to access S3 for input/output data and invocation logs
Permissions for model inference when using inference profiles

If you’re using historical invocation logs, confirm if model invocation logging is enabled in your Amazon Bedrock settings with S3 selected as the logging destination.

Preparing your data

Effective data preparation is crucial for successful distillation of agent function calling capabilities. Amazon Bedrock provides two primary methods for preparing your training data: uploading JSONL files to Amazon S3 or using historical invocation logs. Regardless of which method to choose, you’ll need to prepare proper formatting of tool specifications to enable successful agent function calling distillation.

Tool specification format requirements

For agent function calling distillation, Amazon Bedrock requires that tool specifications be provided as part of your training data. These specifications must be encoded as text within the system or user message of your input data. The example shown is using the Llama model family’s function calling format:

system: 'You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.

Here is a list of functions in JSON format that you can invoke.
[
    {
        "name": "lookup_weather",
        "description": "Lookup weather to a specific location",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "location": {
                    "type": "string",
                },
                "date": {
                    "type": "string",
                }
            }
        }
    }
 ]'
 user: "What's the weather tomorrow?"

This approach lets the model learn how to interpret tool definitions and make appropriate function calls based on user queries. Afterwards, when calling inference on the distilled student model, we suggest keeping the prompt format consistent with the distillation input data. This provides optimal performance by maintaining the same structure the model was trained on.

Preparing data using Amazon S3 JSONL upload

When creating a JSONL file for distillation, each record must follow this structure:

{
    "schemaVersion": "bedrock-conversation-2024",
    "system": [
        {
            "text": 'You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
                    Here is a list of functions in JSON format that you can invoke.
                    [
                        {
                            "name": "lookup_weather",
                            "description": "Lookup weather to a specific location",
                            "parameters": {
                                "type": "dict",
                                "required": [
                                    "city"
                                ],
                                "properties": {
                                    "location": {
                                        "type": "string",
                                    },
                                    "date": {
                                        "type": "string",
                                    }
                                }
                            }
                        }
                    ]'
        }
    ],
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "text": "What's the weather tomorrow?"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
               {
                   "text": "[lookup_weather(location="san francisco", date="tomorrow")]"
               }
            ]
        }
    ]
}

Each record must include the schemaVersion field with the value bedrock-conversation-2024. The system field contains instructions for the model, including available tools. The messages field contains the conversation, with required user input and optional assistant responses.

Using historical invocation logs

Alternatively, you can use your historical model invocation logs on Amazon Bedrock for distillation. This approach uses actual production data from your application, capturing real-world function calling scenarios. To use this method:

Enable invocation logging in your Amazon Bedrock account settings, selecting S3 as your logging destination.
Add metadata to your model invocations using the requestMetadata field to categorize interactions. For example:
```
"requestMetadata": { 
   "project": "WeatherAgent", 
   "intent": "LocationQuery", 
   "priority": "High"
}
```
When creating your distillation job, specify filters to select relevant logs based on metadata:
```
"requestMetadataFilters": { 
    "equals": {"project": "WeatherAgent"} 
}
```

Using historical invocation logs means that you can distill knowledge from your production workloads, allowing the model to learn from real user interactions and function calls.

Model distillation enhancements

Although the basic process for creating a model distillation job remains similar to what we described in our previous blog post, Amazon Bedrock Model Distillation introduces several enhancements with general availability that improve the experience, capabilities, and transparency of the service.

Expanded model support

With general availability, we have expanded the model options available for distillation. In addition to the models supported during preview, customers can now use:

Nova Premier as a teacher model for Nova Pro/Lite/Micro models distillation
Anthropic Claude Sonnet 3.5 v2 as a teacher model for Claude Haiku distillation
Meta’s Llama 3.3 70B as teacher and 3.2 1B and 3B as student models for Meta model distillation

This broader selection allows customers to find the balance between performance and efficiency across different use cases. For the most current list of supported models, refer to the Amazon Bedrock documentation.

Advanced data synthesis technology

Amazon Bedrock applies proprietary data synthesis techniques during the distillation process for certain use cases. This science innovation automatically generates additional training examples that improve the student model’s ability to generate better response.

For agent function calling with Llama models specifically, the data augmentation methods help bridge the performance gap between teacher and student models compared to vanilla distillation (vanilla distillation means directly annotating input data with teacher response and run student training with supervised fine-tuning). This makes the student models’ performance much more comparable to the teacher after distillation while maintaining the cost and latency benefits of a smaller model.

Enhanced training visibility

Amazon Bedrock model distillation now provides better visibility into the training process through multiple enhancements:

Synthetic data transparency – Model distillation now provides samples of the synthetically generated training data used to enhance model performance. For most model families, up to 50 sample prompts are exported (up to 25 for Anthropic models), giving you insight into how your model was trained, which can help support internal compliance requirements.
Prompt insights reporting – A summarized report of prompts accepted for distillation is provided, along with detailed visibility into prompts that were rejected and the specific reason for rejection. This feedback mechanism helps you identify and fix problematic prompts to improve your distillation success rate.

These insights are stored in the output S3 bucket specified during job creation, giving you a clearer picture of the knowledge transfer process.

Improved job status reporting

Amazon Bedrock Model Distillation also offers enhanced training job status reporting to provide more detailed information about where your model distillation job stands in the process. Rather than brief status indicators such as “In Progress” or “Complete,” the system now provides more granular status updates, helping you better track the progress of the distillation job.

You can track these job status details in both the AWS Management Console and AWS SDK.

Performance improvements and benefits

Now that we’ve explored the feature enhancements in Amazon Bedrock Model Distillation, we examine the benefits these capabilities deliver, particularly for agent function calling use cases.

Evaluation metric

We use abstract syntax tree (AST) to evaluate the function calling performance. AST parses the generated function call and performs fine-grained evaluation on the correctness of the generated function name, parameter values, and data types with the following workflow:

Function matching – Checks if the predicted function name is consistent with one of the possible answers
Required parameter matching – Extracts the arguments from the AST and checks if each parameter can be found and exact matched in possible answers
Parameter type and value matching – Checks if the predicted parameter values and types are correct

The process is illustrated in following diagram from Gorilla: Large Language Model Connected with Massive APIs.

Experiment results

To evaluate model distillation in the function call use case, we used the BFCL v2 dataset and filtered it to specific domains (entertainment, in this case) to match a typical use case of model customization. We also split the data into training and test sets and performed distillation on the training data while we ran evaluations on the test set. Both the training set and the test set contained around 200 examples. We assessed the performance of several models, including the teacher model (Llama 405B), the base student model (Llama 3B), a vanilla distillation version where Llama 405B is distilled into Llama 3B without data augmentation, and an advanced distillation version enhanced with proprietary data augmentation techniques.

The evaluation focused on simple and multiple categories defined in the BFCL V2 dataset. As shown in the following chart, there is a performance variance between the teacher and the base student model across both categories. Vanilla distillation significantly improved the base student model’s performance. In the simple category, performance increased from 0.478 to 0.783, representing a 63.8% relative improvement. In the multiple category, the score rose from 0.586 to 0.742, which is a 26.6% relative improvement. On average, vanilla distillation led to a 45.2% improvement across the two categories.

Applying data augmentation techniques provided further gains beyond vanilla distillation. In the simple category, performance improved from 0.783 to 0.826, and in the multiple category, from 0.742 to 0.828. On average, this resulted in a 5.8% relative improvement across both categories, calculated as the mean of the relative gains in each. These results highlight the effectiveness of both distillation and augmentation strategies in enhancing student model performance for function call tasks.

We show the latency and output speed comparison for different models in the following figure. The data is gathered from Artificial Analysis, a website that provides independent analysis of AI models and providers, on April 4, 2025. We find that there is a clear trend on latency and generation speed as different size Llama models evaluated. Notably, the Llama 3.1 8B model offers the highest output speed, making it the most efficient in terms of responsiveness and throughput. Similarly, Llama 3.2 3B performs well with a slightly higher latency but still maintains a solid output speed. On the other hand, Llama 3.1 70B and Llama 3.1 405B exhibit much higher latencies with significantly lower output speeds, indicating a substantial performance cost at higher model sizes. Compared to Llama 3.1 405B, Llama 3.2 3B provides 72% latency reduction and 140% output speed improvement. These results suggest that smaller models might be more suitable for applications where speed and responsiveness are critical.

In addition, we report the comparison of cost per 1M tokens for different Llama models. As shown in the following figure, it’s evident that smaller models (Llama 3.2 3B and Llama 3.1 8B) are significantly more cost-effective. As the model size increases (Llama 3.1 70B and Llama 3.1 405B), the pricing scales steeply. This dramatic increase underscores the trade-off between model complexity and operational cost.

Real-world agent applications require LLM models that can achieve a good balance between accuracy, speed, and cost. This result shows that using a distilled model for agent applications helps developers receive the speed and cost of smaller models while getting similar accuracy as a larger teacher model.

Conclusion

Amazon Bedrock Model Distillation is now generally available, offering organizations a practical pathway for deploying capable agent experiences without compromising on performance or cost-efficiency. As our performance evaluation demonstrates, distilled models for function calling can achieve accuracy comparable to models many times their size while delivering significantly faster inference and lower operational costs. This capability enables scalable deployment of AI agents that can accurately interact with external tools and systems across enterprise applications.

Start using Amazon Bedrock Model Distillation today through the AWS Management Console or API to transform your generative AI applications, including agentic use cases, with the balance of accuracy, speed, and cost efficiency. For implementation examples, check out our code samples in the amazon-bedrock-samples GitHub repository.

Appendix

BFCL V2 simple category

Definition: The simple category consists of tasks where the user is provided with a single function documentation (that is, one JSON function definition), and the model is expected to generate exactly one function call that matches the user’s request. This is the most basic and commonly encountered scenario, focusing on whether the model can correctly interpret a straightforward user query and map it to the only available function, filling in the required parameters as needed.

# Example
{
    "id": "live_simple_0-0-0",
    "question": [
        [{
            "role": "user",
            "content": "Can you retrieve the details for the user with the ID 7890, who has black as their special request?"
        }]
    ],
    "function": [{
        "name": "get_user_info",
        "description": "Retrieve details for a specific user by their unique identifier.",
        "parameters": {
            "type": "dict",
            "required": ["user_id"],
            "properties": {
                "user_id": {
                    "type": "integer",
                    "description": "The unique identifier of the user. It is used to fetch the specific user details from the database."
                },
                "special": {
                    "type": "string",
                    "description": "Any special information or parameters that need to be considered while fetching user details.",
                    "default": "none"
                }
            }
        }
    }]
}

BFCL V2 multiple category

Definition: The multiple category presents the model with a user query and several (typically two to four) function documentations. The model must select the most appropriate function to call based on the user’s intent and context and then generate a single function call accordingly. This category evaluates the model’s ability to understand the user’s intent, distinguish between similar functions, and choose the best match from multiple options.

{
    "id": "live_multiple_3-2-0",
    "question": [
        [{
            "role": "user",
            "content": "Get weather of Ha Noi for me"
        }]
    ],
    "function": [{
        "name": "uber.ride",
        "description": "Finds a suitable Uber ride for the customer based on the starting location, the desired ride type, and the maximum wait time the customer is willing to accept.",
        "parameters": {
            "type": "dict",
            "required": ["loc", "type", "time"],
            "properties": {
                "loc": {
                    "type": "string",
                    "description": "The starting location for the Uber ride, in the format of 'Street Address, City, State', such as '123 Main St, Springfield, IL'."
                },
                "type": {
                    "type": "string",
                    "description": "The type of Uber ride the user is ordering.",
                    "enum": ["plus", "comfort", "black"]
                },
                "time": {
                    "type": "integer",
                    "description": "The maximum amount of time the customer is willing to wait for the ride, in minutes."
                }
            }
        }
    }, {
        "name": "api.weather",
        "description": "Retrieve current weather information for a specified location.",
        "parameters": {
            "type": "dict",
            "required": ["loc"],
            "properties": {
                "loc": {
                    "type": "string",
                    "description": "The location for which weather information is to be retrieved, in the format of 'City, Country' (e.g., 'Paris, France')."
                }
            }
        }
    }]
}

About the authors

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Yijun Tian is an Applied Scientist II at AWS Agentic AI, where he focuses on advancing fundamental research and applications in Large Language Models, Agents, and Generative AI. Prior to joining AWS, he obtained his Ph.D. in Computer Science from the University of Notre Dame.

Yawei Wang is an Applied Scientist at AWS Agentic AI, working at the forefront of generative AI technologies to build next-generation AI products within AWS. He also collaborates with AWS business partners to identify and develop machine learning solutions that address real-world industry challenges.

David Yan is a Senior Research Engineer at AWS Agentic AI, leading efforts in Agent Customization and Optimization. Prior to that, he was in AWS Bedrock, leading model distillation effort to help customers optimize LLM latency, cost and accuracy. His research interest includes AI agent, planning and prediction and inference optimization. Before joining AWS, David worked on planning and behavior prediction for autonomous driving in Waymo. Before that, he worked on nature language understanding for knowledge graph at Google. David received a M.S. in Electrical Engineering from Stanford University and a B.S. in Physics from Peking University.

Panpan Xu is a Principal Applied Scientist at AWS Agentic AI, leading a team working on Agent Customization and Optimization. Prior to that, she lead a team in AWS Bedrock working on research and development of inference optimization techniques for foundation models, covering modeling level techniques such as model distillation and sparsification to hardware-aware optimization. Her past research interest covers a broad range of topics including model interpretability, graph neural network, human-in-the-loop AI and interactive data visualization. Prior to joining AWS, she was a lead research scientist at Bosch Research and obtained her PhD in computer science from Hong Kong University of Science and Technology.

Shreeya Sharma is a Senior Technical Product Manager at AWS, where she has been working on leveraging the power of generative AI to deliver innovative and customer-centric products. Shreeya holds a master’s degree from Duke University. Outside of work, she loves traveling, dancing, and singing.

Build public-facing generative AI applications using Amazon Q Business for anonymous users

April 30, 2025

by Vishnu Elangovan Amazon AWS

Amazon Q Business is a generative AI-powered assistant that answers question, provides summaries, generates content, and securely completes tasks based on enterprise data and information. It connects to company data sources, applications, and internal systems to provide relevant, contextual answers while maintaining organizational security and compliance standards.

Today, we’re excited to announce that Amazon Q Business now supports anonymous user access. With this new feature, you can now create Amazon Q Business applications with anonymous user mode, where user authentication is not required and content is publicly accessible. These anonymous user applications can be used in use cases such as public website Q&A, documentation portals, and customer self-service experiences.

This capability allows guest users to use Amazon Q Business generative AI capabilities to quickly find product information, get technical answers, navigate documentation, and troubleshoot issues. Your public-facing websites, documentation, and support portals can now deliver the same powerful AI-driven assistance that authenticated users receive, creating an experience that enriches the guest user journey across your digital environments.

With this launch, you can seamlessly integrate an anonymous Amazon Q Business application into your websites and web applications through two pathways: either by embedding the ready-to-use web experience into your websites using an iframe for quick deployment, or by using our Chat, ChatSync, and PutFeedback APIs to build completely customized interfaces within your own applications. For anonymous Amazon Q Business applications, we’ve implemented a simple consumption-based pricing model where you’re charged based on the number of Chat or ChatSync API operations your anonymous Amazon Q Business applications make.

In this post, we demonstrate how to build a public-facing generative AI application using Amazon Q Business for anonymous users.

Solution overview

In this solution, we walk you through creating an anonymous Amazon Q Business application using both the AWS Management Console and AWS Command Line Interface (AWS CLI). Our example demonstrates a practical scenario: helping website visitors find information on public-facing documentation websites.

We demonstrate how to test the implementation with sample queries through the built-in web experience URL. The resulting application can be customized and embedded directly into your websites (using the API or the iframe method), providing immediate value for your users.

Prerequisites

To follow along with this post, you will need the following:

An AWS account.
At least one Amazon Q Business Pro user that has admin permissions to set up and configure Amazon Q Business. For pricing information, see Amazon Q Business pricing.
AWS Identity and Access Management (IAM) permissions to create and manage IAM roles and policies.
Public content to index (documents, FAQs, knowledge base articles) that can be shared with unauthenticated users.
A supported data source to connect, such as an Amazon Simple Storage Service (Amazon S3) bucket containing your public documents.
The AWS CLI configured with appropriate permissions (if following the AWS CLI method).

Create an anonymous Amazon Q Business application using the console

In this section, we walk through the steps to implement the solution using the console.

Create an IAM role for the web experience

Before creating your Amazon Q Business application, you will need to set up an IAM role with the appropriate permissions:

On the IAM console, choose Roles in the navigation pane and choose Create role.
Choose AWS service as the trusted entity
Select Amazon Q Business from the service list.
Choose Next: Permissions.
Create a custom policy or attach the necessary read-only policies, and add permissions for anonymous access.

We strongly recommend that you use a restricted policy for the role, like the one shown in the following screenshot, which will be used to create the web experience for anonymous access application environments.

An example of a restricted role policy for calling the Chat API for anonymous access application environments would be arn:aws:qbusiness:<your-region>:<your-aws-account-id>:application/<your-application-id>.

Create an IAM role with a trust policy that allows the Amazon Q Business service principal to assume the role using AWS Security Token Service (AWS STS), specifically scoped to your application’s Amazon Resource Name (ARN) in the designated AWS Region.

Create an Amazon Q Business application

Now you’re ready to create your Amazon Q Business application:

On the Amazon Q Business console, choose Create application.
For Application name, enter a name (for example, SupportDocs-Assistant).
For User access, select Anonymous access for this application environment.
Select Web experience to create a managed web experience to access the Amazon Q Business application.

You will see a notice about consumption-based billing for anonymous Amazon Q Business applications. For more details on pricing, refer to Amazon Q Business pricing.

Leave the default service role option unless you have specific requirements.
For Encryption, use the default AWS managed key unless you need custom encryption.
For Web experience settings, you can use an existing IAM role from your account or authorize Amazon Q Business to generate a new role with appropriate permissions. For this post, we select Use an existing service role and choose the IAM role created earlier (QBusinessAnonymousWebRole).
Optionally, customize the web experience title and welcome message.
Review all your configuration options and choose Create to create the application.

You should see a confirmation that your anonymous access application has been created successfully.

You will find the necessary parameters and details of your Amazon Q Business application on the landing page displayed after successful creation like the following screenshot, which provides comprehensive information about your newly created Amazon Q Business application.

Add data sources

After you create your application, you need to add an index and data sources. To learn more, refer to Index. You will see a pop-up like the following indicating that anonymous access is enabled.

Complete the following steps:

From your application dashboard, choose Add index.
Name your index (for example, Supportdocs-External) and keep the default settings.
Choose Add an index.
After you create the index, you can add data sources to it.

For our example, we use the Amazon Q Business public documentation as our data source by adding the URL https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/what-is.html. The Web Crawler will automatically index the content from this documentation page, making it searchable through your anonymous Amazon Q Business application.

For more information about Web Crawler configuration options and best practices, refer to Connecting Web Crawler to Amazon Q Business.

From your index dashboard, choose Add data source.
Enter a name for your data source and optional description.
For Source, select Source URLs and enter the URLs of the public websites you want to index.
For Authentication, select No authentication.
Configure the sync run schedule and field mappings.
Choose Add data source.

Alternatively, you can add Amazon S3 as the data source:

From your index dashboard, choose Add data source.
Select Amazon S3 as the source.
Configure your S3 bucket settings (make sure the bucket has public access).
Complete the data source creation process.

You must only ingest publicly available data sources without access control lists (ACLs).

Generate an anonymous web experience URL

After your data sources are set up, complete the following steps:

From your application dashboard, choose your application.
In the Web experience settings section, choose Share one-time URL.

The anonymous web experience URL can be shared as a single-use link that must be redeemed and accessed within 5 minutes. After it’s activated, the Amazon Q Business session remains active with a configurable timeout ranging from 15–60 minutes. This enables you to experience the web interface and test its functionality before deploying or offering the anonymous application to guest users.

Test your anonymous Amazon Q Business application

To test the application, choose Preview web experience.

The following screenshot shows the welcome page for your anonymous Amazon Q Business application’s web interface. Let’s begin asking Amazon Q Business some questions about the Amazon Q index.

In the first query, we ask “What is Q index? How is it useful for ISV’s?” The following screenshot shows the response.

In the following query, we ask “How can Q index enrich generative AI experiences for ISVs?”

In our next query, we ask “How is Q index priced?”

Having successfully tested our anonymous Amazon Q Business application through the console, we will now explore how to create an equivalent application using the AWS CLI.

Create your anonymous application using the AWS CLI

Make sure that your AWS CLI is configured with permissions to create Amazon Q Business resources and IAM roles.

Create an IAM role for Amazon Q Business

First, create an IAM role that Amazon Q Business can assume to access necessary resources:

# Create trust policy document
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "qbusiness.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create IAM role
aws iam create-role 
  --role-name QBusinessAnonymousAppRole 
  --assume-role-policy-document file://trust-policy.json

# Attach necessary permissions
aws iam attach-role-policy 
  --role-name QBusinessAnonymousAppRole

Create an anonymous Amazon Q Business application

Use the following code to create your application:

#bash
aws qbusiness create-application 
--display-name "PublicKnowledgeBase" 
--identity-type ANONYMOUS 
--role-arn "arn:aws:iam:: <ACCOUNT_ID>:role/QBusinessAnonymousAppRole" 
--description "This is the QBiz application for anonymous use-case"

Save the applicationId from the response:

#json

{
  "applicationId": "your-application-id",
  "applicationArn": "arn:aws:qbusiness:region:account-id:application/your-application-id"
}

Create a restrictive policy for anonymous access

We strongly recommend using the following restricted policy for the role that will be used to call the chat APIs for anonymous access application environments. This policy limits actions to only the necessary APIs and restricts access to only your specific application.

Create the IAM role with the following policy:

# Create restrictive policy document
cat > anonymous-access-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "QBusinessConversationPermission",
      "Effect": "Allow",
      "Action": [
        "qbusiness:Chat",
        "qbusiness:ChatSync",
        "qbusiness:PutFeedback"
      ],
      "Resource": "arn:aws:qbusiness:<REGION>:<ACCOUNT_ID>:application/<APPLICATION_ID>"
    }
  ]
}
EOF

# Attach the policy to the role
aws iam put-role-policy 
  --role-name QBusinessAnonymousAppRole 
  --policy-name QBusinessAnonymousAccessPolicy 
  --policy-document file://anonymous-access-policy.json

Create an index

Create an index for your content, then upload documents using the BatchPutDocument API. For step-by-step guidance, see Select Retriever.

Test your anonymous Amazon Q Business application

To demonstrate the chat functionality using the AWS CLI, we uploaded Amazon Q Business documentation in PDF format to our index and tested the application using the following sample queries.

The following is an example chat interaction using the IAM role credentials. We first ask “What is Amazon Q index?”

#1)
#bash
aws qbusiness chat-sync 
  --application-id <APPLICATION_ID> 
  --user-message "What is Amazon Q index?"

The following screenshot shows part of the output from the chat-sync API when executed with our anonymous Amazon Q Business application ID, as shown in the previous command.

Next, we ask “How can Q index enrich generative AI experiences for ISV’s?”

2)
#bash
aws qbusiness chat-sync 
  --application-id <APPLICATION_ID> 
  --user-message "How can Q index enrich generative AI experiences for ISV's?"

The following screenshot shows part of the output from the chat-sync API when executed with our anonymous Amazon Q Business application ID.

Create a web experience for the anonymous web application

Use the following code to create the web experience:

#bash
aws qbusiness create-web-experience 
  --application-id <APPLICATION_ID> 
  --display-name "PublicKnowledgeBaseExperience" 
  --role-arn "arn:aws:iam::<ACCOUNT_ID>:role/QBusinessAnonymousAppRole" 
  --description "Web interface for my anonymous Q Business application"

To generate an anonymous URL, use the following code:

#bash
aws qbusiness create-anonymous-web-experience-url 
  --application-id <APPLICATION_ID> 
  --web-experience-id <WEB_EXPERIENCE_ID>

You can use the web experience URL generated by the preceding command and embed it into your web applications using an iframe.

Considerations

Consider the following when using anonymous access in Amazon Q Business:

The following are the only chat APIs that support anonymous access application environments:
- Chat
- ChatSync
- PutFeedback
You should only ingest publicly available data sources without ACLs. Examples of public data sources include:
- Data from the Amazon Q Business Web Crawler
- Amazon S3 data without ACLs
Amazon Q Business applications with anonymous access are billed on a consumption-based pricing model.
Chat history is not available for anonymous application environments.
Anonymous users and authenticated users are not supported on the same application environments.
Plugins are not supported for anonymous application environments.
Amazon QuickSight integration is not supported for anonymous application

Environments.

Amazon Q Apps are not supported for anonymous application environments.
Attachments are not supported for anonymous application environments.
Admin controls and guardrails are read-only for anonymous application environments, except for blocked words.
Topic rules using users and groups are not supported for anonymous application

The remaining Amazon Q Business functionality and features remain unchanged.

Clean up

When you are done with the solution, clean up the resources you created.

Conclusion

In this post, we introduced Amazon Q Business anonymous user access mode and demonstrated how to create, configure, and test an anonymous Amazon Q Business application using both the console and AWS CLI. This exciting feature extends enterprise-grade Amazon Q Business generative AI capabilities to your anonymous audiences without requiring authentication, opening up new possibilities for enhancing customer experiences on public websites, documentation portals, and self-service knowledge bases. This feature is available through a consumption pricing model that charges based on actual Chat and Chatsync API usage and index storage costs still applicable.

By following the implementation steps outlined in this post, you can quickly set up an Amazon Q Business application tailored for your external users, secured with appropriate IAM policies, and ready to embed in your end-user-facing applications.

To learn more about this anonymous access feature, see the Amazon Q Business User Guide. For detailed guidance on embedding Amazon Q Business in your web applications, see Add a generative AI experience to your website or web application with Amazon Q embedded. If you’re interested in building completely custom UI experiences with the Amazon Q Business API, check out Customizing an Amazon Q Business web experience.

About the authors

Vishnu Elangovan is a Worldwide Generative AI Solution Architect with over seven years of experience in Applied AI/ML. He holds a master’s degree in Data Science and specializes in building scalable artificial intelligence solutions. He loves building and tinkering with scalable AI/ML solutions and considers himself a lifelong learner. Outside his professional pursuits, he enjoys traveling, participating in sports, and exploring new problems to solve.

Jean-Pierre Dodel is a Principal Product Manager for Amazon Q Business, responsible for delivering key strategic product capabilities including structured data support in Q Business, RAG. and overall product accuracy optimizations. He brings extensive AI/ML and Enterprise search experience to the team with over 7 years of product leadership at AWS.

FloQast builds an AI-powered accounting transformation solution with Anthropic’s Claude 3 on Amazon Bedrock

April 30, 2025

by Kartik Bhatnagar Amazon AWS

With the advent of generative AI solutions, a paradigm shift is underway across industries, driven by organizations embracing foundation models (FMs) to unlock unprecedented opportunities. Amazon Bedrock has emerged as the preferred choice for numerous customers seeking to innovate and launch generative AI applications, leading to an exponential surge in demand for model inference capabilities. Amazon Bedrock customers aim to scale their worldwide applications to accommodate a variety of use cases. One such customer is FloQast.

Since its founding in 2013, FloQast has had the privilege of working with over 2,800 organizations across various industries and regions, helping them streamline their accounting operations. From automated reconciliations to tools that manage the entire close process, FloQast has seen firsthand how organizations, big and small, struggle to keep pace with their accounting needs as they scale. FloQast’s software (created by accountants, for accountants) brings AI and automation innovation into everyday accounting workflows. You can reconcile bank statements against internal ledgers, get real-time visibility into financial operations, and much more.

In this post, we share how FloQast built an AI-powered accounting transaction solution using Anthropic’s Claude 3 on Amazon Bedrock.

Accounting operations: Complexity amplified at scale

At the heart of every successful organization—whether small startups or large corporations—lies a well-oiled financial and accounting operation. Accounting is more than just a back-office function; it’s the backbone of every business. From processing payroll to generating financial statements, accounting is a ubiquitous force that touches every facet of business operations.

Consider this: when you sign in to a software system, a log is recorded to make sure there’s an accurate record of activity—essential for accountability and security. Similarly, when an incident occurs in IT, the responding team must provide a precise, documented history for future reference and troubleshooting. The same principle applies to accounting: when a financial event takes place, whether it’s receiving a bill from a vendor or signing a contract with a customer, it must be logged. These logs, known in accounting as journal entries, provide a clear financial record.

Now imagine this process scaled across hundreds, or even thousands, of transactions happening simultaneously in a large organization. The complexity of accounting increases exponentially with growth and diversification. As businesses expand, they encounter a vast array of transactions that require meticulous documentation, categorization, and reconciliation. At scale, upholding the accuracy of each financial event and maintaining compliance becomes a monumental challenge. With advancement in AI technology, the time is right to address such complexities with large language models (LLMs).

Amazon Bedrock has helped democratize access to LLMs, which have been challenging to host and manage. Amazon Bedrock offers a choice of industry-leading FMs along with a broad set of capabilities to build generative AI applications, simplifying development with security, privacy, and responsible AI. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure to securely integrate and deploy generative AI capabilities into your application, handle spiky traffic patterns, and enable new features like cross-Region inference, which helps provide scalability and reliability across AWS Regions.

In this post, we highlight how the AI-powered accounting transformation platform uses Amazon Bedrock. FloQast addresses the most complex and custom aspects of financial processes (the final 20%)—those intricate, bespoke aspects of accounting that are highly specific to each organization and often require manual intervention. FloQast’s AI-powered solution uses advanced machine learning (ML) and natural language commands, enabling accounting teams to automate reconciliation with high accuracy and minimal technical setup.

FloQast AI Transaction Matching

Seamlessly integrated with the existing FloQast suite, the AI Transaction Matching product streamlines and automates your matching and reconciliation processes, delivering unparalleled precision and efficiency.

It offers the following key features:

AI-driven matching – You can automatically match transactions across multiple data sources with high accuracy
Flexible rule creation – You can use natural language to create custom matching rules tailored to your unique processes
Exception handling – You can quickly identify and manage unmatched transactions or discrepancies
Audit trail – You can maintain a comprehensive audit trail of matching activities for compliance and transparency
High-volume processing – You can efficiently handle large volumes of transactions, suitable for businesses of all sizes
Multi-source integration – You can seamlessly integrate and match transactions from various financial systems and data sources

Let’s review how it works:

Transaction data is gathered from bank statements and enterprise resource planning (ERP) systems.
An accountant will select specific transactions in both systems and choose Generate AI Rule.

The following screenshot shows the general ledger system on the left and the bank statement on the right.

Based on the selected transactions, text is generated (see the following screenshot).

At this point, the accountant has the option to either accept the generated text or edit the text.
The accountant chooses Save and apply to generate a rule in coded format that is further used to find additional matches, helping the accountant automate transaction reconciliation.

FloQast AI Transaction Matching offers the following benefits:

Unified environment – It seamlessly integrates with your existing FloQast products for a single source of truth
AI-powered automation – It uses advanced ML to handle complex matching scenarios
User-friendly interface – It’s designed by accountants for how accountants work, providing ease of use and adoption
Real-time insights – You can gain immediate visibility into your transaction data across systems
Scalability – It can adapt as your transaction volumes grow and business evolves

FloQast AI Annotations

FloQast’s new AI Annotations feature empowers teams to seamlessly and automatically annotate and review sample documents, streamlining compliance and audit processes through advanced automation and ML.

It offers the following key features:

Automated document annotation – You can upload sample documents to automatically annotate key data points with attributes specified in your testing criteria, saving time on manual reviews
AI-powered analysis – You can use advanced AI and natural language models to analyze document text, highlighting relevant information according to predefined controls and testing attributes
Bulk annotation for efficiency – You can select multiple documents or testing controls for bulk annotation, reducing time spent on repetitive document processing
Structured storage and audit trail – You can maintain a structured record of each annotated document, capturing all extracted data, annotation responses, and status updates for streamlined compliance and audit trails
Intuitive error handling – Smart checks identify and notify users of processing errors, making sure each annotation is complete and accurate.

The following diagram illustrates the architecture using AWS services.

The workflow starts with user authentication and authorization (steps 1-3). After those steps are complete, the workflow consists of the following steps:

Users upload supporting documents that provide audit evidence into a secure Amazon Simple Storage Service (Amazon S3) bucket.
The input documents are encrypted by Amazon S3 when consumed by Amazon Textract.
Amazon Textract (encrypts data in transit and at rest) extracts the data from the documents.
When complete, raw data is stored into an encrypted S3 bucket.
Data sanitization workflow kicks off using AWS Step Functions consisting of AWS Lambda functions.
Sanitized extracted data is written into an encrypted MongoDB.
Amazon Textract is polled to update the job status and written into Mongo DB.
The user starts the annotation process.
Application logic consumes data from Mongo DB and provides it to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
The LLM runs the audit rules (shown in the following screenshot) against the extracted data and generates an annotation for each audit rule, including pass/fail details of the audit rule.
Annotation results are filtered using Amazon Bedrock Guardrails to enhance content safety and privacy in generative AI applications.

FloQast AI Annotations offers the following benefits:

Seamless integration with FloQast – This feature is integrated into the FloQast platform, providing access to annotation tools alongside your existing compliance and financial workflows
Enhanced efficiency with AI-driven workflows – FloQast’s annotation feature uses AI to reduce manual workload, helping teams focus on high-value tasks rather than repetitive document review
Scalable solution for high-volume document processing – Designed to handle substantial document volumes, FloQast AI Annotations adapts to the demands of growing teams and complex audit requirements
Real-time document processing insights – You can stay informed with live tracking of each annotation job, with built-in monitoring for smooth and efficient workflows

FloQast’s AI technology choices

FloQast selected Amazon Bedrock because of its unmatched versatility, feature sets, and the robust suite of scalable AI models from top-tier providers like Anthropic. Anthropic’s Claude 3.5 Sonnet provides the advanced reasoning and contextual understanding necessary for handling complex financial workflows. However, a key feature of Amazon Bedrock—Amazon Bedrock Agents—is a game changer for FloQast. Amazon Bedrock Agents enables generative AI applications to run multi-step tasks across company systems and data sources. To learn more, see How Amazon Bedrock Agents works.

Amazon Bedrock Agents provides an intelligent orchestration layer, allowing FloQast to automate accounting workflows efficiently. It has added significant value in the following areas:

Instruction handling and task automation – Amazon Bedrock Agents enables FloQast to submit natural language instructions that the AI interprets and executes autonomously.
Session and memory management session – Attributes and promptSessionAttributes are passed between sessions related to a single workflow, but most user requests can be singular to a session.
Code generation that demonstrates business understanding – Amazon Bedrock Agents offers valuable features through its secure code interpretation capabilities and flexible configuration options. Amazon Bedrock agents can be tailored to the correct persona and business context, while operating within a protected test environment. This allows accountants to submit natural language instructions and input data, which is then processed in a controlled manner that aligns with security best practices. When FloQast integrates with Amazon Bedrock Agents, accountants can submit custom requests, and the agent can generate and test code within an isolated secure environment, with appropriate technical oversight and guardrails in place. The combination of Amazon Bedrock Agents’ secure code interpretation features and FloQast’s deep knowledge of accounting practices enables financial teams to operate efficiently while maintaining proper controls.
Data integration and output handling – By using Amazon Bedrock Agents, information is passed from upstream integrated financial systems, allowing FloQast to automate data retrieval and transformation tasks.
Multi-step task orchestration – Amazon Bedrock agents are designed to handle multi-step tasks by orchestrating complex workflows. For example, after FloQast retrieves data from a financial system, that data is passed to the agent, which runs the necessary calculations, generates the output code, and presents the results for user approval—all in one automated process. This orchestration is especially useful in accounting, where multiple steps must be completed in the correct sequence to maintain compliance and accuracy.

The flexibility of Amazon Bedrock Agents to manage these tasks and integrate them seamlessly into existing workflows enables FloQast to achieve scale, reduce complexity, and implement automation required to cater to the evolving needs of FloQast’s customers.

Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock provides the best results in FloQast’s evaluation of other models for the use case. FloQast doesn’t need to fine-tune the model as a model consumer, so they use Retrieval Augmented Generation (RAG) with few-shot classification on data collected on the user’s behalf, removing the overhead of fine-tuning an LLM. For this use case, this design mechanism produces a higher level of accuracy, a better security model that is understood by FloQast’s customers, and ease of use as a developer.

Conclusion

FloQast’s AI-powered accounting transformation solution has had a substantial impact on its users. By automating routine, time-consuming accounting processes, the solution has saved accounting teams countless hours, enabling them to shift away from manual spreadsheet work and focus on higher-value activities, such as reviewing financial outcomes, assessing business health, and making data-driven decisions. This solution has removed the tedium of data reconciliation, delivering measurable improvements, including a 38% reduction in reconciliation time, a 23% decrease in audit process duration and discrepancies, and a 44% improvement in workload management.

Learn more about the FloQast platform at FloQast.com. Contact evelyn.cantu@floqast.com for more information about the FloQast and AWS partnership.

About the authors

Kartik Bhatnagar is a data security-focused Solutions Architect at AWS, based in San Francisco, CA. He has experience working with startups and enterprises across the tech, fintech, healthcare, and media & entertainment industries, in roles including DevOps Engineer and Systems Architect. In his current role, he partners with AWS customers to design and implement scalable, secure, and cost-effective solutions on the AWS platform. Outside of work, he enjoys playing cricket and tennis, food hopping, and hiking.

Aidan Anderson is a dynamic technology leader with over a decade of experience in software engineering, security, and artificial intelligence. Currently serving as the Director of AI Engineering at FloQast, he is at the forefront of integrating AI and automation into accounting workflows, enhancing operational efficiency and accuracy for finance teams. Aidan’s portfolio spans leadership across security, product development, and platform engineering – where he’s consistently driven innovation, built high-performing teams, and delivered impactful solutions in fast-paced startup environments.

Insights in implementing production-ready solutions with generative AI

April 30, 2025

by Giorgio Pessot Amazon AWS

As generative AI revolutionizes industries, organizations are eager to harness its potential. However, the journey from production-ready solutions to full-scale implementation can present distinct operational and technical considerations. This post explores key insights and lessons learned from AWS customers in Europe, Middle East, and Africa (EMEA) who have successfully navigated this transition, providing a roadmap for others looking to follow suit.

Building a solid business case: Operational excellence drives customer experience

The foundation of successful generative AI implementations are business cases with clear value propositions that fit with organizational goals, for example, improving efficiency, cost savings, or revenue growth. Typical examples include enhancing customer experience, optimizing operations, maintaining compliance with legal standards, improving level of services, or increasing employee productivity.

Companies in EMEA have used AWS services to transform their operations and improve customer experience using generative AI, with their stories illustrating how a strong business case can lead to tangible results across various industry verticals.

Il Sole 24 Ore, Italy’s leading multimedia publishing group, partnered with AWS Professional Services to boost the efficiency of a historic service, L’Esperto Risponde, where users can ask fiscal questions and receive responses from a team of experts. Il Sole 24 Ore leveraged its vast internal knowledge with a Retrieval Augmented Generation (RAG) solution powered by AWS. This solution maintained over 90% accuracy in responses and reduced the time spent by experts in searching and processing information, empowering them to focus on more strategic tasks. Additionally, the company is continuously incorporating end-user feedback to keep the service tailored to customer needs. For more information, you can watch the AWS Summit Milan 2024 presentation.

Booking.com, one of the world’s leading digital travel services, is using AWS to power emerging generative AI technology at scale, creating personalized customer experiences while achieving greater scalability and efficiency in its operations. Booking.com uses Amazon SageMaker AI to provide highly personalized customer accommodation recommendations.

“One of the things we really like about AWS’s approach to generative AI is choice. We love open source, and we feel it will play an important role in the evolution of generative AI,”

– Rob Francis, Chief Technology Officer of Booking.com.

With AWS support, Booking.com is enhancing its generative AI capabilities and positioning itself for future growth in the travel and hospitality industry. For more details, you can watch Booking.com’s keynote at AWS re:Invent 2023, their presentation on generative AI from idea to production on AWS at AWS London Summit 2024, and read the case study on how Booking.com helps customers experience a new world of travel using AWS and generative AI.

ENGIE is a global power and utilities company, with 25 business units operating worldwide. ENGIE’s One Data team partnered with AWS Professional Services to develop an AI-powered chatbot that enables natural language conversation search within ENGIE’s Common Data Hub data lake, over 3 petabytes of data. The solution complements traditional keyword-based search by allowing users to discover datasets through simple conversational queries, making it easier to find relevant data among tens of thousands of datasets. This dual approach to data discovery has accelerated the development of data-driven products and enhanced data assets sharing across the organization.

These examples demonstrate how companies across various sectors have successfully used AWS generative AI capabilities to address specific business challenges.

Getting ahead of implementation challenges

Though essential, a solid business case is only the first step. As organizations move their generative AI initiatives forward, they often encounter new challenges related to making the solution scalable, reliable, and compliant. Let’s explore what it takes to successfully advance generative AI projects from the preproduction phase, making sure that the original value of the business case is then fully realized in real-world application.

Achieving scale, reliability, and compliance

Factors to consider in transitioning to full-scale production include scalability, data governance, privacy, consistent and responsible AI behaviors, security, integration with existing systems, monitoring, end-user feedback collection, and business impact measurement. As organizations in EMEA have discovered, success in this transition requires a holistic approach that goes beyond mere technological considerations. With a multitude of customer learnings, paired with AWS expertise, we can identify key strategies for implementation.

Production-ready infrastructure, applications, and processes in the cloud

With the increase in scope, number, and complexity of generative AI applications, organizations have an increased need to reduce undifferentiated effort and set a high-quality bar for production-ready applications. Standard development best practices and effective cloud operating models, like AWS Well-Architected and the AWS Cloud Adoption Framework for Artificial Intelligence, Machine Learning, and Generative AI, are key to enabling teams to spend most of their time on tasks with high business value, rather than on recurrent, manual operations. Such an approach should include established industry standards such as infrastructure as code (IaC), continuous integration and continuous delivery (CI/CD), monitoring and observability, logging and auditing, and solutions for scalability and high availability.

For instance, Iveco Group, a global automotive leader active in the Commercial and Specialty Vehicles, Powertrain, adopted a structured cloud-operating model, leveraging IaC via Terraform for consistent and repeatable deployments across environments. A DevOps environment, via CI/CD pipelines, allows for frequent updates and testing of generative AI models and applications, allowing the developers to focus on improving and expanding the solutions rather then spending time on manual operations. This also helps make sure that generative AI solutions are optimized for performance, security, and cost-efficiency. This integrated approach not only accelerates the path from pre-production to full-scale implementation, but also enables them to adapt quickly to new generative AI advancements, manage complex dependencies, and scale resources as needed, ultimately driving innovation and competitive advantage in the rapidly evolving field of generative AI. See the re:Invent 2024 session for more information.

Accor Group, a major hospitality company that developed a generative AI-powered booking application, showcased how, even when working with new technologies like generative AI, fundamental software development principles remain a must-have. They implemented a three-layered comprehensive testing strategy. First, unit tests verify that the prompts consistently generate acceptable responses from the chatbot, even upon prompt modifications. Second, integration tests verify the end-to-end flow of the REST API and the chatbot’s interaction with the large language model (LLM). The final step is functional testing with predefined scenarios for manual testing and validation. They also implemented feedback systems, essential for the improvement flywheel of customer-facing applications, in the form of in-app surveys, instant feedback options (thumbs-up or thumbs-down), and a dedicated feedback portal for detailed user input. Finally, to measure the effectiveness of the solution and its business impact, they established a system to track room bookings made through the generative AI application.

Danske Bank, a leading Nordic bank, transitioned from a container-based on-premises setup to Amazon Elastic Container Service (Amazon ECS) with AWS Fargate. This allowed them to quickly move their API-based backend services to a cloud-native environment. This decoupled architecture, designed to be provider-agnostic, set them up for flexibility in leveraging different cloud-based generative AI tools and services as needed. The integration with Amazon Bedrock was seamless and impactful, as it provided faster access to multiple foundational models from more providers. This allowed the customer to rapidly experiment, iterate, and evaluate different models for their specific use cases. This case demonstrates how the combination of generative AI services and a cloud-native, API-driven architecture allowed this customer to iterate faster, and keep the focus on business value rather than integration of technologies.

The Schaeffler Group has been driving forward groundbreaking inventions and developments in the field of motion technology for over 75 years. The company developed a comprehensive generative AI framework, which establishes enterprise-grade governance and security guardrails for generative AI use case roll-out at scale with infrastructure blueprints. A generative AI inference gateway is integrated within the solution, offering centralized access to numerous foundational models while tracking usage and costs. Going forward, Schaeffler envisions to further integrate these capabilities into their wider generative AI and data landscape, including more fine-grained access controls to data assets and the adoption of generative AI agents.

These examples highlight a key theme for organizations across industries: Success in generative AI goes beyond developing standalone applications. A thorough cloud-based operating model is crucial for enterprises looking to keep pace with the rapidly evolving technology, with minimal operational overhead.

Security, compliance, and responsible AI

As an organization’s generative AI applications expand to handle increasingly sensitive data, security, compliance, and governance must be prioritized accordingly. This includes implementing authentication and access control, encrypting data at rest and in transit, monitoring and auditing of system access and usage, maintaining compliance with regulations (such as GDPR and the recent EU AI Act), as well as establishing clear policies for data handling and model usage.

Here are some examples of customers who have successfully navigated these critical requirements.

Il Sole24 Ore implemented a code of self-discipline for ethical AI application. It prescribes retention of high-quality standards and the centrality of trustworthy data. The principles include regulatory compliance, maintaining data provenance and reliability, incorporating human oversight via human-in-the-loop, inclusivity and diversity in data usage and algorithm adoption, responsibility and accountability, and digital education and communicative transparency. By adhering to these principles, Il Sole 24 Ore Group demonstrates its commitment to leveraging innovative technologies like generative AI in a safe and responsible manner, particularly in sensitive areas such as providing expert legal and tax advice. This approach allows them to harness the benefits of AI while mitigating potential risks and maintaining the trust of their users.

For Accor Group, the implementation of their next-generation booking application required direct customer interaction, emphasizing the critical need for responsible AI practices. To make sure the chatbot would deliver effective customer service while operating within strict ethical boundaries, they established specific safeguards to minimize misuse:

Blocking responses to discriminatory queries
Withholding responses to illegal activities
Implementing guardrails to keep conversations within appropriate business context
Installing protections against role-switching or tone-changing attempts during conversations
Implementing robust technical defenses against prompt injections

Conclusion

The transition from preproduction to full-scale implementation for generative AI applications presents new challenges and opportunities. It requires identifying a solid business case, maintaining high standards for infrastructure and processes, strategic thinking in choosing an efficient cloud operating model, robust data governance, security, compliance, ethical AI practices, and more.

Organizations across EMEA have demonstrated how using AWS services can help overcome hurdles and accelerate the advantages of generative AI by embracing a holistic approach. By learning from these use cases, more enterprises can achieve successful deployments of generative AI solutions, and benefit from this transformative technology in a reliable, productive, and responsible manner.

Explore more generative AI use cases and customer succcess stories and discover how to accelerate your AI adoption on the cloud with specialized training and the support of AWS Professional Services and the Generative AI Innovation Center.

About the Authors

Dr. Giorgio Pessot is a Machine Learning Engineer at Amazon Web Services Professional Services. With a background in computational physics, he specializes in architecting enterprise-grade AI systems at the confluence of mathematical theory, DevOps, and cloud technologies, where technology and organizational processes converge to achieve business objectives. When he’s not whipping up cloud solutions, you’ll find Giorgio engineering culinary creations in his kitchen.

Daniel Zagyva is a Senior ML Engineer at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI and machine learning operations.

Nicolò Cosimo Albanese is a Data Scientist and Machine Learning Engineer at Amazon Web Services Professional Services. With a Master of Science in Engineering and postgraduate degrees in Machine Learning and Biostatistics, he specializes in developing AI/ML solutions that drive business value for enterprise customers. His expertise lies at the intersection of statistical modeling, cloud technologies, and scalable machine learning systems.

Subhro Bose is a Data Architect in Emergent Technologies and Intelligence Platform in Amazon. He loves working on ways for emergent technologies such as AI/ML, big data, quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

Diar Sabri is a Machine Learning Engineer at AWS who helps organizations transform their business through innovative AI solutions. With experience across multiple industries, he excels at bridging the gap between strategic vision and practical technology implementation, enabling customers to achieve meaningful business outcomes.

Aamna Najmi is a GenAI and Data Specialist at AWS. She assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations, bringing a unique perspective of modern data strategies to complement the field of AI. In her spare time, she pursues her passion of experimenting with food and discovering new places.

Anwar Rizal is a Senior Machine Learning consultant for AWS Professional Services based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Amer Elhabbash is a Senior Data & AI Delivery Consultant with AWS Professional Services. With over 25 years of international experience in IT spanning multiple fields and domains; Telecommunication, Software Engineering , Database, Data Analytics and AI. He helps AWS’ customers migrating their legacy data systems and building innovative cloud-native data-driven solutions.

Hassen Riahi is a Delivery Practice Manager Data & AI at AWS Professional Services. He holds a PhD in Mathematics & Computer Science on large-scale data management. He collaborates with AWS customers to build data-driven solutions.

Dr. Marco Guerriero leads Data and GenAI at AWS Professional Services for France and Europe South, holding a Ph.D. in Electrical and Computer Engineering from the University of Connecticut. His expertise spans machine learning, statistical inference, and mathematical optimization, with experience at organizations like NATO, GE, and ABB across defense, manufacturing, energy, and industrial automation sectors. With over 60 publications and five US patents to his name, Dr. Guerriero focuses on leveraging emerging technologies like GenAI and Quantum computing to drive business innovation across industries.

Sri Elaprolu is Director of the AWS Generative AI Innovation Center, where he leads a global team implementing cutting-edge AI solutions for enterprise and government organizations. During his 12-year tenure at AWS, he has led ML science teams partnering with organizations like the NFL, Cerner, and NASA. Prior to AWS, he spent 14 years at Northrop Grumman in product development and software engineering leadership roles. Sri holds a Master’s in Engineering Science and an MBA.

Dragica Boca is Managing Director of Professional Services EMEA at Amazon Web Services (AWS), leading enterprise cloud migration and generative AI transformation initiatives. With 30 years of technology consulting experience across Microsoft and IBM Global Business Services, she specializes in implementing production-ready AI solutions for Public Sector and Financial Services organizations. Dragica currently oversees large-scale GenAI implementations across EMEA, helping enterprises navigate the complexities of responsible AI deployment, scalable architecture, and sustainable adoption patterns.

Responsible AI in action: How Data Reply red teaming supports generative AI safety on AWS

April 29, 2025

by Cassandre Vandeputte Amazon AWS

Generative AI is rapidly reshaping industries worldwide, empowering businesses to deliver exceptional customer experiences, streamline processes, and push innovation at an unprecedented scale. However, amidst the excitement, critical questions around the responsible use and implementation of such powerful technology have started to emerge.

Although responsible AI has been a key focus for the industry over the past decade, the increasing complexity of generative AI models brings unique challenges. Risks such as hallucinations, controllability, intellectual property breaches, and unintended harmful behaviors are real concerns that must be addressed proactively.

To harness the full potential of generative AI while reducing these risks, it’s essential to adopt mitigation techniques and controls as an integral part of the build process. Red teaming, an adversarial exploit simulation of a system used to identify vulnerabilities that might be exploited by a bad actor, is a crucial component of this effort.

At Data Reply and AWS, we are committed to helping organizations embrace the transformative opportunities generative AI presents, while fostering the safe, responsible, and trustworthy development of AI systems.

In this post, we explore how AWS services can be seamlessly integrated with open source tools to help establish a robust red teaming mechanism within your organization. Specifically, we discuss Data Reply’s red teaming solution, a comprehensive blueprint to enhance AI safety and responsible AI practices.

Understanding generative AI’s security challenges

Generative AI systems, though transformative, introduce unique security challenges that require specialized approaches to address them. These challenges manifest in two key ways: through inherent model vulnerabilities and adversarial threats.

The inherent vulnerabilities of these models include their potential of producing hallucinated responses (generating plausible but false information), their risk of generating inappropriate or harmful content, and their potential for unintended disclosure of sensitive training data.

These potential vulnerabilities could be exploited by adversaries through various threat vectors. Bad actors might employ techniques such as prompt injection to trick models into bypassing safety controls, intentionally altering training data to compromise model behavior, or systematically probing models to extract sensitive information embedded in their training data. For both types of vulnerabilities, red teaming is a useful mechanism to mitigate those challenges because it can help identify and measure inherent vulnerabilities through systematic testing, while also simulating real-world adversarial exploits to uncover potential exploitation paths.

What is red teaming?

Red teaming is a methodology used to test and evaluate systems by simulating real-world adversarial conditions. In the context of generative AI, it involves rigorously stress-testing models to identify weaknesses, evaluate resilience, and mitigate risks. This practice helps develop AI systems that are functional, safe, and trustworthy. By adopting red teaming as part of the AI development lifecycle, organizations can anticipate threats, implement robust safeguards, and promote trust in their AI solutions.

Red teaming is critical for uncovering vulnerabilities before they are exploited. Data Reply has partnered with AWS to offer support and best practices to help integrate responsible AI and red teaming into your workflows, helping you build secure AI models. This unlocks the following benefits:

Mitigating unexpected risks – Generative AI systems can inadvertently produce harmful outputs, such as biased content or factually inaccurate information. With red teaming, Data Reply helps organizations test models for these weaknesses and identify vulnerabilities to adversarial exploitation, such as prompt injections or data poisoning.
Compliance with AI regulation – As global regulations around AI continue to evolve, red teaming can help organizations by setting up mechanisms to systematically test their applications and make them more resilient, or serve as a tool to adhere to transparency and accountability requirements. Additionally, it maintains detailed audit trails and documentation of testing activities, which are critical artifacts that can be used as evidence for demonstrating compliance with standards and responding to regulatory inquiries.
Reducing data leakage and malicious use – Although generative AI has the potential to be a force for good, models might also be exploited by adversaries looking to extract sensitive information or perform harmful actions. For instance, adversaries might craft prompts to extract private data from training sets or generate phishing emails and malicious code. Red teaming simulates such adversarial scenarios to identify vulnerabilities, enabling safeguards like prompt filtering, access controls, and output moderation.

The following chart outlines some of the common challenges in generative AI systems where red teaming can serve as a mitigation strategy.

Before diving into specific threats, it’s important to acknowledge the value of having a systematic approach to AI security risk assessment for organizations deploying AI solutions. As an example, the OWASP Top 10 for LLMs can serve as a comprehensive framework for identifying and addressing critical AI vulnerabilities. This industry-standard framework categorizes key threats, including prompt injection, where malicious inputs manipulate model outputs; training data poisoning, which can compromise model integrity; and unauthorized disclosure of sensitive information embedded in model responses. It also addresses emerging risks such as insecure output handling and denial of service (DOS) that could disrupt AI operations. By using such frameworks alongside practical security testing approaches like red teaming exercises, organizations can implement targeted controls and monitoring to make sure their AI models remain secure, resilient, and align with regulatory requirements and responsible AI principles.

How Data Reply uses AWS services for responsible AI

Fairness is an essential component of responsible AI and, as such, part of the AWS core dimensions of responsible AI. To address potential fairness concerns, it can be helpful to evaluate disparities and imbalances in training data or outcomes. Amazon SageMaker Clarify helps identify potential biases during data preparation without requiring code. For example, you can specify input features such as gender or age, and SageMaker Clarify will run an analysis job to detect imbalances in those features. It generates a detailed visual report with metrics and measurements of potential bias, helping organizations understand and address imbalances.

During red teaming, SageMaker Clarify plays a key role by analyzing whether the model’s predictions and outputs treat all demographic groups equitably. If imbalances are identified, tools like Amazon SageMaker Data Wrangler can rebalance datasets using methods such as random undersampling, random oversampling, or Synthetic Minority Oversampling Technique (SMOTE). This supports the model’s fair and inclusive operation, even under adversarial testing conditions.

Veracity and robustness represent another critical dimension for responsible AI deployments. Tools like Amazon Bedrock provide comprehensive evaluation capabilities that enable organizations to assess model security and robustness through automated evaluation. These include specialized tasks such as question-answering assessments with adversarial inputs designed to probe model limitations. For instance, Amazon Bedrock can help you test model behavior across edge case scenarios by analyzing responses to carefully crafted inputs—from ambiguous queries to potentially misleading prompts—to evaluate if the models maintain reliability and accuracy even under challenging conditions.

Privacy and security go hand in hand when implementing responsible AI. Security at Amazon is “job zero” for all employees. Our strong security culture is reinforced from the top down with deep executive engagement and commitment, and from the bottom up with training, mentoring, and strong “see something, say something” as well as “when in doubt, escalate” and “no blame” principles. As an example of this commitment, Amazon Bedrock Guardrails provide organizations with a tool to incorporate robust content filtering mechanisms and protective measures against sensitive information disclosure.

Transparency is another best practice prescribed by industry standards, frameworks, and regulations, and is essential for building user trust in making informed decisions. LangFuse, an open source tool, plays a key role in providing transparency by keeping an audit trail of model decisions. This audit trail offers a way to trace model actions, helping organizations demonstrate accountability and adhere to evolving regulations.

Solution overview

To achieve the goals mentioned in the previous section, Data Reply has developed the Red Teaming Playground, a testing environment that combines several open source tools—like Giskard, LangFuse, and AWS FMEval—to assess the vulnerabilities of AI models. This playground allows AI builders to explore scenarios, perform white hat hacking, and evaluate how models react under adversarial conditions. The following diagram illustrates the solution architecture.

This playground is designed to help you responsibly develop and evaluate your generative AI systems, combining a robust multi-layered approach for authentication, user interaction, model management, and evaluation.

At the outset, the Identity Management Layer handles secure authentication, using Amazon Cognito and integration with external identity providers to help secure authorized access. Post-authentication, users access the UI Layer, a gateway to the Red Teaming Playground built on AWS Amplify and React. This UI directs traffic through an Application Load Balancer (ALB), facilitating seamless user interactions and allowing red team members to explore, interact, and stress-test models in real time. For knowledge retrieval, we use Amazon Bedrock Knowledge Bases, which integrates with Amazon Simple Storage Service (Amazon S3) for document storage, and Amazon OpenSearch Serverless for rapid and scalable search capabilities.

Central to this solution is the Foundation Model Management Layer, responsible for defining model policies and managing their deployment, using Amazon Bedrock Guardrails for safety, Amazon SageMaker services for model evaluation, and a vendor model registry comprising a range of foundation model (FM) options, including other vendor models, supporting model flexibility.

After the models are deployed, they go through online and offline evaluations to validate robustness.

Online evaluation uses AWS AppSync for WebSocket streaming to assess models in real time under adversarial conditions. A dedicated red teaming squad (authorized white hat testers) conducts evaluations focused on OWASP Top 10 for LLMs vulnerabilities, such as prompt injection, model theft, and attempts to alter model behavior. Online evaluation provides an interactive environment where human testers can pivot and respond dynamically to model answers, increasing the chances of identifying vulnerabilities or successfully jailbreaking the model.

Offline evaluation conducts a deeper analysis through services like SageMaker Clarify to check for biases and Amazon Comprehend to detect harmful content. The memory database captures interaction data, such as historical user prompts and model responses. LangFuse plays a vital role in maintaining an audit trail of model activities, allowing each model decision to be tracked for observability, accountability, and compliance. The offline evaluation pipeline uses tools like Giskard to detect performance, bias, and security issues in AI systems. It employs LLM-as-a-judge, where a large language model (LLM) evaluates AI responses for correctness, relevance, and adherence to responsible AI guidelines. Models are tested through offline evaluations first; if successful, they progress through online evaluation and ultimately move into the model registry.

The Red Teaming Playground is a dynamic environment designed to simulate scenarios and rigorously test models for vulnerabilities. Through a dedicated UI, the red team interacts with the model using a Q&A AI assistant (for instance, a Streamlit application), enabling real-time stress testing and evaluation. Team members can provide detailed feedback on model performance and log any issues or vulnerabilities encountered. This feedback is systematically integrated into the red teaming process, fostering continuous improvements and enhancing the model’s robustness and security.

Use case example: Mental health triage AI assistant

Imagine deploying a mental health triage AI assistant—an application that demands extra caution around sensitive topics like dosage information, health records, or judgement call questions. By defining a clear use case and establishing quality expectations, you can guide the model on when to answer, deflect, or provide a safe response:

Answer – When the bot is confident that the question is within its domain and is able to retrieve a relevant response, it can provide a direct answer. For example, if asked “What are some common symptoms of anxiety?”, the bot can respond: “Common symptoms of anxiety include restlessness, fatigue, difficulty concentrating, and excessive worry. If you’re experiencing these, consider speaking to a healthcare professional.”
Deflect – For questions outside the bot’s scope or purpose, the bot should deflect responsibility and guide the user toward appropriate human support. For instance, if asked “Why does life feel meaningless?”, the bot might reply: “It sounds like you’re going through a tough time. Would you like me to connect you to someone who can help?” This makes sure sensitive topics are handled carefully and responsibly.
Safe response – When the question requires human validation or advice that the bot can’t provide, it should offer generalized, neutral suggestions to minimize risks. For example, in response to “How can I stop feeling anxious all the time?”, the bot might say: “Some people find practices like meditation, exercise, or journaling helpful, but I recommend consulting a healthcare provider for advice tailored to your needs.”

Red teaming results help refine model outputs by identifying risks and vulnerabilities. For example, consider a medical AI assistant developed by the fictional company AnyComp. By subjecting this assistant to a red teaming exercise, AnyComp can detect potential risks, such as the assistant generating unsolicited medical advice before deployment. With this insight, AnyComp can refine the assistant to either deflect such queries or provide a safe, appropriate response.

This structured approach—answer, deflect, and safe response—provides a comprehensive strategy for managing various types of questions and scenarios effectively. By clearly defining how to handle each category, you can make sure the AI assistant fulfills its purpose while maintaining safety and reliability. Red teaming further validates these strategies by rigorously testing interactions, making sure that the assistant remains useful and trustworthy in different situations.

Conclusion

Implementing responsible AI policies involves continuous improvement. Scaling solutions, like integrating SageMaker for model lifecycle monitoring or AWS CloudFormation for controlled deployments, helps organizations maintain robust AI governance as they grow.

Integrating responsible AI through red teaming is a crucial step to assess that generative AI systems operate responsibly, securely, and remain compliant. Data Reply collaborates with AWS to industrialize these efforts, from fairness checks to security stress tests, helping organizations stay ahead of emerging threats and evolving standards.

Data Reply has extensive expertise in helping customers adopt generative AI, especially with their GenAI Factory framework, which simplifies the transition from proof of concept to production, benefiting industries such as maintenance and customer service FAQs. The GenAI Factory initiative by Data Reply France is designed to overcome integration challenges and scale generative AI applications effectively, using AWS managed services like Amazon Bedrock and OpenSearch Serverless.

To learn more about Data Reply’s work, check out their specialized offerings for red teaming in generative AI and LLMOps.

About the authors

Cassandre Vandeputte is a Solutions Architect for AWS Public Sector based in Brussels. Since her first steps into the digital world, she has been passionate about harnessing technology to drive positive societal change. Beyond her work with intergovernmental organizations, she drives responsible AI practices across AWS EMEA customers.

Davide Gallitelli is a Senior Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Amine Aitelharraj is a seasoned cloud leader and ex-AWS Senior Consultant with over a decade of experience driving large-scale cloud, data, and AI transformations. Currently a Principal AWS Consultant and AWS Ambassador, he combines deep technical expertise with strategic leadership to deliver scalable, secure, and cost-efficient cloud solutions across sectors. Amine is passionate about GenAI, serverless architectures, and helping organizations unlock business value through modern data platforms.

Direct preference optimization

Recommended use cases for fine-tuning

Prerequisites

Key multimodal datasets and experiment setup

Best practices for data preparation

Configuring fine-tuning parameters

Model size selection and performance comparison

Conclusion

About the authors

Understanding MCP

When to use MCP instead of implementing microservices or APIs

FastMCP vs. FastAPI

Solution overview

Use SageMaker AI with FastMCP for rapid prototyping

Implement a loan underwriter MCP workflow with LangGraph and SageMaker AI with FastAPI for custom routing

Tracing with the LangSmith UI

Conclusion

About the Authors

Solution overview

Prerequisites

Deployment steps

Language translation

Transcreation process

Clean up

Conclusion

Further reading

About the Authors

Why agentic IDP?

IDP in mortgage processing

Solution: Agentic workflows in mortgage processing

Data extraction agent

Validation agent

Compliance agent

Underwriting agent

RACI matrix

End-to-end IDP automation architecture for mortgage processing

Prerequisites

Deploy the solution

Monitoring and troubleshooting

Clean up

Conclusion

About the Authors

Prerequisites

Preparing your data

Tool specification format requirements

Preparing data using Amazon S3 JSONL upload

Using historical invocation logs

Model distillation enhancements

Expanded model support

Advanced data synthesis technology

Enhanced training visibility

Improved job status reporting

Performance improvements and benefits

Evaluation metric

Experiment results

Conclusion

Appendix

About the authors

Solution overview

Prerequisites

Create an anonymous Amazon Q Business application using the console

Create an IAM role for the web experience

Create an Amazon Q Business application

Add data sources

Generate an anonymous web experience URL

Test your anonymous Amazon Q Business application

Create your anonymous application using the AWS CLI

Create an IAM role for Amazon Q Business

Create an anonymous Amazon Q Business application

Create a restrictive policy for anonymous access

Create an index

Test your anonymous Amazon Q Business application

Create a web experience for the anonymous web application

Considerations

Clean up

Conclusion

About the authors

Accounting operations: Complexity amplified at scale

FloQast AI Transaction Matching

FloQast AI Annotations