AWS and DXC collaborate to deliver customizable, near real-time voice-to-voice translation capabilities for Amazon Connect

AWS and DXC collaborate to deliver customizable, near real-time voice-to-voice translation capabilities for Amazon Connect

Providing effective multilingual customer support in global businesses presents significant operational challenges. Through collaboration between AWS and DXC Technology, we’ve developed a scalable voice-to-voice (V2V) translation prototype that transforms how contact centers handle multi-lingual customer interactions.

In this post, we discuss how AWS and DXC used Amazon Connect and other AWS AI services to deliver near real-time V2V translation capabilities.

Challenge: Serving customers in multiple languages

In Q3 2024, DXC Technology approached AWS with a critical business challenge: their global contact centers needed to serve customers in multiple languages without the exponential cost of hiring language-specific agents for the lower volume languages. Previously, DXC had explored several existing alternatives but found limitations in each approach – from communication constraints to infrastructure requirements that impacted reliability, scalability, and operational costs. DXC and AWS decided to organize a focused hackathon where DXC and AWS Solution Architects collaborated to:

  • Define essential requirements for real-time translation
  • Establish latency and accuracy benchmarks
  • Create seamless integration paths with existing systems
  • Develop a phased implementation strategy
  • Prepare and test an initial proof of concept setup

Business impact

For DXC, this prototype was used as an enabler, allowing technical talent maximization, operational transformation, and cost improvements through:

  • Best technical expertise delivery – Hiring and matching agents based on technical knowledge rather than spoken language, making sure customers get top technical support regardless of language barriers
  • Global operational flexibility – Removing geographical and language constraints in hiring, placement, and support delivery while maintaining consistent service quality across all languages
  • Cost reduction – Eliminating multi-language expertise premiums, specialized language training, and infrastructure costs through pay-per-use translation model
  • Similar experience to native speakers – Maintaining natural conversation flow with near real-time translation and audio feedback, while delivering premium technical support in customer’s preferred language

Solution overview

The Amazon Connect V2V translation prototype uses AWS advanced speech recognition and machine translation technologies to enable real-time conversation translation between agents and customers, allowing them to speak in their preferred languages while having natural conversations. It consists of the following key components:

  • Speech recognition – The customer’s spoken language is captured and converted into text using Amazon Transcribe, which serves as the speech recognition engine. The transcript (text) is then fed into the machine translation engine.
  • Machine translation – Amazon Translate, the machine translation engine, translates the customer’s transcript into the agent’s preferred language in near real time. The translated transcript is converted back into speech using Amazon Polly, which serves as the text-to-speech engine.
  • Bidirectional translation – The process is reversed for the agent’s response, translating their speech into the customer’s language and delivering the translated audio to the customer.
  • Seamless integration – The V2V translation sample project integrates with Amazon Connect, enabling agents to handle customer interactions in multiple languages without any additional effort or training, using the Amazon Connect Streams JS and Amazon Connect RTC JS libraries.

The prototype can be extended with other AWS AI services to further customize the translation capabilities. It’s open source and ready for customization to meet your specific needs.

The following diagram illustrates the solution architecture.

The following screenshot illustrates a sample agent web application.

The user interface consists of three sections:

  • Contact Control Panel – A softphone client using Amazon Connect
  • Customer Controls – Customer-to-agent interaction controls, including Transcribe Customer Voice, Translate Customer Voice, and Synthesize Customer Voice
  • Agent controls – Agent-to-customer interaction controls, including Transcribe Agent Voice, Translate Agent Voice, and Synthesize Agent Voice

Challenges when implementing near real-time voice translation

The Amazon Connect V2V sample project was designed to minimize the audio processing time from the moment the customer or agent finishes speaking until the translated audio stream is started. However, even with the shortest audio processing time, the user experience still doesn’t match the experience of a real conversation when both are speaking the same language. This is due to the specific pattern of the customer only hearing the agent’s translated speech, and the agent only hearing the customer’s translated speech. The following diagram displays that pattern.

The example workflow consists of the following steps:

  1. The customer starts speaking in their own language, and speaks for 10 seconds.
  2. Because the agent only hears the customer’s translated speech, the agent first hears 10 seconds of silence.
  3. When customer finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
  4. The customer’s translated speech is streamed to the agent. During that time, the customer hears silence.
  5. When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
  6. Because customer only hears the agent’s translated speech, the customer hears 10 seconds of silence.
  7. When the agent finishes speaking, the audio processing time takes 1–2 seconds, during which time both the customer and agent hear silence.
  8. The agent’s translated speech is streamed to the agent. During that time, the agent hears silence.

In this scenario, the customer hears a single block of 22–24 seconds of a complete silence, from the moment they finished speaking until they hear the agent’s translated voice. This creates a suboptimal experience, because the customer might not be certain what is happening during these 22–24 seconds—for instance, if the agent was able to hear them, or if there was a technical issue.

Audio streaming add-ons

In a face-to-face conversation scenario between two people that don’t speak the same language, they might have another person as a translator or interpreter. An example workflow consists of the following steps:

  1. Person A speaks in their own language, which is heard by Person B and the translator.
  2. The translator translates what Person A said to Person B’s language. The translation is heard by Person B and Person A.

Essentially, Person A and Person B hear each other speaking their own language, and they also hear the translation (from the translator). There’s no waiting in silence, which is even more important in non-face-to-face conversations (such as contact center interactions).

To optimize the customer/agent experience, the Amazon Connect V2V sample project implements audio streaming add-ons to simulate a more natural conversation experience. The following diagram illustrates an example workflow.

The workflow consists of the following steps:

  1. The customer starts speaking in their own language, and speaks for 10 seconds.
  2. The agent hears the customer’s original voice, at a lower volume (“Stream Customer Mic to Agent” enabled).
  3. When the customer finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
  4. The customer’s translated speech is then streamed to the agent. During that time, the customer hears their translated speech, at a lower volume (“Stream Customer Translation to Customer” enabled).
  5. When the customer’s translated speech playback is complete, the agent starts speaking, and speaks for 10 seconds.
  6. The customer hears the agent’s original voice, at a lower volume (“Stream Agent Mic to Customer” enabled).
  7. When the agent finishes speaking, the audio processing time takes 1–2 seconds. During that time, the customer and agent hear subtle audio feedback—contact center background noise—at a very low volume (“Audio Feedback” enabled).
  8. The agent’s translated speech is then streamed to the agent. During that time, the agent hears their translated speech, at a lower volume (“Stream Agent Translation to Agent” enabled).

In this scenario, the customer hears two short blocks (1–2 seconds) of subtle audio feedback, instead of a single block of 22–24 seconds of complete silence. This pattern is much closer to a face-to-face conversation that includes a translator.

The audio streaming add-ons provide additional benefits, including:

  • Voice characteristics – In cases when the agent and customer only hear their translated and synthesized speech, the actual voice characteristics are lost. For instance, the agent can’t hear if the customer was talking slow or fast, if the customer was upset or calm, and so on. The translated and synthesized speech doesn’t carry over that information.
  • Quality assurance – In cases when call recording is enabled, only the customer’s original voice and the agent’s synthesized speech are recorded, because the translation and the synthetization are done on the agent (client) side. This makes it difficult for QA teams to properly evaluate and audit the conversations, including the many silent blocks within it. Instead, when the audio streaming add-ons are enabled, there are no silent blocks, and the QA team can hear the agent’s original voice, the customer’s original voice, and their respective translated and synthesized speech, all in a single audio file.
  • Transcription and translation accuracy – Having both the original and translated speech available in the call recording makes it straightforward to detect specific words that would improve transcription accuracy (by using Amazon Transcribe custom vocabularies) or translation accuracy (using Amazon Translate custom terminologies), to make sure that your brand names, character names, model names, and other unique content are transcribed and translated to the desired result.

Get started with Amazon Connect V2V

Ready to transform your contact center’s communication? Our Amazon Connect V2V sample project is now available on GitHub. We invite you to explore, deploy, and experiment with this powerful prototype. You can it as a foundation for developing innovative multi-lingual communication solutions in your own contact center, through the following key steps:

  1. Clone the GitHub repository.
  2. Test different configurations for audio streaming add-ons.
  3. Review the sample project’s limitations in the README.
  4. Develop your implementation strategy:
    1. Implement robust security and compliance controls that meet your organization’s standards.
    2. Collaborate with your customer experience team to define your specific use case requirements.
    3. Balance between automation and the agent’s manual controls (for example, use an Amazon Connect contact flow to automatically set contact attributes for preferred languages and audio streaming add-ons).
    4. Use your preferred transcribe, translate, and text-to-speech engines, based on specific language support requirements and business, legal, and regional preferences.
    5. Plan a phased rollout, starting with a pilot group, then iteratively optimize your transcription custom vocabularies and translation custom terminologies.

Conclusion

The Amazon Connect V2V sample project demonstrates how Amazon Connect and advanced AWS AI services can break down language barriers, enhance operational flexibility, and reduce support costs. Get started now and revolutionize how your contact center communicates across language barriers!


About the Authors

Milos Cosic is a Principal Solutions Architect at AWS.

EJ Ferrell is a Senior Solutions Architect at AWS.

Adam El Tanbouli is a Technical Program Manager for Prototyping and Support Services at DXC Modern Workplace.

Read More

Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock

Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock

Generative AI is revolutionizing enterprise automation, enabling AI systems to understand context, make decisions, and act independently. Generative AI foundation models (FMs), with their ability to understand context and make decisions, are becoming powerful partners in solving sophisticated business problems. At AWS, we’re using the power of models in Amazon Bedrock to drive automation of complex processes that have traditionally been challenging to streamline.

In this post, we focus on one such complex workflow: document processing. This serves as an example of how generative AI can streamline operations that involve diverse data types and formats.

Challenges with document processing

Document processing often involves handling three main categories of documents:

  • Structured – For example, forms with fixed fields
  • Semi-structured – Documents that have a predictable set of information but might vary in layout or presentation
  • Unstructured – For example, paragraphs of text or notes

Traditionally, processing these varied document types has been a pain point for many organizations. Rule-based systems or specialized machine learning (ML) models often struggle with the variability of real-world documents, especially when dealing with semi-structured and unstructured data.

We demonstrate how generative AI along with external tool use offers a more flexible and adaptable solution to this challenge. Through a practical use case of processing a patient health package at a doctor’s office, you will see how this technology can extract and synthesize information from all three document types, potentially improving data accuracy and operational efficiency.

Solution overview

This intelligent document processing solution uses Amazon Bedrock FMs to orchestrate a sophisticated workflow for handling multi-page healthcare documents with mixed content types. The solution uses the FM’s tool use capabilities, accessed through the Amazon Bedrock Converse API. This enables the FMs to not just process text, but to actively engage with various external tools and APIs to perform complex document analysis tasks.

The solution employs a strategic multi-model approach, optimizing for both performance and cost by selecting the most appropriate model for each task:

  • Anthropic’s Claude 3 Haiku – Serves as the workflow orchestrator due to its low latency and cost-effectiveness. This model’s strong reasoning and tool use abilities make it ideal for the following:

    • Coordinating the overall document processing pipeline

    • Making routing decisions for different document types

    • Invoking appropriate processing functions

    • Managing the workflow state

  • Anthropic’s Claude 3.5 Sonnet (v2) – Used for its advanced reasoning capabilities, notably strong visual processing abilities, particularly excelling at interpreting charts and graphs. Its key strengths include:

    • Interpreting complex document layouts and structure

    • Extracting text from tables and forms

    • Processing medical charts and handwritten notes

    • Converting unstructured visual information into structured data

Through the Amazon Bedrock Converse API’s standardized tool use (function calling) interface, these models can work together seamlessly to invoke document processing functions, call external APIs for data validation, trigger storage operations, and execute content transformation tasks. The API serves as the foundation for this intelligent workflow, providing a unified interface for model communication while maintaining conversation state throughout the processing pipeline. The API’s standardized approach to tool definition and function calling provides consistent interaction patterns across different processing stages. For more details on how tool use works, refer to The complete tool use workflow.

The solution incorporates Amazon Bedrock Guardrails to implement robust content filtering policies and sensitive information detection, making sure that personal health information (PHI) and personally identifiable information (PII) data is appropriately protected through automated detection and masking capabilities while maintaining industry standard compliance throughout the document processing workflow.

Prerequisites

You need the following prerequisites before you can proceed with this solution. For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

Use case and dataset

For our example use case, we examine a patient intake process at a healthcare institution. The workflow processes a patient health information package containing three distinct document types:

  • Structured document – A new patient intake form with standardized fields for personal information, medical history, and current symptoms. This form follows a consistent layout with clearly defined fields and check boxes, making it an ideal example of a structured document.
  • Semi-structured document – A health insurance card that contains essential coverage information. Although insurance cards generally contain similar information (policy number, group ID, coverage dates), they come from different providers with varying layouts and formats, showing the semi-structured nature of these documents.
  • Unstructured document – A handwritten doctor’s note from an initial consultation, containing free-form observations, preliminary diagnoses, and treatment recommendations. This represents the most challenging category of unstructured documents, where information isn’t confined to any predetermined format or structure.

The example document can be downloaded from the following GitHub repo.

This healthcare use case is particularly relevant because it encompasses common challenges in document processing: the need for high accuracy, compliance with healthcare data privacy requirements, and the ability to handle multiple document formats within a single workflow. The variety of documents in this patient package demonstrates how a modern intelligent document processing solution must be flexible enough to handle different levels of document structure while maintaining consistency and accuracy in data extraction.

The following diagram illustrates the solution workflow.

IDP flow using external tool claling

This self-orchestrated workflow demonstrates how modern generative AI solutions can balance capability, performance, and cost-effectiveness in transforming traditional document processing workflows in healthcare settings.

Deploy the solution

  1. Create an Amazon SageMaker domain. For instructions, see Use quick setup for Amazon SageMaker AI.
  2. Launch SageMaker Studio, then create and launch a JupyterLab space. For instructions, see Create a space.
  3. Create a guardrail. Focus on adding sensitive information filters that would mask PII or PHI.
  4. Clone the code from the GitHub repository:

    git clone https://github.com/aws-samples/anthropic-on-aws.git
  5. Change the directory to the root of the cloned repository:

    cd medical-idp
  6. Install dependencies:

    pip install -r requirements.txt
  7. Update setup.sh with the guardrail ID you created in Step 3. Then set the ENV variable:

    source setup.sh
  8. Finally, start the Streamlit application:

    streamlit run streamlit_app.py

Now you’re ready to explore the intelligent document processing workflow using Amazon Bedrock.

Technical implementation

The solution is built around the Amazon Bedrock Converse API and tool use framework, with Anthropic’s Claude 3 Haiku serving as the primary orchestrator. When a document is uploaded through the Streamlit interface, Haiku analyzes the request and determines the sequence of tools needed by consulting the tool definitions in ToolConfig. These definitions include tools for the following:

  • Document processing pipeline – Handles initial PDF processing and classification
  • Document notes processing – Extracts information from medical notes
  • New patient information processing – Processes patient intake forms
  • Insurance form processing – Handles insurance card information

The following code is an example tool definition for extracting consultation notes. Here, extract_consultation_notes represents the name of the function that the orchestration workflow will call, and document_paths defines the schema of the input parameter that will be passed to the function. The FM will contextually extract the information from the document and pass to the method. A similar toolspec will be defined for each step. Refer to the GitHub repo for the full toolspec definition.

{
            "toolSpec": {
                "name": "extract_consultation_notes",
                "description": "Extract diagnostics information from a doctor's consultation notes. Along with the extraction include the full transcript in a <transcript> node",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "document_paths": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Paths to the files that were classified as DOC_NOTES"
                            }
                        },
                        "required": ["document_paths"]
                    }
                }
            }
        }

When a PDF document is uploaded through the Streamlit interface, it is temporarily stored and passed to the FileProcessor class along with the tool specification and a user prompt:

prompt = ("1. Extract 2. save and 3. summarize the information from the patient information package located at " + tmp_file + ". " +
                          "The package might contain various types of documents including insurance cards. Extract and save information from all documents provided. "
                          "Perform any preprocessing or classification of the file provided prior to the extraction." + 
                          "Set the enable_guardrails parameter to " + str(enable_guardrails) + ". " + 
                          "At the end, list all the tools that you had access to. Give an explantion on why each tool was used and if you are not using a tool, explain why it was not used as well" + 
                          "Think step by step.")
                processor.process_file(prompt=prompt, 
toolspecs=toolspecs,
...

The BedrockUtils class manages the conversation with Anthropic’s Claude 3 Haiku through the Amazon Bedrock Converse API. It maintains the conversation state and handles the tool use workflow:

# From bedrockutility.py
def invoke_bedrock(self, message_list, system_message=[], tool_list=[],
                  temperature=0, maxTokens=2048, guardrail_config=None):
    response = self.bedrock.converse(
        modelId=self.model_id,
        messages=message_list,
        system=system_message,
        inferenceConfig={
            "maxTokens": maxTokens,
            "temperature": temperature
        },
        **({"toolConfig": {"tools": tool_list}} if tool_list else {})
    )

When the processor receives a document, it initiates a conversation loop with Anthropic’s Claude 3 Haiku, which analyzes the document and determines which tools to use based on the content. The model acts as an intelligent orchestrator, making decisions about the following:

  • Which document processing tools to invoke
  • The sequence of processing steps
  • How to handle different document types within the same package
  • When to summarize and complete the processing

This orchestration is managed through a continuous conversation loop that processes tool requests and their results until the entire document package has been processed.

The first key decision in the workflow is initiating the document classification process. Through the DocumentClassifier class, the solution uses Anthropic’s Claude 3.5 Sonnet to analyze and categorize each page of the uploaded document into three main types: intake forms, insurance cards, and doctor’s notes:

# from document_classifier.py
class DocumentClassifier:
    def __init__(self, file_handler):
        self.sonnet_3_5_bedrock_utils = BedrockUtils(
            model_id=ModelIDs.anthropic_claude_3_5_sonnet
        )
        
    def categorize_document(self, file_paths):
        # Convert documents to binary format for model processing
        binary_data_array = []
        for file_path in file_paths:
            binary_data, media_type = self.file_handler.get_binary_for_file(file_path)
            binary_data_array.append((binary_data[0], media_type))

        # Prepare message for classification
        message_content = [
            {"image": {"format": media_type, "source": {"bytes": data}}}
            for data, media_type in binary_data_array
        ]
        
        # Create classification request
        message_list = [{
            "role": 'user',
            "content": [
                *message_content,
                {"text": "What types of document is in this image?"}
            ]
        }]
        
        # Define system message for classification
        system_message = [{
            "text": '''You are a medical document processing agent. 
                      Categorize images as: INTAKE_FORM, INSURANCE_CARD, or DOC_NOTES'''
        }]
        
        # Get classification from model
        response = self.sonnet_3_5_bedrock_utils.invoke_bedrock(
            message_list=message_list,
            system_message=system_message
        )
        return [response['output']['message']]

Based on the classification results, the FM determines the next tool to be invoked. The tool’s description and input schema define exactly what information needs to be extracted. Following the previous example, let’s assume the next page to be processed is a consultation note. The workflow will invoke the extract_consultation_notes function. This function processes documents to extract detailed medical information. Like the classification process discussed earlier, it first converts the documents to binary format suitable for model processing. The key to accurate extraction lies in how the images and system message are combined:

def extract_info(self, file_paths):
    # Convert documents to binary data
    # This will follow the same pattern to as in the classification function
    message_content = [
        {"image": {"format": media_type, "source": {"bytes": data}}}
        for data, media_type in binary_data_array
    ]

    message_list = [{
        "role": 'user',
        "content": [
            *message_content,  # Include the processed document images
            {"text": '''Extract all information from this file
                       If you find a visualization
                           - Provide a detailed description in natural language
                           - Use domain specific language for the description
                    '''}
        ]
    }]
    
    system_message = [{
        "text": '''You are a medical consultation agent with expertise in diagnosing and treating various health conditions.
                   You have a deep understanding of human anatomy, physiology, and medical knowledge across different specialties.
                   During the consultation, you review the patient's medical records, test results, and documentation provided.
                   You analyze this information objectively and make associations between the data and potential diagnoses.
Associate a confidence score to each extracted information. This should reflect how confident the model in the extracted value matched the requested entity.
        '''}
    ]
    
    response = self.bedrock_utils.invoke_bedrock(
        message_list=message_list,
        system_message=system_message
    )
    return [response['output']['message']]

The system message serves three crucial purposes:

  • Establish medical domain expertise for accurate interpretation.
  • Provide guidelines for handling different types of information (text and visualizations).
  • Provide a self-scored confidence. Although this is not an independent grading mechanism, the score is directionally indicative of how confident the model is in its own extraction.

Following the same pattern, the FM will use the other tools in the toolspec definition to save and summarize the results.

A unique advantage of using a multi-modal FM for the extraction task is its ability to have a deep understanding of the text it is extracting. For example, the following code is an abstract of the data schema we are requesting as input to the save_consultation_notes function. Refer to the code in constants.py for full definition. The model needs to not only extract a transcript, but also understand it to extract such structured data from an unstructured document. This significantly reduces the postprocessing efforts required for the data to be consumed by a downstream application.

"consultation": {
                            "type": "object",
                            "properties": {
                            "date": {"type": "string"},
                            "concern": {
                                "type": "object",
                                "properties": {
                                    "primaryComplaint": {
                                        "type": "string",
                                        "description": "Primary medical complaint of the patient. Only capture the medical condition. no timelines"
                                    },
                                    "duration": {"type": "number"},
                                    "durationUnit": {"type": "string", "enum": ["days", "weeks", "months", "years"]},
                                    "associatedSymptoms": {
                                        "type": "object",
                                        "additionalProperties": {
                                            "type": "boolean"
                                        },
                                        "description": "Key-value pairs of symptoms and their presence (true) or absence (false)"
                                    },
                                    "absentSymptoms": {
                                        "type": "array",
                                        "items": {"type": "string"}
                                    }
                                },
                                "required": ["primaryComplaint", "duration", "durationUnit"]
                            }

The documents contain a treasure trove of personally identifiable information (PII) and personal health information (PIH). To redact this information, you can pass enable_guardrails as true. This will use the guardrail you setup earlier as part of the information extraction process and mask information identified as PII or PIH.

processor.process_file(prompt=prompt, 
                                        enable_guardrails=True,
                                        toolspecs=toolspecs,
      …
)

Finally, cross-document validation is crucial for maintaining data accuracy and compliance in healthcare settings. Although the current implementation performs basic consistency checks through the summary prompt, organizations can extend the framework by implementing a dedicated validation tool that integrates with their specific business rules and compliance requirements. Such a tool could perform sophisticated validation logic like insurance policy verification, appointment date consistency checks, or any other domain-specific validation requirements, providing complete data integrity across the document package.

Future considerations

As Amazon Bedrock continues to evolve, several powerful features can be integrated into this document processing workflow to enhance its enterprise readiness, performance, and cost-efficiency. Let’s explore how these advanced capabilities can take this solution to the next level:

  • Inference profiles in Amazon Bedrock define a model and its associated Regions for routing invocation requests, enabling various tasks such as usage tracking, cost monitoring, and cross-Region inference. These profiles help users track metrics through Amazon CloudWatch logs, monitor costs with cost allocation tags, and increase throughput by distributing requests across multiple Regions.
  • Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. Instead of reprocessing the entire context for each document, the workflow can reuse cached prompts, which is particularly beneficial when using the same image across different tooling workflows. With support for multiple cache checkpoints, this feature can substantially reduce processing time and inference costs while maintaining the workflow’s intelligent orchestration capabilities.
  •  Intelligent prompt routing can dynamically select the most appropriate model for each task based on performance and cost requirements. Rather than explicitly assigning Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for document analysis, the workflow can use intelligent routing to automatically choose the optimal model within the Anthropic family for each request. This approach simplifies model management while providing cost-effective processing of different document types, from simple structured forms to complex handwritten notes, all through a single endpoint.

Conclusion

This intelligent document processing solution demonstrates the power of combining Amazon Bedrock FMs with tool use capabilities to create sophisticated, self-orchestrating workflows. By using Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet for complex visual tasks, the solution effectively handles structured, semi-structured, and unstructured documents while maintaining high accuracy and compliance standards.

Key benefits of this approach include:

  • Reduced manual processing through intelligent automation
  • Improved accuracy through specialized model selection
  • Built-in compliance with guardrails for sensitive data
  • Flexible architecture that adapts to various document types
  • Cost-effective processing through strategic model usage

As organizations continue to digitize their operations, solutions like this showcase how generative AI can transform traditional document processing workflows. The combination of powerful FMs in Amazon Bedrock and the tool use framework provides a robust foundation for building intelligent, scalable document processing solutions across industries.

For more information about Amazon Bedrock and its capabilities, visit the Amazon Bedrock User Guide.


About the Author

Raju Rangan is a Senior Solutions Architect at AWS. He works with government-sponsored entities, helping them build AI/ML solutions using AWS. When not tinkering with cloud solutions, you’ll catch him hanging out with family or smashing birdies in a lively game of badminton with friends.

Read More

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Large language models (LLMs) excel at generating human-like text but face a critical challenge: hallucination—producing responses that sound convincing but are factually incorrect. While these models are trained on vast amounts of generic data, they often lack the organization-specific context and up-to-date information needed for accurate responses in business settings. Retrieval Augmented Generation (RAG) techniques help address this by grounding LLMs in relevant data during inference, but these models can still generate non-deterministic outputs and occasionally fabricate information even when given accurate source material. For organizations deploying LLMs in production applications—particularly in critical domains such as healthcare, finance, or legal services—these residual hallucinations pose serious risks, potentially leading to misinformation, liability issues, and loss of user trust.

To address these challenges, we introduce a practical solution that combines the flexibility of LLMs with the reliability of drafted, curated, verified answers. Our solution uses two key Amazon Bedrock services: Amazon Bedrock Knowledge Bases, a fully managed service that you can use to store, search, and retrieve organization-specific information for use with LLMs; and Amazon Bedrock Agents, a fully managed service that you can use to build, test, and deploy AI assistants that can understand user requests, break them down into steps, and execute actions. Similar to how a customer service team maintains a bank of carefully crafted answers to frequently asked questions (FAQs), our solution first checks if a user’s question matches curated and verified responses before letting the LLM generate a new answer. This approach helps prevent hallucinations by using trusted information whenever possible, while still allowing the LLM to handle new or unique questions. By implementing this technique, organizations can improve response accuracy, reduce response times, and lower costs. Whether you’re new to AI development or an experienced practitioner, this post provides step-by-step guidance and code examples to help you build more reliable AI applications.

Solution overview

Our solution implements a verified semantic cache using the Amazon Bedrock Knowledge Bases Retrieve API to reduce hallucinations in LLM responses while simultaneously improving latency and reducing costs. This read-only semantic cache acts as an intelligent intermediary layer between the user and Amazon Bedrock Agents, storing curated and verified question-answer pairs.

When a user submits a query, the solution first evaluates its semantic similarity with existing verified questions in the knowledge base. For highly similar queries (greater than 80% match), the solution bypasses the LLM completely and returns the curated and verified answer directly. When partial matches (60–80% similarity) are found, the solution uses the verified answers as few-shot examples to guide the LLM’s response, significantly improving accuracy and consistency. For queries with low similarity (less than 60%) or no match, the solution falls back to standard LLM processing, making sure that user questions receive appropriate responses.

This approach offers several key benefits:

  • Reduced costs: By minimizing unnecessary LLM invocations for frequently answered questions, the solution significantly reduces operational costs at scale
  • Improved accuracy: Curated and verified answers minimize the possibility of hallucinations for known user queries, while few-shot prompting enhances accuracy for similar questions.
  • Lower latency: Direct retrieval of cached answers provides near-instantaneous responses for known queries, improving the overall user experience.

The semantic cache serves as a growing repository of trusted responses, continuously improving the solution’s reliability while maintaining efficiency in handling user queries.

Solution architecture

Solution diagram to describe which AWS services are used

The solution architecture in the preceding figure consists of the following components and workflow. Let’s assume that the question “What date will AWS re:invent 2024 occur?” is within the verified semantic cache. The corresponding answer is also input as “AWS re:Invent 2024 takes place on December 2–6, 2024.” Let’s walkthrough an example of how this solution would handle a user’s question.

1. Query processing:

a. User submits a question “When is re:Invent happening this year?”, which is received by the Invoke Agent function.

b. The function checks the semantic cache (Amazon Bedrock Knowledge Bases) using the Retrieve API.

c. Amazon Bedrock Knowledge Bases performs a semantic search and finds a similar question with an 85% similarity score.

2. Response paths: (Based on the 85% similarity score in step 1.c, our solution follows the strong match path)

a. Strong match (similarity score greater than 80%):

i. Invoke Agent function returns exactly the verified answer “AWS re:Invent 2024 takes place on December 2–6, 2024” directly from the Amazon Bedrock knowledge base, providing a deterministic response.

ii. No LLM invocation needed, response in less than 1 second.

b. Partial match (similarity score 60–80%):

i. The Invoke Agent function invokes the Amazon Bedrock agent and provides the cached answer as a few-shot example for the agent through Amazon Bedrock Agents promptSessionAttributes.

ii. If the question was “What’s the schedule for AWS events in December?”, our solution would provide the verified re:Invent dates to guide the Amazon Bedrock agent’s response with additional context.

iii. Providing the Amazon Bedrock agent with a curated and verified example might help increase accuracy.

c. No match (similarity score less than 60%):

i. If the user’s question isn’t similar to any of the curated and verified questions in the cache, the Invoke Agent function invokes the Amazon Bedrock agent without providing it any additional context from cache.

ii. For example, if the question was “What hotels are near re:Invent?”, our solution would invoke the Amazon Bedrock agent directly, and the agent would use the tools at its disposal to formulate a response.

3. Offline knowledge management:

a. Verified question-answer pairs are stored in a verified Q&A Amazon S3 bucket (Amazon Simple Storage Service), and must be updated or reviewed periodically to make sure that the cache contains the most recent and accurate information.

b. The S3 bucket is periodically synchronized with the Amazon Bedrock knowledge base. This offline batch process makes sure that the semantic cache remains up-to-date without impacting real-time operations.

Solution walkthrough

You need to meet the following prerequisites for the walkthrough:

Once you have the prerequisites in place, use the following steps to set up the solution in your AWS account.

Step 0: Set up the necessary infrastructure

Follow the “Getting started” instructions in the README of the Git repository to set up the infrastructure for this solution. All the following code samples are extracted from the Jupyter notebook in this repository.

Step 1: Set up two Amazon Bedrock knowledge bases

This step creates two Amazon Bedrock knowledge bases. The agent knowledge base stores Amazon Bedrock service documentation, while the cache knowledge base contains curated and verified question-answer pairs. This setup uses the AWS SDK for Python (Boto3) to interact with AWS services.

agent_knowledge_base = BedrockKnowledgeBase(
    kb_name=agent_knowledge_base_name,
    kb_description="Knowledge base used by Bedrock Agent",
    data_bucket_name=agent_bucket_name,
    chunking_strategy="FIXED_SIZE",
    suffix=f'{agent_unique_id}-f'
)

cache_knowledge_base = BedrockKnowledgeBase(
    kb_name=cache_knowledge_base_name,
    kb_description="Verified cache for Bedrock Agent System",
    data_bucket_name=cache_bucket_name,
    chunking_strategy="NONE",  # We do not want to chunk our question-answer pairs
    suffix=f'{cache_unique_id}-f'
)

This establishes the foundation for your semantic caching solution, setting up the AWS resources to store the agent’s knowledge and verified cache entries.

Step 2: Populate the agent knowledge base and associate it with an Amazon Bedrock agent

For this walkthrough, you will create an LLM Amazon Bedrock agent specialized in answering questions about Amazon Bedrock. For this example, you will ingest Amazon Bedrock documentation in the form of the User Guide PDF into the Amazon Bedrock knowledge base. This will be the primary dataset. After ingesting the data, you create an agent with specific instructions:

agent_instruction = """You are the Amazon Bedrock Agent. You have access to a 
knowledge base with information about the Amazon Bedrock service on AWS. 
Use it to answer questions."""

agent_id = agents_handler.create_agent(
    agent_name,
    agent_description,
    agent_instruction,
    [agent_foundation_model],
    kb_arns=[agent_kb_arn] # Associate agent with our Agent knowledge base
)

This setup enables the Amazon Bedrock agent to use the ingested knowledge to provide responses about Amazon Bedrock services. To test it, you can ask a question that isn’t present in the agent’s knowledge base, making the LLM either refuse to answer or hallucinate.

invoke_agent("What are the dates for reinvent 2024?", session_id="test")
# Response: Unfortunately, the dates for the AWS re:Invent 2024 conference have not 
# been announced yet by Amazon. The re:Invent conference is typically held in late 
# November or early December each year, but the specific dates for 2024 are not 
# available at this time. AWS usually announces the dates for their upcoming 
# re:Invent event around 6-9 months in advance.

Step 3: Create a cache dataset with known question-answer pairs and populate the cache knowledge base

In this step, you create a raw dataset of verified question-answer pairs that aren’t present in the agent knowledge base. These curated and verified answers serve as our semantic cache to prevent hallucinations on known topics. Good candidates for inclusion in this cache are:

  1. Frequently asked questions (FAQs): Common queries that users often ask, which can be answered consistently and accurately.
  2. Critical questions requiring deterministic answers: Topics where precision is crucial, such as pricing information, service limits, or compliance details.
  3. Time-sensitive information: Recent updates, announcements, or temporary changes that might not be reflected in the main RAG knowledge base.

By carefully curating this cache with high-quality, verified answers to such questions, you can significantly improve the accuracy and reliability of your solution’s responses. For this walkthrough, use the following example pairs for the cache:

Q: 'What are the dates for reinvent 2024?'
A: 'The AWS re:Invent conference was held from December 2-6 in 2024.'

Q: 'What was the biggest new feature announcement for Bedrock Agents during reinvent 2024?'
A: 'During re:Invent 2024, one of the headline new feature announcements for Bedrock Agents was the custom orchestrator. This key feature allows users to implement their own orchestration strategies through AWS Lambda functions, providing granular control over task planning, completion, and verification while enabling real-time adjustments and reusability across multiple agents.'

You then format these pairs as individual text files with corresponding metadata JSON files, upload them to an S3 bucket, and ingest them into your cache knowledge base. This process makes sure that your semantic cache is populated with accurate, curated, and verified information that can be quickly retrieved to answer user queries or guide the agent’s responses.

Step 4: Implement the verified semantic cache logic

In this step, you implement the core logic of your verified semantic cache solution. You create a function that integrates the semantic cache with your Amazon Bedrock agent, enhancing its ability to provide accurate and consistent responses.

  1. Queries the cache knowledge base for similar entries to the user question.
  2. If a high similarity match is found (greater than 80%), it returns the cached answer directly.
  3. For partial matches (60–80%), it uses the cached answer as a few-shot example for the agent.
  4. For low similarity (less than 60%), it falls back to standard agent processing.

This simplified logic forms the core of the semantic caching solution, efficiently using curated and verified information to improve response accuracy and reduce unnecessary LLM invocations.

Step 5: Evaluate results and performance

This step demonstrates the effectiveness of the verified semantic cache solution by testing it with different scenarios and comparing the results and latency. You’ll use three test cases to showcase the solution’s behavior:

  1. Strong semantic match (greater than 80% similarity)
  2. Partial semantic match (60-80% similarity)
  3. No semantic match (less than 60% similarity)

Here are the results:

  1. Strong semantic match (greater than 80% similarity) provides the exact curated and verified answer in less than 1 second.
    %%time
    invoke_agent_with_verified_cache("What were some new features announced for Bedrock Agents during reinvent 2024?")
    
    # Output:
    # Cache semantic similarity log: Strong match with score 0.9176399
    # CPU times: user 20.7 ms, sys: 442 μs, total: 21.1 ms
    # Wall time: 440 ms
    
    # During re:Invent 2024, one of the headline new feature announcements for Bedrock 
    # Agents was the custom orchestrator. This key feature allows users to implement 
    # their own orchestration strategies through AWS Lambda functions, providing 
    # granular control over task planning, completion, and verification while enabling 
    # real-time adjustments and reusability across multiple agents.

  2. Partial semantic match (60–80% similarity) passes the verified answer to the LLM during the invocation. The Amazon Bedrock agent answers the question correctly using the cached answer even though the information is not present in the agent knowledge base.
    %%time
    invoke_agent_with_verified_cache("What are the newest features for Bedrock Agents?") 
    
    # Output:
    # Cache semantic similarity log: Partial match with score 0.6443664
    # CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
    # Wall time: 12.8 s
    
    # One of the newest and most significant features for Amazon Bedrock Agents 
    # announced during re:Invent 2024 was the custom orchestrator. This feature 
    # allows users to implement their own orchestration strategies through AWS 
    # Lambda functions, providing granular control over task planning, completion, 
    # and verification. It enables real-time adjustments and reusability across 
    # multiple agents, enhancing the flexibility and power of Bedrock Agents.

  3. No semantic match (less than 60% similarity) invokes the Amazon Bedrock agent as usual. For this query, the LLM will either refuse to provide the information because it’s not present in the agent’s knowledge base, or will hallucinate and provide a response that is plausible but incorrect.
    %%time
    invoke_agent_with_verified_cache("Tell me about a new feature for Amazon Bedrock Agents")
    
    # Output:
    # Cache semantic similarity log: No match with score 0.532105
    # CPU times: user 22.3 ms, sys: 579 μs, total: 22.9 ms
    # Wall time: 13.6 s
    
    # Amazon Bedrock is a service that provides secure and scalable compute capacity 
    # for running applications on AWS. As for new features for the Bedrock Agents 
    # component, I do not have any specific information on recent or upcoming new 
    # features. However, AWS services are frequently updated with new capabilities, 
    # so it's possible there could be new agent features released in the future to 
    # enhance security, scalability, or integration with other AWS services. Without 
    # being able to consult the Knowledge Base, I cannot provide details on any 
    # particular new Bedrock Agent features at this time.

These results demonstrate the effectiveness of the semantic caching solution:

  1. Strong matches provide near-instant, accurate, and deterministic responses without invoking an LLM.
  2. Partial matches guide the LLM agent to provide a more relevant or accurate answer.
  3. No matches fall back to standard LLM agent processing, maintaining flexibility.

The semantic cache significantly reduces latency for known questions and improves accuracy for similar queries, while still allowing the agent to handle unique questions when necessary.

Step 6: Resource clean up

Make sure that the Amazon Bedrock knowledge bases that you created, along with the underlying Amazon OpenSearch Serverless collections are deleted to avoid incurring unnecessary costs.

Production readiness considerations

Before deploying this solution in production, address these key considerations:

  1. Similarity threshold optimization: Experiment with different thresholds to balance cache hit rates and accuracy. This directly impacts the solution’s effectiveness in preventing hallucinations while maintaining relevance.
  2. Feedback loop implementation: Create a mechanism to continuously update the verified cache with new, accurate responses. This helps prevent cache staleness and maintains the solution’s integrity as a source of truth for the LLM.
  3. Cache management and update strategy: Regularly refresh the semantic cache with current, frequently asked questions to maintain relevance and improve hit rates. Implement a systematic process for reviewing, validating, and incorporating new entries to help ensure cache quality and alignment with evolving user needs.
  4. Ongoing tuning: Adjust similarity thresholds as your dataset evolves. Treat the semantic cache as a dynamic component, requiring continuous optimization for your specific use case.

Conclusion

This verified semantic cache approach offers a powerful solution to reduce hallucinations in LLM responses while improving latency and reducing costs. By using Amazon Bedrock Knowledge Bases, you can implement a solution that can efficiently serve curated and verified answers, guide LLM responses with few-shot examples, and gracefully fall back to full LLM processing when needed.


About the Authors

Dheer Toprani (author photo)Dheer Toprani is a System Development Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazon’s operations. Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines.

Chaithanya Maisagoni Author PhotoChaithanya Maisagoni is a Senior Software Development Engineer (AI/ML) in Amazon’s Worldwide Returns and ReCommerce organization. He specializes in building scalable machine learning infrastructure, distributed systems, and containerization technologies. His expertise lies in developing robust solutions that enhance monitoring, streamline inference processes, and strengthen audit capabilities to support and optimize Amazon’s global operations.

Rajesh Nedunuri Author PhotoRajesh Nedunuri is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in designing, building, and optimizing large-scale data solutions. At Amazon, he plays a key role in developing scalable data pipelines, improving data quality, and enabling actionable insights for reverse logistics and ReCommerce operations. He is deeply passionate about generative AI and consistently seeks opportunities to implement AI into solving complex customer challenges.

Karam Muppidi Author PhotoKaram Muppidi is a Senior Engineering Manager at Amazon Retail, where he leads data engineering, infrastructure and analytics for the Worldwide Returns and ReCommerce organization. He has extensive experience developing enterprise-scale data architectures and governance strategies using both proprietary and native AWS platforms, as well as third-party tools. Previously, Karam developed big-data analytics applications and SOX compliance solutions for Amazon’s Fintech and Merchant Technologies divisions.

Read More

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

Fine-tuning a pre-trained large language model (LLM) allows users to customize the model to perform better on domain-specific tasks or align more closely with human preferences. It is a continuous process to keep the fine-tuned model accurate and effective in changing environments, to adapt to the data distribution shift (concept drift) and prevent performance degradation over time. Continuous fine-tuning also enables models to integrate human feedback, address errors, and tailor to real-world applications. You can use supervised fine-tuning (SFT) and instruction tuning to train the LLM to perform better on specific tasks using human-annotated datasets and instructions. When you have user feedback to the model responses, you can also use reinforcement learning from human feedback (RLHF) to guide the LLM’s response by rewarding the outputs that align with human preferences.

Precise and responsible outputs from fine-tuned LLMs require big efforts from subject matter experts (SMEs). The manual annotation of extensive training data for fine-tuning by human SMEs and collecting user feedback to align LLM responses with human preferences are both resource-heavy and time-intensive. Also, the continuous fine-tuning process requires orchestrating the multiple steps of data generation, LLM training, feedback collection, and preference alignments with scalability, resiliency, and resource efficiency. To address these challenges, we present an innovative continuous self-instruct fine-tuning framework that streamlines the LLM fine-tuning process of training data generation and annotation, model training and evaluation, human feedback collection, and alignment with human preference. This framework is designed as a compound AI system to drive the fine-tuning workflow for performance improvement, versatility, and reusability.

In this post, we introduce the continuous self-instruct fine-tuning framework and its pipeline, and present how to drive the continuous fine-tuning process for a question-answer task as a compound AI system. We use DSPy (Declarative Self-improving Python) to demonstrate the workflow of Retrieval Augmented Generation (RAG) optimization, LLM fine-tuning and evaluation, and human preference alignment for performance improvement.

Overview of the continuous self-instruct fine-tuning framework

The continuous self-instruct fine-tuning framework drives a workflow to customize the foundation model (FM) using human-labeled training samples and human feedback after model inference. This workflow runs on a continuous basis to be adaptive to a changing environment. The following diagram illustrates the workflow.

cont_ft_workflow

The workflow consists of the following steps:

  1. Self-instruct supervised fine-tuning – First, we use a human-labeled training dataset to adapt the FM to tasks in a specific domain. Instruction tuning is a popular approach in domain-specific LLM fine-tuning, which trains the FM to follow instructions for a specific task rather than generating the next texts. To address the challenges of the lack of human efforts for data labeling, annotation, and validation, we designed a self-instruct fine-tuning method to synthetically generate training labels by the LLM from a small volume of high-quality human-annotated samples. This process scales up the training dataset used for fine-tuning the FM into a custom LLM.
  2. Human preference alignment – After the model is deployed in the production environment, the process moves into the human-in-the-loop workflow, in which we collect user feedback including satisfaction scores and comments on model response. The human feedback data is not only used for model performance and hallucination measurement, but is also used to further fine-tune the custom model in Step 1 through RLHF. Likewise, to address the challenges of lack of human feedback data, we use LLMs to generate AI grades and feedback that scale up the dataset for reinforcement learning from AI feedback (RLAIF). There are various techniques of preference alignment, including proximal policy optimization (PPO), direct preference optimization (DPO), odds ratio policy optimization (ORPO), group relative policy optimization (GRPO), and other algorithms, that can be used in this process.
  3. Evaluation and continuous learning – The model customization and preference alignment is not a one-time effort. We need to keep monitoring and evaluating the model performance, and restart the process in case of concept shift or model decay.

The overall workflow consists of multiple steps of synthetic data generation, LLM training, feedback collection, preference alignment, and evaluation that involves multiple components and multiple LLMs. In the next section, we discuss using a compound AI system to implement this framework to achieve high versatility and reusability.

Compound AI system and the DSPy framework

With the rise of generative AI, scientists and engineers face a much more complex scenario to develop and maintain AI solutions, compared to classic predictive AI. The paper The Shift from Models to Compound AI Systems highlights that state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models. Compound AI systems are systems that implement AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers, or external tools. The following diagram compares predictive AI to generative AI.

compoundai

The concept of a compound AI system enables data scientists and ML engineers to design sophisticated generative AI systems consisting of multiple models and components. You can use a module to incorporate prompt engineering and in-context learning to improve RAG performance, and also design a data architecture with tools to gather external data. You can also build an agentic architecture with multiple LLMs, fine-tune the model to achieve higher performance, and orchestrate the LLM access. Besides the efficiency in system design, the compound AI system also enables you to optimize complex generative AI systems, using a comprehensive evaluation module based on multiple metrics, benchmarking data, and even judgements from other LLMs. The optimization is on the holistic end-to-end solution, rather than on each component separately.

To efficiently build and optimize compound AI systems, we introduce DSPy, an open source Python framework for developers to build LLM applications using modular and declarative programming, whether you’re building simple classifiers, sophisticated RAG pipelines, or agentic workflows. It provides algorithms for optimizing LLMs’ prompts and weights, and automates the prompt tuning process, as opposed to the trial-and-error approach performed by humans. DSPy supports iteratively optimizing all prompts involved against defined metrics for the end-to-end compound AI solution.

The DSPy lifecycle is presented in the following diagram in seven steps. It separates the flow of your program (modules) from the parameters (language model prompts and weights) of each step. These modules define the system behavior in a portable, declarative way. The first four steps cover the DSPy programming stage, including defining your task and its constraints, exploring a few examples, and using that to inform your initial pipeline design. When your system works reasonably well, you can run the DSPy evaluation stage (Steps 5 and 6) to collect an initial development set, define your DSPy metric, and use these to iterate on your system more systematically. Afterwards, DSPy introduces new optimizers (compilers) in Step 7, with language model-driven algorithms to tune LLM prompts and weights, based on predefined evaluation metrics.

dspy_lifecycle

RAG pipeline with continuous fine-tuning in a compound AI system

In this post, we provide an example of a question-answer task, using a RAG pipeline along with the continuous self-instruct fine-tuning framework. We build this as a compound AI system and use DSPy to drive the RAG inference, prompt optimization, LLM fine-tuning, and performance evaluation. The overall workflow is shown in the following diagram.

CFT_pipeline

The flow starts from a standard RAG pipeline, followed by a few optimizations on the prompts and the RAG retriever. Then we generate the synthetic training dataset from the RAG knowledge base to fine-tune the generator LLM using RAG for performance improvement. Lastly, we use a separate LLM to generate feedback on the fine-tuned model responses, and use it to conduct the preference alignment training by DPO and PPO. The question-answer outputs from each step are measured by the underlying LLM-as-a-judge evaluation module. In this way, we demonstrate the effectiveness of the compound AI system for the continuous optimizing of the pipeline through RAG optimization and the fine-tuning framework.

In the next sections, we demonstrate how to build this workflow, including the RAG pipeline, optimization, instruction fine-tuning, preference alignment, and model evaluation, into a compound AI system using an Amazon SageMaker notebook instance with the DSPy framework and LLMs on Amazon Bedrock. The code from this post and more examples are available in the GitHub repository.

Prerequisites

To create and run this compound AI system in your AWS account, complete the following prerequisites:

  1. Create an AWS account if you don’t already have one.
  2. Set up a SageMaker notebook instance.
  3. Open JupyterLab in this newly created instance.
  4. Clone the GitHub repository and follow the steps explained in the README.
  5. Navigate to the cloned repository and open the notebook folder.
  6. Enable access to models hosted on Amazon Bedrock. For this post, we enable Anthropic’s Claude 3 Sonnet, Mistral 7B, and Meta Llama 8B.

Dataset

For the question-answering task, we use the Contract Understanding Atticus Dataset (CUAD), an open legal contract review dataset created with dozens of legal experts from The Atticus Project, which consists of over 13,000 annotations. The synthetic data generation notebook automatically downloads the CUAD_v1 ZIP file and places it in the required folder named cuad_data.

In case of any issues, you can alternately download the dataset yourself by following the steps in the README file and store the dataset inside a folder within the SageMaker notebook instance, and use it to perform the steps in the next section.

Prepare question-answer pairs

The first step is to prepare question-answer pairs from the CUAD document by running synthetic data generation.

We use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to synthetically generate question-answer pairs to infer the RAG pipeline in the compound AI system, to demonstrate the improved accuracy after RAG optimization and model fine-tuning. The generated datasets are in the format of question-answer pairs along with the context [context, question, answer] from the document. We use the question to infer the RAG pipeline and use the answer as ground truth to evaluate the inference accuracy. Additionally, the question-answer pairs are used as training samples for the model fine-tuning. The following is a sample dataset triplet with context and a question-answer pair.

Context (Snippet from PDF file) Question Answer

THIS STRATEGIC ALLIANCE AGREEMENT (“Agreement”) is made and entered into as of November 6, 2016 (the “Effective Date”) by

and between Dialog Semiconductor (UK) Ltd., a corporation organized under the laws of England and Wales, having its principal office at 100

Longwater Avenue, Green Park, Reading, RG2 6GP, United Kingdom (“DIALOG”) and Energous Corporation, a Delaware corporation, having its

principal office at 3590 North First Street, Suite 210, San Jose, CA 95134 (“ENERGOUS”)

What is the date of the contract? November 6, 2016

Create a RAG pipeline

We implement a standard RAG pipeline with DSPy using the following components to create the vector database, set up context retrieval, and generate the answer:

  1. Configure DSPy to use LLMs on Amazon Bedrock as the RAG generator model:
dsp_bedrock = dspy.Bedrock(region_name='us-west-2')
claude_sonnet_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
bedrock_sonnet = dspy.AWSAnthropic(aws_provider=dsp_bedrock,
                                   model=claude_sonnet_model_id,
                                   max_new_tokens=4096,
                                   max_tokens=4096)
  1. Process the dataset to generate logical and syntactically readable chunks. The size and overlap percentage can be empirically determined based on the dataset. For more flexibility, you can generate multiple files from the dataset file and make one file one chunk.
  2. To set up a RAG retriever, we select ChromaDB as a vector store, and use DSPy’s ChromadbRM module as the retriever model:
titan_embed_model_id = "amazon.titan-embed-text-v2:0"
bedrock_ef = AmazonBedrockEmbeddingFunction(session=session, 
                                            model_name=titan_embed_model_id)
collection_name = "contexts"
persist_dir = "cuad_db/"
rm = ChromadbRM(collection_name=collection_name,
                persist_directory=persist_dir,
                embedding_function=bedrock_ef,
                k=3) 
  1. Using these components, we orchestrate a DSPy RAG pipeline to clean the context, generate the answer, and use the LLM-as-a-judge to score the generated answer with respect to the ground truth:
class GenerateAnswer(dspy.Signature):
   """Answer questions with short factoid answers."""
   context = dspy.InputField(desc="may contain relevant facts")
   question = dspy.InputField()
   answer = dspy.OutputField(desc="often between 1 and 5 words")

class RAG(dspy.Module):
   def __init__(self, num_passages=3):
      super().__init__()
      self.retrieve = ChromadbRM("contexts", "./chroma", k=num_passages)
      self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
   def forward(self, question):
      context = self.retrieve(question).passages
      context = [unicodedata.normalize("NFKD", r) for r in self.retrieve(question).passages]
      prediction = self.generate_answer(context=context, question=question)
      return dspy.Prediction(context=context, answer=prediction.answer)

RAG optimization with DSPy

The next step is to perform RAG optimization with DSPy. DSPy provides the Optimizer module, an algorithm that can tune the parameters of a DSPy program (the prompts and language model weights) to maximize the metrics you specify. It takes in a training set to bootstrap the selective training examples, and is based on a metric function that measures proximity to or matches against the ground truth. With these, we can compile the RAG pipeline module with a defined optimizer instance to conduct the optimization.

In this post, we use DSPy Optimizer to learn how to generate the prompt to improve the RAG response accuracy. Because our dataset size is low (fewer than 100 examples), we select the BootstrapFewShot teleprompter to compile the RAG prompts and overall pipeline, and use the synthetic dataset with ground truth and the LLM-as-a-judge metric function we defined in the previous sections:

def validate_context_and_answer(example, pred, trace=None):
   answer_EM = dspy.evaluate.answer_exact_match(example, pred)
   answer_PM = dspy.evaluate.answer_passage_match(example, pred)
   answer_LLMJudge = factuality_metric(example, pred)
   return answer_LLMJudge or answer_EM or answer_PM

rag_lm = RAG()
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
compiled_rag = teleprompter.compile(rag_lm, trainset=trainset)

The context retrieval is crucial to the overall RAG accuracy. To evaluate the RAG optimization we’ve described, we create a retriever evaluation by the LLM-as-a-judge to understand how well the retriever is able to pull out the relevant chunks for the incoming user question. The LLM judge is defined in the RetrievalJudge class:

class RetrievalJudge(dspy.Signature):
   """Judge given the question to be answered, check if the groundtruth answer can be derived from the predicted context.  Answer either Retrieved[True] or Retrieved[False]"""
   context = dspy.InputField(desc="Context for the prediction")
   question = dspy.InputField(desc="Question to be answered")
   groundtruth_answer = dspy.InputField(desc="groundtruth answer for the question")
   retrieval_correctness = dspy.OutputField(desc="Can the groundtruth answer be derived from the predicted context?", prefix="Retrieved[True/False]:")

retrieval_judge = dspy.ChainOfThought(RetrievalJudge)

Then we define the metric to measure the retrieval by using the RetrievalJudge, and use the DSPy Evaluate module to generate the accuracy score for retrieval:

def retrieval_metric(example, pred):
   retrieval = retrieval_judge(question=example.question, groundtruth_answer=example.answer, context=pred.context)
   llm_retriever_ans = bool("Retrieved[True]" in retrieval.retrieval_correctness
                            or '100% True' in retrieval.retrieval_correctness
                            or '100% retrieved correct' in retrieval.retrieval_correctness
                            or 'True.' in retrieval.retrieval_correctness)
   return llm_retriever_ans

rag_retrieval_score = Evaluate(compiled_rag, num_threads = 1, metric=retrieval_metric)

Configure the continuous fine-tuning framework

After the RAG optimization, the compound AI system has the instruction tuning and preference alignment modules, driven by the continuous fine-tuning framework. This includes using the synthetically generated dataset to train the LLM to follow question-answer instructions by SFT, and generating feedback of RAG responses by AI (another LLM) used for RLAIF with PPO and preference alignment with DPO and ORPO. In this step, we use Parameter Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to reduce the requirement of compute resources and accelerate the training process.

At the time of writing, the DSPy Optimization module supports distillation of a prompt-based DSPy program into LLM weight updates using BootstrapFinetune, and does not yet support the fine-tuning methods we defined in the compound AI system. Therefore, we conducted the fine-tuning (instruction tuning and preference alignment) on a Meta Llama 3 8B model separately; refer to the following GitHub repository for more details. With the compound AI system design, we are able to take the fine-tuning results back into the DSPy pipeline, use the LLM-as-a-judge evaluation function to generate the accuracy scores, and benchmark with the standard and optimized RAG inferences. This demonstrates the flexibility and interoperability of the compound AI system, which allows us to seamlessly replace one module with an external component without requiring changes to the entire pipeline.

The following diagram illustrates the workflow.

FT-model evaluation

Define an evaluation approach with DSPy

DSPy provides an Evaluate module for evaluating the compound AI system output by using user-defined metrics. In this post, we use LLM-as-a-judge to evaluate the system output and create the corresponding metrics for benchmarking the accuracy of standard RAG, optimized RAG, and fine-tuned models. Complete the following steps:

  1. Load the dataset for evaluation in the Example data type. Examples are similar to Python dictionaries but with added utilities such as the dspy.Prediction as a return value. For example:
gt_answer = <ground truth of the answer>
pred_answer = <answer from RAG and/or fine-tuned model>
dspy_data = dspy.Example(gt_answer=gt_answer, pred_answer=pred_answer).with_inputs("gt_answer", "pred_answer")
  1. Define the LLM-as-a-judge class to adjudicate whether the predicted answer semantically matches the ground truth of the answer. For example, the following FactualityJudge_1 class provides a score between 0 and 1; 0 means a complete mismatch and 1 means a perfect match.
class FactualityJudge_1(dspy.Signature):
   """Judge if the predicted answer is semantically match the groundtruth answer. Provide a score between 0 and 1, 0 means completely mismatch and 1 means perfectly match. In the response, only present the score, DO NOT add any preambles."""
   groundtruth_answer = dspy.InputField(desc="groundtruth answer")
   predicted_answer = dspy.InputField(desc="predicted answer")
   factually_correct = dspy.OutputField(desc="Is the predicted answer factually correct and semantically similar to the groundtruth answer?"))
  1. Define the evaluation metrics from the LLM judge, using DSPy metrics, to mark whether the predicted answer is true or not. For example, the following function returns the accuracy score based on the output of FactualityJudge_1:
factualityJudge_1 = dspy.ChainOfThought(FactualityJudge_1)

def factuality_metric_1(gt_answer, pred_answer):
   pred_answer = gt_answer.pred_answer
   gt_answer = gt_answer.gt_answer
   factual_metrc = factualityJudge_1(groundtruth_answer=gt_answer, predicted_answer=pred_answer)
   llm_judge_ans = float(factual_metrc[0].factually_correct)
   print(f"llm_judge_ans = {llm_judge_ans}")
   return llm_judge_ans

metric_LLM_1 = factuality_metric_1
  1. Use the dspy.Evaluate module to generate an accuracy score using the LLM-as-a-judge metrics defined in the previous step:
evaluate_llm_judge = Evaluate(devset= dspy_data, metric=metric_LLM_1, num_threads=1)

This evaluation process should be conducted on a continuous basis in the compound AI system driven by self-instruct fine-tuning, to make sure the overall performance remains stable despite the changes in the environment or the introduction of new data.

Benchmark RAG and LLM fine-tuning with DSPy

We benchmark the approaches presented in this post using the LLM-as-a-judge evaluation function defined in the previous section with the following settings.

The benchmarking is across five methods: standard RAG, optimized RAG, fine-tuning LLMs by instruction tuning, and fine-tuning LLMs by DPO and ORPO trained LLMs based on AIF. For each method, the LLM judge provides a decimal accuracy score in the range of 0 and 1.

The standard RAG uses Amazon Titan Text Embedding V2 for the embedding model, and Anthropic’s Claude 3 Haiku model for the generator model. The RAG compilation uses 32 question-answer pairs to optimize the prompts. The same dataset is used for inference. The fine-tuning by SFT, DPO, and ORPO are performed on the Meta Llama 3 8B FM, using training samples synthetically generated from CUAD document.

The results are presented in the following tables and charts. The different methods demonstrate different levels of improvement. The improvement is calculated in percentage by (accuracy of new method – accuracy of standard RAG)/(accuracy of standard RAG)*100%.

The optimized RAG by DSPy improved the accuracy and reduced the hallucination.

  Standard RAG with Claude 3 Haiku RAG with Claude 3 Haiku optimized by DSPy Improvement %
Accuracy by LLM Judge (0-1) 0.3969 0.6656 67.70%
  Standard RAG with Claude 3 Sonnet RAG with Claude 3 Sonnet optimized by DSPy Improvement %
Accuracy by LLM Judge (0-1) 0.3031 0.6375 110.33%

The custom LLM trained by SFT yielded higher accuracy than the standard RAG.

  Standard RAG with Claude 3 Haiku SFT tuned Meta Llama 3 8B Improvement %
Accuracy by LLM Judge (0-1) 0.3969 0.4813 21.26%
  Standard RAG with Claude 3 Sonnet SFT tuned Meta Llama 3 8B Improvement % 
Accuracy by LLM Judge (0-1) 0.3031 0.4813 58.79%

The custom LLM through preference alignment from human and AI feedback (DPO and ORPO) further improved the model performance. The fine-tuned small size model (Meta Llama 3 8B) outperformed the standard RAG pipeline with the medium size (Anthropic’s Claude Haiku) and larger size (Anthropic’s Claude Sonnet) generator model, and was comparable with the prompt-optimized RAG using ground truth data.

  Standard RAG with Claude 3 Haiku DPO tuned Meta Llama 3 8B Improvement % ORPO tuned Meta Llama 3 8B Improvement %  
Accuracy by LLM Judge (0-1) 0.3969 0.6719 69.29% 0.6812 71.63%
  Standard RAG with Claude 3 Sonnet DPO tuned Meta Llama 3 8B Improvement % ORPO tuned Meta Llama 3 8B Improvement %
Accuracy by LLM Judge (0-1) 0.3031 0.6719 121.68% 0.6812 124.74%

The following charts compare the accuracy across all tested methods.

accuracy_bench_chart

The preceding results were generated from a small dataset (32 question-answer pairs). You can use a larger sample set with more question-answer pairs to conduct the benchmarking and compare your own results.

Clean up

Make sure to clean up the following resources to avoid incurring additional costs:

  1. Delete Amazon Simple Storage Service (Amazon S3) buckets created for data storage and resource sharing.
  2. Back up the Jupyter notebooks in the SageMaker notebook instance.
  3. Shut down and delete the SageMaker notebook instance.

Cost considerations

Consider the following costs from the solution deployed on AWS:

  • You will incur charges for LLM inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
  • You will incur charges for storing files in S3 buckets. For more details, refer to Amazon S3 pricing.
  • You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion

In this post, we presented the continuous self-instruct fine-tuning framework as a compound AI system implemented by the DSPy framework. The framework first generates a synthetic dataset from the domain knowledge base and documents for self-instruction, then drives model fine-tuning through SFT, and introduces the human-in-the-loop workflow to collect human and AI feedback to the model response, which is used to further improve the model performance by aligning human preference through reinforcement learning (RLHF/RLAIF).

We demonstrated the framework for a question-answer task with a RAG pipeline, which improved the end-to-end response accuracy. The workflow is implemented by the DSPy framework; the overall strategy is to use the dspy.Module to connect all the components (RAG pipeline, prompt optimization, LLMs fine-tuned by SFT and RLHF/RLAIF, performance evaluation) together into a compound AI system. Each module can be seamlessly maintained, updated, and replaced without affecting other components in the system. This robust and versatile system design strengthens control and trust through modular design, and increases flexibility and adaptability to changing environments and data sources.

You can implement this continuous fine-tuning framework for LLM performance improvement for your own business use cases, with a compound AI system that provides high flexibility and interoperability. For more details, follow the examples in our GitHub repository.


About the Authors

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Jose Cassio dos Santos Junior is a Senior Data Scientist member of the MLU team. He is responsible for Curriculum Development for Advanced Modules. As a previous Senior Data Scientist on the AWS LATAM Professional Services Data Science team, he has over 20 years of experience working as a software engineer and more than 10 years of teaching experience at colleges and as an instructor for Linux certification preparation and Microsoft Innovation Center bootcamps. As a business process management expert, he participated in BPO projects for more than 7 years. He holds a Master’s degree in Computer Engineering, a Bachelor’s degree in Physics, and a Bachelor’s degree in Business Administration, specialized in IT Quantitative Methods.

Read More

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

Organizations need efficient ways to access and analyze their enterprise data. Amazon Q Business addresses this need as a fully managed generative AI-powered assistant that helps you find information, generate content, and complete tasks using enterprise data. It provides immediate, relevant information while streamlining tasks and accelerating problem-solving.

Amazon FSx for Windows File Server is a fully managed Windows file system that provides high-performance file storage for Windows-based applications. You can use Amazon FSx to lift and shift your on-premises Windows file server workloads to the cloud, taking advantage of the scalability, durability, and cost-effectiveness of AWS while maintaining full compatibility with your existing Windows applications and tooling.

Amazon Q Business is designed to be secure and private, seamlessly integrating with your existing identity provider (IdP). It works directly with your identities, roles, and permission sets, making sure users can’t access data they are not authorized to. Additionally, Amazon Q Business seamlessly integrates with multiple enterprise data stores, including FSx for Windows File Server, enabling you to index documents from file server systems and perform tasks such as summarization, Q&A, or data analysis of large numbers of files effortlessly.

In this post, we demonstrate how to use the Amazon Q connector for FSx for Windows File Server, explore a practical use case, and provide step-by-step instructions to help you get started and gain insights out of your data stored in FSx for Windows File Server.

Overview of the Amazon Q data source connector

A data source connector is a mechanism for integrating and synchronizing data from multiple repositories, including Microsoft SharePoint, Salesforce, Amazon Simple Storage Service (Amazon S3) buckets, and even your internal FSx for Windows File Server into one container index. Amazon Q Business offers multiple data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. For a list of supported connectors, see Supported connectors.

Supported document types

Amazon Q boasts impressive versatility, supporting a wide range of document types stored at various places in your environment, including Windows Share (FSX for Windows File Server). Amazon Q can ingest and understand common formats like plaintext, PDF, HTML, XML, and JSON to Microsoft formats like Excel, Word, and PowerPoint. This provides a comprehensive search experience for your enterprise users.

Secure access with supported authentication types

Security is job zero at AWS, and Amazon Q has been built keeping that in mind. It supports a variety of authentication types, seamlessly integrating with your existing identity management systems. Whether you use single sign-on (SSO) or a custom authentication solution, Amazon Q can adapt to your specific needs.

Fine-grained control with ACLs and identity crawling

For organizations with highly sensitive data, Amazon Q offers an extra layer of security. Amazon Q Business supports crawling access control lists (ACLs) for document security by default. When you connect an Amazon FSx (Windows) data source to Amazon Q Business, it crawls ACL information attached to a document (user and group information) from the directory service of the Amazon FSx instance.

Overview of solution

The following diagram shows a high-level architecture of how AWS Managed Active Directory users, through AWS IAM Identity Center, can access and interact with an Amazon Q Business application. This enables an authenticated user to securely and privately interact with the application and gain insights from the enterprise data stored in FSx for Windows File Server, using the Amazon Q Business web experience from their web browser.

In this post, we walk you through the process of integrating Amazon Q Business with FSx for Windows File Server to extract meaningful insights from your file system using natural language processing (NLP). This solution enables you to interact with your file system data using conversational AI, making information discovery more intuitive and efficient.

To set up your Amazon Q Business application, complete the following high-level steps:

  1. Create a new Amazon Q application.
  2. Select the retriever.
  3. Add a data source (FSx for Windows File Server).
  4. Synchronize your file system data.

Lastly, we demonstrate the application functionality by testing its access for two different users.

Prerequisites

To implement this solution, you should have an AWS account with administrative privileges.

Follow the instructions in the GitHub repository’s README file to provision the infrastructure required for exploring the Amazon Q connector for FSx for Windows File Server.

Create an Amazon Q Business application

Complete the following steps to create a new Amazon Q Business application:

  1. On the Amazon Q Business console, choose Applications in the navigation pane.
  2. Choose Create application.

  1. For Application name, enter a name (for example, anycompany-filesystem-knowledgebase).
  2. For Access management method, select AWS IAM Identity Center.

If you completed the prerequisites, then IAM Identity Center is already enabled, and you should see the instance ARN listed.

  1. Under Quick start user, for Select user, choose your users.
  2. Leave Select subscription as Q Business Pro.
  3. For Application details, use the default values.
  4. Choose Create.

In the next step, you will select the data source to retrieve and index the data.

Select the retriever

In this step, you select the retriever to connect data sources to the application. There are two options: use a native retriever or use Amazon Kendra. For this example, we use a native retriever.

  1. On the application details page, under Q Recommendations, choose Data sources.

  1. Choose Select retriever.

  1. For Retrievers, select Native.
  2. For Index provisioning, select Enterprise.
  3. For Number of units, enter 1.
  4. Choose Confirm.

Add a data source

Complete the following steps to add a data source:

  1. On the application details page, choose Add data source.
  2. Search for Amazon FSx and choose the plus sign next to Amazon FSX (Windows).

  1. In the Name and description section, enter a name (for example, anycompany-filesystem-source) and an optional description.
  2. In the Source section, for Amazon FSx file system ID, choose the file system ID you created as a prerequisite.
  3. In the Authorization section, leave as default (ACLs are enabled for the connector).

  1. In the Authentication section, for AWS Secrets Manager secret, choose the AWS Secrets Manager secret that holds the active directory credentials to communicate with Amazon FSx to crawl the file system (QBusiness-fsx-creds).
  2. In the Configure VPC and security group, provide the following information:
    • For Virtual Private Cloud (VPC), choose the virtual private cloud (VPC) created as a prerequisite (amazon-connector-for-win-fsx-blog-vpc).
    • For Subnets, choose the private subnets that hold the FSx for Windows File System and active directory instance.
    • For VPC security groups, choose your security group (<stack-name>-DefaultSecurityGroup).

  1. In the IAM role section, provide the following information:
    1. For IAM role¸ choose Create a new service role.
    2. For Role name, enter a name for the role.
  2. In the Sync scope section, provide the following information:
    1. For Maximum file size, use the default option of 50 MB.
    2. Under Regex patterns, you can add inclusion and exclusion patterns. For this post, we add the inclusion pattern for PDF file types, so the Amazon Q crawler will include PDF files.

  1. In the Sync mode section, select Full sync.

Full sync is preferable for the first sync; for subsequent runs, you can choose only the modified data.

  1. In the Sync run schedule section, for Frequency, choose Run on demand.

You also have the option to run the sync on a recurring basis like hourly or daily.

  1. In the Tags section, you can optionally add tags.

  1. In the Field mappings section, use the default field mappings selected.

The Amazon Q connector offers seven fields. Modifying field mappings and adding custom fields will be available after you create the application and retriever. For more information on the field mappings, refer to Amazon FSx (Windows) data source connector field mappings.

  1. Choose Add data source.

Synchronize your file system data

When the data source is successfully created, a banner message appears. In the banner message (or on the data source details page), choose Sync now to sync your file system data.

You can monitor the status of the sync, which includes direct links to Amazon CloudWatch logs.

The sync can take a few minutes to a few hours to complete. Sync speeds are limited by factors such as remote repository throughput and throttling, network bandwidth, and the size of documents.

When the sync is complete, you should see the stats on the scan, which includes the number of items scanned and failed.

For this post, we have two active directory groups, ml-engineers and security-engineers. Each group has one user under them (John Doe and Jane Smith), and they have access to only one whitepaper based on their group (Choosing a generative AI service and AWS Security Incident Response Guide, respectively). The following diagram illustrates this access.

Validate the Amazon Q application functionality

Now that you have completed the setup, you can validate the application functionality by testing the access controls. We test the access of two users, John Doe and Jane Smith, who are users of the ml-engineers group and security-engineers group, respectively. You can retrieve the user name and password for each user from Secrets Manager. The secret name for John Doe is jdoe, and for Jane Smith, it’s jsmith.

  1. On the application details page, in the Web experience settings section, choose the link for the deployed URL.

  1. Sign in as John Doe.

A successful login directs you to the Amazon Q Business chat interface. This window serves as the main workspace where users interact with the application, as shown in the following screenshot.

With the test configuration, John Doe has access to only one document: generative-ai-on-aws-how-to-choose.pdf. You can test the access controls by asking questions about this whitepaper through the chat interface. This restricted access demonstrates the effective implementation of document-level permissions.

  1. For our first question, we ask What are the key factors to consider when choosing a generative AI service?

The following screenshot shows the response.

  1. Next, we ask Does Amazon Bedrock provide an option to customize the model?

The response includes citations from Amazon Q with reference to the source data.

Testing confirms that John Doe successfully receives responses to questions about content from generative-ai-on-aws-how-to-choose.pdf. You can ask additional questions about generative AI services, such as:

  • What are the generative AI service offerings from AWS?
  • What is Amazon Q optimized for?
  • What are critical factors to consider when choosing an appropriate foundational model?

Next, we test access to the security incident response guide.

  1. We ask What are the four phases of the AWS security incident response process?

When asking questions about security topics from aws-security-incident-response-guide.pdf, the system returns no results. This behavior validates that document indexing respects the configured access permissions, and users can only access content they’re authorized to view.

  1. To validate access controls for the security-engineers user group, log in as Jane Smith.

You can test with questions about security incident response:

  • What are the key objectives of an AWS security incident response plan?
  • What are the four phases of the AWS security incident response process?
  • What are the recommended steps for containing and eradicating a security incident in AWS?
  • What types of data should be collected during an AWS security incident investigation?
  • What are the key considerations for recovering from an AWS security incident?

Troubleshooting

If you encounter issues during the setup or operation of your Amazon Q Business application with FSx for Windows File Server, refer to the detailed troubleshooting guide in the README file. The guide provides solutions for common configuration challenges and operational issues you might experience.

Clean up

To avoid ongoing charges, we recommend cleaning up the resources you created while following this guide. For step-by-step cleanup instructions, refer to the README file.

Conclusion

In this post, we provided an overview of the Amazon Q FSx connector and how you can use it for safe and seamless integration of generative AI assistance with your enterprise data source. By using Amazon Q in your organization, you can enable employees to be more data-driven, efficient, prepared, and productive. Lastly, we demonstrated how using simple NLP search through Amazon Q Business enhances your ability to discover insights from your enterprise data quicker and respond to your needs faster.

The Amazon Q Business application offers a compelling solution for organizations seeking to enhance their data-driven capabilities. By using its NLP and secure data source integration features, you can unlock the true value of your data and empower your teams to be more productive and efficient in their work.

To learn more about the Amazon Q connector for FSx for Windows File Server, refer to Connecting Amazon FSx (Windows) to Amazon Q Business.


About the Authors

Manjunath Arakere is a Senior Solutions Architect on the Worldwide Public Sector team at AWS, based in Atlanta, Georgia. He partners with AWS customers to design and scale well-architected solutions, supporting their cloud migrations and modernization initiatives. With extensive experience in the field, Manjunath specializes in migration strategies, application modernization, serverless, and Generative AI (GenAI). He is passionate about helping organizations leverage the full potential of cloud computing to drive innovation and operational efficiency. Outside of work, Manjunath enjoys outdoor runs, tennis, volleyball, and challenging his son in PlayStation soccer games.

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform. Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Read More

Generate synthetic counterparty (CR) risk data with generative AI using Amazon Bedrock LLMs and RAG

Generate synthetic counterparty (CR) risk data with generative AI using Amazon Bedrock LLMs and RAG

Data is the lifeblood of modern applications, driving everything from application testing to machine learning (ML) model training and evaluation. As data demands continue to surge, the emergence of generative AI models presents an innovative solution. These large language models (LLMs), trained on expansive data corpora, possess the remarkable capability to generate new content across multiple media formats—text, audio, and video—and across various business domains, based on provided prompts and inputs.

In this post, we explore how you can use these LLMs with advanced Retrieval Augmented Generation (RAG) to generate high-quality synthetic data for a finance domain use case. You can use the same technique for synthetic data for other business domain use cases as well. For this post, we demonstrate how to generate counterparty risk (CR) data, which would be beneficial for over-the-counter (OTC) derivatives that are traded directly between two parties, without going through a formal exchange.

Solution overview

OTC derivatives are typically customized contracts between counterparties and include a variety of financial instruments, such as forwards, options, swaps, and other structured products. A counterparty is the other party involved in a financial transaction. In the context of OTC derivatives, the counterparty refers to the entity (such as a bank, financial institution, corporation, or individual) with whom a derivative contract is made.

For example, in an OTC swap or option contract, one entity agrees to terms with another party, and each entity becomes the counterparty to the other. The responsibilities, obligations, and risks (such as credit risk) are shared between these two entities according to the contract.

As financial institutions continue to navigate the complex landscape of CR, the need for accurate and reliable risk assessment models has become paramount. For our use case, ABC Bank, a fictional financial services organization, has taken on the challenge of developing an ML model to assess the risk of a given counterparty based on their exposure to OTC derivative data.

Building such a model presents numerous challenges. Although ABC Bank has gathered a large dataset from various sources and in different formats, the data may be biased, skewed, or lack the diversity needed to train a highly accurate model. The primary challenge lies in collecting and preprocessing the data to make it suitable for training an ML model. Deploying a poorly suited model could result in misinformed decisions and significant financial losses.

We propose a generative AI solution that uses the RAG approach. RAG is a widely used approach that enhances LLMs by supplying extra information from external data sources not included in their original training. The entire solution can be broadly divided into three steps: indexing, data generation, and validation.

Data indexing

In the indexing step, we parse, chunk, and convert the representative CR data into vector format using the Amazon Titan Text Embeddings V2 model and store this information in a Chroma vector database. Chroma is an open source vector database known for its ease of use, efficient similarity search, and support for multimodal data and metadata. It offers both in-memory and persistent storage options, integrates well with popular ML frameworks, and is suitable for a wide range of AI applications. It is particularly beneficial for smaller to medium-sized datasets and projects requiring local deployment or low resource usage. The following diagram illustrates this architecture.

Here are the steps for data indexing:

  • The sample CR data is segmented into smaller, manageable chunks to optimize it for embedding generation.
  • These segmented data chunks are then passed to a method responsible for both generating embeddings and storing them efficiently.
  • The Amazon Titan Text Embeddings V2 API is called upon to generate high-quality embeddings from the prepared data chunks.
  • The resulting embeddings are then stored in the Chroma vector database, providing efficient retrieval and similarity searches for future use.

Data generation

When the user requests data for a certain scenario, the request is converted into vector format and then looked up in the Chroma database to find matches with the stored data. The retrieved data is augmented with the user request and additional prompts to Anthropic’s Claude Haiku on Amazon Bedrock. Anthropic’s Claude Haiku was chosen primarily for its speed, processing over 21,000 tokens per second, which significantly outpaces its peers. Moreover, Anthropic’s Claude Haiku’s efficiency in data generation is remarkable, with a 1:5 input-to-output token ratio. This means it can generate a large volume of data from a relatively small amount of input or context. This capability not only enhances the model’s effectiveness, but also makes it cost-efficient for our application, where we need to generate numerous data samples from a limited set of examples. Anthropic’s Claude Haiku LLM is invoked iteratively to efficiently manage token consumption and help prevent reaching the maximum token limit. The following diagram illustrates this workflow.

Here are the steps for data generation:

  • The user initiates a request to generate new synthetic counterparty risk data based on specific criteria.
  • The Amazon Titan Text Embeddings V2 LLM is employed to create embeddings for the user’s request prompts, transforming them into a machine-interpretable format.
  • These newly generated embeddings are then forwarded to a specialized module designed to identify matching stored data.
  • The Chroma vector database, which houses previously stored embeddings, is queried to find data that closely matches the user’s request.
  • The identified matching data and the original user prompts are then passed to a module responsible for generating new synthetic data.
  • Anthropic’s Claude Haiku 3.0 model is invoked, using both the matching embeddings and user prompts as input to create high-quality synthetic data.
  • The generated synthetic data is then parsed and formatted into a .csv file using the Pydantic library, providing a structured and validated output.
  • To confirm the quality of the generated data, several statistical methods are applied, including quantile-quantile (Q-Q) plots and correlation heat maps of key attributes, providing a comprehensive validation process.

Data validation

When validating the synthetic CR data generated by the LLM, we employed Q-Q plots and correlation heat maps focusing on key attributes such as cp_exposure, cp_replacement_cost, and cp_settlement_risk. These statistical tools serve crucial roles in promoting the quality and representativeness of the synthetic data. By using the Q-Q plots, we can assess whether these attributes follow a normal distribution, which is often expected in many clinical and financial variables. By comparing the quantiles of our synthetic data against theoretical normal distributions, we can identify significant deviations that might indicate bias or unrealistic data generation.

Simultaneously, the correlation heat maps provide a visual representation of the relationships between these attributes and others in the dataset. This is particularly important because it helps verify that the LLM has maintained the complex interdependencies typically observed in real CR data. For instance, we would expect certain correlations between exposure and replacement cost, or between replacement cost and settlement risk. By making sure these correlations are preserved in our synthetic data, we can be more confident that analyses or models built on this data will yield insights that are applicable to real-world scenarios. This rigorous validation process helps to mitigate the risk of introducing artificial patterns or biases, thereby enhancing the reliability and utility of our synthetic CR dataset for subsequent research or modeling tasks.

We’ve created a Jupyter notebook containing three parts to implement the key components of the solution. We provide code snippets from the notebooks for better understanding.

Prerequisites

To set up the solution and generate test data, you should have the following prerequisites:

  • Python 3 must be installed on your machine
  • We recommend that an integrated development environment (IDE) that can run Jupyter notebooks be installed
  • You can also create a Jupyter notebook instance using Amazon SageMaker from AWS console and develop the code there.
  • You need to have an AWS account with access to Amazon Bedrock and the following LLMs enabled (be careful not to share the AWS account credentials):
    • Amazon Titan Text Embeddings V2
    • Anthropic’s Claude 3 Haiku

Setup

Here are the steps to setup the environment.

import sys!{sys.executable} -m pip install -r requirements.txt

The content of the requirements.txt is given here.

boto3
langchain
langchain-community
streamlit
chromadb==0.4.15
numpy
jq
langchain-aws
seaborn
matplotlib
scipy

The following code snippet will perform all the necessary imports.

from pprint import pprint 
from uuid import uuid4 
import chromadb 
from langchain_community.document_loaders import JSONLoader 
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import Chroma 
from langchain_text_splitters import RecursiveCharacterTextSplitter

Index data in the Chroma database

In this section, we show how indexing of data is done in a Chroma database as a locally maintained open source vector store. This index data is used as context for generating data.

The following code snippet shows the preprocessing steps of loading the JSON data from a file and splitting it into smaller chunks:

def load_using_jsonloaer(path):
    loader = JSONLoader(path,
                            jq_schema=".[]",
                            text_content=False)
    documents = loader.load()
    return documents

def split_documents(documents):
    doc_list = [item for item in documents]
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=0)
    texts = text_splitter.split_documents(doc_list)
    return texts

The following snippet shows how an Amazon Bedrock embedding instance is created. We used the Amazon Titan Embeddings V2 model:

def get_bedrock_embeddings():
    aws_region = "us-east-1"
    model_id = "amazon.titan-embed-text-v2:0" #look for latest version of model
    bedrock_embeddings = BedrockEmbeddings(model_id=model_id, region_name=aws_region)
    return bedrock_embeddings

The following code shows how the embeddings are created and then loaded in the Chroma database:

persistent_client = chromadb.PersistentClient(path="../data/chroma_index")
collection = persistent_client.get_or_create_collection("test_124")
print(collection)
    #     query the database
vector_store_with_persistent_client = Chroma(collection_name="test_124",
                                                 persist_directory="../data/chroma_index",
                                                 embedding_function=get_bedrock_embeddings(),
                                                 client=persistent_client)
load_json_and_index(vector_store_with_persistent_client)

Generate data

The following code snippet shows the configuration used during the LLM invocation using Amazon Bedrock APIs. The LLM used is Anthropic’s Claude 3 Haiku:

config = Config(
    region_name='us-east-1',
    signature_version='v4',
    retries={
        'max_attempts': 2,
        'mode': 'standard'
    }
)
bedrock_runtime = boto3.client('bedrock-runtime', config=config)
model_id = "anthropic.claude-3-haiku-20240307-v1:0" #look for latest version of model
model_kwrgs = {
    "temperature": 0,
    "max_tokens": 8000,
    "top_p": 1.0,
    "top_k": 25,
    "stop_sequences": ["company-1000"],
}
# Initialize the language model
llm = ChatBedrock(
    model_id=model_id,
    model_kwargs=model_kwrgs,
    client=bedrock_runtime,
)

The following code shows how the context is fetched by looking up the Chroma database (where data was indexed) for matching embeddings. We use the same Amazon Titan model to generate the embeddings:

def get_context(scenario):
    region_name = 'us-east-1'
    credential_profile_name = "default"
    titan_model_id = "amazon.titan-embed-text-v2:0"
    kb_context = []
    be = BedrockEmbeddings(region_name=region_name,
                           credentials_profile_name=credential_profile_name,
                           model_id=titan_model_id)

    vector_store = Chroma(collection_name="test_124", persist_directory="../data/chroma_index",
                      embedding_function=be)
    search_results = vector_store.similarity_search(scenario, k=3)
    for doc in search_results:
        kb_context.append(doc.page_content)
    return json.dumps(kb_context)

The following snippet shows how we formulated the detailed prompt that was passed to the LLM. We provided examples for the context, scenario, start index, end index, records count, and other parameters. The prompt is subjective and can be adjusted for experimentation.

# Create a prompt template
prompt_template = ChatPromptTemplate.from_template(
    "You are a financial data expert tasked with generating records "
    "representing company OTC derivative data and "
    "should be good enough for investor and lending ML model to take decisions "
    "and data should accurately represent the scenario: {scenario} n "
    "and as per examples given in context: "
    "and context is {context} "
    "the examples given in context is for reference only, do not use same values while generating dataset."
    "generate dataset with the diverse set of samples but record should be able to represent the given scenario accurately."
    "Please ensure that the generated data meets the following criteria: "
    "The data should be diverse  and realistic, reflecting various industries, "
    "company sizes, financial metrics. "
    "Ensure that the generated data follows logical relationships and correlations between features "
    "(e.g., higher revenue typically corresponds to more employees, "
    "better credit ratings, and lower risk). "
    "And Generate {count} records starting from index {start_index}. "
    "generate just JSON as per schema and do not include any text or message before or after JSON. "
    "{format_instruction} n"
    "If continuing, start after this record: {last_record}n"
    "If stopping, do not include this record in the output."
    "Please ensure that the generated data is well-formatted and consistent."
)

The following code snippet shows the process for generating the synthetic data. You can call this method in an iterative manner to generate more records. The input parameters include scenario, context, count, start_index, and last_record. The response data is also formatted into CSV format using the instruction provided by the following:

output_parser.get_format_instructions():

 def generate_records(start_index, count, scenario, context, last_record=""):
    try:
        response = chain.invoke({
            "count": count,
            "start_index": start_index,
            "scenario": scenario,
            "context": context,
            "last_record": last_record,
            "format_instruction": output_parser.get_format_instructions(),
            "data_set_class_schema": DataSet.schema_json()
        })
        
        return response
    except Exception as e:
        print(f"Error in generate_records: {e}")
        raise e

Parsing the output generated by the LLM and representing it in CSV was quite challenging. We used a Pydantic parser to parse the JSON output generated by the LLM, as shown in the following code snippet:

class CustomPydanticOutputParser(PydanticOutputParser):
    def parse(self, text: str) -> BaseModel:
        # Extract JSON from the text
        try:
            # Find the first occurrence of '{'
            start = text.index('{')
            # Find the last occurrence of '}'
            end = text.rindex('}') + 1
            json_str = text[start:end]

            # Parse the JSON string
            parsed_json = json.loads(json_str)

            # Use the parent class to convert to Pydantic object
            return super().parse_with_cls(parsed_json)
        except (ValueError, json.JSONDecodeError) as e:
            raise ValueError(f"Failed to parse output: {e}")

The following code snippet shows how the records are generated in an iterative manner with 10 records in each invocation to the LLM:

def generate_full_dataset(total_records, batch_size, scenario, context):
    dataset = []
    total_generated = 0
    last_record = ""
    batch: DataSet = generate_records(total_generated,
                                      min(batch_size, total_records - total_generated),
                                      scenario, context, last_record)
    # print(f"batch: {type(batch)}")
    total_generated = len(batch.records)
    dataset.extend(batch.records)
    while total_generated < total_records:
        try:
            batch = generate_records(total_generated,
                                     min(batch_size, total_records - total_generated),
                                     scenario, context, batch.records[-1].json())
            processed_batch = batch.records

            if processed_batch:
                dataset.extend(processed_batch)
                total_generated += len(processed_batch)
                last_record = processed_batch[-1].start_index
                print(f"Generated {total_generated} records.")
            else:
                print("Generated an empty or invalid batch. Retrying...")
                time.sleep(10)
        except Exception as e:
            print(f"Error occurred: {e}. Retrying...")
            time.sleep(5)

    return dataset[:total_records]  # Ensure exactly the requested number of records

Verify the statistical properties of the generated data

We generated Q-Q plots for key attributes of the generated data: cp_exposure, cp_replacement_cost, and cp_settlement_risk, as shown in the following screenshots. The Q-Q plots compare the quantiles of the data distribution with the quantiles of a normal distribution. If the data isn’t skewed, the points should approximately follow the diagonal line.

As the next step of verification, we created a corelation heat map of the following attributes: cp_exposure, cp_replacement_cost, cp_settlement_risk, and risk. The plot is perfectly balanced with the diagonal elements showing a value of 1. The value of 1 indicates the column is perfectly co-related to itself. The following screenshot is the correlation heatmap.

Clean up

It’s a best practice to clean up the resources you created as part of this post to prevent unnecessary costs and potential security risks from leaving resources running. If you created the Jupyter notebook instance in SageMaker please complete the following steps:

  1. Save and shut down the notebook:
    # First save your work
    # Then close all open notebooks by clicking File -> Close and Halt 
  2. Clear the output (if needed before saving):
    # Option 1: Using notebook menu
    # Kernel -> Restart & Clear Output
    
    # Option 2: Using code
    from IPython.display import clear_output
    clear_output()
  3. Stop and delete the Jupyter notebook instance created in SageMaker:
    # Option 1: Using aws cli
    # Stop the notebook instance when not in use
    aws sagemaker stop-notebook-instance --notebook-instance-name <your-notebook-name>
    
    # If you no longer need the notebook instance
    aws sagemaker delete-notebook-instance --notebook-instance-name <your-notebook-name>
    
    # Option 2: Using Sagemager Console
    # Amazon Sagemaker -> Notebooks
    # Select the Notebook and click Actions drop-down and hit Stop.
    Click Actions drop-down and hit Delete

Responsible use of AI

Responsible AI use and data privacy are paramount when using AI in financial applications. Although synthetic data generation can be a powerful tool, it’s crucial to make sure that no real customer information is used without proper authorization and thorough anonymization. Organizations must prioritize data protection, implement robust security measures, and adhere to relevant regulations. Additionally, when developing and deploying AI models, it’s essential to consider ethical implications, potential biases, and the broader societal impact. Responsible AI practices include regular audits, transparency in decision-making processes, and ongoing monitoring to help prevent unintended consequences. By balancing innovation with ethical considerations, financial institutions can harness the benefits of AI while maintaining trust and protecting individual privacy.

Conclusion

In this post, we showed how to generate a well-balanced synthetic dataset representing various aspects of counterparty data, using RAG-based prompt engineering with LLMs. Counterparty data analysis is imperative for making OTC transactions between two counterparties. Because actual business data in this domain isn’t easily available, using this approach you can generate synthetic training data for your ML models at minimal cost often within minutes. After you train the model, you can use it to make intelligent decisions before entering into an OTC derivative transaction.

For more information about this topic, refer to the following resources:


About the Authors

Santosh Kulkarni is a Senior Moderation Architect with over 16 years of experience, specialized in developing serverless, container-based, and data architectures for clients across various domains. Santosh’s expertise extends to machine learning, as a certified AWS ML specialist. Currently, engaged in multiple initiatives leveraging AWS Bedrock and hosted Foundation models.

Joyanta Banerjee is a Senior Modernization Architect with AWS ProServe and specializes in building secure and scalable cloud native application for customers from different industry domains. He has developed an interest in the AI/ML space particularly leveraging Gen AI capabilities available on Amazon Bedrock.

Mallik Panchumarthy is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Mallik works with customers to help them architect efficient, secure and scalable AI and machine learning applications. Mallik specializes in generative AI services Amazon Bedrock and Amazon SageMaker.

Read More

Turbocharging premium audit capabilities with the power of generative AI: Verisk’s journey toward a sophisticated conversational chat platform to enhance customer support

Turbocharging premium audit capabilities with the power of generative AI: Verisk’s journey toward a sophisticated conversational chat platform to enhance customer support

This post is co-written with Sajin Jacob, Jerry Chen, Siddarth Mohanram, Luis Barbier, Kristen Chenowith, and Michelle Stahl from Verisk.

Verisk (Nasdaq: VRSK) is a leading data analytics and technology partner for the global insurance industry. Through advanced analytics, software, research, and industry expertise across more than 20 countries, Verisk helps build resilience for individuals, communities, and businesses. The company is committed to ethical and responsible AI development with human oversight and transparency. Verisk is using generative AI to enhance operational efficiencies and profitability for insurance clients while adhering to its ethical AI principles.

Verisk’s Premium Audit Advisory Service (PAAS®) is the leading source of technical information and training for premium auditors and underwriters. PAAS helps users classify exposure for commercial casualty insurance, including general liability, commercial auto, and workers’ compensation. PAAS offers a wide range of essential services, including more than 40,000 classification guides and more than 500 bulletins. PAAS now includes PAAS AI, the first commercially available interactive generative-AI chats specifically developed for premium audit, which reduces research time and empower users to make informed decisions by answering questions and quickly retrieving and summarizing multiple PAAS documents like class guides, bulletins, rating cards, etc.

In this post, we describe the development of the customer support process in PAAS, incorporating generative AI, the data, the architecture, and the evaluation of the results. Conversational AI assistants are rapidly transforming customer and employee support. Verisk has embraced this technology and developed its own PAAS AI, which provides an enhanced self-service capability to the PAAS platform.

The opportunity

The Verisk PAAS platform houses a vast array of documents—including class guides, advisory content, and bulletins—that aid Verisk’s customers in determining the appropriate rules and classifications for workers’ compensation, general liability, and commercial auto business. When premium auditors need accurate answers within this extensive document repository, the challenges they face are:

  • Overwhelming volume – The sheer volume of documents (advisories, bulletins, and so on) makes manual searching time-consuming and inefficient
  • Slow response times – Finding accurate information within this vast repository can be slow, hindering timely decision-making
  • Inconsistent quality of responses – Manual searches might yield irrelevant or incomplete results, leading to uncertainty and potential errors

To address this issue, Verisk PAAS AI is designed to alleviate the burden by providing round-the-clock support for business processing and delivering precise and quick responses to customer queries. This technology is deeply integrated into Verisk’s newly reimagined PAAS platform, using all of Verisk’s documentation, training materials, and collective expertise. It employs a retrieval augmented generation (RAG) approach and a combination of AWS services alongside proprietary evaluations to promptly answer most user questions about the capabilities of the Verisk PAAS platform.

When deployed at scale, this PAAS AI will enable Verisk staff to dedicate more time to complex issues, critical projects, and innovation, thereby enhancing the overall customer experience. Throughout the development process, Verisk encountered several considerations, key findings, and decisions that provide valuable insights for any enterprise looking to explore the potential of generative AI.

The approach

When creating an interactive agent using large language models (LLMs), two common approaches are RAG and model fine-tuning. The choice between these methods depends on the specific use case and available data. Verisk PAAS began developing a RAG pipeline for its PAAS AI and has progressively improved this solution. Here are some reasons why continuing with a RAG architecture was beneficial for Verisk:

  • Dynamic data access – The PAAS platform is constantly evolving, adding new business functions and technical capabilities. Verisk needed to make sure its responses are based on the most current information. The RAG approach allows access to continuously updated data, providing responses with the latest information without frequently retraining the model.
  • Multiple data sources – Besides data recency, another crucial aspect is the ability to draw from multiple PAAS resources to acquire relevant context. The ease of expanding the knowledge base without the need for fine-tuning new data sources makes the solution adaptable.
  • Reduced hallucinations – Retrieval minimizes the risk of hallucinations compared with free-form text generation because responses come directly from the provided excerpts. Verisk developed an evaluation tool to enhance response quality.
  • LLM linguistics – Although appropriate context can be retrieved from enterprise data sources, the underlying LLM manages the linguistics and fluency.
  • Transparency – Verisk aimed to consistently improve the PAAS AI’s response generation ability. A RAG architecture offered the transparency required in the context retrieval process, which would ultimately be used to generate user responses. This transparency helped Verisk identify areas where document restructuring was needed.
  • Data governance – With diverse users accessing the platform and differing data access permissions, data governance and isolation were critical. Verisk implemented controls within the RAG pipeline to restrict data access based on user permissions, helping to ensure that responses are delivered only to authorized users.

Although both RAG and fine-tuning have their pros and cons, RAG is the best approach for building a PAAS AI on the PAAS platform, given Verisk’s needs for real-time accuracy, explainability, and configurability. The pipeline architecture supports iterative enhancement as the use cases for the Verisk PAAS platform develop.

Solution overview

The following diagram showcases a high-level architectural data flow that highlights various AWS services used in constructing the solution. Verisk’s system demonstrates a complex AI setup, where multiple components interact and frequently call on the LLM to provide user responses. Employing the PAAS platform to manage these varied components was an intuitive decision.

Premium Audit Advisory Service AI Pipeline

The key components are as follows:

Amazon ElastiCache

Verisk’s PAAS team determined that ElastiCache is the ideal solution for storing all chat history. This storage approach allows for seamless integration in conversational chats and enables the display of recent conversations on the website, providing an efficient and responsive user experience.

Amazon Bedrock

Anthropic’s Claude, available in Amazon Bedrock, played various roles within Verisk’s solution:

  • Response generation – When building their PAAS AI, Verisk conducted a comprehensive evaluation of leading LLMs, using their extensive dataset to test each model’s capabilities. Through Amazon Bedrock, Verisk gained streamlined access to multiple best-in-class foundation models (FMs), enabling efficient testing and comparison across key performance criteria. The Amazon Bedrock unified API and robust infrastructure provided the ideal platform to develop, test, and deploy LLM solutions at scale. After this extensive testing, Verisk found Anthropic’s Claude model consistently outperformed across key criteria. Anthropic’s Claude demonstrated superior language understanding in Verisk’s complex business domain, allowing more pertinent responses to user questions. Given the model’s standout results across Verisk PAAS platform use cases, it was the clear choice to power the PAAS AI’s natural language capabilities.
  • Conversation summarization – When a user asks a follow-up question, the PAAS AI can continue the conversational thread. To enable this, Verisk used Claude to summarize the dialogue to update the context from ElastiCache. The full conversation summary and new excerpts are input to the LLM to generate the next response. This conversational flow allows the PAAS AI to answer user follow-up questions and have a more natural, contextual dialogue, bringing Verisk PAAS closer to having a true AI assistant that can engage in useful, back-and-forth conversations with users.
  • Keyword extraction – Keywords are extracted from user questions and previous conversations to be used for creating the new summarized prompt and to be input to Verisk’s knowledge base retrievers to perform vector similarity search.

Amazon OpenSearch Service

Primarily used for the storage of text embeddings, OpenSearch facilitates efficient document retrieval by enabling rapid access to indexed data. These embeddings serve as semantic representations of documents, allowing for advanced search capabilities that go beyond simple keyword matching. This semantic search functionality enhances the system’s ability to retrieve relevant documents that are contextually similar to the search queries, thereby improving the overall accuracy and speed of data queries. Additionally, OpenSearch functions as a semantic cache for similarity searches, optimizing performance by reducing the computational load and improving response times during data retrieval operations. This makes it an indispensable tool in the larger PAAS ecosystem, where the need for quick and precise information access is paramount.

Snowflake in Amazon

The integration of Snowflake in the PAAS AI ecosystem helps provide scalable and real-time access to data, allowing Verisk to promptly address customer concerns and improve its services. By using Snowflake’s capabilities, Verisk can perform advanced analytics, including sentiment analysis and predictive modeling, to better understand customer needs and enhance user experiences. This continuous feedback loop is vital for refining the PAAS AI and making sure it remains responsive and relevant to user demands.

Structuring and retrieving the data

An essential element in developing the PAAS AI’s knowledge base was properly structuring and effectively querying the data to deliver accurate answers. Verisk explored various techniques to optimize both the organization of the content and the methods to extract the most relevant information:

  • Chunking – A key step in preparing the accumulated questions and answers was splitting the data into individual documents to facilitate indexing into OpenSearch Service. Rather than uploading large files containing multiple pages of content, Verisk chunked the data into smaller segments by document section and character lengths. By splitting the data into small, modular chunks focused on a single section of a document, Verisk could more easily index each document and had greater success in pulling back the correct context. Chunking the data also enabled straightforward updating and reindexing of the knowledge base over time.
  • Hybrid query – When querying the knowledge base, Verisk found that using just standard vector search wasn’t enough to retrieve all the relevant contexts pertaining to a question. Therefore, a solution was implemented to combine a sparse bm25 search in combination with the dense vector search to create a hybrid search approach, which yielded much better context retrieval results.
  • Data separation and filters – Another issue Verisk ran into was that, because of the vast amount of documents and the overlapping content within certain topics, incorrect documents were being retrieved for some questions that asked for specific topics that were present across multiple sources—some of these weren’t needed or appropriate in the context of the user’s question. Therefore, data separation was implemented to split the documents based on document type and filter by line of business to improve context retrieval within the application.

By thoroughly experimenting and optimizing both the knowledge base powering the PAAS AI and the queries to extract answers from it, Verisk was able to achieve very high answer accuracy during the proof of concept, paving the way for further development. The techniques explored—hybrid querying, HTML section chunking, and index filtering—became core elements of Verisk’s approach for extracting quality contexts.

LLM parameters and models

Experimenting with prompt structure, length, temperature, role-playing, and context was key to improving the quality and accuracy of the PAAS AI’s Claude-powered responses. The prompt design guidelines provided by Anthropic were incredibly helpful.

Verisk crafted prompts that provided Anthropic’s Claude with clear context and set roles for answering user questions. Setting the temperature to 0 helped reduce the randomness and indeterministic nature of LLM-generated responses.

Verisk also experimented with different models to improve the efficiency of the overall solution. For scenarios where latency was more important and less reasoning was required, Anthropic’s Claude Haiku was the perfect solution. For other scenarios such as question answering using provided contexts where it was more important for the LLM to be able to understand every detail given in the prompt, Anthropic’s Claude Sonnet was the better choice to balance latency, performance, and cost.

Guardrails

LLM guardrails were implemented in the PAAS AI project using both the guardrails provided by Amazon Bedrock and specialized sections within the prompt to detect unrelated questions and prompt attack attempts. Amazon Bedrock guardrails can be attached to any Amazon Bedrock model invocation call and automatically detect if the given model input and output are in violation of the language filters that are set (violence, misconduct, sexual, and so on), which helps with screening user inputs. The specialized prompts further improve LLM security by creating a second net that uses the power of the LLMs to catch any inappropriate inputs from the users.

This allows Verisk to be confident that the model will only answer to its intended purpose surrounding premium auditing services and will not be misused by threat actors.

PAAS Evaluation API Pipeline

After validating several evaluation tools such as Deepeval, Ragas, Trulens, and so on, the Verisk PAAS team realized that there were certain limitations to using these tools for their specific use case. Consequently, the team decided to develop its own evaluation API, shown in the following figure.

This custom API evaluates the answers based on three major metrics:

  • Answer relevancy score – Using LLMs, the process assesses whether the answers provided are relevant to the customer’s prompt. This helps make sure that the responses are directly addressing the questions posed.
  • Context relevancy score – By using LLMs, the process evaluates whether the context retrieved is appropriate and aligns well with the question. This helps make sure that the LLM has the appropriate and accurate contexts to generate a response.
  • Faithfulness score – Using LLMs, the process checks if the responses are generated based on their retrieved context or if they are hallucinated. This is crucial for maintaining the integrity and reliability of the information provided.

This custom evaluation approach helps make sure that the answers generated are not only relevant and contextually appropriate but also faithful to the established generative AI knowledge base, minimizing the risk of misinformation. By incorporating these metrics, Verisk has enhanced the robustness and reliability of their PAAS AI, providing customers with accurate and trustworthy responses.

Feedback loop of PAAS AI platform

The Verisk PAAS team has implemented a comprehensive feedback loop mechanism, shown in the following figure, to support continuous improvement and address any issues that might arise.

This feedback loop is structured around the following key components:

  • Customer feedback analysis – The team actively collects and analyzes feedback from customers to identify potential data issues or problems with the generative AI responses. This analysis helps pinpoint specific areas that need improvement.
  • Issue categorization – After an issue is identified, it’s categorized based on its nature. If it’s a data-related issue, it’s assigned to the internal business team for resolution. If it’s an application issue, a Jira ticket is automatically created for the PAAS IT team to address and fix the problem.
  • QA test case updates – The system provides an option to update QA test cases based on the feedback received. This helps make sure that the test scenarios remain relevant and comprehensive, covering a wide range of potential issues.
  • Ground truth agreements – Ground truth agreements, which serve as the benchmark for evaluating LLM response quality, are periodically reviewed and updated. This helps make sure that the evaluation metrics remain accurate and reflective of the desired standards.
  • Ongoing evaluations – Regular evaluations of the LLM responses are conducted using the updated QA test cases and ground truth agreements. This helps in maintaining high-quality responses and quickly addressing any deviations from the expected standards.

This robust feedback loop mechanism enables Verisk to continuously fine-tune the PAAS AI, making sure that it delivers precise, relevant, and contextually appropriate answers to customer queries. By integrating customer feedback, categorizing issues efficiently, updating test scenarios, and adhering to stringent evaluation protocols, Verisk maintains a high standard of service and drives continuous improvement in its generative AI capabilities.

Business impact

Verisk initially rolled out the PAAS AI to one beta customer to demonstrate real-world performance and impact. Supporting a customer in this way is a stark contrast to how Verisk has historically engaged with and supported customers in the past, where Verisk would typically have a team allocated to interact with the customer directly. Verisk’s PAAS AI has revolutionized the way subject matter experts (SMEs) work and cost-effectively scales while still providing high-quality assistance. What previously took hours of manual review can now be accomplished in minutes, resulting in an extraordinary 96–98% reduction in processing time per specialist. This dramatic improvement in efficiency not only streamline operations but also allows Verisk’s experts to focus on more strategic initiatives that drive greater value for the organization.

In analyzing this early usage data, Verisk uncovered additional areas where it can drive business value for its customers. As Verisk collects additional information, this data will help uncover what will be needed to improve results and prepare to roll out to a wider customer base of approximately 15,000 users.

Ongoing development will focus on expanding these capabilities, prioritized based on the collected questions. Most exciting, though, are the new possibilities on the horizon with generative AI. Verisk knows this technology is rapidly advancing and is eager to harness innovations to bring even more value to customers. As new models and techniques emerge, Verisk plans to adapt the PAAS AI to take advantage of the latest capabilities. Although the PAAS AI currently focuses on responding to user questions, this is only the starting point. Verisk plans to quickly improve its capabilities to proactively make suggestions and configure functionality directly in the system itself. The Verisk PAAS team is inspired by the challenge of pushing the boundaries of what’s possible with generative AI and is excited to test those boundaries.

Conclusion

Verisk’s development of a PAAS AI for its PAAS platform demonstrates the transformative power of generative AI in customer support and operational efficiency. Through careful data harvesting, structuring, retrieval, and the use of LLMs, semantic search functionalities, and stringent evaluation protocols, Verisk has crafted a robust system that delivers accurate, real-time answers to user questions. By continuing to enhance the PAAS AI’s features while maintaining ethical and responsible AI practices, Verisk is set to provide increased value to its customers, enable staff to concentrate on innovation, and establish new benchmarks for customer service in the insurance sector.

For more information, see the following resources:


About the Authors

Sajin Jacob is the Director of Software Engineering at Verisk, where he leads the Premium Audit Advisory Service (PAAS) development team. In this role, Sajin plays a crucial part in designing the architecture and providing strategic guidance to eight development teams, optimizing their efficiency and ensuring the maintainability of all solutions. He holds an MS in Software Engineering from Periyar University, India.

Jerry Chen is a Lead Software Developer at Verisk, based in Jersey City. He leads the GenAi development team, working on solutions for projects within the Verisk Underwriting department to enhance application functionalities and accessibility. Within PAAS, he has worked on the implementation of the conversational RAG architecture with enhancements such as hybrid search, guardrails, and response evaluations. Jerry holds a degree in Computer Science from Stevens Institute of Technology.

Sid Mohanram is the Senior Vice President of Core Lines Technology at Verisk. His area of expertise includes data strategy, analytics engineering, and digital transformation. Sid is head of the technology organization with global teams across five countries. He is also responsible for leading the technology transformation for the multi-year Core Lines Reimagine initiative. Sid holds an MS in Information Systems from Stevens Institute of Technology.

Luis Barbier is the Chief Technology Officer (CTO) of Verisk Underwriting at Verisk. He provides guidance to the development teams’ architectures to maximize efficiency and maintainability for all underwriting solutions. Luis holds an MBA from Iona University.

Kristen Chenowith, MSMSL, CPCU, WCP, APA, CIPA, AIS, is PAAS Product Manager at Verisk. She is currently the product owner for the Premium Audit Advisory Service (PAAS) product suite, including PAAS AI, a first to market generative AI chat tool for premium audit that accelerates research for many consultative questions by 98% compared to traditional methods. Kristen holds an MS in Management, Strategy and Leadership at Michigan State University and a BS in Business Administration at Valparaiso University. She has been in the commercial insurance industry and premium audit field since 2006.

Michelle Stahl, MBA, CPCU, AIM, API, AIS, is a Digital Product Manager with Verisk. She has over 20 years of experience building and transforming technology initiatives for the insurance industry. She has worked as a software developer, project manager, and product manager throughout her career.

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, fast-paced, deeply customer-obsessed, and uses the working backward process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.

Ryan Doty is a Solutions Architect Manager at AWS, based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Apoorva Kiran, PhD, is a Senior Solutions Architect at AWS, based out of New York. He is aligned with the financial service industry, and is responsible for providing architectural guidelines to design innovative and scalable fintech solutions. He specializes in developing and commercializing artificial intelligence and machine learning products. Connect with him on LinkedIn.

Read More

Build verifiable explainability into financial services workflows with Automated Reasoning checks for Amazon Bedrock Guardrails

Build verifiable explainability into financial services workflows with Automated Reasoning checks for Amazon Bedrock Guardrails

Foundational models (FMs) and generative AI are transforming how financial service institutions (FSIs) operate their core business functions. AWS FSI customers, including NASDAQ, State Bank of India, and Bridgewater, have used FMs to reimagine their business operations and deliver improved outcomes.

FMs are probabilistic in nature and produce a range of outcomes. Though these models can produce sophisticated outputs through the interplay of pre-training, fine-tuning, and prompt engineering, their decision-making process remains less transparent than classical predictive approaches. Although emerging techniques such as tool use and Retrieval Augmented Generation (RAG) aim to enhance transparency, they too rely on probabilistic mechanisms—whether in retrieving relevant context or selecting appropriate tools. Even methods such as attention visualization and prompt tracing produce probabilistic insights rather than deterministic explanations.

AWS customers operating in regulated industries such as insurance, banking, payments, and capital markets, where decision transparency is paramount, want to launch FM-powered applications with the same confidence of traditional, deterministic software. To address these challenges, we’re introducing Automated Reasoning checks in Amazon Bedrock Guardrails (preview.) Automated Reasoning checks can detect hallucinations, suggest corrections, and highlight unstated assumptions in the response of your generative AI application. More importantly, Automated Reasoning checks can explain why a statement is accurate using mathematically verifiable, deterministic formal logic.

To use Automated Reasoning checks, you first create an Automated Reasoning policy by encoding a set of logical rules and variables from available source documentation. Automated Reasoning checks can then validate that the questions (prompts) and the FM-suggested answers are consistent with the rules defined in the Automated Reasoning policy using sound mathematical techniques. This fundamentally changes the approach to a solution’s transparency in FM applications, adding a deterministic verification for process-oriented workflows common in FSI organizations.

In this post, we explore how Automated Reasoning checks work through various common FSI scenarios such as insurance legal triaging, underwriting rules validation, and claims processing.

What is Automated Reasoning and how does it help?

Automated Reasoning is a field of computer science focused on mathematical proof and logical deduction—similar to how an auditor might verify financial statements or how a compliance officer makes sure that regulatory requirements are met. Rather than using probabilistic approaches such as traditional machine learning (ML), Automated Reasoning tools rely on mathematical logic to definitively verify compliance with policies and provide certainty (under given assumptions) about what a system will or won’t do. Automated Reasoning checks in Amazon Bedrock Guardrails is the first offering from a major cloud provider in the generative AI space.

The following financial example serves as an illustration.

Consider a basic trading rule: “If a trade is over $1 million AND the client is not tier-1 rated, THEN additional approval is required.”

An Automated Reasoning system would analyze this rule by breaking it down into logical components:

  1. Trade value > $1,000,000
  2. Client rating ≠ tier-1
  3. Result: Additional approval required

When presented with a scenario, the system can provide a deterministic (yes or no) answer about whether additional approval is needed, along with the exact logical path it used to reach that conclusion. For instance:

  • Scenario A – $1.5M trade, tier-2 client → Additional approval required (Both conditions met)
  • Scenario B – $2M trade, tier-1 client → No additional approval (Second condition not met)

What makes Automated Reasoning different is its fundamental departure from probabilistic approaches common in generative AI. At its core, Automated Reasoning provides deterministic outcomes where the same input consistently produces the same output, backed by verifiable proof chains that trace each conclusion to its original rules. This mathematical certainty, based on formal logic rather than statistical inference, enables complete verification of possible scenarios within defined rules (and under given assumptions).

FSIs regularly apply Automated Reasoning to verify regulatory compliance, validate trading rules, manage access controls, and enforce policy frameworks. However, it’s important to understand its limitations. Automated Reasoning can’t predict future events or handle ambiguous situations, nor can it learn from new data such as ML models. It requires precise, formal definition of rules and isn’t suitable for subjective decisions that require human judgment. This is where the combination of generative AI and Automated Reasoning come into play.

As institutions seek to integrate generative AI into their decision-making processes, Amazon Bedrock Guardrails Automated Reasoning checks provides a way to incorporate Automated Reasoning into the generative AI workflow. Automated Reasoning checks deliver deterministic verification of model outputs against documented rules, complete with audit trails and mathematical proof of policy adherence. This capability makes it particularly valuable for regulated processes where accuracy and governance are essential, such as risk assessment, compliance monitoring, and fraud detection. Most importantly, through its deterministic rule-checking and explainable audit trails, Automated Reasoning checks effectively address one of the major barriers to generative AI adoption: model hallucination, where models generate unreliable or unfaithful responses to the given task.

Using Automated Reasoning checks for Amazon Bedrock in financial services

A great candidate for applying Automated Reasoning in FSI is in scenarios where a process or workflow can be translated into a set of logical rules. Hard-coding rules as programmatic functions provides deterministic outcomes, but it becomes complex to maintain and requires highly structured inputs, potentially compromising the user experience. Alternatively, using an FM as the decision engine offers flexibility but introduces uncertainty. This is because FMs operate as black boxes where the internal reasoning process remains opaque and difficult to audit. In addition, the FM’s potential to hallucinate or misinterpret inputs means that conclusions would require human verification to verify accuracy.

Solution overview

This is where Automated Reasoning checks come into play. The following diagram demonstrates the workflow to combine generative AI and Automated Reasoning to incorporate both methods.

ARC Policy Diagram

The following steps explain the workflow in detail:

  1. The source document along with the intent instructions are passed to the Automated Reasoning checks service to build the rules and variables and create an Automated Reasoning checks policy.
  2. An Automated Reasoning checks policy is created and versioned.
  3. An Automated Reasoning checks policy and version is associated with an Amazon Bedrock guardrail.
  4. An ApplyGuardrail API call is made with the question and an FM response to the associated Amazon Bedrock guardrail.
  5. The Automated Reasoning checks model is triggered with the inputs from the ApplyGuardrail API, building logical representation of the input and FM response.
  6. An Automated Reasoning check is completed based on the created rules and variables from the source document and the logical representation of the inputs.
  7. The results of the Automated Reasoning check are shared with the user along with what rules, variables, and variable values were used in its determination, plus suggestions on what would make the assertion valid.

Prerequisites

Before you build your first Automated Reasoning check for Amazon Bedrock Guardrails, make sure you have the following:

  • An AWS account that provides access to AWS services, including Amazon Bedrock.
  • The new Automated Reasoning checks safeguard is available today in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. Make sure that you have access to the Automated Reasoning checks preview within Amazon Bedrock. To request access to the preview today, contact your AWS account team. To learn more, visit Amazon Bedrock Guardrails.
  • An AWS Identity and Access Management (IAM) user set up for the Amazon Bedrock API and appropriate permissions added to the IAM user

Solution walkthrough

To build an Automated Reasoning check for Amazon Bedrock Guardrails, follow these steps:

  1. On the Amazon Bedrock console, under Safeguards in the navigation pane, select Automated Reasoning.
  2. Choose Create policy, as shown in the following screenshot.

Create Policy Console view

  1. On the Create policy section, shown in the following screenshot, enter the following inputs:
  • Name – Name of the Automated Reasoning checks policy.
  • Description – Description of the Automated Reasoning checks policy.
  • Source content – The document to create the rules and variables from. You need to upload a document in PDF format.
  • Intent – Instructions on how to approach the creation of the rules and variables.

Create Policy Form

The following sections dive into some example uses of Automated Reasoning checks.

Automated Reasoning checks for insurance underwriting rules validation

Consider a scenario for an auto insurance company’s underwriting rules validation process.

Underwriting is a fundamental function within the insurance industry, serving as the foundation for risk assessment and management. Underwriters are responsible for evaluating insurance applications, determining the level of risk associated with each applicant, and making decisions on whether to accept or reject the application based on the insurer’s guidelines and risk appetite.

One of the key challenges in underwriting is the process of rule validations, which is the verification that the information provided in the documents adheres to the insurer’s underwriting guidelines. This is a complex task that deals with unstructured data and varying document formats.

This example uses an auto insurance company’s underwriting rules guideline document. A typical underwriting manual can have rules to define unacceptable drivers, unacceptable vehicles, and other definitions, as shown in the following example:

Unacceptable drivers

  • Drivers with 3 or more DUIs.
  • For new business or additional drivers, drivers with 3 or more accidents, regardless of fault.
  • Drivers with more than 2 major violations.
  • Drivers with more than 3 chargeable accidents.
  • Military personnel not stationed in California.
  • Drivers 75 and older without a completed company Physician’s Report form.
  • Any driver disclosing physical or mental conditions that might affect the driver’s ability to safely operate a motor vehicle may be required to complete a company Physician’s Report form to verify their ability to drive. In addition, if in the course of an investigation we discover an undisclosed medical concern, a completed company Physician’s Report form will be required.
  • Any unlisted or undisclosed driver that is a household member or has regular use of a covered vehicle.

Unacceptable Vehicles

  • Vehicles principally garaged outside the state of California.
  • Vehicles with more or less than 4 wheels.
  • Vehicles with cargo capacity over 1 ton.
  • Motor vehicles not eligible to be licensed for highway use.
  • Taxicabs, limousines, emergency vehicles, escort vehicles, and buses.
  • Vehicles used for pickup or delivery of goods at any time including pizzas, magazines, and newspapers.
  • Vehicles used for public livery, conveyance, and company fleets.
  • Vehicles made available to unlisted drivers for any use including business use such as sales, farming, or artisan use (for example, pooled vehicles).
  • Vehicles used to transport nursery or school children, migrant workers, or hotel or motel guests.
  • Vehicles with permanent or removable business-solicitation logos or advertising.
  • Vehicles owned or leased by a partnership or corporation.
  • Step vans, panel vans, dump trucks, flatbed trucks, amphibious vehicles, dune buggies, motorcycles, scooters, motor homes, travel trailers, micro or kit cars, antique or classic vehicles, custom, rebuilt, altered or modified vehicles.
  • Physical damage coverage for vehicles with an ISO symbol of more than 20 for model year 2010 and earlier or ISO symbol 41 for model year 2011 and later.
  • Liability coverage for vehicles with an ISO symbol of more than 25 for vehicles with model year 2010 and earlier or ISO symbol 59 for model year 2011 and later.
  • Salvaged vehicles for comprehensive and collision coverage. Liability only policies for salvaged vehicles are acceptable.
  • Physical damage coverage for vehicles over 15 years old for new business or for vehicles added during the policy term.

For this example, we entered the following inputs for the Automated Reasoning check:

  • Name – Auto Policy Rule Validation.
  • Description – A policy document outlining the rules and criteria that define unacceptable drivers and unacceptable vehicles.
  • Source content – A document describing the companies’ underwriting manual and guidelines. You can copy and paste the example provided and create a PDF document. Upload this document as your source content.
  • Intent – Create a logical model for auto insurance underwriting policy approval. An underwriter associate will provide the driver profile and type of vehicle and ask whether a policy can be written for this potential customer. The underwriting guideline document uses a list of unacceptable driver profiles and unacceptable vehicles. Make sure to create a separate rule for each unacceptable condition listed in the document, and create a variable to capture whether the driver is an acceptable risk or not. A customer that doesn’t violate any rule is acceptable. Here is an example: ” Is the risk acceptable for a driver with the following profile? A driver has 4 car accidents, uses the car as a Uber-Taxi, and has 3 DUIs”. The model should determine: “The driver has unacceptable risks. Driving a taxi is an unacceptable risk. The driver has multiple DUIs.”

The model creates rules and variables from the source content. Depending on the size of the source content, this process may take more than 10 minutes.

The process of rule and variable creation is probabilistic in nature, and we highly recommend that you edit the created rules and variables to align better with your source content.

After the process is complete, a set of rules and variables will be created and can be reviewed and edited.

The following screenshots show an extract of the rules and variables created by the Automated Reasoning checks feature. The actual policy will have more rules and variables that can be viewed in Amazon Bedrock, but we’re not showing them here due to space limits.

Rules Underwriting Auto

The Automated Reasoning checks policy must be associated to an Amazon Bedrock guardrail. For more information, refer to Create a guardrail.

Create guardrail Console view

Test the policy

To test this policy, we considered a hypothetical scenario with an FM-generated response to validate.

Question: Is the risk acceptable for a driver with the following profile? Has 2 chargeable accidents in a span of 10 years. Driving records show a negligent driving charge and one DUI.

Answer: Driver has unacceptable risk. Number of chargeable accidents count is 2.

After entering the question and answer inputs, choose Submit, as shown in the following screenshot.

The Automated Reasoning check returned as Invalid, as shown in the following screenshot. The components shown in the screenshot are as follows:

  • Validation result – This is the Automated Reasoning checks validation output. This conclusion is reached by computing the extracted variable assignments against the rules defined in the Automated Reasoning policy.
  • Applied rules – These are the rules that were used to reach the validation result for this finding.
  • Extracted variables – This list shows how Automated Reasoning checks interpreted the input Q&A and used it to assign values to variables in the Automated Reasoning policy. These variable values are computed against the rules in the policy to reach the validation result.
  • Suggestions – When the validation result is invalid, this list shows a set of variable assignments that would make the conclusion valid. When the validation result is valid, this list shows a list of assignments that are necessary for the result to hold; these are unstated assumptions in the answer. You can use these values alongside the rules to generate a string that provides feedback to your FM.

validation result Invalid case - Underwriting Auto

The model evaluated the answer against the Automated Reasoning logical rules, and in this scenario the following rule was triggered:

“A driver is considered an acceptable risk if and only if their number of violations is less than or equal to 2.”

The Extracted variables value for violation_count is 2, and the is_acceptable_risk variable was set to false, which is wrong according to the Automated Reasoning logic. Therefore, the answer isn’t valid.

The suggested value for is_acceptable_risk is true.

Here is an example with a revised answer.

Question: Is the risk acceptable for a driver with the following profile? Has 2 chargeable accidents in a span of 10 years. Driving records show a negligent driving charge and one DUI.

Answer: Driver has acceptable risk.

Because no rules were violated, the Automated Reasoning logic determines the assertion is Valid, as shown in the following screenshot.

validation result Valid case - Underwriting Auto

Automated Reasoning checks for insurance legal triaging

For the next example, consider a scenario where an underwriter is evaluating whether a long-term care (LTC) claim requires legal intervention.

For this example, we entered the following inputs:

  • Name – Legal LTC Triage
  • Description – A workflow document outlining the criteria, process, and requirements for referring LTC claims to legal investigation
  • Source content – A document describing your LTC legal triaging process. You need to upload your own legal LTC triage document in PDF format. This document should outline the criteria, process, and requirements for referring LTC claims to legal investigation.
  • Intent – Create a logical model that validates compliance requirements for LTC claims under legal investigation. The model must evaluate individual policy conditions including benefit thresholds, care durations, and documentation requirements that trigger investigations. It should verify timeline constraints, proper sequencing of actions, and policy limits. Each requirement must be evaluated independently, where a single violation results in noncompliance. For example: “A claim has two care plan amendments within 90 days, provider records covering 10 months, and a review meeting at 12 days. Is this compliant?” The model should determine: “Not compliant because: multiple amendments require investigation, provider records must cover 12 months, and review meetings must be within 10 days.”

The process of rule and variable creation is probabilistic in nature, and we highly recommend that you edit the created rules and variables to align better with your source content.

After the process is complete, a set of rules and variables will be created. To review and edit a rule or variable, select the more options icon under Actions and then choose Edit. The following screenshots show the Rules and Variables screens.

Legal LTC Triage Rules

Legal LTC Triage Variables

Test the policy

From here we can test out our Automated Reasoning checks in the test playground. Note: to do this, the Automated Reasoning checks policy must be associated to an Amazon Bedrock guardrail.To test this policy, we posed the following hypothetical scenario with an FM-generated response for the Automated Reasoning checks policy to validate.

Question: A claim with care duration of 28 months, no documentation irregularities, and total projected benefit value of $200,000 has been submitted. Does this require legal investigation?

Answer: This claim does not require legal investigation because the total projected benefit value is below $250,000 and there are no documentation irregularities.

Legal LTC Triage Playground Console

After completing the check, the Automated Reasoning tool produces the validation result, which for this example was Invalid, as shown in the following screenshot. This means the FM generated response violates one or more rules from the generated Automated Reasoning checks policy.

Legal LTC Triage Invalid result

The rule that was triggered was the following:

“A claim is flagged for legal investigation if and only if there are documentation irregularities, or the total projected benefit exceeds $250,000, or the care duration is more than 24 months, or the number of care plan amendments within a 90-day period is greater than 1.”

Based on our input the model determined our variable inputs to be:

Name Type Value Description
1 total_projected_benefit Real number 200,000 The total projected monetary value of benefits for a long-term care claim
2 flag_for_legal_investigation Boolean FALSE Indicates whether a claim should be flagged for legal investigation based on the specified criteria
3 has_documentation_irregularities Boolean FALSE Presence of irregularities in the care provider’s documentation
4 care_duration_months Integer 28 The length of time for which care is provided or expected to be provided

From this, we can determine where exactly our rule was found INVALID. Our input had care_duration_months > 24 months, and flag_for_legal_investigation was set as FALSE. This invalidated our rule.

In the suggestions, we observe that for our original Q&A to be correct, we’d have to have flag_for_legal_investigation as TRUE, along with the total_projected_benefit being 200,000.

We can validate whether the suggestion will yield a VALID response by adjusting our answer to the original question to the following.

“This claim does require legal investigation even though the total projected benefit value is below $250,000 and there are no documentation irregularities.”

Legal LTC Triage Valid results

As shown in the following screenshot, no rules were triggered. However, what changed is our extracted variables and our suggestions.

Legal LTC Triage Extracted Variables

Now that the assertion is valid, we have the other requirements as unstated assumptions according to our rules to make sure that this is a VALID response. We can use suggestions to modify our response to the end user with more granular detail.

Automated Reasoning checks for insurance claims processing

The final example demonstrates an Automated Reasoning checks example for claims processing.

Claims processing is another fundamental function within insurance companies, and it’s the process used by policy holders to exercise their policy to get compensation for an event (a car accident, for example). Claims processors work to validate the claim and the beneficiaries, determine the amount of compensation, and work to settle the claim. This process includes verification of the people involved, proof of the incident, and a host of legal guidelines that they’re required to follow.

One of the key issues in claims processing is validating the claim and the parties involved. In this example, we use Automated Reasoning checks to provide recommendations to individuals attempting to file a claim in the case of a house fire.

As in the previous examples, we create an Automated Reasoning guardrail policy as follows:

  • Name – Home Owners Insurance Claims Policy
  • Description – This policy is used for the validation of homeowners’ insurance claims and includes the processes and procedures needed to file a claim.
  • Source content – A document describing the companies’ homeowners’ insurance claims process. This document should outline the necessary processes and procedures needed to file a claim.
  • Intent – Create a logical model that validates the requirements for homeowner claims. The model must evaluate individual policy conditions, including benefit thresholds, durations, and documentation requirements needed for the creation of a claim. It should verify timeline constraints, proper sequencing of actions, and policy limits. Each requirement must be evaluated independently, where any single violation results in noncompliance. For example: “I had a fire at my house. What documents do I need in order to file a claim?” The model should determine: “You will need to provide a fire department report, police report, photos, and your policy number.”

The following screenshots show an extract of the rules and variables created by the Automated Reasoning checks feature. The actual policy will have more rules and variables that can be viewed in Amazon Bedrock, but we’re not showing them due to space limits.

Rules Policy Claims Processing

Variables Policy Claims Processing

Test the policy

To test this policy, we considered a hypothetical scenario with an FM-generated response to validate.

Question: I had a fire at my house. What documents do I need to file a claim?

Answer: You provide a report from the fire department, a police report, photos, and policy number.

In this case, the Automated Reasoning check returned as Valid, as shown in the following screenshot. Automated Reasoning checks validated that the answer is correct and aligns to the provided claims processing document.

Valid Result - Rules Policy Claims Processing

Conclusion

In this post, we demonstrated that Automated Reasoning checks solve a core challenge within FMs: the ability to verifiably demonstrate the reasoning for decision-making. By incorporating Automated Reasoning checks into our workflow, we were able to validate a complex triage scenario and determine the exact reason for why a decision was made. Automated Reasoning is deterministic, meaning that with the same ruleset, same variables, and same input and FM response, the determination will be reproducible. This means you can the reproduce findings for compliance or regulatory reporting.

Automated Reasoning checks in Amazon Bedrock Guardrails empowers financial service professionals to work more effectively with generative AI by providing deterministic validation of FM responses for decision-oriented documents. This enhances human decision-making by reducing hallucination risk and creating reproducible, explainable safeguards that help professionals better understand and trust FM-generated insights.

The new Automated Reasoning checks safeguard is available today in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. We invite you to build your first Automated Reasoning checks. For detailed guidance, visit our documentation and code examples in our GitHub repo. Please share your experiences in the comments or reach out to the authors with questions. Happy building!


About the Authors

AlfredoAlfredo Castillo is a Senior Solutions Architect at AWS, where he works with Financial Services customers on all aspects of internet-scale distributed systems, and specializes in Machine learning,  Natural Language Processing, Intelligent Document Processing, and GenAI. Alfredo has a background in both electrical engineering and computer science. He is passionate about family, technology, and endurance sports.

AndyAndy Hall is a Senior Solutions Architect with AWS and is focused on helping Financial Services customers with their digital transformation to AWS. Andy has helped companies to architect, migrate, and modernize large-scale applications to AWS. Over the past 30 years, Andy has led efforts around Software Development, System Architecture, Data Processing, and Development Workflows for large enterprises.

RajRaj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Read More

Best practices for Amazon SageMaker HyperPod task governance

Best practices for Amazon SageMaker HyperPod task governance

At AWS re:Invent 2024, we launched a new innovation in Amazon SageMaker HyperPod on Amazon Elastic Kubernetes Service (Amazon EKS) that enables you to run generative AI development tasks on shared accelerated compute resources efficiently and reduce costs by up to 40%. Administrators can use SageMaker HyperPod task governance to govern allocation of accelerated compute to teams and projects, and enforce policies that determine the priorities across different types of tasks. The resulting improvement in utilization of compute resources enables organizations to focus on accelerating their generative AI innovation and time to market, instead of spending time coordinating resource allocation and continuously replanning their generative AI development tasks.

In this post, we provide best practices to maximize the value of SageMaker HyperPod task governance and make the administration and data science experiences seamless. We also discuss common governance scenarios when administering and running generative AI development tasks.

Prerequisites

To get started with SageMaker HyperPod task governance on an existing SageMaker HyperPod cluster orchestrated by Amazon EKS, make sure you uninstall any existing Kueue installations, and have a Kubernetes cluster running version 1.30+.

Administration experience

Administrators are the first persona interacting with SageMaker HyperPod task governance. They are responsible for managing the cluster compute allocation according to the organization’s priorities and goals.

Managing compute

The first step to managing capacity across teams is to set up compute allocations. When setting up a compute allocation, keep in mind the following considerations:

  • What type of tasks does this team typically run?
  • Does this team constantly run tasks and require reserved capacity?
  • What is this team’s priority relative to other teams?

When setting up a compute allocation, an administrator sets the team’s fair-share weight, which provides relative prioritization comparative to other teams when vying for the same idle compute. Higher weight enables a team to access unutilized resources within shared capacity sooner. As a best practice, set the fair-share weight higher for teams that will require access to capacity sooner than other teams.

After the fair-share weight is set, the administrator then sets up the quota and borrowing strategy. Quota determines the allocation per instance type within the cluster’s instance groups. Borrowing strategy determines whether a team will share or reserve their allotted capacity. To enforce proper quota management, the total reserved quota should not surpass the cluster’s available capacity for that resource. For instance, if a cluster comprises 20 ml.c5.2xlarge instances, the cumulative quota assigned to teams should remain under 20.

If the compute allocations for teams allow for “Lend and Borrow” or “Lend,” the idle capacity is shared between these teams. For example, if Team A has a quota of 6 but is using only 2 for its tasks, and Team B has a quota of 5 and is using 4 for its tasks, and a task that is submitted to Team B requiring 4 resources, 3 will be borrowed from Team A based on its “Lend and Borrow” settings. If any team’s compute allocations setting is set to “Don’t Lend,” the team will not be able to borrow any additional capacity beyond its reserved capacity.

To maintain a pool or a set of resources that all teams can borrow from, users can set up a dedicated team with resources that bridge the gap between other teams’ allocations and the total cluster capacity. Make sure that this cumulative resource allocation includes the appropriate instance types and doesn’t exceed the total cluster capacity. To make sure that these resources can be shared among teams, enable the participating teams to have their compute allocations set to “Lend and Borrow” or “Lend” for this common pool of resources. In addition, every time new teams are introduced, or quota allocations are changed or there are any changes to the cluster capacity, revisit the quota allocations of all the teams, to make sure the cumulative quota remains at or below cluster capacity.

After compute allocations have been set, the administrator will also need to set a cluster policy, which is comprised of two components: task prioritization and idle compute allocation. Administrators will set up a task prioritization, which determines the priority level for tasks running in a cluster. Next, an administrator will set idle compute allocation setting to either “first come, first serve,” in which tasks are not prioritized, or “fair-share allocation,” in which idle compute is distributed to teams based on their fair-share weight.

Observability

To get started with observability, install the Amazon CloudWatch Observability add-on with Kueue metrics selected. The SageMaker HyperPod task governance dashboard provides a single pane of glass view for cluster utilization across teams. At present, you can view tasks running for PyTorch, TensorFlow, and MPI tasks. Administrators can analyze the graphs within the dashboard to understand equity in resource sharing and utilization of resources.

To view utilization of resources, users can see the following dashboard showing GPU and vCPU utilization. These graphs inform administrators where teams can further maximize their GPU utilization. In this example, administrators observe GPU utilization around 52%

Administrators have a real-time view of utilization of instances as tasks are running or moved to pending during preemption. In this example, the ML engineering team is borrowing 5 GPUs for their training task

With SageMaker HyperPod, you can additionally set up observability tools of your choice. In our public workshop, we have steps on how to set up Amazon Managed Prometheus and Grafana dashboards.

Data scientist experience

Data scientists are the second persona interacting with SageMaker HyperPod clusters. Data scientists are responsible for the training, fine-tuning, and deployment of models on accelerated compute instances. It’s important to make sure data scientists have the necessary capacity and permissions when interacting with clusters of GPUs.

Access control

When working with SageMaker HyperPod task governance, data scientists will assume their specific role. Each data science team will need to have their own role and associated role-based access control (RBAC) on the cluster. RBAC prevents data scientists from submitting tasks to teams in which they do not belong. For more information about data science role permissions, see AWS Identity and Access Management for SageMaker HyperPod. As a best practice, administrators should limit data scientists according to the principle of least privilege. After roles and access entries are set up, data scientists can assume their associated AWS Identity and Access Management (IAM) role to submit tasks to corresponding namespaces. It’s important to note that users interacting with the console dashboard who didn’t create the associated EKS cluster will need to have their role added to the AccessEntry list for the EKS cluster.

Submitting tasks

There are two ways to submit tasks on Amazon EKS orchestrated SageMaker HyperPod clusters: kubectl and the SageMaker HyperPod CLI. With both options, data scientists will need to reference their team’s namespace and task priority class in the task configuration file in order to use their allocated quota with appropriate prioritization. If the user doesn’t a specify priority class, then SageMaker HyperPod task governance will automatically assume the lowest priority.

In the following code snippet, we show the labels required in a kubectl manifest file for the researchers namespace with inference priority. Priority classes will have -priority appended to the name set in the cluster policy. For further guidance on submitting tasks to SageMaker HyperPod task governance, follow the documentation here.

metadata:
    name: job-name
    namespace: hyperpod-ns-researchers
    labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-researchers-localqueue
        kueue.x-k8s.io/priority-class: inference-priority

HyperPod CLI

The HyperPod CLI was created to abstract the complexities of working with kubectl and enable developers using SageMaker HyperPod to iterate faster with custom commands. HyperPod CLI v2.0.0 introduces a new default scheduler type with autofill commands, auto discovery of namespaces, improved cluster and task management features, and enhanced visibility into task priorities and accelerator quota allocations. Data scientists can use the new HyperPod CLI to quickly submit tasks, iterate, and experiment in their generative AI development lifecycle.

Sample commands

The following is a short reference guide for helpful commands when interacting with SageMaker HyperPod task governance:

  • Describing cluster policy with the AWS CLI – This AWS Command Line Interface (AWS CLI) command is useful to view the cluster policy settings for your cluster.
  • List compute quota allocations with the AWS CLI – This AWS CLI command is useful to view the different teams and set up task governance and their respective quota allocation settings.
  • HyperPod CLI – The HyperPod CLI abstracts common kubectl commands used to interact with SageMaker HyperPod clusters such as submitting, listing, and cancelling tasks. Refer to the for a full list of commands.
  • kubectl – You can also use kubectl to interact with task governance with the following example commands:
    • kubectl get pytorchjobs -n hyperpod-ns-<team-name> This command shows you the PyTorch tasks running in the specified team namespace.
    • kubectl get workloads -n hyperpod-ns-<team-name> / kubectl describe workload <workload-name> -n hyperpod-ns-<team-name> – These commands show the workloads running in your cluster per namespace and provide detailed reasonings on Kueue Admission. You can use these commands to answer questions such as “Why was my task preempted?” or “Why did my task get admitted?”

Common scenarios

SageMaker HyperPod task governance enables allocating compute quota to teams, increasing utilization of compute resources, reducing costs, and accelerating waiting tasks by priority and in turn accelerating time to market. To relate these value propositions to real work scenarios, we will talk about a enterprise and a startup situation.

Enterprises have different teams working towards various business goals, each with budgets that limit their compute access. To maximize resource utilization within budget constraints, SageMaker HyperPod task governance allows enterprises to allocate compute quotas to teams for artificial intelligence and machine learning (AI/ML) tasks. When teams use up their allocation, they can access idle compute from other teams to accelerate waiting tasks, providing optimal resource utilization across the organization.

Startups aim to maximize compute resource utilization while achieving timely allocation for high-priority tasks. SageMaker HyperPod task governance’s prioritization feature allows you to assign priorities to different task types, such as prioritizing inference over training. This makes sure that high-priority tasks receive necessary compute resources before lower-priority ones, optimizing overall resource allocation.

Now we will walk you through two common scenarios for users interacting with SageMaker HyperPod task governance.

Scenario 1: Enterprise

In the first scenario, we have an enterprise company who wants to manage compute allocations to optimize for cost. This company has five teams sharing 80 GPUs, with the following configuration:

  • Team 1 – Compute allocation: 20; Strategy: Don’t Lend
  • Team 2 – Compute allocation: 20; Strategy: Don’t Lend
  • Team 3 – Compute allocation: 5; Strategy: Lend & Borrow at 150%; Fair-share weight: 100
  • Team 4 – Compute allocation: 10; Strategy: Lend & Borrow at 100%; Fair-share weight: 75
  • Team 5 – Compute allocation: 25; Strategy: Lend & Borrow at 50%; Fair-share weight: 50

This sample configuration reserves capacity to teams that will be constantly using instances for high-priority tasks. In addition, a few teams have the option to lend and borrow idle compute from other teams—this improves cost optimization by reserving capacity as needed and allowing non-consistent workloads to run using available idle compute with prioritization.

Scenario 2: Startup

In the second scenario, we have a startup customer who wants to provide equitable compute allocation for members of their engineering and research teams. This company has three teams sharing 15 GPUs:

  • Team 1 (ML engineering) – Compute allocation: 6; Strategy: Lend & Borrow at 50%; Fair-share weight: 100
  • Team 2 (Researchers) – Compute allocation: 5; Strategy: Lend & Borrow at 50%; Fair-share weight: 100
  • Team 3 (Real-time chatbot) – Compute allocation: 4; Strategy: Don’t Lend; Fair-share weight: 100

This sample configuration promotes equitable compute allocation across the company because all teams have the same fair-share weight and are able to preempt tasks with lower priority.

Conclusion

In this post, we discussed best practices for efficient use of SageMaker HyperPod task governance. We also provided certain patterns that you can adopt while administering generative AI tasks, whether you are aiming to optimize for cost or optimize for equitable compute allocation. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance.


About the Author

Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.

Chaitanya Hazarey leads software development for SageMaker HyperPod task governance at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.

 Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Read More