Attribute-controlled fine-tuning can produce LLMs that adhere to policy while achieving competitive performance on general benchmarks.Read More
Enhance speech synthesis and video generation models with RLHF using audio and video segmentation in Amazon SageMaker
As generative AI models advance in creating multimedia content, the difference between good and great output often lies in the details that only human feedback can capture. Audio and video segmentation provides a structured way to gather this detailed feedback, allowing models to learn through reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT). Annotators can precisely mark and evaluate specific moments in audio or video content, helping models understand what makes content feel authentic to human viewers and listeners.
Take, for instance, text-to-video generation, where models need to learn not just what to generate but how to maintain consistency and natural flow across time. When creating a scene of a person performing a sequence of actions, factors like the timing of movements, visual consistency, and smoothness of transitions contribute to the quality. Through precise segmentation and annotation, human annotators can provide detailed feedback on each of these aspects, helping models learn what makes a generated video sequence feel natural rather than artificial. Similarly, in text-to-speech applications, understanding the subtle nuances of human speech—from the length of pauses between phrases to changes in emotional tone—requires detailed human feedback at a segment level. This granular input helps models learn how to produce speech that sounds natural, with appropriate pacing and emotional consistency. As large language models (LLMs) increasingly integrate more multimedia capabilities, human feedback becomes even more critical in training them to generate rich, multi-modal content that aligns with human quality standards.
The path to creating effective AI models for audio and video generation presents several distinct challenges. Annotators need to identify precise moments where generated content matches or deviates from natural human expectations. For speech generation, this means marking exact points where intonation changes, where pauses feel unnatural, or where emotional tone shifts unexpectedly. In video generation, annotators must pinpoint frames where motion becomes jerky, where object consistency breaks, or where lighting changes appear artificial. Traditional annotation tools, with basic playback and marking capabilities, often fall short in capturing these nuanced details.
Amazon SageMaker Ground Truth enables RLHF by allowing teams to integrate detailed human feedback directly into model training. Through custom human annotation workflows, organizations can equip annotators with tools for high-precision segmentation. This setup enables the model to learn from human-labeled data, refining its ability to produce content that aligns with natural human expectations.
In this post, we show you how to implement an audio and video segmentation solution in the accompanying GitHub repository using SageMaker Ground Truth. We guide you through deploying the necessary infrastructure using AWS CloudFormation, creating an internal labeling workforce, and setting up your first labeling job. We demonstrate how to use Wavesurfer.js for precise audio visualization and segmentation, configure both segment-level and full-content annotations, and build the interface for your specific needs. We cover both console-based and programmatic approaches to creating labeling jobs, and provide guidance on extending the solution with your own annotation needs. By the end of this post, you will have a fully functional audio/video segmentation workflow that you can adapt for various use cases, from training speech synthesis models to improving video generation capabilities.
Feature Overview
The integration of Wavesurfer.js in our UI provides a detailed waveform visualization where annotators can instantly see patterns in speech, silence, and audio intensity. For instance, when working on speech synthesis, annotators can visually identify unnatural gaps between words or abrupt changes in volume that might make generated speech sound robotic. The ability to zoom into these waveform patterns means they can work with millisecond precision—marking exactly where a pause is too long or where an emotional transition happens too abruptly.
In this snapshot of audio segmentation, we are capturing a customer-representative conversation, annotating speaker segments, emotions, and transcribing the dialogue. The UI allows for playback speed adjustment and zoom functionality for precise audio analysis.
The multi-track feature lets annotators create separate tracks for evaluating different aspects of the content. In a text-to-speech task, one track might focus on pronunciation accuracy, another on emotional consistency, and a third on natural pacing. For video generation tasks, annotators can mark segments where motion flows naturally, where object consistency is maintained, and where scene transitions work well. They can adjust playback speed to catch subtle details, and the visual timeline for precise start and end points for each marked segment.
In this snapshot of video segmentation, we’re annotating a scene with dogs, tracking individual animals, their colors, emotions, and gaits. The UI also enables overall video quality assessment, scene change detection, and object presence classification.
Annotation process
Annotators begin by choosing Add New Track and selecting appropriate categories and tags for their annotation task. After you create the track, you can choose Begin Recording at the point where you want to start a segment. As the content plays, you can monitor the audio waveform or video frames until you reach the desired end point, then choose Stop Recording. The newly created segment appears in the right pane, where you can add classifications, transcriptions, or other relevant labels. This process can be repeated for as many segments as needed, with the ability to adjust segment boundaries, delete incorrect segments, or create new tracks for different annotation purposes.
Importance of high-quality data and reducing labeling errors
High-quality data is essential for training generative AI models that can produce natural, human-like audio and video content. The performance of these models depends directly on the accuracy and detail of human feedback, which stems from the precision and completeness of the annotation process. For audio and video content, this means capturing not just what sounds or looks unnatural, but exactly when and how these issues occur.
Our purpose built UI in SageMaker Ground Truth addresses common challenges in audio and video annotation that often lead to inconsistent or imprecise feedback. When annotators work with long audio or video files, they need to mark precise moments where generated content deviates from natural human expectations. For example, in speech generation, an unnatural pause might last only a fraction of a second, but its impact on perceived quality is significant. The tool’s zoom functionality allows annotators to expand these brief moments across their screen, making it possible to mark the exact start and end points of these subtle issues. This precision helps models learn the fine details that separate natural from artificial-sounding speech.
Solution overview
This audio/video segmentation solution combines several AWS services to create a robust annotation workflow. At its core, Amazon Simple Storage Service (Amazon S3) serves as the secure storage for input files, manifest files, annotation outputs, and the web UI components. SageMaker Ground Truth provides annotators with a web portal to access their labeling jobs and manages the overall annotation workflow. The following diagram illustrates the solution architecture.
The UI template, which includes our specialized audio/video segmentation interface built with Wavesurfer.js, requires specific JavaScript and CSS files. These files are hosted through Amazon CloudFront distribution, providing reliable and efficient delivery to annotators’ browsers. By using CloudFront with an origin access identity and appropriate bucket policies, we allow the UI components to be served to annotators. This setup follows AWS best practices for least-privilege access, making sure CloudFront can only access the specific UI files needed for the annotation interface.
Pre-annotation and post-annotation AWS Lambda functions are optional components that can enhance the workflow. The pre-annotation Lambda function can process the input manifest file before data is presented to annotators, enabling any necessary formatting or modifications. Similarly, the post-annotation Lambda function can transform the annotation outputs into specific formats required for model training. These functions provide flexibility to adapt the workflow to specific needs without requiring changes to the core annotation process.
The solution uses AWS Identity and Access Management (IAM) roles to manage permissions:
- A SageMaker Ground Truth IAM role enables access to Amazon S3 for reading input files and writing annotation outputs
- If used, Lambda function roles provide the necessary permissions for preprocessing and postprocessing tasks
Let’s walk through the process of setting up your annotation workflow. We start with a simple scenario: you have an audio file stored in Amazon S3, along with some metadata like a call ID and its transcription. By the end of this walkthrough, you will have a fully functional annotation system where your team can segment and classify this audio content.
Prerequisites
For this walkthrough, make sure you have the following:
- Familiarity with SageMaker Ground Truth labeling jobs and the workforce portal
- Basic understanding of CloudFormation templates
- An AWS account with permissions to deploy CloudFormation stacks
- A SageMaker Ground Truth private workforce configured for labeling jobs
- Permissions to launch CloudFormation stacks that create and configure S3 buckets, CloudFront distributions, and Lambda functions automatically
Create your internal workforce
Before we dive into the technical setup, let’s create a private workforce in SageMaker Ground Truth. This allows you to test the annotation workflow with your internal team before scaling to a larger operation.
- On the SageMaker console, choose Labeling workforces.
- Choose Private for the workforce type and create a new private team.
- Add team members using their email addresses—they will receive instructions to set up their accounts.
Deploy the infrastructure
Although this demonstrates using a CloudFormation template for quick deployment, you can also set up the components manually. The assets (JavaScript and CSS files) are available in our GitHub repository. Complete the following steps for manual deployment:
- Download these assets directly from the GitHub repository.
- Host them in your own S3 bucket.
- Set up your own CloudFront distribution to serve these files.
- Configure the necessary permissions and CORS settings.
This manual approach gives you more control over infrastructure setup and might be preferred if you have existing CloudFront distributions or a need to customize security controls and assets.
The rest of this post will focus on the CloudFormation deployment approach, but the labeling job configuration steps remain the same regardless of how you choose to host the UI assets.
This CloudFormation template creates and configures the following AWS resources:
- S3 bucket for UI components:
- Stores the UI JavaScript and CSS files
- Configured with CORS settings required for SageMaker Ground Truth
- Accessible only through CloudFront, not directly public
- Permissions are set using a bucket policy that grants read access only to the CloudFront Origin Access Identity (OAI)
- CloudFront distribution:
- Provides secure and efficient delivery of UI components
- Uses an OAI to securely access the S3 bucket
- Is configured with appropriate cache settings for optimal performance
- Access logging is enabled, with logs being stored in a dedicated S3 bucket
- S3 bucket for CloudFront logs:
- Stores access logs generated by CloudFront
- Is configured with the required bucket policies and ACLs to allow CloudFront to write logs
- Object ownership is set to ObjectWriter to enable ACL usage for CloudFront logging
- Lifecycle configuration is set to automatically delete logs older than 90 days to manage storage
- Lambda function:
- Downloads UI files from our GitHub repository
- Stores them in the S3 bucket for UI components
- Runs only during initial setup and uses least privilege permissions
- Permissions include Amazon CloudWatch Logs for monitoring and specific S3 actions (read/write) limited to the created bucket
After the CloudFormation stack deployment is complete, you can find the CloudFront URLs for accessing the JavaScript and CSS files on the AWS CloudFormation console. You need these CloudFront URLs to update your UI template before creating the labeling job. Note these values—you will use them when creating the labeling job.
Prepare your input manifest
Before you create the labeling job, you need to prepare an input manifest file that tells SageMaker Ground Truth what data to present to annotators. The manifest structure is flexible and can be customized based on your needs. For this post, we use a simple structure:
You can adapt this structure to include additional metadata that your annotation workflow requires. For example, you might want to add speaker information, timestamps, or other contextual data. The key is making sure your UI template is designed to process and display these attributes appropriately.
Create your labeling job
With the infrastructure deployed, let’s create the labeling job in SageMaker Ground Truth. For full instructions, refer to Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda.
- On the SageMaker console, choose Create labeling job.
- Give your job a name.
- Specify your input data location in Amazon S3.
- Specify an output bucket where annotations will be stored.
- For the task type, select Custom labeling task.
- In the UI template field, locate the placeholder values for the JavaScript and CSS files and update as follows:
- Replace
audiovideo-wavesufer.js
with your CloudFront JavaScript URL from the CloudFormation stack outputs. - Replace
audiovideo-stylesheet.css
with your CloudFront CSS URL from the CloudFormation stack outputs.
- Replace
- Before you launch the job, use the Preview feature to verify your interface.
You should see the Wavesurfer.js interface load correctly with all controls working properly. This preview step is crucial—it confirms that your CloudFront URLs are correctly specified and the interface is properly configured.
Programmatic setup
Alternatively, you can create your labeling job programmatically using the CreateLabelingJob API. This is particularly useful for automation or when you need to create multiple jobs. See the following code:
The API approach offers the same functionality as the SageMaker console, but allows for automation and integration with existing workflows. Whether you choose the SageMaker console or API approach, the result is the same: a fully configured labeling job ready for your annotation team.
Understanding the output
After your annotators complete their work, SageMaker Ground Truth will generate an output manifest in your specified S3 bucket. This manifest contains rich information at two levels:
- Segment-level classifications – Details about each marked segment, including start and end times and assigned categories
- Full-content classifications – Overall ratings and classifications for the entire file
Let’s look at a sample output to understand its structure:
This two-level annotation structure provides valuable training data for your AI models, capturing both fine-grained details and overall content assessment.
Customizing the solution
Our audio/video segmentation solution is designed to be highly customizable. Let’s walk through how you can adapt the interface to match your specific annotation requirements.
Customize segment-level annotations
The segment-level annotations are controlled in the report() function of the JavaScript code. The following code snippet shows how you can modify the annotation options for each segment:
You can remove existing fields or add new ones based on your needs. Make sure you’re updating the data model (updateTrackListData
function) to handle your custom fields.
Modify full-content classifications
For classifications that apply to the entire audio/video file, you can modify the HTML template. The following code is an example of adding custom classification options:
The classifications you add here will be included in your output manifest, allowing you to capture both segment-level and full-content annotations.
Extending Wavesurfer.js functionality
Our solution uses Wavesurfer.js, an open source audio visualization library. Although we’ve implemented core functionality for segmentation and annotation, you can extend this further using Wavesurfer.js’s rich feature set. For example, you might want to:
- Add spectrogram visualization
- Implement additional playback controls
- Enhance zoom functionality
- Add timeline markers
For these customizations, we recommend consulting the Wavesurfer.js documentation. When implementing additional Wavesurfer.js features, remember to test thoroughly in the SageMaker Ground Truth preview to review compatibility with the labeling workflow.
Wavesurfer.js is distributed under the BSD-3-Clause license. Although we’ve tested the integration thoroughly, modifications you make to the Wavesurfer.js implementation should be tested in your environment. The Wavesurfer.js community provides excellent documentation and support for implementing additional features.
Clean up
To clean up the resources created during this tutorial, follow these steps:
- Stop the SageMaker Ground Truth labeling job if it’s still running and you no longer need it. This will halt ongoing labeling tasks and stop additional charges from accruing.
- Empty the S3 buckets by deleting all objects within them. S3 buckets must be emptied before they can be deleted, so removing all stored files facilitates a smooth cleanup process.
- Delete the CloudFormation stack to remove all the AWS resources provisioned by the template. This action will automatically delete associated services like the S3 buckets, CloudFront distribution, Lambda function, and related IAM roles.
Conclusion
In this post, we walked through implementing an audio and video segmentation solution using SageMaker Ground Truth. We saw how to deploy the necessary infrastructure, configure the annotation interface, and create labeling jobs both through the SageMaker console and programmatically. The solution’s ability to capture precise segment-level annotations along with overall content classifications makes it particularly valuable for generating high-quality training data for generative AI models, whether you’re working on speech synthesis, video generation, or other multimedia AI applications. As you develop your AI models for audio and video generation, remember that the quality of human feedback directly impacts your model’s performance—whether you’re training models to generate more natural-sounding speech, create coherent video sequences, or understand complex audio patterns.
We encourage you to visit our GitHub repository to explore the solution further and adapt it to your specific needs. You can enhance your annotation workflows by customizing the interface, adding new classification categories, or implementing additional Wavesurfer.js features. To learn more about creating custom labeling workflows in SageMaker Ground Truth, visit Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda and Custom labeling workflows.
If you’re looking for a turnkey data labeling solution, consider Amazon SageMaker Ground Truth Plus, which provides access to an expert workforce trained in various machine learning tasks. With SageMaker Ground Truth Plus, you can quickly receive high-quality annotations without the need to build and manage your own labeling workflows, reducing costs by up to 40% and accelerating the delivery of labeled data at scale.
Start building your annotation workflow today and contribute to the next generation of AI models that push the boundaries of what’s possible in audio and video generation.
About the Authors
Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers leverage SageMaker and Bedrock to build scalable and cost-efficient pipelines for computer vision applications, natural language processing, and generative AI. In his free time, Sundar loves exploring new places, sampling local eateries and embracing the great outdoors.
Vineet Agarwal is a Senior Manager of Customer Delivery in the Amazon Bedrock team responsible for Human in the Loop services. He has been in AWS for over 2 years managing Go-to-Market activities, business and technical operations. Prior to AWS, he worked in SaaS , Fintech and Telecommunications industry in services leadership role. He has MBA from the Indian School of Business and B. Tech in Electronics and Communications Engineering from National Institute of Technology, Calicut (India). In his free time, Vineet loves playing racquetball and enjoying outdoor activities with his family.
Using responsible AI principles with Amazon Bedrock Batch Inference
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
The recent announcement of batch inference in Amazon Bedrock enables organizations to process large volumes of data efficiently at 50% less cost compared to On-Demand pricing. It’s especially useful when the use case is not latency sensitive and you don’t need real-time inference. However, as we embrace these powerful capabilities, we must also address a critical challenge: implementing responsible AI practices in batch processing scenarios.
In this post, we explore a practical, cost-effective approach for incorporating responsible AI guardrails into Amazon Bedrock Batch Inference workflows. Although we use a call center’s transcript summarization as our primary example, the methods we discuss are broadly applicable to a variety of batch inference use cases where ethical considerations and data protection are a top priority.
Our approach combines two key elements:
- Ethical prompting – We demonstrate how to embed responsible AI principles directly into the prompts used for batch inference, preparing for ethical outputs from the start
- Postprocessing guardrails – We show how to apply additional safeguards to the batch inference output, making sure that the remaining sensitive information is properly handled
This two-step process offers several advantages:
- Cost-effectiveness – By applying heavy-duty guardrails to only the typically shorter output text, we minimize processing costs without compromising on ethics
- Flexibility – The technique can be adapted to various use cases beyond transcript summarization, making it valuable across industries
- Quality assurance – By incorporating ethical considerations at both the input and output stages, we maintain high standards of responsible AI throughout the process
Throughout this post, we address several key challenges in responsible AI implementation for batch inference. These include safeguarding sensitive information, providing accuracy and relevance of AI-generated content, mitigating biases, maintaining transparency, and adhering to data protection regulations. By tackling these challenges, we aim to provide a comprehensive approach to ethical AI use in batch processing.
To illustrate these concepts, we provide practical step-by-step guidance on implementing this technique.
Solution overview
This solution uses Amazon Bedrock for batch inference to summarize call center transcripts, coupled with the following two-step approach to maintain responsible AI practices. The method is designed to be cost-effective, flexible, and maintain high ethical standards.
- Ethical data preparation and batch inference:
- Use ethical prompting to prepare data for batch processing
- Store the prepared JSONL file in an Amazon Simple Storage Service (Amazon S3) bucket
- Use Amazon Bedrock batch inference for efficient and cost-effective call center transcript summarization
- Postprocessing with Amazon Bedrock Guardrails:
- After the completion of initial summarization, apply Amazon Bedrock Guardrails to detect and redact sensitive information, filter inappropriate content, and maintain compliance with responsible AI policies
- By applying guardrails to the shorter output text, you optimize for both cost and ethical compliance
This two-step approach combines the efficiency of batch processing with robust ethical safeguards, providing a comprehensive solution for responsible AI implementation in scenarios involving sensitive data at scale.
In the following sections, we walk you through the key components of implementing responsible AI practices in batch inference workflows using Amazon Bedrock, with a focus on ethical prompting techniques and guardrails.
Prerequisites
To implement the proposed solution, make sure you have satisfied the following requirements:
- Have an active AWS account.
- Have an S3 bucket to store your data prepared for batch inference. To learn more about uploading files in Amazon S3, see Uploading objects.
- Have an AWS Identity and Access Management (IAM) role for batch inference with a trust policy and Amazon S3 access (read access to the folder containing input data and write access to the folder storing output data).
- Enable your selected models hosted on Amazon Bedrock. Refer to Supported Regions and models for batch inference for a complete list of supported models.
- Create a guardrail based on your specific responsible AI needs. For instructions, see Create a guardrail.
Ethical prompting techniques
When setting up your batch inference job, it’s crucial to incorporate ethical guidelines into your prompts. The following is a concise example of how you might structure your prompt:
This prompt sets the stage for ethical summarization by explicitly instructing the model to protect privacy, minimize bias, and focus on relevant information.
Set up a batch inference job
For detailed instructions on how to set up and run a batch inference job using Amazon Bedrock, refer to Enhance call center efficiency using batch inference for transcript summarization with Amazon Bedrock. It provides detailed instructions for the following steps:
- Preparing your data in the required JSONL format
- Understanding the quotas and limitations for batch inference jobs
- Starting a batch inference job using either the Amazon Bedrock console or API
- Collecting and analyzing the output from your batch job
By following the instructions in our previous post and incorporating the ethical prompt provided in the preceding section, you’ll be well-equipped to set up batch inference jobs.
Amazon Bedrock Guardrails
After the batch inference job has run successfully, apply Amazon Bedrock Guardrails as a postprocessing step. This provides an additional layer of protection against potential ethical violations or sensitive information disclosure. The following is a simple implementation, but you can update this based on your data volume and SLA requirements:
Key points about this implementation:
- We use the
apply_guardrail
method from the Amazon Bedrock runtime to process each output - The guardrail is applied to the ‘OUTPUT’ source, focusing on postprocessing
- We handle rate limiting by introducing a delay between API calls, making sure that we don’t exceed the requests per minute quota, which is 20 requests per minute
- The function
mask_pii_data
applies the guardrail and returns the processed text if the guardrail intervened - We store the masked version for comparison and analysis
This approach allows you to benefit from the efficiency of batch processing while still maintaining strict control over the AI’s outputs and protecting sensitive information. By addressing ethical considerations at both the input (prompting) and output (guardrails) stages, you’ll have a comprehensive approach to responsible AI in batch inference workflows.
Although this example focuses on call center transcript summarization, you can adapt the principles and methods discussed in this post to various batch inference scenarios across different industries, always prioritizing ethical AI practices and data protection.
Ethical considerations for responsible AI
Although the prompt in the previous section provides a basic framework, there are many ethical considerations you can incorporate depending on your specific use case. The following is a more comprehensive list of ethical guidelines:
- Privacy protection – Avoid including any personally identifiable information in the summary. This protects customer privacy and aligns with data protection regulations, making sure that sensitive personal data is not exposed or misused.
- Factual accuracy – Focus on facts explicitly stated in the transcript, avoiding speculation. This makes sure that the summary remains factual and reliable, providing an accurate representation of the interaction without introducing unfounded assumptions.
- Bias mitigation – Be mindful of potential biases related to gender, ethnicity, location, accent, or perceived socioeconomic status. This helps prevent discrimination and maintains fair treatment for your customers, promoting equality and inclusivity in AI-generated summaries.
- Cultural sensitivity – Summarize cultural references or idioms neutrally, without interpretation. This respects cultural diversity and minimizes misinterpretation, making sure that cultural nuances are acknowledged without imposing subjective judgments.
- Gender neutrality – Use gender-neutral language unless gender is explicitly mentioned. This promotes gender equality and minimizing stereotyping, creating summaries that are inclusive and respectful of all gender identities.
- Location neutrality – Include location only if relevant to the customer’s issue. This minimizes regional stereotyping and focuses on the actual issue rather than unnecessary generalizations based on geographic information.
- Accent awareness – If accent or language proficiency is relevant, mention it factually without judgment. This acknowledges linguistic diversity without discrimination, respecting the varied ways in which people communicate.
- Socioeconomic neutrality – Focus on the issue and resolution, regardless of the product or service tier discussed. This promotes fair treatment regardless of a customer’s economic background, promoting equal consideration of customers’ concerns.
- Emotional context – Use neutral language to describe emotions accurately. This provides insight into customer sentiment without escalating emotions, allowing for a balanced representation of the interaction’s emotional tone.
- Empathy reflection – Note instances of the agent demonstrating empathy. This highlights positive customer service practices, encouraging the recognition and replication of compassionate interactions.
- Accessibility awareness – Include information about any accessibility needs or accommodations factually. This promotes inclusivity and highlights efforts to accommodate diverse needs, fostering a more accessible and equitable customer service environment.
- Ethical behavior flagging – Identify potentially unethical behavior without repeating problematic content. This helps identify issues for review while minimizing the propagation of inappropriate content, maintaining ethical standards in the summarization process.
- Transparency – Indicate unclear or ambiguous information in the summary. This promotes transparency and helps identify areas where further clarification might be needed, making sure that limitations in understanding are clearly communicated.
- Continuous improvement – Highlight actionable insights for improving customer service. This turns the summarization process into a tool for ongoing enhancement of service quality, contributing to the overall improvement of customer experiences.
When implementing ethical AI practices in your batch inference workflows, consider which of these guidelines are most relevant to your specific use case. You may need to add, remove, or modify instructions based on your industry, target audience, and specific ethical considerations. Remember to regularly review and update your ethical guidelines as new challenges and considerations emerge in the field of AI ethics.
Clean up
To delete the guardrail you created, follow the steps in Delete a guardrail.
Conclusion
Implementing responsible AI practices, regardless of the specific feature or method, requires a thoughtful balance of privacy protection, cost-effectiveness, and ethical considerations. In our exploration of batch inference with Amazon Bedrock, we’ve demonstrated how these principles can be applied to create a system that not only efficiently processes large volumes of data, but does so in a manner that respects privacy, avoids bias, and provides actionable insights.
We encourage you to adopt this approach in your own generative AI implementations. Start by incorporating ethical guidelines into your prompts and applying guardrails to your outputs. Responsible AI is an ongoing commitment—continuously monitor, gather feedback, and adapt your approach to align with the highest standards of ethical AI use. By prioritizing ethics alongside technological advancement, we can create AI systems that not only meet business needs, but also contribute positively to society.
About the authors
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Revolutionizing knowledge management: VW’s AI prototype journey with AWS
Today, we’re excited to share the journey of the VW—an innovator in the automotive industry and Europe’s largest car maker—to enhance knowledge management by using generative AI, Amazon Bedrock, and Amazon Kendra to devise a solution based on Retrieval Augmented Generation (RAG) that makes internal information more easily accessible by its users. This solution efficiently handles documents that include both text and images, significantly enhancing VW’s knowledge management capabilities within their production domain.
The challenge
The VW engaged with AWS Industries Prototyping & Customer Engineering Team (AWSI-PACE) to explore ways to improve knowledge management in the production domain by building a prototype that uses advanced features of Amazon Bedrock, specifically Anthropic’s Claude 3 models, to extract and analyze information from private documents, such as PDFs containing text and images. The main technical challenge was to efficiently retrieve and process data in a multi-modal setup to provide comprehensive and accurate information from Chemical Compliance private documents.
PACE, a multi-disciplinary rapid prototyping team, focuses on delivering feature-complete initial products that enable business evaluation, determining feasibility, business value, and path to production. Using the PACE-Way (an Amazon-based development approach), the team developed a time-boxed prototype over a maximum of 6 weeks, which included a full stack solution with frontend and UX, backed by specialist expertise, such as data science, tailored for VW’s needs.
The choice of Anthropic’s Claude 3 models within Amazon Bedrock was driven by Claude’s advanced vision capabilities, enabling it to understand and analyze images alongside text. This multimodal interaction is crucial for applications that require extracting insights from complex documents containing both textual content and images. These features open up exciting possibilities for multimodal interactions, making it ideal for querying private PDF documents that include both text and images.
The integrated approach and ease of use of Amazon Bedrock in deploying large language models (LLMs), along with built-in features that facilitate seamless integration with other AWS services like Amazon Kendra, made it the preferred choice. By using Claude 3’s vision capabilities, we could upload image-rich PDF documents. Claude analyzes each image contained within these documents to extract text and understand the contextual details embedded in these visual elements. The extracted text and context from the images are then added to Amazon Kendra, enhancing the search-ability and accessibility of information within the system. This integration ensures that users can perform detailed and accurate searches across the indexed content, using the full depth of information extracted by Claude 3.
Architecture overview
Because of the need to provide access to proprietary information, it was decided early that the prototype would use RAG. The RAG approach, at this time an established solution to enhance LLMs with private knowledge, is implemented using a blend of AWS services that enable us to streamline the processing, searching, and querying of documents while at same time meeting non-functional requirements related to efficiency, scalability, and reliability. The architecture is centered around a native AWS serverless backend, which ensures minimal maintenance and high availability together with fast development.
Core components of the RAG system
- Amazon Simple Storage Service (Amazon S3): Amazon S3 serves as the primary storage for source data. It’s also used for hosting static website components, ensuring high durability and availability.
- Amazon Kendra: Amazon Kendra provides semantic search capabilities for ranking of documents and passages, it also deals with the overhead of handling text extraction, embeddings, and managing vector datastore.
- Amazon Bedrock: This component is critical for processing and inference. It uses machine learning models to analyze and interpret the text and image data extracted from documents, integrating these insights to generate context-aware responses to queries.
- Amazon CloudFront: Distributes the web application globally to reduce latency, offering users fast and reliable access to the RAG system’s interface.
- AWS Lambda: Provides the serverless compute environment for running backend operations without provisioning or managing servers, which scales automatically with the application’s demands.
- Amazon DynamoDB: Used for storing metadata and other necessary information for quick retrieval during search operations. Its fast and flexible NoSQL database service accommodates high-performance needs.
- AWS AppSync: Manages real-time data synchronization and communication between the users’ interfaces and the serverless backend, enhancing the interactive experience.
- Amazon Cognito: Manages user authentication and authorization, providing secure and scalable user access control. It supports integration with various identity providers to facilitate easy and secure user sign-in and registration processes.
- Amazon API Gateway: Acts as the entry point for all RESTful API requests to the backend services, offering features such as throttling, monitoring, and API version management.
- AWS Step Functions: Orchestrates the various AWS services involved in the RAG system, ensuring coordinated execution of the workflow.
Solution walkthrough
The process flow handles complex documents efficiently from the moment a user uploads a PDF. These documents are often large and contain numerous images. This workflow integrates AWS services to extract, process, and make content available for querying. This section details the steps involved in processing uploaded documents and ensuring that extracted data is searchable and contextually relevant to user queries (shown in the following figure).
Initiation and initial processing:
- User access: A user accesses the web interface through CloudFront, which allows users to upload PDFs as shown in Image A in Results. These PDFs are stored in Amazon S3.
- Text extraction: With the Amazon Kendra S3 connector, the solution indexes the S3 bucket repository of documents that the user has uploaded in Step 1. Amazon Kendra supports popular document types or formats such as PDF, HTML, Word, PowerPoint, and more. An index can contain multiple document formats. Amazon Kendra extracts the content inside the documents to make the documents searchable. The documents are parsed to optimize search on the extracted text within the documents. This means structuring the documents into fields or attributes that are used for search.
- Step function activation: When an object is created in S3, such as a user uploading a file in Step 1, the solution will launch a step function that orchestrates the document processing workflow for adding image context to the Kendra index.
Image extraction and analysis:
- Extract images: While Kendra indexes the text from the uploaded file, the step function extracts the images from the document. Extracting the images from the uploaded file allows the solution to process the images using Amazon Bedrock to extract text and contextual information. The code snippet that follows provides a sample of the code used to extract the images from the PDF file and save them back to S3.
import json
import fitz # PyMuPDF
import os
import boto3
# Initialize the S3 client
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket_name = event['bucket_name']
pdf_key = event['pdf_key']
# Define the local paths
local_pdf_path = '/tmp/' + os.path.basename(pdf_key)
local_image_dir = '/tmp/images'
# Ensure the image directory exists
if not os.path.exists(local_image_dir):
os.makedirs(local_image_dir)
# Download the PDF from S3
s3.download_file(bucket_name, pdf_key, local_pdf_path)
# Open the PDF file using PyMuPDF
pdf_file = fitz.open(local_pdf_path)
pdf_name = os.path.splitext(os.path.basename(local_pdf_path))[0] # Extract PDF base name for labeling
total_images_extracted = 0 # Counter for all images extracted from this PDF
image_filenames = [] # List to store the filenames of extracted images
# Iterate through each page of the PDF
for current_page_index in range(len(pdf_file)):
# Extract images from the current page
for img_index, img in enumerate(pdf_file.get_page_images(current_page_index)):
xref = img[0]
image = fitz.Pixmap(pdf_file, xref)
# Construct image filename with a global counter
image_filename = f"{pdf_name}_image_{total_images_extracted}.png"
image_path = os.path.join(local_image_dir, image_filename)
total_images_extracted += 1
# Save the image appropriately
if image.n < 5: # GRAY or RGB
image.save(image_path)
else: # CMYK, requiring conversion to RGB
new_image = fitz.Pixmap(fitz.csRGB, image)
new_image.save(image_path)
new_image = None
image = None
# Upload the image back to S3
s3.upload_file(image_path, bucket_name, f'images/{image_filename}')
# Add the image filename to the list
image_filenames.append(image_filename)
# Return the response with the list of image filenames and total images extracted
return {
'statusCode': 200,
'image_filenames': image_filenames,
'total_images_extracted': total_images_extracted
}
-
- Lambda function code:
- Initialization: The function initializes the S3 client.
- Event extraction: Extracts the bucket name and PDF key from the incoming event payload.
- Local path set up: Defines local paths for storing the PDF and extracted images.
- Directory creation: Ensures the directory for images exists.
- PDF download: Downloads the PDF file from S3.
- Image extraction: Opens the PDF and iterates through its pages to extract images.
- Image processing: Saves the images locally and uploads them back to S3.
- Filename collection: Collects the filenames of the uploaded images.
- Return statement: Returns the list of image filenames and the total number of images extracted.
- Lambda function code:
- Text extraction from images: The image files processed from the previous step are then sent to Amazon Bedrock, where advanced models extract textual content and contextual details from the images. The step function uses a map state to iterate over the list of images, processing each one individually. Claude 3 offers image-to-text vision capabilities that can process images and return text outputs. It excels at analyzing and understanding charts, graphs, technical diagrams, reports, and other visual assets. Claude 3 Sonnet achieves comparable performance to other best-in-class models with image processing capabilities while maintaining a significant speed advantage. The following is a sample snippet that extracts the contextual information from each image in the map state.
import json
import base64
import boto3
from botocore.exceptions import ClientError
# Initialize the boto3 client for BedrockRuntime and S3
s3 = boto3.client('s3', region_name='us-west-2')
bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-west-2')
def lambda_handler(event, context):
source_bucket = event['bucket_name']
destination_bucket = event['destination_bucket']
image_filename = event['image_filename']
try:
# Get the image from S3
image_file = s3.get_object(Bucket=source_bucket, Key=image_filename)
contents = image_file['Body'].read()
# Encode the image to base64
encoded_string = base64.b64encode(contents).decode('utf-8')
# Prepare the payload for Bedrock
payload = {
"modelId": "anthropic.claude-3-sonnet-20240229-v1:0",
"contentType": "application/json",
"accept": "application/json",
"body": {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"temperature": 0.7,
"top_p": 0.999,
"top_k": 250,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": encoded_string
}
},
{
"type": "text",
"text": "Extract all text."
}
]
}
]
}
}
# Call Bedrock to extract text from the image
body_bytes = json.dumps(payload['body']).encode('utf-8')
response = bedrock_runtime.invoke_model(
body=body_bytes,
contentType=payload['contentType'],
accept=payload['accept'],
modelId=payload['modelId']
)
response = json.loads(response['body'].read().decode('utf-8'))
response_content = response['content'][0]
response_text = response_content['text']
# Save the extracted text to S3
text_file_key = image_filename.replace('.png', '.txt')
s3.put_object(Bucket=destination_bucket, Key=text_file_key, Body=str(response_text))
return {
'statusCode': 200,
'text_file_key': text_file_key,
'message': f"Processed and saved text for {image_filename}"
}
except Exception as e:
return {
'statusCode': 500,
'error': str(e),
'message': f"An error occurred processing {image_filename}"
}
-
- Lambda function code:
- Initialization: The script initializes the boto3 clients for
BedrockRuntime
and S3 services to interact with AWS resources. - Lambda handler: The main function (
lambda_handler
) is invoked when the Lambda function is run. It receives the event and context parameters. - Retrieve image: The image file is retrieved from the specified S3 bucket using the
get_object
method. - Base64 encoding: The image is read and encoded to a base64 string, which is required for sending the image data to Bedrock.
- Payload preparation: A payload is constructed with the base64 encoded image and a request to extract text.
- Invoke Amazon Bedrock: The Amazon Bedrock model is invoked using the prepared payload to extract text from the image.
- Process response: The response from Amazon Bedrock is parsed to extract the textual content.
- Save text to S3: The extracted text is saved back to the specified S3 bucket with a filename derived from the original image filename.
- Return statement: The function returns a success message and the key of the saved text file. If an error occurs, it returns an error message.
- Initialization: The script initializes the boto3 clients for
- Lambda function code:
Data storage and indexing:
- Save to S3: The extracted text from the images are saved back to S3 as text files.
- Indexing by Amazon Kendra: After being saved in S3, the data is indexed by Amazon Kendra, making it searchable and accessible for queries. This indexing adds the image context to perform similarity searches in the RAG system.
User query with semantic search and inference
The semantic search and inference process of our solution plays a critical role in providing users with accurate and contextually relevant information based on their queries.
Semantic search focuses on understanding the intent and contextual meaning behind a user’s query instead of relying solely on keyword matching. Amazon Kendra, an advanced enterprise search service, uses semantic search to deliver more accurate and relevant results. By using natural language processing (NLP) and machine learning algorithms, Amazon Kendra can interpret the nuances of a query, ensuring that the retrieved documents and data align closely with the user’s actual intent.
User query handling:
- User interaction: Users submit their queries through a user-friendly interface.
Semantic search with Amazon Kendra:
- Context retrieval: Upon receiving a query, Amazon Kendra performs a semantic search to identify the most relevant documents and data. The advanced NLP capabilities of Amazon Kendra allow it to understand the intent and contextual nuances of the query.
- Provision of relevant context: Amazon Kendra provides a list of documents that are ranked based on their relevance to the user’s query. This ensures that the response is not only based on keyword matches but also on the semantic relevance of the content. Note that Amazon Kendra also uses the text extracted from images, which was processed with Amazon Bedrock, to enhance the search results.
Inference with Amazon Bedrock:
- Contextual analysis and inference: The relevant documents and data retrieved by Amazon Kendra are then passed to Amazon Bedrock. The inference models available in Amazon Bedrock consider both the context provided by Kendra and the specific details of the user query. This dual consideration allows Amazon Bedrock to formulate responses that are not only accurate but also finely tuned to the specifics of the query. The following are the snippets for generating prompts that help Bedrock provide accurate and contextually relevant responses:
def get_qa_prompt(self):
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}"""
return PromptTemplate(template=template, input_variables=["context", "question"])
def get_prompt(self):
template = """The following is a friendly conversation between a human and an AI. If the AI does not know the answer to a question, it truthfully says it does not know.
Current conversation:
{chat_history}
Question: {input}"""
input_variables = ["input", "chat_history"]
prompt_template_args = {
"chat_history": "{chat_history}",
"input_variables": input_variables,
"template": template,
}
prompt_template = PromptTemplate(**prompt_template_args)
return prompt_template
def get_condense_question_prompt(self):
template = """<conv>
{chat_history}
</conv>
<followup>
{question}
</followup>
Given the conversation inside the tags <conv></conv>, rephrase the follow up question you find inside <followup></followup> to be a standalone question, in the same language as the follow up question.
"""
return PromptTemplate(input_variables=["chat_history", "question"], template=template)
-
- QA prompt explanation:
- QA Prompt:
- This prompt is designed to use the context provided by Amazon Kendra to answer a question accurately. The context provided by Amazon Kendra is from the most relevant documents and data processed by the semantic search from the user query.
- It instructs the AI to use the given context and only provide an answer if it is certain; otherwise, it should admit not knowing the answer.
- QA Prompt:
- QA prompt explanation:
Response delivery:
- Delivery to user: This response is then delivered back to the user; completing the cycle of query and response.
Results
Our evaluation of the system revealed significant multi-lingual capabilities, enhancing user interaction with documents in multiple languages:
- Multilingual support: The model showed strong performance across different languages. Despite the documents being primarily in German, the system handled queries in English effectively. It translated the extracted text from the PDFs or images from German to English, providing responses in English. This feature was crucial for English-speaking users.
- Seamless language transition: The system also supports transitions between languages. Users could ask questions in German and receive responses in German, maintaining context and accuracy. This dual-language functionality significantly enhanced efficiency, catering to documents containing both German and English.
- Enhanced user experience: This multilingual capability broadened the system’s accessibility and ensured users could receive information in their preferred language, making interactions more intuitive.
Image A demonstrates a user querying their private data. The solution successfully answers the query using the private data. The answer isn’t derived from the extracted text within the files, but from an image embedded in the uploaded file.
Image B shows the specific image from which Amazon Bedrock extracted the text and added it to the index, enabling the system to provide the correct answer.
Image C also shows a scenario where, without the image context, the question cannot be answered.
Following the successful prototype development, Stefan Krawinkel from VW shared his thoughts:
“We are thrilled by the AWS team’s joy of innovation and the constant questioning of solutions for the requirements we brought to the prototype. The solutions developed give us a good overview of what is possible with generative AI, and what limits still exist today. We are confident that we will continue to push existing boundaries together with AWS to be able to offer attractive products to our customers.”
This testimonial highlights how the collaborative effort addressed the complex challenges and underscores the ongoing potential for innovation in future projects.
Additional thanks to Fabrizio Avantaggiato, Verena Koutsovagelis and Jon Reed for their work on this prototype.
About the Authors
Rui Costa specializes in Software Engineering and currently holds the position of Principal Solutions Developer within the AWS Industries Prototyping and Customer Engineering (PACE) Team based out of Jersey City, New Jersey.
Mahendra Bairagi is a Generative AI specialist who currently holds a position of Principal Solutions Architect – Generative AI within the AWS Industries and Customer Engineering (PACE) team. Throughout his more than 9 years at AWS, Mahendra has held a variety of pivotal roles, including Principal AI/ML Specialist, IoT Specialist, Principal Product Manager and head of Sports Innovations Lab. In these capacities, he has consistently led innovative solutions, driving significant advancements for both customers and partners.
Fine-tune large language models with Amazon SageMaker Autopilot
Fine-tuning foundation models (FMs) is a process that involves exposing a pre-trained FM to task-specific data and fine-tuning its parameters. It can then develop a deeper understanding and produce more accurate and relevant outputs for that particular domain.
In this post, we show how to use an Amazon SageMaker Autopilot training job with the AutoMLV2 SDK to fine-tune a Meta Llama2-7B model on question answering tasks. Specifically, we train the model on multiple-choice science exam questions covering physics, chemistry, and biology. This fine-tuning approach can be extended to other tasks, such as summarization or text generation, in domains like healthcare, education, or financial services.
AutoMLV2 supports the instruction-based fine-tuning of a selection of general-purpose FMs powered by Amazon SageMaker JumpStart. We use Amazon SageMaker Pipelines, which helps automate the different steps, including data preparation, fine-tuning, and creating the model. We use the open source library fmeval to evaluate the model and register it in the Amazon SageMaker Model Registry based on its performance.
Solution overview
The following architecture diagram shows the various steps involved to create an automated and scalable process to fine-tune large language models (LLMs) using AutoMLV2. The AutoMLV2 SDK simplifies the process of creating and managing AutoML jobs by providing high-level functions and abstractions, making it straightforward for developers who may not be familiar with AutoML concepts. The CreateAutoMLJobV2 API offers a low-level interface that allows for more control and customization. Using the SDK offers benefits like faster prototyping, better usability, and pre-built functions, and the API is better for advanced customizations.
To implement the solution, we use SageMaker Pipelines in Amazon SageMaker Studio to orchestrate the different steps. The solution consists of two pipelines: training and inference.
To create the training pipeline, you complete the following steps:
Load and prepare the dataset.
- Create a SageMaker Autopilot CreateAutoMLJobV2 training job.
- Check the training job status.
- Deploy the best candidate model.
The following steps configure the inference pipeline:
Preprocess data for evaluation.
- Evaluate the model using the fmeval library.
- Register the model if it meets the required performance.
To deploy the solution, refer to the GitHub repo, which provides step-by-step instructions for fine-tuning Meta Llama2-7B using SageMaker Autopilot and SageMaker Pipelines.
Prerequisites
For this walkthrough, complete the following prerequisite steps:
- Set up an AWS account.
- Create a SageMaker Studio environment.
- Create two AWS Identity and Access Management (IAM) roles:
LambdaExecutionRole
andSageMakerExecutionRole
, with permissions as outlined in the SageMaker notebook. The managed policies should be scoped down further for improved security. For instructions, refer to Create a role to delegate permissions to an IAM user. - On the SageMaker Studio console, upload the code from the GitHub repo.
- Open the SageMaker notebook ipynb and run the cells.
Training pipeline
The following training pipeline shows a streamlined way to automate the fine-tuning of a pre-trained LLM and the deployment of the model to a real-time endpoint inference.
Prepare the data
For this project, we used the SciQ dataset, which contains science exam questions about physics, chemistry, biology, and other subjects. SageMaker Autopilot supports instruction-based fine-tuning datasets formatted as CSV files (default) or as Parquet files.
When you prepare your CSV file, make sure that it contains exactly two columns:
- The input column must be in a string format and contains the prompt
- The output column is in a string format and indicates the ground truth answer
In this project, we start by removing the irrelevant columns. Next, we combine the question and support columns to create a comprehensive prompt, which is then placed in the input column. SageMaker Autopilot sets a maximum limit on the number of rows in the dataset and the context length based on the type of model being used. We select 10,000 rows from the dataset.
Finally, we divide the data into training and validation sets:
Create an CreateAutoMLJobV2 training job
AutoMLV2 makes it straightforward to train, optimize, and deploy machine learning (ML) models by automating the tasks involved in the ML development lifecycle. It provides a simple approach to create highly accurate models tailored to your specific problem type, whether it’s classification, regression, forecasting, or others. In this section, we go through the steps to train a model with AutoMLV2, using an LLM fine-tuning job as an example. For this project, we used the Meta Llama2-7B model. You can change the model by choosing from the supported LLMs for fine-tuning.
Define the text generation configuration
AutoMLV2 automates the entire ML process, from data preprocessing to model training and deployment. However, for AutoMLV2 to work effectively, it’s crucial to provide the right problem configuration. This configuration acts as a guide, helping SageMaker Autopilot understand the nature of your problem and select the most appropriate algorithm or approach. By specifying details such as the problem type (such as classification, regression, forecasting, or fine-tuning), you give AutoMLV2 the necessary information to tailor its solution to your specific requirements.
For a fine-tuning job, the configuration consists of determining the model to be used and its access configuration, in addition to the hyperparameters that optimize the model learning process. See the following code:
The definitions of each parameter used in text_generation_config
are:
- base_model_name – The name of the base model to fine-tune. SageMaker Autopilot supports fine-tuning a variety of LLMs. If no value is provided, the default model used is Falcon7BInstruct.
- accept_eula – The access configuration file to control access to the ML model. The value is set to True to accept the model end-user license agreement (EULA). This setting is necessary for models like Meta Llama2-7B, which require accepting the license terms before they can be used.
epochCount – The number of times the model goes through the entire training dataset. Its value should be a string containing an integer value within the range of 1–10. One epoch means the Meta Llama2-7B model has been exposed to the 10,000 samples and had a chance to learn from them. You can set it to 3, meaning the model will make three complete passes, or increase the number, if the model doesn’t converge with just three epochs.
learningRate – The step size at which a model’s parameters are updated during training. Its value should be a string containing a floating-point value within the range of 0–1. A learning rate of 0,00001 or 0,00002 is a good standard when fine-tuning LLMs like Meta Llama2-7B.
batchSize – The number of data samples used in each iteration of training. Its value should be a string containing an integer value within the range of 1–64. Start with 1 in order to not receive an out-of-memory error.
learningRateWarmupSteps – The number of training steps during which the learning rate gradually increases before reaching its target or maximum value. Its value should be a string containing an integer value within the range of 0–250. Start with 1.
The configuration settings can be adjusted to align with your specific requirements and the chosen FM.
Start the AutoMLV2 job
Next, set up the AutoMLV2 job by providing the problem configuration details, the AWS role with the necessary permissions, a base name for job identification, and the output path where the model artifacts will be saved. To initiate the training process in a pipeline step, we invoked the create_auto_ml_job_v2 method. In the following code snippet, the create_auto_ml_job_v2 method is called to create an AutoML job object with specific inputs. The AutoMLJobInputDataConfig
parameter takes a list that includes an AutoMLDataChannel
, which specifies the type of data (in this case, ‘S3Prefix’) and the location of the training dataset (given by train_dataset_s3_path.default_value
) in an S3 bucket. The channel_type
is set to ‘training’, indicating that this dataset is used for training the model.
Check SageMaker Autopilot job status
This step tracks the status of the Autopilot training job. In the script check_autopilot_job_status.py, we repeatedly check the status of the training job until it’s complete.
The callback step sends a token in an Amazon Simple Queue Service (Amazon SQS) queue, which invokes the AWS Lambda function to check the training job status. If the job is complete, the Lambda function sends a success message back to the callback step and the pipeline continues with the next step.
Deploy a model with AutoMLV2 using real-time inference
AutoMLV2 simplifies the deployment of models by automating the entire process, from model training to deployment. It takes care of the heavy lifting involved in selecting the best-performing model and preparing it for production use.
Furthermore, AutoMLV2 simplifies the deployment process. It can directly create a SageMaker model from the best candidate model and deploy it to a SageMaker endpoint with just a few lines of code.
In this section, we look at the code that deploys the best-performing model to a real-time SageMaker endpoint.
This pipeline step uses a Lambda step, which runs a serverless Lambda function. We use a Lambda step because the API call to create and deploy the SageMaker model is lightweight.
The first stage after the completion of the AutoMLV2 training process is to select the best candidate, making sure that the most accurate and efficient solution is chosen for deployment. We use the method describe_auto_ml_job_v2 to retrieve detailed information about a specific AutoMLV2 job. This method provides insights into the current status, configuration, and output of your AutoMLV2 job, allowing you to monitor its progress and access relevant information. See the following code:
In SageMaker Autopilot, the best candidate model is selected based on minimizing cross-entropy loss, a default metric that measures the dissimilarity between predicted and actual word distributions during fine-tuning. Additionally, the model’s quality is evaluated using metrics like ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated text and human-written reference text, along with perplexity, which assesses how well the model predicts the next word in a sequence. The model with the lowest cross-entropy and perplexity, combined with strong ROUGE scores, is considered the best candidate.
With the best candidate model identified, you can create a SageMaker model object, encapsulating the trained model artifacts and necessary dependencies. For that, we use the method create_model
for the AutoML job object:
Next, we create a SageMaker endpoint configuration and deploy a SageMaker endpoint for real-time inference using the best candidate model. We use the instance type ml.g5.12xlarge to deploy the model. You may need to increase your quota to use this instance.
Inference pipeline
The inference pipeline is used for batch inference. It demonstrates a way to deploy and evaluate an FM and register it in SageMaker Model Registry. The following diagram shows the workflow starting with a preprocess data step, through model inference, to post-inference evaluation and conditional model registration.
Preprocess data for evaluation
The first crucial step in evaluating the performance of the fine-tuned LLM is to preprocess the data for evaluation. This preprocessing stage involves transforming the data into a format suitable for the evaluation process and verifying the compatibility with the chosen evaluation library.
In this particular case, we use a pipeline step to prepare the data for evaluation. The preprocessing script (preprocess_evaluation.py
) creates a .jsonl (JSON Lines) file, which serves as the test dataset for the evaluation phase. The JSON Lines format is a convenient way to store structured data, where each line represents a single JSON object.
This test dataset is crucial for obtaining an unbiased evaluation of the model’s generalization capabilities and its ability to handle new, previously unseen inputs. After the evaluation_dataset.jsonl
file is created, it’s saved in the appropriate path in an Amazon Simple Storage Service (Amazon S3) bucket.
Evaluate the model using the fmeval library
SageMaker Autopilot streamlines the entire ML workflow, automating steps from data preprocessing to model evaluation. After training multiple models, SageMaker Autopilot automatically ranks them based on selected performance metrics, such as cross-entropy loss for text generation tasks, and identifies the best-performing model.
However, when deeper, more granular insights are required, particularly during post-training evaluation with a testing dataset, we use fmeval, an open source library tailored for fine-tuning and evaluating FMs. Fmeval provides enhanced flexibility and control, allowing for a comprehensive assessment of model performance using custom metrics tailored to the specific use case. This makes sure the model behaves as expected in real-world applications. Fmeval facilitates the evaluation of LLMs across a broad range of tasks, including open-ended text generation, summarization, question answering, and classification. Additionally, fmeval assesses models on metrics such as accuracy, toxicity, semantic robustness, and prompt stereotyping, helping identify the optimal model for diverse use cases while maintaining ethical and robust performance.
To start using the library, follow these steps:
- Create a
ModelRunner
that can perform invocation on your LLM.ModelRunner
encapsulates the logic for invoking different types of LLMs, exposing a predict method to simplify interactions with LLMs within the eval algorithm code. For this project, we use SageMakerModelRunner from fmeval. - In the file py used by our pipeline, create a DataConfig object to use the
evaluation_dataset
created in the previous step. - Next, use an evaluation algorithm with the custom dataset. For this project, we use the
QAAccuracy
algorithm, which measures how well the model performs in question answering tasks. The model is queried for a range of facts, and we evaluate the accuracy of its response by comparing the model’s output to target answers under different metrics:- Exact match (EM) – Binary score. 1 if model output and target answer match exactly.
- Quasi-exact match – Binary score. Similar to exact match, but both model output and target answer are normalized first by removing articles and punctuation.
- Precision over words – The fraction of words in the prediction that are also found in the target answer. The text is normalized as before.
- Recall over words – The fraction of words in the target answer that are also found in the prediction.
- F1 over words – The harmonic mean of precision and recall over words (normalized).
As an output, the evaluation step produces a file (evaluation_metrics.json
) that contains the computed metrics. This file is stored in Amazon S3 and is registered as a property file for later access in the pipeline.
Register the model
Before registering the fine-tuned model, we introduce a quality control step by implementing a condition based on the evaluation metrics obtained from the previous step. Specifically, we focus on the F1 score metric, which measures the harmonic mean of precision and recall between the normalized response and reference.
To make sure that only high-performing models are registered and deployed, we set a predetermined threshold for the F1 score metric. If the model’s performance meets or exceeds this threshold, it is suitable for registration and deployment. However, if the model fails to meet the specified threshold, the pipeline concludes without registering the model, stopping the deployment of suboptimal models.
Create and run the pipeline
After we define the pipeline steps, we combine them into a SageMaker pipeline. The steps are run sequentially. The pipeline runs the steps for an AutoML job, using SageMaker Autopilot for training, model evaluation, and model registration. See the following code:
Clean up
To avoid unnecessary charges and maintain a clean environment after running the demos outlined in this post, it’s important to delete all deployed resources. Follow these steps to properly clean up:
- To delete deployed endpoints, use the SageMaker console or the AWS SDK. This step is essential because endpoints can accrue significant charges if left running.
- Delete both SageMaker pipelines created during this walkthrough. This will help prevent residual executions that might generate additional costs.
- Remove all artifacts stored in your S3 buckets that were used for training, storing model artifacts, or logging. Make sure you delete only the resources related to this project to help avoid data loss.
- Clean up any additional resources. Depending on your implementation and any additional configurations, there may be other resources to consider, such as IAM roles, Amazon CloudWatch logs, or other AWS services. Identify and delete any resources that are no longer needed.
Conclusion
In this post, we explored how AutoMLV2 streamlines the process of fine-tuning FMs by automating the heavy lifting involved in model development. We demonstrated an end-to-end solution that uses SageMaker Pipelines to orchestrate the steps of data preparation, model training, evaluation, and deployment. The fmeval library played a crucial role in assessing the fine-tuned LLM’s performance, enabling us to select the best-performing model based on relevant metrics. By seamlessly integrating with the SageMaker infrastructure, AutoMLV2 simplified the deployment process, allowing us to create a SageMaker endpoint for real-time inference with just a few lines of code.
Get started by accessing the code on the GitHub repo to train and deploy your own custom AutoML models.
For more information on SageMaker Pipelines and SageMaker Autopilot, refer to Amazon SageMaker Pipelines and SageMaker Autopilot, respectively.
About the Author
Hajer Mkacher is a Solutions Architect at AWS, specializing in the Healthcare and Life Sciences industries. With over a decade in software engineering, she leverages generative AI to create innovative solutions, acting as a trusted advisor to her customers. In her free time, Hajer enjoys painting or working on creative robotics projects with her family.
Unify structured data in Amazon Aurora and unstructured data in Amazon S3 for insights using Amazon Q
In today’s data-intensive business landscape, organizations face the challenge of extracting valuable insights from diverse data sources scattered across their infrastructure. Whether it’s structured data in databases or unstructured content in document repositories, enterprises often struggle to efficiently query and use this wealth of information.
In this post, we explore how you can use Amazon Q Business, the AWS generative AI-powered assistant, to build a centralized knowledge base for your organization, unifying structured and unstructured datasets from different sources to accelerate decision-making and drive productivity. The solution combines data from an Amazon Aurora MySQL-Compatible Edition database and data stored in an Amazon Simple Storage Service (Amazon S3) bucket.
Solution overview
Amazon Q Business is a fully managed, generative AI-powered assistant that helps enterprises unlock the value of their data and knowledge. The key to using the full potential of Amazon Q lies in its ability to seamlessly integrate and query multiple data sources, from structured databases to unstructured content stores. In this solution, we use Amazon Q to build a comprehensive knowledge base that combines sales-related data from an Aurora MySQL database and sales documents stored in an S3 bucket. Aurora MySQL-Compatible is a fully managed, MySQL-compatible, relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.
This custom knowledge base that connects these diverse data sources enables Amazon Q to seamlessly respond to a wide range of sales-related questions using the chat interface. The following diagram illustrates the solution architecture.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- A virtual private cloud (VPC) with at least two subnets
- An Aurora MySQL database
- An Amazon Elastic Compute Cloud (Amazon EC2) bastion host
- AWS IAM Identity Center configured
- An S3 bucket
Set up your VPC
Establishing a VPC provides a secure, isolated network environment for hosting the data sources that Amazon Q Business will access to index. In this post, we use an Aurora MySQL database in a private subnet, and Amazon Q Business accesses the private DB instance in a secure manner using an interface VPC endpoint.
Complete the following steps:
- Choose an AWS Region Amazon Q supports (for this post, we use the us-east-1 Region).
- Create a VPC or use an existing VPC with at least two subnets. These subnets must be in two different Availability Zones in the Region where you want to deploy your DB instance.
- Refer to Steps 1 and 2 in Configuring Amazon VPC support for Amazon Q Business connectors to configure your VPC so that you have a private subnet to host an Aurora MySQL database along with a security group for your database.
- Additionally, create a public subnet that will host an EC2 bastion server, which we create in the next steps.
- Create an interface VPC endpoint for Aurora powered by AWS PrivateLink in the VPC you created. For instructions, refer to Access an AWS service using an interface VPC endpoint.
- Specify the private subnet where the Aurora MySQL database resides along with the database security group you created.
Each interface endpoint is represented by one or more elastic network interfaces in your subnets, which is then used by Amazon Q Business to connect to the private database.
Set up an Aurora MySQL database
Complete the following steps to create an Aurora MySQL database to host the structured sales data:
- On the Amazon RDS console, choose Databases in the navigation pane.
- Choose Create database.
- Select Aurora, then Aurora (MySQL compatible).
- For Templates, choose Production or Dev/test.
- Under Settings, enter a name for your database cluster identifier. For example, q-aurora-mysql-source.
- For Credentials settings, choose Self-managed, give the admin user a password, and keep the rest of the parameters as default.
- Under Connectivity, for Virtual private cloud (VPC), choose the VPC that you created.
- For DB subnet group, create a new subnet group or choose an existing one. Keep the rest of the parameters as default.
- For Publicly accessible, choose NO.
- Under VPC security group (firewall), choose Existing and choose the existing security group that you created for the Aurora MySQL DB instance.
- Leave the remaining parameters as default and create the database.
Create an EC2 bastion host to connect to the private Aurora MySQL DB instance
In this post, you connect to the private DB instance from the MySQL Workbench client on your local machine through an EC2 bastion host. Launch the EC2 instance in the public subnet of the VPC you configured. The security group attached to this EC2 bastion host instance should be configured to allow SSH traffic (port 22) from your local machine’s IP address. To facilitate the connection between the EC2 bastion host and the Aurora MySQL database, the security group for the Aurora MySQL database should have an inbound rule to allow MySQL traffic (port 3306) from the security group of the EC2 bastion host. Conversely, the security group for the EC2 bastion host should have an outbound rule to allow traffic to the security group of the Aurora MySQL database on port 3306. Refer to Controlling access with security groups for more details.
Configure IAM Identity Center
An Amazon Q Business application requires you to use IAM Identity Center to manage user access. IAM Identity Center is a single place where you can assign your workforce users, also known as workforce identities, to provide consistent access to multiple AWS accounts and applications. In this post, we use IAM Identity Center as the SAML 2.0-aligned identity provider (IdP). Make sure you have enabled an IAM Identity Center instance, provisioned at least one user, and provided each user with a valid email address. The Amazon Q Business application needs to be in the same Region as the IAM Identity Center instance. For more information on enabling users in IAM Identity Center, see Add users to your Identity Center directory.
Create an S3 bucket
Create a S3 bucket in the us-east-1 Region with the default settings and create a folder with a name of your choice inside the bucket.
Create and load sample data
In this post, we use two sample datasets: a total sales dataset CSV file and a sales target document in PDF format. The total sales dataset contains information about orders placed by customers located in various geographical locations, through different sales channels. The sales document contains information about sales targets for the year for each of the sales channel. Complete the steps in the section below to load both datasets.
Aurora MySQL database
In the Amazon Q Business application, you create two indexes for the same Aurora MySQL table: one on the total sales dataset and another on an aggregated view of the total sales data, to cater to the different type of queries. Complete the following steps:
- Securely connect to your private Aurora MySQL database using an SSH tunnel through an EC2 bastion host.
This enables you to manage and interact with your database resources directly from your local MySQL Workbench client.
- Create the database and tables using the following commands on the local MySQL Workbench client:
- Download the sample file csv in your local environment.
- Use the following code to insert sample data in your MYSQL client:
If you encounter the error LOAD DATA LOCAL INFILE file request rejected due to restrictions on access when running the statements in MySQL Workbench 8.0, you might need to edit the connection. On the Connection tab, go to the Advanced sub-tab, and in the Others field, add the line OPT_LOCAL_INFILE=1
and start a new query tab after testing the connection.
- Verify the data load by running a select statement:
This should return 7,991 rows.
The following screenshot shows the database table schema and the sample data in the table.
Amazon S3 bucket
Download the sample file 2020_Sales_Target.pdf
in your local environment and upload it to the S3 bucket you created. This sales target document contains information about the sales target for four sales channels and looks like the following screenshot.
Create an Amazon Q application
Complete the following steps to create an Amazon Q application:
- On the Amazon Q console, choose Applications in the navigation pane.
- Choose Create application.
- Provide the following details:
- In the Application details section, for Application name, enter a name for the application (for example,
sales_analyzer
). - In the Service access section, for Choose a method to authorize Amazon Q, select Create and use a new service role.
- Leave all other default options and choose Create.
- In the Application details section, for Application name, enter a name for the application (for example,
- On the Select retriever page, you configure the retriever. The retriever is an index that will be used by Amazon Q to fetch data in real time.
- For Retrievers, select Use native retriever.
- For Index provisioning, select Starter.
- For Number of units, use the default value of 1. Each unit can support up to 20,000 documents. For a database, each database row is considered a document.
- Choose Next.
Configure Amazon Q to connect to Aurora MySQL-Compatible
Complete the following steps to configure Amazon Q to connect to Aurora MySQL-Compatible:
- On the Connect data sources page, under Data sources, choose the Aurora (MySQL) data source.
- Choose Next.
- In the Name and description section, configure the following parameters:
- For Data source name, enter a name (for example,
aurora_mysql_sales
). - For Description, enter a description.
- For Data source name, enter a name (for example,
- In the Source section, configure the following parameters:
- For Host, enter the database endpoint (for example,
<databasename>.<ID>.<region>.rds.amazonaws.com
).
- For Host, enter the database endpoint (for example,
You can obtain the endpoint on the Amazon RDS console for the instance on the Connectivity & security tab.
-
- For Port, enter the Amazon RDS port for MySQL:
3306
. - For Instance, enter the database name (for example,
sales
). - Select Enable SSL Certificate location.
- For Port, enter the Amazon RDS port for MySQL:
- For Authentication, choose Create a new secret with a name of your choice.
- Provide the user name and password for your MySQL database to create the secret.
- In the Configure VPC and security group section, choose the VPC and subnets where your Aurora MySQL database is located, and choose the default VPC security group.
- For IAM role, choose Create a new service role.
- For Sync scope, under SQL query, enter the following query:
This select statement returns a primary key column, a document title column, and a text column that serves your document body for Amazon Q to answer questions. Make sure you don’t put ; at the end of the query.
- For Primary key column, enter
order_number
. - For Title column, enter
sales_channel
. - For Body column, enter
sales_details
.
- Under Sync run schedule, for Frequency, choose Run on demand.
- Keep all other parameters as default and choose Add data source.
This process may take a few minutes to complete. After the aurora_mysql_sales
data source is added, you will be redirected to the Connect data sources page.
- Repeat the steps to add another Aurora MySQL data source, called
aggregated_sales
, for the same database but with the following details in the Sync scope This data source will be used by Amazon Q for answering questions on aggregated sales.- Use the following SQL query:
-
- For Primary key column, enter
scoy_id
. - For Title column, enter
sales_channel
. - For Body column, enter
sales_aggregates
.
- For Primary key column, enter
After adding the aggregated_sales
data source, you will be redirected to the Connect data sources page again.
Configure Amazon Q to connect to Amazon S3
Complete the following steps to configure Amazon Q to connect to Amazon S3:
- On the Connect data sources page, under Data sources, choose Amazon S3.
- Under Name and description, enter a data source name (for example,
s3_sales_targets
) and a description. - Under Configure VPC and security group settings, choose No VPC.
- For IAM role, choose Create a new service role.
- Under Sync scope, for the data source location, enter the S3 bucket name containing the sales target PDF document.
- Leave all other parameters as default.
- Under Sync run schedule, for Frequency, choose Run on demand.
- Choose Add data source.
- On the Connect data sources page, choose Next.
- In the Update groups and users section, choose Add users and groups.
- Choose the user as entered in IAM Identity Center and choose Assign.
- After you add the user, you can choose the Amazon Q Business subscription to assign to the user. For this post, we choose Q Business Lite.
- Under Web experience service access, select Create and use a new service role and enter a service role name.
- Choose Create application.
After few minutes, the application will be created and you will be taken to the Applications page on the Amazon Q Business console.
Sync the data sources
Choose the name of your application and navigate to the Data sources section. For each of the three data sources, select the data source and choose Sync now. It will take several minutes to complete. After the sources have synced, you should see the Last sync status show as Completed.
Customize and interact with the Amazon Q application
At this point, you have created an Amazon Q application, synced the data source, and deployed the web experience. You can customize your web experience to make it more intuitive to your application users.
- On the application details page, choose Customize web experience.
- For this post, we have customized the Title, Subtitle and Welcome message fields for our assistant.
- After you have completed your customizations for the web experience, go back to the application details page and choose the web experience URL.
- Sign in with the IAM Identity Center user name and password you created earlier to start the conversation with assistant.
You can now test the application by asking different questions, as shown in the following screenshot. You can observe in the following question that the channel names were fetched from the Amazon S3 sales target PDF.
The following screenshots show more example interactions.
The answer in the preceding example was derived from the two sources: the S3 bucket and the Aurora database. You can verify the output by cross-referencing the PDF, which has a target as $12 million for the in-store sales channel in 2020. The following SQL shows the actual sales achieved in 2020 for the same channel:
As seen from the sales target PDF data, the 2020 sales target for the distributor sales channel was $7 million.
The following SQL in the Aurora MySQL database shows the actual sales achieved in 2020 for the same channel:
The following screenshots show additional questions.
You can verify the preceding answers with the following SQL:
Clean up
To avoid incurring future charges, clean up any resources you created as part of this solution, including the Amazon Q Business application:
- On the Amazon Q Business console, choose Applications in the navigation pane, select the application you created, and on the Actions menu, choose Delete.
- Delete the AWS Identity and Access Management (IAM) roles created for the application and data retriever. You can identify the IAM roles used by the Amazon Q Business application and data retriever by inspecting the associated configuration using the AWS console or AWS Command Line Interface (AWS CLI).
- Delete the IAM Identity Center instance you created for this walkthrough.
- Empty the bucket you created and then delete the bucket.
- Delete the Aurora MySQL instance and Aurora cluster.
- Shut down the EC2 bastion host instance.
- Delete the VPC and related components—the NAT gateway and interface VPC endpoint.
Conclusion
In this post, we demonstrated how organizations can use Amazon Q to build a unified knowledge base that integrates structured data from an Aurora MySQL database and unstructured data from an S3 bucket. By connecting these disparate data sources, Amazon Q enables you to seamlessly query information from two data sources and gain valuable insights that drive better decision-making.
We encourage you to try this solution and share your experience in the comments. Additionally, you can explore the many other data sources that Amazon Q for Business can seamlessly integrate with, empowering you to build robust and insightful applications.
About the Authors
Monjumi Sarma is a Technical Account Manager at Amazon Web Services. She helps customers architect modern, scalable, and cost-effective solutions on AWS, which gives them an accelerated path towards modernization initiatives. She has experience across analytics, big data, ETL, cloud operations, and cloud infrastructure management.
Akchhaya Sharma is a Sr. Data Engineer at Amazon Ads. He builds and manages data-driven solutions for recommendation systems, working together with a diverse and talented team of scientists, engineers, and product managers. He has experience across analytics, big data, and ETL.
Automate Q&A email responses with Amazon Bedrock Knowledge Bases
Email remains a vital communication channel for business customers, especially in HR, where responding to inquiries can use up staff resources and cause delays. The extensive knowledge required can make it overwhelming to respond to email inquiries manually. In the future, high automation will play a crucial role in this domain.
Using generative AI allows businesses to improve accuracy and efficiency in email management and automation. This technology allows for automated responses, with only complex cases requiring manual review by a human, streamlining operations and enhancing overall productivity.
The combination of retrieval augmented generation (RAG) and knowledge bases enhances automated response accuracy. The combination of retrieval-based and generation-based models in RAG allows for accessing databases and generating accurate and contextually relevant responses. Access to reliable information from a comprehensive knowledge base helps the system provide better responses. This hybrid approach makes sure automated replies are not only contextually relevant but also factually correct, enhancing the reliability and trustworthiness of the communication.
In this post, we illustrate automating the responses to email inquiries by using Amazon Bedrock Knowledge Bases and Amazon Simple Email Service (Amazon SES), both fully managed services. By linking user queries to relevant company domain information, Amazon Bedrock Knowledge Bases offers personalized responses. Amazon Bedrock Knowledge Bases can achieve greater response accuracy and relevance by integrating foundation models (FMs) with internal company data sources for RAG. Amazon SES is an email service that provides a straightforward way to send and receive email using your own email addresses and domains.
Retrieval Augmented Generation
RAG is an approach that integrates information retrieval into the natural language generation process. It involves two key workflows: data ingestion and text generation. The data ingestion workflow creates semantic embeddings for documents and questions, storing document embeddings in a vector database. By comparing vector similarity to the question embedding, the text generation workflow selects the most relevant document chunks to enhance the prompt. The obtained information empowers the model to generate more knowledgeable and precise responses.
Amazon Bedrock Knowledge Bases
For RAG workflows, Amazon Bedrock offers managed knowledge bases, which are vector databases that store unstructured data semantically. This managed service simplifies deployment and scaling, allowing developers to focus on building RAG applications without worrying about infrastructure management. For more information on RAG and Amazon Bedrock Knowledge Bases, see Connect Foundation Models to Your Company Data Sources with Agents for Amazon Bedrock.
Solution overview
The solution presented in this post responds automatically to email inquiries using the following solution architecture. The primary functions are to enhance the RAG support knowledge base with domain-specific documents and automate email responses.
The workflow to populate the knowledge base consists of the following steps, as noted in the architecture diagram:
- The user uploads company- and domain-specific information, like policy manuals, to an Amazon Simple Storage (Amazon S3) bucket.
- This bucket is designated as the knowledge base data source.
- Amazon S3 invokes an AWS Lambda function to synchronize the data source with the knowledge base.
- The Lambda function starts data ingestion by calling the
StartIngestionJob
API function. The knowledge base splits the documents in the data source into manageable chunks for efficient retrieval. The knowledge base is set up to use Amazon OpenSearch Serverless as its vector store and an Amazon Titan embeddings text model on Amazon Bedrock to create the embeddings. During this step, the chunks are converted to embeddings and stored in a vector index in the OpenSearch Serverless vector store for Knowledge Bases of Amazon Bedrock, while also keeping track of the original document.
The workflow for automating email responses using generative AI with the knowledge base includes the following steps:
- A customer sends a natural language email inquiry to an address configured within your domain, such as
info@example.com
. - Amazon SES receives the email and sends the entire email content to an S3 bucket with the unique email identifier as the object key.
- An Amazon EventBridge rule is invoked upon receipt of the email in the S3 bucket and starts an AWS Step Functions state machine to coordinate generating and sending the email response.
- A Lambda function retrieves the email content from Amazon S3.
- The email identifier and a received timestamp is recorded in an Amazon DynamoDB table. You can use the DynamoDB table to monitor and analyze the email responses that are generated.
- By using the body of the email inquiry, the Lambda function creates a prompt query and invokes the Amazon Bedrock
RetrieveAndGenerate
API function to generate a response. - Amazon Bedrock Knowledge Bases uses the Amazon Titan embeddings model to convert the prompt query to a vector embedding, and then finds chunks that are semantically similar. The prompt is then augmented with the chunks that are retrieved from the vector store. We then send the prompt alongside the additional context to a large language model (LLM) for response generation. In this solution, we use Anthropic’s Claude Sonnet 3.5 on Amazon Bedrock as our LLM to generate user responses using additional context. Anthropic’s Claude Sonnet 3.5 is fast, affordable, and versatile, capable of handling various tasks like casual dialogue, text analysis, summarization, and document question answering.
- A Lambda function constructs an email reply from the generated response and transmits the email reply using Amazon SES to the customer. Email tracking and disposition information is updated in the DynamoDB table.
- When there’s no automated email response, a Lambda function forwards the original email to an internal support team for them to review and respond to the customer. It updates the email disposition information in the DynamoDB table.
Prerequisites
To set up this solution, you should have the following prerequisites:
- A local machine or virtual machine (VM) on which you can install and run AWS Command Line Interface (AWS CLI) tools.
- A local environment prepared to deploy the AWS Cloud Development Kit (AWS CDK) stack as documented in Getting started with the AWS CDK. You can bootstrap the environment with
cdk bootstrap aws://{ACCOUNT_NUMBER}/{REGION}
. - A valid domain name with configuration rights over it. If you have a domain name registered in Amazon Route 53 and managed in this same account, the AWS CDK will configure Amazon SES for you. If your domain is managed elsewhere, then some manual steps will be necessary (as detailed later in this post).
- Amazon Bedrock models enabled for embedding and querying. For more information, see Access Amazon Bedrock foundation models. In the default configuration, the following models are required to be enabled:
- Amazon Titan Text Embeddings V2
- Anthropic’s Claude 3.5 Sonnet
Deploy the solution
To deploy the solution, complete the following steps:
- Configure an SES domain identity to allow Amazon SES to send and receive messages.
If you want to receive an email address for a domain managed in Route 53, it will automatically configure this for you if you provide theROUTE53_HOSTED_ZONE
context variable. If you manage your domain in a different account or in a registrar besides Route 53, refer to Creating and verifying identities in Amazon SES to manually verify your domain identity and Publishing an MX record for Amazon SES email receiving to manually add the MX record required for Amazon SES to receive email for your domain. - Clone the repository and navigate to the root directory:
- Install dependencies:
npm install
- Deploy the AWS CDK app, replacing
{EMAIL_SOURCE}
with the email address that will receive inquiries,{EMAIL_REVIEW_DEST}
with the email address for internal review for messages that fail auto response, and{HOSTED_ZONE_NAME}
with your domain name:
At this point, you have configured Amazon SES with a verified domain identity in sandbox mode. You can now send email to an address in that domain. If you need to send email to users with a different domain name, you need to request production access.
Upload domain documents to Amazon S3
Now that you have a running knowledge base, you need to populate your vector store with the raw data you want to query. To do so, upload your raw text data to the S3 bucket serving as the knowledge base data source:
- Locate the bucket name from the AWS CDK output (
KnowledgeBaseSourceBucketArn/Name
). - Upload your text files, either through the Amazon S3 console or the AWS CLI.
If you’re testing this solution out, we recommend using the documents in the following open source HR manual. Upload the files in either the markdown or PDF folders. Your knowledge base will then automatically sync those files to the vector database.
Test the solution
To test the solution, send an email to the address defined in the “sourceEmail” context parameter. If you opted to upload the sample HR documents, you could use the following example questions:
- “How many days of PTO do I get?”
- “To whom do I report an HR violation?”
Clean up
Deploying the solution will incur charges. To clean up resources, run the following command from the project’s folder:
Conclusion
In this post, we discussed the essential role of email as a communication channel for business users and the challenges of manual email responses. Our description outlined the use of a RAG architecture and Amazon Bedrock Knowledge Bases to automate email responses, resulting in improved HR prioritization and enhanced user experiences. Lastly, we created a solution architecture and sample code in a GitHub repository for automatically generating and sending contextual email responses using a knowledge base.
For more information, see the Amazon Bedrock User Guide and Amazon SES Developer Guide.
About the Authors
Darrin Weber is a Senior Solutions Architect at AWS, helping customers realize their cloud journey with secure, scalable, and innovative AWS solutions. He brings over 25 years of experience in architecture, application design and development, digital transformation, and the Internet of Things. When Darrin isn’t transforming and optimizing businesses with innovative cloud solutions, he’s hiking or playing pickleball.
Marc Luescher is a Senior Solutions Architect at AWS, helping enterprise customers be successful, focusing strongly on threat detection, incident response, and data protection. His background is in networking, security, and observability. Previously, he worked in technical architecture and security hands-on positions within the healthcare sector as an AWS customer. Outside of work, Marc enjoys his 3 dogs, 4 cats, and over 20 chickens, and practices his skills in cabinet making and woodworking.
Matt Richards is a Senior Solutions Architect at AWS, assisting customers in the retail industry. Having formerly been an AWS customer himself with a background in software engineering and solutions architecture, he now focuses on helping other customers in their application modernization and digital transformation journeys. Outside of work, Matt has a passion for music, singing, and drumming in several groups.
Streamline RAG applications with intelligent metadata filtering using Amazon Bedrock
Retrieval Augmented Generation (RAG) has become a crucial technique for improving the accuracy and relevance of AI-generated responses. The effectiveness of RAG heavily depends on the quality of context provided to the large language model (LLM), which is typically retrieved from vector stores based on user queries. The relevance of this context directly impacts the model’s ability to generate accurate and contextually appropriate responses.
One effective way to improve context relevance is through metadata filtering, which allows you to refine search results by pre-filtering the vector store based on custom metadata attributes. By narrowing down the search space to the most relevant documents or chunks, metadata filtering reduces noise and irrelevant information, enabling the LLM to focus on the most relevant content.
In some use cases, particularly those involving complex user queries or a large number of metadata attributes, manually constructing metadata filters can become challenging and potentially error-prone. To address these challenges, you can use LLMs to create a robust solution. This approach, which we call intelligent metadata filtering, uses tool use (also known as function calling) to dynamically extract metadata filters from natural language queries. Function calling allows LLMs to interact with external tools or functions, enhancing their ability to process and respond to complex queries.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. One of its key features, Amazon Bedrock Knowledge Bases, allows you to securely connect FMs to your proprietary data using a fully managed RAG capability and supports powerful metadata filtering capabilities.
In this post, we explore an innovative approach that uses LLMs on Amazon Bedrock to intelligently extract metadata filters from natural language queries. By combining the capabilities of LLM function calling and Pydantic data models, you can dynamically extract metadata from user queries. This approach can also enhance the quality of retrieved information and responses generated by the RAG applications.
This approach not only addresses the challenges of manual metadata filter construction, but also demonstrates how you can use Amazon Bedrock to create more effective and user-friendly RAG applications.
Understanding metadata filtering
Metadata filtering is a powerful feature that allows you to refine search results by pre-filtering the vector store based on custom metadata attributes. This approach narrows down the search space to the most relevant documents or passages, reducing noise and irrelevant information. For a comprehensive overview of metadata filtering and its benefits, refer to Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.
The importance of context quality in RAG applications
In RAG applications, the accuracy and relevance of generated responses heavily depend on the quality of the context provided to the LLM. This context, typically retrieved from the knowledge base based on user queries, directly impacts the model’s ability to generate accurate and contextually appropriate outputs.
To evaluate the effectiveness of a RAG system, we focus on three key metrics:
- Answer relevancy – Measures how well the generated answer addresses the user’s query. By improving the relevance of the retrieved context through dynamic metadata filtering, you can significantly enhance the answer relevancy.
- Context recall – Assesses the proportion of relevant information retrieved from the knowledge base. Dynamic metadata filtering helps improve context recall by more accurately identifying and retrieving the most pertinent documents or passages for a given query.
- Context precision – Evaluates the accuracy of the retrieved context, making sure the information provided to the LLM is highly relevant to the query. Dynamic metadata filtering enhances context precision by reducing the inclusion of irrelevant or tangentially related information.
By implementing dynamic metadata filtering, you can significantly improve these metrics, leading to more accurate and relevant RAG responses. Let’s explore how to implement this approach using Amazon Bedrock and Pydantic.
Solution overview
In this section, we illustrate the flow of the dynamic metadata filtering solution using the tool use (function calling) capability. The following diagram illustrates high level RAG architecture with dynamic metadata filtering.
The process consists of the following steps:
- The process begins when a user asks a query through their interface.
- The user’s query is first processed by an LLM using the tool use (function calling) feature. This step is crucial for extracting relevant metadata from the natural language query. The LLM analyzes the query and identifies key entities or attributes that can be used for filtering.
- The extracted metadata is used to construct an appropriate metadata filter. This combined query and filter is passed to the
RetrieveAndGenerate
- This API, part of Amazon Bedrock Knowledge Bases, handles the core RAG workflow. It consists of several sub-steps:
- The user query is converted into a vector representation (embedding).
- Using the query embedding and the metadata filter, relevant documents are retrieved from the knowledge base.
- The original query is augmented with the retrieved documents, providing context for the LLM.
- The LLM generates a response based on the augmented query and retrieved context.
- Finally, the generated response is returned to the user.
This architecture uses the power of tool use for intelligent metadata extraction from a user’s query, combined with the robust RAG capabilities of Amazon Bedrock Knowledge Bases. The key innovation lies in Step 2, where the LLM is used to dynamically interpret the user’s query and extract relevant metadata for filtering. This approach allows for more flexible and intuitive querying, because users can express their information needs in natural language without having to manually specify metadata filters.
The subsequent steps (3–4) follow a more standard RAG workflow, but with the added benefit of using the dynamically generated metadata filter to improve the relevance of retrieved documents. This combination of intelligent metadata extraction and traditional RAG techniques results in more accurate and contextually appropriate responses to user queries.
Prerequisites
Before proceeding with this tutorial, make sure you have the following in place:
- AWS account – You should have an AWS account with access to Amazon Bedrock.
- Model access – Amazon Bedrock users need to request access to FMs before they’re available for use. For this solution, you need to enable access to the Amazon Titan Embeddings G1 – Text and Anthropic’s Claude Instant 1.2 model in Amazon Bedrock. For more information, refer to Access Amazon Bedrock foundation models.
- Knowledge base – You need a knowledge base created in Amazon Bedrock with ingested data and metadata. For detailed instructions on setting up a knowledge base, including data preparation, metadata creation, and step-by-step guidance, refer to Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy. This post walks you through the entire process of creating a knowledge base and ingesting data with metadata.
In the following sections, we explore how to implement dynamic metadata filtering using the tool use feature in Amazon Bedrock and Pydantic for data validation.
Tool use is a powerful feature in Amazon Bedrock that allows models to access external tools or functions to enhance their response generation capabilities. When you send a message to a model, you can provide definitions for one or more tools that could potentially help the model generate a response. If the model determines it needs a tool, it responds with a request for you to call the tool, including the necessary input parameters.
In our example, we use Amazon Bedrock to extract entities like genre and year from natural language queries about video games. For a query like “A strategy game with cool graphics released after 2023?”” it will extract “strategy” (genre) and “2023” (year). These extracted entities will then dynamically construct metadata filters to retrieve only relevant games from the knowledge base. This allows flexible, natural language querying with precise metadata filtering.
Set up the environment
First, set up your environment with the necessary imports and Boto3 clients:
Define Pydantic models
For this solution, you use Pydantic models to validate and structure our extracted entities:
Implement entity extraction using tool use
You now define a tool for entity extraction with basic instructions and use it with Amazon Bedrock. You should use a proper description for this to work for your use case:
Construct a metadata filter
Create a function to construct the metadata filter based on the extracted entities:
Create the main function
Finally, create a main function to process the query and retrieve results:
This implementation uses the tool use feature in Amazon Bedrock to dynamically extract entities from user queries. It then uses these entities to construct metadata filters, which are applied when retrieving results from the knowledge base.
The key advantages of this approach include:
- Flexibility – The system can handle a wide range of natural language queries without predefined patterns
- Accuracy – By using LLMs for entity extraction, you can capture nuanced information from user queries
- Extensibility – You can expand the tool definition to extract additional metadata fields as needed
Handling edge cases
When implementing dynamic metadata filtering, it’s important to consider and handle edge cases. In this section, we discuss some ways you can address them.
If the tool use process fails to extract metadata from the user query due to an absence of filters or errors, you have several options:
- Proceed without filters – This allows for a broad search, but may reduce precision:
- Apply a default filter – This can help maintain some level of filtering even when no specific metadata is extracted:
- Use the most common filter – If you have analytics on common user queries, you could apply the most frequently used filter
- Strict policy handling – For cases where you want to enforce stricter policies or adhere to specific responsible AI guidelines, you might choose not to process queries that don’t yield metadata:
This approach makes sure that only queries with clear, extractable metadata are processed, potentially reducing errors and improving overall response quality.
Performance considerations
The dynamic approach introduces an additional FM call to extract metadata, which will increase both cost and latency. To mitigate this, consider the following:
- Use a faster, lighter FM for the metadata extraction step. This can help reduce latency and cost while still providing accurate entity extraction.
- Implement caching mechanisms for common queries to help avoid redundant FM calls.
- Monitor and optimize the performance of your metadata extraction model regularly.
Clean up
After you’ve finished experimenting with this solution, it’s crucial to clean up your resources to avoid unnecessary charges. For detailed cleanup instructions, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy. These steps will guide you through deleting your knowledge base, vector database, AWS Identity and Access Management (IAM) roles, and sample datasets, making sure that you don’t incur unexpected costs.
Conclusion
By implementing dynamic metadata filtering using Amazon Bedrock and Pydantic, you can significantly enhance the flexibility and power of RAG applications. This approach allows for more intuitive querying of knowledge bases, leading to improved context recall and more relevant AI-generated responses.
As you explore this technique, remember to balance the benefits of dynamic filtering against the additional computational costs. We encourage you to try this method in your own RAG applications and share your experiences with the community.
For additional resources, refer to the following:
- Retrieve data and generate AI responses with knowledge bases
- Use RAG to improve responses in generative AI applications
- Amazon Bedrock Knowledge Bases – Samples for building RAG workflows
Happy building with Amazon Bedrock!
About the Authors
Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High-Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in machine learning and natural language processing, Ishan specializes in developing safe and responsible AI systems that drive business value. Outside of work, he enjoys playing competitive volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Embedding secure generative AI in mission-critical public safety applications
This post is co-written with Lawrence Zorio III from Mark43.
Public safety organizations face the challenge of accessing and analyzing vast amounts of data quickly while maintaining strict security protocols. First responders need immediate access to relevant data across multiple systems, while command staff require rapid insights for operational decisions. Mission-critical public safety applications require the highest levels of security and reliability when implementing technology capabilities. Mark43, a public safety technology company, recognized this challenge and embedded generative artificial intelligence (AI) capabilities into their application using Amazon Q Business to transform how law enforcement agencies interact with their mission-critical applications. By embedding advanced AI into their cloud-native platform, Mark43 enables officers to receive instant answers to natural language queries and automated case report summaries, reducing administrative time from minutes to seconds. This solution demonstrates how generative AI can enhance public safety operations while allowing officers to focus more time on serving their communities.
This post shows how Mark43 uses Amazon Q Business to create a secure, generative AI-powered assistant that drives operational efficiency and improves community service. We explain how they embedded Amazon Q Business web experience in their web application with low code, so they could focus on creating a rich AI experience for their customers.
Mark43’s public safety solution built on the AWS Cloud
Mark43 offers a cloud-native Public Safety Platform with powerful computer-aided dispatch (CAD), records management system (RMS), and analytics solutions, positioning agencies at the forefront of public safety technology. These solutions make sure public safety agencies have access to the essential tools and data they need to protect and serve their communities effectively. By using purpose-built Amazon Web Services (AWS) cloud services and modern software architecture, Mark43 delivers an intuitive, user-friendly experience that empowers both frontline personnel and command staff. The solution’s advanced analytical capabilities provide real-time insights to support data-driven decision making and enhance operational efficiency. Mark43 has built a robust and resilient microservices architecture using a combination of serverless technologies, such as AWS Lambda, AWS Fargate, and Amazon Elastic Compute Cloud (Amazon EC2). They use event-driven architectures, real-time processing, and purpose-built AWS services for hosting data and running analytics. This, combined with integrated AI capabilities, positions Mark43 to drive innovation in the industry. With its Open API architecture built on AWS and 100+ integrations, Mark43 connects to the applications and data sources agencies rely on for unmatched insights, situational awareness and decision support. This modern data foundation built on AWS allows agencies to leverage the latest technologies and AI models, keeping pace with the evolving technology landscape.
Opportunity for innovation with generative AI
Agency stakeholders have mission-critical roles that demand significant time and administrative interactions with core solutions. With a cloud native Computer Aided Dispatch (CAD) and Records Management System (RMS), Mark43 was able to bring the same modern solutions that have long infiltrated other industries to make police forces more efficient, replacing legacy systems. Now, Mark43 values the opportunity to leverage AI to support the next evolution of innovative technology to drive efficiencies, enhance situational awareness, and support better public safety outcomes.
Leading agencies are embracing AI by setting high standards for data integrity and security, implementing a central strategy to prevent unauthorized use of consumer AI tools, and ensuring a human-in-the-loop approach. Meanwhile, value-add AI tools should seamlessly integrate with existing workflows and applications to prevent sprawl to yet more tools adding unwanted complexity. Mark43 and AWS worked backwards from these requirements to bring secure, easy-to-use, and valuable AI to public safety.
AWS collaborated with Mark43 to embed a frictionless AI assistant directly into their core products, CAD and RMS, for first responders and command staff. Together, we harnessed the power of AI into a secure, familiar, existing workflow with a low barrier to entry for adoption across the user base. The assistant enables first responders to search information, receive summaries, and complete tasks based on their authorized data access within Mark43’s systems, reducing the time needed to capture high value insights.
In just a few weeks, Mark43 deployed an Amazon Q Business application, integrated their data sources using Amazon Q Business built-in data connectors, embedded the Amazon Q Business application into their native app, tested and tuned responses to prompts, and completed a successful beta version of the assistant with their to end users. Figure 1 depicts the overall architecture of Mark43’s application using Amazon Q Business.
Mark43’s solution uses the Amazon Q Business built-in data connectors to unite information from various enterprise applications, document repositories, chat applications, and knowledge management systems. The implementation draws data from objects stored in Amazon Simple Storage Service (Amazon S3) in addition to structured records stored in Amazon Relational Database Service (Amazon RDS). Amazon Q Business automatically uses the data from these sources as context to answer prompts from users of the AI assistant without requiring Mark43 to build and maintain a retrieval augmented generation (RAG) pipeline.
Amazon Q Business provides a chat interface web experience with a web address hosted by AWS. To embed the Amazon Q Business web experience in Mark43’s web application, Mark43 first allowlisted
their web application domain using the Amazon Q Business console. Then, Mark43 added an inline frame (iframe
) HTML component to their web application with the src
attribute set to the web address of the Amazon Q Business web experience. For example, <iframe src=”Amazon Q Business web experience URL”/>
. This integration requires a small amount of development effort to create a seamless experience for Mark43 end users.
Security remains paramount in the implementation. Amazon Q Business integrates with Mark43’s existing identity and access management protocols, making sure that users can only access information to which they’re authorized. If a user doesn’t have access to the data outside of Amazon Q Business, then they cannot access the data within Amazon Q Business. The AI assistant respects the same data access restrictions that apply to users in their normal workflow. With Amazon Q Business administrative controls and guardrails, Mark43 administrators can block specific topics and filter both questions and answers using keywords, verifying that responses align with public safety agency guidelines and protocols.
Mark43 is committed to the responsible use of AI. We believe in transparency, informing our users that they’re interacting with an AI solution. We strongly recommend human-in-the-loop review for critical decisions. Importantly, our AI assistant’s responses are limited to authorized data sources only, not drawing from general Large Language Model (LLM) knowledge. Additionally, we’ve implemented Amazon Q’s guardrails to filter undesirable topics, further enhancing the safety and reliability of our AI-driven solutions.
“Mark43 is committed to empowering communities and governments with a modern public safety solution that elevates both safety and quality of life. When we sought a seamless way to integrate AI-powered search and summarization, Amazon Q Business proved the ideal solution.
We added Amazon Q Business into our solution, reaffirming our dedication to equipping agencies with resilient, dependable and powerful technology. Amazon Q’s precision in extracting insights from complex data sources provides law enforcement with immediate access to information, reducing administrative burden from minutes to seconds. By adding Amazon Q into our portfolio of AWS services, we continue to deliver an efficient, intuitive user experience which enables officers to stay focused on serving their communities but also empowers command staff to quickly interpret data and share insights with stakeholders, supporting real-time situational awareness and operational efficiency,”
– Bob Hughes, CEO Mark43, GovTech Provider of Public Safety Solutions.
Mark43 customer feedback
Public Safety agencies are excited about the potential for AI-powered search to enhance investigations, drive real-time decision support, and increase situational awareness. At International Association of Chiefs of Police (IACP) conference in Boston in Oct 2024, one agency who viewed a demo described it as a “game-changer,” while another agency recognized the value of AI capabilities to make officer training programs more efficient. Agency stakeholders noted that AI-powered search will democratize insights across the agency, allowing them to spend more time on higher-value work, instead of answering basic questions.
Conclusion
In this post, we showed you how Mark43 embedded Amazon Q Business web experience into their public safety solution to transform how law enforcement agencies interact with mission-critical applications. Through this integration, Mark43 demonstrated how AI can reduce administrative tasks from minutes to seconds while maintaining the high levels of security required for law enforcement operations.
Looking ahead, Mark43 plans to expand their Amazon Q Business integration with a focus on continuous improvements to the user experience. Mark43 will continue to empower law enforcement with the most advanced, resilient, and user-friendly public safety technology powered by Amazon Q Business and other AWS AI services.
Visit the Amazon Q Business User Guide to learn more about how to embed generative AI into your applications. Request a demo with Mark43 and learn how your agency can benefit from Amazon Q Business in public safety software.
About the authors
Lawrence Zorio III serves as the Chief Information Security Officer at Mark43, where he leads a team of cybersecurity professionals dedicated to safeguarding the confidentiality, integrity, and availability (CIA) of enterprise and customer data, assets, networks, and products. His leadership ensures Mark43’s security strategy aligns with the unique requirements of public safety agencies worldwide. With over 20 years of global cybersecurity experience across the Public Safety, Finance, Healthcare, and Technology sectors, Zorio is a recognized leader in the field. He chairs the Integrated Justice Information System (IJIS) Cybersecurity Working Group, where he helps develop standards, best practices, and recommendations aimed at strengthening cybersecurity defenses against rising cyber threats. Zorio also serves as an advisor to universities and emerging technology firms. Zorio holds a Bachelor of Science in Business Information Systems from the University of Massachusetts Dartmouth and a Master of Science in Innovation from Northeastern University’s D’Amore-McKim School of Business. He has been featured in various news publications, authored multiple security-focused white papers, and is a frequent speaker at industry events.
Ritesh Shah is a Senior Generative AI Specialist at AWS. He partners with customers like Mark43 to drive AI adoption, resulting in millions of dollars in top and bottom line impact for these customers. Outside work, Ritesh tries to be a dad to his AWSome daughter. Connect with him on LinkedIn.
Prajwal Shetty is a GovTech Solutions Architect at AWS and collaborates with Justice and Public Safety (JPS) customers like Mark43. He designs purpose-driven solutions that foster an efficient and secure society, enabling organizations to better serve their communities through innovative technology. Connect with him on LinkedIn.
Garrett Kopeski is an Enterprise GovTech Senior Account Manager at AWS responsible for the business relationship with Justice and Public Safety partners such as Mark43. Garrett collaborates with his customers’ Executive Leadership Teams to connect their business objectives with AWS powered initiatives and projects. Outside work, Garrett pursues physical fitness challenges when he’s not chasing his energetic 2-year-old son. Connect with him on LinkedIn.
Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.
How FP8 boosts LLM training by 18% on Amazon SageMaker P5 instances
Large language models (LLMs) are AI systems trained on vast amounts of text data, enabling them to understand, generate, and reason with natural language in highly capable and flexible ways. LLM training has seen remarkable advances in recent years, with organizations pushing the boundaries of what’s possible in terms of model size, performance, and efficiency. In this post, we explore how FP8 optimization can significantly speed up large model training on Amazon SageMaker P5 instances.
LLM training using SageMaker P5
In 2023, SageMaker announced P5 instances, which support up to eight of the latest NVIDIA H100 Tensor Core GPUs. Equipped with high-bandwidth networking technologies like EFA, P5 instances provide a powerful platform for distributed training, enabling large models to be trained in parallel across multiple nodes. With the use of Amazon SageMaker Model Training, organizations have been able to achieve higher training speeds and efficiency by turning to P5 instances. This showcases the transformative potential of training different scales of models faster and more efficiently using SageMaker Training.
LLM training using FP8
P5 instances, which are NVIDIA H100 GPUs underneath, also come with capabilities of training models using FP8 precision. The FP8 data type has emerged as a game changer in LLM training. By reducing the precision of the model’s weights and activations, FP8 allows for more efficient memory usage and faster computation, without significantly impacting model quality. The throughput for running matrix operations like multipliers and convolutions on 32-bit float tensors is much lower than using 8-bit float tensors. FP8 precision reduces the data footprint and computational requirements, making it ideal for large-scale models where memory and speed are critical. This enables researchers to train larger models with the same hardware resources, or to train models faster while maintaining comparable performance. To make the models compatible for FP8, NVIDIA released the Transformer Engine (TE) library, which provides support for some layers like Linear
, LayerNorm
, and DotProductAttention
. To enable FP8 training, models need to use the TE API to incorporate these layers when casted to FP8. For example, the following Python code shows how FP8-compatible layers can be integrated:
try:
import transformer_engine.pytorch as te
using_te = True
except ImportError as ie:
using_te = False
......
linear_type: nn.Module = te.Linear if using_te else nn.Linear
......
in_proj = linear_type(dim, 3 * n_heads * head_dim, bias=False, device='cuda' if using_te)
out_proj = linear_type(n_heads * head_dim, dim, bias=False, device='cuda' if using_te)
......
Results
We ran some tests using 1B-parameter and 7B-parameter LLMs by running training with and without FP8. The test is run on 24 billion tokens for one epoch, thereby providing a comparison for throughput (in tokens per second per GPU) and model performance (in loss numbers). For 1B-parameter models, we computed results to compare performance with FP8 using a different number of instances for distributed training. The following table summarizes our results:
Number of P5 Nodes | Without FP8 | With FP8 | % Faster by Using FP8 | % Loss Higher with FP8 than Without FP8 | ||||
Tokens/sec/GPU | % Decrease | Loss After 1 Epoch | Tokens/sec/GPU | % Decrease | Loss After 1 Epoch | |||
1 | 40200 | – | 6.205 | 40800 | – | 6.395 | 1.49 | 3.06 |
2 | 38500 | 4.2288 | 6.211 | 41600 | -3.4825 | 6.338 | 8.05 | 2.04 |
4 | 39500 | 1.7412 | 6.244 | 42000 | -4.4776 | 6.402 | 6.32 | 2.53 |
8 | 38200 | 4.9751 | 6.156 | 41800 | -3.98 | 6.365 | 9.42 | 3.39 |
16 | 35500 | 11.6915 | 6.024 | 39500 | 1.7412 | 6.223 | 11.26 | 3.3 |
32 | 33500 | 16.6667 | 6.112 | 38000 | 5.4726 | 6.264 | 13.43 | 2.48 |
The following graph that shows the throughput performance of 1B-parameter model in terms of tokens/second/gpu over different numbers of P5 instances:
For 7B-parameter models, we computed results to compare performance with FP8 using different number of instances for distributed training. The following table summarizes our results:
Number of P5 Nodes | Without FP8 | With FP8 | % Faster by Using FP8 | % Loss Higher with FP8 than Without FP8 | ||||
Tokens/sec/GPU | % Decrease | Loss After 1 Epoch | Tokens/sec/GPU | % Decrease | Loss After 1 Epoch | |||
1 | 9350 | – | 6.595 | 11000 | – | 6.602 | 15 | 0.11 |
2 | 9400 | -0.5347 | 6.688 | 10750 | 2.2935 | 6.695 | 12.56 | 0.1 |
4 | 9300 | 0.5347 | 6.642 | 10600 | 3.6697 | 6.634 | 12.26 | -0.12 |
8 | 9250 | 1.0695 | 6.612 | 10400 | 4.9541 | 6.652 | 11.06 | 0.6 |
16 | 8700 | 6.9518 | 6.594 | 10100 | 8.7155 | 6.644 | 13.86 | 0.76 |
32 | 7900 | 15.508 | 6.523 | 9700 | 11.8182 | 6.649 | 18.56 | 1.93 |
The following graph that shows the throughput performance of 7B-parameter model in terms of tokens/second/gpu over different numbers of P5 instances:
The preceding tables show how, when using FP8, the training of 1B models is faster by 13% and training of 7B models is faster by 18%. As model training speed increases with FP8, there is generally a trade-off with a slower decrease in loss. However, the impact on model performance after one epoch remains minimal, with only about a 3% higher loss for 1B models and 2% higher loss for 7B models using FP8 as compared to training without using FP8. The following graph illustrates the loss performance.
As discussed in Scalable multi-node training with TensorFlow, due to inter-node communication, a small decline in the overall throughput is observed as the number of nodes increases.
The impact on LLM training and beyond
The use of FP8 precision combined with SageMaker P5 instances has significant implications for the field of LLM training. By demonstrating the feasibility and effectiveness of this approach, it opens the door for other researchers and organizations to adopt similar techniques, accelerating progress in large model training. Moreover, the benefits of FP8 and advanced hardware extend beyond LLM training. These advancements can also accelerate research in fields like computer vision and reinforcement learning by enabling the training of larger, more complex models with less time and fewer resources, ultimately saving time and cost. In terms of inference, models with FP8 activations have shown to improve two-fold over BF16 models.
Conclusion
The adoption of FP8 precision and SageMaker P5 instances marks a significant milestone in the evolution of LLM training. By pushing the boundaries of model size, training speed, and efficiency, these advancements have opened up new possibilities for research and innovation in large models. As the AI community builds on these technological strides, we can expect even more breakthroughs in the future. Ongoing research is exploring further improvements through techniques such as PyTorch 2.0 Fully Sharded Data Parallel (FSDP) and TorchCompile. Coupling these advancements with FP8 training could lead to even faster and more efficient LLM training. For those interested in the potential impact of FP8, experiments with 1B or 7B models, such as GPT-Neo or Meta Llama 2, on SageMaker P5 instances could offer valuable insights into the performance differences compared to FP16 or FP32.
About the Authors
Romil Shah is a Sr. Data Scientist at AWS Professional Services. Romil has more than 8 years of industry experience in computer vision, machine learning, generative AI, and IoT edge devices. He works with customers, helping in training, optimizing and deploying foundation models for edge devices and on the cloud.
Mike Garrison is a Global Solutions Architect based in Ypsilanti, Michigan. Utilizing his twenty years of experience, he helps accelerate tech transformation of automotive companies. In his free time, he enjoys playing video games and travel.