Generate AWS Resilience Hub findings in natural language using Amazon Bedrock

Generate AWS Resilience Hub findings in natural language using Amazon Bedrock

Resilient architectures are the foundation upon which successful businesses are built. However, keeping up with the latest advancements and making sure your systems are resilient can be a daunting task. Between monitoring, analyzing, and documenting architectural findings, a lack of crucial information can leave your organization vulnerable to potential risks and inefficiencies. Even when architectural assessments are conducted, the reports can be highly technical and challenging to comprehend for key stakeholders.

In this post, we explore how to use the power of AWS Resilience Hub and Amazon Bedrock to bridge this gap and streamline the process of sharing architectural findings across your organization. We walk through a solution that uses the generative AI capabilities of Amazon Bedrock to translate technical reports into concise, natural language summaries, making them accessible to a broader audience.

By using the capabilities of Resilience Hub and Amazon Bedrock, you can share findings with C-suite executives, engineers, managers, and other personas within your corporation to provide better visibility over maintaining a resilient architecture.

Solution Overview

By combining Resilience Hub and Amazon Bedrock, you can generate architectural findings in natural language to save time, better understand Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements, and distribute assessments through a clear and concise view. Resilience Hub is a central location on the AWS Management Console to manage, define, and assess resilience goals with recommendations based on the AWS Well-Architected Framework. Amazon Bedrock is a fully managed service to build generative AI applications with foundation models (FMs) from leading AI companies such as Anthropic, Mistral AI, Meta, Stability AI, Cohere, AI21 Labs, and Amazon through a single API. Amazon Bedrock allows for integrating generative AI solutions within your application with the ability to test, fine-tune, and customize top FMs based on your use case.

The solution presented in this post is orchestrated through Amazon Cognito to log in to a sample UI that invokes AWS Lambda functions and Amazon Bedrock prompts through large language models (LLMs). Resilience Hub provides resiliency and operational recommendations that include alarms, standard operating procedures (SOPs), and fault injection experiments through AWS Fault Injection Service (FIS). After the assessment Amazon Resource Name (ARN) is input from Resilience Hub, the findings are summarized in natural language to share with other users.

The following diagram illustrates the solution architecture.

The solution workflow includes the following steps:

  1. The user is authenticated through Amazon Cognito with a user name and password.
  2. The user accesses the main UI through Amazon CloudFront, which runs a single-page application hosted on Amazon Simple Storage Service (Amazon S3).
  3. Amazon API Gateway validates the access token with Amazon Cognito, then uses a Lambda function as the integration target.
  4. Lambda gathers the most recent assessment ARN from your published applications in Resilience Hub.
  5. A second Lambda function invokes the Amazon Bedrock API.
  6. Amazon Bedrock processes the assessment and uses prompt engineering techniques to generate the report in natural language based on target personas.

Prerequisites

For this walkthrough, the following are required:

Deploy solution resources

You can deploy the solution using a CloudFormation template, found on the GitHub repo, to automatically provision the necessary resources in your AWS account. You will provision the Amazon S3 hosted UI using the AWS CDK.

Run the solution

Complete the following steps to run the solution:

  1. Within your terminal or preferred integrated development environment (IDE), run the following commands:
    git clone https://github.com/aws-samples/resilience-hub-genai.git 
    cd aws-resilience-hub-genai/backend

  2. Using the text editor (vim, nano, notepad) of your choice, replace EMAIL in the constants.py file with your email.
  3. Deploy with the following code:
    cd ..
    ./deploy.sh

Wait for the CloudFormation template to successfully launch. This template takes approximately 10 minutes to deploy.

  1. On the AWS CloudFormation console, on the stack’s Outputs tab, locate the public-facing URL for your web application (labeled CLOUDFRONTDISTRIBUTION).

You should have received an email with your user name being the email you provided in the constants.py file and a temporary password.

  1. Log in using the provided credentials, then confirm the password change.
  2. In the UI, choose Report in the navigation pane.
  3. For Persona, choose your desired persona.
  4. For Application, choose your desired application from the list of existing published applications.
  5. Choose Generate Report to review the concise, summarized report generated from the most recent assessment, which is ready for distribution.

Review the summary

This solution includes a summary example from a sample stack from the executive persona. Due to the nature of generative AI, your results may slightly vary, but will look similar to the following screenshot.

Clean up

To clean up the solution, complete the following steps:

  1. On the AWS CloudFormation console, delete the CloudFormation stack you created earlier.
  2. If you downloaded the sample CloudFormation template to assess in Resilience Hub, delete that stack as well.
  3. On the Resilience Hub console, delete the newly created application. This will delete the assessments.

Conclusion

In this post, we discussed how Resilience Hub and Amazon Bedrock can greatly improve the maintenance and evaluation of resilient architectures in your organization. This solution automates the translation of technical architectural findings into natural language summaries, making critical information accessible to various stakeholders, including C-suite executives, auditors, and managers. Streamlined communication leads to improved understanding and faster decision-making, ultimately benefiting your business operations. Integrating AWS services such as Lambda and Amazon Cognito further automates and simplifies the workflow, providing a seamless experience from assessment to reporting.

Ready to enhance your organization’s architectural resilience? Deploy the solution today and begin transforming your technical reports into concise summaries by following the steps outlined in this post. This allows stakeholders to access important information, promoting informed decision-making and a resilient culture.

For more insights and related content, refer to the following:


About the Authors

Ibrahim Ahmad is a Solutions Architect at AWS with a focus in resilience and machine learning. He builds solutions for government technology customers to scale and modernize their cloud solutions. Outside of work, he loves to spend time with friends and family, work out, and race cars.

Mike P. is a Sr. Solutions Architect at AWS based in South Florida. He specializes in helping customers use AWS services to enhance their security posture and explore the potential of generative AI technologies. Mike works closely with organizations to design and implement robust security solutions while exploring innovative use cases for generative AI.

Leland Johnson is a Sr. Solutions Architect for AWS focusing on travel and hospitality. As a Solutions Architect, he plays a crucial role in guiding customers through their cloud journey by designing scalable and secure cloud solutions. Outside of work, he enjoys playing music and flying light aircraft.

Read More

Generate and evaluate images in Amazon Bedrock with Amazon Titan Image Generator G1 v2 and Anthropic Claude 3.5 Sonnet

Generate and evaluate images in Amazon Bedrock with Amazon Titan Image Generator G1 v2 and Anthropic Claude 3.5 Sonnet

Recent enhancements in the field of generative AI, such as media generation technologies, are rapidly transforming the way businesses create and manipulate visual content. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With that, it brings functionalities such as model customization, fine-tuning, and Retrieval Augmented Generation (RAG).

In your business, you might want to use those capabilities to improve the user experience and generate media content—such as images, diagrams, infographics or custom shapes—and understand the level of confidence of that generated content according to another model or even a customized, pre-trained evaluation model, with data and parameters from your own organization.

In this post, we demonstrate how to interact with the Amazon Titan Image Generator G1 v2 model on Amazon Bedrock to generate an image. Then, we show you how to use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock to describe it, evaluate it with a score from 1–10, explain the reason behind the given score, and suggest improvements to the image. Amazon Titan Image Generator G1 v2 was recently released on Amazon Bedrock bringing new features in the image generation field and Anthropic’s Claude 3.5 Sonnet, also newly released, setting new industry benchmarks for graduate-level reasoning and improvements in grasping complex instructions.

Amazon Titan Image Generator G1 v2

Exclusive to Amazon Bedrock, the Amazon Titan models incorporate the 25 years of experience that Amazon has innovating with AI and machine learning (ML) across its business. It allows content creators to quickly generate high-quality, realistic images using simple English text prompts, and returns studio-quality images suitable for advertising, ecommerce, and entertainment.

The newly announced Amazon Titan Image Generator G1 v2 expands its initial version by allowing you to guide image creation using reference images, edit existing visuals, remove backgrounds, generate image variations, and securely customize the model to maintain brand style and subject consistency.

Anthropic Claude 3.5 Sonnet

Anthropic Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming other generative AI models on a wide range of evaluations, including Anthropic’s previously most intelligent model, Anthropic Claude 3 Opus. Anthropic Claude 3.5 Sonnet is available on Amazon Bedrock with the speed and cost of the original Anthropic Claude 3 Sonnet model.

Solution overview

This solution is running in AWS Region us-east-1. It exposes an API endpoint through Amazon API Gateway that proxies the initial prompt request to a Python-based AWS Lambda function, which calls Amazon Bedrock twice. The following diagram illustrates the flow of events.

Solution Architecture Diagram

  1. Users or applications submit a prompt as an API request.
  2. The prompt and parameters are passed to Amazon Bedrock using an inference API called by the Lambda function.
  3. Amazon Bedrock generates a high-quality image based on the prompt with Amazon Titan Image Generator G1 v2.
  4. The Lambda function sends the image bytes and the original prompt to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
  5. Anthropic’s Claude 3.5 Sonnet evaluates the generated image against the original prompt.
  6. The Lambda function saves the image to an Amazon Simple Storage Service (Amazon S3) bucket and generates a pre-signed URL.
  7. The pre-signed URL and the evaluation are returned as an API response in JSON format.

Ultimately, the function saves the image in an S3 bucket and generates a pre-signed URL, returning it and the evaluation summary as the API response.

API Gateway proxies the request to a Lambda function that uses the Python Boto3 library to call Amazon Titan Image Generator v2 on Amazon Bedrock to generate the image and then decodes the image bytes. Then, it passes the image and an evaluation prompt through a multimodal call to Anthropic’s Claude 3.5 Sonnet and, after receiving the score, saves the image to Amazon S3, generates a pre-signed URL, and returns the complete response.

Prerequisites

You should have the following prerequisites:

  • An AWS account to create and manage the necessary AWS resources for this solution
  • Amazon Titan Image Generator G1 v2 and Anthropic Claude 3.5 Sonnet models enabled on Amazon Bedrock in AWS Region us-east-1

Provision the solution

You can build the solution architecture using AWS CloudFormation. A single YAML file contains the infrastructure, including AWS Identity and Access Management (IAM)  users, policies, API methods, the S3 bucket, and the Lambda function code. Complete the following steps to set up the solution resources:

  1. Sign in to the AWS Management Console as an IAM administrator or appropriate IAM user.
  2. Choose Launch Stack to deploy the CloudFormation template.

  1. Choose Next.
  2. In the Parameters section, enter the following:
    • A name for the new S3 bucket that will receive the images (for example, image-gen-your-initials)
    • The name of an existing S3 bucket where access logs will be stored.
    • A token that you will use to authenticate your API (a string of your choice)
  3. After entering the parameters, choose Next.
  4. Choose Next again.
  5. Acknowledge the creation of IAM resources and choose Submit.

When the stack status is CREATE_COMPLETE, navigate to the Outputs tab and find the API information. Copy the ApiId, the ApiUrl and ResourceId to a safe place and continue to test.

Output of CloudFormation

Test the solution

You can test the deployed API by calling it with a programming language of your choice (Python, React, and so in), using the console, a terminal window, or the AWS Command Line Interface (AWS CLI). In this post, we will review the console, the terminal, and AWS CLI. For a visual reference, the following picture is a rendered representation of the image and its evaluation using Streamlit (Python) and the prompt a black cat in an alleyway with blue eyes.

Test using Python Streamlit

Note that the use of Amazon Bedrock is subject to the AWS Responsible AI Policy. If you encounter errors, or if generations or evaluations are being blocked, your prompt might conflict with the AWS Acceptable Use Policy or the AWS Responsible AI Policy. Retry with a different prompt that adheres to the policy.

Test the solution using the console

Complete the following steps to test the solution using the console:

  1. On the API Gateway console, choose APIs in the navigation pane.
  2. On the APIs list, choose BedrockImageGenEval.
  3. In the Resources section, select the POST method below /generate-image.
  4. Choose the Test tab in the method execution settings.
  5. In the Request body section, enter the following JSON structure:
    { “prompt”:”your prompt” }
  6. Choose Test.

Test using API Gateway console

Test the solution using the AWS CLI

To test the solution using the AWS CLI, make sure you have the latest version installed and configured. For instructions, see Install or update to the latest version of the AWS CLI. For configuration, see Configure the AWS CLI. Then complete the following steps:

  1. Retrieve the ApiId and ResourceId information you saved from the Outputs tab.
  2. In an environment running AWS CLI, run the following command:
aws apigateway test-invoke-method --rest-api-id ApiId --resource-id ResouceId --http-method POST --path-with-query-string "/generate-image" --body '{"prompt":"your prompt"}' | grep '"body"' | sed 's/.*"body": "(.*)".*/1/' | sed 's/\//g'

Test using AWS CLI

Test the solution using the terminal

To test the solution using a terminal window, you need to have the curl tool installed. After you have it, run the following command:

curl -X POST YOUR_API_URL -H 'Authorization: YOUR_TOKEN_STRING_PARAMETER' -H 'Content-Type: application/json' -d '{"prompt":"your prompt"}'

Test using terminal

Regardless of your choice, you should get a response with the following JSON structure:

{"url": "You pre-signed S3 URL",

"evaluation_data": {

"description": "Short description of your image given by the model",

"score": "9",

"reason": "Reason for the score above",

"suggestions": "Suggestions to improve the prompt to have better results."}}

Clean up

To avoid incurring future charges, clean up all the AWS resources that you created using CloudFormation. You can delete these resources on the console or using the AWS CLI. To clean up using the console:

  1. On the Amazon S3 console, empty the S3 bucket that you created and delete it.
  2. On the CloudFormation console, select the stack and choose Delete.

Conclusion

In this post, we demonstrated how to use Amazon Titan Generator G1 v2 and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock to generate and evaluate media assets (images) and create accurate, fine-grained, exclusive content for your users or internal business case. Thanks to the multimodal capabilities of Amazon Bedrock models, you can apply this solution to different types of media, such as documents, summarizations, translations, and more.

We encourage you to learn and experiment with Amazon Bedrock capabilities, such as how to customize a model to use your own data for generation or evaluation, or try different models and apply security guardrails to have standardized safety controls over the content generated.


About the Author

Raul Tavares is a Solutions Architect focused on games customers across EMEA. With a strong engineering approach, when not knee-deep in cloud architecture, you can find him transforming ideas into solutions, writing code samples or listening to some Japanese heavy metal bands to relax.

Read More

How InsuranceDekho transformed insurance agent interactions using Amazon Bedrock and generative AI

How InsuranceDekho transformed insurance agent interactions using Amazon Bedrock and generative AI

This post is co-authored with Nishant Gupta from InsuranceDekho.

The insurance industry is complex and overwhelming, with numerous options that can be hard for consumers to understand. This complexity hinders customers from making informed decisions. As a result, customers face challenges in selecting the right insurance coverage, while insurance aggregators and agents struggle to provide clear and accurate information.

InsuranceDekho is a leading InsurTech service that offers a wide range of insurance products from over 49 insurance companies in India. The service operates through a vast network of 150,000 point of sale person (POSP) agents and direct-to-customer channels. InsuranceDekho uses cutting-edge technology to simplify the insurance purchase process for all users. The company’s mission is to make insurance transparent, accessible, and hassle-free for all customers through tech-driven solutions.

In this post, we explain how InsuranceDekho harnessed the power of generative AI using Amazon Bedrock and Anthropic’s Claude to provide responses to customer queries on policy coverages, exclusions, and more. This let our customer care agents and POSPs confidently help our customers understand the policies without reaching out to insurance subject matter experts (SMEs) or memorizing complex plans while providing sales and after-sales services. The use of this solution has improved sales, cross-selling, and overall customer service experience.

Amazon Bedrock provided the flexibility to explore various leading LLM models using a single API, reducing the undifferentiated heavy lifting associated with hosting third-party models. Leveraging this, InsuranceDekho developed the industry’s first Health Pro Genie with the most efficient engine. It facilitates the insurance agents to choose the right plan for the end customer from the pool of over 125 health plans from 21 different health insurers available on the InsuranceDekho platform.

– Ish Babbar, Co-Founder and CTO, InsuranceDekho

The challenge

InsuranceDekho faced a significant challenge in responding to customer queries on insurance products in a timely manner. For a given lead, the insurance advisors, particularly those who are new to insurance, would often reach out to SMEs to inquire about policy or product-specific queries. The added step of SME consultation resulted in a process slowdown, requiring advisors to await expert input before responding to customers, introducing delays of a few minutes. Additionally, although SMEs can provide valuable guidance and expertise, their involvement introduces additional costs.

This delay not only affects the customer’s experience but also results in lost prospects because potential customers may decide not to purchase and explore competing services if they get better clarity on those products. The current process was inefficient, and InsuranceDekho needed a solution to empower its agents to respond to customer queries confidently and efficiently, without requiring excessive memorization.

The following figure depicts a common scenario where an SME receives multiple calls from insurance advisors, resulting in delays for the customers. Because SMEs can handle one call at a time, the advisors are left waiting for a response. This further prolongs the time it takes for customers to get clarity on the insurance product and decide on which product they want to purchase.

Solution overview

To overcome the limitations of relying on SMEs, a generative AI-based chat assistant was developed to autonomously resolve agent queries with accuracy. One of the key considerations while designing the chat assistant was to avoid responses from the default large language model (LLM) trained on generic data and only use the insurance policy documents. To generate such high-quality responses, we decided to go with the Retrieval Augmented Generation (RAG) approach using Amazon Bedrock and Anthropic’s Claude Haiku.

Amazon Bedrock

We conducted a thorough evaluation of several generative AI model providers and selected Amazon Bedrock as our primary provider for our foundation model (FM) needs. The key reasons that influenced this decision were:

  • Managed service – Amazon Bedrock is a fully serverless offering that offers a choice of industry leading FMs without provisioning infrastructure, procuring GPUs around the clock, or configuring ML frameworks. As a result, it significantly reduces development, deployment overheads, and total cost of ownership, while enhancing efficiency and accelerating innovation in disruptive technologies like generative AI.
  • Continuous model enhancements – Amazon Bedrock provides access to a vast and continuously expanding set of FMs through a single API. The continuous additions and updates to its model portfolio facilitate access to the latest advancements and improvements in AI technology, enabling us to evaluate upcoming LLMs and optimize output quality, latency, and cost by selecting the most suitable LLM for each specific task or application. We experienced this flexibility firsthand when we seamlessly transitioned from Anthropic’s Claude Instant to Anthropic’s Claude Haiku with the advent of Anthropic’s Claude 3, without requiring code changes.
  • Performance – Amazon Bedrock provides options to achieve high-performance, low-latency, and scalable inference capabilities through on-demand and provisioned throughput options depending on the requirements.
  • Secure model access – Secure, private model access using AWS PrivateLink gives controlled data transfer for inference without traversing the public internet, maintaining data privacy and helping to adhere to compliance requirements.

Retrieval Augmented Generation

RAG is a process in which LLMs access external documents or knowledge bases, promoting accurate and relevant responses. By referencing authoritative sources beyond their training data, RAG helps LLMs generate high-quality responses and overcome common pitfalls such as outdated or misleading information. RAG can be applied to various applications, including improving customer service, enhancing research capabilities, and streamlining business processes.

Solution building blocks

To begin designing the solution, we identified the key components needed, including the generative AI service, LLMs, vector databases, and caching engines. In this section, we delve into the key building blocks used in the solution, highlighting their importance in achieving optimal accuracy, cost-effectiveness, and performance:

  • LLMs – After a thorough evaluation of various LLMs and benchmarking, we chose Anthropic’s Claude Haiku for its exceptional performance. The benchmarking results demonstrated unparalleled speed and affordability in its category. Additionally, it delivered rapid and accurate responses while handling straightforward queries or complex requests, making it an ideal choice for our use case.
  • Embedding model – An embedding model is a type of machine learning (ML) model that maps discrete objects, such as words, phrases, or entities, into dense vector representations in a continuous embedding space. These vector representations, called embeddings, capture the semantic and syntactic relationships between the objects, allowing the model to reason about their similarities and differences. For our use case, we used a third-party embedding model.
  • Vector database – For the purpose of vector database, we chose Amazon OpenSearch Service because of its scalability, high-performance search capabilities, and cost-effectiveness. Additionally, the OpenSearch Service flexible data model and integration with other features make it an ideal choice for our use case.
  • Caching – To enhance the performance, efficiency, and cost-effectiveness of our chat assistant, we used Redis on Amazon ElastiCache to cache frequently accessed responses. This approach enables the chat assistant to rapidly retrieve cached responses, minimizing latency and computational load and resulting in a significantly improved user experience and reduced cost.

Implementation details

The following diagram illustrates the workflow of the current solution. Overall, the workflow can be divided into two workflows: the ingestion workflow and the response generation workflow.

Ingestion workflow

The ingestion workflow serves as the foundation that fuels the entire response generation workflow by keeping the knowledge base up to date with the latest information. This process is crucial for making sure that the system can provide accurate and relevant responses based on the most recent insurance policy documents. The ingestion workflow involves three key components: policy documents, embedding model, and OpenSearch Service as a vector database.

  1. The policy documents contain the insurance policy information that needs to be ingested into the knowledge base.
  2. These documents are processed by the embedding model, which converts the textual content into high-dimensional vector representations, capturing the semantic meaning of the text. After the embedding model generates the vector representations of the policy documents, these embeddings are stored in OpenSearch Service. This ingestion workflow enables the chat assistant to provide responses based on the latest policy information available.

Response generation workflow

The response generation workflow is the core of our chat assistant solution. Insurance advisors use it to provide comprehensive responses to customers’ queries regarding policy coverage, exclusions, and other related topics.

  1. To initiate this workflow, our chatbot serves as the entry point, facilitating seamless interaction between the insurance advisors and the underlying response generation system.
  2. This solution incorporates a caching mechanism that uses semantic search to check if a query has been recently processed and answered. If a match is found in the cache (Redis), the chat assistant retrieves and returns the corresponding response, bypassing the full response generation workflow for redundant queries, thereby enhancing system performance.
  3. If no match is found in the cache, the query goes to the intent classifier powered by Anthropic’s Claude Haiku. It analyzes the query to understand the user’s intent and classify it accordingly. This enables dynamic prompting and tailored processing based on the query type. For generic or common queries, the intent classifier can provide the final response independently, bypassing the full RAG workflow, thereby optimizing efficiency and response times.
  4. For queries requiring the full RAG workflow, the intent classifier passes the query to the retrieval step, where a semantic search is performed on a vector database containing insurance policy documents to find the most relevant information, that is, the context based on the query.
  5. After the retrieval step, the retrieved context is integrated with the query and prompt, and this augmented information is fed into the generation process. This augmentation is the core component that enables the enhancement of the generation.
  6. In the final generation step, the actual response to the query is produced based on the external knowledge base of policy documents.

Results

The implementation of the generative AI-powered RAG chat assistant solution has yielded impressive results for InsuranceDekho. By using this solution, insurance advisors can now confidently and efficiently address customer queries autonomously, without the constant need for SME involvement. Additionally, the implementation of this solution has resulted in a significant reduction in response time to address customer queries. InsuranceDekho has witnessed a remarkable 80% decrease in the response time of the customer queries to understand the plan features, inclusions, and exclusions.

InsuranceDekho’s adoption of this generative AI-powered solution has streamlined the customer service process, making sure that customers receive precise and trustworthy responses to their inquiries in a timely manner.

Conclusion

In this post, we discussed how InsuranceDekho harnessed the power of generative AI to equip its insurance advisors with the tools to efficiently respond to customer queries regarding various insurance policies. By implementing a RAG-based chat assistant using Amazon Bedrock and OpenSearch Service, InsuranceDekho empowered its insurance advisors to deliver exceptional service. This solution minimized the reliance on SMEs and significantly reduced response times so advisors could address customer inquiries promptly and accurately.


About the Authors

Vishal Gupta is a Senior Solutions Architect at AWS India, based in Delhi. In his current role at AWS, he works with digital native business customers and enables them to design, architect, and innovate highly scalable, resilient, and cost-effective cloud architectures. An avid blogger and speaker, Vishal loves to share his knowledge with the tech community. Outside of work, he enjoys traveling to new destinations and spending time with his family.

Nishant Gupta is working as Vice President, Engineering at InsuranceDekho with 14 years of experience. He is passionate about building highly scalable, reliable, and cost-optimized solutions that can handle massive amounts of data efficiently.

Read More

Considerations for addressing the core dimensions of responsible AI for Amazon Bedrock applications

Considerations for addressing the core dimensions of responsible AI for Amazon Bedrock applications

The rapid advancement of generative AI promises transformative innovation, yet it also presents significant challenges. Concerns about legal implications, accuracy of AI-generated outputs, data privacy, and broader societal impacts have underscored the importance of responsible AI development. Responsible AI is a practice of designing, developing, and operating AI systems guided by a set of dimensions with the goal to maximize benefits while minimizing potential risks and unintended harm. Our customers want to know that the technology they are using was developed in a responsible way. They also want resources and guidance to implement that technology responsibly in their own organization. Most importantly, they want to make sure the technology they roll out is for everyone’s benefit, including end-users. At AWS, we are committed to developing AI responsibly, taking a people-centric approach that prioritizes education, science, and our customers, integrating responsible AI across the end-to-end AI lifecycle.

What constitutes responsible AI is continually evolving. For now, we consider eight key dimensions of responsible AI: Fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. These dimensions make up the foundation for developing and deploying AI applications in a responsible and safe manner.

At AWS, we help our customers transform responsible AI from theory into practice—by giving them the tools, guidance, and resources to get started with purpose-built services and features, such as Amazon Bedrock Guardrails. In this post, we introduce the core dimensions of responsible AI and explore considerations and strategies on how to address these dimensions for Amazon Bedrock applications. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Safety

The safety dimension in responsible AI focuses on preventing harmful system output and misuse. It focuses on steering AI systems to prioritize user and societal well-being.

Amazon Bedrock is designed to facilitate the development of secure and reliable AI applications by incorporating various safety measures. In the following sections, we explore different aspects of implementing these safety measures and provide guidance for each.

Addressing model toxicity with Amazon Bedrock Guardrails

Amazon Bedrock Guardrails supports AI safety by working towards preventing the application from generating or engaging with content that is considered unsafe or undesirable. These safeguards can be created for multiple use cases and implemented across multiple FMs, depending on your application and responsible AI requirements. For example, you can use Amazon Bedrock Guardrails to filter out harmful user inputs and toxic model outputs, redact by either blocking or masking sensitive information from user inputs and model outputs, or help prevent your application from responding to unsafe or undesired topics.

Content filters can be used to detect and filter harmful or toxic user inputs and model-generated outputs. By implementing content filters, you can help prevent your AI application from responding to inappropriate user behavior, and make sure your application provides only safe outputs. This can also mean providing no output at all, in situations where certain user behavior is unwanted. Content filters support six categories: hate, insults, sexual content, violence, misconduct, and prompt injections. Filtering is done based on confidence classification of user inputs and FM responses across each category. You can adjust filter strengths to determine the sensitivity of filtering harmful content. When a filter is increased, it increases the probability of filtering unwanted content.

Denied topics are a set of topics that are undesirable in the context of your application. These topics will be blocked if detected in user queries or model responses. You define a denied topic by providing a natural language definition of the topic along with a few optional example phrases of the topic. For example, if a medical institution wants to make sure their AI application avoids giving any medication or medical treatment-related advice, they can define the denied topic as “Information, guidance, advice, or diagnoses provided to customers relating to medical conditions, treatments, or medication” and optional input examples like “Can I use medication A instead of medication B,” “Can I use Medication A for treating disease Y,” or “Does this mole look like skin cancer?” Developers will need to specify a message that will be displayed to the user whenever denied topics are detected, for example “I am an AI bot and cannot assist you with this problem, please contact our customer service/your doctor’s office.” Avoiding specific topics that aren’t toxic by nature but can potentially be harmful to the end-user is crucial when creating safe AI applications.

Word filters are used to configure filters to block undesirable words, phrases, and profanity. Such words can include offensive terms or undesirable outputs, like product or competitor information. You can add up to 10,000 items to the custom word filter to filter out topics you don’t want your AI application to produce or engage with.

Sensitive information filters are used to block or redact sensitive information such as personally identifiable information (PII) or your specified context-dependent sensitive information in user inputs and model outputs. This can be useful when you have requirements for sensitive data handling and user privacy. If the AI application doesn’t process PII information, your users and your organization are safer from accidental or intentional misuse or mishandling of PII. The filter is configured to block sensitive information requests; upon such detection, the guardrail will block content and display a preconfigured message. You can also choose to redact or mask sensitive information, which will either replace the data with an identifier or delete it completely.

Measuring model toxicity with Amazon Bedrock model evaluation

Amazon Bedrock provides a built-in capability for model evaluation. Model evaluation is used to compare different models’ outputs and select the most appropriate model for your use case. Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question answering, and text summarization. You can choose to create either an automatic model evaluation job or a model evaluation job that uses a human workforce. For automatic model evaluation jobs, you can either use built-in datasets across three predefined metrics (accuracy, robustness, toxicity) or bring your own datasets. For human-in-the-loop evaluation, which can be done by either AWS managed or customer managed teams, you must bring your own dataset.

If you are planning on using automated model evaluation for toxicity, start by defining what constitutes toxic content for your specific application. This may include offensive language, hate speech, and other forms of harmful communication. Automated evaluations come with curated datasets to choose from. For toxicity, you can use either RealToxicityPrompts or BOLD datasets, or both. If you bring your custom model to Amazon Bedrock, you can implement scheduled evaluations by integrating regular toxicity assessments into your development pipeline at key stages of model development, such as after major updates or retraining sessions. For early detection, implement custom testing scripts that run toxicity evaluations on new data and model outputs continuously.

Amazon Bedrock and its safety capabilities helps developers create AI applications that prioritize safety and reliability, thereby fostering trust and enforcing ethical use of AI technology. You should experiment and iterate on chosen safety approaches to achieve their desired performance. Diverse feedback is also important, so think about implementing human-in-the-loop testing to assess model responses for safety and fairness.

Controllability

Controllability focuses on having mechanisms to monitor and steer AI system behavior. It refers to the ability to manage, guide, and constrain AI systems to make sure they operate within desired parameters.

Guiding AI behavior with Amazon Bedrock Guardrails

To provide direct control over what content the AI application can produce or engage with, you can use Amazon Bedrock Guardrails, which we discussed under the safety dimension. This allows you to steer and manage the system’s outputs effectively.

You can use content filters to manage AI outputs by setting sensitivity levels for detecting harmful or toxic content. By controlling how strictly content is filtered, you can steer the AI’s behavior to help avoid undesirable responses. This allows you to guide the system’s interactions and outputs to align with your requirements. Defining and managing denied topics helps control the AI’s engagement with specific subjects. By blocking responses related to defined topics, you help AI systems remain within the boundaries set for its operation.

Amazon Bedrock Guardrails can also guide the system’s behavior for compliance with content policies and privacy standards. Custom word filters allow you to block specific words, phrases, and profanity, giving you direct control over the language the AI uses. And managing how sensitive information is handled, whether by blocking or redacting it, allows you to control the AI’s approach to data privacy and security.

Monitoring and adjusting performance with Amazon Bedrock model evaluation

To asses and adjust AI performance, you can look at Amazon Bedrock model evaluation. This helps systems operate within desired parameters and meet safety and ethical standards. You can explore both automatic and human-in-the loop evaluation. These evaluation methods help you monitor and guide model performance by assessing how well models meet safety and ethical standards. Regular evaluations allow you to adjust and steer the AI’s behavior based on feedback and performance metrics.

Integrating scheduled toxicity assessments and custom testing scripts into your development pipeline helps you continuously monitor and adjust model behavior. This ongoing control helps AI systems to remain aligned with desired parameters and adapt to new data and scenarios effectively.

Fairness

The fairness dimension in responsible AI considers the impacts of AI on different groups of stakeholders. Achieving fairness requires ongoing monitoring, bias detection, and adjustment of AI systems to maintain impartiality and justice.

To help with fairness in AI applications that are built on top of Amazon Bedrock, application developers should explore model evaluation and human-in-the-loop validation for model outputs at different stages of the machine learning (ML) lifecycle. Measuring bias presence before and after model training as well as at model inference is the first step in mitigating bias. When developing an AI application, you should set fairness goals, metrics, and potential minimum acceptable thresholds to measure performance across different qualities and demographics applicable to the use case. On top of these, you should create remediation plans for potential inaccuracies and bias, which may include modifying datasets, finding and deleting the root cause for bias, introducing new data, and potentially retraining the model.

Amazon Bedrock provides a built-in capability for model evaluation, as we explored under the safety dimension. For general text generation evaluation for measuring model robustness and toxicity, you can use the built-in fairness dataset Bias in Open-ended Language Generation Dataset (BOLD), which focuses on five domains: profession, gender, race, religious ideologies, and political ideologies. To assess fairness for other domains or tasks, you must bring your own custom prompt datasets.

Transparency

The transparency dimension in generative AI focuses on understanding how AI systems make decisions, why they produce specific results, and what data they’re using. Maintaining transparency is critical for building trust in AI systems and fostering responsible AI practices.

To help meet the growing demand for transparency, AWS introduced AWS AI Service Cards, a dedicated resource aimed at enhancing customer understanding of our AI services. AI Service Cards serve as a cornerstone of responsible AI documentation, consolidating essential information in one place. They provide comprehensive insights into the intended use cases, limitations, responsible AI design principles, and best practices for deployment and performance optimization of our AI services. They are part of a comprehensive development process we undertake to build our services in a responsible way.

At the time of writing, we offer the following AI Service Cards for Amazon Bedrock models:

Service cards for other Amazon Bedrock models can be found directly on the provider’s website. Each card details the service’s specific use cases, the ML techniques employed, and crucial considerations for responsible deployment and use. These cards evolve iteratively based on customer feedback and ongoing service enhancements, so they remain relevant and informative.

An additional effort in providing transparency is the Amazon Titan Image Generator invisible watermark. Images generated by Amazon Titan come with this invisible watermark by default. This watermark detection mechanism enables you to identify images produced by Amazon Titan Image Generator, an FM designed to create realistic, studio-quality images in large volumes and at low cost using natural language prompts. By using watermark detection, you can enhance transparency around AI-generated content, mitigate the risks of harmful content generation, and reduce the spread of misinformation.

Content creators, news organizations, risk analysts, fraud detection teams, and more can use this feature to identify and authenticate images created by Amazon Titan Image Generator. The detection system also provides a confidence score, allowing you to assess the reliability of the detection even if the original image has been modified. Simply upload an image to the Amazon Bedrock console, and the API will detect watermarks embedded in images generated by the Amazon Titan model, including both the base model and customized versions. This tool not only supports responsible AI practices, but also fosters trust and reliability in the use of AI-generated content.

Veracity and robustness

The veracity and robustness dimension in responsible AI focuses on achieving correct system outputs, even with unexpected or adversarial inputs. The main focus of this dimension is to address possible model hallucinations. Model hallucinations occur when an AI system generates false or misleading information that appears to be plausible. Robustness in AI systems makes sure model outputs are consistent and reliable under various conditions, including unexpected or adverse situations. A robust AI model maintains its functionality and delivers consistent and accurate outputs even when faced with incomplete or incorrect input data.

Measuring accuracy and robustness with Amazon Bedrock model evaluation

As introduced in the AI safety and controllability dimensions, Amazon Bedrock provides tools for evaluating AI models in terms of toxicity, robustness, and accuracy. This makes sure the models don’t produce harmful, offensive, or inappropriate content and can withstand various inputs, including unexpected or adversarial scenarios.

Accuracy evaluation helps AI models produce reliable and correct outputs across various tasks and datasets. In the built-in evaluation, accuracy is measured against a TREX dataset and the algorithm calculates the degree to which the model’s predictions match the actual results. The actual metric for accuracy depends on the chosen use case; for example, in text generation, the built-in evaluation calculates a real-world knowledge score, which examines the model’s ability to encode factual knowledge about the real world. This evaluation is essential for maintaining the integrity, credibility, and effectiveness of AI applications.

Robustness evaluation makes sure the model maintains consistent performance across diverse and potentially challenging conditions. This includes handling unexpected inputs, adversarial manipulations, and varying data quality without significant degradation in performance.

Methods for achieving veracity and robustness in Amazon Bedrock applications

There are several techniques that you can consider when using LLMs in your applications to maximize veracity and robustness:

  • Prompt engineering – You can instruct that model to only engage in discussion about things that the model knows and not generate any new information.
  • Chain-of-thought (CoT) – This technique involves the model generating intermediate reasoning steps that lead to the final answer, improving the model’s ability to solve complex problems by making its thought process transparent and logical. For example, you can ask the model to explain why it used certain information and created a certain output. This is a powerful method to reduce hallucinations. When you ask the model to explain the process it used to generate the output, the model has to identify different the steps taken and information used, thereby reducing hallucination itself. To learn more about CoT and other prompt engineering techniques for Amazon Bedrock LLMs, see General guidelines for Amazon Bedrock LLM users.
  • Retrieval Augmented Generation (RAG) – This helps reduce hallucination by providing the right context and augmenting generated outputs with internal data to the models. With RAG, you can provide the context to the model and tell the model to only reply based on the provided context, which leads to fewer hallucinations. With Amazon Bedrock Knowledge Bases, you can implement the RAG workflow from ingestion to retrieval and prompt augmentation. The information retrieved from the knowledge bases is provided with citations to improve AI application transparency and minimize hallucinations.
  • Fine-tuning and pre-training – There are different techniques for improving model accuracy for specific context, like fine-tuning and continued pre-training. Instead of providing internal data through RAG, with these techniques, you add data straight to the model as part of its dataset. This way, you can customize several Amazon Bedrock FMs by pointing them to datasets that are saved in Amazon Simple Storage Service (Amazon S3) buckets. For fine-tuning, you can take anything between a few dozen and hundreds of labeled examples and train the model with them to improve performance on specific tasks. The model learns to associate certain types of outputs with certain types of inputs. You can also use continued pre-training, in which you provide the model with unlabeled data, familiarizing the model with certain inputs for it to associate and learn patterns. This includes, for example, data from a specific topic that the model doesn’t have enough domain knowledge of, thereby increasing the accuracy of the domain. Both of these customization options make it possible to create an accurate customized model without collecting large volumes of annotated data, resulting in reduced hallucination.
  • Inference parameters – You can also look into the inference parameters, which are values that you can adjust to modify the model response. There are multiple inference parameters that you can set, and they affect different capabilities of the model. For example, if you want the model to get creative with the responses or generate completely new information, such as in the context of storytelling, you can modify the temperature parameter. This will affect how the model looks for words across probability distribution and select words that are farther apart from each other in that space.
  • Contextual grounding – Lastly, you can use the contextual grounding check in Amazon Bedrock Guardrails. Amazon Bedrock Guardrails provides mechanisms within the Amazon Bedrock service that allow developers to set content filters and specify denied topics to control allowed text-based user inputs and model outputs. You can detect and filter hallucinations in model responses if they are not grounded (factually inaccurate or add new information) in the source information or are irrelevant to the user’s query. For example, you can block or flag responses in RAG applications if the model response deviates from the information in the retrieved passages or doesn’t answer the question by the user.

Model providers and tuners might not mitigate these hallucinations, but can inform the user that they might occur. This could be done by adding some disclaimers about using AI applications at the user’s own risk. We currently also see advances in research in methods that estimate uncertainty based on the amount of variation (measured as entropy) between multiple outputs. These new methods have proved much better at spotting when a question was likely to be answered incorrectly than previous methods.

Explainability

The explainability dimension in responsible AI focuses on understanding and evaluating system outputs. By using an explainable AI framework, humans can examine the models to better understand how they produce their outputs. For the explainability of the output of a generative AI model, you can use techniques like training data attribution and CoT prompting, which we discussed under the veracity and robustness dimension.

For customers wanting to see attribution of information in completion, we recommend using RAG with an Amazon Bedrock knowledge base. Attribution works with RAG because the possible attribution sources are included in the prompt itself. Information retrieved from the knowledge base comes with source attribution to improve transparency and minimize hallucinations. Amazon Bedrock Knowledge Bases manages the end-to-end RAG workflow for you. When using the RetrieveAndGenerate API, the output includes the generated response, the source attribution, and the retrieved text chunks.

Security and privacy

If there is one thing that is absolutely critical to every organization using generative AI technologies, it is making sure everything you do is and remains private, and that your data is protected at all times. The security and privacy dimension in responsible AI focuses on making sure data and models are obtained, used, and protected appropriately.

Built-in security and privacy of Amazon Bedrock

With Amazon Bedrock, if we look from a data privacy and localization perspective, AWS does not store your data—if we don’t store it, it can’t leak, it can’t be seen by model vendors, and it can’t be used by AWS for any other purpose. The only data we store is operational metrics—for example, for accurate billing, AWS collects metrics on how many tokens you send to a specific Amazon Bedrock model and how many tokens you receive in a model output. And, of course, if you create a fine-tuned model, we need to store that in order for AWS to host it for you. Data used in your API requests remains in the AWS Region of your choosing—API requests to the Amazon Bedrock API to a specific Region will remain completely within that Region.

If we look at data security, a common adage is that if it moves, encrypt it. Communications to, from, and within Amazon Bedrock are encrypted in transit—Amazon Bedrock doesn’t have a non-TLS endpoint. Another adage is that if it doesn’t move, encrypt it. Your fine-tuning data and model will by default be encrypted using AWS managed AWS Key Management Service (AWS KMS) keys, but you have the option to use your own KMS keys.

When it comes to identity and access management, AWS Identity and Access Management (IAM) controls who is authorized to use Amazon Bedrock resources. For each model, you can explicitly allow or deny access to actions. For example, one team or account could be allowed to provision capacity for Amazon Titan Text, but not Anthropic models. You can be as broad or as granular as you need to be.

Looking at network data flows for Amazon Bedrock API access, it’s important to remember that traffic is encrypted at all time. If you’re using Amazon Virtual Private Cloud (Amazon VPC), you can use AWS PrivateLink to provide your VPC with private connectivity through the regional network direct to the frontend fleet of Amazon Bedrock, mitigating exposure of your VPC to internet traffic with an internet gateway. Similarly, from a corporate data center perspective, you can set up a VPN or AWS Direct Connect connection to privately connect to a VPC, and from there you can have that traffic sent to Amazon Bedrock over PrivateLink. This should negate the need for your on-premises systems to send Amazon Bedrock related traffic over the internet. Following AWS best practices, you secure PrivateLink endpoints using security groups and endpoint policies to control access to these endpoints following Zero Trust principles.

Let’s also look at network and data security for Amazon Bedrock model customization. The customization process will first load your requested baseline model, then securely read your customization training and validation data from an S3 bucket in your account. Connection to data can happen through a VPC using a gateway endpoint for Amazon S3. That means bucket policies that you have can still be applied, and you don’t have to open up wider access to that S3 bucket. A new model is built, which is then encrypted and delivered to the customized model bucket—at no time does a model vendor have access to or visibility of your training data or your customized model. At the end of the training job, we also deliver output metrics relating to the training job to an S3 bucket that you had specified in the original API request. As mentioned previously, both your training data and customized model can be encrypted using a customer managed KMS key.

Best practices for privacy protection

The first thing to keep in mind when implementing a generative AI application is data encryption. As mentioned earlier, Amazon Bedrock uses encryption in transit and at rest. For encryption at rest, you have the option to choose your own customer managed KMS keys over the default AWS managed KMS keys. Depending on your company’s requirements, you might want to use a customer managed KMS key. For encryption in transit, we recommend using TLS 1.3 to connect to the Amazon Bedrock API.

For terms and conditions and data privacy, it’s important to read the terms and conditions of the models (EULA). Model providers are responsible for setting up these terms and conditions, and you as a customer are responsible for evaluating these and deciding if they’re appropriate for your application. Always make sure you read and understand the terms and conditions before accepting, including when you request model access in Amazon Bedrock. You should make sure you’re comfortable with the terms. Make sure your test data has been approved by your legal team.

For privacy and copyright, it is the responsibility of the provider and the model tuner to make sure the data used for training and fine-tuning is legally available and can actually be used to fine-tune and train those models. It is also the responsibility of the model provider to make sure the data they’re using is appropriate for the models. Public data doesn’t automatically mean public for commercial usage. That means you can’t use this data to fine-tune something and show it to your customers.

To protect user privacy, you can use the sensitive information filters in Amazon Bedrock Guardrails, which we discussed under the safety and controllability dimensions.

Lastly, when automating with generative AI (for example, with Amazon Bedrock Agents), make sure you’re comfortable with the model making automated decisions and consider the consequences of the application providing wrong information or actions. Therefore, consider risk management here.

Governance

The governance dimension makes sure AI systems are developed, deployed, and managed in a way that aligns with ethical standards, legal requirements, and societal values. Governance encompasses the frameworks, policies, and rules that direct AI development and use in a way that is safe, fair, and accountable. Setting and maintaining governance for AI allows stakeholders to make informed decisions around the use of AI applications. This includes transparency about how data is used, the decision-making processes of AI, and the potential impacts on users.

Robust governance is the foundation upon which responsible AI applications are built. AWS offers a range of services and tools that can empower you to establish and operationalize AI governance practices. AWS has also developed an AI governance framework that offers comprehensive guidance on best practices across vital areas such as data and model governance, AI application monitoring, auditing, and risk management.

When looking at auditability, Amazon Bedrock integrates with the AWS generative AI best practices framework v2 from AWS Audit Manager. With this framework, you can start auditing your generative AI usage within Amazon Bedrock by automating evidence collection. This provides a consistent approach for tracking AI model usage and permissions, flagging sensitive data, and alerting on issues. You can use collected evidence to assess your AI application across eight principles: responsibility, safety, fairness, sustainability, resilience, privacy, security, and accuracy.

For monitoring and auditing purposes, you can use Amazon Bedrock built-in integrations with Amazon CloudWatch and AWS CloudTrail. You can monitor Amazon Bedrock using CloudWatch, which collects raw data and processes it into readable, near real-time metrics. CloudWatch helps you track usage metrics such as model invocations and token count, and helps you build customized dashboards for audit purposes either across one or multiple FMs in one or multiple AWS accounts. CloudTrail is a centralized logging service that provides a record of user and API activities in Amazon Bedrock. CloudTrail collects API data into a trail, which needs to be created inside the service. A trail enables CloudTrail to deliver log files to an S3 bucket.

Amazon Bedrock also provides model invocation logging, which is used to collect model input data, prompts, model responses, and request IDs for all invocations in your AWS account used in Amazon Bedrock. This feature provides insights on how your models are being used and how they are performing, enabling you and your stakeholders to make data-driven and responsible decisions around the use of AI applications. Model invocation logs need to be enabled, and you can decide whether you want to store this log data in an S3 bucket or CloudWatch logs.

From a compliance perspective, Amazon Bedrock is in scope for common compliance standards, including ISO, SOC, FedRAMP moderate, PCI, ISMAP, and CSA STAR Level 2, and is Health Insurance Portability and Accountability Act (HIPAA) eligible. You can also use Amazon Bedrock in compliance with the General Data Protection Regulation (GDPR). Amazon Bedrock is included in the Cloud Infrastructure Service Providers in Europe Data Protection Code of Conduct (CISPE CODE) Public Register. This register provides independent verification that Amazon Bedrock can be used in compliance with the GDPR. For the most up-to-date information about whether Amazon Bedrock is within the scope of specific compliance programs, see AWS services in Scope by Compliance Program and choose the compliance program you’re interested in.

Implementing responsible AI in Amazon Bedrock applications

When building applications in Amazon Bedrock, consider your application context, needs, and behaviors of your end-users. Also, look into your organization’s needs, legal and regulatory requirements, and metrics you want or need to collect when implementing responsible AI. Take advantage of managed and built-in features available. The following diagram outlines various measures you can implement to address the core dimensions of responsible AI. This is not an exhaustive list, but rather a proposition of how the measures mentioned in this post could be combined together. These measures include:

  • Model evaluation – Use model evaluation to assess fairness, accuracy, toxicity, robustness, and other metrics to evaluate your chosen FM and its performance.
  • Amazon Bedrock Guardrails – Use Amazon Bedrock Guardrails to establish content filters, denied topics, word filters, sensitive information filters, and contextual grounding. With guardrails, you can guide model behavior by denying any unsafe or harmful topics or words and protect the safety of your end-users.
  • Prompt engineering – Utilize prompt engineering techniques, such as CoT, to improve explainability, veracity and robustness, and safety and controllability of your AI application. With prompt engineering, you can set a desired structure for the model response, including tone, scope, and length of responses. You can emphasize safety and controllability by adding denied topics to the prompt template.
  • Amazon Bedrock Knowledge Bases – Use Amazon Bedrock Knowledge Bases for end-to-end RAG implementation to decrease hallucinations and improve accuracy of the model for internal data use cases. Using RAG will improve veracity and robustness, safety and controllability, and explainability of your AI application.
  • Logging and monitoring – Maintain comprehensive logging and monitoring to enforce effective governance.
Diagram outlining various measures you can implement to address the core dimensions of responsible AI: model evaluation, Amazon Bedrock Guardrails, prompt engineering, Amazon Bedrock Knowledge Bases and logging and monitoring

Diagram outlining the various measures you can implement to address the core dimensions of responsible AI.

Conclusion

Building responsible AI applications requires a deliberate and structured approach, iterative development, and continuous effort. Amazon Bedrock offers a robust suite of built-in capabilities that support the development and deployment of responsible AI applications. By providing customizable features and the ability to integrate your own datasets, Amazon Bedrock enables developers to tune AI solutions to their specific application contexts and align them with organizational requirements for responsible AI. This flexibility makes sure AI applications are not only effective, but also ethical and aligned with best practices for fairness, safety, transparency, and accountability.

Implementing AI by following the responsible AI dimensions is key for developing and using AI solutions transparently, and without bias. Responsible development of AI will also help with AI adoption across your organization and build reliability with end customers. The broader the use and impact of your application, the more important following the responsibility framework becomes. Therefore, consider and address the responsible use of AI early on in your AI journey and throughout its lifecycle.

To learn more about the responsible use of ML framework, refer to the following resources:


About the Authors

Laura Verghote is a senior solutions architect for public sector customers in EMEA. She works with customers to design and build solutions in the AWS Cloud, bridging the gap between complex business requirements and technical solutions. She joined AWS as a technical trainer and has wide experience delivering training content to developers, administrators, architects, and partners across EMEA.

Maria Lehtinen is a solutions architect for public sector customers in the Nordics. She works as a trusted cloud advisor to her customers, guiding them through cloud system development and implementation with strong emphasis on AI/ML workloads. She joined AWS through an early-career professional program and has previous work experience from cloud consultant position at one of AWS Advanced Consulting Partners.

Read More

From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC – Part 2

From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC – Part 2

In Part 1 of this series, we defined the Retrieval Augmented Generation (RAG) framework to augment large language models (LLMs) with a text-only knowledge base. We gave practical tips, based on hands-on experience with customer use cases, on how to improve text-only RAG solutions, from optimizing the retriever to mitigating and detecting hallucinations.

This post focuses on doing RAG on heterogeneous data formats. We first introduce routers, and how they can help managing diverse data sources. We then give tips on how to handle tabular data and will conclude with multimodal RAG, focusing specifically on solutions that handle both text and image data.

Overview of RAG use cases with heterogeneous data formats

After a first wave of text-only RAG, we saw an increase in customers wanting to use a variety of data for Q&A. The challenge here is to retrieve the relevant data source to answer the question and correctly extract information from that data source. Use cases we have worked on include:

  • Technical assistance for field engineers – We built a system that aggregates information about a company’s specific products and field expertise. This centralized system consolidates a wide range of data sources, including detailed reports, FAQs, and technical documents. The system integrates structured data, such as tables containing product properties and specifications, with unstructured text documents that provide in-depth product descriptions and usage guidelines. A chatbot enables field engineers to quickly access relevant information, troubleshoot issues more effectively, and share knowledge across the organization.
  • Oil and gas data analysis – Before beginning operations at a well a well, an oil and gas company will collect and process a diverse range of data to identify potential reservoirs, assess risks, and optimize drilling strategies. The data sources may include seismic surveys, well logs, core samples, geochemical analyses, and production histories, with some of it in industry-specific formats. Each category necessitates specialized generative AI-powered tools to generate insights. We built a chatbot that can answer questions across this complex data landscape, so that oil and gas companies can make faster and more informed decisions, improve exploration success rates, and decrease time to first oil.
  • Financial data analysis – The financial sector uses both unstructured and structured data for market analysis and decision-making. Unstructured data includes news articles, regulatory filings, and social media, providing qualitative insights. Structured data consists of stock prices, financial statements, and economic indicators. We built a RAG system that combines these diverse data types into a single knowledge base, allowing analysts to efficiently access and correlate information. This approach enables nuanced analysis by combining numerical trends with textual insights to identify opportunities, assess risks, and forecast market movements.
  • Industrial maintenance – We built a solution that combines maintenance logs, equipment manuals, and visual inspection data to optimize maintenance schedules and troubleshooting. This multimodal approach integrates written reports and procedures with images and diagrams of machinery, allowing maintenance technicians to quickly access both descriptive information and visual representations of equipment. For example, a technician could query the system about a specific machine part, receiving both textual maintenance history and annotated images showing wear patterns or common failure points, enhancing their ability to diagnose and resolve issues efficiently.
  • Ecommerce product search – We built several solutions to enhance the search capabilities on ecommerce websites to improve the shopping experience for customers. Traditional search engines rely mostly on text-based queries. By integrating multimodal (text and image) RAG, we aimed to create a more comprehensive search experience. The new system can handle both text and image inputs, allowing customers to upload photos of desired items and receive precise product matches.

Using a router to handle heterogeneous data sources

In RAG systems, a router is a component that directs incoming user queries to the appropriate processing pipeline based on the query’s nature and the required data type. This routing capability is crucial when dealing with heterogeneous data sources, because different data types often require distinct retrieval and processing strategies.

Consider a financial data analysis system. For a qualitative question like “What caused inflation in 2023?”, the router would direct the query to a text-based RAG that retrieves relevant documents and uses an LLM to generate an answer based on textual information. However, for a quantitative question such as “What was the average inflation in 2023?”, the router would direct the query to a different pipeline that fetches and analyzes the relevant dataset.

The router accomplishes this through intent detection, analyzing the query to determine the type of data and analysis required to answer it. In systems with heterogeneous data, this process makes sure each data type is processed appropriately, whether it’s unstructured text, structured tables, or multimodal content. For instance, analyzing large tables might require prompting the LLM to generate Python or SQL and running it, rather than passing the tabular data to the LLM. We give more details on that aspect later in this post.

In practice, the router module can be implemented with an initial LLM call. The following is an example prompt for a router, following the example of financial analysis with heterogeneous data. To avoid adding too much latency with the routing step, we recommend using a smaller model, such as Anthropic’s Claude Haiku on Amazon Bedrock.

router_template = """
You are a financial data assistant that can query different data sources
based on the user's request. The available data sources are:

<data_sources>
<source>
<name>Stock Prices Database</name>
<description>Contains historical stock price data for publicly traded companies.</description>
</source>
<source>
<name>Analyst Notes Database</name>
<description>Knowledge base containing reports from Analysts on their interpretation and analyis of economic events.</description>
</source>
<source>
<name>Economic Indicators Database</name>
<description>Holds macroeconomic data like GDP, inflation, unemployment rates, etc.</description>
</source>
<source>
<name>Regulatory Filings Database</name>
<description>Contains SEC filings, annual reports, and other regulatory documents for public companies.</description>
</source>
</data_sources>

<instructions>
When the user asks a query, analyze the intent and route it to the appropriate data source.
If the query is not related to any of the available data sources,
respond politely that you cannot assist with that request.
</instructions>

<example>
<query>What was the closing price of Amazon stock on January 1st, 2022?</query>
<data_source>Stock Prices Database</data_source>
<reason>The question is about a stock price.</reason>
</example>

<example>
<query>What caused inflation in 2021?</query>
<data_source>Analyst Notes Database</data_source>
<reason>This is asking for interpretation of an event, I will look in Analyst Notes.</reason>
</example>

<example>
<query>How has the US unemployment rate changed over the past 5 years?</query>
<data_source>Economic Indicators Database</data_source>
<reason>Unemployment rate is an Economic indicator.</reason>
</example>

<example>
<query>I need to see the latest 10-K filing for Amazon.</query>
<data_source>Regulatory Filings Database</data_source>
<reason>SEC 10K which are in Regulatory Filings database.</reason>
</example>

<example>
<query>What's the best restaurant in town?</query>
<data_source>None</data_source>
<reason>Restaurant recommendations are not related to any data source.</reason>
</example>

Here is the user query
<query>
{user_query}
</query>

Output the data source in <data_source> tags and the explanation in <reason> tags.
"""

Prompting the LLM to explain the routing logic may help with accuracy, by forcing the LLM to “think” about its answer, and also for debugging purposes, to understand why a category might not be routed properly.

The prompt uses XML tags following Anthropic’s Claude best practices. Note that in this example prompt we used <data_source> tags but something similar such as <category> or <label> could also be used. Asking the LLM to also structure its response with XML tags allows us to parse out the category from the LLM answer, which can be done with the following code:

# Parse out the data source
pattern = r"<data_source>(.*?)</data_source>"
data_source = re.findall(
    pattern, llm_response, re.DOTALL
)[0]

From a user’s perspective, if the LLM fails to provide the right routing category, the user can explicitly ask for the data source they want to use in the query. For instance, instead of saying “What caused inflation in 2023?”, the user could disambiguate by asking “What caused inflation in 2023 according to analysts?”, and instead of “What was the average inflation in 2023?”, the user could ask “What was the average inflation in 2023? Look at the indicators.”

Another option for a better user experience is to add an option to ask for clarifications in the router, if the LLM finds that the query is too ambiguous. We can add this as an additional “data source” in the router using the following code:

<source>
<name>Clarifications</name>
<description>If the query is too ambiguous, use this to ask the user for more
clarifications. Put your reply to the user in the reason tags</description>
</source>

We use an associated example:

<example>
<query>What's can you tell me about Amazon stock?</query>
<data_source>Clarifications</data_source>
<reason>I'm not sure how to best answer your question,
do you want me to look into Stock Prices, Analyst Notes, Regulatory filings?</reason>
</example>

If in the LLM’s response, the data source is Clarifications, we can then directly return the content of the <reason> tags to the user for clarifications.

An alternative approach to routing is to use the native tool use capability (also known as function calling) available within the Bedrock Converse API. In this scenario, each category or data source would be defined as a ‘tool’ within the API, enabling the model to select and use these tools as needed. Refer to this documentation for a detailed example of tool use with the Bedrock Converse API.

Using LLM code generation abilities for RAG with structured data

Consider an oil and gas company analyzing a dataset of daily oil production. The analyst may ask questions such as “Show me all wells that produced oil on June 1st 2024,” “What well produced the most oil in June 2024?”, or “Plot the monthly oil production for well XZY for 2024.” Each question requires different treatment, with varying complexity. The first one involves filtering the dataset to return all wells with production data for that specific date. The second one requires computing the monthly production values from the daily data, then finding the maximum and returning the well ID. The third one requires computing the monthly average for well XYZ and then generating a plot.

LLMs don’t perform well at analyzing tabular data when it’s added directly in the prompt as raw text. A simple way to improve the LLM’s handling of tables is to add it in the prompt in a more structured format, such as markdown or XML. However, this method will only work if the question doesn’t require complex quantitative reasoning and the table is small enough. In other cases, we can’t reliably use an LLM to analyze tabular data, even when provided as structured format in the prompt.

On the other hand, LLMs are notably good at code generation; for instance, Anthropic’s Claude Sonnet 3.5 has 92% accuracy on the HumanEval code benchmark. We can take advantage of that capability by asking the LLM to write Python (if the data is stored in a CSV, Excel, or Parquet file) or SQL (if the data is stored in a SQL database) code that performs the required analysis. Popular libraries Llama Index and LangChain both offer out-of-the-box solutions for text-to-SQL (Llama Index, LangChain) and text-to-Pandas (Llama Index, LangChain) pipelines for quick prototyping. However, for better control over prompts, code execution, and outputs, it might be worth writing your own pipeline. Out-of-the-box solutions will typically prompt the LLM to write Python or SQL code to answer the user’s question, then parse and run the code from the LLM’s response, and finally send the code output back to the LLM for a final answer.

Going back to the oil and gas data analysis use case, take the question “Show me all wells that produced oil on June 1st 2024.” There could be hundreds of entries in the dataframe. In that case, a custom pipeline that directly returns the code output to the UI (the filtered dataframe for the date of June 1st 2024, with oil production greater than 0) would be more efficient than sending it to the LLM for a final answer. If the filtered dataframe is large, the additional call might cause high latency and even risks causing hallucinations. Writing your custom pipelines also allows you to perform some sanity checks on the code, to verify, for instance, that the code generated by the LLM will not create issues (such as modify existing files or data bases).

The following is an example of a prompt that can be used to generate Pandas code for data analysis:

prompt_template = """
You are an AI assistant designed to answer questions from oil and gas analysts.
You have access to a Pandas dataframe df that contains daily production data for oil producing wells.

Here is a sample from df:
<df_sample>
{sample}
</df_sample>

Here is the analyst's question:
<question>
{question}
</question>

<instructions>
 - Use <scratchpad> tags to think about what you are going to do.
 - Put your the code in <code> tags.
 - The dataframes may contain nans, so make sure you account for those in your code.
 - In your code, the final variable should be named "result".
</instructions>
"""

We can then parse the code out from the <code> tags in the LLM response and run it using exec in Python. The following code is a full example:

import boto3
import pandas as pd

# Import the csv into a DataFrame
df = pd.read_csv('stock_prices.csv')

# Create an Amazon Bedrock client
bedrock_client = boto3.client('bedrock')

# Define the prompt
user_query = "Show me all wells that produced oil on June 1st 2024"
prompt = prompt_template.format(sample = df.sample(5), question=user_query))

# Call Anthropic Claude Sonnet
request_body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [
            {
                "role": "user",
                "content":  prompt
                    }
            
        ]
    }
)
response = bedrock_client.invoke_model(
    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
    body=request_body
)
# Get the LLM's response
llm_response = json.loads(
    response['body'].read().decode('utf-8')
    )['content'][0]['text']

# Extract code from LLM response
 code_pattern = r"<code>(.*?)</code>"
code_matches = re.findall(
    code_pattern, llm_response, re.DOTALL
)  
# Use a dictionary to pass the dataframe to the exec environment
local_vars = {"df": df}
for match in code_matches:
    exec(
        match, local_vars
    ) 
    
# Variables created in the exec environment get stored in the local_vars dict
code_output = local_vars["result"]

# We can then return the code output or send the code output
#to the LLM to get the final answer

# Call Anthropic Claude Sonnet
request_body = json.dumps(
    {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4000,
        "messages": [
            {
                "role": "user",
                "content":  prompt
                    },
                            {
                "role": "assistant",
                "content":  llm_response
                    },
                            {
                "role": "user",
                "content":  f"This is the code output: {code_output}"
                    }
            
        ]
    }
)
response = bedrock_client.invoke_model(
    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
    body=request_body
)

# Get the final LLM's response
final_llm_response = json.loads(
    response['body'].read().decode('utf-8')
    )['content'][0]['text']

Because we explicitly prompt the LLM to store the final result in the result variable, we know it will be stored in the local_vars dictionary under that key, and we can retrieve it that way. We can then either directly return this result to the user, or send it back to the LLM to generate its final response. Sending the variable back to the user directly can be useful if the request requires filtering and returning a large dataframe, for instance. Directly returning the variable to the user removes the risk of hallucination that can occur with large inputs and outputs.

Multimodal RAG

An emerging trend in generative AI is multimodality, with models that can use text, images, audio, and video. In this post, we focus exclusively on mixing text and image data sources.

In an industrial maintenance use case, consider a technician facing an issue with a machine. To troubleshoot, they might need visual information about the machine, not just a textual guide.

In ecommerce, using multimodal RAG can enhance the shopping experience not only by allowing users to input images to find visually similar products, but also by providing more accurate and detailed product descriptions from visuals of the products.

We can categorize multimodal text and image RAG questions in three categories:

  • Image retrieval based on text input – For example:
    • “Show me a diagram to repair the compressor on the ice cream machine.”
    • “Show me red summer dresses with floral patterns.”
  • Text retrieval based on image input – For example:
    • A technician might take a picture of a specific part of the machine and ask, “Show me the manual section for this part.”
  • Image retrieval based on text and image input – For example:
    • A customer could upload an image of a dress and ask, “Show me similar dresses.” or “Show me items with a similar pattern.”

As with traditional RAG pipelines, the retrieval component is the basis of these solutions. Constructing a multimodal retriever requires having an embedding strategy that can handle this multimodality. There are two main options for this.

First, you could use a multimodal embedding model such as Amazon Titan Multimodal Embeddings, which can embed both images and text into a shared vector space. This allows for direct comparison and retrieval of text and images based on semantic similarity. This simple approach is effective for finding images that match a high-level description or for matching images of similar items. For instance, a query like “Show me summer dresses” would return a variety of images that fit that description. It’s also suitable for queries where the user uploads a picture and asks, “Show me dresses similar to that one.”

The following diagram shows the ingestion logic with a multimodal embedding. The images in the database are sent to a multimodal embedding model that returns vector representations of the images. The images and the corresponding vectors are paired up and stored in the vector database.

This diagram shows the ingestion of images in a vector database using a multimodal embedding.

At retrieval time, the user query (which can be text or image) is passed to the multimodal embedding model, which returns a vectorized user query that is used by the retriever module to search for images that are close to the user query, in the embedding distance. The closest images are then returned.

This diagram shows the retrieval of images from a user query in a vector database using a multimodal embedding.

Alternatively, you could use a multimodal foundation model (FM) such as Anthropic’s Claude v3 Haiku, Sonnet, or Opus, and Sonnet 3.5, all available on Amazon Bedrock, which can generate the caption of an image, which will then be used for retrieval. Specifically, the generated image description is embedded using a traditional text embedding (e.g. Amazon Titan Embedding Text v2) and stored in a vector store along with the image as metadata.

Captions can capture finer details in images, and can be guided to focus on specific aspects such as color, fabric, pattern, shape, and more. This would be better suited for queries where the user uploads an image and looks for similar items but only in some aspects (such as uploading a picture of a dress, and asking for skirts in a similar style). This would also work better to capture the complexity of diagrams in industrial maintenance.

The following figure shows the ingestion logic with a multimodal FM and text embedding. The images in the database are sent to a multimodal FM that returns image captions. The image captions are then sent to a text embedding model and converted to vectors. The images are paired up with the corresponding vectors and captions and stored in the vector database.

This diagram shows the ingestion of images in a vector database using a multimodal foundation model.

At retrieval time, the user query (text) is passed to the text embedding model, which returns a vectorized user query that is used by the retriever module to search for captions that are close to the user query, in the embedding distance. The images corresponding to the closest captions are then returned, optionally with the caption as well. If the user query contains an image, we need to use a multimodal LLM to describe that image similarly to the previous ingestion steps.

This diagram shows the retrieval of images from a user query in a vector database using a multimodal foundation model.

Example with a multimodal embedding model

The following is a code sample performing ingestion with Amazon Titan Multimodal Embeddings as described earlier. The embedded image is stored in an OpenSearch index with a k-nearest neighbors (k-NN) vector field.

from utils import *

# Read and encode the image
file_name = 'image.png'
image_base64 = read_and_encode_image(file_name)

# Embed the image using Amazon Titan Multimodal Embeddings
multi_embedding_model = "amazon.titan-embed-image-v1"
image_embedding = get_embedding(input = image_base64, model = multi_embedding_model)

# Get OpenSearch client (assume this function is available)
open_search = get_open_search_client()

# Create index in OpenSearch for storing embeddings
create_opensearch_index(name = 'multimodal-image-index', client = open_search)

# Index the image and its embedding in OpenSearch
request = {
    'image': image_base64,
    "vector_field": image_embedding,
    "_op_type": "index",
    "source": file_name  # replace with a URL or S3 location if needed
}
result = open_search.index(index='multimodal-image-index', body=request)


The following is the code sample performing the retrieval with Amazon Titan Multimodal Embeddings:

# Use Amazon Titan Multimodal Embeddings to embed the user query
query_text = "Show me a diagram to repair the compressor on the ice cream machine."

query_embedding = get_embedding(input = image_base64, model = multi_embedding_model)

# Search for images that are close to that description in OpenSearch
search_query ={
        'query': {
            'bool': {
                'should': [
                    {
                        'knn': {
                            'vector_field': {
                                'vector': text_embedding,
                                'k': 5
                            }
                        }
                    }
                ]
            }
        }
    }

response = open_search.search(index='multimodal-image-index', body=search_query)

In the response, we have the images that are closest to the user query in embedding space, thanks to the multimodal embedding.

Example with a multimodal FM

The following is a code sample performing the retrieval and ingestion described earlier. It uses Anthropic’s Claude Sonnet 3 to caption the image first, and then Amazon Titan Text Embeddings to embed the caption. You could also use another multimodal FM such as Anthropic’s Claude Sonnet 3.5, Haiku 3, or Opus 3 on Amazon Bedrock. The image, caption embedding, and caption are stored in an OpenSearch index. At retrieval time, we embed the user query using the same Amazon Titan Text Embeddings model and perform a k-NN search on the OpenSearch index to retrieve the relevant image.

# Read and encode the image
file_name = 'image.png'
image_base64 = read_and_encode_image(file_name)

# Use Anthropic Claude Sonnet to caption the image
caption = call_multimodal_llm(
    modelId ="anthropic.claude-3-sonnet-20240229-v1:0",
    text = "Describe this image in detail. Only output the description, nothing else"
    image = image_base64
)
    
# Compute text embedding for the caption
text_embedding_model = "amazon.titan-embed-text-v2:0"
caption_embedding = get_embedding(input = caption, model = text_embedding_model)


# Create the index with a mapping that has a knn vector field
open_search.indices.create(index='image-caption-index', body=mapping)

# Index image in OpenSearch
open_search.index(
    index='image-caption-index',
    body={
        "image_base64": image_base64,
        "vector_field": caption_embedding,
        "caption": caption,
        "source": file_name
    }
)

The following is code to perform the retrieval step using text embeddings:

# Compute embedding for a natural language query with text embedding
user_query= "Show me a diagram to repair the compressor on the ice cream machine."
query_embedding  = get_embedding(input = caption, model = text_embedding_model)

# Search for images that match that query in OpenSearch
search_query ={
        'query': {
            'bool': {
                'should': [
                    {
                        'knn': {
                            'vector_field': {
                                'vector': query_embedding,
                                'k': 5
                            }
                        }
                    }
                ]
            }
        }
    }

response = open_search.search(index='image-caption-index', body=search_query)

This returns the images whose captions are closest to the user query in the embedding space, thanks to the text embeddings. In the response, we get both the images and the corresponding captions for downstream use.

Comparative table of multimodal approaches

The following table provides a comparison between using multimodal embeddings and using a multimodal LLM for image captioning, across several key factors. Multimodal embeddings offer faster ingestion and are generally more cost-effective, making them suitable for large-scale applications where speed and efficiency are crucial. On the other hand, using a multimodal LLM for captions, though slower and less cost-effective, provides more detailed and customizable results, which is particularly useful for scenarios requiring precise image descriptions. Considerations such as latency for different input types, customization needs, and the level of detail required in the output should guide the decision-making process when selecting your approach.

 . Multimodal Embeddings Multimodal LLM for Captions
Speed Faster ingestion Slower ingestion due to additional LLM call
Cost More cost-effective Less cost-effective
Detail Basic comparison based on embeddings Detailed captions highlighting specific features
Customization Less customizable Highly customizable with prompts
Text Input Latency Same as multimodal LLM Same as multimodal embeddings
Image Input Latency Faster, no extra processing required Slower, requires extra LLM call to generate image caption
Best Use Case General use, quick and efficient data handling Precise searches needing detailed image descriptions

Conclusion

Building real-world RAG systems with heterogeneous data formats presents unique challenges, but also unlocks powerful capabilities for enabling natural language interactions with complex data sources. By employing techniques like intent detection, code generation, and multimodal embeddings, you can create intelligent systems that can understand queries, retrieve relevant information from structured and unstructured data sources, and provide coherent responses. The key to success lies in breaking down the problem into modular components and using the strengths of FMs for each component. Intent detection helps route queries to the appropriate processing logic, and code generation enables quantitative reasoning and analysis on structured data sources. Multimodal embeddings and multimodal FMs enable you to bridge the gap between text and visual data, enabling seamless integration of images and other media into your knowledge bases.

Get started with FMs and embedding models in Amazon Bedrock to build RAG solutions that seamlessly integrate tabular, image, and text data for your organization’s unique needs.


About the Author

Aude Genevay is a Senior Applied Scientist at the Generative AI Innovation Center, where she helps customers tackle critical business challenges and create value using generative AI. She holds a PhD in theoretical machine learning and enjoys turning cutting-edge research into real-world solutions.

Read More

Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

The Cohere Embed multimodal embeddings model is now generally available on Amazon SageMaker JumpStart. This model is the newest Cohere Embed 3 model, which is now multimodal and capable of generating embeddings from both text and images, enabling enterprises to unlock real value from their vast amounts of data that exist in image form.

In this post, we discuss the benefits and capabilities of this new model with some examples.

Overview of multimodal embeddings and multimodal RAG architectures

Multimodal embeddings are mathematical representations that integrate information not only from text but from multiple data modalities—such as product images, graphs, and charts—into a unified vector space. This integration allows for seamless interaction and comparison between different types of data. As foundational models (FMs) advance, they increasingly require the ability to interpret and generate content across various modalities to better mimic human understanding and communication. This trend toward multimodality enhances the capabilities of AI systems in tasks like cross-modal retrieval, where a query in one modality (such as text) retrieves data in another modality (such as images or design files).

Multimodal embeddings can enable personalized recommendations by understanding user preferences and matching them with the most relevant assets. For instance, in ecommerce, product images are a critical factor influencing purchase decisions. Multimodal embeddings models can enhance personalization through visual similarity search, where users can upload an image or select a product they like, and the system finds visually similar items. In the case of retail and fashion, multimodal embeddings can capture stylistic elements, enabling the search system to recommend products that fit a particular aesthetic, such as “vintage,” “bohemian,” or “minimalist.”

Multimodal Retrieval Augmented Generation (MM-RAG) is emerging as a powerful evolution of traditional RAG systems, addressing limitations and expanding capabilities across diverse data types. Traditionally, RAG systems were text-centric, retrieving information from large text databases to provide relevant context for language models. However, as data becomes increasingly multimodal in nature, extending these systems to handle various data types is crucial to provide more comprehensive and contextually rich responses. MM-RAG systems that use multimodal embeddings models to encode both text and images into a shared vector space can simplify retrieval across modalities. MM-RAG systems can also enable enhanced customer service AI agents that can handle queries that involve both text and images, such as product defects or technical issues.

Cohere Multimodal Embed 3: Powering enterprise search across text and images

Cohere’s embeddings model, Embed 3, is an industry-leading AI search model that is designed to transform semantic search and generative AI applications. Cohere Embed 3 is now multimodal and capable of generating embeddings from both text and images. This enables enterprises to unlock real value from their vast amounts of data that exist in image form. Businesses can now build systems that accurately search important multimodal assets such as complex reports, ecommerce product catalogs, and design files to boost workforce productivity.

Cohere Embed 3 translates input data into long strings of numbers that represent the meaning of the data. These numerical representations are then compared to each other to determine similarities and differences. Cohere Embed 3 places both text and image embeddings in the same space for an integrated experience.

The following figure illustrates an example of this workflow. This figure is simplified for illustrative purposes. In practice, the numerical representations of data (seen in the output column) are far longer and the vector space that stores them has a higher number of dimensions.

This similarity comparison enables applications to retrieve enterprise data that is relevant to an end-user query. In addition to being a fundamental component of semantic search systems, Cohere Embed 3 is useful in RAG systems because it makes generative models like the Command R series have the most relevant context to inform their responses.

All businesses, across industry and size, can benefit from multimodal AI search. Specifically, customers are interested in the following real-world use cases:

  • Graphs and charts – Visual representations are key to understanding complex data. You can now effortlessly find the right diagrams to inform your business decisions. Simply describe a specific insight and Cohere Embed 3 will retrieve relevant graphs and charts, making data-driven decision-making more efficient for employees across teams.
  • Ecommerce product catalogs – Traditional search methods often limit you to finding products through text-based product descriptions. Cohere Embed 3 transforms this search experience. Retailers can build applications that surface products that visually match a shopper’s preferences, creating a differentiated shopping experience and improving conversion rates.
  • Design files and templates – Designers often work with vast libraries of assets, relying on memory or rigorous naming conventions to organize visuals. Cohere Embed 3 makes it simple to locate specific UI mockups, visual templates, and presentation slides based on a text description. This streamlines the creative process.

The following figure illustrates some examples of these use cases.

At a time when businesses are increasingly expected to use their data to drive outcomes, Cohere Embed 3 offers several advantages that accelerate productivity and improves customer experience.

The following chart compares Cohere Embed 3 with another embeddings model. All text-to-image benchmarks are evaluated using Recall@5; text-to-text benchmarks are evaluated using NDCG@10. Text-to-text benchmark accuracy is based on BEIR, a dataset focused on out-of-domain retrievals (14 datasets). Generic text-to-image benchmark accuracy is based on Flickr and CoCo. Graphs and charts benchmark accuracy is based on business reports and presentations constructed internally. ecommerce benchmark accuracy is based on a mix of product catalog and fashion catalog datasets. Design files benchmark accuracy is based on a product design retrieval dataset constructed internally.

BEIR (Benchmarking IR) is a heterogeneous benchmark—it uses a diverse collection of datasets and tasks designed for evaluating information retrieval (IR) models across diverse tasks. It provides a common framework for assessing the performance of natural language processing (NLP)-based retrieval models, making it straightforward to compare different approaches. Recall@5 is a specific metric used in information retrieval evaluation, including in the BEIR benchmark. Recall@5 measures the proportion of relevant items retrieved within the top five results, compared to the total number of relevant items in the dataset

Cohere’s latest Embed 3 model’s text and image encoders share a unified latent space. This approach has a few important benefits. First, it enables you to include both image and text features in a single database and therefore reduces complexity. Second, it means current customers can begin embedding images without re-indexing their existing text corpus. In addition to leading accuracy and ease of use, Embed 3 continues to deliver the same useful enterprise search capabilities as before. It can output compressed embeddings to save on database costs, it’s compatible with over 100 languages for multilingual search, and it maintains strong performance on noisy real-world data.

Solution overview

SageMaker JumpStart offers access to a broad selection of publicly available FMs. These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

Amazon SageMaker is a comprehensive, fully managed machine learning (ML) platform that revolutionizes the entire ML workflow. It offers an unparalleled suite of tools that cater to every stage of the ML lifecycle, from data preparation to model deployment and monitoring. Data scientists and developers can use the SageMaker integrated development environment (IDE) to access a vast array of pre-built algorithms, customize their own models, and seamlessly scale their solutions. The platform’s strength lies in its ability to abstract away the complexities of infrastructure management, allowing you to focus on innovation rather than operational overhead.

You can access the Cohere Embed family of models using SageMaker JumpStart in Amazon SageMaker Studio.

For those new to SageMaker JumpStart, we walk through using SageMaker Studio to access models in SageMaker JumpStart.

Prerequisites

Make sure you meet the following prerequisites:

  • Make sure your SageMaker AWS Identity and Access Management (IAM) role has the AmazonSageMakerFullAccess permission policy attached.
  • To deploy Cohere multimodal embeddings successfully, confirm the following:
    • Your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
      • aws-marketplace:ViewSubscriptions
      • aws-marketplace:Unsubscribe
      • aws-marketplace:Subscribe
    • Alternatively, confirm your AWS account has a subscription to the model. If so, skip to the next section in this post.

Deployment starts when you choose the Deploy option. You may be prompted to subscribe to this model through AWS Marketplace. If you’re already subscribed, then you can proceed and choose Deploy. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Subscribe to the model package

To subscribe to the model package, complete the following steps:

  1. Depending on the model you want to deploy, open the model package listing page for it.
  2. On the AWS Marketplace listing, choose Continue to subscribe.
  3. On the Subscribe to this software page, choose Accept Offer if you and your organization agrees with EULA, pricing, and support terms.
  4. Choose Continue to configuration and then choose an AWS Region.

You will see a product ARN displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

  1. Subscribe to the Cohere embeddings model package on AWS Marketplace.
  2. Choose the appropriate model package ARN for your Region. For example, the ARN for Cohere Embed Model v3 – English is:
    arn:aws:sagemaker:[REGION]:[ACCOUNT_ID]:model-package/cohere-embed-english-v3-7-6d097a095fdd314d90a8400a620cac54

Deploy the model using the SDK

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

from cohere_aws import Client 
import boto3 
region = boto3.Session().region_name 
model_package_arn = "Specify the model package ARN here"

Use the SageMaker SDK to create a client and deploy the models:

co = Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-embed-english-v3", instance_type="ml.g5.xlarge", n_instances=1)

If the endpoint is already created using SageMaker Studio, you can simply connect to it:

co.connect_to_endpoint(endpoint_name="cohere-embed-english-v3")

Consider the following best practices:

  • Choose an appropriate instance type based on your performance and cost requirements. This example uses ml.g5.xlarge, but you might need to adjust this based on your specific needs.
  • Make sure your IAM role has the necessary permissions, including AmazonSageMakerFullAccess2.
  • Monitor your endpoint’s performance and costs using Amazon CloudWatch.

Inference example with Cohere Embed 3 using the SageMaker SDK

The following code example illustrates how to perform real-time inference using Cohere Embed 3. We walk through a sample notebook to get started. You can also find the source code on the accompanying GitHub repo.

Pre-setup

Import all required packages using the following code:

import requests
import base64
import os
import mimetypes
import numpy as np
from IPython.display import Image, display
import tqdm
import tqdm.auto

Create helper functions

Use the following code to create helper functions that determine whether the input document is text or image, and download images given a list of URLs:

def is_image(doc):
    return (doc.endswith(".jpg") or doc.endswith(".png")) and os.path.exists(doc)

def is_txt(doc):
    return (doc.endswith(".txt")) and os.path.exists(doc)

def download_images(image_urls):
    image_names = []

    #print("Download some example images we want to embed")
    for url in image_urls:
        image_name = os.path.basename(url)
        image_names.append(image_name)

        if not os.path.exists(image_name):
            with open(image_name, "wb") as fOut:
                fOut.write(requests.get(url, stream=True).content)
    
    return image_names

Generate embeddings for text and image inputs

The following code shows a compute_embeddings() function we defined that will accept multimodal inputs to generate embeddings with Cohere Embed 3:

def compute_embeddings(docs):
    # Compute the embeddings
    embeddings = []
    for doc in tqdm.auto.tqdm(docs, desc="encoding"):
        if is_image(doc):
            print("Encode image:", doc)
            # Doc is an image, encode it as an image

            # Convert the images to base64
            with open(doc, "rb") as fIn:
                img_base64 = base64.b64encode(fIn.read()).decode("utf-8")
            
            #Get the mime type for the image
            mime_type = mimetypes.guess_type(doc)[0]
            
            payload = {
                "model": "embed-english-v3.0",
                "input_type": 'image',
                "embedding_types": ["float"],
                "images": [f"data:{mime_type};base64,{img_base64}"]
            }
        
            response = sagemaker_runtime.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType='application/json',
                Body=json.dumps(payload)
            )

            response = json.loads(response['Body'].read().decode("utf-8"))
            response = response["embeddings"]["float"][0]
        elif is_txt(doc):
            # Doc is a text file, encode it as a document
            with open(doc, "r") as fIn:
                text = fIn.read()

            print("Encode img desc:", doc, " - Content:", text[0:100]+"...")
            
            payload = {
                "texts": [text],
                "model": "embed-english-v3.0",
                "input_type": "search_document",
            }
            
            response = sagemaker_runtime.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType='application/json',
                Body=json.dumps(payload)
            )
            response = json.loads(response['Body'].read().decode("utf-8"))
            response = response["embeddings"][0]
        else:
            #Encode as document
            
            payload = {
                "texts": [doc],
                "model": "embed-english-v3.0",
                "input_type": "search_document",
            }
            
            response = sagemaker_runtime.invoke_endpoint(
                EndpointName=endpoint_name,
                ContentType='application/json',
                Body=json.dumps(payload)
            )
            response = json.loads(response['Body'].read().decode("utf-8"))
            response = response["embeddings"][0]
        embeddings.append(response)
    return np.asarray(embeddings, dtype="float")

Find the most relevant embedding based on query

The Search() function generates query embeddings and computes a similarity matrix between the query and embeddings:

def search(query, embeddings, docs):
    # Get the query embedding
    
    payload = {
        "texts": [query],
        "model": "embed-english-v3.0",
        "input_type": "search_document",
    }
    
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    query_emb = json.loads(response['Body'].read().decode("utf-8"))
    query_emb = query_emb["embeddings"][0]

    # Compute L2 norms of the vector and matrix rows
    vector_norm = np.linalg.norm(query_emb)
    matrix_norms = np.linalg.norm(embeddings, axis = 1)

    # Compute the dot product between the vector and each row of the matrix
    dot_products = np.dot(embeddings, query_emb)
    
    
    # Compute cosine similarities
    similarity = dot_products / (matrix_norms * vector_norm)

    # Sort decreasing most to least similar
    top_hits = np.argsort(-similarity)

    print("Query:", query, "n")
    # print(similarity)
    print("Search results:")
    for rank, idx in enumerate(top_hits):
        print(f"#{rank+1}: ({similarity[idx]*100:.2f})")
        if is_image(docs[idx]):
            print(docs[idx])
            display(Image(filename=docs[idx], height=300))
        elif is_txt(docs[idx]):
            print(docs[idx]+" - Image description:")
            with open(docs[idx], "r") as fIn:
                print(fIn.read())
            #display(Image(filename=docs[idx].replace(".txt", ".jpg"), height=300))
        else:
            print(docs[idx])
        print("--------")

Test the solution

Let’s assemble all the input documents; notice that there are both text and image inputs:

# Download images
image_urls = [
    "https://images-na.ssl-images-amazon.com/images/I/31KqpOznU1L.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/41RI4qgJLrL.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/61NbJr9jthL.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/31TW1NCtMZL.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/51a6iOTpnwL.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/31sa-c%2BfmpL.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/41sKETcJYcL.jpg",
    "https://images-na.ssl-images-amazon.com/images/I/416GZ2RZEPL.jpg"
]
image_names = download_images(image_urls)
text_docs = [
    "Toy with 10 activities including a storybook, clock, gears; 13 double-sided alphabet blocks build fine motor skills and introduce letters, numbers, colors, and more.",
    "This is the perfect introduction to the world of scooters.",
    "2 -IN-1 RIDE-ON TOY- This convertible scooter is designed to grow with your child.",
    "Playful elephant toy makes real elephant sounds and fun music to inspire imaginative play."
]

docs = image_names + text_docs
print("Total docs:", len(docs))
print(docs)

Generate embeddings for the documents:

embeddings = compute_embeddings(docs)
print("Doc embeddings shape:", embeddings.shape)

The output is a matrix of 11 items of 1,024 embedding dimensions.

Search for the most relevant documents given the query “Fun animal toy”

search("Fun animal toy", embeddings, docs)

The following screenshots show the output.

Query: Fun animal toy 

Search results:
#1: (54.28)
Playful elephant toy makes real elephant sounds and fun music to inspire imaginative play.
--------
#2: (52.48)
31TW1NCtMZL.jpg

--------
#3: (51.83)
31sa-c%2BfmpL.jpg

--------
#4: (50.33)
51a6iOTpnwL.jpg

--------
#5: (47.81)
31KqpOznU1L.jpg

--------
#6: (44.70)
61NbJr9jthL.jpg

#7: (44.36)
416GZ2RZEPL.jpg

--------
#8: (43.55)
41RI4qgJLrL.jpg

--------
#9: (41.40)
41sKETcJYcL.jpg

--------
#10: (37.69)
Learning toy with 10 activities including a storybook, clock, gears; 13 double-sided alphabet blocks build fine motor skills and introduce letters, numbers, colors, and more.
--------
#11: (35.50)
This is the perfect introduction to the world of scooters.
--------
#12: (33.14)
2 -IN-1 RIDE-ON TOY- This convertible scooter is designed to grow with your child.
--------

Try another query “Learning toy for a 6 year old”.

Query: Learning toy for a 6 year old 

Search results:
#1: (47.59)
Playful elephant toy makes real elephant sounds and fun music to inspire imaginative play.
--------
#2: (41.86)
61NbJr9jthL.jpg

--------
#3: (41.66)
2 -IN-1 RIDE-ON TOY- This convertible scooter is designed to grow with your child.
--------
#4: (41.62)
Toy with 10 activities including a storybook, clock, gears; 13 double-sided alphabet blocks build fine motor skills and introduce letters, numbers, colors, and more.
--------
#5: (41.25)
This is the perfect introduction to the world of scooters.
--------
#6: (40.94)
31sa-c%2BfmpL.jpg

--------
#7: (40.11)
416GZ2RZEPL.jpg

--------
#8: (40.10)
41sKETcJYcL.jpg

--------
#9: (38.64)
41RI4qgJLrL.jpg

--------
#10: (36.47)
31KqpOznU1L.jpg

--------
#11: (35.27)
31TW1NCtMZL.jpg

--------
#12: (34.76)
51a6iOTpnwL.jpg
--------

As you can see from the results, the images and documents are returns based on the queries from the user and demonstrates functionality of the new version of Cohere embed 3 for multimodal embeddings.

Clean up

To avoid incurring unnecessary costs, when you’re done, delete the SageMaker endpoints using the following code snippets:

# Delete the endpoint
sagemaker.delete_endpoint(EndpointName='Endpoint-Cohere-Embed-Model-v3-English-1')
sagemaker.close()

Alternatively, to use the SageMaker console, complete the following steps:

  1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
  2. Search for the embedding and text generation endpoints.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

Cohere Embed 3 for multimodal embeddings is now available with SageMaker and SageMaker JumpStart. To get started, refer to SageMaker JumpStart pretrained models.

Interested in diving deeper? Check out the Cohere on AWS GitHub repo.


About the Authors

Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science (HCLS) customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing model adoption. Breanne is also on the Women@Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign.

Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.

Yang Yang is an Independent Software Vendor (ISV) Solutions Architect at Amazon Web Services based in Seattle, where he supports customers in the financial services industry. Yang focuses on developing generative AI solutions to solve business and technical challenges and help drive faster time-to-market for ISV customers. Yang holds a Bachelor’s and Master’s degree in Computer Science from Texas A&M University.

Malhar Mane is an Enterprise Solutions Architect at AWS based in Seattle. He supports enterprise customers in the Digital Native Business (DNB) segment and specializes in generative AI and storage. Malhar is passionate about helping customers adopt generative AI to optimize their business. Malhar holds a Bachelor’s in Computer Science from University of California, Irvine.

Read More

How GoDaddy built Lighthouse, an interaction analytics solution to generate insights on support interactions using Amazon Bedrock

How GoDaddy built Lighthouse, an interaction analytics solution to generate insights on support interactions using Amazon Bedrock

This post is co-written with Mayur Patel, Nick Koenig, and Karthik Jetti from GoDaddy.

GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With 21 million customers worldwide, GoDaddy’s global solutions help seamlessly connect entrepreneurs’ identity and presence with commerce, leading to profitable growth. At GoDaddy, we take pride in being a data-driven company. Our relentless pursuit of valuable insights from data fuels our business decisions and works to achieve customer satisfaction.

In this post, we discuss how GoDaddy’s Care & Services team, in close collaboration with the  AWS GenAI Labs team, built Lighthouse—a generative AI solution powered by Amazon Bedrock. Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using the AWS tools without having to manage infrastructure. With Amazon Bedrock, GoDaddy’s Lighthouse mines insights from customer care interactions using crafted prompts to identify top call drivers and reduce friction points in customers’ product and website experiences, leading to improved customer experience.

GoDaddy’s business challenge

Data has always been a competitive advantage for GoDaddy, as has the Care & Services team . We realize the potential to derive meaningful insights from this data and identify key call drivers and pain points. In the world before generative AI, however, the technology for mining insights from unstructured data was computationally expensive and challenging to operationalize.

Solution overview

This changed with GoDaddy Lighthouse, a generative AI-powered interactions analytics solution, which unlocks the rich mine of insights sitting within our customer care transcript data. Fed by customer care interactions data, it enables scale for deep and actionable analysis, allowing us to:

  • Detect and size customer friction points in our product and website experiences, leading to improvements in customer experience (CX) and retention
  • Improve customer care operations, including quality assurance and routing optimization, leading to improvements in CX and operational expenditure (OpEx)
  • Deprecate our reliance on costly vendor solutions for voice analytics

The following diagram illustrates the high-level business workflow of Lighthouse.

GoDaddy Lighthouse is an insights solution powered by large language models (LLMs) that allows prompt engineers throughout the company to craft, manage, and evaluate prompts using a portal where they can interact with an LLM of their choice. By engineering prompts that run against an LLM, we can systematically derive powerful and standardized insights across text-based data. Product subject matter experts use the Lighthouse platform UI to test and iterate on generative AI prompts that produce tailored insights about a Care & Services interaction.

The below diagram shows the iterative process of creating and strengthening the prompts.

After the prompts are tested and confirmed to work as intended, they are deployed into production, where they are scaled across thousands of interactions. Then, the insights produced for each interaction are aggregated and visualized in dashboards and other analytical tools. Additionally, Lighthouse lets GoDaddy users craft one-time generative AI prompts to reveal rich insights for a highly specific customer scenario.

Let’s dive into how the Lighthouse architecture and features support users in generating insights. The following diagram illustrates the Lighthouse architecture on AWS.

The Lighthouse UI is powered by data generated from Amazon Bedrock LLM calls on thousands of transcripts, utilizing a library of prompts from GoDaddy’s internal prompt catalog. The UI facilitates the selection of LLM model based on the user’s choice, making the solution independent of one model. These LLM calls are processed sequentially using Amazon EMR and Amazon EMR Serverless. The seamless integration of backend data into the UI is facilitated by Amazon API Gateway and Amazon Lambdas functions, while the UI/UX is supported by AWS Fargate and Elastic Load Balancing to maintain high availability. For data storage and retrieval, Lighthouse employs a combination of Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon Athena. Visual data analysis and representation are achieved through dashboards built on Tableau and Amazon QuickSight.

Prompt evaluation

Lighthouse offers a unique proposition by allowing users to evaluate their one-time generative AI prompts using an LLM of their choice. This feature empowers users to write a new one-time prompt specifically for evaluation purposes. Lighthouse processes this new prompt using the actual transcript and response from a previous LLM call.

This capability is particularly valuable for users seeking to refine their prompts through multiple iterations. By iteratively adjusting and evaluating their prompts, users can progressively enhance and solidify the effectiveness of their queries. This iterative refinement process makes sure that users can achieve the highest-quality outputs tailored to their specific needs.

The flexibility and precision offered by this feature make Lighthouse an indispensable tool for anyone trying to optimize their interactions with LLMs, fostering continuous improvement and innovation in prompt engineering.

The following screenshot illustrates how Lighthouse lets users validate the accuracy of the model response with an evaluation prompt

After a prompt is evaluated for quality and the user is satisfied with the results, the prompt can be promoted into the prompt catalog.

Response summarization

After the user submits their prompt, Lighthouse processes this prompt against each available transcript, generating a series of responses. The user can then view the generated responses for that query on a dedicated page. This page serves as a valuable resource, allowing users to review the responses in detail and even download them into an Excel sheet for further analysis.

However, the sheer volume of responses can sometimes make this task overwhelming. To address this, Lighthouse offers a feature that allows users to pass these responses through a new prompt for summarization. This functionality enables users to obtain concise, single-line summaries of the responses, significantly simplifying the review process and enhancing efficiency.

The following screenshot shows an example of the prompt with which Lighthouse lets users meta-analyze all responses into one, reducing the time needed to review each response individually.

With this summarization tool, users can quickly distill large sets of data into easily digestible insights, streamlining their workflow and making Lighthouse an indispensable tool for data analysis and decision-making.

Insights

Lighthouse generates valuable insights, providing a deeper understanding of key focus areas, opportunities for improvement, and strategic directions. With these insights, GoDaddy can make informed, strategic decisions that enhance operational efficiency and drive revenue growth.

The following screenshot is an example of the dashboard based on insights generated by Lighthouse, showing the distribution of categories in each insight.

Through Lighthouse, we analyzed the distribution of root causes and intents across the vast number of daily calls handled by GoDaddy agents. This analysis identified the most frequent causes of escalations and factors most likely to lead to customer dissatisfaction.

Business value and impact

To date (as of the time of writing), Lighthouse has generated 15 new insights. Most notably, the team used insights from Lighthouse to quantify the impact and cost of the friction within the current process, enabling them to prioritize necessary improvements across multiple departments. This strategic approach led to a streamlined password reset process, reducing support contacts related to the password reset process and shortening resolution times, ultimately providing significant cost savings.

Other insights leading to improvements to the GoDaddy business include:

  • The discovery of call routing flows suboptimal to profit per interaction
  • Understanding the root cause of repeat contact interactions

Conclusion

GoDaddy’s Lighthouse, powered by Amazon Bedrock, represents a transformative leap in using generative AI to unlock the value hidden within unstructured customer interaction data. By scaling deep analysis and generating actionable insights, Lighthouse empowers GoDaddy to enhance customer experiences, optimize operations, and drive business growth. As a testament to its success, Lighthouse has already delivered financial and operational improvements, solidifying GoDaddy’s position as a data-driven leader in the industry.


About the Authors

Mayur Patel is a Director, Software Development in the Data & Analytics (DnA) team at GoDaddy, specializing in data engineering and AI-driven solutions. With nearly 20 years of experience in engineering, architecture, and leadership, he has designed and implemented innovative solutions to improve business processes, reduce costs, and increase revenue. His work has enabled companies to achieve their highest potential through data. Passionate about leveraging data and AI, he aims to create solutions that delight customers, enhance operational efficiency, and optimize costs. Outside of his professional life, he enjoys reading, hiking, DIY projects, and exploring new technologies.

Nick Koenig is a Senior Director of Data Analytics and has worked across GoDaddy building data solutions for the last 10 years. His first job at GoDaddy included listening to calls and finding trends, so he’s particularly proud to be involved in building an AI solution for this a decade later.

Karthik Jetti is a Senior Data Engineer in the Data & Analytics organization at GoDaddy. With more than 12 years of experience in engineering and architecture in data technologies, AI, and cloud platforms, he has produced data to support advanced analytics and AI initiatives. His work drives strategy and innovation, focusing on revenue generation and improving efficiency.

Ranjit Rajan is a Principal GenAI Lab Solutions Architect with AWS. Ranjit works with AWS customers to help them design and build data and analytics applications in the cloud.

Satveer Khurpa is a Senior Solutions Architect on the GenAI Labs team at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.

Richa Gupta is a Solutions Architect at Amazon Web Services specializing in generative AI and AI/ML designs. She helps customers implement scalable, cloud-based solutions to use advanced AI technologies and drive business growth. She has also presented generative AI use cases in AWS Summits. Prior to joining AWS, she worked in the capacity of a software engineer and solutions architect, building solutions for large telecom operators.

Read More

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

In the rapidly evolving landscape of AI, generative models have emerged as a transformative technology, empowering users to explore new frontiers of creativity and problem-solving. These advanced AI systems have transcended their traditional text-based capabilities, now seamlessly integrating multimodal functionalities that expand their reach into diverse applications. models have become increasingly powerful, enabling a wide range of applications beyond just text generation. These models can now create striking images, generate engaging summaries, answer complex questions, and even produce code—all while maintaining a high level of accuracy and coherence. The integration of these multimodal capabilities has unlocked new possibilities for businesses and individuals, revolutionizing fields such as content creation, visual analytics, and software development.

In this post, we showcase how to fine-tune a text and vision model, such as Meta Llama 3.2, to better perform at visual question answering tasks. The Meta Llama 3.2 Vision Instruct models demonstrated impressive performance on the challenging DocVQA benchmark for visual question answering. The non-fine-tuned 11B and 90B models achieved strong ANLS (Aggregated Normalized Levenshtein Similarity) scores of 88.4 and 90.1, respectively, on the DocVQA test set. ANLS is a metric used to evaluate the performance of models on visual question answering tasks, which measures the similarity between the model’s predicted answer and the ground truth answer. However, by using the power of Amazon SageMaker JumpStart, we demonstrate the process of adapting these generative AI models to excel at understanding and responding to natural language questions about images. By fine-tuning these models using SageMaker JumpStart, we were able to further enhance their abilities, boosting the ANLS scores to 91 and 92.4. This significant improvement showcases how the fine-tuning process can equip these powerful multimodal AI systems with specialized skills for excelling at understanding and answering natural language questions about complex, document-based visual information.

For a detailed walkthrough on fine-tuning the Meta Llama 3.2 Vision models, refer to the accompanying notebook.

Overview of Meta Llama 3.2 11B and 90B Vision models

The Meta Llama 3.2 collection of multimodal and multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in a variety of sizes. The 11B and 90B models are multimodal—they support text in/text out, and text+image in/text out.

Meta Llama 3.2 11B and 90B are the first Llama models to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. The new models are designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications. All Meta Llama 3.2 models support a 128,000 context length, maintaining the expanded token capacity introduced in Meta Llama 3.1. Additionally, the models offer improved multilingual support for eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

DocVQA dataset

The DocVQA (Document Visual Question Answering) dataset is a widely used benchmark for evaluating the performance of multimodal AI models on visual question answering tasks involving document-style images. This dataset consists of a diverse collection of document images paired with a series of natural language questions that require both visual and textual understanding to answer correctly. By fine-tuning a generative AI model like Meta Llama 3.2 on the DocVQA dataset using Amazon SageMaker, you can equip the model with the specialized skills needed to excel at answering questions about the content and structure of complex, document-based visual information.

For more information on the dataset used in this post, see DocVQA – Datasets.

Dataset preparation for visual question and answering tasks

The Meta Llama 3.2 Vision models can be fine-tuned on image-text datasets for vision and language tasks such as visual question answering (VQA). The training data should be structured with the image, the question about the image, and the expected answer. This data format allows the fine-tuning process to adapt the model’s multimodal understanding and reasoning abilities to excel at answering natural language questions about visual content.

The input includes the following:

  • A train and an optional validation directory. Train and validation directories should contain one directory named images hosting all the image data and one JSON Lines (.jsonl) file named metadata.jsonl.
  • In the metadata.jsonl file, each example is a dictionary that contains three keys named file_name, prompt, and completion. The file_name defines the path to image data. prompt defines the text input prompt and completion defines the text completion corresponding to the input prompt. The following code is an example of the contents in the metadata.jsonl file:
{"file_name": "images/img_0.jpg", "prompt": "what is the date mentioned in this letter?", "completion": "1/8/93"}
{"file_name": "images/img_1.jpg", "prompt": "what is the contact person name mentioned in letter?", "completion": "P. Carter"}
{"file_name": "images/img_2.jpg", "prompt": "Which part of Virginia is this letter sent from", "completion": "Richmond"}

SageMaker JumpStart

SageMaker JumpStart is a powerful feature within the SageMaker machine learning (ML) environment that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to a growing list of cutting-edge models from leading model hubs and providers that you can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

Solution overview

In the following sections, we discuss the steps to fine-tune Meta Llama 3.2 Vision models. We cover two approaches: using the Amazon SageMaker Studio UI for a no-code solution, and using the SageMaker Python SDK.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

  • An AWS account that will contain all of your AWS resources.
  • An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
  • Access to SageMaker Studio or a SageMaker notebook instance, or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.

No-code fine-tuning through the SageMaker Studio UI

SageMaker JumpStart provides access to publicly available and proprietary FMs from third-party and proprietary providers. Data scientists and developers can quickly prototype and experiment with various ML use cases, accelerating the development and deployment of ML applications. It helps reduce the time and effort required to build ML models from scratch, allowing teams to focus on fine-tuning and customizing the models for their specific use cases. These models are released under different licenses designated by their respective sources. It’s essential to review and adhere to the applicable license terms before downloading or using these models to make sure they’re suitable for your intended use case.

You can access the Meta Llama 3.2 FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we cover how to discover these models in SageMaker Studio.

SageMaker Studio is an IDE that offers a web-based visual interface for performing the ML development steps, from data preparation to model building, training, and deployment. For instructions on getting started and setting up SageMaker Studio, refer to Amazon SageMaker Studio.

When you’re in SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

In the JumpStart view, you’re presented with the list of public models offered by SageMaker. You can explore other models from other providers in this view. To start using the Meta Llama 3 models, under Providers, choose Meta.

You’re presented with a list of the models available. Choose one of the Vision Instruct models, for example the Meta Llama 3.2 90B Vision Instruct model.

Here you can view the model details, as well as train, deploy, optimize, and evaluate the model. For this demonstration, we choose Train.

On this page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. Choose Submit to start the training job on a SageMaker ML instance.

Deploy the model

After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

You can also deploy the model from this view. You can configure endpoint settings such as the instance type, number of instances, and endpoint name. You will need to accept the End User License Agreement (EULA) before you can deploy the model.

Fine-tune using the SageMaker Python SDK

You can also fine-tune Meta Llama 3.2 Vision Instruct models using the SageMaker Python SDK. A sample notebook with the full instructions can be found on GitHub. The following code example demonstrates how to fine-tune the Meta Llama 3.2 11B Vision Instruct model:

import os
import boto3
from sagemaker.jumpstart.estimator import JumpStartEstimator
model_id, model_version = "meta-vlm-llama-3-2-11b-vision-instruct", "*"

from sagemaker import hyperparameters
my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)
my_hyperparameters["epoch"] = "1"
estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    environment={"accept_eula": "true"},  # Please change {"accept_eula": "true"}
    disable_output_compression=True,
    instance_type="ml.p5.48xlarge",
    hyperparameters=my_hyperparameters,
)
estimator.fit({"training": train_data_location})

The code sets up a SageMaker JumpStart estimator for fine-tuning the Meta Llama 3.2 Vision Instruct model on a custom training dataset. It configures the estimator with the desired model ID, accepts the EULA, sets the number of training epochs as a hyperparameter, and initiates the fine-tuning process.

When the fine-tuning process is complete, you can review the evaluation metrics for the model. These metrics will provide insights into the performance of the fine-tuned model on the validation dataset, allowing you to assess how well the model has adapted. We discuss these metrics more in the following sections.

You can then deploy the fine-tuned model directly from the estimator, as shown in the following code:

estimator = attached_estimator
finetuned_predictor = estimator.deploy()

As part of the deploy settings, you can define the instance type you want to deploy the model on. For the full list of deployment parameters, refer to the deploy parameters in the SageMaker SDK documentation.

After the endpoint is up and running, you can perform an inference request against it using the predictor object as follows:

q, a, image = each["prompt"], each['completion'], get_image_decode_64base(image_path=f"./docvqa/validation/{each['file_name']}")
payload = formulate_payload(q=q, image=image, instruct=is_chat_template)

ft_response = finetuned_predictor.predict(
    JumpStartSerializablePayload(payload)
)

For the full list of predictor parameters, refer to the predictor object in the SageMaker SDK documentation.

Fine-tuning quantitative metrics

SageMaker JumpStart automatically outputs various training and validation metrics, such as loss, during the fine-tuning process to help evaluate the model’s performance.

The DocVQA dataset is a widely used benchmark for evaluating the performance of multimodal AI models on visual question answering tasks involving document-style images. As shown in the following table, the non-fine-tuned Meta Llama 3.2 11B and 90B models achieved ANLS scores of 88.4 and 90.1 respectively on the DocVQA test set, as reported in the post Llama 3.2: Revolutionizing edge AI and vision with open, customizable models on the Meta AI website. After fine-tuning the 11B and 90B Vision Instruct models using SageMaker JumpStart, the fine-tuned models achieved improved ANLS scores of 91 and 92.4, demonstrating that the fine-tuning process significantly enhanced the models’ ability to understand and answer natural language questions about complex document-based visual information.

DocVQA test set (5138 examples, metric: ANLS) 11B-Instruct 90B-Instruct
Non-fine-tuned 88.4 90.1
SageMaker JumpStart Fine-tuned 91 92.4

For the fine-tuning results shown in the table, the models were trained using the DeepSpeed framework on a single P5.48xlarge instance with multi-GPU distributed training. The fine-tuning process used Low-Rank Adaptation (LoRA) on all linear layers, with a LoRA alpha of 8, LoRA dropout of 0.05, and a LoRA rank of 16. The 90B Instruct model was trained for 6 epochs, while the 11B Instruct model was trained for 4 epochs. Both models used a learning rate of 5e-5 with a linear learning rate schedule. Importantly, the Instruct models were fine-tuned using the built-in chat template format, where the loss was computed on the last turn of the conversation (the assistant’s response)

For the base model fine-tuning, you have the choice of using chat completion format or text completion format, controlled by the hyperparameter chat_template. For text completion, it is simply a concatenation of image token, prompt, and completion, where the prompt and completion part are connected by a response key ###Response:nn and loss values are computed on the completion part only.

Fine-tuning qualitative results

In addition to the quantitative evaluation metrics, you can observe qualitative differences in the model’s outputs after the fine-tuning process.

For the non-Instruct models, the fine-tuning was performed using a specific prompt template that doesn’t use the chat format. The prompt template was structured as follows:

prompt = f"![]({image})<|image|><|begin_of_text|>Read the text in the image carefully and answer the question with the text as seen exactly in the image. For yes/no questions, just respond Yes or No. If the answer is numeric, just respond with the number and nothing else. If the answer has multiple words, just respond with the words and absolutely nothing else. Never respond in a sentence or a phrase.n Question: {q}### Response:nn"

This prompt template required the model to generate a direct, concise response based on the visual information in the image, without producing additional context or commentary. The results of fine-tuning a 11 B Vision non-Instruct base model using this prompt template are shown in the following qualitative examples, demonstrating how the fine-tuning process improved the models’ ability to accurately extract and reproduce the relevant information from the document images.

Image Input prompt Pre-trained response Fine-tuned response Ground truth
What is the name of the company? ### Response:
### Response:
### Response:
### Response:
### Response:
### Response:
### Response:
###
ITC Limited itc limited
Where is the company located? 1) Opening Stock :
a) Cigarette Filter Rods
Current Year
Previous year
b) Poly Propelene
CHENNAI chennai
What the location address of NSDA? Source: https://www.
industrydocuments.ucsf
.edu/docs/qqvf0227.
<OCR/> The best thing between
1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036 1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036
What is the ‘no. of persons present’ for the sustainability committee meeting held on 5th April, 2012? 1
2
3
4
5
6
7
8
9
10
11
12
13
6 6

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped:

# Delete resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

Conclusion

In this post, we discussed fine-tuning Meta Llama 3.2 Vision Instruct models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. Finally, we showcased both the quantitative metrics and qualitative results of fine-tuning the Meta Llama 3.2 Vision model on the DocVQA dataset, highlighting the model’s improved performance on visual question answering tasks involving complex document-style images.

As a next step, you can try fine-tuning these models on your own dataset using the code provided in the notebook to test and benchmark the results for your use cases.


About the Authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.


Appendix

Language models such as Meta Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly higher CUDA memory. Furthermore, training these models can be very slow due to their size. Therefore, for efficient fine-tuning, we use the following optimizations:

  • Low-Rank Adaptation (LoRA) – To efficiently fine-tune the LLM, we employ LoRA, a type of parameter-efficient fine-tuning (PEFT) technique. Instead of training all the model parameters, LoRA introduces a small set of adaptable parameters that are added to the pre-trained model. This significantly reduces the memory footprint and training time compared to fine-tuning the entire model.
  • Mixed precision training (bf16) – To further optimize memory usage, we use mixed precision training using bfloat16 (bf16) data type. bf16 provides similar performance to full-precision float32 while using only half the memory, enabling us to train larger batch sizes and fit the model on available hardware.

The default hyperparameters are as follows:

  • Peft Type: lora – LoRA fine-tuning, which can efficiently adapt a pre-trained language model to a specific task
  • Chat Template: True – Enables the use of a chat-based template for the fine-tuning process
  • Gradient Checkpointing: True – Reduces the memory footprint during training by recomputing the activations during the backward pass, rather than storing them during the forward pass
  • Per Device Train Batch Size: 2 – The batch size for training on each device
  • Per Device Evaluation Batch Size: 2 – The batch size for evaluation on each device
  • Gradient Accumulation Steps: 2 – The number of steps to accumulate gradients for before performing an update
  • Bf16 16-Bit (Mixed) Precision Training: True – Enables the use of bfloat16 (bf16) data type for mixed precision training, which can speed up training and reduce memory usage
  • Fp16 16-Bit (Mixed) Precision Training: False – Disables the use of float16 (fp16) data type for mixed precision training
  • Deepspeed: True – Enables the use of the Deepspeed library for efficient distributed training
  • Epochs: 10 – The number of training epochs
  • Learning Rate: 6e-06 – The learning rate to be used during training
  • Lora R: 64 – The rank parameter for the LoRA fine-tuning
  • Lora Alpha: 16 – The alpha parameter for the LoRA fine-tuning
  • Lora Dropout: 0 – The dropout rate for the LoRA fine-tuning
  • Warmup Ratio: 0.1 – The ratio of the total number of steps to use for a linear warmup from 0 to the learning rate
  • Evaluation Strategy: steps – The strategy for evaluating the model during training
  • Evaluation Steps: 20 – The number of steps to use for evaluating the model during training
  • Logging Steps: 20 – The number of steps between logging training metrics
  • Weight Decay: 0.2 – The weight decay to be used during training
  • Load Best Model At End: False – Disables loading the best performing model at the end of training
  • Seed: 42 – The random seed to use for reproducibility
  • Max Input Length: -1 – The maximum length of the input sequence
  • Validation Split Ratio: 0.2 – The ratio of the training dataset to use for validation
  • Train Data Split Seed: 0 – The random seed to use for splitting the training data
  • Preprocessing Num Workers: None – The number of worker processes to use for data preprocessing
  • Max Steps: -1 – The maximum number of training steps to perform
  • Adam Beta1: 0.9 – The beta1 parameter for the Adam optimizer
  • Adam Beta2: 0.999 – The beta2 parameter for the Adam optimizer
  • Adam Epsilon: 1e-08 – The epsilon parameter for the Adam optimizer
  • Max Grad Norm: 1.0 – The maximum gradient norm to be used for gradient clipping
  • Label Smoothing Factor: 0 – The label smoothing factor to be used during training
  • Logging First Step: False – Disables logging the first step of training
  • Logging Nan Inf Filter: True – Enables filtering out NaN and Inf values from the training logs
  • Saving Strategy: no – Disables automatic saving of the model during training
  • Save Steps: 500 – The number of steps between saving the model during training
  • Save Total Limit: 1 – The maximum number of saved models to keep
  • Dataloader Drop Last: False – Disables dropping the last incomplete batch during data loading
  • Dataloader Num Workers: 32 – The number of worker processes to use for data loading
  • Eval Accumulation Steps: None – The number of steps to accumulate gradients for before performing an evaluation
  • Auto Find Batch Size: False – Disables automatically finding the optimal batch size
  • Lr Scheduler Type: constant_with_warmup – The type of learning rate scheduler to use (for example, constant with warmup)
  • Warm Up Steps: 0 – The number of steps to use for linear warmup of the learning rate

Read More

Principal Financial Group uses QnABot on AWS and Amazon Q Business to enhance workforce productivity with generative AI

Principal Financial Group uses QnABot on AWS and Amazon Q Business to enhance workforce productivity with generative AI

Principal is a global financial company with nearly 20,000 employees passionate about improving the wealth and well-being of people and businesses. In business for 145 years, Principal is helping approximately 64 million customers (as of Q2, 2024) plan, protect, invest, and retire, while working to support the communities where it does business and build a diverse, inclusive workforce.

As Principal grew, its internal support knowledge base considerably expanded. This wealth of content provides an opportunity to streamline access to information in a compliant and responsible way. Principal wanted to use existing internal FAQs, documentation, and unstructured data and build an intelligent chatbot that could provide quick access to the right information for different roles. With the QnABot on AWS (QnABot), integrated with Microsoft Azure Entra ID access controls, Principal launched an intelligent self-service solution rooted in generative AI. Now, employees at Principal can receive role-based answers in real time through a conversational chatbot interface. The chatbot improved access to enterprise data and increased productivity across the organization.

In this post, we explore how Principal used QnABot paired with Amazon Q Business and Amazon Bedrock to create Principal AI Generative Experience: a user-friendly, secure internal chatbot for faster access to information.

QnABot is a multilanguage, multichannel conversational interface (chatbot) that responds to customers’ questions, answers, and feedback. It allows companies to deploy a fully functional chatbot integrated with generative AI offerings from Amazon, including Amazon Bedrock, Amazon Q Business, and intelligent search services, with natural language understanding (NLU) such as Amazon OpenSearch Service and Amazon Bedrock Knowledge Bases. With QnABot, companies have the flexibility to tier questions and answers based on need, from static FAQs to generating answers on the fly based on documents, webpages, indexed data, operational manuals, and more.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Challenges, opportunities, and constraints

Principal team members need insights from vast amounts of unstructured data to serve their customers. This data includes manuals, communications, documents, and other content across various systems like SharePoint, OneNote, and the company’s intranet. The information exists in various formats such as Word documents, ASPX pages, PDFs, Excel spreadsheets, and PowerPoint presentations that were previously difficult to systematically search and analyze. Principal sought to develop natural language processing (NLP) and question-answering capabilities to accurately query and summarize this unstructured data at scale. This solution would allow for greater understanding of a wide range of employee questions by searching internal documentation for responses and suggesting answers—all through a user-friendly interface. The solution had to adhere to compliance, privacy, and ethics regulations and brand standards and use existing compliance-approved responses without additional summarization. It was important for Principal to maintain fine-grained access controls and make sure all data and sources remained secure within its environment.

Principal needed a solution that could be rapidly deployed without extensive custom coding. It also wanted a flexible platform that it could own and customize for the long term. As a leader in financial services, Principal wanted to make sure all data and responses adhered to strict risk management and responsible AI guidelines. This included preventing any data from leaving its source or being accessible to third parties.

The chatbot solution deployed by Principal had to address two use cases. The first use case, treated as a proof of concept, was to respond to customers’ request for proposal (RFP) inquiries. This first use case was chosen because the RFP process relies on reviewing multiple types of information to generate an accurate response based on the most up-to-date information, which can be time-consuming.

The second use case applied to Principal employees in charge of responding to customer inquiries using a vast well of SharePoint data. The extensive amount of data employees must search to find appropriate answers for customers made it difficult and time-consuming to navigate. It is estimated these employees collectively spent hundreds of hours each year searching for information. As the volume and complexity of customer requests grows, without a solution to enhance search capabilities, costs were projected to significantly rise.

Principal saw an opportunity for an internal generic AI assistant to allow employees to use AI in their daily work without risking exposure of sensitive information through any unapproved or unregulated external AI vendors.

The solution: Principal AI Generative Experience with QnABot

Principal began its development of an AI assistant by using the core question-answering capabilities in QnABot. Within QnABot, company subject matter experts authored hard-coded questions and answers using the QnABot editor. Principal also used the AWS open source repository Lex Web UI to build a frontend chat interface with Principal branding.

Initially, Principal relied on the built-in capabilities of QnABot, using Anthropic’s Claude on Amazon Bedrock for information summarization and retrieval. Upon the release of Amazon Q Business in preview, Principal integrated QnABot with Amazon Q Business to take advantage of its advanced response aggregation algorithms and more complete AI assistant features. Integration enhanced the solution by providing a more human-like interaction for end-users.

Principal implemented several measures to improve the security, governance, and performance of its conversational AI platform. By integrating QnABot with Azure Active Directory, Principal facilitated single sign-on capabilities and role-based access controls. This allowed fine-tuned management of user access to content and systems. Generative AI models (for example, Amazon Titan) hosted on Amazon Bedrock were used for query disambiguation and semantic matching for answer lookups and responses.

Usability and continual improvement were top priorities, and Principal enhanced the standard user feedback from QnABot to gain input from end-users on answer accuracy, outdated content, and relevance. This input made it straightforward for administrators and developers to identify and improve answer relevancy. Custom monitoring dashboards in Amazon OpenSearch Service provided real-time visibility into platform performance. Additional integrations with services like Amazon Data Firehose, AWS Glue, and Amazon Athena allowed for historical reporting, user activity analytics, and sentiment trends over time through Amazon QuickSight.

Adherence to responsible and ethical AI practices were a priority for Principal. The Principal AI Enablement team, which was building the generative AI experience, consulted with governance and security teams to make sure security and data privacy standards were met. Model monitoring of key NLP metrics was incorporated and controls were implemented to prevent unsafe, unethical, or off-topic responses. The flexible, scalable nature of AWS services makes it straightforward to continually refine the platform through improvements to the machine learning models and addition of new features.

The initial proof of concept was deployed in a preproduction environment within 3 months. The first data source connected was an Amazon Simple Storage Service (Amazon S3) bucket, where a 100-page RFP manual was uploaded for natural language querying by users. The data source allowed accurate results to be returned based on indexed content.

The first large-scale use case directly interfaced with SharePoint data, indexing over 8,000 pages. The Principal team partnered with Amazon Q Business data connector developers to implement improvements to the SharePoint connector. Improvements included the ability to index pages in SharePoint lists and add data security features. The use case was piloted with 10 users during 1 month, while working to onboard an additional 300 users over the next 3 months.

During the initial pilot, the Principal AI Enablement team worked with business users to gather feedback. The first round of testers needed more training on fine-tuning the prompts to improve returned results. The enablement team took this feedback and partnered with training and development teams to design learning plans to help new users more quickly gain proficiency with the AI assistant. The goal was to onboard future users faster through improved guidance on how to properly frame questions for the assistant and additional coaching resources for those who needed more guidance to learn the system.

Some users lacked access to corporate data, but they used the platform as a generative AI chatbot to securely attach internal-use documentation (also called initial generic entitlement) and query it in real time or to ask questions of the model’s foundational knowledge without risk of data leaving the tenant. Queries from users were also analyzed to identify beneficial future features to implement.

The following diagram illustrates the Principal generative AI chatbot architecture with AWS services.

Principal-AWS-GenAI-Architecture
Principal started by deploying QnABot, which draws on numerous services including Amazon Bedrock, Amazon Q Business, QuickSight, and others. All AWS services are high-performing, secure, scalable, and purpose-built. AWS services are designed to meet your specific industry, cross-industry, and technology use cases and are developed, maintained, and supported by AWS. AWS solutions (for example, QnABot) bring together AWS services into preconfigured deployable products, with architecture diagrams and implementation guides. Developed, maintained, and supported by AWS, AWS solutions simplify the deployment of optimized infrastructure tailored to meet customer use cases.

Principal strategically worked with the Amazon Q Business and QnABot teams to test and improve the Amazon Q Business conversational AI platform. The QnABot team worked closely with the Principal AI Enablement team on the implementation of QnABot, helping to define and build out capabilities to meet the incoming use cases. As an early adopter of Amazon Q Business, engineers from Principal worked directly with the Amazon Q Business team to validate updates and new features. When Amazon Q Business became generally available, Principal collaborated with the team to implement the integration of AWS IAM Identity Center, helping to define the process for IAM Identity Center implementation and software development kit (SDK) integration. The results of the IAM Identity Center integration were contributed back to the QnABot Amazon Q Business plugin repository so other customers could benefit from this work.

Results

The initial proof of concept was highly successful in driving efficiencies for users. It achieved an estimated 50% reduction in time required for users to respond to client inquiries and requests for proposals. This reduction in time stemmed from the platform’s ability to search for and summarize the data needed to quickly and accurately respond to inquiries. This early success demonstrated the solution’s effectiveness, generating excitement within the organization to broaden use cases.

The initial generic entitlement option allowed users to attach files to their chat sessions and dynamically query content. This option proved popular because of large productivity gains across various roles, including project management, enterprise architecture, communications, and education. Users interacting with the application in their daily work have received it well. Some users have reported up to a 50% reduction in the time spent conducting rote work. Removing the time spent on routine work allows employees to focus on human judgement-based and strategic decisions.

The platform has delivered strong results across several key metrics. Over 95% of queries received answers users accepted or built upon, with only 4% of answers receiving negative feedback. For queries earning negative feedback, less than 1% involved answers or documentation deemed irrelevant to the original question. Over 99% of documents provided through the system were evaluated as relevant and containing up-to-date information. 56% of total queries were addressed through either sourcing documentation related to the question or having the user attach a relevant file through the chat interface. The remaining queries were answered based on foundational knowledge built into the platform or from the current session’s chat history. These results indicate that users benefit from both Retrieval Augmented Generation (RAG) functionality and Amazon Q Business foundational knowledge, which provide helpful responses based on past experiences.

The positive feedback validates the application’s ability to deliver timely, accurate information to users, optimizing processes and empowering employees with data-driven insights. Metrics indicate a high level of success in delivering the right information, reducing time spent on client inquiries and tasks and producing significant savings in hours and dollars. The platform effectively and quickly resolves issues by surfacing relevant information faster than manual searches, improving processes, productivity, and the customer experience. As usage expands, it is expected that the benefits will multiply for both users and stakeholders. This initial proof of concept provides a strong foundation for continued optimization and value, with potential expansion to maximize benefits for Principal employees.

Roadmap

The Principal AI Enablement team has an ambitious roadmap for 2024 focused on expanding the capabilities of its conversational AI platform. There is a commitment to scale and accelerate development of generative AI technology to meet the growing needs of the enterprise.

Numerous use cases are currently in development by the AI Enablement team. Many future use cases are expected to use the Principal AI Generative Experience application due to its success in automating processes and empowering users with self-service insights. As adoption increases, ongoing feature additions will further strengthen the platform’s value.

At Principal, the roadmap indicates a commitment to delivering continual innovation. Innovation will drive further optimization of operations and workflows. It could also create new opportunities to enhance customer and employee experiences through advanced AI applications.

Principal is well positioned to build upon early successes by fulfilling its vision. The roadmap provides a strategic framework to maximize the platform’s business impact and differentiate solutions in the years ahead.

Conclusion

Principal utilized QnABot on AWS paired with Amazon Q Business and Amazon Bedrock, to deliver a generative AI experience for its users, thereby reducing manual time spent on client inquiries and tasks and producing significant savings in hours and dollars. Using generative AI, Principal’s employees can now focus on deeper human judgment based decisioning, instead of time spent scouring for answers from data sources manually. Get started with QnABot on AWS, Amazon Q Business and Amazon Bedrock.


About the Authors

Ajay Swamy is the Global Product Leader for Data, AIML and Generative AI AWS Solutions. He specializes in building AWS Solutions (production-ready software packages) that deliver compelling value to customers by solving for their unique business needs. Other than QnABot on AWS, he manages Generative AI Application BuilderEnhanced Document UnderstandingDiscovering Hot Topics using Machine Learning and other AWS Solutions. He lives with his wife (Tina) and dog (Figaro), in New York, NY.

Dr. Nicki Susman is a Senior Machine Learning Engineer and the Technical Lead of the Principal AI Enablement team. She has extensive experience in data and analytics, application development, infrastructure engineering, and DevSecOps.

Joel Elscott is a Senior Data Engineer on the Principal AI Enablement team. He has over 20 years of software development experience in the financial services industry, specializing in ML/AI application development and cloud data architecture. Joel lives in Des Moines, Iowa, with his wife and five children, and is also a group fitness instructor.

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Generative AI Innovation Center team.

Austin Johnson is a Solutions Architect, maintaining the Lex Web UI open source library.


The subject matter in this communication is educational only and provided with the understanding that Principal is not endorsing, or necessarily recommending use of artificial intelligence. You should consult with appropriate counsel, compliance, and information security for your business needs.

Insurance products and plan administrative services provided through Principal Life Insurance Company, a member of the Principal Financial Group, Des Moines, IA 50392.
2024, Principal Financial Services, Inc.
3778998-082024

Read More