Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock

Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock

Large language model (LLM) based AI agents that have been specialized for specific tasks have demonstrated great problem-solving capabilities. By combining the reasoning power of multiple intelligent specialized agents, multi-agent collaboration has emerged as a powerful approach to tackle more intricate, multistep workflows.

The concept of multi-agent systems isn’t entirely new—it has its roots in distributed artificial intelligence research dating back to the 1980s. However, with recent advancements in LLMs, the capabilities of specialized agents have significantly expanded in areas such as reasoning, decision-making, understanding, and generation through language and other modalities. For instance, a single attraction research agent can perform web searches and list potential destinations based on user preferences. By creating a network of specialized agents, we can combine the strengths of multiple specialist agents to solve increasingly complex problems, such as creating and optimizing an entire travel plan by considering weather forecasts in nearby cities, traffic conditions, flight and hotel availability, restaurant reviews, attraction ratings, and more.

The research team at AWS has worked extensively on building and evaluating the multi-agent collaboration (MAC) framework so customers can orchestrate multiple AI agents on Amazon Bedrock Agents. In this post, we explore the concept of multi-agent collaboration (MAC) and its benefits, as well as the key components of our MAC framework. We also go deeper into our evaluation methodology and present insights from our studies. More technical details can be found in our technical report.

Benefits of multi-agent systems

Multi-agent collaboration offers several key advantages over single-agent approaches, primarily stemming from distributed problem-solving and specialization.

Distributed problem-solving refers to the ability to break down complex tasks into smaller subtasks that can be handled by specialized agents. By breaking down tasks, each agent can focus on a specific aspect of the problem, leading to more efficient and effective problem-solving. For example, a travel planning problem can be decomposed into subtasks such as checking weather forecasts, finding available hotels, and selecting the best routes.

The distributed aspect also contributes to the extensibility and robustness of the system. As the scope of a problem increases, we can simply add more agents to extend the capability of the system rather than try to optimize a monolithic agent packed with instructions and tools. On robustness, the system can be more resilient to failures because multiple agents can compensate for and even potentially correct errors produced by a single agent.

Specialization allows each agent to focus on a specific area within the problem domain. For example, in a network of agents working on software development, a coordinator agent can manage overall planning, a programming agent can generate correct code and test cases, and a code review agent can provide constructive feedback on the generated code. Each agent can be designed and customized to excel at a specific task.

For developers building agents, this means the workload of designing and implementing an agentic system can be organically distributed, leading to faster development cycles and better quality. Within enterprises, often development teams have distributed expertise that is ideal for developing specialist agents. Such specialist agents can be further reused by other teams across the entire organization.

In contrast, developing a single agent to perform all subtasks would require the agent to plan the problem-solving strategy at a high level while also keeping track of low-level details. For example, in the case of travel planning, the agent would need to maintain a high-level plan for checking weather forecasts, searching for hotel rooms and attractions, while simultaneously reasoning about the correct usage of a set of hotel-searching APIs. This single-agent approach can easily lead to confusion for LLMs because long-context reasoning becomes challenging when different types of information are mixed. Later in this post, we provide evaluation data points to illustrate the benefits of multi-agent collaboration.

A hierarchical multi-agent collaboration framework

The MAC framework for Amazon Bedrock Agents starts from a hierarchical approach and expands to other mechanisms in the future. The framework consists of several key components designed to optimize performance and efficiency.

Here’s an explanation of each of the components of the multi-agent team:

  • Supervisor agent – This is an agent that coordinates a network of specialized agents. It’s responsible for organizing the overall workflow, breaking down tasks, and assigning subtasks to specialist agents. In our framework, a supervisor agent can assign and delegate tasks, however, the responsibility of solving the problem won’t be transferred.
  • Specialist agents – These are agents with specific expertise, designed to handle particular aspects of a given problem.
  • Inter-agent communication – Communication is the key component of multi-agent collaboration, allowing agents to exchange information and coordinate their actions. We use a standardized communication protocol that allows the supervisor agents to send and receive messages to and from the specialist agents.
  • Payload referencing – This mechanism enables efficient sharing of large content blocks (like code snippets or detailed travel itineraries) between agents, significantly reducing communication overhead. Instead of repeatedly transmitting large pieces of data, agents can reference previously shared payloads using unique identifiers. This feature is particularly valuable in domains such as software development.
  • Routing mode – For simpler tasks, this mode allows direct routing to specialist agents, bypassing the full orchestration process to improve efficiency for latency-sensitive applications.

The following figure shows inter-agent communication in an interactive application. The user first initiates a request to the supervisor agent. After coordinating with the subagents, the supervisor agent returns a response to the user.

Evaluation of multi-agent collaboration: A comprehensive approach

Evaluating the effectiveness and efficiency of multi-agent systems presents unique challenges due to several complexities:

  1. Users can follow up and provide additional instructions to the supervisor agent.
  2. For many problems, there are multiple ways to resolve them.
  3. The success of a task often requires an agentic system to correctly perform multiple subtasks.

Conventional evaluation methods based on matching ground-truth actions or states often fall short in providing intuitive results and insights. To address this, we developed a comprehensive framework that calculates success rates based on automatic judgments of human-annotated assertions. We refer to this approach as “assertion-based benchmarking.” Here’s how it works:

  • Scenario creation – We create a diverse set of scenarios across different domains, each with specific goals that an agent must achieve to obtain success.
  • Assertions – For each scenario, we manually annotate a set of assertions that must be true for the task to be considered successful. These assertions cover both user-observable outcomes and system-level behaviors.
  • Agent and user simulation We simulate the behavior of the agent in a sandbox environment, where the agent is asked to solve the problems described in the scenarios. Whenever user interaction is required, we use an independent LLM-based user simulator to provide feedback.
  • Automated evaluation – We use an LLM to automatically judge whether each assertion is true based on the conversation transcript.
  • Human evaluation – Instead of using LLMs, we ask humans to directly judge the success based on simulated trajectories.

Here is an example of a scenario and corresponding assertions for assertion-based benchmarking:

  • Goals:
    • User needs the weather conditions expected in Las Vegas for tomorrow, January 5, 2025.
    • User needs to search for a direct flight from Denver International Airport to McCarran International Airport, Las Vegas, departing tomorrow morning, January 5, 2025.
  • Assertions:
    • User is informed about the weather forecast for Las Vegas tomorrow, January 5, 2025.
    • User is informed about the available direct flight options for a trip from Denver International Airport to McCarran International Airport in Las Vegas for tomorrow, January 5, 2025.
      get_tomorrow_weather_by_city is triggered to find information on the weather conditions expected in Las Vegas tomorrow, January 5, 2025.
    • search_flights is triggered to search for a direct flight from Denver International Airport to McCarran International Airport departing tomorrow, January 5, 2025.

For better user simulation, we also include additional contextual information as part of the scenario. A multi-agent collaboration trajectory is judged as successful only when all assertions are met.

Key metrics

Our evaluation framework focuses on evaluating a high-level success rate across multiple tasks to provide a holistic view of system performance:

Goal success rate (GSR) – This is our primary measure of success, indicating the percentage of scenarios where all assertions were evaluated as true. The overall GSR is aggregated into a single number for each problem domain.

Evaluation results

The following table shows the evaluation results of multi-agent collaboration on Amazon Bedrock Agents across three enterprise domains (travel planning, mortgage financing, and software development):

Dataset Overall GSR
Automatic evaluation  Travel planning  87%
 Mortgage financing  90%
 Software development  77%
Human evaluation  Travel planning  93%
 Mortgage financing  97%
 Software development  73%

All experiments are conducted in a setting where the supervisor agents are driven by Anthropic’s Claude 3.5 Sonnet models.

Comparing to single-agent systems

We also conducted an apples-to-apples comparison with the single-agent approach under equivalent settings. The MAC approach achieved a 90% success rate across all three domains. In contrast, the single-agent approach scored 60%, 80%, and 53% in the travel planning, mortgage financing, and software development datasets, respectively, which are significantly lower than the multi-agent approach. Upon analysis, we found that when presented with many tools, a single agent tended to hallucinate tool calls and failed to reject some out-of-scope requests. These results highlight the effectiveness of our multi-agent system in handling complex, real-world tasks across diverse domains.

To understand the reliability of the automatic judgments, we conducted a human evaluation on the same scenarios to investigate the correlation between the model and human judgments and found high correlation on end-to-end GSR.

Comparison with other frameworks

To understand how our MAC framework stacks up against existing solutions, we conducted a comparative analysis with a widely adopted open source framework (OSF) under equivalent conditions, with Anthropic’s Claude 3.5 Sonnet driving the supervisor agent and Anthropic’s Claude 3.0 Sonnet driving the specialist agents. The results are summarized in the following figure:

These results demonstrate a significant performance advantage for our MAC framework across all the tested domains.

Best practices for building multi-agent systems

The design of multi-agent teams can significantly impact the quality and efficiency of problem-solving across tasks. Among the many lessons we learned, we found it crucial to carefully design team hierarchies and agent roles.

Design multi-agent hierarchies based on performance targets
It’s important to design the hierarchy of a multi-agent team by considering the priorities of different targets in a use case, such as success rate, latency, and robustness. For example, if the use case involves building a latency-sensitive customer-facing application, it might not be ideal to include too many layers of agents in the hierarchy because routing requests through multiple tertiary agents can add unnecessary delays. Similarly, to optimize latency, it’s better to avoid agents with overlapping functionalities, which can introduce inefficiencies and slow down decision-making.

Define agent roles clearly
Each agent must have a well-defined area of expertise. On Amazon Bedrock Agents, this can be achieved through collaborator instructions when configuring multi-agent collaboration. These instructions should be written in a clear and concise manner to minimize ambiguity. Moreover, there should be no confusion in the collaborator instructions across multiple agents because this can lead to inefficiencies and errors in communication.

The following is a clear, detailed instruction:

Trigger this agent for 1) searching for hotels in a given location, 2) checking availability of one or multiple hotels, 3) checking amenities of hotels, 4) asking for price quote of one or multiple hotels, and 5) answering questions of check-in/check-out time and cancellation policy of specific hotels.

The following instruction is too brief, making it unclear and ambiguous.

Trigger this agent for helping with accommodation.

The second, unclear, example can lead to confusion and lower collaboration efficiency when multiple specialist agents are involved. Because the instruction doesn’t explicitly define the capabilities of the hotel specialist agent, the supervisor agent may overcommunicate, even when the user query is out of scope.

Conclusion

Multi-agent systems represent a powerful paradigm for tackling complex real-world problems. By using the collective capabilities of multiple specialized agents, we demonstrate that these systems can achieve impressive results across a wide range of domains, outperforming single-agent approaches.

Multi-agent collaboration provides a framework for developers to combine the reasoning power of numerous AI agents powered by LLMs. As we continue to push the boundaries of what is possible, we can expect even more innovative and complex applications, such as networks of agents working together to create software or generate financial analysis reports. On the research front, it’s important to explore how different collaboration patterns, including cooperative and competitive interactions, will emerge and be applied to real-world scenarios.

Additional references


About the author

Raphael Shu is a Senior Applied Scientist at Amazon Bedrock. He received his PhD from the University of Tokyo in 2020, earning a Dean’s Award. His research primarily focuses on Natural Language Generation, Conversational AI, and AI Agents, with publications in conferences such as ICLR, ACL, EMNLP, and AAAI. His work on the attention mechanism and latent variable models received an Outstanding Paper Award at ACL 2017 and the Best Paper Award for JNLP in 2018 and 2019. At AWS, he led the Dialog2API project, which enables large language models to interact with the external environment through dialogue. In 2023, he has led a team aiming to develop the Agentic capability for Amazon Titan. Since 2024, Raphael worked on multi-agent collaboration with LLM-based agents.

Nilaksh Das is an Applied Scientist at AWS, where he works with the Bedrock Agents team to develop scalable, interactive and modular AI systems. His contributions at AWS have spanned multiple initiatives, including the development of foundational models for semantic speech understanding, integration of function calling capabilities for conversational LLMs and the implementation of communication protocols for multi-agent collaboration. Nilaksh completed his PhD in AI Security at Georgia Tech in 2022, where he was also conferred the Outstanding Dissertation Award.

Michelle Yuan is an Applied Scientist on Amazon Bedrock Agents. Her work focuses on scaling customer needs through Generative and Agentic AI services. She has industry experience, multiple first-author publications in top ML/NLP conferences, and strong foundation in mathematics and algorithms. She obtained her Ph.D. in Computer Science at University of Maryland before joining Amazon in 2022.

Monica Sunkara is a Senior Applied Scientist at AWS, where she works on Amazon Bedrock Agents. With over 10 years of industry experience, including 6.5 years at AWS, Monica has contributed to various AI and ML initiatives such as Alexa Speech Recognition, Amazon Transcribe, and Amazon Lex ASR. Her work spans speech recognition, natural language processing, and large language models. Recently, she worked on adding function calling capabilities to Amazon Titan text models. Monica holds a degree from Cornell University, where she conducted research on object localization under the supervision of Prof. Andrew Gordon Wilson before joining Amazon in 2018.

Dr. Yi Zhang is a Principal Applied Scientist at AWS, Bedrock. With 25 years of combined industrial and academic research experience, Yi’s research focuses on syntactic and semantic understanding of natural language in dialogues, and their application in the development of conversational and interactive systems with speech and text/chat. He has been technically leading the development of modeling solutions behind AWS services such as Bedrock Agents, AWS Lex, HealthScribe, etc.

Read More

How BQA streamlines education quality reporting using Amazon Bedrock

How BQA streamlines education quality reporting using Amazon Bedrock

Given the value of data today, organizations across various industries are working with vast amounts of data across multiple formats. Manually reviewing and processing this information can be a challenging and time-consuming task, with a margin for potential errors. This is where intelligent document processing (IDP), coupled with the power of generative AI, emerges as a game-changing solution.

Enhancing the capabilities of IDP is the integration of generative AI, which harnesses large language models (LLMs) and generative techniques to understand and generate human-like text. This integration allows organizations to not only extract data from documents, but to also interpret, summarize, and generate insights from the extracted information, enabling more intelligent and automated document processing workflows.

The Education and Training Quality Authority (BQA) plays a critical role in improving the quality of education and training services in the Kingdom Bahrain. BQA reviews the performance of all education and training institutions, including schools, universities, and vocational institutes, thereby promoting the professional advancement of the nation’s human capital.

BQA oversees a comprehensive quality assurance process, which includes setting performance standards and conducting objective reviews of education and training institutions. The process involves the collection and analysis of extensive documentation, including self-evaluation reports (SERs), supporting evidence, and various media formats from the institutions being reviewed.

The collaboration between BQA and AWS was facilitated through the Cloud Innovation Center (CIC) program, a joint initiative by AWS, Tamkeen, and leading universities in Bahrain, including Bahrain Polytechnic and University of Bahrain. The CIC program aims to foster innovation within the public sector by providing a collaborative environment where government entities can work closely with AWS consultants and university students to develop cutting-edge solutions using the latest cloud technologies.

As part of the CIC program, BQA has built a proof of concept solution, harnessing the power of AWS services and generative AI capabilities. The primary purpose of this proof of concept was to test and validate the proposed technologies, demonstrating their viability and potential for streamlining BQA’s reporting and data management processes.

In this post, we explore how BQA used the power of Amazon Bedrock, Amazon SageMaker JumpStart, and other AWS services to streamline the overall reporting workflow.

The challenge: Streamlining self-assessment reporting

BQA has traditionally provided education and training institutions with a template for the SER as part of the review process. Institutions are required to submit a review portfolio containing the completed SER and supporting material as evidence, which sometimes did not adhere fully to the established reporting standards.

The existing process had some challenges:

  • Inaccurate or incomplete submissions – Institutions might provide incomplete or inaccurate information in the submitted reports and supporting evidence, leading to gaps in the data required for a comprehensive review.
  • Missing or insufficient supporting evidence – The supporting material provided as evidence by institutions frequently did not substantiate the claims made in their reports, which challenged the evaluation process.
  • Time-consuming and resource-intensive – The process required dedicating significant time and resources to review the submissions manually and follow up with institutions to request additional information if needed to rectify the submissions, resulting in slowing down the overall review process.

These challenges highlighted the need for a more streamlined and efficient approach to the submission and review process.

Solution overview

The proposed solution uses Amazon Bedrock and the Amazon Titan Express model to enable IDP functionalities. The architecture seamlessly integrates multiple AWS services with Amazon Bedrock, allowing for efficient data extraction and comparison.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI startups and Amazon through a unified API. It offers a wide range of FMs, allowing you to choose the model that best suits your specific use case.

The following diagram illustrates the solution architecture.

solution architecture diagram

The solution consists of the following steps:

  1. Relevant documents are uploaded and stored in an Amazon Simple Storage Service (Amazon S3) bucket.
  2. An event notification is sent to an Amazon Simple Queue Service (Amazon SQS) queue to align each file for further processing. Amazon SQS serves as a buffer, enabling the different components to send and receive messages in a reliable manner without being directly coupled, enhancing scalability and fault tolerance of the system.
  3. The text extraction AWS Lambda function is invoked by the SQS queue, processing each queued file and using Amazon Textract to extract text from the documents.
  4. The extracted text data is placed into another SQS queue for the next processing step.
  5. The text summarization Lambda function is invoked by this new queue containing the extracted text. This function sends a request to SageMaker JumpStart, where a Meta Llama text generation model is deployed to summarize the content based on the provided prompt.
  6. In parallel, the InvokeSageMaker Lambda function is invoked to perform comparisons and assessments. It compares the extracted text against the BQA standards that the model was trained on, evaluating the text for compliance, quality, and other relevant metrics.
  7. The summarized data and assessment results are stored in an Amazon DynamoDB table
  8. Upon request, the InvokeBedrock Lambda function invokes Amazon Bedrock to generate generative AI summaries and comments. The function constructs a detailed prompt designed to guide the Amazon Titan Express model in evaluating the university’s submission.

Prompt engineering using Amazon Bedrock

To take advantage of the power of Amazon Bedrock and make sure the generated output adhered to the desired structure and formatting requirements, a carefully crafted prompt was developed according to the following guidelines:

  • Evidence submission – Present the evidence submitted by the institution under the relevant indicator, providing the model with the necessary context for evaluation
  • Evaluation criteria – Outline the specific criteria the evidence should be assessed against
  • Evaluation instructions – Instruct the model as follows:
    • Indicate N/A if the evidence is irrelevant to the indicator
    • Evaluate the university’s self-assessment based on the criteria
    • Assign a score from 1–5 for each comment, citing evidence directly from the content
  • Response format – Specify the response as bullet points, focusing on relevant analysis and evidence, with a word limit of 100 words

To use this prompt template, you can create a custom Lambda function with your project. The function should handle the retrieval of the required data, such as the indicator name, the university’s submitted evidence, and the rubric criteria. Within the function, include the prompt template and dynamically populate the placeholders (${indicatorName}, ${JSON.stringify(allContent)}, and ${JSON.stringify(c.comment)}) with the retrieved data.

The Amazon Titan Text Express model will then generate the evaluation response based on the provided prompt instructions, adhering to the specified format and guidelines. You can process and analyze the model’s response within your function, extracting the compliance score, relevant analysis, and evidence.

The following is an example prompt template:

for (const c of comments) {
        const prompt = `
        Below is the evidence submitted by the university under the indicator "${indicatorName}":
        ${JSON.stringify(allContent)}
    
         Analyze and Evaluate the university's eviedence based on the provided rubric criteria:
        ${JSON.stringify(c.comment)}

        - If the evidence does not relate to the indicator, indicate that it is not applicable (N/A) without any additional commentary.
        
       Choose one from the below compliance score based on evidence submitted:
       1. Non-compliant: The comment does not meet the criteria or standards.
        2.Compliant with recommendation: The comment meets the criteria but includes a suggestion or recommendation for improvement.
        3. Compliant: The comment meets the criteria or standards.

        THE END OF THE RESPONSE THERE SHOULD BE EITHER SCORE: [SCORE: COMPLIANT OR NON-COMPLIANT OR COMPLIANT WITH RECOMMENDATION]
        Write your response in concise bullet points, focusing strictly on relevant analysis and evidence.
        **LIMIT YOUR RESPONSE TO 100 WORDS ONLY.**

        `;

        logger.info(`Prompt for comment ${c.commentId}: ${prompt}`);

        const body = JSON.stringify({
          inputText: prompt,
          textGenerationConfig: {
            maxTokenCount: 4096,
            stopSequences: [],
            temperature: 0,
            topP: 0.1,
          },
        });

The following screenshot shows an example of the Amazon Bedrock generated response.

Amazon Bedrock generated response

Results

The implementation of Amazon Bedrock enabled institutions with transformative benefits. By automating and streamlining the collection and analysis of extensive documentation, including SERs, supporting evidence, and various media formats, institutions can achieve greater accuracy and consistency in their reporting processes and readiness for the review process. This not only reduces the time and cost associated with manual data processing, but also improves compliance with the quality expectations, thereby enhancing the credibility and quality of their institutions.

For BQA the implementation helped in achieving one of its strategic objectives focused on streamlining their reporting processes and achieve significant improvements across a range of critical metrics, substantially enhancing the overall efficiency and effectiveness of their operations.

Key success metrics anticipated include:

  • Faster turnaround times for generating 70% accurate and standards-compliant self-evaluation reports, leading to improved overall efficiency.
  • Reduced risk of errors or non-compliance in the reporting process, enforcing adherence to established guidelines.
  • Ability to summarize lengthy submissions into concise bullet points, allowing BQA reviewers to quickly analyze and comprehend the most pertinent information, reducing evidence analysis time by 30%.
  • More accurate compliance feedback functionality, empowering reviewers to effectively evaluate submissions against established standards and guidelines, while achieving 30% reduced operational costs through process optimizations.
  • Enhanced transparency and communication through seamless interactions, enabling users to request additional documents or clarifications with ease.
  • Real-time feedback, allowing institutions to make necessary adjustments promptly. This is particularly useful to maintain submission accuracy and completeness.
  • Enhanced decision-making by providing insights on the data. This helps universities identify areas for improvement and make data-driven decisions to enhance their processes and operations.

The following screenshot shows an example generating new evaluations using Amazon Bedrock

generating new evaluations using Amazon Bedrock

Conclusion

This post outlined the implementation of Amazon Bedrock at the Education and Training Quality Authority (BQA), demonstrating the transformative potential of generative AI in revolutionizing the quality assurance processes in the education and training sectors. For those interested in exploring the technical details further, the full code for this implementation is available in the following GitHub repo. If you are interested in conducting a similar proof of concept with us, submit your challenge idea to the Bahrain Polytechnic or University of Bahrain CIC website.


About the Author

Maram AlSaegh is a Cloud Infrastructure Architect at Amazon Web Services (AWS), where she supports AWS customers in accelerating their journey to cloud. Currently, she is focused on developing innovative solutions that leverage generative AI and machine learning (ML) for public sector entities.

Read More

Boosting team innovation, productivity, and knowledge sharing with Amazon Q Business – Web experience

Boosting team innovation, productivity, and knowledge sharing with Amazon Q Business – Web experience

Amazon Q Business can increase productivity across diverse teams, including developers, architects, site reliability engineers (SREs), and product managers. Amazon Q Business as a web experience makes AWS best practices readily accessible, providing cloud-centered recommendations quickly and making it straightforward to access AWS service functions, limits, and implementations. These elements are brought together in a web integration that serves various job roles and personas exactly when they need it.

As enterprises continue to grow their applications, environments, and infrastructure, it has become difficult to keep pace with technology trends, best practices, and programming standards. Enterprises provide their developers, engineers, and architects with a range of knowledge bases and documents, such as usage guides, wikis, and tools. But these resources tend to become siloed over time and inaccessible across teams, resulting in reduced knowledge, duplication of work, and reduced productivity.

MuleSoft from Salesforce provides the Anypoint platform that gives IT the tools to automate everything. This includes integrating data and systems and automating workflows and processes, and the creation of incredible digital experiences—all on a single, user-friendly platform.

This post shows how MuleSoft introduced a generative AI-powered assistant using Amazon Q Business to enhance their internal Cloud Central dashboard. This individualized portal shows assets owned, costs and usage, and well-architected recommendations to over 100 engineers. For more on MuleSoft’s journey to cloud computing, refer to Why a Cloud Operating Model?

Developers, engineers, FinOps, and architects can get the right answer at the right time when they’re ready to troubleshoot, address an issue, have an inquiry, or want to understand AWS best practices and cloud-centered deployments.

This post covers how to integrate Amazon Q Business into your enterprise setup.

Solution overview

The Amazon Q Business web experience provides seamless access to information, step-by-step instructions, troubleshooting, and prescriptive guidance so teams can deploy well-architected applications or cloud-centered infrastructure. Team members can chat directly or upload documents and receive summarization, analysis, or answers to a calculation. Amazon Q Business uses supported connectors such as Confluence, Amazon Relational Database Service (Amazon RDS), and web crawlers. The following diagram shows the reference architecture for various personas, including developers, support engineers, DevOps, and FinOps to connect with internal databases and the web using Amazon Q Business.

Reference Architecture

In this reference architecture, you can see how various user personas, spanning across teams and business units, use the Amazon Q Business web experience as an access point for information, step-by-step instructions, troubleshooting, or prescriptive guidance for deploying a well-architected application or cloud-centered infrastructure. The web experience allows team members to chat directly with an AI assistant or upload documents and receive summarization, analysis, or answers to a calculation.

Use cases for Amazon Q Business

Small, medium, and large enterprises, depending on their mode of operation, type of business, and level of investment in IT, will have varying approaches and policies on providing access to information. Amazon Q Business is one of the AWS suites of generative AI services that provides a web-based utility to set up, manage, and interact with Amazon Q. It can answer questions, provide summaries, generate content, and complete tasks using the data and expertise found in your enterprise systems. You can connect internal and external datasets without compromising security to seamlessly incorporate your specific standard operating procedures, guidelines, playbooks, and reference links. With Amazon Q, MuleSoft’s engineering teams were able to address their AWS specific inquiries (such as support ticket escalation, operational guidance, and AWS Well-Architected best practices) at scale.

The Amazon Q Business web experience allows business users across various job titles and functions to interact with Amazon Q through the web browser. With the web experience, teams can access the same information and receive similar recommendations based on their prompt or inquiry, level of experience, and knowledge, ranging from beginner to advanced.

The following demos are examples of what the Amazon Q Business web experience looks like. Amazon Q Business securely connects to over 40 commonly used business tools, such as wikis, intranets, Atlassian, Gmail, Microsoft Exchange, Salesforce, ServiceNow, Slack, and Amazon Simple Storage Service (Amazon S3). Point Amazon Q Business at your enterprise data, and it will search your data, summarize it logically, analyze trends, and engage in dialogue with end users about the data. This helps users access their data no matter where it resides in their organization.

Amazon Q Business underscores prompting and response for prescriptive guidance. Optimizing Amazon Elastic Block Store (Amazon EBS) volumes as an example, it provided detailed migration steps from gp2 to gp3. This is a well-known use case asked about by several MuleSoft teams.

Through the web experience, you can effortlessly perform document uploads and prompts for summary, calculation, or recommendations based on your document. You have the flexibility to upload .pdf, .xls, .xlsx, or .csv files directly into the chat interface. You can also assume a persona such as FinOps or DevOps and get personalized recommendations or responses.

MuleSoft engineers used the Amazon Q Business web summarization feature to better understand Split Cost Allocation Data (SCAD) for Amazon Elastic Kubernetes Service (Amazon EKS). They uploaded the SCAD PDF documents to Amazon Q and got straightforward summaries. This helped them understand their customer’s use of MuleSoft Anypoint platform running on Amazon EKS.

Amazon Q helped analyze IPv4 costs by processing an uploaded Excel file. As the video shows, it calculated expenses for elastic IPs and outbound data transfers, supporting a proposed network estimate.

Amazon Q Business demonstrating its ability to provide tailored advice by responding to a specific user scenario. As the video shows, a user took on the role of a FinOps professional and asked Amazon Q to recommend AWS tools for cost optimization. Amazon Q then offered personalized suggestions based on this FinOps persona perspective.

Prerequisites

To get started with your Amazon Q Business web experience, you need the following prerequisites:

Create an Amazon Q Business web experience

Complete the following steps to create your web experience:

The web experience can be used by a variety of business users or personas to yield accurate and repeatable recommendations for level 100, 200, and 300 inquiries. Amazon Q supports a variety of data sources and data connectors to personalize your user experience. You can also further enrich your dataset with knowledge bases within Amazon Q. With Amazon Q Business set up with your own datasets and sources, teams and business units within your enterprise can index from the same information on common topics such as cost optimization, modernization, and operational excellence while maintaining their own unique area of expertise, responsibility, and job function.

Clean Up

After trying the Amazon Q Business web experience, remember to remove any resources you created to avoid unnecessary charges. Complete the following steps:

  1. Delete the web experience:
    • On the Amazon Q Business console, navigate to the Web experiences section within your application.
    • Select the web experience you want to remove.
    • On the Actions menu, choose Delete.
    • Confirm the deletion by following the prompts.
  2. If you granted specific users access to the web experience, revoke their permissions. This might involve updating AWS Identity and Access Management (IAM) policies or removing users from specific groups in IAM Identity Center.
  3. If you set up any custom configurations for the web experience, such as specific data source filters or custom prompts, make sure to remove these.
  4. If you integrated the web experience with other tools or services, remove those integrations.
  5. Check for and delete any Amazon CloudWatch alarms or logs specifically set up for monitoring this web experience.

After deletion, review your AWS billing to make sure that charges related to the web experience have stopped.

Deleting a web experience is irreversible. Make sure you have any necessary backups or exports of important data before proceeding with the deletion. Also, keep in mind that deleting a web experience doesn’t automatically delete the entire Amazon Q Business application or its associated data sources. If you want to remove everything, follow the Amazon Q Business application clean-up procedure for the entire application.

Conclusion

Amazon Q Business web experience is your gateway to a powerful generative AI assistant. Want to take it further? Integrate Amazon Q with Slack for an even more interactive experience.

Every organization has unique needs when it comes to AI. That’s where Amazon Q shines. It adapts to your business needs, user applications, and end-user personas. The best part? You don’t need to do the heavy lifting. No complex infrastructure setup. No need for teams of data scientists. Amazon Q connects to your data and makes sense of it with just a click. It’s AI power made simple, giving you the intelligence you need without the hassle.

To learn more about the power of a generative AI assistant in your workplace, see Amazon Q Business.


About the Authors

Rueben Jimenez is an AWS Sr Solutions Architect who designs and implements complex data analytics, machine learning, generative AI, and cloud infrastructure solutions.

Sona Rajamani is a Sr. Manager Solutions Architect at AWS.  She lives in the San Francisco Bay Area and helps customers architect and optimize applications on AWS. In her spare time, she enjoys traveling and hiking.

Erick Joaquin is a Sr Customer Solutions Manager for Strategic Accounts at AWS. As a member of the account team, he is focused on evolving his customers’ maturity in the cloud to achieve operational efficiency at scale.

Read More

Build an Amazon Bedrock based digital lending solution on AWS

Build an Amazon Bedrock based digital lending solution on AWS

Digital lending is a critical business enabler for banks and financial institutions. Customers apply for a loan online after completing the know your customer (KYC) process. A typical digital lending process involves various activities, such as user onboarding (including steps to verify the user through KYC), credit verification, risk verification, credit underwriting, and loan sanctioning. Currently, some of these activities are done manually, leading to delays in loan sanctioning and impacting the customer experience.

In India, the KYC verification usually involves identity verification through identification documents for Indian citizens, such as a PAN card or Aadhar card, address verification, and income verification. Credit checks in India are normally done using the PAN number of a customer. The ideal way to address these challenges is to automate them to the extent possible.

The digital lending solution primarily needs orchestration of a sequence of steps and other features such as natural language understanding, image analysis, real-time credit checks, and notifications. You can seamlessly build automation around these features using Amazon Bedrock Agents. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock Agents, you can orchestrate multi-step processes and integrate with enterprise data using natural language instructions.

In this post, we propose a solution using DigitalDhan, a generative AI-based solution to automate customer onboarding and digital lending. The proposed solution uses Amazon Bedrock Agents to automate services related to KYC verification, credit and risk assessment, and notification. Financial institutions can use this solution to help automate the customer onboarding, KYC verification, credit decisioning, credit underwriting, and notification processes. This post demonstrates how you can gain a competitive advantage using Amazon Bedrock Agents based automation of a complex business process.

Why generative AI is best suited for assistants that support customer journeys

Traditional AI assistants that use rules-based navigation or natural language processing (NLP) based guidance fall short when handling the nuances of complex human conversations. For instance, in a real-world customer conversation, the customer might provide inadequate information (for example, missing documents), ask random or unrelated questions that aren’t part of the predefined flow (for example, asking for loan pre-payment options while verifying the identity documents), natural language inputs (such as using various currency modes, such as representing twenty thousand as “20K” or “20000” or “20,000”). Additionally, rules-based assistants don’t provide additional reasoning and explanations (such as why a loan was denied). Some of the rigid and linear flow-related rules either force customers to start the process over again or the conversation requires human assistance.

Generative AI assistants excel at handling these challenges. With well-crafted instructions and prompts, a generative AI-based assistant can ask for missing details, converse in human-like language, and handle errors gracefully while explaining the reasoning for their actions when required. You can add guardrails to make sure that these assistants don’t deviate from the main topic and provide flexible navigation options that account for real-world complexities. Context-aware assistants also enhance customer engagement by flexibly responding to the various off-the-flow customer queries.

Solution overview

DigitalDhan, the proposed digital lending solution, is powered by Amazon Bedrock Agents. They have developed a solution that fully automates the customer onboarding, KYC verification, and credit underwriting process. The DigitalDhan service provides the following features:

  • Customers can understand the step-by-step loan process and the documents required through the solution
  • Customers can upload KYC documents such as PAN and Aadhar, which DigitalDhan verifies through automated workflows
  • DigitalDhan fully automates the credit underwriting and loan application process
  • DigitalDhan notifies the customer about the loan application through email

We have modeled the digital lending process close to a real-world scenario. The high-level steps of the DigitalDhan solution are shown in the following figure.

Digital Lending Process

The key business process steps are:

  1. The loan applicant initiates the loan application flow by accessing the DigitalDhan solution.
  2. The loan applicant begins the loan application journey. Sample prompts for the loan application include:
    1. “What is the process to apply for loan?”
    2. “I would like to apply for loan.”
    3. “My name is Adarsh Kumar. PAN is ABCD1234 and email is john_doe@example.org. I need a loan for 150000.”
    4. The applicant uploads their PAN card.
    5. The applicant uploads their Aadhar card.
  3. The DigitalDhan processes each of the natural language prompts. As part of the document verification process, the solution extracts the key details from the uploaded PAN and Aadhar cards such as name, address, date of birth, and so on. The solution then identifies whether the user is an existing customer using the PAN.
    1. If the user is an existing customer, the solution gets the internal risk score for the customer.
    2. If the user is a new customer, the solution gets the credit score based on the PAN details.
  4. The solution uses the internal risk score for an existing customer to check for credit worthiness.
  5. The solution uses the external credit score for a new customer to check for credit worthiness.
  6. The credit underwriting process involves credit decisioning based on the credit score and risk score, and calculates the final loan amount for the approved customer.
  7. The loan application details along with the decision are sent to the customer through email.

Technical solution architecture

The solution primarily uses Amazon Bedrock Agents (to orchestrate the multi-step process), Amazon Textract (to extract data from the PAN and Aadhar cards), and Amazon Comprehend (to identify the entities from the PAN and Aadhar card). The solution architecture is shown in the following figure.

Technical Solution Architecture for Digital Dhan Solution

The key solution components of the DigitalDhan solution architecture are:

  1. A user begins the onboarding process with the DigitalDhan application. They provide various documents (including PAN and Aadhar) and a loan amount as part of the KYC
  2. After the documents are uploaded, they’re automatically processed using various artificial intelligence and machine learning (AI/ML) services.
  3. Amazon Textract is used to extract text information from the uploaded documents.
  4. Amazon Comprehend is used to identify entities such as PAN and Aadhar.
  5. The credit underwriting flow is powered by Amazon Bedrock Agents.
    1. The knowledge base contains loan-related documents to respond to loan-related queries.
    2. The loan handler AWS Lambda function uses the information in the KYC documents to check the credit score and internal risk score. After the credit checks are complete, the function calculates the loan eligibility and processes the loan application.
    3. The notification Lambda function emails information about the loan application to the customer.
  6. The Lambda function can be integrated with external credit APIs.
  7. Amazon Simple Email Service (Amazon SES) is used to notify customers of the status of their loan application.
  8. The events are logged using Amazon CloudWatch.

Amazon Bedrock Agents deep dive

Because we used Amazon Bedrock Agents heavily in the DigitalDhan solution, let’s look at the overall functioning of Amazon Bedrock Agents. The flow of the various components of Amazon Bedrock Agents is shown in the following figure.

Amazon Bedrock Agents Flow

The Amazon Bedrock agents break each task into subtasks, determine the right sequence, and perform actions and knowledge searches. The detailed steps are:

  1. Processing the loan application is the primary task performed by the Amazon Bedrock agents in the DigitalDhan solution.
  2. The Amazon Bedrock agents use the user prompts, conversation history, knowledge base, instructions, and action groups to orchestrate the sequence of steps related to loan processing. The Amazon Bedrock agent takes natural language prompts as inputs. The following are the instructions given to the agent:
You are DigitalDhan, an advanced AI lending assistant designed to provide personal loan-related information create loan application. Always ask for relevant information and avoid making assumptions. If you're unsure about something, clearly state "I don't have that information."

Always greet the user by saying the following: Hi there! I am DigitalDhan bot. I can help you with loans over this chat. To apply for a loan, kindly provide your full name, PAN Number, email, and the loan amount."

When a user expresses interest in applying for a loan, follow these steps in order, always ask the user for necessary details:

1. Determine user status: Identify if they're an existing or new customer.

2. User greeting (mandatory, do not skip): After determining user status, welcome returning users using the following format:

  Existing customer: Hi {customerName}, I see you are an existing customer. Please upload your PAN for KYC.

  New customer: Hi {customerName}, I see you are a new customer. Please upload your PAN and Aadhar for KYC.

3. Call Pan Verification step using the uploaded PAN document

4. Call Aadhaar Verification step using the uploaded Aadhaar document. Request the user to upload their Aadhaar card document for verification.

5. Loan application: Collect all necessary details to create the loan application.

6. If the loan is approved (email will be sent with details):

   For existing customers: If the loan officer approves the application, inform the user that their loan application has been approved using following format: Congratulations {customerName}, your loan is sanctioned. Based on your PAN {pan}, your risk score is {riskScore} and your overall credit score is {cibilScore}. I have created your loan and the application ID is {loanId}. The details have been sent to your email.

   For new customers: If the loan officer approves the application, inform the user that their loan application has been approved using following format: Congratulations {customerName}, your loan is sanctioned. Based on your PAN {pan} and {aadhar}, your risk score is {riskScore} and your overall credit score is {cibilScore}. I have created your loan and the application ID is {loanId}. The details have been sent to your email.

7. If the loan is rejected ( no emails sent):

   For new customers: If the loan officer rejects the application, inform the user that their loan application has been rejected using following format: Hello {customerName}, Based on your PAN {pan} and aadhar {aadhar}, your overall credit score is {cibilScore}. Because of the low credit score, unfortunately your loan application cannot be processed.

   For existing customers: If the loan officer rejects the application, inform the user that their loan application has been rejected using following format: Hello {customerName}, Based on your PAN {pan}, your overall credit score is {creditScore}. Because of the low credit score, unfortunately your loan application cannot be processed.

Remember to maintain a friendly, professional tone and prioritize the user's needs and concerns throughout the interaction. Be short and direct in your responses and avoid making assumptions unless specifically requested by the user.

Be short and prompt in responses, do not answer queries beyond the lending domain and respond saying you are a lending assistant
  1. We configured the agent preprocessing and orchestration instructions to validate and perform the steps in a predefined sequence. The few-shot examples specified during the agent instructions boost the accuracy of the agent performance. Based on the instructions and the API descriptions, the Amazon Bedrock agent creates a logical sequence of steps to complete an action. In the DigitalDhan example, instructions are specified such that the Amazon Bedrock agent creates the following sequence:
    1. Greet the customer.
    2. Collect the customer’s name, email, PAN, and loan amount.
    3. Ask for the PAN card and Aadhar card to read and verify the PAN and Aadhar number.
    4. Categorize the customer as an existing or new customer based on the verified PAN.
    5. For an existing customer, calculate the customer internal risk score.
    6. For a new customer, get the external credit score.
    7. Use the internal risk score (for existing customers) or credit score (for external customers) for credit underwriting. If the internal risk score is less than 300 or if the credit score is more than 700, sanction the loan amount.
    8. Email the credit decision to the customer’s email address.
  2. Action groups define the APIs for performing actions such as creating the loan, checking the user, fetching the risk score, and so on. We described each of the APIs in the OpenAPI schema, which the agent uses to select the most appropriate API to perform the action. Lambda is associated with the action group. The following code is an example of the create_loan API. The Amazon Bedrock agent uses the description for the create_loan API while performing the action. The API schema also specifies customerName, address, loanAmt, PAN, and riskScore as required elements for the APIs. Therefore, the corresponding APIs read the PAN number for the customer (verify_pan_card API), calculate the risk score for the customer (fetch_risk_score API), and identify the customer’s name and address (verify_aadhar_card API) before calling the create_loan API.
"/create_loan":
  post:
    summary: Create New Loan application
    description: Create new loan application for the customer. This API must be
      called for each new loan application request after calculating riskscore and
      creditScore
    operationId: createLoan
    requestBody:
      required: true
      content:
        application/json:
          schema:
            type: object
            properties:
              customerName:
                type: string
                description: Customer’s Name for creating the loan application
                minLength: 3
              loanAmt:
                type: string
                description: Preferred loan amount for the loan application
                minLength: 5
              pan:
                type: string
                description: Customer's PAN number for the loan application
                minLength: 10
              riskScore:
                type: string
                description: Risk Score of the customer
                minLength: 2
              creditScore:
                type: string
                description: Risk Score of the customer
                minLength: 3
            required:
            - customerName
            - address
            - loanAmt
            - pan
            - riskScore
            - creditScore
    responses:
      '200':
        description: Success
        content:
          application/json:
            schema:
              type: object
              properties:
                loanId:
                  type: string
                  description: Identifier for the created loan application
                status:
                  type: string
                  description: Status of the loan application creation process
  1. Amazon Bedrock Knowledge Bases provides a cloud-based Retrieval Augmented Generation (RAG) experience to the customer. We have added the documents related to loan processing, the general information, the loan information guide, and the knowledge base. We specified the instructions for when to use the knowledge base. Therefore, during the beginning of a customer journey, when the customer is in the exploration stage, they get responses with how-to instructions and general loan-related information. For instance, if the customer asks “What is the process to apply for a loan?” the Amazon Bedrock agent fetches the relevant step-by-step details from the knowledge base.
  2. After the required steps are complete, the Amazon Bedrock agent curates the final response to the customer.

Let’s explore an example flow for an existing customer. For this example, we have depicted various actions performed by Amazon Bedrock Agents for an existing customer. First, the customer begins the loan journey by asking exploratory questions. We have depicted one such question—“What is the process to apply for a loan?”—in the following figure. Amazon Bedrock responds to such questions by providing a step-by-step guide fetched from the configured knowledge base.

Conversation with Digital Lending Solution

The customer proceeds to the next step and tries to apply for a loan. The DigitalDhan solution asks for the user details such as the customer name, email address, PAN number, and desired loan amount. After the customer provides those details, the solution asks for the actual PAN card to verify the details, as shown in in the following figure.

Identity Verification with Digital Lending Solution

When the PAN verification and the risk score checks are complete, the DigitalDhan solution creates a loan application and notifies the customer of the decision through the email, as shown in the following figure.

Notification in Digital Lending Solution

Prerequisites

This project is built using the AWS Cloud Development Kit (AWS CDK).

For reference, the following versions of node and AWS CDK are used:

  • js: v20.16.0
  • AWS CDK: 2.143.0
  • The command to install a specific version of the AWS CDK is npm install -g aws-cdk@<X.YY.Z>

Deploy the Solution

Complete the following steps to deploy the solution. For more details, refer to the GitHub repo.

  1. Clone the repository:
    git clone https://github.com/aws-samples/DigitalDhan-GenAI-FSI-LendingSolution-India.git

  2. Enter the code sample backend directory:
    cd DigitalDhan-GenAI-FSI-LendingSolution-India/

  3. Install packages:
    npm install
    npm install -g aws-cdk

  4. Bootstrap AWS CDK resources on the AWS account. If deployed in any AWS Region other than us-east-1, the stack might fail because of Lambda layers dependency. You can either comment the layer and deploy in another Region or deploy in us-east-1.
    cdk bootstrap aws://<ACCOUNT_ID>/<REGION>

  5. You must explicitly enable access to models before they can be used with the Amazon Bedrock service. Follow the steps in Access Amazon Bedrock foundation models to enable access to the models (Anthropic::Claude (Sonnet) and Cohere::Embed English).
  6. Deploy the sample in your account. The following command will deploy one stack in your account cdk deploy --all
    To protect against unintended changes that might affect your security posture, the AWS CDK prompts you to approve security-related changes before deploying them. You will need to answer yes to fully deploy the stack.

The AWS Identity and Access Management (IAM) role creation in this example is for illustration only. Always provision IAM roles with the least required privileges. The stack deployment takes approximately 10–15 minutes. After the stack is successfully deployed, you can find InsureAssistApiAlbDnsName in the output section of the stack—this is the application endpoint.

Enable user input

After deployment is complete, enable user input so the agent can prompt the customer to provide addition information if necessary.

  1. Open the Amazon Bedrock console in the deployed Region and edit the agent.
  2. Modify the additional settings to enable User Input to allow the agent to prompt for additional information from the user when it doesn’t have enough information to respond to a prompt.

Test the solution

We covered three test scenarios in the solution. The sample data and prompts for the three scenarios can found in the GitHub repo.

  • Scenario 1 is an existing customer who will be approved for the requested loan amount
  • Scenario 2 is a new customer who will be approved for the requested loan amount
  • Scenario 3 is a new customer whose loan application will be denied because of a low credit score

Clean up

To avoid future charges, delete the sample data stored in Amazon Simple Storage Service (Amazon S3) and the stack:

  1. Remove all data from the S3 bucket.
  2. Delete the S3 bucket.
  3. Use the following command to destroy the stack: cdk destroy

Summary

The proposed digital lending solution discussed in this post onboards a customer by verifying the KYC documents (including the PAN and Aadhar cards) and categorizes the customer as an existing customer or a new customer. For an existing customer, the solution uses an internal risk score, and for a new customer, the solution uses the external credit score.

The solution uses Amazon Bedrock Agents to orchestrate the digital lending processing steps. The documents are processed using Amazon Textract and Amazon Comprehend, after which Amazon Bedrock Agents processes the workflow steps. The customer identification, credit checks, and customer notification are implemented using Lambda.

The solution demonstrates how you can automate a complex business process with the help of Amazon Bedrock Agents and enhance customer engagement through a natural language interface and flexible navigation options.

Test some Amazon Bedrock for banking use cases such as building customer service bots, email classification, and sales assistants by using the powerful FMs and Amazon Bedrock Knowledge Bases that provide a managed RAG experience. Explore using Amazon Bedrock Agents to help orchestrate and automate complex banking processes such as customer onboarding, document verification, digital lending, loan origination, and customer servicing.


About the Authors

Shailesh Shivakumar is a FSI Sr. Solutions Architect with AWS India. He works with financial enterprises such as banks, NBFCs, and trading enterprises to help them design secure cloud services and engages with them to accelerate their cloud journey. He builds demos and proofs of concept to demonstrate the possibilities of AWS Cloud. He leads other initiatives such as customer enablement workshops, AWS demos, cost optimization, and solution assessments to make sure that AWS customers succeed in their cloud journey. Shailesh is part of Machine Learning TFC at AWS, handling the generative AI and machine learning-focused customer scenarios. Security, serverless, containers, and machine learning in the cloud are his key areas of interest.

Reena Manivel is AWS FSI Solutions Architect. She specializes in analytics and works with customers in lending and banking businesses to create secure, scalable, and efficient solutions on AWS. Besides her technical pursuits, she is also a writer and enjoys spending time with her family.

Read More

Build AI-powered malware analysis using Amazon Bedrock with Deep Instinct

Build AI-powered malware analysis using Amazon Bedrock with Deep Instinct

This post is co-written with Yaniv Avolov, Tal Furman and Maor Ashkenazi from Deep Instinct.

Deep Instinct is a cybersecurity company that offers a state-of-the-art, comprehensive zero-day data security solution—Data Security X (DSX), for safeguarding your data repositories across the cloud, applications, network attached storage (NAS), and endpoints. DSX provides unmatched prevention and explainability by using a powerful combination of deep learning-based DSX Brain and generative AI DSX Companion to protect systems from known and unknown malware and ransomware in real-time.

Using deep neural networks (DNNs), Deep Instinct analyzes threats with unmatched accuracy, adapting to identify new and unknown risks that traditional methods might miss. This approach significantly reduces false positives and enables unparalleled threat detection rates, making it popular among large enterprises and critical infrastructure sectors such as finance, healthcare, and government.

In this post, we explore how Deep Instinct’s generative AI-powered malware analysis tool, DIANNA, uses Amazon Bedrock to revolutionize cybersecurity by providing rapid, in-depth analysis of known and unknown threats, enhancing the capabilities of AWS System and Organization Controls (SOC) teams and addressing key challenges in the evolving threat landscape.

Main challenges for SecOps

There are two main challenges for SecOps:

  • The growing threat landscape – With a rapidly evolving threat landscape, SOC teams are becoming overwhelmed with a continuous increase of security alerts that require investigation. This situation hampers proactive threat hunting and exacerbates team burnout. Most importantly, the surge in alert storms increases the risk of missing critical alerts. A solution is needed that provides the explainability necessary to allow SOC teams to perform quick risk assessments regarding the nature of incidents and make informed decisions.
  • The challenges of malware analysis – Malware analysis has become an increasingly critical and complex field. The challenge of zero-day attacks lies in the limited information about why a file was blocked and classified as malicious. Threat analysts often spend considerable time assessing whether it was a genuine exploit or a false positive.

Let’s explore some of the key challenges that make malware analysis demanding:

  • Identifying malware – Modern malware has become incredibly sophisticated in its ability to disguise itself. It often mimics legitimate software, making it challenging for analysts to distinguish between benign and malicious code. Some malware can even disable security tools or evade scanners, further obfuscating detection.
  • Preventing zero-day threats – The rise of zero-day threats, which have no known signatures, adds another layer of difficulty. Identifying unknown malware is crucial, because failure can lead to severe security breaches and potentially incapacitate organizations.
  • Information overload The powerful malware analysis tools currently available can be both beneficial and detrimental. Although they offer high explainability, they can also produce an overwhelming amount of data, forcing analysts to sift through a digital haystack to find indicators of malicious activity, increasing the possibility of analysts overlooking critical compromises.
  • Connecting the dots – Malware often consists of multiple components interacting in complex ways. Not only do analysts need to identify the individual components, but they also need to understand how they interact. This process is like assembling a jigsaw puzzle to form a complete picture of the malware’s capabilities and intentions, with pieces constantly changing shape.
  • Keeping up with cybercriminals – The world of cybercrime is fluid, with bad actors relentlessly developing new techniques and exploiting newly emerging vulnerabilities, leaving organizations struggling to keep up. The time window between the discovery of a vulnerability and its exploitation in the wild is narrowing, putting pressure on analysts to work faster and more efficiently. This rapid evolution means that malware analysts must constantly update their skill set and tools to stay one step ahead of the cybercriminals.
  • Racing against the clock – In malware analysis, time is of the essence. Malicious software can spread rapidly across networks, causing significant damage in a matter of minutes, often before the organization realizes an exploit has occurred. Analysts face the pressure of conducting thorough examinations while also providing timely insights to prevent or mitigate exploits.

DIANNA, the DSX Companion

There is a critical need for malware analysis tools that can provide precise, real-time, in-depth malware analysis for both known and unknown threats, supporting SecOps efforts. Deep Instinct, recognizing this need, has developed DIANNA (Deep Instinct’s Artificial Neural Network Assistant), the DSX Companion. DIANNA is a groundbreaking malware analysis tool powered by generative AI to tackle real-world issues, using Amazon Bedrock as its large language model (LLM) infrastructure. It offers on-demand features that provide flexible and scalable AI capabilities tailored to the unique needs of each client. Amazon Bedrock is a fully managed service that grants access to high-performance foundation models (FMs) from top AI companies through a unified API. By concentrating our generative AI models on specific artifacts, we can deliver comprehensive yet focused responses to address this gap effectively.

DIANNA is a sophisticated malware analysis tool that acts as a virtual team of malware analysts and incident response experts. It enables organizations to shift strategically toward zero-day data security by integrating with Deep Instinct’s deep learning capabilities for a more intuitive and effective defense against threats.

DIANNA’s unique approach

Current cybersecurity solutions use generative AI to summarize data from existing sources, but this approach is limited to retrospective analysis with limited context. DIANNA enhances this by integrating the collective expertise of numerous cybersecurity professionals within the LLM, enabling in-depth malware analysis of unknown files and accurate identification of malicious intent.

DIANNA’s unique approach to malware analysis sets it apart from other cybersecurity solutions. Unlike traditional methods that rely solely on retrospective analysis of existing data, DIANNA harnesses generative AI to empower itself with the collective knowledge of countless cybersecurity experts, sources, blog posts, papers, threat intelligence reputation engines, and chats. This extensive knowledge base is effectively embedded within the LLM, allowing DIANNA to delve deep into unknown files and uncover intricate connections that would otherwise go undetected.

At the heart of this process are DIANNA’s advanced translation engines, which transform complex binary code into natural language that LLMs can understand and analyze. This unique approach bridges the gap between raw code and human-readable insights, enabling DIANNA to provide clear, contextual explanations of a file’s intent, malicious aspects, and potential system impact. By translating the intricacies of code into accessible language, DIANNA addresses the challenge of information overload, distilling vast amounts of data into concise, actionable intelligence.

This translation capability is key for linking between different components of complex malware. It allows DIANNA to identify relationships and interactions between various parts of the code, offering a holistic view of the threat landscape. By piecing together these components, DIANNA can construct a comprehensive picture of the malware’s capabilities and intentions, even when faced with sophisticated threats. DIANNA doesn’t stop at simple code analysis—it goes deeper. It provides insights into why unknown events are malicious, streamlining what is often a lengthy process. This level of understanding allows SOC teams to focus on the threats that matter most.

Solution overview

DIANNA’s integration with Amazon Bedrock allows us to harness the power of state-of-the-art language models while maintaining agility to adapt to evolving client requirements and security considerations. DIANNA benefits from the robust features of Amazon Bedrock, including seamless scaling, enterprise-grade security, and the ability to fine-tune models for specific use cases.

The integration offers the following benefits:

  • Accelerated development with Amazon Bedrock – The fast-paced evolution of the threat landscape necessitates equally responsive cybersecurity solutions. DIANNA’s collaboration with Amazon Bedrock has played a crucial role in optimizing our development process and speeding up the delivery of innovative capabilities. The service’s versatility has enabled us to experiment with different FMs, exploring their strengths and weaknesses in various tasks. This experimentation has led to significant advancements in DIANNA’s ability to understand and explain complex malware behaviors. We have also benefited from the following features:
    • Fine-tuning – Alongside its core functionalities, Amazon Bedrock provides a range of ready-to-use features for customizing the solution. One such feature is model fine-tuning, which allows you to train FMs on proprietary data to enhance your performance in specific domains. For example, organizations can fine-tune an LLM-based malware analysis tool to recognize industry-specific jargon or detect threats associated with particular vulnerabilities.
    • Retrieval Augmented Generation – Another valuable feature is the use of Retrieval Augmented Generation (RAG), enabling access to and the incorporation of relevant information from external sources, such as knowledge bases or threat intelligence feeds. This enhances the model’s ability to provide contextually accurate and informative responses, improving the overall effectiveness of malware analysis.
  • A landscape for innovation and comparison – Amazon Bedrock has also served as a valuable landscape for conducting LLM-related research and comparisons.
  • Seamless integration, scalability, and customization – Integrating Amazon Bedrock into DIANNA’s architecture was a straightforward process. The user-friendly Amazon Bedrock API and well-documented facilitated seamless integration with our existing infrastructure. Furthermore, the service’s on-demand nature allows us to scale our AI capabilities up or down based on customer demand. This flexibility makes sure that DIANNA can handle fluctuating workloads without compromising performance.
  • Prioritizing data security and compliance – Data security and compliance are paramount in the cybersecurity domain. Amazon Bedrock offers enterprise-grade security features that provide us with the confidence to handle sensitive customer data. The service’s adherence to industry-leading security standards, coupled with the extensive experience of AWS in data protection, makes sure DIANNA meets the highest regulatory requirements such as GDPR. By using Amazon Bedrock, we can offer our customers a solution that not only protects their assets, but also demonstrates our commitment to data privacy and security.

By combining Deep Instinct’s proprietary prevention algorithms with the advanced language processing capabilities of Amazon Bedrock, DIANNA offers a unique solution that not only identifies and analyzes threats with high accuracy, but also communicates its findings in clear, actionable language. This synergy between Deep Instinct’s expertise in cybersecurity and the leading AI infrastructure of Amazon positions DIANNA at the forefront of AI-driven malware analysis and threat prevention.

The following diagram illustrates DIANNA’s architecture.

DIANNA’s architecture

Evaluating DIANNA’s malware analysis

In our task, the input is a malware sample, and the output is a comprehensive, in-depth report on the behaviors and intents of the file. However, generating ground truth data is particularly challenging. The behaviors and intents of malicious files aren’t readily available in standard datasets and require expert malware analysts for accurate reporting. Therefore, we needed a custom evaluation approach.

We focused our evaluation on two core dimensions:

  • Technical features – This dimension focuses on objective, measurable capabilities. We used programmable metrics to assess how well DIANNA handled key technical aspects, such as extracting indicators of compromise (IOCs), detecting critical keywords, and processing the length and structure of threat reports. These metrics allowed us to quantitatively assess the model’s basic analysis capabilities.
  • In-depth semantics – Because DIANNA is expected to generate complex, human-readable reports on malware behavior, we relied on domain experts (malware analysts) to assess the quality of the analysis. The reports were evaluated based on the following:
    • Depth of information – Whether DIANNA provided a detailed understanding of the malware’s behavior and techniques.
    • Accuracy – How well the analysis aligned with the true behaviors of the malware.
    • Clarity and structure – Evaluating the organization of the report, making sure the output was clear and comprehensible for security teams.

Because human evaluation is labor-intensive, fine-tuning the key components (the model itself, the prompts, and the translation engines) involved iterative feedback loops. Small adjustments in a component led to significant variations in the output, requiring repeated validations by human experts. The meticulous nature of this process, combined with the continuous need for scaling, has subsequently led to the development of the auto-evaluation capability.

Fine-tuning process and human validation

The fine-tuning and validation process consisted of the following steps:

  • Gathering a malware dataset To cover the breadth of malware techniques, families, and threat types, we collected a large dataset of malware samples, each with technical metadata.
  • Splitting the dataset – The data was split into subsets for training, validation, and evaluation. Validation data was continually used to test how well DIANNA adapted after each key component update.
  • Human expert evaluation – Each time we fine-tuned DIANNA’s model, prompts, and translation mechanisms, human malware analysts reviewed a portion of the validation data. This made sure improvements or degradations in the quality of the reports were identified early. Because DIANNA’s outputs are highly sensitive to even minor changes, each update required a full reevaluation by human experts to verify whether the response quality was improved or degraded.
  • Final evaluation on a broader dataset – After sufficient tuning based on the validation data, we applied DIANNA to a large evaluation set. Here, we gathered comprehensive statistics on its performance to confirm improvements in report quality, correctness, and overall technical coverage.

Automation of evaluation

To make this process more scalable and efficient, we introduced an automatic evaluation phase. We trained a language model specifically designed to critique DIANNA’s outputs, providing a level of automation in assessing how well DIANNA was generating reports. This critique model acted as an internal judge, allowing for continuous, rapid feedback on incremental changes during fine-tuning. This enabled us to make small adjustments across DIANNA’s three core components (model, prompts, and translation engines) while receiving real-time evaluations of the impact of those changes.

This automated critique model enhanced our ability to test and refine DIANNA without having to rely solely on the time-consuming manual feedback loop from human experts. It provided a consistent, reliable measure of performance and allowed us to quickly identify which model adjustments led to meaningful improvements in DIANNA’s analysis.

Advanced integration and proactive analysis

DIANNA is integrated with Deep Instinct’s proprietary deep learning algorithms, enabling it to detect zero-day threats with high accuracy and a low false positive rate. This proactive approach helps security teams quickly identify unknown threats, reduce false positives, and allocate resources more effectively. Additionally, it streamlines investigations, minimizes cross-tool efforts, and automates repetitive tasks, making the decision-making process clearer and faster. This ultimately helps organizations strengthen their security posture and significantly reduce the mean time to triage.

This analysis offers the following key features and benefits:

  • Performs on-the-fly file scans, allowing for immediate assessment without prior setup or delays
  • Generates comprehensive malware analysis reports for a variety of file types in seconds, making sure users receive timely information about potential threats
  • Streamlines the entire file analysis process, making it more efficient and user-friendly, thereby reducing the time and effort required for thorough evaluations
  • Supports a wide range of common file formats, including Office documents, Windows executable files, script files, and Windows shortcut files (.lnk), providing compatibility with various types of data
  • Offers in-depth contextual analysis, malicious file triage, and actionable insights, greatly enhancing the efficiency of investigations into potentially harmful files
  • Empowers SOC teams to make well-informed decisions without relying on manual malware analysis by providing clear and concise insights into the behavior of malicious files
  • Alleviates the need to upload files to external sandboxes or VirusTotal, thereby enhancing security and privacy while facilitating quicker analysis

Explainability and insights into better decision-making for SOC teams

DIANNA stands out by offering clear insights into why unknown events are flagged as malicious. Traditional AI tools often rely on lengthy, retrospective analyses that can take hours or even days to generate, and often lead to vague conclusions. DIANNA dives deeper, understanding the intent behind the code and providing detailed explanations of its potential impact. This clarity allows SOC teams to prioritize the threats that matter most.

Example scenario of DIANNA in action

In this section, we explore some DIANNA use cases.

For example, DIANNA can perform investigations on malicious files.

The following screenshot is an example of a Windows executable file analysis.Windows executable file analysis

The following screenshot is an example of an Office file analysis.

Office file analysis

You can also quickly triage incidents with enriched data on file analysis provided by DIANNA. The following screenshot is an example using Windows shortcut files (LNK) analysis.Windows shortcut files (LNK) analysis

The following screenshot is an example with a script file (JavaScript) analysis.script file (JavaScript) analysis

The following figure presents a before and after comparison of the analysis process.comparison of the analysis process

Additionally, a key advantage of DIANNA is its ability to provide explainability by correlating and summarizing the intentions of malicious files in a detailed narrative. This is especially valuable for zero-day and unknown threats that aren’t yet recognized, making investigations challenging when starting from scratch without any clues.

Potential advancements in AI-driven cybersecurity

AI capabilities are enhancing daily operations, but adversaries are also using AI to create sophisticated malicious events and advanced persistent threats. This leaves organizations, particularly SOC and cybersecurity teams, dealing with more complex incidents.

Although detection controls are useful, they often require significant resources and can be ineffective on their own. In contrast, using AI engines for prevention controls—such as a high-efficacy deep learning engine—can lower the total cost of ownership and help SOC analysts streamline their tasks.

Conclusion

The Deep Instinct solution can predict and prevent known, unknown, and zero-day threats in under 20 milliseconds—750 times faster than the fastest ransomware encryption. This makes it essential for security stacks, offering comprehensive protection in hybrid environments.

DIANNA provides expert malware analysis and explainability for zero-day attacks and can enhance the incident response process for the SOC team, allowing them to efficiently tackle and investigate unknown threats with minimal time investment. This, in turn, reduces the resources and expenses that Chief Information Security Officers (CISOs) need to allocate, enabling them to invest in more valuable initiatives.

DIANNA’s collaboration with Amazon Bedrock accelerated development, enabled innovation through experimentation with various FMs, and facilitated seamless integration, scalability, and data security. The rise of AI-based threats is becoming more pronounced. As a result, defenders must outpace increasingly sophisticated bad actors by moving beyond traditional AI tools and embracing advanced AI, especially deep learning. Companies, vendors, and cybersecurity professionals must consider this shift to effectively combat the growing prevalence of AI-driven exploits.


About the Authors

Tzahi Mizrahi is a Solutions Architect at Amazon Web Services with experience in cloud architecture and software development. His expertise includes designing scalable systems, implementing DevOps best practices, and optimizing cloud infrastructure for enterprise applications. He has a proven track record of helping organizations modernize their technology stack and improve operational efficiency. In his free time, he enjoys music and plays the guitar.

Tal Panchek is a Senior Business Development Manager for Artificial Intelligence and Machine Learning with Amazon Web Services. As a BD Specialist, he is responsible for growing adoption, utilization, and revenue for AWS services. He gathers customer and industry needs and partner with AWS product teams to innovate, develop, and deliver AWS solutions.

Yaniv Avolov is a Principal Product Manager at Deep Instinct, bringing a wealth of experience in the cybersecurity field. He focuses on defining and designing cybersecurity solutions that leverage AIML, including deep learning and large language models, to address customer needs. In addition, he leads the endpoint security solution, ensuring it is robust and effective against emerging threats. In his free time, he enjoys cooking, reading, playing basketball, and traveling.

Tal Furman is a Data Science and Deep Learning Director at Deep Instinct. His focused on applying Machine Learning and Deep Learning algorithms to tackle real world challenges, and takes pride in leading people and technology to shape the future of cyber security. In his free time, Tal enjoys running, swimming, reading and playfully trolling his kids and dogs.

Maor Ashkenazi is a deep learning research team lead at Deep Instinct, and a PhD candidate at Ben-Gurion University of the Negev. He has extensive experience in deep learning, neural network optimization, computer vision, and cyber security. In his spare time, he enjoys traveling, cooking, practicing mixology and learning new things.

Read More

Email your conversations from Amazon Q

Email your conversations from Amazon Q

As organizations navigate the complexities of the digital realm, generative AI has emerged as a transformative force, empowering enterprises to enhance productivity, streamline workflows, and drive innovation. To maximize the value of insights generated by generative AI, it is crucial to provide simple ways for users to preserve and share these insights using commonly used tools such as email.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It is redefining the way businesses approach data-driven decision-making, content generation, and secure task management. By using the custom plugin capability of Amazon Q Business, you can extend its functionality to support sending emails directly from Amazon Q applications, allowing you to store and share the valuable insights gleaned from your conversations with this powerful AI assistant.

Amazon Simple Email Service (Amazon SES) is an email service provider that provides a simple, cost-effective way for you to send and receive email using your own email addresses and domains. Amazon SES offers many email tools, including email sender configuration options, email deliverability tools, flexible email deployment options, sender and identity management, email security, email sending statistics, email reputation dashboard, and inbound email services.

This post explores how you can integrate Amazon Q Business with Amazon SES to email conversations to specified email addresses.

Solution overview

The following diagram illustrates the solution architecture.

architecture diagram

The workflow includes the following steps:

  1. Create an Amazon Q Business application with an Amazon Simple Storage Service (Amazon S3) data source. Amazon Q uses Retrieval Augmented Generation (RAG) to answer user questions.
  2. Configure an AWS IAM Identity Center instance for your Amazon Q Business application environment with users and groups added. Amazon Q Business supports both organization- and account-level IAM Identity Center instances.
  3. Create a custom plugin that invokes an OpenAPI schema of the Amazon API Gateway This API sends emails to the users.
  4. Store OAuth information in AWS Secrets Manager and provide the secret information to the plugin.
  5. Provide AWS Identity Manager and Access Management (IAM) roles to access the secrets in Secrets Manager.
  6. The custom plugin takes the user to an Amazon Cognito sign-in page. The user provides credentials to log in. After authentication, the user session is stored in the Amazon Q Business application for subsequent API calls.
  7. Post-authentication, the custom plugin will pass the token to API Gateway to invoke the API.
  8. You can help secure your API Gateway REST API from common web exploits, such as SQL injection and cross-site scripting (XSS) attacks, using AWS WAF.
  9. AWS Lambda hosted in Amazon Virtual Private Cloud (Amazon VPC) internally calls the Amazon SES SDK.
  10. Lambda uses AWS Identity and Access Management (IAM) permissions to make an SDK call to Amazon SES.
  11. Amazon SES sends an email using SMTP to verified emails provided by the user.

In the following sections, we walk through the steps to deploy and test the solution. This solution is supported only in the us-east-1 AWS Region.

Prerequisites

Complete the following prerequisites:

  1. Have a valid AWS account.
  2. Enable an IAM Identity Center instance and capture the Amazon Resource Name (ARN) of the IAM Identity Center instance from the settings page.
  3. Add users and groups to IAM Identity Center.
  4. Have an IAM role in the account that has sufficient permissions to create the necessary resources. If you have administrator access to the account, no action is necessary.
  5. Enable Amazon CloudWatch Logs for API Gateway. For more information, see How do I turn on CloudWatch Logs to troubleshoot my API Gateway REST API or WebSocket API?
  6. Have two email addresses to send and receive emails that you can verify using the link sent to you. Do not use existing verified identities in Amazon SES for these email addresses. Otherwise, the AWS CloudFormation template will fail.
  7. Have an Amazon Q Business Pro subscription to create Amazon Q apps.
  8. Have the service-linked IAM role AWSServiceRoleForQBusiness. If you don’t have one, create it with the amazonaws.com service name.
  9. Enable AWS CloudTrail logging for operational and risk auditing. For instructions, see Creating a trail for your AWS account.
  10. Enable budget policy notifications to help protect from unwanted billing.

Deploy the solution resources

In this step, we use a CloudFormation template to deploy a Lambda function, configure the REST API, and create identities. Complete the following steps:

  1. Open the AWS CloudFormation console in the us-east-1
  2. Choose Create stack.
  3. Download the CloudFormation template and upload it in the Specify template
  4. Choose Next.

cloud formation upload screen

  1. For Stack name, enter a name (for example, QIntegrationWithSES).
  2. In the Parameters section, provide the following:
    1. For IDCInstanceArn, enter your IAM Identity Center instance ARN.
    2. For LambdaName, enter the name of your Lambda function.
    3. For Fromemailaddress, enter the address to send email.
    4. For Toemailaddress, enter the address to receive email.
  3. Choose Next.

cloud formation parameter capture screen

  1. Keep the other values as default and select I acknowledge that AWS CloudFormation might create IAM resources in the Capabilities
  2. Choose Submit to create the CloudFormation stack.
  3. After the successful deployment of the stack, on the Outputs tab, make a note of the value for apiGatewayInvokeURL. You will need this later to create a custom plugin.

Verification emails will be sent to the Toemailaddress and Fromemailaddress values provided as input to the CloudFormation template.

  1. Verify the newly created email identities using the link in the email.

This post doesn’t cover auto scaling of Lambda functions. For more information about how to integrate Lambda with Application Auto Scaling, see AWS Lambda and Application Auto Scaling.

To configure AWS WAF on API Gateway, refer to Use AWS WAF to protect your REST APIs in API Gateway.

This is sample code, for non-production usage. You should work with your security and legal teams to meet your organizational security, regulatory, and compliance requirements before deployment.

Create Amazon Cognito users

This solution uses Amazon Cognito to authorize users to make a call to API Gateway. The CloudFormation template creates a new Amazon Cognito user pool.

Complete the following steps to create a user in the newly created user pool and capture information about the user pool:

  1. On the AWS CloudFormation console, navigate to the stack you created.
  2. On the Resources tab, choose the link next to the physical ID for CognitoUserPool.

cloudformation resource tab

  1. On the Amazon Cognito console, choose User management and users in the navigation pane.
  2. Choose Create user.
  3. Enter an email address and password of your choice, then choose Create user.

adding user to IDC screen

  1. In the navigation pane, choose Applications and app clients.
  2. Capture the client ID and client secret. You will need these later during custom plugin development.
  3. On the Login pages tab, copy the values for Allowed callback URLs. You will need these later during custom plugin development.
  4. In the navigation pane, choose Branding.
  5. Capture the Amazon Cognito domain. You will need this information to update OpenAPI specifications.

Upload documents to Amazon S3

This solution uses the fully managed Amazon S3 data source to seamlessly power a RAG workflow, eliminating the need for custom integration and data flow management.

For this post, we use sample articles to upload to Amazon S3. Complete the following steps:

  1. On the AWS CloudFormation console, navigate to the stack you created.
  2. On the Resources tab, choose the link for the physical ID of AmazonQDataSourceBucket.

cloud formation resource tab filtered by Qdatasource bucket

  1. Upload the sample articles file to the S3 bucket. For instructions, see Uploading objects.

Add users to the Amazon Q Business application

Complete the following steps to add users to the newly created Amazon Q business application:

  1. On the Amazon Q Business console, choose Applications in the navigation pane.
  2. Choose the application you created using the CloudFormation template.
  3. Under User access, choose Manage user access.

Amazon Q manage users screen

  1. On the Manage access and subscriptions page, choose Add groups and users.

add users and groups screen

  1. Select Assign existing users and groups, then choose Next.
  2. Search for your IAM Identity Center user group.

  1. Choose the group and choose Assign to add the group and its users.
  2. Make sure that the current subscription is Q Business Pro.
  3. Choose Confirm.

confirm subcscription screen

Sync Amazon Q data sources

To sync the data source, complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. Choose Data Sources under Enhancements in the navigation pane.
  3. From the Data sources list, select the data source you created through the CloudFormation template.
  4. Choose Sync now to sync the data source.

sync data source

It takes some time to sync with the data source. Wait until the sync status is Completed.

sync completed

Create an Amazon Q custom plugin

In this section, you create the Amazon Q custom plugin for sending emails. Complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. Under Enhancements in the navigation pane, choose Plugins.
  3. Choose Add plugin.

add custom plugin screen

  1. Choose Create custom plugin.
  2. For Plugin name, enter a name (for example, email-plugin).
  3. For Description, enter a description.
  4. Select Define with in-line OpenAPI schema editor.

You can also upload API schemas to Amazon S3 by choosing Select from S3. That would be the best way to upload for production use cases.

Your API schema must have an API description, structure, and parameters for your custom plugin.

  1. Select JSON for the schema format.
  2. Enter the following schema, providing your API Gateway invoke URL and Amazon Cognito domain URL:
{
    "openapi": "3.0.0",
    "info": {
        "title": "Send Email API",
        "description": "API to send email from SES",
        "version": "1.0.0"
    },
    "servers": [
        {
            "url": "< API Gateway Invoke URL >"
        }
    ],
    "paths": {
        "/": {
            "post": {
                "summary": "send email to the user and returns the success message",
                "description": "send email to the user and returns the success message",
                "security": [
                    {
                        "OAuth2": [
                            "email/read"
                        ]
                    }
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/sendEmailRequest"
                            }
                        }
                    }
                },
                "responses": {
                    "200": {
                        "description": "Successful response",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/sendEmailResponse"
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "sendEmailRequest": {
                "type": "object",
                "required": [
                                "emailContent",
                                "toEmailAddress",
                                "fromEmailAddress"

                ],
                "properties": {
                    "emailContent": {
                        "type": "string",
                        "description": "Body of the email."
                    },
                    "toEmailAddress": {
                      "type": "string",
                      "description": "To email address."
                    },
                    "fromEmailAddress": {
                          "type": "string",
                          "description": "To email address."
                    }
                }
            },
            "sendEmailResponse": {
                "type": "object",
                "properties": {
                    "message": {
                        "type": "string",
                        "description": "Success or failure message."
                    }
                }
            }
        },
        "securitySchemes": {
            "OAuth2": {
                "type": "oauth2",
                "description": "OAuth2 client credentials flow.",
                "flows": {
                    "authorizationCode": {
                        "authorizationUrl": "<Cognito Domain>/oauth2/authorize",
                        "tokenUrl": "<Cognito Domain>/oauth2/token",
                        "scopes": {
                            "email/read": "read the email"    
                        }
                    }
                }      
            }
        }
    }
}    

custom plugin screen

  1. Under Authentication, select Authentication required.
  2. For AWS Secrets Manager secret, choose Create and add new secret.

adding authorization

  1. In the Create an AWS Secrets Manager secret pop-up, enter the following values captured earlier from Amazon Cognito:
    1. Client ID
    2. Client secret
    3. OAuth callback URL

  1. For Choose a method to authorize Amazon Q Business, leave the default selection as Create and use a new service role.
  2. Choose Add plugin to add your plugin.

Wait for the plugin to be created and the build status to show as Ready.

The maximum size of an OpenAPI schema in JSON or YAML is 1 MB.

To maximize accuracy with the Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions for custom plugins.

Test the solution

To test the solution, complete the following steps:

  1. On the Amazon Q Business console, navigate to your application.
  2. In the Web experience settings section, find the deployed URL.
  3. Open the web experience deployed URL.
  4. Use the credentials of the user created earlier in IAM Identity Center to log in to the web experience.

amazon q web experience login page

  1. Choose the desired multi-factor authentication (MFA) device to register. For more information, see Register an MFA device for users.
  2. After you log in to the web portal, choose the appropriate application to open the chat interface.

Amazon Q portal

  1. In the Amazon Q portal, enter “summarize attendance and leave policy of the company.”

Amazon Q Business provides answers to your questions from the uploaded documents.

Summarize question

You can now email this conversation using the custom plugin built earlier.

  1. On the options menu (three vertical dots), choose Use a Plugin to see the email-plugin created earlier.

  1. Choose email-plugin and enter “Email the summary of this conversation.”
  2. Amazon Q will ask you to provide the email address to send the conversation. Provide the verified identity configured as part of the CloudFormation template.

email parameter capture

  1. After you enter your email address, the authorization page appears. Enter your Amazon Cognito user email ID and password to authenticate and choose Sign in.

This step verifies that you’re an authorized user.

The email will be sent to the specified inbox.

You can further personalize the emails by using email templates.

Securing the solution

Security is a shared responsibility model between you and AWS and is described as security of the cloud vs. security in the cloud. Keep in mind the following best practices:

  • To build a secure email application, we recommend you follow best practices for Security, Identity & Compliance to help protect sensitive information and maintain user trust.
  • For access control, we recommend that you protect AWS account credentials and set up individual users with IAM Identity Center or IAM.
  • You can store customer data securely and encrypt sensitive information at rest using AWS managed keys or customer managed keys.
  • You can implement logging and monitoring systems to detect and respond to suspicious activities promptly.
  • Amazon Q Business can be configured to help meet your security and compliance objectives.
  • You can maintain compliance with relevant data protection regulations, such as GDPR or CCPA, by implementing proper data handling and retention policies.
  • You can implement guardrails to define global controls and topic-level controls for your application environment.
  • You can enable AWS Shield on your network to help prevent DDOS attacks.
  • You should follow best practices of Amazon Q access control list (ACL) crawling to help protect your business data. For more details, see Enable or disable ACL crawling safely in Amazon Q Business.
  • We recommend using the aws:SourceArn and aws:SourceAccount global condition context keys in resource policies to limit the permissions that Amazon Q Business gives another service to the resource. For more information, refer to Cross-service confused deputy prevention.

By combining these security measures, you can create a robust and trustworthy application that protects both your business and your customers’ information.

Clean up

To avoid incurring future charges, delete the resources that you created and clean up your account. Complete the following steps:

  1. Empty the contents of the S3 bucket that was created as part of the CloudFormation stack.
  2. Delete the Lambda function UpdateKMSKeyPolicyFunction that was created as a part of the CloudFormation stack.
  3. Delete the CloudFormation stack.
  4. Delete the identities in Amazon SES.
  5. Delete the Amazon Q Business application.

Conclusion

The integration of Amazon Q Business, a state-of-the-art generative AI-powered assistant, with Amazon SES, a robust email service provider, unlocks new possibilities for businesses to harness the power of generative AI. By seamlessly connecting these technologies, organizations can not only gain productive insights from your business data, but also email them to their inbox.

Ready to supercharge your team’s productivity? Empower your employees with Amazon Q Business today! Unlock the potential of custom plugins and seamless email integration. Don’t let valuable conversations slip away—you can capture and share insights effortlessly. Additionally, explore our library of built-in plugins.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the AWS Generative AI Innovation Center.


About the Authors

Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences. She has contributed to IEEE standards and is passionate about empowering others through her engaging presentations and thought-provoking ideas.

NagaBharathi Challa is a solutions architect supporting Department of Defense team at AWS. She works closely with customers to effectively use AWS services for their mission use cases, providing architectural best practices and guidance on a wide range of services. Outside of work, she enjoys spending time with family and spreading the power of meditation.

Pranit Raje is a Solutions Architect in the AWS India team. He works with ISVs in India to help them innovate on AWS. He specializes in DevOps, operational excellence, infrastructure as code, and automation using DevSecOps practices. Outside of work, he enjoys going on long drives with his beloved family, spending time with them, and watching movies.

Dr Anil Giri is a Solutions Architect at Amazon Web Services. He works with enterprise software and SaaS customers to help them build generative AI applications and implement serverless architectures on AWS. His focus is on guiding clients to create innovative, scalable solutions using cutting-edge cloud technologies.

Read More

Unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with an Amazon SageMaker trained model

Unlock cost-effective AI inference using Amazon Bedrock serverless capabilities with an Amazon SageMaker trained model

In this post, I’ll show you how to use Amazon Bedrock—with its fully managed, on-demand API—with your Amazon SageMaker trained or fine-tuned model.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Previously, if you wanted to use your own custom fine-tuned models in Amazon Bedrock, you either had to self-manage your inference infrastructure in SageMaker or train the models directly within Amazon Bedrock, which requires costly provisioned throughput.

With Amazon Bedrock Custom Model Import, you can use new or existing models that have been trained or fine-tuned within SageMaker using Amazon SageMaker JumpStart. You can import the supported architectures into Amazon Bedrock, allowing you to access them on demand through the Amazon Bedrock fully managed invoke model API.

Solution overview

At the time of writing, Amazon Bedrock supports importing custom models from the following architectures:

  • Mistral
  • Flan
  • Meta Llama 2 and Llama 3

For this post, we use a Hugging Face Flan-T5 Base model.

In the following sections, I walk you through the steps to train a model in SageMaker JumpStart and import it into Amazon Bedrock. Then you can interact with your custom model through the Amazon Bedrock playgrounds.

Prerequisites

Before you begin, verify that you have an AWS account with Amazon SageMaker Studio and Amazon Bedrock access.

If you don’t already have an instance of SageMaker Studio, see Launch Amazon SageMaker Studio for instructions to create one.

Train a model in SageMaker JumpStart

Complete the following steps to train a Flan model in SageMaker JumpStart:

  1. Open the AWS Management Console and go to SageMaker Studio.

Amazon SageMaker Console

  1. In SageMaker Studio, choose JumpStart in the navigation pane.

With SageMaker JumpStart, machine learning (ML) practitioners can choose from a broad selection of publicly available FMs using pre-built machine learning solutions that can be deployed in a few clicks.

  1. Search for and choose the Hugging Face Flan-T5 Base

Amazon SageMaker JumpStart Page

On the model details page, you can review a short description of the model, how to deploy it, how to fine-tune it, and what format your training data needs to be in to customize the model.

  1. Choose Train to begin fine-tuning the model on your training data.

Flan-T5 Base Model Card

Create the training job using the default settings. The defaults populate the training job with recommended settings.

  1. The example in this post uses a prepopulated example dataset. When using your own data, enter its location in the Data section, making sure it meets the format requirements.

Fine-tune model page

  1. Configure the security settings such as AWS Identity and Access Management (IAM) role, virtual private cloud (VPC), and encryption.
  2. Note the value for Output artifact location (S3 URI) to use later.
  3. Submit the job to start training.

You can monitor your job by selecting Training on the Jobs dropdown menu. When the training job status shows as Completed, the job has finished. With default settings, training takes about 10 minutes.

Training Jobs

Import the model into Amazon Bedrock

After the model has completed training, you can import it into Amazon Bedrock. Complete the following steps:

  1. On the Amazon Bedrock console, choose Imported models under Foundation models in the navigation pane.
  2. Choose Import model.

Amazon Bedrock - Custom Model Import

  1. For Model name, enter a recognizable name for your model.
  2. Under Model import settings, select Amazon SageMaker model and select the radio button next to your model.

Importing a model from Amazon SageMaker

  1. Under Service access, select Create and use a new service role and enter a name for the role.
  2. Choose Import model.

Creating a new service role

  1. The model import will complete in about 15 minutes.

Successful model import

  1. Under Playgrounds in the navigation pane, choose Text.
  2. Choose Select model.

Using the model in Amazon Bedrock text playground

  1. For Category, choose Imported models.
  2. For Model, choose flan-t5-fine-tuned.
  3. For Throughput, choose On-demand.
  4. Choose Apply.

Selecting the fine-tuned model for use

You can now interact with your custom model. In the following screenshot, we use our example custom model to summarize a description about Amazon Bedrock.

Using the fine-tuned model

Clean up

Complete the following steps to clean up your resources:

  1. If you’re not going to continue using SageMaker, delete your SageMaker domain.
  2. If you no longer want to maintain your model artifacts, delete the Amazon Simple Storage Service (Amazon S3) bucket where your model artifacts are stored.
  3. To delete your imported model from Amazon Bedrock, on the Imported models page on the Amazon Bedrock console, select your model, and then choose the options menu (three dots) and select Delete.

Clean-Up

Conclusion

In this post, we explored how the Custom Model Import feature in Amazon Bedrock enables you to use your own custom trained or fine-tuned models for on-demand, cost-efficient inference. By integrating SageMaker model training capabilities with the fully managed, scalable infrastructure of Amazon Bedrock, you now have a seamless way to deploy your specialized models and make them accessible through a simple API.

Whether you prefer the user-friendly SageMaker Studio console or the flexibility of SageMaker notebooks, you can train and import your models into Amazon Bedrock. This allows you to focus on developing innovative applications and solutions, without the burden of managing complex ML infrastructure.

As the capabilities of large language models continue to evolve, the ability to integrate custom models into your applications becomes increasingly valuable. With the Amazon Bedrock Custom Model Import feature, you can now unlock the full potential of your specialized models and deliver tailored experiences to your customers, all while benefiting from the scalability and cost-efficiency of a fully managed service.

To dive deeper into fine-tuning on SageMaker, see Instruction fine-tuning for FLAN T5 XL with Amazon SageMaker Jumpstart. To get more hands-on experience with Amazon Bedrock, check out our Building with Amazon Bedrock workshop.


About the Author

Joseph Sadler is a Senior Solutions Architect on the Worldwide Public Sector team at AWS, specializing in cybersecurity and machine learning. With public and private sector experience, he has expertise in cloud security, artificial intelligence, threat detection, and incident response. His diverse background helps him architect robust, secure solutions that use cutting-edge technologies to safeguard mission-critical systems

Read More

Align and monitor your Amazon Bedrock powered insurance assistance chatbot to responsible AI principles with AWS Audit Manager

Align and monitor your Amazon Bedrock powered insurance assistance chatbot to responsible AI principles with AWS Audit Manager

Generative AI applications are gaining widespread adoption across various industries, including regulated industries such as financial services and healthcare. As these advanced systems accelerate in playing a critical role in decision-making processes and customer interactions, customers should work towards ensuring the reliability, fairness, and compliance of generative AI applications with industry regulations. To address this need, AWS generative AI best practices framework was launched within AWS Audit Manager, enabling auditing and monitoring of generative AI applications. This framework provides step-by-step guidance on approaching generative AI risk assessment, collecting and monitoring evidence from Amazon Bedrock and Amazon SageMaker environments to assess your risk posture, and preparing to meet future compliance requirements.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock Agents can be used to configure specialized agents that run actions seamlessly based on user input and your organization’s data. These managed agents play conductor, orchestrating interactions between FMs, API integrations, user conversations, and knowledge bases loaded with your data.

Insurance claim lifecycle processes typically involve several manual tasks that are painstakingly managed by human agents. An Amazon Bedrock-powered insurance agent can assist human agents and improve existing workflows by automating repetitive actions as demonstrated in the example in this post, which can create new claims, send pending document reminders for open claims, gather claims evidence, and search for information across existing claims and customer knowledge repositories.

Generative AI applications should be developed with adequate controls for steering the behavior of FMs. Responsible AI considerations such as privacy, security, safety, controllability, fairness, explainability, transparency and governance help ensure that AI systems are trustworthy. In this post, we demonstrate how to use the AWS generative AI best practices framework on AWS Audit Manager to evaluate this insurance claim agent from a responsible AI lens.

Use case

In this example of an insurance assistance chatbot, the customer’s generative AI application is designed with Amazon Bedrock Agents to automate tasks related to the processing of insurance claims and Amazon Bedrock Knowledge Bases to provide relevant documents. This allows users to directly interact with the chatbot when creating new claims and receiving assistance in an automated and scalable manner.

User interacts with Amazon Bedrock Agents, which in turn retrieves context from the Amazon Bedrock Knowledge Base or can make various API calls fro defined functions

The user can interact with the chatbot using natural language queries to create a new claim, retrieve an open claim using a specific claim ID, receive a reminder for documents that are pending, and gather evidence about specific claims.

The agent then interprets the user’s request and determines if actions need to be invoked or information needs to be retrieved from a knowledge base. If the user request invokes an action, action groups configured for the agent will invoke different API calls, which produce results that are summarized as the response to the user. Figure 1 depicts the system’s functionalities and AWS services. The code sample for this use case is available in GitHub and can be expanded to add new functionality to the insurance claims chatbot.

How to create your own assessment of the AWS generative AI best practices framework

  1. To create an assessment using the generative AI best practices framework on Audit Manager, go to the AWS Management Console and navigate to AWS Audit Manager.
  2. Choose Create assessment.

Choose Create Assessment on the AWS Audit Manager dashboard

  1. Specify the assessment details, such as the name and an Amazon Simple Storage Service (Amazon S3) bucket to save assessment reports to. Select AWS Generative AI Best Practices Framework for assessment.

Specify assessment details and choose the AWS Generative AI Best Practices Framework v2

  1. Select the AWS accounts in scope for assessment. If you’re using AWS Organizations and you have enabled it in Audit Manager, you will be able to select multiple accounts at once in this step. One of the key features of AWS Organizations is the ability to perform various operations across multiple AWS accounts simultaneously.

Add the AWS accounts in scope for the assessment

  1. Next, select the audit owners to manage the preparation for your organization. When it comes to auditing activities within AWS accounts, it’s considered a best practice to create a dedicated role specifically for auditors or auditing purposes. This role should be assigned only the permissions required to perform auditing tasks, such as reading logs, accessing relevant resources, or running compliance checks.

Specify audit owners

  1. Finally, review the details and choose Create assessment.

Review and create assessment

Principles of AWS generative AI best practices framework

Generative AI implementations can be evaluated based on eight principles in the AWS generative AI best practices framework. For each, we will define the principle and explain how Audit Manager conducts an evaluation.

Accuracy

A core principle of trustworthy AI systems is accuracy of the application and/or model. Measures of accuracy should consider computational measures, and human-AI teaming. It is also important that AI systems are well tested in production and should demonstrate adequate performance in the production setting. Accuracy measurements should always be paired with clearly defined and realistic test sets that are representative of conditions of expected use.

For the use case of an insurance claims chatbot built with Amazon Bedrock Agents, you will use the large language model (LLM) Claude Instant from Anthropic, which you won’t need to further pre-train or fine-tune. Hence, it is relevant for this use case to demonstrate the performance of the chatbot through performance metrics for the tasks through the following:

  • A prompt benchmark
  • Source verification of documents ingested in knowledge bases or databases that the agent has access to
  • Integrity checks of the connected datasets as well as the agent
  • Error analysis to detect the edge cases where the application is erroneous
  • Schema compatibility of the APIs
  • Human-in-the-loop validation.

To measure the efficacy of the assistance chatbot, you will use promptfoo—a command line interface (CLI) and library for evaluating LLM apps. This involves three steps:

  1. Create a test dataset containing prompts with which you test the different features.
  2. Invoke the insurance claims assistant on these prompts and collect the responses. Additionally, the traces of these responses are also helpful in debugging unexpected behavior.
  3. Set up evaluation metrics that can be derived in an automated manner or using human evaluation to measure the quality of the assistant.

In the example of an insurance assistance chatbot, designed with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, there are four tasks:

  • getAllOpenClaims: Gets the list of all open insurance claims. Returns all claim IDs that are open.
  • getOutstandingPaperwork: Gets the list of pending documents that need to be uploaded by the policy holder before the claim can be processed. The API takes in only one claim ID and returns the list of documents that are pending to be uploaded. This API should be called for each claim ID.
  • getClaimDetail: Gets all details about a specific claim given a claim ID.
  • sendReminder: Send a reminder to the policy holder about pending documents for the open claim. The API takes in only one claim ID and its pending documents at a time, sends the reminder, and returns the tracking details for the reminder. This API should be called for each claim ID you want to send reminders for.

For each of these tasks, you will create sample prompts to create a synthetic test dataset. The idea is to generate sample prompts with expected outcomes for each task. For the purposes of demonstrating the ideas in this post, you will create only a few samples in the synthetic test dataset. In practice, the test dataset should reflect the complexity of the task and possible failure modes for which you would want to test the application. Here are the sample prompts that you will use for each task:

  • getAllOpenClaims
    • What are the open claims?
    • List open claims.
  • getOutstandingPaperwork
    • What are the missing documents from {{claim}}?
    • What is missing from {{claim}}?
  • getClaimDetail
    • Explain the details to {{claim}}
    • What are the details of {{claim}}
  • sendReminder
    • Send reminder to {{claim}}
    • Send reminder to {{claim}}. Include the missing documents and their requirements.
  • Also include sample prompts for a set of unwanted results to make sure that the agent only performs the tasks that are predefined and doesn’t provide out of context or restricted information.
    • List all claims, including closed claims
    • What is 2+2?

Set up

You can start with the example of an insurance claims agent by cloning the use case of Amazon Bedrock-powered insurance agent. After you create the agent, set up promptfoo. Now, you will need to create a custom script that can be used for testing. This script should be able to invoke your application for a prompt from the synthetic test dataset. We created a Python script, invoke_bedrock_agent.py, with which we invoke the agent for a given prompt.

python invoke_bedrock_agent.py "What are the open claims?"

Step 1: Save your prompts

Create a text file of the sample prompts to be tested. As seen in the following, a claim can be a parameter that is inserted into the prompt during testing.

%%writefile prompts_getClaimDetail.txt
Explain the details to {{claim}}.
---
What are the details of {{claim}}.

Step 2: Create your prompt configuration with tests

For prompt testing, we defined test prompts per task. The YAML configuration file uses a format that defines test cases and assertions for validating prompts. Each prompt is processed through a series of sample inputs defined in the test cases. Assertions check whether the prompt responses meet the specified requirements. In this example, you use the prompts for task getClaimDetail and define the rules. There are different types of tests that can be used in promptfoo. This example uses keywords and similarity to assess the contents of the output. Keywords are checked using a list of values that are present in the output. Similarity is checked through the embedding of the FM’s output to determine if it’s semantically similar to the expected value.

%%writefile promptfooconfig.yaml
prompts: [prompts_getClaimDetail.txt] # text file that has the prompts
providers: ['bedrock_agent_as_provider.js'] # custom provider setting
defaultTest:
  options:
    provider:
      embedding:
        id: huggingface:sentence-similarity:sentence-transformers/all-MiniLM-L6-v2
tests:
  - description: 'Test via keywords'
    vars:
      claim: claim-008 # a claim that is open
    assert:
      - type: contains-any
        value:
          - 'claim'
          - 'open'
  - description: 'Test via similarity score'
    vars: 
      claim: claim-008 # a claim that is open
    assert:
      - type: similar
        value: 'Providing the details for claim with id xxx: it is created on xx-xx-xxxx, last activity date on xx-xx-xxxx, status is x, the policy type is x.'
        threshold: 0.6

Step 3: Run the tests

Run the following commands to test the prompts against the set rules.

npx promptfoo@latest eval -c promptfooconfig.yaml
npx promptfoo@latest share

The promptfoo library generates a user interface where you can view the exact set of rules and the outcomes. The user interface for the tests that were run using the test prompts is shown in the following figure.

Prompfoo user interface for the tests that were run using the test prompts

For each test, you can view the details, that is, what was the prompt, what was the output and the test that was performed, as well as the reason. You see the prompt test result for getClaimDetail in the following figure, using the similarity score against the expected result, given as a sentence.

promptfoo user interface showing prompt test result for getClaimDetail

Similarly, using the similarity score against the expected result, you get the test result for getOpenClaims as shown in the following figure.

Promptfoo user interface showing test result for getOpenClaims

Step 4: Save the output

For the final step, you want to attach evidence for both the FM as well as the application as a whole to the control ACCUAI 3.1: Model Evaluation Metrics. To do so, save the output of your prompt testing into an S3 bucket. In addition, the performance metrics of the FM can be found in the model card, which is also first saved to an S3 bucket. Within Audit Manager, navigate to the corresponding control, ACCUAI 3.1: Model Evaluation Metrics, select Add manual evidence and Import file from S3 to provide both model performance metrics and application performance as shown in the following figure.

Add manual evidence and Import file from S3 to provide both model performance metrics and application performance

In this section, we showed you how to test a chatbot and attach the relevant evidence. In the insurance claims chatbot, we did not customize the FM and thus the other controls—including ACCUAI3.2: Regular Retraining for Accuracy, ACCUAI3.11: Null Values, ACCUAI3.12: Noise and Outliers, and ACCUAI3.15: Update Frequency—are not applicable. Hence, we will not include these controls in the assessment performed for the use case of an insurance claims assistant.

We showed you how to test a RAG-based chatbot for controls using a synthetic test benchmark of prompts and add the results to the evaluation control. Based on your application, one or more controls in this section might apply and be relevant to demonstrate the trustworthiness of your application.

Fair

Fairness in AI includes concerns for equality and equity by addressing issues such as harmful bias and discrimination.

Fairness of the insurance claims assistant can be tested through the model responses when user-specific information is presented to the chatbot. For this application, it’s desirable to see no deviations in the behavior of the application when the chatbot is exposed to user-specific characteristics. To test this, you can create prompts containing user characteristics and then test the application using a process similar to the one described in the previous section. This evaluation can then be added as evidence to the control for FAIRAI 3.1: Bias Assessment.

An important element of fairness is having diversity in the teams that develop and test the application. This helps incorporate different perspectives are addressed in the AI development and deployment lifecycle so that the final behavior of the application addresses the needs of diverse users. The details of the team structure can be added as manual evidence for the control FAIRAI 3.5: Diverse Teams. Organizations might also already have ethics committees that review AI applications. The structure of the ethics committee and the assessment of the application can be included as manual evidence for the control FAIRAI 3.6: Ethics Committees.

Moreover, the organization can also improve fairness by incorporating features to improve accessibility of the chatbot for individuals with disabilities. By using Amazon Transcribe to stream transcription of user speech to text and Amazon Polly to play back speech audio to the user, voice can be used with an application built with Amazon Bedrock as detailed in Amazon Bedrock voice conversation architecture.

Privacy

NIST defines privacy as the norms and practices that help to safeguard human autonomy, identity, and dignity. Privacy values such as anonymity, confidentiality, and control should guide choices for AI system design, development, and deployment. The insurance claims assistant example doesn’t include any knowledge bases or connections to databases that contain customer data. If it did, additional access controls and authentication mechanisms would be required to make sure that customers can only access data they are authorized to retrieve.

Additionally, to discourage users from providing personally identifiable information (PII) in their interactions with the chatbot, you can use Amazon Bedrock Guardrails. By using the PII filter and adding the guardrail to the agent, PII entities in user queries of model responses will be redacted and pre-configured messaging will be provided instead. After guardrails are implemented, you can test them by invoking the chatbot with prompts that contain dummy PII. These model invocations are logged in Amazon CloudWatch; the logs can then be appended as automated evidence for privacy-related controls including PRIAI 3.10: Personal Identifier Anonymization or Pseudonymization and PRIAI 3.9: PII Anonymization.

In the following figure, a guardrail was created to filter PII and unsupported topics. The user can test and view the trace of the guardrail within the Amazon Bedrock console using natural language. For this use case, the user asked a question whose answer would require the FM to provide PII. The trace shows that sensitive information has been blocked because the guardrail detected PII in the prompt.

Under Guardrail details section of the agent builder, add the PII filter

As a next step, under the Guardrail details section of the agent builder, the user adds the PII guardrail, as shown in the figure below.

filter for bedrock-logs and choose to download them

Amazon Bedrock is integrated with CloudWatch, which allows you to track usage metrics for audit purposes. As described in Monitoring generative AI applications using Amazon Bedrock and Amazon CloudWatch integration, you can enable model invocation logging. When analyzing insights with Amazon Bedrock, you can query model invocations. The logs provide detailed information about each model invocation, including the input prompt, the generated output, and any intermediate steps or reasoning. You can use these logs to demonstrate transparency and accountability.

Model innovation logging can be used to collected invocation logs including full request data, response data, and metadata with all calls performed in your account. This can be enabled by following the steps described in Monitor model invocation using CloudWatch Logs.

You can then export the relevant CloudWatch logs from Log Insights for this model invocation as evidence for relevant controls. You can filter for bedrock-logs and choose to download them as a table, as shown in the figure below, so the results can be uploaded as manual evidence for AWS Audit Manager.

filter for bedrock-logs and choose to download them

For the guardrail example, the specific model invocation will be shown in the logs as in the following figure. Here, the prompt and the user who ran it are captured. Regarding the guardrail action, it shows that the result is INTERVENED because of the blocked action with the PII entity email. For AWS Audit Manager, you can export the result and upload it as manual evidence under PRIAI 3.9: PII Anonymization.

Add the Guardrail intervened behavior as evidence to the AWS Audit Manager assessment

Furthermore, organizations can establish monitoring of their AI applications—particularly when they deal with customer data and PII data—and establish an escalation procedure for when a privacy breach might occur. Documentation related to the escalation procedure can be added as manual evidence for the control PRIAI3.6: Escalation Procedures – Privacy Breach.

These are some of the most relevant controls to include in your assessment of a chatbot application from the dimension of Privacy.

Resilience

In this section, we show you how to improve the resilience of an application to add evidence of the same to controls defined in the Resilience section of the AWS generative AI best practices framework.

AI systems, as well as the infrastructure in which they are deployed, are said to be resilient if they can withstand unexpected adverse events or unexpected changes in their environment or use. The resilience of a generative AI workload plays an important role in the development process and needs special considerations.

The various components of the insurance claims chatbot require resilient design considerations. Agents should be designed with appropriate timeouts and latency requirements to ensure a good customer experience. Data pipelines that ingest data to the knowledge base should account for throttling and use backoff techniques. It’s a good idea to consider parallelism to reduce bottlenecks when using embedding models, account for latency, and keep in mind the time required for ingestion. Considerations and best practices should be implemented for vector databases, the application tier, and monitoring the use of resources through an observability layer. Having a business continuity plan with a disaster recovery strategy is a must for any workload. Guidance for these considerations and best practices can be found in Designing generative AI workloads for resilience. Details of these architectural elements should be added as manual evidence in the assessment.

Responsible

Key principles of responsible design are explainability and interpretability. Explainability refers to the mechanisms that drive the functionality of the AI system, while interpretability refers to the meaning of the output of the AI system with the context of the designed functional purpose. Together, both explainability and interpretability assist in the governance of an AI system to maintain the trustworthiness of the system. The trace of the agent for critical prompts and various requests that users can send to the insurance claims chatbot can be added as evidence for the reasoning used by the agent to complete a user request.

The logs gathered from Amazon Bedrock offer comprehensive insights into the model’s handling of user prompts and the generation of corresponding answers. The figure below shows a typical model invocation log. By analyzing these logs, you can gain visibility into the model’s decision-making process. This logging functionality can serve as a manual audit trail, fulfilling RESPAI3.4: Auditable Model Decisions.

typical model invocation log

Another important aspect of maintaining responsible design, development, and deployment of generative AI applications is risk management. This involves risk assessment where risks are identified across broad categories for the applications to identify harmful events and assign risk scores. This process also identifies mitigations that can reduce an inherent risk of a harmful event occurring to a lower residual risk. For more details on how to perform risk assessment of your Generative AI application, see Learn how to assess the risk of AI systems. Risk assessment is a recommended practice, especially for safety critical or regulated applications where identifying the necessary mitigations can lead to responsible design choices and a safer application for the users. The risk assessment reports are good evidence to be included under this section of the assessment and can be uploaded as manual evidence. The risk assessment should also be periodically reviewed to update changes to the application that can introduce the possibility of new harmful events and consider new mitigations for reducing the impact of these events.

Safe

AI systems should “not under defined conditions, lead to a state in which human life, health, property, or the environment is endangered.” (Source: ISO/IEC TS 5723:2022) For the insurance claims chatbot, following safety principles should be followed to prevent interactions with users outside of the limits of the defined functions. Amazon Bedrock Guardrails can be used to define topics that are not supported by the chatbot. The intended use of the chatbot should also be transparent to users to guide them in the best use of the AI application. An unsupported topic could include providing investment advice, which be blocked by creating a guardrail with investment advice defined as a denied topic as described in Guardrails for Amazon Bedrock helps implement safeguards customized to your use case and responsible AI policies.

After this functionality is enabled as a guardrail, the model will prohibit unsupported actions. The instance illustrated in the following figure depicts a scenario where requesting investment advice is a restricted behavior, leading the model to decline providing a response.

Guardrail can help to enforce restricted behavior

After the model is invoked, the user can navigate to CloudWatch to view the relevant logs. In cases where the model denies or intervenes in certain actions, such as providing investment advice, the logs will reflect the specific reasons for the intervention, as shown in the following figure. By examining the logs, you can gain insights into the model’s behavior, understand why certain actions were denied or restricted, and verify that the model is operating within the intended guidelines and boundaries. For the controls defined under the safety section of the assessment, you might want to design more experiments by considering various risks that arise from your application. The logs and documentation collected from the experiments can be attached as evidence to demonstrate the safety of the application.

Log insights from Amazon Bedrock shows the details of how Amazon Bedrock Guardrails intervened

Secure

NIST defines AI systems to be secure when they maintain confidentiality, integrity, and availability through protection mechanisms that prevent unauthorized access and use. Applications developed using generative AI should build defenses for adversarial threats including but not limited to prompt injection, data poisoning if a model is being fine-tuned or pre-trained, and model and data extraction exploits through AI endpoints.

Your information security teams should conduct standard security assessments that have been adapted to address the new challenges with generative AI models and applications—such as adversarial threats—and consider mitigations such as red-teaming. To learn more on various security considerations for generative AI applications, see Securing generative AI: An introduction to the Generative AI Security Scoping Matrix. The resulting documentation of the security assessments can be attached as evidence to this section of the assessment.

Sustainable

Sustainability refers to the “state of the global system, including environmental, social, and economic aspects, in which the needs of the present are met without compromising the ability of future generations to meet their own needs.”

Some actions that contribute to a more sustainable design of generative AI applications include considering and testing smaller models to achieve the same functionality, optimizing hardware and data storage, and using efficient training algorithms. To learn more about how you can do this, see Optimize generative AI workloads for environmental sustainability. Considerations implemented for achieving more sustainable applications can be added as evidence for the controls related to this part of the assessment.

Conclusion

In this post, we used the example of an insurance claims assistant powered by Amazon Bedrock Agents and looked at various principles that you need to consider when getting this application audit ready using the AWS generative AI best practices framework on Audit Manager. We defined each principle of safeguarding applications for trustworthy AI and provided some best practices for achieving the key objectives of the principles. Finally, we showed you how these development and design choices can be added to the assessment as evidence to help you prepare for an audit.

The AWS generative AI best practices framework provides a purpose-built tool that you can use for monitoring and governance of your generative AI projects on Amazon Bedrock and Amazon SageMaker. To learn more, see:


About the Authors

Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organisation. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.

Irem Gokcek is a Data Architect in the AWS Professional Services team, with expertise spanning both Analytics and AI/ML. She has worked with customers from various industries such as retail, automotive, manufacturing and finance to build scalable data architectures and generate valuable insights from the data. In her free time, she is passionate about swimming and painting.

Fiona McCann is a Solutions Architect at Amazon Web Services in the public sector. She specializes in AI/ML with a focus on Responsible AI. Fiona has a passion for helping nonprofit customers achieve their missions with cloud solutions. Outside of building on AWS, she loves baking, traveling, and running half marathons in cities she visits.

Read More

London Stock Exchange Group uses Amazon Q Business to enhance post-trade client services

London Stock Exchange Group uses Amazon Q Business to enhance post-trade client services

This post was co-written with Ben Doughton, Head of Product Operations – LCH, Iulia Midus, Site Reliability Engineer – LCH, and Maurizio Morabito, Software and AI specialist – LCH (part of London Stock Exchange Group, LSEG).

In the financial industry, quick and reliable access to information is essential, but searching for data or facing unclear communication can slow things down. An AI-powered assistant can change that. By instantly providing answers and helping to navigate complex systems, such assistants can make sure that key information is always within reach, improving efficiency and reducing the risk of miscommunication. Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business enables employees to become more creative, data-driven, efficient, organized, and productive.

In this blog post, we explore a client services agent assistant application developed by the London Stock Exchange Group (LSEG) using Amazon Q Business. We will discuss how Amazon Q Business saved time in generating answers, including summarizing documents, retrieving answers to complex Member enquiries, and combining information from different data sources (while providing in-text citations to the data sources used for each answer).

The challenge

The London Clearing House (LCH) Group of companies includes leading multi-asset class clearing houses and are part of the Markets division of LSEG PLC (LSEG Markets). LCH provides proven risk management capabilities across a range of asset classes, including over-the-counter (OTC) and listed interest rates, fixed income, foreign exchange (FX), credit default swap (CDS), equities, and commodities.

As the LCH business continues to grow, the LCH team has been continuously exploring ways to improve their support to customers (members) and to increase LSEG’s impact on customer success. As part of LSEG’s multi-stage AI strategy, LCH has been exploring the role that generative AI services can have in this space. One of the key capabilities that LCH is interested in is a managed conversational assistant that requires minimal technical knowledge to build and maintain. In addition, LCH has been looking for a solution that is focused on its knowledge base and that can be quickly kept up to date. For this reason, LCH was keen to explore techniques such as Retrieval Augmented Generation (RAG). Following a review of available solutions, the LCH team decided to build a proof-of-concept around Amazon Q Business.

Business use case

Realizing value from generative AI relies on a solid business use case. LCH has a broad base of customers raising queries to their client services (CS) team across a diverse and complex range of asset classes and products. Example queries include: “What is the eligible collateral at LCH?” and “Can members clear NIBOR IRS at LCH?” This requires CS team members to refer to detailed service and policy documentation sources to provide accurate advice to their members.

Historically, the CS team has relied on producing product FAQs for LCH members to refer to and, where required, an in-house knowledge center for CS team members to refer to when answering complex customer queries. To improve the customer experience and boost employee productivity, the CS team set out to investigate whether generative AI could help answer questions from individual members, thus reducing the number of customer queries. The goal was to increase the speed and accuracy of information retrieval within the CS workflows when responding to the queries that inevitably come through from customers.

Project workflow

The CS use case was developed through close collaboration between LCH and Amazon Web Service (AWS) and involved the following steps:

  1. Ideation: The LCH team carried out a series of cross-functional workshops to examine different large language model (LLM) approaches including prompt engineering, RAG, and custom model fine tuning and pre-training. They considered different technologies such as Amazon SageMaker and Amazon SageMaker Jumpstart and evaluated trade-offs between development effort and model customization. Amazon Q Business was selected because of its built-in enterprise search web crawler capability and ease of deployment without the need for LLM deployment. Another attractive feature was the ability to clearly provide source attribution and citations. This enhanced the reliability of the responses, allowing users to verify facts and explore topics in greater depth (important aspects to increase their overall trust in the responses received).
  2. Knowledge base creation: The CS team built data sources connectors for the LCH website, FAQs, customer relationship management (CRM) software, and internal knowledge repositories and included the Amazon Q Business built-in index and retriever in the build.
  3. Integration and testing: The application was secured using a third-party identity provider (IdP) as the IdP for identity and access management to manage users with their enterprise IdP and used AWS Identity and Access Management (IAM) to authenticate users when they signed in to Amazon Q Business. Testing was carried out to verify factual accuracy of responses, evaluating the performance and quality of the AI-generated answers, which demonstrated that the system had achieved a high level of factual accuracy. Wider improvements in business performance were demonstrated including enhancements in response time, where responses were delivered within a few seconds. Tests were undertaken with both unstructured and structured data within the documents.
  4. Phased rollout: The CS AI assistant was rolled out in a phased approach to provide thorough, high-quality answers. In the future, there are plans to integrate their Amazon Q Business application with existing email and CRM interfaces, and to expand its use to additional use cases and functions within LSEG. 

Solution overview

In this solution overview, we’ll explore the LCH-built Amazon Q Business application.

The LCH admin team developed a web-based interface that serves as a gateway for their internal client services team to interact with the Amazon Q Business API and other AWS services (Amazon Elastic Compute Cloud (Amazon ECS), Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), and Amazon Bedrock) and secured it using SAML 2.0 IAM federation—maintaining secure access to the chat interface—to retrieve answers from a pre-indexed knowledge base and to validate the responses using Anthropic’s Claude v2 LLM.

The following figure illustrates the architecture for the LCH client services application.

Architectural Design of the Solution

The workflow consists of the following steps:

  1. The LCH team set up the Amazon Q Business application using a SAML 2.0 IAM IdP. (The example in the blog post shows connecting with Okta as the IdP for Amazon Q Business. However, the LCH team built the application using a third-party solution as the IdP instead of Okta). This architecture allows LCH users to sign in using their existing identity credentials from their enterprise IdP, while they maintain control over which users have access to their Amazon Q Business application.
  2. The application had two data sources as part of the configuration for their Amazon Q Business application:
    1. An S3 bucket to store and index their internal LCH documents. This allows the Amazon Q Business application to access and search through their internal product FAQ PDF documents as part of providing responses to user queries. Indexing the documents in Amazon S3 makes them readily available for the application to retrieve relevant information.
    2. In addition to internal documents, the team has also set up their public-facing LCH website as a data source using a web crawler that can index and extract information from their rulebooks.
  3. The LCH team opted for a custom user interface (UI) instead of the built-in web experience provided by Amazon Q Business to have more control over the frontend by directly accessing the Amazon Q Business API. The application’s frontend was developed using the open source application framework and hosted on Amazon ECS. The frontend application accesses an Amazon API Gateway REST API endpoint to interact with the business logic written in AWS Lambda
  4. The architecture consists of two Lambda functions:
    1. An authorizer Lambda function is responsible for authorizing the frontend application to access the Amazon Q business API by generating temporary AWS credentials.
    2. A ChatSync Lambda function is responsible for accessing the Amazon Q Business ChatSync API to start an Amazon Q Business conversation.
  5. The architecture includes a Validator Lambda function, which is used by the admin to validate the accuracy of the responses generated by the Amazon Q Business application.
    1. The LCH team has stored a golden answer knowledge base in an S3 bucket, consisting of approximately 100 questions and answers about their product FAQs and rulebooks collected from their live agents. This knowledge base serves as a benchmark for the accuracy and reliability of the AI-generated responses.
    2. By comparing the Amazon Q Business chat responses against their golden answers, LCH can verify that the AI-powered assistant is providing accurate and consistent information to their customers.
    3. The Validator Lambda function retrieves data from a DynamoDB table and sends it to Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) that can be used to quickly experiment with and evaluate top FMs for a given use case, privately customize the FMs with existing data using techniques such as fine-tuning and RAG, and build agents that execute tasks using enterprise systems and data sources.
    4. The Amazon Bedrock service uses Anthropic’s Claude v2 model to validate the Amazon Q Business application queries and responses against the golden answers stored in the S3 bucket.
    5. Anthropic’s Claude v2 model returns a score for each question and answer, in addition to a total score, which is then provided to the application admin for review.
    6. The Amazon Q Business application returned answers within a few seconds for each question. The overall expectation is that Amazon Q Business saves time for each live agent on each question by providing quick and correct responses.

This validation process helped LCH to build trust and confidence in the capabilities of Amazon Q Business, enhancing the overall customer experience.

Conclusion

This post provides an overview of LSEG’s experience in adopting Amazon Q Business to support LCH client services agents for B2B query handling. This specific use case was built by working backward from a business goal to improve customer experience and staff productivity in a complex, highly technical area of the trading life cycle (post-trade). The variety and large size of enterprise data sources and the regulated environment that LSEG operates in makes this post particularly relevant to customer service operations dealing with complex query handling. Managed, straightforward-to-use RAG is a key capability within a wider vision of providing technical and business users with an environment, tools, and services to use generative AI across providers and LLMs. You can get started with this tool by creating a sample Amazon Q Business application.


About the Authors

Ben Doughton is a Senior Product Manager at LSEG with over 20 years of experience in Financial Services. He leads product operations, focusing on product discovery initiatives, data-informed decision-making and innovation. He is passionate about machine learning and generative AI as well as agile, lean and continuous delivery practices.

Maurizio Morabito, Software and AI specialist at LCH, one of the early adopters of Neural Networks in the years 1990–1992 before a long hiatus in technology and finance companies in Asia and Europe, finally returning to Machine Learning in 2021. Maurizio is now leading the way to implement AI in LSEG Markets, following the motto “Tackling the Long and the Boring”

Iulia Midus is a recent IT Management graduate and currently working in Post-trade. The main focus of the work so far has been data analysis and AI, and looking at ways to implement these across the business.

Magnus Schoeman is a Principal Customer Solutions Manager at AWS. He has 25 years of experience across private and public sectors where he has held leadership roles in transformation programs, business development, and strategic alliances. Over the last 10 years, Magnus has led technology-driven transformations in regulated financial services operations (across Payments, Wealth Management, Capital Markets, and Life & Pensions).

Sudha Arumugam is an Enterprise Solutions Architect at AWS, advising large Financial Services organizations. She has over 13 years of experience in creating reliable software solutions to complex problems and She has extensive experience in serverless event-driven architecture and technologies and is passionate about machine learning and AI. She enjoys developing mobile and web applications.

Elias Bedmar is a Senior Customer Solutions Manager at AWS. He is a technical and business program manager helping customers be successful on AWS. He supports large migration and modernization programs, cloud maturity initiatives, and adoption of new services. Elias has experience in migration delivery, DevOps engineering and cloud infrastructure.

Marcin Czelej is a Machine Learning Engineer at AWS Generative AI Innovation and Delivery. He combines over 7 years of experience in C/C++ and assembler programming with extensive knowledge in machine learning and data science. This unique skill set allows him to deliver optimized and customised solutions across various industries. Marcin has successfully implemented AI advancements in sectors such as e-commerce, telecommunications, automotive, and the public sector, consistently creating value for customers.

Zmnako Awrahman, Ph.D., is a generative AI Practice Manager at AWS Generative AI Innovation and Delivery with extensive experience in helping enterprise customers build data, ML, and generative AI strategies. With a strong background in technology-driven transformations, particularly in regulated industries, Zmnako has a deep understanding of the challenges and opportunities that come with implementing cutting-edge solutions in complex environments.

Read More

Evaluate large language models for your machine translation tasks on AWS

Evaluate large language models for your machine translation tasks on AWS

Large language models (LLMs) have demonstrated promising capabilities in machine translation (MT) tasks. Depending on the use case, they are able to compete with neural translation models such as Amazon Translate. LLMs particularly stand out for their natural ability to learn from the context of the input text, which allows them to pick up on cultural cues and produce more natural sounding translations. For instance, the sentence “Did you perform well?” translated in French might be translated into “Avez-vous bien performé?” The target translation can vary widely depending on the context. If the question is asked in the context of sport, such as “Did you perform well at the soccer tournament?”, the natural French translation would be very different. It is critical for AI models to capture not only the context, but also the cultural specificities to produce a more natural sounding translation. One of LLMs’ most fascinating strengths is their inherent ability to understand context.

A number of our global customers are looking to take advantage of this capability to improve the quality of their translated content. Localization relies on both automation and humans-in-the-loop in a process called Machine Translation Post Editing (MTPE). Building solutions that help enhance translated content quality present multiple benefits:

  • Potential cost savings on MTPE activities
  • Faster turnaround for localization projects
  • Better experience for content consumers and readers overall with enhanced quality

LLMs have also shown gaps with regards to MT tasks, such as:

  • Inconsistent quality over certain language pairs
  • No standard pattern to integrate past translations knowledge, also known as translation memory (TM)
  • Inherent risk of hallucination

Switching MT workloads from to LLM-driven translation should be considered on a case-by-case basis. However, the industry is seeing enough potential to consider LLMs as a valuable option.

This blog post with accompanying code presents a solution to experiment with real-time machine translation using foundation models (FMs) available in Amazon Bedrock. It can help collect more data on the value of LLMs for your content translation use cases.

Steering the LLMs’ output

Translation memory and TMX files are important concepts and file formats used in the field of computer-assisted translation (CAT) tools and translation management systems (TMSs).

Translation memory

A translation memory is a database that stores previously translated text segments (typically sentences or phrases) along with their corresponding translations. The main purpose of a TM is to aid human or machine translators by providing them with suggestions for segments that have already been translated before. This can significantly improve translation efficiency and consistency, especially for projects involving repetitive content or similar subject matter.

Translation Memory eXchange (TMX) is a widely used open standard for representing and exchanging TM data. It is an XML-based file format that allows for the exchange of TMs between different CAT tools and TMSs. A typical TMX file contains a structured representation of translation units, which are groupings of a same text translated into multiple languages.

Integrating TM with LLMs

The use of TMs in combination with LLMs can be a powerful approach for improving the quality and efficiency of machine translation. The following are a few potential benefits:

  • Improved accuracy and consistency – LLMs can benefit from the high-quality translations stored in TMs, which can help improve the overall accuracy and consistency of the translations produced by the LLM. The TM can provide the LLM with reliable reference translations for specific segments, reducing the chances of errors or inconsistencies.
  • Domain adaptation – TMs often contain translations specific to a particular domain or subject matter. By using a domain-specific TM, the LLM can better adapt to the terminology, style, and context of that domain, leading to more accurate and natural translations.
  • Efficient reuse of human translations – TMs store human-translated segments, which are typically of higher quality than machine-translated segments. By incorporating these human translations into the LLM’s training or inference process, the LLM can learn from and reuse these high-quality translations, potentially improving its overall performance.
  • Reduced post-editing effort – When the LLM can accurately use the translations stored in the TM, the need for human post-editing can be reduced, leading to increased productivity and cost savings.

Another approach to integrating TM data with LLMs is to use fine-tuning in the same way you would fine-tune a model for business domain content generation, for instance. For customers operating in global industries, potentially translating to and from over 10 languages, this approach can prove to be operationally complex and costly. The solution proposed in this post relies on LLMs’ context learning capabilities and prompt engineering. It enables you to use an off-the-shelf model as is without involving machine learning operations (MLOps) activity.

Solution overview

The LLM translation playground is a sample application providing the following capabilities:

  • Experiment with LLM translation capabilities using models available in Amazon Bedrock
  • Create and compare various inference configurations
  • Evaluate the impact of prompt engineering and Retrieval Augmented Generation (RAG) on translation with LLMs
  • Configure supported language pairs
  • Import, process, and test translation using your existing TMX file with Multiple LLMS
  • Custom terminology conversion
  • Performance, quality, and usage metrics including BLEU, BERT, METEOR and, CHRF

The following diagram illustrates the translation playground architecture. The numbers are color-coded to represent two flows: the translation memory ingestion flow (orange) and the text translation flow (gray). The solution offers two TM retrieval modes for users to choose from: vector and document search. This is covered in detail later in the post.

Streamlit Application Architecture

The TM ingestion flow (orange) consists of the following steps:

  1. The user uploads a TMX file to the playground UI.
  2. Depending on which retrieval mode is being used, the appropriate adapter is invoked.
  3. When using the Amazon OpenSearch Service adapter (document search), translation unit groupings are parsed and stored into an index dedicated to the uploaded file. When using the FAISS adapter (vector search), translation unit groupings are parsed and turned into vectors using the selected embedding model from Amazon Bedrock.
  4. When using the FAISS adapter, translation units are stored into a local FAISS index along with the metadata.

The text translation flow (gray) consists of the following steps:

  1. The user enters the text they want to translate along with source and target language.
  2. The request is sent to the prompt generator.
  3. The prompt generator invokes the appropriate knowledge base according to the selected mode.
  4. The prompt generator receives the relevant translation units.
  5. Amazon Bedrock is invoked using the generated prompt as input along with customization parameters.

The translation playground could be adapted into a scalable serverless solution as represented by the following diagram using AWS Lambda, Amazon Simple Storage Service (Amazon S3), and Amazon API Gateway.

Serverless Solution Architecture Diagram

Strategy for TM knowledge base

The LLM translation playground offers two options to incorporate the translation memory into the prompt. Each option is available through its own page within the application:

  • Vector store using FAISS – In this mode, the application processes the .tmx file the user uploaded, indexes it, and stores it locally into a vector store (FAISS).
  • Document store using Amazon OpenSearch Serverless – Only standard document search using Amazon OpenSearch Serverless is supported. To test vector search, use the vector store option (using FAISS).

In vector store mode, the translation segments are processed as follows:

  1. Embed the source segment.
  2. Extract metadata:
    • Segment language
    • System generated <tu> segment unique identifier
  3. Store source segment vectors along with metadata and the segment itself in plain text as a document

The translation customization section allows you to select the embedding model. You can choose either Amazon Titan Embeddings Text V2 or Cohere Embed Multilingual v3. Amazon Titan Text Embeddings V2 includes multilingual support for over 100 languages in pre-training. Cohere Embed supports 108 languages.

In document store mode, the language segments are not embedded and are stored following a flat structure. Two metadata attributes are maintained across the documents:

  • Segment Language
  • System generated <tu> segment unique identifier

Translation Memory Chunking

Prompt engineering

The application uses prompt engineering techniques to incorporate several types of inputs for the inference. The following sample XML illustrates the prompt’s template structure:

<prompt>
<system_prompt>…</system_prompt>
<source_language>EN</source_language>
<target_language>FR</target_language>
<translation_memory_pairs>
<source_language>…</source_language>
<target_language>…</target_language>
</translation_memory_pairs>
<custom_terminology_pairs>
<source_language>…</source_language>
<target_language>…</target_language>
</custom_terminology_pairs ><user_prompt>…</user_prompt>
</prompt>

Prerequisites

The project code uses the Python version of the AWS Cloud Development Kit (AWS CDK). To run the project code, make sure that you have fulfilled the AWS CDK prerequisites for Python.

The project also requires that the AWS account is bootstrapped to allow the deployment of the AWS CDK stack.

Install the UI

To deploy the solution, first install the UI (Streamlit application):

  1. Clone the GitHub repository using the following command:
git clone https://github.com/aws-samples/llm-translation-playground.git
  1. Navigate to the deployment directory:
cd llm-translation-playground
  1. Install and activate a Python virtual environment:
python3 -m venv .venv
source .venv/bin/activate
  1. Install Python libraries:
python -m pip install -r requirements.txt

Deploy the AWS CDK stack

Complete the following steps to deploy the AWS CDK stack:

  1. Move into the deployment folder:
cd deployment/cdk
  1. Configure the AWS CDK context parameters file context.json. For collection_name, use the OpenSearch Serverless collection name. For example:

"collection_name": "search-subtitles"

  1. Deploy the AWS CDK stack:
cdk deploy
  1. Validate successful deployment by reviewing the OpsServerlessSearchStack stack on the AWS CloudFormation The status should read CREATE_COMPLETE.
  2. On the Outputs tab, make note of the OpenSearchEndpoint attribute value.

Cloudformation Stack Output

Configure the solution

The stack creates an AWS Identity and Access Management (IAM) role with the right level of permission needed to run the application. The LLM translation playground assumes this role automatically on your behalf. To achieve this, modify the role or principal under which you are planning to run the application so you are allowed to assume the newly created role. You can use the pre-created policy and attach it to your role. The policy Amazon Resource Name (ARN) can be retrieved as a stack output under the key LLMTranslationPlaygroundAppRoleAssumePolicyArn, as illustrated in the preceding screenshot. You can do so from the IAM console after selecting your role and choosing Add permissions. If you prefer to use the AWS Command Line Interface (AWS CLI), refer to the following sample command line:

aws iam attach-role-policy --role-name &lt;role-name&gt;  --policy-arn &lt;policy-arn&gt;

Finally, configure the .env file in the utils folder as follows:

  • APP_ROLE_ARN – The ARN of the role created by the stack (stack output LLMTranslationPlaygroundAppRoleArn)
  • HOST – OpenSearch Serverless collection endpoint (without https)
  • REGION – AWS Region the collection was deployed into
  • INGESTION_LIMIT – Maximum amount of translation units (<tu> tags) indexed per TMX file you upload

Run the solution

To start the translation playground, run the following commands:

cd llm-translation-playground/source
streamlit run LLM_Translation_Home.py

Your default browser should open a new tab or window displaying the Home page.

LLM Translation Playground Home

Simple test case

Let’s run a simple translation test using the phrase mentioned earlier: “Did you perform well?”

Because we’re not using a knowledge base for this test case, we can use either a vector store or document store. For this post, we use a document store.

  1. Choose With Document Store.
  2. For Source Text, enter the text to be translated.
  3. Choose your source and target languages (for this post, English and French, respectively).
  4. You can experiment with other parameters, such as model, maximum tokens, temperature, and top-p.
  5. Choose Translate.

Translation Configuration Page

The translated text appears in the bottom section. For this example, the translated text, although accurate, is close to a literal translation, which is not a common phrasing in French.

English-French Translation Test 1

  1. We can rerun the same test after slightly modifying the initial text: “Did you perform well at the soccer tournament?”

We’re now introducing some situational context in the input. The translated text should be different and closer to a more natural translation. The new output literally means “Did you play well at the soccer tournament?”, which is consistent with the initial intent of the question.

English-French Translation Test 2

Also note the completion metrics on the left pane, displaying latency, input/output tokens, and quality scores.

This example highlights the ability of LLMs to naturally adapt the translation to the context.

Adding translation memory

Let’s test the impact of using a translation memory TMX file on the translation quality.

  1. Copy the text contained within test/source_text.txt and paste into the Source text
  2. Choose French as the target language and run the translation.
  3. Copy the text contained within test/target_text.txt and paste into the reference translation field.

Translation Memory Configuration

  1. Choose Evaluate and notice the quality scores on the left.
  2. In the Translation Customization section, choose Browse files and choose the file test/subtitles_memory.tmx.

This will index the translation memory into the OpenSearch Service collection previously created. The indexing process can take a few minutes.

  1. When the indexing is complete, select the created index from the index dropdown.
  2. Rerun the translation.

You should see a noticeable increase in the quality score. For instance, we’ve seen up to 20 percentage points improvement in BLEU score with the preceding test case. Using prompt engineering, we were able to steer the model’s output by providing sample phrases directly pulled from the TMX file. Feel free to explore the generated prompt for more details on how the translation pairs were introduced.

You can replicate a similar test case with Amazon Translate by launching an asynchronous job customized using parallel data.

Prompt Engineering

Here we took a simplistic retrieval approach, which consists of loading all of the samples as part of the same TMX file, matching the source and target language. You can enhance this technique by using metadata-driven filtering to collect the relevant pairs according to the source text. For example, you can classify the documents by theme or business domain, and use category tags to select language pairs relevant to the text and desired output.

Semantic similarity for translation memory selection

In vector store mode, the application allows you to upload a TMX and create a local index that uses semantic similarity to select the translation memory segments. First, we retrieve the segment with the highest similarity score based on the text to be translated and the source language. Then we retrieve the corresponding segment matching the target language and parent translation unit ID.

To try it out, upload the file in the same way as shown earlier. Depending on the size of the file, this can take a few minutes. There is a maximum limit of 200 MB. You can use the sample file as in the previous example or one of the other samples provided in the code repository.

This approach differs from the static index search because it’s assumed that the source text is semantically close to segments representative enough of the expected style and tone.

TMX File Upload Widget

Adding custom terminology

Custom terminology allows you to make sure that your brand names, character names, model names, and other unique content get translated to the desired result. Given that LLMs are pre-trained on massive amounts of data, they can likely already identify unique names and render them accurately in the output. If there are names for which you want to enforce a strict and literal translation, you can try the custom terminology feature of this translate playground. Simply provide the source and target language pairs separated by semicolon in the Translation Customization section. For instance, if you want to keep the phrase “Gen AI” untranslated regardless of the language, you can configure the custom terminology as illustrated in the following screenshot.

Custom Terminology

Clean up

To delete the stack, navigate to the deployment folder and run:cdk destroy.

Further considerations

Using existing TMX files with generative AI-based translation systems can potentially improve the quality and consistency of translations. The following are some steps to use TMX files for generative AI translations:

  • TMX data pipeline – TMX files contain structured translation units, but the format might need to be preprocessed to extract the source and target text segments in a format that can be consumed by the generative AI model. This involves extract, transform, and load (ETL) pipelines able to parse the XML structure, handle encoding issues, and add metadata.
  • Incorporate quality estimation and human review – Although generative AI models can produce high-quality translations, it is recommended to incorporate quality estimation techniques and human review processes. You can use automated quality estimation models to flag potentially low-quality translations, which can then be reviewed and corrected by human translators.
  • Iterate and refine – Translation projects often involve iterative cycles of translation, review, and improvement. You can periodically retrain or fine-tune the generative AI model with the updated TMX file, creating a virtuous cycle of continuous improvement.

Conclusion

The LLM translation playground presented in this post enables you evaluate the use of LLMs for your machine translation needs. The key features of this solution include:

  • Ability to use translation memory – The solution allows you to integrate your existing TM data, stored in the industry-standard TMX format, directly into the LLM translation process. This helps improve the accuracy and consistency of the translations by using high-quality human-translated content.
  • Prompt engineering capabilities – The solution showcases the power of prompt engineering, demonstrating how LLMs can be steered to produce more natural and contextual translations by carefully crafting the input prompts. This includes the ability to incorporate custom terminology and domain-specific knowledge.
  • Evaluation metrics – The solution includes standard translation quality evaluation metrics, such as BLEU, BERT Score, METEOR, and CHRF, to help you assess the quality and effectiveness of the LLM-powered translations compared to their your existing machine translation workflows.

As the industry continues to explore the use of LLMs, this solution can help you gain valuable insights and data to determine if LLMs can become a viable and valuable option for your content translation and localization workloads.

To dive deeper into the fast-moving field of LLM-based machine translation on AWS, check out the following resources:


About the Authors

Narcisse Zekpa is a Sr. Solutions Architect based in Boston. He helps customers in the Northeast U.S. accelerate their business transformation through innovative, and scalable solutions, on the AWS Cloud. He is passionate about enabling organizations to transform transform their business, using advanced analytics and AI. When Narcisse is not building, he enjoys spending time with his family, traveling, running, cooking and playing basketball.

Ajeeb Peter is a Principal Solutions Architect with Amazon Web Services based in Charlotte, North Carolina, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development, Architecture and Analytics from industries like finance and telecom

Read More