Using transcription confidence scores to improve slot filling in Amazon Lex

Using transcription confidence scores to improve slot filling in Amazon Lex

When building voice-enabled chatbots with Amazon Lex, one of the biggest challenges is accurately capturing user speech input for slot values. For example, when a user needs to provide their account number or confirmation code, speech recognition accuracy becomes crucial. This is where transcription confidence scores come in to help ensure reliable slot filling.

What Are Transcription Confidence Scores?

Transcription confidence scores indicate how confident Amazon Lex is in converting speech to text for slot values. These scores range from low to high and are separate from intent/entity recognition scores. For each spoken slot value, Lex provides a confidence score that you can use to:

  • Validate if a spoken slot value was correctly understood
  • Decide whether to ask for confirmation or re-prompt
  • Branch conversation flows based on recognition confidence

Here are some ways to leverage confidence scores for better slot handling:

  1. Progressive Confirmation
    • High confidence (>0.9): Accept the slot value and continue
    • Medium confidence (0.6-0.9): Ask user to confirm (“Did you say 12345?”)
    • Low confidence (<0.6): Re-prompt for the slot value
  2. Adaptive re-prompting
    • Customize re-prompt messages based on confidence level
    • Provide more specific guidance for low confidence inputs
    • Offer alternative input methods when needed
  3. Branching Logic
    • Route to human agent if multiple low confidence attempts
    • Skip confirmation for consistently high confidence inputs
    • Adjust validation rules based on confidence thresholds

The post includes an AWS CloudFormation template, to demonstrate these patterns, that you can deploy in your AWS account.

Using confidence scores strategically helps create more robust slot filling experiences that:

  • Reduce errors in capturing critical information
  • Improve containment rates for self-service
  • Provide better handling of challenging audio conditions
  • Enable smarter conversation flows

By implementing these patterns, you can significantly improve the accuracy and reliability of slot filling in your Lex voice bots while maintaining a natural conversation flow.

Solution overview

The following diagram illustrates the overall architecture for the solution.

architecture diagram

We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience.

Amazon Lex asks the caller “In a few words, how can I help you today?” Amazon Lex reviews the transcription confidence score of the slot values to influence the confirmation routine.

Interaction model

The following are different customer experiences a caller may have depending on what they say and the transcription confidence scores returned by Amazon Lex:

  • Conversation 1:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking.
Virtual Agent:
That’s great, please say your 5 character booking reference, you will find it at the top of the information pack we sent.
Customer:
1 9 A K B (Recognized with low transcription confidence: 0.4)
Virtual Agent:
Ok, let’s try again. What is your booking reference?
Customer:
1 9 A A B (More clearly pronounced: recognized with 0.8)
Virtual Agent:
Your booking 1 9 A A B is currently being progressed.

  • Conversation 2:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking 1 9 A Z Y.
Virtual Agent:
Your booking 1 9 A Z Y is currently being progressed.

  • Conversation 3:

Virtual Agent:
Thank you for calling Acme Travel.
Virtual Agent:
…In a few words, what is the reason for your call today?
Customer:
I’d like to check my booking.
Virtual Agent:
That’s great, please say your 5 character booking reference, you will find it at the top of the information pack we sent.
Customer:
1 9 A Z Y (recognized with a transcription confidence of 0.6)
Virtual Agent:
Did you say 1 9 A Z Y? Please say yes or no.
Customer:
Yes
Virtual Agent:
Your booking 1 9 A Z Y is currently being progressed.

In the example conversations, the IVR requests the booking reference from the customer. Once received, the transcription confidence score is evaluated by enabling conditional branching in Amazon Lex based on speech confidence scores. These conditions check the value against specific thresholds. If the transcription confidence score exceeds the high threshold (for example, greater than 0.7), the conversation progresses to the next state. If the score falls in the medium confidence range (for example, between 0.4–0.7), the user is asked to confirm the interpreted input. Finally, if the score falls below a minimum threshold (for example, lower than 0.4), the user is prompted to retry and provide the information again. This approach optimizes the conversation flow based on the quality of the input captured and prevents erroneous or redundant slot capturing, leading to an improved user experience while increasing the self-service containment rates.

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

Additionally, you need an Amazon Connect instance—you use the instance Amazon Resource Name (ARN) in a later step.

Deploy the Amazon Lex bot and Amazon Connect flow

To create the sample bot and configure the runtime phrase hints, perform the following steps. For this example, we create an Amazon Lex bot called disambiguation-bot, one intent (CheckBooking), and one slot type (BookingRef).

  1. Sign in to your AWS account, then choose Launch Stack to deploy the CloudFormation template:

stack launch button

  1. For Stack Name, enter a name, for example contact-center-transcription-confidence-scores.
  2. Choose Next.
  3. Provide the following parameters:
    1. For BotName, enter disambiguation-bot.
    2. For ConnectInstanceARN, enter the ARN of your Amazon Connect instance.
    3. For ContactFlowName, enter a name for your Amazon Connect contact flow (for example, lex-check-booking-sample-flow).
    4. For LogGroupName, enter the name of the Amazon CloudWatch log group where the conversation logs are stored.
  4. Choose Next.

CFN stack parameters

  1. Leave all remaining settings as default and choose Next.
  2. Select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Submit.

CFN acknowledge

  1. Wait for the CloudFormation stack to successfully deploy.
  2. On the Amazon Connect console, assign the contact flow to an Amazon Connect claimed number.

Configure the transcript confidence score logic

After you create your intent (CheckBooking), use you can Visual conversation builder to configure your transcription confidence score logic.

The following figure is an example of how we add logic to the intent. Highlighted in red is the branch condition where we use the transcription confidence score to dynamically change the customer experience and improve accuracy.

Lex Visual Builder

If you choose the node, you’re presented with the following configuration options, which is where you can configure the branch condition.

Lex condition

Test the solution

To test the solution, we examine a conversation with words that might not be clearly understood.

  1. Assign the Amazon Lex bot to an Amazon Connect workflow.
  2. Make a call.

Amazon Connect will ask “Thank you for calling Acme travel, In a few words, what is the reason for your call today?”

  1. Respond “I want to check my booking.”
  2. When asked for the booking reference, speak any two numbers followed by three letters (for example, “1 9 A Z Y”).

This test checks the confidence score and will either say “your booking 1 9 A Z Y is currently being progressed” or it will ask you to confirm “1 9 A Z Y”.

Limitations

Audio transcription confidence scores are available only in the English (GB) (en_GB) and English (US) (en_US) languages. Confidence scores are supported only for 8 kHz audio input. Transcription confidence scores aren’t provided for audio input from the test window on the Amazon Lex V2 console because it uses 16 kHz audio input.

Clean up

To remove the infrastructure created by the CloudFormation template, open the AWS CloudFormation console and delete the stack. This will remove the services and configuration installed as part of this deployment process.

Conclusion

Optimizing the user experience is at the forefront of any Amazon Lex conversational designer’s priority list, and so is capturing information accurately. This new feature empowers designers to have choices around confirmation routines that drive a more natural dialog between the customer and the bot. Although confirming each input can slow down the user experience and cause frustration, failing to confirm when transcription confidence is low can risk accuracy. These improvements enable you to create a more natural and performant experience.

For more information about how to build effective conversations on Amazon Lex with intent confidence scores, see Build more effective conversations on Amazon Lex with confidence scores and increased accuracy.


About the Authors

Alex BuckhurstAlex Buckhurst is a Senior Amazon Connect consultant at Amazon Web Services with a focus on innovation and building customer-centric designs. In his downtime, Alex enjoys playing squash, perfecting his BBQ skills, and cherishing moments with his family.

Kai LoreckKai Loreck is a Senior professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Neel KapadiaNeel Kapadia is a Senior Software Engineer at AWS where he works on designing and building scalable AI/ML services using Large Language Models and Natural Language Processing. He has been with Amazon for over 5 years and has worked on Amazon Lex and Amazon Bedrock. In his spare time, he enjoys cooking, reading, and traveling.

Anand Jumnani is a DevOps Consultant at Amazon Web Services based in United Kingdom. Outside of work, he is passionate about club cricket and enjoys spending quality time with family and friends.

Read More

Improving Retrieval Augmented Generation accuracy with GraphRAG

Improving Retrieval Augmented Generation accuracy with GraphRAG

Customers need better accuracy to take generative AI applications into production. In a world where decisions are increasingly data-driven, the integrity and reliability of information are paramount. To address this, customers often begin by enhancing generative AI accuracy through vector-based retrieval systems and the Retrieval Augmented Generation (RAG) architectural pattern, which integrates dense embeddings to ground AI outputs in relevant context. When even greater precision and contextual fidelity are required, the solution evolves to graph-enhanced RAG (GraphRAG), where graph structures provide enhanced reasoning and relationship modeling capabilities.

Lettria, an AWS Partner, demonstrated that integrating graph-based structures into RAG workflows improves answer precision by up to 35% compared to vector-only retrieval methods. This enhancement is achieved by using the graph’s ability to model complex relationships and dependencies between data points, providing a more nuanced and contextually accurate foundation for generative AI outputs.

In this post, we explore why GraphRAG is more comprehensive and explainable than vector RAG alone, and how you can use this approach using AWS services and Lettria.

How graphs make RAG more accurate

In this section, we discuss the ways in which graphs make RAG more accurate.

Capturing complex human queries with graphs

Human questions are inherently complex, often requiring the connection of multiple pieces of information. Traditional data representations struggle to accommodate this complexity without losing context. Graphs, however, are designed to mirror the way humans naturally think and ask questions. They represent data in a machine-readable format that preserves the rich relationships between entities.

By modeling data as a graph, you capture more of the context and intent. This means your RAG application can access and interpret data in a way that aligns closely with human thought processes. The result is a more accurate and relevant answer to complex queries.

Avoiding loss of context in data representation

When you rely solely on vector similarity for information retrieval, you miss out on the nuanced relationships that exist within the data. Translating natural language into vectors reduces the richness of the information, potentially leading to less accurate answers. Also, end-user queries are not always aligned semantically to useful information in provided documents, leading to vector search excluding key data points needed to build an accurate answer.

Graphs maintain the natural structure of the data, allowing for a more precise mapping between questions and answers. They enable the RAG system to understand and navigate the intricate connections within the data, leading to improved accuracy.

Lettria demonstrated improvement on correctness of answers from 50% with traditional RAG to more than 80% using GraphRAG within a hybrid approach. The testing covered datasets from finance (Amazon financial reports), healthcare (scientific studies on COVID-19 vaccines), industry (technical specifications for aeronautical construction materials), and law (European Union directives on environmental regulations).

Proving that graphs are more accurate

To substantiate the accuracy improvements of graph-enhanced RAG, Lettria conducted a series of benchmarks comparing their GraphRAG solution—a hybrid RAG using both vector and graph stores—with a baseline vector-only RAG reference.

Lettria’s hybrid methodology to RAG

Lettria’s hybrid approach to question answering combines the best of vector similarity and graph searches to optimize performance of RAG applications on complex documents. By integrating these two retrieval systems, Lettria uses both structured precision and semantic flexibility in handling intricate queries.

GraphRAG specializes in using fine-grained, contextual data, ideal for answering questions that require explicit connections between entities. In contrast, vector RAG excels at retrieving semantically relevant information, offering broader contextual insights. This dual system is further reinforced by a fallback mechanism: when one system struggles to provide relevant data, the other compensates. For example, GraphRAG pinpoints explicit relationships when available, whereas vector RAG fills in relational gaps or enhances context when structure is missing.

The benchmarking process

To demonstrate the value of this hybrid method, Lettria conducted extensive benchmarks across datasets from various industries. Using their solution, they compared GraphRAG’s hybrid pipeline against a leading open source RAG package, Verba by Weaviate, a baseline RAG reference reliant solely on vector stores. The datasets included Amazon financial reports, scientific texts on COVID-19 vaccines, technical specifications from aeronautics, and European environmental directives—providing a diverse and representative test bed.

The evaluation tackled real-world complexity by focusing on six distinct question types, including fact-based, multi-hop, numerical, tabular, temporal, and multi-constraint queries. The questions ranged from simple fact-finding, like identifying vaccine formulas, to multi-layered reasoning tasks, such as comparing revenue figures across different timeframes. An example multi-hop query in finance is “Compare the oldest booked Amazon revenue to the most recent.”

Lettria’s in-house team manually assessed the answers with a detailed evaluation grid, categorizing results as correct, partially correct (acceptable or not), or incorrect. This process measured how the hybrid GraphRAG approach outperformed the baseline, particularly in handling multi-dimensional queries that required combining structured relationships with semantic breadth. By using the strengths of both vector and graph-based retrieval, Lettria’s system demonstrated its ability to navigate the nuanced demands of diverse industries with precision and flexibility.

The benchmarking results

The results were significant and compelling. GraphRAG achieved 80% correct answers, compared to 50.83% with traditional RAG. When including acceptable answers, GraphRAG’s accuracy rose to nearly 90%, whereas the vector approach reached 67.5%.

The following graph shows the results for vector RAG and GraphRAG.

In the industry sector, dealing with complex technical specifications, GraphRAG provided 90.63% correct answers, almost doubling vector RAG’s 46.88%. These figures highlight how GraphRAG offers substantial advantages over the vector-only approach, particularly for clients focused on structuring complex data.

GraphRAG’s overall reliability and superior handling of intricate queries allow customers to make more informed decisions with confidence. By delivering up to 35% more accurate answers, it significantly boosts efficiency and reduces the time spent sifting through unstructured data. These compelling results demonstrate that incorporating graphs into the RAG workflow not only enhances accuracy, but is essential for tackling the complexity of real-world questions.

Using AWS and Lettria for enhanced RAG applications

In this section, we discuss how you can use AWS and Lettria for enhanced RAG applications.

AWS: A robust foundation for generative AI

AWS offers a comprehensive suite of tools and services to build and deploy generative AI applications. With AWS, you have access to scalable infrastructure and advanced services like Amazon Neptune, a fully managed graph database service. Neptune allows you to efficiently model and navigate complex relationships within your data, making it an ideal choice for implementing graph-based RAG systems.

Implementing GraphRAG from scratch usually requires a process similar to the following diagram.

The process can be broken down as follows:

  1. Based on domain definition, the large language model (LLM) can identify the entities and relationship contained in the unstructured data, which are then stored in a graph database such as Neptune.
  2. At query time, user intent is turned into an efficient graph query based on domain definition to retrieve the relevant entities and relationship.
  3. Results are then used to augment the prompt and generate a more accurate response compared to standard vector-based RAG.

Implementing such process requires teams to develop specific skills in topics such as graph modeling, graph queries, prompt engineering, or LLM workflow maintenance. AWS released an open source GraphRAG Toolkit to make it simple for customers who want to build and customize their GraphRAG workflows. Iterations on extraction process and graph lookup are to be expected in order to get accuracy improvement.

Managed GraphRAG implementations

There are two solutions for managed GraphRAG with AWS: Lettria’s solution, soon available on AWS Marketplace, and Amazon Bedrock integrated GraphRAG support with Neptune. Lettria provides an accessible way to integrate GraphRAG into your applications. By combining Lettria’s expertise in natural language processing (NLP) and graph technology with the scalable and managed AWS infrastructure, you can develop RAG solutions that deliver more accurate and reliable results.

The following are key benefits of Lettria on AWS:

  • Simple integration – Lettria’s solution simplifies the ingestion and processing of complex datasets
  • Improved accuracy – You can achieve up to 35% better performance in question-answering tasks
  • Scalability – You can use scalable AWS services to handle growing data volumes and user demands
  • Flexibility – The hybrid approach combines the strengths of vector and graph representations

In addition to Lettria’s solution, Amazon Bedrock introduced managed GraphRAG support on December 4, 2024, integrating directly with Neptune. GraphRAG with Neptune is built into Amazon Bedrock Knowledge Bases, offering an integrated experience with no additional setup or additional charges beyond the underlying services. GraphRAG is available in AWS Regions where Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics are both available (see the current list of supported Regions). To learn more, see Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases.

Conclusion

Data accuracy is a critical concern for enterprises adopting generative AI applications. By incorporating graphs into your RAG workflow, you can significantly enhance the accuracy of your systems. Graphs provide a richer, more nuanced representation of data, capturing the complexity of human queries and preserving context.

GraphRAG is a key option to consider for organizations seeking to unlock the full potential of their data. With the combined power of AWS and Lettria, you can build advanced RAG applications that help meet the demanding needs of today’s data-driven enterprises and achieve up to 35% improvement in accuracy.

Explore how you can implement GraphRAG on AWS in your generative AI application:


About the Authors

Denise Gosnell is a Principal Product Manager for Amazon Neptune, focusing on generative AI infrastructure and graph data applications that enable scalable, cutting-edge solutions across industry verticals.

Vivien de Saint Pern is a Startup Solutions Architect working with AI/ML startups in France, focusing on generative AI workloads.

Read More

Add a generative AI experience to your website or web application with Amazon Q embedded

Add a generative AI experience to your website or web application with Amazon Q embedded

Generative AI offers many benefits for both you, as a software provider, and your end-users. AI assistants can help users generate insights, get help, and find information that may be hard to surface using traditional means. In addition, they can help your employees reduce repetitive tasks and focus on high-value work. However, adding generative AI assistants to your website or web application requires significant domain knowledge and the technical expertise to build, deploy, and maintain the infrastructure and end-user experience. These challenges fall outside of some software providers’ core domain, creating barriers to offering AI assistants to users.

Amazon Q Business is a generative AI assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business securely unites disparate data with over 40 built-in connectors to popular enterprise applications, document repositories, chat applications, and knowledge management systems. You can use natural language to request information or assistance to generate content. Amazon Q Business handles the complexity of deploying and maintaining the infrastructure required for generative AI assistants so you can focus on creating a delightful end-user experience.

Amazon Q embedded is a feature that lets you embed a hosted Amazon Q Business assistant on your website or application to create more personalized experiences that boost end-users’ productivity. You can configure the assistant with guardrails to define global and topic-level controls for your environment. With an embedded Amazon Q Business assistant, end-users can receive immediate, permission-aware responses from your data sources, with citations.

In this post, we demonstrate how to use the Amazon Q embedded feature to add an Amazon Q Business assistant to your website or web application using basic HTML or React. We also show you how to use the feature with content management systems like WordPress and Drupal. This post includes a sample webpage for Amazon Q Business that allows you to quickly test and demonstrate your AI assistant. This allows you to develop the changes on your website or application in parallel while refining your Amazon Q Business configurations.

Solution overview

Embedding Amazon Q Business gives your users access to a generative AI assistant without leaving your website or web application. Integrating the assistant involves creating an Amazon Q Business application, adding users or groups, connecting relevant data sources, allowlisting your domain, and finally adding an HTML inline frame (iframe) element to your website or web application.

Prerequisites

In this section, we walk through how to set up an Amazon Q Business application, permissions, and user access.

Amazon Q Business application

The Amazon Q embedded feature requires an Amazon Q Business application. If you don’t have an existing application, you can create an application integrated with AWS IAM Identity Center or AWS Identity and Access Management (IAM) identity federation. Refer to Configuring an Amazon Q Business application using AWS IAM Identity Center, or Creating an Amazon Q Business application using Identity Federation through IAM if you need to make a new application.

Permissions

Configuring the Amazon Q embedded feature IAM permissions that allow you to use and manage Amazon Q Business. Your permission policy must at least allow the Amazon Q Business CreateWebExperience and UpdateWebExperience actions:

"Action": "qbusiness:CreateWebExperience",
"Action": "qbusiness:UpdateWebExperience",

When creating the IAM permission policy, the IAM Visual policy creator is a great way to see the options available. Using the least privileged access approach, you can restrict the resource in which the permission grants access to a specific AWS Region, account ID, application ID, and web experience ID.

"Resource": "arn:aws:qbusiness:us-east-1:123456789012:application/<replace-with-id>"
"Resource": "arn:aws:qbusiness:us-east-1:123456789012:application/<replace-with-id>/web-experience/<replace-with-id>"

You can find your application ID on the Amazon Q Business console under Application settings or from the list-applications command in the AWS Command Line Interface (AWS CLI). You can find your web experience ID with the list-web-experiences AWS CLI command. For example:

aws qbusiness list-applications
aws qbusiness list-web-experiences --application-id a1b2c3d4-5678-90ab-cdef-EXAMPLE11111

User access

Amazon Q Business requires authentication before users can engage with the assistant. If you use AWS IAM Identity Center, you can grant users access to the assistant by adding the users or groups to your Amazon Q Business application. If you use IAM identity federation, Amazon Q Business automatically subscribes users to the subscription type you select when you create the application. For more information on managing users, refer to Managing user subscriptions for IAM Identity Center-integrated applications, or see Updating and cancelling user subscriptions for applications using IAM Federation.

Allowlisting your website or web application

To embed Amazon Q Business on your website or web application, you must first allowlist your domain. This restricts your assistant to only sites you trust and stops others from embedding your assistant. You can add multiple domains for different services or development instances used for testing. Complete the following steps:

  1. Open the Amazon Q Business console.
  2. Next, select your Amazon Q Business application.
  3. From the menu, choose Amazon Q embedded under the Enhancements section, then choose Add allowed website.
  4. For Enter website URL, enter the base URL of the website or web application you want to allowlist for Amazon Q Business, for example https://www.example.com (trailing / not required), and choose Add.

Amazon Q Business hosts the web experience on an AWS domain. To find the URL, navigate to the main page of your Amazon Q Business application and copy the value for Deployed URL, for example https://1234abcdef5678.chat.qbusiness.example.on.aws/, in the Web experience settings section. Now you can embed this assistant into the website or web application hosted at the domain you allowlisted.

Customizing the user experience

You can customize the user experience look and feel for your organization. Customization options include the assistant title, subtitle, welcome message, font, color, and logo. You can also enable sample prompts. Refer to Customizing an Amazon Q Business web experience to see the available customization options.

The following screenshots show the default Amazon Q Business user experience (left) and an Amazon Q Business user experience with a custom title, subtitle, and welcome message (right).

Add Amazon Q Business to your website or web application

Before continuing, make sure you have allowlisted your domain as described earlier in this post.

You can choose from the following embedding options:

  • Using an HTML iframe element
  • Using a React component
  • Using a content management system

Embed Amazon Q Business using an HTML iframe element

You can embed Amazon Q Business on your website or web application using an iframe element, which is an HTML element that you can use to insert another HTML page into the current one. Other embedding options build upon this foundational HTML element. The following is a sample iframe element:

<iframe src="https://1234abcdef5678.chat.qbusiness.example.on.aws/"></iframe>

You can customize the iframe element with various attributes such as the width, height, and title. Setting the Amazon Q Business deployed URL as the value for the src attribute will display the Amazon Q Business web experience within the iframe. The following code shows an example iframe element with the id, title, width, height, and src attributes set to example values:

<iframe
    id="inlineFrameExample"
    title="Inline Frame Example"
    width="600"
    height="650"
    src="https://1234abcdef5678.chat.qbusiness.example.on.aws/">
</iframe>

Refer to <iframe>: The Inline Frame element to learn more about the iframe element.

Embed Amazon Q Business using a React component

You can embed Amazon Q Business on your website or web application using a React component. React components offer more customizations and modularity than a standard iframe. In this post, we’ve included a sample React component that wraps an iframe element and adds abilities such as an expanding and collapsing chat interface and showing a loading spinner when the page first loads.

To use this React component, download the sample code from the Embed GenAI chat into React GitHub repo and add it to your React source code. Then you can import the component into your website or web application and add the Chat element with at least the embedUrl attribute set to the deployed URL of your Amazon Q Business application. The following example code shows the options of the sample React component:

import Chat from "../components/embed";
...
<Chat
    embedUrl="https://1234abcdef5678.chat.qbusiness.example.on.aws/"
    embedWidth={600}          // Optional
    embedHeight={650}         // Optional
    embedOffsetRightPc={5}    // Optional
    headerText="Chat"         // Optional
    headerInfo="Chat with us" // Optional
/>

Embed Amazon Q Business using a content management system

You can embed Amazon Q Business on a website published by a content management system that allows you to add HTML elements to the content. We’ve included examples for WordPress and Drupal, both of which you can deploy with Amazon Lightsail.

Embedding on a WordPress site

To embed Amazon Q Business on your WordPress site, first access the WordPress admin page. Optionally, add a block group wrapper to constrain iframe sizing with the values of your choosing. For example, you could set the layout content height to 650px, width to 620px, a width of 100% in the iframe to fill the container, and select a full-size block item. Finally, add a custom HTML block and insert the iframe code. The following code is a sample iframe element:

<iframe
    id="inlineFrameExample"
    title="Inline Frame Example"
    width="100%"
    height="650"
    src="https://021345abcdef.chat.qbusiness.example.on.aws/">
</iframe>

The following screenshot shows an example of adding a block to a WordPress site.

The following screenshot shows an example of adding an iframe to the block.

The following screenshot shows an example of Amazon Q Business in a WordPress site.

Embedding on a Drupal site

To embed Amazon Q Business on your Drupal site, complete the following steps:

  1. Open the Drupal admin page.
  2. Choose Content, Blocks, and Add content block.
  3. Give your content block a description and change the text format to HTML.
  4. Choose the Source
  5. Add your iframe to the Body section of the block, then choose Save and configure.
  6. When configuring your content block, the visibility options are optional and can be left with the default values.
  7. Choose a Region to display this block, such as Content Above or Sidebar, then choose Save block.

The following screenshot shows an example of Amazon Q Business embedded with the Content Above option.

The following screenshot shows an example of Amazon Q Business embedded with the Sidebar option.

Sample website

To help you get started embedding Amazon Q Business, we have included a sample website that you can deploy on AWS Amplify with an AWS CloudFormation stack. The sample website contains an HTML iframe element with your Amazon Q Business assistant. To use the website, complete the following steps:

  1. First collect your Amazon Q Business application ID and make a note. You can find your application ID on the Amazon Q Business console as described earlier in this post.
  2. Download our YAML sample CloudFormation template to your workstation.
  3. Deploy the stack either using the AWS CloudFormation console or using the AWS CLI.

  4. After uploading the sample CloudFormation template, enter a stack name, a web page name, and your Amazon Q Business application ID in the Application ID input field.
  5. You can leave all other settings at their default values.
  6. After the stack fully deploys, navigate to the Outputs tab on the AWS CloudFormation console and copy the Amplify URL.
  7. Return to the Amazon Q Business console, select your Amazon Q Business application, and choose Amazon Q Embedded to add your Amplify URL to the Allowed websites list as described earlier in this post.
  8. Navigate to your Amplify URL in your web browser to see your sample website with Amazon Q Business. You may need to Sign in to Q Business.

Clean Up

To avoid future charges in your account from Amplify you can delete the resources you created in the previous section walkthrough on creating a sample website.

  1. On the CloudFormation console, in the navigation pane, choose Stacks.
  2. Select the stack you launched in the previous step, then choose Delete.

Conclusion

In this post, we showed you various methods of embedding Amazon Q Business, which enables users to have natural language conversations and get meaningful assistance directly on your website or web application. We discussed creating an Amazon Q Business application and how to allowlist your URL. We then walked through adding Amazon Q Business with a standard HTML iframe, a React component, and how to update a WordPress or Drupal site.

To get started, refer to Getting started with Amazon Q Business to create an Amazon Q Business application. For more information on the Amazon Q embedded feature, see Amazon Q embedded. Refer to Enhancing an Amazon Q Business application environment for guidance on integrating your data sources, which can include your website content, to enrich the answers Amazon Q Business can provide your website or web application users.


About the authors

Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.

David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.

Philip WhitesidePhilip Whiteside is a Solutions Architect (SA) at Amazon Web Services. Philip is passionate about overcoming barriers by utilizing technology.

Read More

An introduction to preparing your own dataset for LLM training

An introduction to preparing your own dataset for LLM training

Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of linguistic tasks. However, the performance of these models is heavily influenced by the data used during the training process.

In this blog post, we provide an introduction to preparing your own dataset for LLM training. Whether your goal is to fine-tune a pre-trained modIn this blog post, we provide an introduction to preparing your own dataset for LLM training. Whether your goal is to fine-tune a pre-trained model for a specific task or to continue pre-training for domain-specific applications, having a well-curated dataset is crucial for achieving optimal performance.el for a specific task or to continue pre-training for domain-specific applications, having a well-curated dataset is crucial for achieving optimal performance.

Data preprocessing

Text data can come from diverse sources and exist in a wide variety of formats such as PDF, HTML, JSON, and Microsoft Office documents such as Word, Excel, and PowerPoint. It’s rare to already have access to text data that can be readily processed and fed into an LLM for training. Thus, the first step in an LLM data preparation pipeline is to extract and collate data from these various sources and formats. During this step, you read data from multiple sources, extract the text using tools such as optical character recognition (OCR) for scanned PDFs, HTML parsers for web documents, and bespoke libraries for proprietary formats such as Microsoft Office files. Non-textual elements such as HTML tags and non-UTF-8 characters are typically removed or normalized.

The next step is to filter low quality or desirable documents. Common patterns for filtering data include:

  • Filtering on metadata such as the document name or URL.
  • Content-based filtering such as excluding any toxic or harmful content or personally identifiable information (PII).
  • Regex filters to identify specific character patterns present in the text.
  • Filtering documents with excessive repetitive sentences or n-grams.
  • Filters for specific languages such as English.
  • Other quality filters such as the number of words in the document, average word length, ratio of words comprised of alphabetic characters versus non-alphabetic characters, and others.
  • Model based quality filtering using lightweight text classifiers to identify low quality documents. For example, the FineWeb-Edu classifier is used to classify the education value of web pages.

Extracting text from various file formats can be a non-trivial task. Fortunately, many high-level libraries exist that can significantly simplify this process. We will use a few examples to demonstrate extracting text and review how to scale this to large collections of documents further down.

HTML preprocessing

When processing HTML documents, remove non-text data such as the document mark-up tags, inline CSS styles, and inline JavaScript. Furthermore, translate structured objects such as lists, tables, and sample code blocks into markdown format. The trafilatura library provides a command-line interface (CLI) and Python SDK for translating HTML documents in this fashion. The following code snippet demonstrates the library’s usage by extracting and preprocessing the HTML data from the Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker blog post.

from trafilatura import fetch_url, extract, html2txt

url = "https://aws.amazon.com/blogs/machine-learning/fine-tune-meta-llama-3-1-models-using-torchtune-on-amazon-sagemaker/"

downloaded = fetch_url(url)
print("RAW HTMLn", downloaded[:250])

all_text = html2txt(downloaded)
print("nALL TEXTn", all_text[:250])

main_text = extract(downloaded)
print("nMAIN TEXTn", main_text[:250])

trafilatura provides numerous functions for dealing with HTML. In the preceding example, fetch_url fetches the raw HTML, html2txt extracts the text content which includes the navigation links, related content links, and other text content. Finally, the extract method extracts the content of the main body which is the blog post itself. The output of the preceding code should look like the following:

RAW HTML
<!doctype html> <html lang="en-US" class="no-js aws-lng-en_US" xmlns="http://www.w3.org/1999/xhtml" data-aws-assets="https://a0.awsstatic.com" data-js-version="1.0.681" data-css-version="1.0.538" data-static-assets="https://a0.awsstatic.com" prefix="

ALL TEXT
Skip to Main Content Click here to return to Amazon Web Services homepage About AWS Contact Us Support English My Account Sign In Create an AWS Account Products Solutions Pricing Documentation Learn Partner Network AWS Marketplace Customer Enablement

MAIN TEXT
AWS Machine Learning Blog Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker This post is co-written with Meta’s PyTorch team. In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large lan

PDF processing

PDF is a common format for storing and distributing documents within organizations. Extracting clean text from PDFs can be challenging for several reasons. PDFs may use complex layouts that include text columns, images, tables, and figures. They can also contain embedded fonts and graphics that cannot be parsed by standard libraries. Unlike HTML, there is no structural information to work with such as headings, paragraphs, lists, and others, which makes parsing PDF documents significantly more difficult. If possible, PDF parsing should be avoided if an alternative format for the document exists such an HTML, markdown, or even a DOCX file. In cases where an alternative format is not available, you can use libraries such as pdfplumber, pypdf, and pdfminer to help with the extraction of text and tabular data from the PDF. The following is an example of using pdfplumber to parse the first page of the 2023 Amazon annual report in PDF format.

import pdfplumber

pdf_file = "Amazon-com-Inc-2023-Annual-Report.pdf"

with pdfplumber.open(pdf_file) as pdf:
    page = pdf.pages[1]

print(page.extract_text(x_tolerance=1)[:300])

pdfplumber provides bounding box information, which can be used to remove superfluous text such as page headers and footers. However, the library only works with PDFs that have text present, such as digitally authored PDFs. For PDF documents that require OCR, such as scanned documents, you can use services such as Amazon Textract.

Office document processing

Documents authored with Microsoft Office or other compatible productivity software are another common format within an organization. Such documents can include DOCX, PPTX, and XLSX files, and there are libraries available to work with these formats. The following code snippet uses the python-docx library to extract text from a Word document. The code iterates through the document paragraphs and concatenates them into a single string.

from docx import Document
doc_file = "SampleDoc.docx"

doc = Document(doc_file)

full_text = []
for paragraph in doc.paragraphs:
  full_text.append(paragraph.text)

document_text = 'n'.join(full_text)

Deduplication

After the preprocessing step, it is important to process the data further to remove duplicates (deduplication) and filter out low-quality content.

Deduplication is a critical aspect for preparing high-quality pretraining datasets. According to CCNet, duplicated training examples are pervasive in common natural language processing (NLP) datasets. This issue is not only a frequent source of bias in datasets originating from public domains such as the internet, but it can also be a potential problem when curating your own training dataset. When organizations attempt to create their own training dataset, they often use various data sources such as internal emails, memos, internal employee chat logs, support tickets, conversations, and internal wiki pages. The same chunk of text might appear across multiple sources or can repeat excessively in a single data source such as an email thread. Duplicated data extends the training time and potentially biases the model towards more frequently repeated examples.

A commonly used processing pipeline is the CCNet pipeline. The following section will describe deduplication and filtering employed in the CCNet pipeline.

Break documents into shards. In the CCNet paper, the author divided 30 TB of data into 1,600 shards. In that example, the shards are documents that have been grouped together. Each shard contains 5 GB data and 1.6 million documents. Organizations can determine the number of shards and size of each shard based on their data size and compute environment. The main purpose of creating shards is to parallelize the deduplication process across a cluster of compute nodes.

Compute hash code for each paragraph of the document. Each shard contains many documents and each document contains multiple paragraphs. For each paragraph, we compute a hash code and save them into a binary file. The authors of the CCNet paper use the first 64 bits of SHA-1 digits of the normalized paragraphs as the key. Deduplication is done by comparing these keys. If the same key appears multiple times, the paragraphs that these keys link to are considered duplicates. You can compare the keys within one shard, in which case there might still be duplicated paragraphs across different shards. If you compare the keys across all shards, you can verify that no duplicated paragraph exists in your whole dataset. However,  this can be computationally expensive.

MinHash is another popular method for estimating the similarities between two paragraphs. This technique is particularly useful for large datasets because it provides an efficient approximation of the Jaccard similarity. Paragraphs are broken down into shingles, which are overlapping sequences of words or characters of a fixed length. Multiple hashing functions are applied to each shingle. For each hash function, we find the minimum hash value across all the shingles and use that as the signature of the paragraph, called the MinHash signature. Using the MinHash signatures, we can calculate the similarity of the paragraphs. The MinHash technique can also be applied to words, sentences, or entire documents. This flexibility makes MinHash a powerful tool for a wide range of text similarity tasks. The following example shows the pseudo-code for this technique:

function MinHash_similarity(text1, text2, shingle_length, num_hash_functions):
    # Preprocess texts
    shingles1 = create_shingles(text1, shingle_length)
    shingles2 = create_shingles(text2, shingle_length)

    # Initialize MinHash signatures
    minhash_signatures = []

    # Compute MinHash signatures
    for i from 1 to num_hash_functions:
        hash_function = generate_hash_function()
        minhash1 = minimum_hash(shingles1, hash_function)
        minhash2 = minimum_hash(shingles2, hash_function)
        minhash_signatures.append((minhash1, minhash2))

    # Estimate Jaccard similarity
    common_minhashes = count_common_minhashes(minhash_signatures)
    jaccard_similarity = common_minhashes / num_hash_functions
    return jaccard_similarity

The complete steps of using MinHash for deduplication are:

  1. Break down documents into paragraphs.
  2. Apply the MinHash algorithm as shown in the preceding example and calculate the similarity scores between paragraphs.
  3. Use the similarity between paragraphs to identify duplicate pairs.
  4. Combine duplicate pairs into clusters. From each cluster, select one representative paragraph to minimize duplicates.

To enhance the efficiency of similarity searches, especially when dealing with large datasets, MinHash is often used in conjunction with additional techniques such as Locality Sensitive Hashing (LSH). LSH complements MinHash by providing a way to quickly identify potential matches through bucketing and hashing techniques without having to compare every pair of items in the dataset. This combination allows for efficient similarity searches even in massive collections of documents or data points, significantly reducing the computational overhead typically associated with such operations.

It’s important to note that paragraph-level deduplication is not the only choice of granularity. As shown in Meta’s Llama 3 paper, you can also use sentence-level deduplication. The authors also applied document-level deduplication to remove near duplicate documents. The computation cost for sentence-level deduplication is even higher compared to paragraph-level deduplication. However, this approach offers more fine-grained control over duplicate content. At the same time, removing duplicated sentences might result in an incomplete paragraph, potentially affecting the coherence and context of the remaining text. Thus, the trade-off between granularity and context preservation needs to be carefully considered based on the nature of the dataset.

Creating a dataset for model fine-tuning

Fine-tuning a pre-trained LLM involves adapting it to a specific task or domain by training it on an annotated dataset in a supervised manner or through reinforcement learning techniques. The dataset considerations for fine-tuning are crucial because they directly impact the model’s performance, accuracy, and generalization capabilities. Top considerations include:

  1. Relevance and domain-specificity:The dataset should closely match the task or domain the model is being fine-tuned for. Make sure that the dataset includes diverse examples and edge cases that the model is likely to encounter. This helps improve the robustness and generalizability of the model across a range of real-world scenarios. For example,  when fine-tuning a model for financial sentiment analysis, the dataset should contain financial news articles, analyst reports, stock market commentary, and corporate earnings announcements.
  2. Annotation quality:The dataset must be free of noise, errors, and irrelevant information. Annotated datasets must maintain consistency in labeling. The dataset should accurately reflect the correct answers, human preferences, or other target outcomes that the fine-tuning process aims to achieve.
  3. Dataset size and distribution:Although fine-tuning generally requires fewer tokens than pretraining (thousands compared to millions), the dataset should still be large enough to cover the breadth of the task requirements. The dataset should include a diverse set of examples that reflect the variations in language, context, and style that the model is expected to handle.
  4. Ethical considerations: Analyze and mitigate biases present in the dataset, such as gender, racial, or cultural biases. These biases can be amplified during fine-tuning, leading to unfair or discriminatory model outputs. Make sure that the dataset aligns with ethical standards and represents diverse groups and perspectives fairly.
  5. Sensible data cut offs: While preparing the dataset, one of the considerations to understand is choosing a cut-off date for the data. Generally, depending on the speed of changes in the information, you can choose an early or late cut off. For example, for fine-tuning an LLM for brand adherence, you can have a distant cutoff date because the brand language remains consistent for many years. Whereas preparing the dataset for generating audit and compliance letters needs an earlier cutoff date because new compliance regulations are created and are updated quite often.
  6. Modalities: In the case of multi-modal models, the dataset must include various supported data types. Each data type must follow the other considerations mentioned here around annotation quality, ethical considerations, relevance, domain specificity, and so on.
  7. Synthetic data augmentation:Consider generating synthetic data to supplement real-world data, especially to help fill gaps in the dataset to make sure that it’s realistic and representative. Employing these techniques can help overcome the challenges of limited data availability, enhance model robustness, and provide better generalization across diverse tasks.

Dataset format for fine tuning

Instruction tuning is a process of further training a pretrained model on a diverse set of tasks framed as natural language instructions. This approach aims to enhance the model’s ability to understand and follow explicit instructions, improving its performance on a wide range of downstream tasks without the need for task-specific fine -tuning. A technique called reference tuning is used to align large models with human preferences and values. This approach aims to improve the quality, safety, and desirability of AI-generated outputs by incorporating human feedback into the training process. The data requirements for both of these techniques are very different, so we’ll go into further detail in the following sections.

Instruction fine tuning dataset format

The columns in the table that follows represent the key components of the instruction-tuning paradigm: Type categorizes the task or instruction type. Instruction contains the specific directive for the model. Input provides the context or data to work with. Output shows the expected or desired response. This structure is crucial for training models to understand and execute a wide variety of tasks based on natural language instructions.

Type Instruction Input  Output
antonym_relation What is the relationship between the given pairs? Night : Day :: Right : Left The relationship between the given pairs is that they are opposites.
email_subject_generation Generate an appropriate subject title for the following email: Hi [person name],nn I’m writing to ask you if you are willing to be a panelist of our workshop on multi-modality at CVPR. The workshop will be held on June 20, 2023. nnBest,n[my name] Invitation to be a panelist for CVPR 2023 workshop on multi-modality
analogy_completion To make the pairs have the same analogy, write the fourth word. Instability : Turmoil:: Change : Instability : Turmoil :: Change : Revolution
question_answering_generation_from_facts Using the facts given, write a question-answer pair. Name: Bill GatesnBorn: October 28, 1955 (age 66)nEducation: Harvard University (dropped out)nOccupation: Software developer, investor, entrepreneurnSource: WikipedianTime: August 2022 Question: What is Bill Gates’s occupation? nAnswer: He is a software developer, investor, and entrepreneur.

 

Preference-tuning dataset format

For a preference-tuning dataset, the dataset typically consists of entries, each containing an input or prompt, a chosen (preferred) response, and a rejected (less preferred) response. The following is an example of a Direct Preference Optimization (DPO) dataset.

Prompt Chosen Rejected

Artists Gustav Klimt and Carl Moll were born in which European country?

What is the answer?

[{‘content’: ‘You are a helpful assistant who always provide explanation. Think like you are answering a five year old.’, ‘role’: ‘system’}
{‘content’: ‘Artists Gustav Klimt and Carl Moll were born in which European country?nWhat is the answer?’, ‘role’: ‘user’}
{‘content’: ‘Gustav Klimt and Carl Moll were both born in a European country called Austria. Austria is a beautiful place with mountains, forests, and many people who speak German. These two artists created lots of pretty paintings while they lived in Austria.’, ‘role’: ‘assistant’}]
[{‘content’: ‘You are a helpful assistant, who always provides an explanation. Think like you are answering a five year old.’, ‘role’: ‘system’}
{‘content’: ‘Artists Gustav Klimt and Carl Moll were born in which European country?nWhat is the answer?’, ‘role’: ‘user’}
{‘content’: ” Ooh ooh! *giggle* You wanna know where Gustav Klimt and Carl Moll were born? *bounce bounce* Well, they were born in a special place called Austria! *twirl* That’s right, Austria! It’s a beautiful country in Europe, and lots of talented artists like Gustav and Carl were born there! *smile* Does that make you wanna dance? *spin spin* Hehehe!”, ‘role’: ‘assistant’}]

The following is an example using the Ultrachat-feedback dataset format, which includes the following elements: prompt, chosen, rejected, message, score_chosen, and score_rejected. This type of dataset is often used in DPO or reinforcement learning from human feedback (RLHF) to improve AI model outputs. By providing examples of your preferred and non-preferred responses along with their respective scores, the dataset can be used to train models to generate more desirable outputs.

prompt chosen rejected messages score_chosen score_rejected
Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready? [{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: ‘Sure, I would love to play.’, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
7 6

In the case of Meta Llama 3, instruction-tuned models go through an iterative process of DPO preference alignment, and the dataset typically consists of triplets—a user prompt and two model responses, with one response preferred over the other. In advanced implementations, this format can be extended to include a third, edited response that’s considered superior to both original responses. The preference between responses is quantified using a multi-level rating system, ranging from marginally better to significantly better. This granular approach to preference annotation allows for a more nuanced training of the model, enabling it to distinguish between slight improvements and significant enhancements in response quality.

prompt chosen rejected edited alignment rating
Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready? [{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: ‘Sure, I would love to play.’, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
significantly better

 

Synthetic data creation approach for the instruction-tuning dataset format using the Self-Instruct technique

Synthetic data creation using the Self-Instruct technique is one of the most well-known approaches for generating instruction-finetuning datasets. This method uses the capabilities of LLMs to bootstrap a diverse and extensive collection of instruction-tuning examples, significantly reducing the need for manual annotation. The following figure shows the process of the Self-Instruct technique, which is described in the following sections.

 

Seed data and tasks

The seed data process begins with a small set of human-written instruction-output pairs that serve as seed data. The seed dataset serves as the foundation for building a robust collection of tasks used in various domains, with a focus on promoting task diversity. In some cases, the input field provides context to support the instruction, especially in classification tasks where output labels are limited. On the other hand, for tasks that are non-classification, the instruction alone might be self-contained without needing additional input. This dataset encourages task variety through different data formats and solutions, making it a critical step in defining the final task pool, which supports the development of diverse AI applications.

The following is an example of a seed task that identifies financial entities (companies, government institutions, or assets) and assigns a part of speech tag or entity classification based on the given sentence.

{
    "id": "finance_task_001",
    "name": "financial_entity_classification",
    "instruction": "Identify the type of financial entity in the given sentence.",
    "instances": [
      {
        "input": "Entity: Federal ReservenSentence: The Federal Reserve raised interest rates by 0.25% to combat inflation.",
        "output": "Government Institution, ORG"
      }
    ],
    "is_classification": true
  }

The following example requests an explanation of a financial concept, and because it isn’t a classification task, the output is more open-ended.

{
    "id": "finance_task_002",
    "name": "explain_financial_concept",
    "instruction": "Explain the concept of compound interest in two sentences.",
    "instances": [
      {
        "input": "",
        "output": "Compound interest is the interest on a loan or deposit calculated based on both the initial principal and the accumulated interest from previous periods. It allows investments to grow at a faster rate compared to simple interest, where interest is only calculated on the principal."
      }
    ],
    "is_classification": false
  }

Instruction generation

Using the seed data as a foundation, an LLM is prompted to generate new instructions. The process uses existing human-written instructions as examples to help a model (such as Anthropic’s Claude 3.5 or Meta Llama 405B) to generate new instructions, which are then checked and filtered for quality before being added to the final output list.

Come up with a series of tasks:
1. Suggest a diversified investment portfolio for someone with a moderate risk tolerance.
2. What is the relation between the following financial ratios and company performance?
3. Generate a one-sentence description for each of the following economic terms.
4. Describe a situation in which market volatility can negatively impact retirement planning.

Instance generation

For each generated instruction, the model creates corresponding input-output pairs. This step produces concrete examples of how to follow the instructions. The Input-First Approach for non-classification tasks asks the model to first generate the input values, which will then be used to generate the corresponding output. This approach is especially useful for tasks such as financial calculations, where the output directly depends on specific inputs.

input_first_template = 
'''Come up with examples for the following tasks.
Try to generate multiple examples when possible.
If the task doesn't require additional input, you can generate the output directly.
Task: Calculate the compound interest for the given principal, rate, and time period.
Example 1
Principal: $10,000, Rate: 5%, Time: 2 years
Output: $1,025 (Compound interest using annual compounding)
Example 2
Principal: $5,000, Rate: 3%, Time: 5 years
Output: $796.25 (Compound interest using annual compounding)
...
Task: {instruction}'''

The Output-First Approach for classification tasks is designed to first define the output (class label), and then condition the input generation based on the output. This approach verifies that inputs are created in such a way that they correspond to the pre-defined class labels.

output_first_template = 
'''Given the classification task definition and the class labels,
generate an input that corresponds to each of the class labels.
If the task doesn't require input, just generate possible class labels.
Task: Identify whether the following financial transaction is categorized as "Income" or "Expense."
Class Label: Income
Transaction: Payment received from client for consulting services - $5,000.
Class Label: Expense
Transaction: Payment made for office rent - $1,200.
...
Task: {instruction}'''

Post-processing filters

The filtering and quality control step verifies the dataset quality by applying various mechanisms to remove low-quality or redundant examples. After generating tasks, instances are extracted and formatted, followed by filtering based on rules such as removing instances where the input and output are identical, the output is empty, or the instance is already in the task pool. Additional heuristic checks, such as incomplete generations or formatting issues, are also applied to maintain the integrity of the final dataset.

For more details on self-instruct synthetic data creation, see Alpaca: A Strong, Replicable Instruction-Following Model for information about the data creation approach and instruction fine-tuning with the dataset. You can follow a similar approach for various fine-tuning tasks including instruction fine-tuning and direct preference optimization.

Data labeling for different downstream tasks (such as, code languages, summarization, and so on)

When it comes to preparing the data for training an LLM, data labeling plays a crucial role because it directly controls and impacts the quality of responses a model produces. Generally, for training an LLM, there are a variety of approaches that you can take. It depends on the task at hand because we expect the LLM to work on a variety of use cases. The reason we see base foundation models excelling a variety of instructions and tasks is because during the pre-training process, we provided such instructions and examples to the model so it can understand the instructions and perform the tasks. For example, asking the model to generate code or perform name entity extraction. Training the LLM for each type of task requires task-specific labeled datasets. Let’s explore some of the common data-labeling approaches:

  • Human labelers: The most common method for data labeling is to use human labelers. In this approach, a team of human labelers annotates data for various tasks, such as general question-answering, sentiment analysis, summarization, comparing various text for similarity and differences, and so on. For each category of task, you prepare a dataset for the various tasks and ask the human labelers to provide the answers. To mitigate individual bias, you can collect multiple responses for the same question by sourcing answers from multiple human labelers and then consolidate responses into an aggregate label. Human labeling is regarded as the gold standard for collecting high-quality data at scale. However, the process of labeling by hand tends to be tedious, time-consuming, and expensive for labeling tasks that involve millions of data points, which has motivated the study of AI-assisted data annotation tools—such as Snapper—that interactively reduce the burden of manual annotation.
  • LLM-assisted labeling: Another common approach to labeling is to use another LLM to label the data to speed up the labeling process. In this approach, you use another LLM to generate the responses for the various tasks such as sentiment analysis, summarization, coding, and so on. This can be achieved in different ways. In some cases, we can use N-shot learning approaches to improve the quality of the label. To mitigate bias, we use the human-in-the-loop (HITL) approach to review certain responses to verify that the labels are high quality. The benefit of this approach is that it’s faster than human labeling because you can scale the LLM endpoint and serve multiple requests in parallel. However, the downside is that you have to keep iterating and changing the acceptance threshold of confidence of the model’s response. For example, if you’re preparing the dataset for financial crime, you have to lower the tolerance for false negatives and accept slightly higher false positives.
  • Cohort-based labeling: Cohort-based labeling is an emerging approach where more than two LLMs are asked to generate the label for the same data. The models are then asked whether they agree with the other model’s response. The label is accepted if both models agree with each other’s response. There is another variation of this approach where instead of asking the models to agree with each other’s responses, you use a third LLM to rate the quality of the output of the other two models. It produces high quality outputs, but the cost of labeling rises exponentially because you need to make at least three LLM invocation calls for each data point to produce the final label. This approach is under active research, and we expect more orchestration tools for this in the near future.
  • RLHF-based data labeling: This approach is inspired by the RLHF fine-tuning process. Based on the task at hand, you first take a sample of unlabeled data points and have them labeled by a human labeler. You then use the labeled dataset to fine-tune an LLM. The next step is to use the fine-tuned LLM to produce multiple outputs for another subset of unlabeled data points. A human labeler ranks the outputs from best to worst and you use this data to train a reward model. You then send the rest of the unlabeled data points through the re-enforcement-learned PPO initialized through supervised policy. The policy generates the label and then you ask the reward model to calculate a reward for the label. The reward is further used to update the PPO policy. For further reading on this topic, see Improving your LLMs with RLHF on Amazon SageMaker.

Data processing architecture

The entire data processing pipeline can be achieved using a series of jobs as illustrated in the following architecture diagram. Amazon SageMaker is used as a job facility to filter, deduplicate, and tokenize the data. The intermediate outputs of each job can be stored on Amazon Simple Storage Service (Amazon S3). Depending on the size of the final datasets, either Amazon S3 or FSx for Lustre can be used for storing the final dataset. For larger datasets, FSx can provide significant improvements in the training throughput by eliminating the need to copy or stream data directly from S3. An example pipeline using the Hugging Face DataTrove library is provided in this repo.

Pipeline for fine-tuning

As previously discussed, fine-tuning data is typically comprised of an input instruction and the desired outputs. This data can be sourced using manual human annotation, synthetic generation, or a combination of the two. The following architecture diagram outlines an example pipeline where fine-tuning data is generated from an existing corpus of domain-specific documents. An example of a fine-tuning dataset would take a source document as input or context and generate task-specific responses such as a summary of the document, key information extracted from the document, or answers to questions about the document.

Models provided by Amazon Bedrock can be used to generate the synthetic data, which can then be validated and modified by a human reviewer using Amazon SageMaker Ground Truth. SageMaker Ground Truth can also be used to create human-labeled data fine-tuning from scratch. For synthetic data generation, be sure to review the model provider’s acceptable usage terms to verify compliance.

Pipeline for DPO

After a model is fine-tuned, it can be deployed on model hosting services such as Amazon SageMaker. The hosted model can then be used to generate candidate responses to various prompts. Through SageMaker Ground Truth, users can then provide feedback on which responses they prefer, resulting in a preference dataset. This flow is outlined in the following architecture diagram and can be repeated multiple times as the model tunes using the latest preference data.

Conclusion

Preparing high-quality datasets for LLM training is a critical yet complex process that requires careful consideration of various factors. From extracting and cleaning data from diverse sources to deduplicating content and maintaining ethical standards, each step plays a crucial role in shaping the model’s performance. By following the guidelines outlined in this post, organizations can curate well-rounded datasets that capture the nuances of their domain, leading to more accurate and reliable LLMs.


About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Vinayak Arannil is a Sr. Applied Scientist from the AWS Bedrock team. With several years of experience, he has worked on various domains of AI like computer vision, natural language processing etc. Vinayak led the data processing for the Amazon Titan model training. Currently, Vinayak helps build new features on the Bedrock platform enabling customers to build cutting-edge AI applications with ease and efficiency.

Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping customers from financial industries design, build and scale their GenAI/ML workloads on AWS. He carries an experience of more than a decade and a half working on entire ML and software engineering stack. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

David Ping is a Sr. Manager of AI/ML Solutions Architecture at Amazon Web Services. He helps enterprise customers build and operate machine learning solutions on AWS. David enjoys hiking and following the latest machine learning advancement.

Graham Horwood is Sr. Manager of Data Science from the AWS Bedrock team.

Read More

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

As generative AI capabilities evolve, successful business adoptions hinge on the development of robust problem-solving capabilities. At the forefront of this transformation are agentic systems, which harness the power of foundation models (FMs) to tackle complex, real-world challenges. By seamlessly integrating multiple agents, these innovative solutions enable autonomous collaboration, decision-making, and efficient problem-solving in diverse environments. Empirical research conducted by Amazon Web Services (AWS) scientists in conjunction with academic researchers has demonstrated the significant strides made in enhancing the reasoning capabilities through agent collaboration on competitive tasks.

This post provides step-by-step instructions for creating a collaborative multi-agent framework with reasoning capabilities to decouple business applications from FMs. It demonstrates how to combine Amazon Bedrock Agents with open source multi-agent frameworks, enabling collaborations and reasoning among agents to dynamically execute various tasks. The exercise will guide you through the process of building a reasoning orchestration system using Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and FMs. We also explore the integration of Amazon Bedrock Agents with open source orchestration frameworks LangGraph and CrewAI for dispatching and reasoning.

AWS has introduced a multi-agent collaboration capability for Amazon Bedrock, enabling developers to build, deploy, and manage multiple AI agents working together on complex tasks. This feature allows for the creation of specialized agents that handle different aspects of a process, coordinated by a supervisor agent that breaks down requests, delegates tasks, and consolidates outputs. This approach improves task success rates, accuracy, and productivity, especially for complex, multi-step tasks.

For the example code and demonstration discussed in this post, refer to the agentic-orchestration GitHub repository and this AWS Workshop. You can also refer to GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

Key characteristics of an agentic service

In the context of generative AI, “agent” refers to an autonomous function that can interact with its environment, gather data, and make decisions to execute complex tasks to achieve predefined goals. Generative AI agents are autonomous, goal-oriented systems that use FMs, such as large language models (LLMs), to interact with and adapt to their environments. These agents excel in planning, problem-solving, and decision-making, using techniques such as chain-of-thought prompting to break down complex tasks. They can self-reflect, improve their processes, and expand their capabilities through tool use and collaborations with other AI models. These agents can operate independently or collaboratively, executing tasks across various domains while continuously adapting to new information and changing circumstances. Agents can lead to increased creativity and produce content at scale, automating repetitive tasks so humans can focus on strategic work, thus reducing repetitive actions and leading to cost savings. The following diagram shows the high-level architecture of the solution.

To implement an agent on AWS, you can use the Amazon Bedrock Agents Boto3 client as demonstrated in the following code example. After the required AWS and Identity and Access Management (IAM) role is created for the agent, use the create_agent API. This API requires an agent name, an FM identifier, and an instruction string. Optionally, you can also provide an agent description. The created agent is not yet prepared for use. We focus on preparing the agent and then using it to invoke actions and interact with other APIs. Use the following code example to obtain your agent ID; it will be crucial for performing operations with the agent.

# Use the Python boto3 SDK to interact with Amazon Bedrock Agent service

bedrock_agent_client = boto3.client('bedrock-agent')

# Create a new Bedrock Agent
response = bedrock_agent_client.create_agent(
    agentName=<agent_name>, #customized text string
    agentResourceRoleArn=<agent_role['Role']['Arn']>, #IAM role assigned to the agent
    description=<agent_description>, #customized text string
    idleSessionTTLInSeconds=1800, 
    foundationModel=<agent_foundation_model>, #e.g. "anthropic.claude-3-sonnet-20240229-v1:0"
    instruction=<agent_instruction>, #agent instruction text string
)
agent_id = response['agent']['agentId']

Multi-agent pipelines for intra-agent collaboration

Multi-agent pipelines are orchestrated processes within AI systems that involve multiple specialized agents working together to accomplish complex tasks. Within pipelines, agents are organized in a sequential order structure, with different agents handling specific subtasks or roles within the overall workflow. Agents interact with each other, often through a shared “scratchpad” or messaging system, allowing them to exchange information and build upon each other’s work. Each agent maintains its own state, which can be updated with new information as the flow progresses. Complex projects are broken down into manageable subtasks, which are then distributed among the specialized agents. The workflow includes clearly defined processes for how tasks should be orchestrated, facilitating efficient task distribution and alignment with objectives. These processes can govern both inter-agent interactions and intra-agent operations (such as how an agent interacts with tools or processes outputs). Agents can be assigned specific roles (for example, retriever or injector) to tackle different aspects of a problem.

As a practical example, consider a multi-agent pipeline for blog writing, implemented with the multi-agent framework CrewAI. To create a multi-agent pipeline with CrewAI, first define the individual agents that will participate in the pipeline. The agents in the following example are the Planner Agent, a Writer Agent, and an Editor Agent. Next, arrange these agents into a pipeline, specifying the order of task execution and how the data flows between them. CrewAI provides mechanisms for agents to pass information to each other and coordinate their actions. The modular and scalable design of CrewAI makes it well-suited for developing both simple and sophisticated multi-agent AI applications. The following diagram shows this multi-agent pipeline.

from crewai import Agent, Task, Crew, Process

# Create a blog writing multi-agent pipeline, which is comprised of a planner, a writer, and an editor agent
# This code snippet shows only the planner agent, which calls web search tools 
# and Amazon Bedrock for the LLM 
class blogAgents():
   def __init__(self, topic, model_id):
       self.topic = topic
       self.model_id = model_id
    
   def planner(self, topic, model_id):
       return Agent(
           role="Content Planner",
           goal=f"""Plan engaging and factually accurate content on {topic}.""", 
           backstory=f"""You're working on planning a blog article about the topic: {topic}. n
                     You collect information by searching the web for the latest developments that directly relate to the {topic}. n
                     You help the audience learn something to make informed decisions regarding {topic}. n 
                     Your work is the basis for the Content Writer to write an article on this {topic}.""",
           allow_delegation=False,
           tools=<tools_to_use>,
           llm=<Bedrock_foundation_model>,
           verbose=True
       )
......

# Create the associated blog agent tasks which are comprised of a planner, writer, and editor tasks.
# This code snippet shows only the planner task.
class blogTasks():
   def __init__(self, topic, model_id):
       self.topic = topic
       self.model_id = model_id

   def plan(self, planner, topic, model_id):  
       return Task(
           description=(
                 f"""1. Prioritize the latest trends, key players, and noteworthy news on {topic}.n
                 2. Identify the target audience, considering their interests and pain points.n
                 3. Develop a detailed content outline including an introduction, key points, and a call to action.n
                 4. Include SEO keywords and relevant data or sources."""
           ),
           expected_output=f"""Convey the latest developments on the {topic} with sufficient depth as a domain expert.n
               Create a comprehensive content plan document with an outline, audience analysis,
SEO keywords, and resources.""",
           agent=planner
       )
......

# Define planner agent and planning tasks
planner_agent = agents.planner(self.topic, self.model_id)
plan_task = tasks.plan(planner_agent, self.topic, self.model_id)
......
 
# Define an agentic pipeline to chain the agent and associated tasks
# with service components, embedding engine, and execution process
crew = Crew(
        agents=[planner_agent, writer_agent, editor_agent],
        tasks=[plan_task, write_task, edit_task],
        verbose=True,
        memory=True,
        embedder={
            "provider": "huggingface",
            "config": {"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"},
        },
        cache=True,
        process=Process.sequential # Sequential process will have tasks executed one after the other
       )
result = crew.kickoff()

As demonstrated in this code example, multi-agent pipelines are generally simple linear structures that may be easy to set up and understand. They have a clear sequential flow of tasks from one agent to the next and can work well for straightforward workflows with a defined order of operations. Meanwhile, the pipeline structure can be less flexible for complex, nonlinear agent interactions, which makes it less able to handle branching logic or cycles. This might be less efficient for problems that require back-and-forth between agents. The next section addresses a graph framework for multi-agent systems, which lend better to more complex scenarios.

Multi-agent graph framework for asynchronous orchestration and reasoning

A multi-agent framework offers significant potential for intelligent, dynamic problem-solving that enable collaborative, specialized task execution. While these systems can enhance inference accuracy and response efficiency by dynamically activating and coordinating agents, they also present critical challenges including potential bias, limited reasoning capabilities, and the need for robust oversight. Effective multi-agent frameworks require careful design considerations such as clear leadership, dynamic team construction, effective information sharing, planning mechanisms like chain-of-thought prompting, memory systems for contextual learning, and strategic orchestration of specialized language models. As the technology evolves, balancing agent autonomy with human oversight and ethical safeguards will be crucial to unlocking the full potential of these intelligent systems while mitigating potential risks.

A multi-agent graph framework is a system that models the interactions and relationships between multiple autonomous agents using a graph-based representation. In this type of framework, agents are represented as nodes in the graph, with each agent having its own set of capabilities, goals, and decision-making processes. The edges in the graph represent the interactions, communications, or dependencies between the agents. These can include things like information sharing, task delegation, negotiation, or coordination. The graph structure allows for the modeling of complex, dynamic relationships between agents, including cycles, feedback loops, and hierarchies. The following diagram shows this architecture.

The graph-based approach provides a flexible and scalable way to represent the structure of multi-agent systems, making it easier to analyze, simulate, and reason about the emergent behaviors that arise from agent interactions. The following code snippet illustrates the process of building a graph framework designed for multi-agent orchestration using LangGraph. This framework is essential for managing and coordinating the interactions between multiple agents within a system, promoting efficient and effective communication and collaboration. Notably, it emphasizes the plug-and-play feature, which allows for dynamic changes and the flexibility to accommodate third-party agents. Frameworks with this capability can seamlessly adapt to new requirements and integrate with external systems, enhancing their overall versatility and usability.

from langgraph.graph import StateGraph, END
......
# Create a graph to orchestrate multiple agents (i.e. nodes) 
orch = StateGraph(MultiAgentState)
orch.add_node("rewrite_agent", rewrite_node)
orch.add_node('booking_assistant', bedrock_agent_node)
orch.add_node('blog_writer', blog_writer_node)
orch.add_node("router_agent", router_node)
orch.add_node('search_expert', search_expert_node)
....

# Create edges to connect agents to form a graph
orch.set_entry_point("rewrite_agent")
orch.add_edge('rewrite_agent', 'router_agent')
orch.add_conditional_edges(
    "RAG_agent",
    decide_to_search,
    {
        "to_human": "human",
        "do_search": "search_expert",
    },
)
orch.add_edge('blog_writer', 'text2image_generation')
......

# Compile the graph for agentic orchestration
graph = orch.compile(checkpointer=memory, interrupt_before = ['human'])

The multi-agent graph approach is particularly useful for domains where complex, dynamic interactions between autonomous entities need to be modeled and analyzed, such as in robotics, logistics, social networks, and more. There are multiple advantages and disadvantages to the multi-agent graph-based approach over the linear multi-agent pipelines approach, which are captured below.

Advantages and limitations

The emergence of agentic services represents a transformative approach to system design. Unlike conventional AI models that adhere to fixed, predetermined workflows, agentic systems are characterized by their capacity to collaborate, adapt, and make decisions in real time. This transition from passive to active AI opens up exciting opportunities and presents unique design challenges for developers and architects. Central to agentic services is the notion of agentic reasoning, which embodies a flexible, iterative problem-solving methodology that reflects human cognitive processes. By integrating design patterns such as reflection, self-improvement, and tool utilization, we can develop AI agents that are capable of ongoing enhancement and broader functionality across various domains.

Agentic services, although promising, face several limitations that must be addressed for their successful production implementation. The complexity of managing multiple autonomous agents, especially as their numbers and scope increase, poses a significant challenge in maintaining system coherence and stability. Additionally, the emergent behaviors of these systems can be difficult to predict and understand, hindering transparency and interpretability, which are crucial for building trust and accountability. Safety and robustness are paramount concerns because unintended behaviors or failures could have far-reaching consequences, necessitating robust safeguards and error-handling mechanisms. As agentic services scale up, maintaining efficient performance becomes increasingly challenging, requiring optimized resource utilization and load balancing. Finally, the lack of widely adopted standards and protocols for agent-based systems creates interoperability issues, making it difficult to integrate these services with existing infrastructure. Addressing these limitations is essential for the widespread adoption and success of agentic services in various domains.

Advantages:

  • More flexible representation of agent interactions using a graph structure
  • Better suited for complex workflows with nonlinear agent communication
  • Can more easily represent cycles and branching logic between agents
  •  Potentially more scalable for large multi-agent system
  • Clearer visualization of overall agent system structure

Disadvantages:

  • More complex initial setup compared to linear pipelines
  • Can require more upfront planning to design the graph structure
  • Can require extra source usage and longer response time

Next steps

In the next phase of multi-agent orchestration, our focus will be on enhancing the reasoning, reflection, and self-correction capabilities of our agents. This involves developing advanced algorithms (such as tree-of-thoughts (ToT) prompting, Monte Carlo tree search (MCTS), and others) that allow agents to learn from their peer interactions, adapt to new situations, and correct their behaviors based on feedback. Additionally, we’re working on creating a production-ready framework that can accommodate a variety of agentic services. This framework will be designed to be flexible and scalable, enabling seamless integration of different types of agents and services. These efforts are currently underway, and we’ll provide a detailed update on our progress in the next blog post. Stay tuned for more insights into our innovative approach to multi-agent orchestration.

Conclusion

Multi-agent orchestration and reasoning represent a significant leap forward in generative AI production adoption, offering unprecedented potential for complex problem-solving and decision-making, decoupling your applications from individual FMs. It’s also crucial to acknowledge and address the limitations, including scalability challenges, long latency and likely incompatibility among different agents. As we look to the future, enhancing self and intra-agent reasoning, reflection, and self-correction capabilities of our agents will be paramount. This will involve developing more sophisticated algorithms for metacognition, improving inter-agent communication protocols, and implementing robust error detection and correction mechanisms.

For the example code and demonstration discussed in this post, refer to the agentic-orchestration GitHub repository and this AWS Workshop. You can also refer to GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

The authors wish to express their gratitude to Mark Roy, Maria Laderia Tanke, and Max Iguer for their insightful contributions, as well as to Nausheen Sayed for her relentless coordination.


About the authors

Alfred Shen is a Senior GenAI Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on agentic solutions and multimodality.

annadrbAnya Derbakova is a Senior Startup Solutions Architect at AWS, specializing in Healthcare and Life Science technologies. A University of North Carolina graduate, she previously worked as a Principal Developer at Blue Cross Blue Shield Association. Anya is recognized for her contributions to AWS professional development, having been featured on the AWS Developer Podcast and participating in multiple educational series. She co-hosted a six-part mini-series on AWS Certification Exam Prep, focusing on cost-optimized cloud architecture strategies. Additionally, she was instrumental in the “Get Schooled on…Architecting” podcast, which provided comprehensive preparation for the AWS Solutions Architect Exam.

Read More

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco and Andrea Policarpi from BIP xTech.

AI’s transformative impact extends throughout the modern business landscape, with telecommunications emerging as a key area of innovation. Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. With a vision to build a large language model (LLM) trained on Italian data, Fastweb embarked on a journey to make this powerful AI capability available to third parties.

Training an LLM is a compute-intensive and complex process, which is why Fastweb, as a first step in their AI journey, used AWS generative AI and machine learning (ML) services such as Amazon SageMaker HyperPod.

SageMaker HyperPod can provision and maintain large-scale compute resilient clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA H200 and H100 Graphical Processing Units (GPUs), but its flexibility allowed Fastweb to deploy a small, agile and on-demand cluster enabling efficient resource utilization and cost management, aligning well with the project’s requirements.

In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.

Fine-tuning Mistral 7B on AWS

Fastweb recognized the importance of developing language models tailored to the Italian language and culture. To achieve this, the team built an extensive Italian language dataset by combining public sources and acquiring licensed data from publishers and media companies. Using this data, Fastweb, in their first experiment with LLM training, fine-tuned the Mistral 7B model, a state-of-the-art LLM, successfully adapting it to handle tasks such as summarization, question answering, and creative writing in the Italian language, applying a nuanced understanding of Italian culture to the LLM’s responses and providing contextually appropriate and culturally sensitive output.

The team opted for fine-tuning on AWS. This strategic decision was driven by several factors:

  • Efficient data preparation – Building a high-quality pre-training dataset is a complex task, involving assembling and preprocessing text data from various sources, including web sources and partner companies. Because the final, comprehensive pre-training dataset was still under construction, it was essential to begin with an approach that could adapt existing models to Italian.
  • Early results and insights – Fine-tuning allowed the team to achieve early results in training models on the Italian language, providing valuable insights and preliminary Italian language models. This enabled the engineers to iteratively improve the approach based on initial outcomes.
  • Computational efficiency – Fine-tuning requires significantly less computational power and less time to complete compared to a complete model pre-training. This approach streamlined the development process and allowed for a higher volume of experiments within a shorter time frame on AWS.

To facilitate the process, the team created a comprehensive dataset encompassing a wide range of tasks, constructed by translating existing English datasets and generating synthetic elements. The dataset was stored in an Amazon Simple Storage Service (Amazon S3) bucket, which served as a centralized data repository. During the training process, our SageMaker HyperPod cluster was connected to this S3 bucket, enabling effortless retrieval of the dataset elements as needed.

The integration of Amazon S3 and the SageMaker HyperPod cluster exemplifies the power of the AWS ecosystem, where various services work together seamlessly to support complex workflows.

Overcoming data scarcity with translation and synthetic data generation

When fine-tuning a custom version of the Mistral 7B LLM for the Italian language, Fastweb faced a major obstacle: high-quality Italian datasets were extremely limited or unavailable. To tackle this data scarcity challenge, Fastweb had to build a comprehensive training dataset from scratch to enable effective model fine-tuning.

While establishing strategic agreements to acquire licensed data from publishers and media companies, Fastweb employed two main strategies to create a diverse and well-rounded dataset: translating open source English training data into Italian and generating synthetic Italian data using AI models.

To use the wealth of information available in English, Fastweb translated open source English training datasets into Italian. This approach made valuable data accessible and relevant for Italian language training. Both LLMs and open source translation tools were used for this process.

The open source Argos Translate tool was used for bulk translation of datasets with simpler content. Although LLMs offer superior translation quality, Argos Translate is free, extremely fast, and well-suited for efficiently handling large volumes of straightforward data. For complex datasets where accuracy was critical, LLMs were employed to provide high-quality translations.

To further enrich the dataset, Fastweb generated synthetic Italian data using LLMs. This involved creating a variety of text samples covering a wide range of topics and tasks relevant to the Italian language. High-quality Italian web articles, books, and other texts served as the basis for training the LLMs to generate authentic-sounding synthetic content that captured the nuances of the language.

The resulting sub-datasets spanned diverse subjects, including medical information, question-answer pairs, conversations, web articles, science topics, and more. The tasks covered were also highly varied, encompassing question answering, summarization, creative writing, and others.

Each subset generated through translation or synthetic data creation underwent meticulous filtering to maintain quality and diversity. A similarity check was performed to deduplicate the data; if two elements were found to be too similar, one was removed. This step was crucial in maintaining variability and preventing bias from repetitive or overly similar content.

The deduplication process involved embedding dataset elements using a text embedder, then computing cosine similarity between the embeddings to identify similar elements. Meta’s FAISS library, renowned for its efficiency in similarity search and clustering of dense vectors, was used as the underlying vector database due to its ability to handle large-scale datasets effectively.

After filtering and deduplication, the remaining subsets were postprocessed and combined to form the final fine-tuning dataset, comprising 300,000 training elements. This comprehensive dataset enabled Fastweb to effectively fine-tune their custom version of the Mistral 7B model, achieving high performance and diversity across a wide range of tasks and topics.

All data generation and processing steps were run in parallel directly on the SageMaker HyperPod cluster nodes, using a unique working environment and highlighting the cluster’s versatility for various tasks beyond just training models.

The following diagram illustrates two distinct data pipelines for creating the final dataset: the upper pipeline uses translations of existing English datasets into Italian, and the lower pipeline employs custom generated synthetic data.

Dataset creation pipelines

The computational cost of training an LLM

The computational cost of training LLMs scales approximately with the number of parameters and the amount of training data. As a general rule, for each model parameter being trained, approximately 24 bytes of memory are required. This means that to fully fine-tune a 7 billion parameter model like Mistral 7B, at least 156 GB of hardware memory is necessary, not including the additional overhead of loading training data.

The following table provides additional examples.

LLM Model Size vs. Training Memory
Number of Parameters Memory Requirement
500 million 12 GB
1 billion 23 GB
2 billion 45 GB
3 billion 67 GB
5 billion 112 GB
7 billion 156 GB
10 billion 224 GB

Parameter-efficient fine-tuning (PEFT) methods minimize the number of trainable parameters, whereas quantization reduces the number of bits per parameter, often with minimal negative impact on the final training results.

Despite these memory-saving techniques, fine-tuning large models still demands substantial GPU memory and extended training times. This makes distributed training essential, allowing the workload to be shared across multiple GPUs, thereby enabling the efficient handling of such large-scale computational tasks.

The following table and figure illustrate the allocation of GPU memory during each phase of LLM training.

Training requirements

Solution overview

Training LLMs often requires significant computational resources that can exceed the capabilities of a single GPU. Distributed training is a powerful technique that addresses this challenge by distributing the workload across multiple GPUs and nodes, enabling parallel processing and reducing training time. SageMaker HyperPod simplifies the process of setting up and running distributed training jobs, providing preconfigured environments and libraries specifically designed for this purpose.

There are two main techniques for distributed training: data parallelization and model parallelization. Data parallelization involves distributing the training data across multiple GPUs, whereas model parallelization splits the model itself across different GPUs.

To take advantage of distributed training, a cluster of interconnected GPUs, often spread across multiple physical nodes, is required. SageMaker HyperPod allows for both data and model parallelization techniques to be employed simultaneously, maximizing the available computational resources. Also, SageMaker HyperPod provides resilience through features like automatic fault detection and recovery, which are crucial for long-running training jobs. SageMaker HyperPod allows for the creation of personalized Conda environments, enabling the installation of necessary libraries and tools for distributed training.

One popular library for implementing distributed training is DeepSpeed, a Python optimization library that handles distributed training and makes it memory-efficient and fast by enabling both data and model parallelization. The choice to use DeepSpeed was driven by the availability of an extensive, already-developed code base, ready to be employed for training experiments. The high flexibility and environment customization capabilities of SageMaker HyperPod made it possible to create a personalized Conda environment with all the necessary libraries installed, including DeepSpeed.

The following diagram illustrates the two key parallelization strategies offered by DeepSpeed: data parallelism and model parallelism. Data parallelism involves replicating the entire model across multiple devices, with each device processing a distinct batch of training data. In contrast, model parallelism distributes different parts of a single model across multiple devices, enabling the training of large models that exceed the memory capacity of a single device.

Data parallelization and model parallelization

To help meet the demanding computational requirements of training LLMs, we used the power and flexibility of SageMaker HyperPod clusters, orchestrated with Slurm. While HyperPod also supports orchestration with Amazon EKS, our research team had prior expertise with Slurm. The cluster configuration was tailored to our specific training needs, providing optimal resource utilization and cost-effectiveness.

The SageMaker HyperPod cluster architecture consisted of a controller machine to orchestrate the training job’s coordination and resource allocation. The training tasks were run by two compute nodes, which were g5.12xlarge instances equipped with high-performance GPUs. These compute nodes handled the bulk of the computational workload, using their GPUs to accelerate the training process.

The AWS managed high-performance Lustre file system (Amazon FSx for Lustre) mounted on the nodes provided high-speed data access and transfer rates, which are essential for efficient training operations.

SageMaker HyperPod is used to launch large clusters for pre-training Large Language Models (LLMs) with thousands of GPUs, but one of its key advantages is its flexibility, indeed it also allows for the creation of small, agile, and on-demand clusters. The versatility of SageMaker HyperPod made it possible to use resources only when needed, avoiding unnecessary costs.

For the DeepSpeed configuration, we followed the standard recommended setup, enabling data and model parallelism across the two g5.12xlarge nodes of the cluster, for a total of 8 GPUs.

Although more advanced techniques were available, such as offloading some computation to the CPU during training, our cluster was sized with a sufficiently high GPU margin. With 192 GiB (206 GB) of available overall GPU memory, even accounting for the additional GPU needed to keep dataset batches in memory during training, we had ample resources to train a 7B parameter model without the need for these advanced techniques. The following figure describes the infrastructure setup of our training solution.

Architecture diagram

Training results and output examples

After completing the training process, Fastweb’s fine-tuned language model demonstrated a significant performance improvement on Italian language tasks compared to the base model. Evaluated on an internal benchmark dataset, the fine-tuned model achieved an average accuracy increase of 20% across a range of tasks designed to assess its general understanding of the Italian language.

The benchmark tasks focused on three key areas: question answering, common sense reasoning, and next word prediction. Question answering tasks tested the model’s ability to comprehend and provide accurate responses to queries in Italian. Common sense reasoning evaluated the model’s grasp of common sense knowledge and its capacity to make logical inferences based on real-world scenarios. Next word prediction assessed the model’s understanding of language patterns and its ability to predict the most likely word to follow in a given context.

To evaluate the fine-tuned model’s performance, we initiated our interaction by inquiring about its capabilities. The model responded by enumerating its primary functions, emphasizing its ability to address Fastweb-specific topics. The response was formulated in correct Italian with a very natural syntax, as illustrated in the following example.

Dialog 1 - How can you help me?

Afterwards, we asked the model to generate five titles for a presentation on the topic of AI.

Generate titles for a slide deck about AI

Just for fun, we asked what the most famous sandwich is. The model responded with a combination of typical Italian ingredients and added that there is a wide variety of choices.

What is the most famous panini in Italy?

Lastly, we asked the model to provide us with a useful link to understand the recent EU AI Act. The model provided a working link, along with a helpful description.

Tell me something about EU AI Act

Conclusion

Using SageMaker HyperPod, Fastweb successfully fine-tuned the Mistral 7B model as a first step in their generative AI journey, significantly improving its performance on tasks involving the Italian language.

Looking ahead, Fastweb plans to deploy their next models also on Amazon Bedrock using the Custom Model Import feature. This strategic move will enable Fastweb to quickly build and scale new generative AI solutions for their customers, using the broad set of capabilities available on Amazon Bedrock.

By harnessing Amazon Bedrock, Fastweb can further enhance their offerings and drive digital transformation for their customers. This initiative aligns with Fastweb’s commitment to staying at the forefront of AI technology and fostering innovation across various industries.

With their fine-tuned language model running on Amazon Bedrock, Fastweb will be well-positioned to deliver cutting-edge generative AI solutions tailored to the unique needs of their customers. This will empower businesses to unlock new opportunities, streamline processes, and gain valuable insights, ultimately driving growth and competitiveness in the digital age.

Fastweb’s decision to use the Custom Model Import feature in Amazon Bedrock underscores the company’s forward-thinking approach and their dedication to providing their customers with the latest and most advanced AI technologies. This collaboration with AWS further solidifies Fastweb’s position as a leader in digital transformation and a driving force behind the adoption of innovative AI solutions across industries.

To learn more about SageMaker HyperPod, refer to Amazon SageMaker HyperPod and the Amazon SageMaker HyperPod workshop.


About the authors

Marta Cavalleri is the Manager of the Artificial Intelligence Center of Excellence (CoE) at Fastweb, where she leads teams of data scientists and engineers in implementing enterprise AI solutions. She specializes in AI operations, data governance, and cloud architecture on AWS.

Giovanni Germani is the Manager of Architecture & Artificial Intelligence CoE at Fastweb, where he leverages his extensive experience in Enterprise Architecture and digital transformation. With over 12 years in Management Consulting, Giovanni specializes in technology-driven projects across telecommunications, media, and insurance industries. He brings deep expertise in IT strategy, cybersecurity, and artificial intelligence to drive complex transformation programs.

Claudia Sacco is an AWS Professional Solutions Architect at BIP xTech, collaborating with Fastweb’s AI CoE and specialized in architecting advanced cloud and data platforms that drive innovation and operational excellence. With a sharp focus on delivering scalable, secure, and future-ready solutions, she collaborates with organizations to unlock the full potential of cloud technologies. Beyond her professional expertise, Claudia finds inspiration in the outdoors, embracing challenges through climbing and trekking adventures with her family.

Andrea Policarpi is a Data Scientist at BIP xTech, collaborating with Fastweb’s AI CoE. With a strong foundation in computer vision and natural language processing, he is currently exploring the world of Generative AI and leveraging its powerful tools to craft innovative solutions for emerging challenges. In his free time, Andrea is an avid reader and enjoys playing the piano to relax.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Adolfo Pica has a strong background in cloud computing, with over 20 years of experience in designing, implementing, and optimizing complex IT systems and architectures and with a keen interest and hands-on experience in the rapidly evolving field of generative AI and foundation models. He has expertise in AWS cloud services, DevOps practices, security, data analytics and generative AI. In his free time, Adolfo enjoys following his two sons in their sporting adventures in taekwondo and football.

Maurizio Pinto is a Senior Solutions Architect at AWS, specialized in cloud solutions for telecommunications. With extensive experience in software architecture and AWS services, he helps organizations navigate their cloud journey while pursuing his passion for AI’s transformative impact on technology and society.

Read More

Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights

Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights

Many enterprise customers across various industries are looking to adopt Generative AI to drive innovation, user productivity, and enhance customer experience. Generative AI–powered assistants such as Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business understands natural language and allows users to receive immediate, permissions-aware responses from enterprise data sources with citations. This capability supports various use cases such as IT, HR, and help desk.

With custom plugins for Amazon Q Business, you can enhance the application environment to enable your users to use natural language to perform specific tasks related to third-party applications — such as Jira, Salesforce, and ServiceNow — directly from within their web experience chat.

Enterprises that have adopted ServiceNow can improve their operations and boost user productivity by using Amazon Q Business for various use cases, including incident and knowledge management. Users can search ServiceNow knowledge base (KB) articles and incidents in addition to being able to create, manage, and track incidents and KB articles, all from within their web experience chat.

In this post, we’ll demonstrate how to configure an Amazon Q Business application and add a custom plugin that gives users the ability to use a natural language interface provided by Amazon Q Business to query real-time data and take actions in ServiceNow. By the end of this hands-on session, you should be able to:

  • Create an Amazon Q Business application and integrate it with ServiceNow using a custom plugin.
  • Use natural language in your Amazon Q web experience chat to perform read and write actions in ServiceNow such as querying and creating incidents and KB articles in a secure and governed fashion.

Prerequisites

Before proceeding, make sure that you have the necessary AWS account permissions and services enabled, along with access to a ServiceNow environment with the required privileges for configuration.

AWS

ServiceNow

  • Obtain a ServiceNow Personal Developer Instance or use a clean ServiceNow developer environment. You will need an account that has admin privileges to perform the configuration steps in ServiceNow.

Solution overview

The following architecture diagram illustrates the workflow for Amazon Q Business web experience with enhanced capabilities to integrate it seamlessly with ServiceNow.

Solution Overview

The implementation includes the following steps:

  1. The solution begins with configuring Amazon Q Business using the AWS Management Console. This includes setting up the application environment, adding users to AWS IAM Identity Center, selecting the appropriate subscription tier, and configuring the web experience for users to interact with. The environment can optionally be configured to provide real-time data retrieval using a native retriever, which pulls information from indexed data sources, such as Amazon Simple Storage Service (Amazon S3), during interactions.
  2. The next step involves adjusting the global controls and response settings for the application environment guardrails to allow Amazon Q Business to use its large language model (LLM) knowledge to generate responses when it cannot find responses from your connected data sources.
  3. Integration with ServiceNow is achieved by setting up an OAuth Inbound application endpoint in ServiceNow, which authenticates and authorizes interactions between Amazon Q Business and ServiceNow. This involves creating an OAuth API endpoint in ServiceNow and using the web experience URL from Amazon Q Business as the callback URL. The setup makes sure that Amazon Q Business can securely perform actions in ServiceNow with the same scoped permissions as the user signing in to ServiceNow.
  4. The final step of the solution involves enhancing the application environment with a custom plugin for ServiceNow using APIs defined in an OpenAPI schema. The plugin allows Amazon Q Business to securely interact with ServiceNow’s REST APIs, enabling operations such as querying, creating, and updating records dynamically and in real time

Configuring the Amazon Q Business application

To create an Amazon Q Business application, sign in to the Amazon Q Business console.
As a prerequisite to creating an Amazon Q Business application, follow the instructions in Configuring an IAM Identity Center instance section. Amazon Q Business integrates with IAM Identity Center to enable managing user access to your Amazon Q Business application. This is the recommended method for managing human access to AWS resources and the method used for the purpose of this blog.

Amazon Q Business also supports identity federation through IAM. When you use identity federation, you can manage users with your enterprise identity provider (IdP) and use IAM to authenticate users when they sign in to Amazon Q Business.

Create and configure the Amazon Q Business application:

  1. In the Amazon Q Business console, choose Application from the navigation pane and then choose Create application.
  2. Enter the following information for your Amazon Q Business application:
    • Application name: Enter a name for quick identification, such as my-demo-application.
    • Service access: Select the Create and use a new service-linked role (SLR). A service-linked role is a unique type of IAM role that is linked directly to Amazon Q Business. Service-linked roles are predefined by Amazon Q Business and include the permissions that the service requires to call other AWS services on your behalf.
    • Choose Create.
  3.  After creating your Amazon Q Business application environment, create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. On the Select Retriever page:
    • Retrievers: Select Use native retriever.
    • Index provisioning: Select Starter, which is ideal for proof-of-concept or developer workloads. See Index types for more information.
    • Number of units: Enter 1. This indicates the capacity units that you want to provision for your index. Each unit is 20,000 documents. Choose Next.
    • Choose Next.

Select Retriever

  1. After you select a retriever for your Amazon Q Business application environment, you can optionally connect other data sources to it. Because a data source isn’t required for this session, we won’t configure one. For more information on connecting data sources to an Amazon Q Business application, see connecting data sources.
    • Choose Next.
  2. As an account admin, you can add users to your IAM Identity Center instance from the Amazon Q Business console. After you add users or groups to an application environment, you can then choose the Amazon Q Business tier for each user or group. On the Add groups and users page:
    • Choose Add groups and users.
    • In the Add new users dialog box that opens, enter the details of the user. The details you must enter for a single user include: Username, First name, Last name, email address, Confirm email address, and Display name.
    • Choose Next and then Add. The user is automatically added to an IAM Identity Center directory and an email invitation to join Identity Center is sent to the email address provided.
    • After adding a user or group, choose the Amazon Q Business subscription tier for each user or group. From the Current subscription dropdown menu, select Q Business Pro.
    • For the Web experience service access, select Create and use a new service role.
    • Choose Create application.

    Add groups and users

Upon successful completion, Amazon Q Business returns a web experience URL that you can share with the users you added to your application environment. The Web experience URL (in this case: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/) will be used when creating an OAuth application endpoint in ServiceNow. Note that your web experience URL will be different from the one shown here.

Application Created

Enhancing an Amazon Q Business application with guardrails

By default, an Amazon Q Business application is configured to respond to user chat queries using only enterprise data. Because we didn’t configure a data source for the purpose of this post, you will use Admin controls and guardrails to allow Amazon Q to use its LLM world knowledge to generate responses when it cannot find responses from your connected data sources.

Create a custom plugin for ServiceNow:

  1. From the Amazon Q Business console, choose Applications in the navigation pane. Select the name of your application from the list of applications.
  2. From the navigation pane, choose Enhancements, and then choose Admin Controls and guardrails.
  3. In Global Controls, choose Edit.
  4. In Response settings under Application guardrails, select Allow Amazon Q to fall back to LLM knowledge.

create guardrails

Configuring ServiceNow

To allow Amazon Q Business to connect to your ServiceNow instance, you need to create an OAuth inbound application endpoint. OAuth-based authentication validates the identity of the client that attempts to establish a trust on the system by using an authentication protocol. For more information, see OAuth Inbound and Outbound authentication.

Create an OAuth application endpoint for external client applications to access the ServiceNow instance:

  1. In the ServiceNow console, navigate to All, then System OAuth, then Application Registry and then choose New. On the interceptor page, select Create an OAuth API endpoint for external clients and then fill in the form with details for Name and Redirect URL. The other fields are automatically generated by the ServiceNow OAuth server.
    • The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by oauth/callback.
    • For this example, the URL is: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback
  2. For Auth Scope, set the value to useraccount. The scope API response parameter defines the amount of access granted by the access token, which means that the access token has the same rights as the user account that authorized the token. For example, if Abel Tuter authorizes an application by providing login credentials, then the resulting access token grants the token bearer the same access privileges as Abel Tuter.
  3. Choose Submit.

This creates an OAuth client application record and generates a client ID and client secret, which Amazon Q Business needs to access the restricted resources on the instance. You will need this authentication information (client ID and client secret) in the following custom plugin configuration process.

ServiceNow App Registry OAuth

Enhancing the Amazon Q Business application environment with custom plugins for ServiceNow

To integrate with external applications, Amazon Q Business uses APIs, which are configured as part of the custom plugins.

Before creating a custom plugin, you need to create or edit an OpenAPI schema, outlining the different API operations that you want to enable for your custom plugin. Amazon Q Business uses the configured third-party OpenAPI specifications to dynamically determine which API operations to perform to fulfill a user request. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy and might require design optimizations. In order to maximize accuracy and improve efficiency with an Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions.

To configure a custom plugin, you must define at least one and a maximum of eight API operations that can be invoked. To define the API operations, create an OpenAPI schema in JSON or YAML format. You can create OpenAPI schema files and upload them to Amazon S3. Alternatively, you can use the OpenAPI text editor in the console, which will validate your schema.

For this post, a working sample of an OpenAPI Schema for ServiceNow is provided in JSON format. Before using it, edit the template file and replace <YOUR_SERVICENOW_INSTANCE_URL> in the following sections with the URL of your ServiceNow instance.

You can use the REST API Explorer to browse available APIs, API versions, and methods for each API. The explorer enables you to test REST API requests straight from the user interface. The Table API provides endpoints that allow you to perform create, read, update, and delete (CRUD) operations on existing tables. The calling user must have sufficient roles to access the data in the table specified in the request. For additional information on assigning roles, see Managing roles.

{
  "openapi": "3.0.1",
  "info": {
    "title": "Table API",
    "description": "Allows you to perform create, read, update and delete (CRUD) operations on existing tables",
    "version": "latest"
  },
  "externalDocs": {
    "url": "https://docs.servicenow.com/?context=CSHelp:REST-Table-API"
  },
  "servers": [
    {
      "url": "YOUR_SERVICENOW_INSTANCE_URL"
    }
  ],
  "paths": {
    "/api/now/table/{tableName}": {
      "get": {
        "description": "Retrieve records from a table",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_query",
            "in": "query",
            "description": "An encoded query string used to filter the results like Incidents Numbers or Knowledge Base IDs etc",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_fields",
            "in": "query",
            "description": "A comma-separated list of fields to return in the response",
            "required": false,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_limit",
            "in": "query",
            "description": "The maximum number of results returned per page",
            "required": false,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/incident"
                }
              }
            }
          }
        }
      },
      "post": {
        "description": "Create a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "requestBody": {
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "short_description": {
                    "type": "string",
                    "description": "Short Description"
                  },
                  "description": {
                    "type": "string",
                    "description": "Full Description for Incidents only"
                  },
                  "caller_id": {
                    "type": "string",
                    "description": "Caller Email"
                  },
                  "state": {
                    "type": "string",
                    "description": "State of the incident",
                    "enum": [
                      "new",
                      "in_progress",
                      "resolved",
                      "closed"
                    ]
                  },
                  "text": {
                    "type": "string",
                    "description": "Article Body Text for Knowledge Bases Only (KB)"
                  }
                },
                "required": [
                  "short_description",
                  "caller_id"
                ]
              }
            }
          },
          "required": true
        },
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {}
            }
          }
        }
      }
    },
    "/api/now/table/{tableName}/{sys_id}": {
      "get": {
        "description": "Retrieve a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_fields",
            "in": "query",
            "description": "A comma-separated list of fields to return in the response",
            "required": false,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      },
      "delete": {
        "description": "Delete a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      },
      "patch": {
        "description": "Update or modify a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "requestBody": {
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "short_description": {
                    "type": "string",
                    "description": "Short Description"
                  },
                  "description": {
                    "type": "string",
                    "description": "Full Description for Incidents only"
                  },
                  "caller_id": {
                    "type": "string",
                    "description": "Caller Email"
                  },
                  "state": {
                    "type": "string",
                    "description": "State of the incident",
                    "enum": [
                      "new",
                      "in_progress",
                      "resolved",
                      "closed"
                    ]
                  },
                  "text": {
                    "type": "string",
                    "description": "Article Body Text for Knowledge Bases Only (KB)"
                  }
                },
                "required": [
                  "short_description",
                  "caller_id"
                ]
              }
            }
          },
          "required": true
        },
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "incident": {
        "type": "object",
        "properties": {
          "sys_id": {
            "type": "string",
            "description": "Unique identifier for the incident"
          },
          "number": {
            "type": "string",
            "description": "Incident number"
          },
          "short_description": {
            "type": "string",
            "description": "Brief description of the incident"
          }
        }
      }
    },
    "securitySchemes": {
      "oauth2": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_auth.do",
            "tokenUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_token.do",
            "scopes": {
            "useraccount": "Access equivalent to the user's account"
            }
          }
        }
      }
    }
  },
  "security": [
    {
      "oauth2": [
        "useraccount"
      ]
    }
  ]
}

The URL for the ServiceNow instance used in this post is: https://devxxxxxx.service-now.com/. After updating the sections of the template with the URL for this specific instance, the JSON should look like the following:

  "servers": [
    {
      "url": "https://devxxxxxx.service-now.com/"
    }
    "securitySchemes": {
      "oauth2": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "https://devxxxxxx.service-now.com/oauth_auth.do",
            "tokenUrl": "https://devxxxxxx.service-now.com/oauth_token.do",
            "scopes": {
              "useraccount": "Access equivalent to the user's account"
            }
          }
        }
      }
    }

To create a custom plugin for ServiceNow:

    1. Sign in to the Amazon Q Business console.
    2. Choose Applications in the navigation pane, and then select your application from the list of applications.
    3. In the navigation pane, choose Enhancements, and then choose Plugins.
    4. In Plugins, choose Add plugin.
    5. In Add plugins, choose Custom plugin.
      Create Custom Plugin
    6. In Custom plugin, enter the following information:
      • In Name and description, for Plugin name: Enter a name for your Amazon Q plugin.
      • In API schema, for API schema source, select Define with in-line OpenAPI schema editor.
      • Select JSON as the format for the schema.
      • Remove any sample schema that appears in the inline OpenAPI schema editor and replace it with the text from the provided sample JSON template, updated with your ServiceNow instance URL.

      Enter Custom Plugin Details

    7. In Authentication: Select Authentication required.
    8. For AWS Secrets Manager secret, choose Create and add a new secret. You need to store the ServiceNow OAuth authentication credentials in a Secrets Manager secret to connect your third-party application to Amazon Q. In the window that opens, enter the details in the form:
      • Secret name: A name for your Secrets Manager secret.
      • Client ID: The Client ID from ServiceNow OAuth configuration in the previous section.
      • Client secret: The Client Secret from ServiceNow OAuth configuration in the previous section.
      • OAuth callback URL: The URL the user needs to be redirected to after authentication. This will be your web experience URL. For this example, it’s: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback. Amazon Q Business will handle OAuth tokens in this URL.

Create AWS Secrets Manager secret

  1. In Choose a method to authorize Amazon Q Business: Select Create and add a new service role. The console will generate a service role name. To connect Amazon Q Business to third-party applications that require authentication, you need to give the Amazon Q role permissions to access your Secrets Manager secret. This will enable an Amazon Q Business custom plugin to access the credentials needed to sign in to the third-party service.
    Custom Plugin Authentication
  2. Choose Add plugin to add your plugin.

Upon successful completion, the plugin will appear under Plugins with Build status of Ready and Plugin status Active.
Custom Plugin Active

Using Amazon Q Business web experience chat to take actions in ServiceNow

Users can launch your Amazon Q Business web experience in two ways:

  • AWS access portal URL provided in an invitation email sent to the user to join AWS IAM Identity Center.
  • Web experience URL shared by the admin.

Navigate to the deployed web experience URL and sign with your AWS IAM Identity Center credentials.
After signing in, choose the New conversation icon in the left-hand menu to start a conversation.

Example: Search Knowledge Base Articles in ServiceNow for user issue and create an incident

The following chat conversation example illustrates a typical use case of Amazon Q Business integrated with custom plugins for ServiceNow. These features allow you to perform a wide range of tasks tailored to your organization’s needs.

In this example, we initiate a conversation in the web experience chat to search for KB articles related to ”log in issues” in ServiceNow by invoking a plugin action. After the user submits a prompt, Amazon Q Business queries ServiceNow through the appropriate API to retrieve the results and provides a response with related KB articles. We then proceed by asking Amazon Q Business for more details to see if any of the KB articles directly addresses the user’s issue. When no relevant KB articles pertaining to the user’s issue are found, we ask Amazon Q Business to summarize the conversation and create a new incident in ServiceNow, making sure the issue is logged for resolution.

User prompt 1 – I am having issues logging in to the intranet and want to know if there are any ServiceNow KB articles on log-in issues. Perform the search on both Short Description and Text field using LIKE operator

Before submitting the preceding prompt for an action to create an incident in ServiceNow, choose the vertical ellipsis to open Conversation settings, then choose Use a Plugin to select the corresponding custom plugin for ServiceNow.
Web Experience Chat conversation with Amazon Q Business with Custom Plugin
If this is the first time a user is accessing the custom plugin or if their past sign-in has expired, the user will need to authenticate. After authenticating successfully, Amazon Q Business will perform the requested task.

Choose Authorize.
Amazon Q Business Authorization for ServiceNow Interaction

If the user isn’t already signed in to ServiceNow, they will be prompted to enter their credentials. For this example, the user signing in to ServiceNow is the admin user and API actions performed in ServiceNow by Amazon Q Business on behalf of the user will have the same level of access as the user within ServiceNow.
ServiceNow Login

Choose Allow for Amazon Q Business to connect to ServiceNow and perform the requested task on your behalf.

Allow Access to Amazon Q Business

Upon executing the user’s request after verifying that they are authorized, Amazon Q Business responds with the information that it retrieved. We then proceed to retrieve additional details with the following prompt.

User prompt 2 – Can you list the KB number and short description in a tabular form?

Conversation with Amazon Q Business to search for KB articles in ServiceNow
Because there no KB articles related the user’s issue were found, we will ask Amazon Q to summarize the conversation context to create an incident with the following prompt.

User prompt 3 – The error I get is "Unable to Login After System Upgrade". Summarize my issue and create an incident with detailed description and add a note that this needs to be resolved asap.

In response to your prompt for an action, Amazon Q displays a review form where you can modify or fill in the necessary information.

To successfully complete the action, choose submit.

Note: The caller_id value entered in the following example is a valid ServiceNow user.

Amazon Q Business Create Service Now Incident
Your web experience will display a success message if the action succeeds, or an error message if the action fails. In this case, the action succeeded and Amazon Q Business responded accordingly.

Amazon Q Business - Success message after incident Creation

The following screenshot shows that the incident was created successfully in ServiceNow.

Shows ServiceNow Incident Created from Amazon Q Business

Troubleshooting common errors

To have a seamless experience with third-party application integrations, it’s essential to thoroughly test, identify, and troubleshoot unexpected behavior.

A common error encountered in Amazon Q Business is API Response too large, which occurs when an API response size exceeds the current limit of 100 KB. While prompting techniques are essential for obtaining accurate and relevant answers, optimizing API responses to include only the necessary and relevant data is crucial for better response times and enhanced user experience.

The REST API Explorer (shown in the following figure) in ServiceNow is a tool that allows developers and administrators to interact with and test the ServiceNow REST APIs directly from within the ServiceNow environment. It provides a user-friendly interface for making API requests, viewing responses, and understanding the available endpoints and data structures. Using this tool simplifies the process of testing and integrating with ServiceNow.
Rest API Explorer in ServiceNow

Clean up

To clean up AWS configurations, sign in to the Amazon Q Business console.

  1. From the Amazon Q Business console, in Applications, select the application that you want to delete.
  2. Choose Actions and select Delete.
  3. To confirm deletion, enter Delete.

This will take a few minutes to finish. When completed, the application and the configured custom plugin will be deleted.
Delete Amazon Q Business App

When you delete the Amazon Q Business application, the users created as part of the configuration are not automatically deleted from IAM Identity Center. Use the instructions in Delete users in IAM Identity Center to delete the users created for this post.

To clean up in ServiceNow, release the Personal Developer Instance provisioned for this post by following the instructions in the ServiceNow Documentation.

Conclusion

The integration of generative AI-powered assistants such as Amazon Q Business with enterprise systems such as ServiceNow offers significant benefits for organizations. By using natural language processing capabilities, enterprises can streamline operations, enhance user productivity, and deliver better customer experiences. The ability to query real-time data and create incidents and knowledge articles through a secure and governed chat interface transforms how users interact with enterprise data and applications. As demonstrated in this post, enhancing Amazon Q Business to integrate with ServiceNow using custom plugins empowers users to perform complex tasks effortlessly, driving efficiency across various business functions. Adopting this technology not only modernizes workflows, but also positions enterprises at the forefront of innovation.

Learn more


About the Author

Siddhartha Angara is a Senior Solutions Architect at Amazon Web Services. He helps enterprise customers design and build well-architected solutions in the cloud, accelerate cloud adoption, and build Machine Learning and Generative AI applications. He enjoys playing the guitar, reading and family time!

Read More