Add a generative AI experience to your website or web application with Amazon Q embedded

Add a generative AI experience to your website or web application with Amazon Q embedded

Generative AI offers many benefits for both you, as a software provider, and your end-users. AI assistants can help users generate insights, get help, and find information that may be hard to surface using traditional means. In addition, they can help your employees reduce repetitive tasks and focus on high-value work. However, adding generative AI assistants to your website or web application requires significant domain knowledge and the technical expertise to build, deploy, and maintain the infrastructure and end-user experience. These challenges fall outside of some software providers’ core domain, creating barriers to offering AI assistants to users.

Amazon Q Business is a generative AI assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business securely unites disparate data with over 40 built-in connectors to popular enterprise applications, document repositories, chat applications, and knowledge management systems. You can use natural language to request information or assistance to generate content. Amazon Q Business handles the complexity of deploying and maintaining the infrastructure required for generative AI assistants so you can focus on creating a delightful end-user experience.

Amazon Q embedded is a feature that lets you embed a hosted Amazon Q Business assistant on your website or application to create more personalized experiences that boost end-users’ productivity. You can configure the assistant with guardrails to define global and topic-level controls for your environment. With an embedded Amazon Q Business assistant, end-users can receive immediate, permission-aware responses from your data sources, with citations.

In this post, we demonstrate how to use the Amazon Q embedded feature to add an Amazon Q Business assistant to your website or web application using basic HTML or React. We also show you how to use the feature with content management systems like WordPress and Drupal. This post includes a sample webpage for Amazon Q Business that allows you to quickly test and demonstrate your AI assistant. This allows you to develop the changes on your website or application in parallel while refining your Amazon Q Business configurations.

Solution overview

Embedding Amazon Q Business gives your users access to a generative AI assistant without leaving your website or web application. Integrating the assistant involves creating an Amazon Q Business application, adding users or groups, connecting relevant data sources, allowlisting your domain, and finally adding an HTML inline frame (iframe) element to your website or web application.

Prerequisites

In this section, we walk through how to set up an Amazon Q Business application, permissions, and user access.

Amazon Q Business application

The Amazon Q embedded feature requires an Amazon Q Business application. If you don’t have an existing application, you can create an application integrated with AWS IAM Identity Center or AWS Identity and Access Management (IAM) identity federation. Refer to Configuring an Amazon Q Business application using AWS IAM Identity Center, or Creating an Amazon Q Business application using Identity Federation through IAM if you need to make a new application.

Permissions

Configuring the Amazon Q embedded feature IAM permissions that allow you to use and manage Amazon Q Business. Your permission policy must at least allow the Amazon Q Business CreateWebExperience and UpdateWebExperience actions:

"Action": "qbusiness:CreateWebExperience",
"Action": "qbusiness:UpdateWebExperience",

When creating the IAM permission policy, the IAM Visual policy creator is a great way to see the options available. Using the least privileged access approach, you can restrict the resource in which the permission grants access to a specific AWS Region, account ID, application ID, and web experience ID.

"Resource": "arn:aws:qbusiness:us-east-1:123456789012:application/<replace-with-id>"
"Resource": "arn:aws:qbusiness:us-east-1:123456789012:application/<replace-with-id>/web-experience/<replace-with-id>"

You can find your application ID on the Amazon Q Business console under Application settings or from the list-applications command in the AWS Command Line Interface (AWS CLI). You can find your web experience ID with the list-web-experiences AWS CLI command. For example:

aws qbusiness list-applications
aws qbusiness list-web-experiences --application-id a1b2c3d4-5678-90ab-cdef-EXAMPLE11111

User access

Amazon Q Business requires authentication before users can engage with the assistant. If you use AWS IAM Identity Center, you can grant users access to the assistant by adding the users or groups to your Amazon Q Business application. If you use IAM identity federation, Amazon Q Business automatically subscribes users to the subscription type you select when you create the application. For more information on managing users, refer to Managing user subscriptions for IAM Identity Center-integrated applications, or see Updating and cancelling user subscriptions for applications using IAM Federation.

Allowlisting your website or web application

To embed Amazon Q Business on your website or web application, you must first allowlist your domain. This restricts your assistant to only sites you trust and stops others from embedding your assistant. You can add multiple domains for different services or development instances used for testing. Complete the following steps:

  1. Open the Amazon Q Business console.
  2. Next, select your Amazon Q Business application.
  3. From the menu, choose Amazon Q embedded under the Enhancements section, then choose Add allowed website.
  4. For Enter website URL, enter the base URL of the website or web application you want to allowlist for Amazon Q Business, for example https://www.example.com (trailing / not required), and choose Add.

Amazon Q Business hosts the web experience on an AWS domain. To find the URL, navigate to the main page of your Amazon Q Business application and copy the value for Deployed URL, for example https://1234abcdef5678.chat.qbusiness.example.on.aws/, in the Web experience settings section. Now you can embed this assistant into the website or web application hosted at the domain you allowlisted.

Customizing the user experience

You can customize the user experience look and feel for your organization. Customization options include the assistant title, subtitle, welcome message, font, color, and logo. You can also enable sample prompts. Refer to Customizing an Amazon Q Business web experience to see the available customization options.

The following screenshots show the default Amazon Q Business user experience (left) and an Amazon Q Business user experience with a custom title, subtitle, and welcome message (right).

Add Amazon Q Business to your website or web application

Before continuing, make sure you have allowlisted your domain as described earlier in this post.

You can choose from the following embedding options:

  • Using an HTML iframe element
  • Using a React component
  • Using a content management system

Embed Amazon Q Business using an HTML iframe element

You can embed Amazon Q Business on your website or web application using an iframe element, which is an HTML element that you can use to insert another HTML page into the current one. Other embedding options build upon this foundational HTML element. The following is a sample iframe element:

<iframe src="https://1234abcdef5678.chat.qbusiness.example.on.aws/"></iframe>

You can customize the iframe element with various attributes such as the width, height, and title. Setting the Amazon Q Business deployed URL as the value for the src attribute will display the Amazon Q Business web experience within the iframe. The following code shows an example iframe element with the id, title, width, height, and src attributes set to example values:

<iframe
    id="inlineFrameExample"
    title="Inline Frame Example"
    width="600"
    height="650"
    src="https://1234abcdef5678.chat.qbusiness.example.on.aws/">
</iframe>

Refer to <iframe>: The Inline Frame element to learn more about the iframe element.

Embed Amazon Q Business using a React component

You can embed Amazon Q Business on your website or web application using a React component. React components offer more customizations and modularity than a standard iframe. In this post, we’ve included a sample React component that wraps an iframe element and adds abilities such as an expanding and collapsing chat interface and showing a loading spinner when the page first loads.

To use this React component, download the sample code from the Embed GenAI chat into React GitHub repo and add it to your React source code. Then you can import the component into your website or web application and add the Chat element with at least the embedUrl attribute set to the deployed URL of your Amazon Q Business application. The following example code shows the options of the sample React component:

import Chat from "../components/embed";
...
<Chat
    embedUrl="https://1234abcdef5678.chat.qbusiness.example.on.aws/"
    embedWidth={600}          // Optional
    embedHeight={650}         // Optional
    embedOffsetRightPc={5}    // Optional
    headerText="Chat"         // Optional
    headerInfo="Chat with us" // Optional
/>

Embed Amazon Q Business using a content management system

You can embed Amazon Q Business on a website published by a content management system that allows you to add HTML elements to the content. We’ve included examples for WordPress and Drupal, both of which you can deploy with Amazon Lightsail.

Embedding on a WordPress site

To embed Amazon Q Business on your WordPress site, first access the WordPress admin page. Optionally, add a block group wrapper to constrain iframe sizing with the values of your choosing. For example, you could set the layout content height to 650px, width to 620px, a width of 100% in the iframe to fill the container, and select a full-size block item. Finally, add a custom HTML block and insert the iframe code. The following code is a sample iframe element:

<iframe
    id="inlineFrameExample"
    title="Inline Frame Example"
    width="100%"
    height="650"
    src="https://021345abcdef.chat.qbusiness.example.on.aws/">
</iframe>

The following screenshot shows an example of adding a block to a WordPress site.

The following screenshot shows an example of adding an iframe to the block.

The following screenshot shows an example of Amazon Q Business in a WordPress site.

Embedding on a Drupal site

To embed Amazon Q Business on your Drupal site, complete the following steps:

  1. Open the Drupal admin page.
  2. Choose Content, Blocks, and Add content block.
  3. Give your content block a description and change the text format to HTML.
  4. Choose the Source
  5. Add your iframe to the Body section of the block, then choose Save and configure.
  6. When configuring your content block, the visibility options are optional and can be left with the default values.
  7. Choose a Region to display this block, such as Content Above or Sidebar, then choose Save block.

The following screenshot shows an example of Amazon Q Business embedded with the Content Above option.

The following screenshot shows an example of Amazon Q Business embedded with the Sidebar option.

Sample website

To help you get started embedding Amazon Q Business, we have included a sample website that you can deploy on AWS Amplify with an AWS CloudFormation stack. The sample website contains an HTML iframe element with your Amazon Q Business assistant. To use the website, complete the following steps:

  1. First collect your Amazon Q Business application ID and make a note. You can find your application ID on the Amazon Q Business console as described earlier in this post.
  2. Download our YAML sample CloudFormation template to your workstation.
  3. Deploy the stack either using the AWS CloudFormation console or using the AWS CLI.

  4. After uploading the sample CloudFormation template, enter a stack name, a web page name, and your Amazon Q Business application ID in the Application ID input field.
  5. You can leave all other settings at their default values.
  6. After the stack fully deploys, navigate to the Outputs tab on the AWS CloudFormation console and copy the Amplify URL.
  7. Return to the Amazon Q Business console, select your Amazon Q Business application, and choose Amazon Q Embedded to add your Amplify URL to the Allowed websites list as described earlier in this post.
  8. Navigate to your Amplify URL in your web browser to see your sample website with Amazon Q Business. You may need to Sign in to Q Business.

Clean Up

To avoid future charges in your account from Amplify you can delete the resources you created in the previous section walkthrough on creating a sample website.

  1. On the CloudFormation console, in the navigation pane, choose Stacks.
  2. Select the stack you launched in the previous step, then choose Delete.

Conclusion

In this post, we showed you various methods of embedding Amazon Q Business, which enables users to have natural language conversations and get meaningful assistance directly on your website or web application. We discussed creating an Amazon Q Business application and how to allowlist your URL. We then walked through adding Amazon Q Business with a standard HTML iframe, a React component, and how to update a WordPress or Drupal site.

To get started, refer to Getting started with Amazon Q Business to create an Amazon Q Business application. For more information on the Amazon Q embedded feature, see Amazon Q embedded. Refer to Enhancing an Amazon Q Business application environment for guidance on integrating your data sources, which can include your website content, to enrich the answers Amazon Q Business can provide your website or web application users.


About the authors

Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.

David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.

Philip WhitesidePhilip Whiteside is a Solutions Architect (SA) at Amazon Web Services. Philip is passionate about overcoming barriers by utilizing technology.

Read More

An introduction to preparing your own dataset for LLM training

An introduction to preparing your own dataset for LLM training

Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of linguistic tasks. However, the performance of these models is heavily influenced by the data used during the training process.

In this blog post, we provide an introduction to preparing your own dataset for LLM training. Whether your goal is to fine-tune a pre-trained modIn this blog post, we provide an introduction to preparing your own dataset for LLM training. Whether your goal is to fine-tune a pre-trained model for a specific task or to continue pre-training for domain-specific applications, having a well-curated dataset is crucial for achieving optimal performance.el for a specific task or to continue pre-training for domain-specific applications, having a well-curated dataset is crucial for achieving optimal performance.

Data preprocessing

Text data can come from diverse sources and exist in a wide variety of formats such as PDF, HTML, JSON, and Microsoft Office documents such as Word, Excel, and PowerPoint. It’s rare to already have access to text data that can be readily processed and fed into an LLM for training. Thus, the first step in an LLM data preparation pipeline is to extract and collate data from these various sources and formats. During this step, you read data from multiple sources, extract the text using tools such as optical character recognition (OCR) for scanned PDFs, HTML parsers for web documents, and bespoke libraries for proprietary formats such as Microsoft Office files. Non-textual elements such as HTML tags and non-UTF-8 characters are typically removed or normalized.

The next step is to filter low quality or desirable documents. Common patterns for filtering data include:

  • Filtering on metadata such as the document name or URL.
  • Content-based filtering such as excluding any toxic or harmful content or personally identifiable information (PII).
  • Regex filters to identify specific character patterns present in the text.
  • Filtering documents with excessive repetitive sentences or n-grams.
  • Filters for specific languages such as English.
  • Other quality filters such as the number of words in the document, average word length, ratio of words comprised of alphabetic characters versus non-alphabetic characters, and others.
  • Model based quality filtering using lightweight text classifiers to identify low quality documents. For example, the FineWeb-Edu classifier is used to classify the education value of web pages.

Extracting text from various file formats can be a non-trivial task. Fortunately, many high-level libraries exist that can significantly simplify this process. We will use a few examples to demonstrate extracting text and review how to scale this to large collections of documents further down.

HTML preprocessing

When processing HTML documents, remove non-text data such as the document mark-up tags, inline CSS styles, and inline JavaScript. Furthermore, translate structured objects such as lists, tables, and sample code blocks into markdown format. The trafilatura library provides a command-line interface (CLI) and Python SDK for translating HTML documents in this fashion. The following code snippet demonstrates the library’s usage by extracting and preprocessing the HTML data from the Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker blog post.

from trafilatura import fetch_url, extract, html2txt

url = "https://aws.amazon.com/blogs/machine-learning/fine-tune-meta-llama-3-1-models-using-torchtune-on-amazon-sagemaker/"

downloaded = fetch_url(url)
print("RAW HTMLn", downloaded[:250])

all_text = html2txt(downloaded)
print("nALL TEXTn", all_text[:250])

main_text = extract(downloaded)
print("nMAIN TEXTn", main_text[:250])

trafilatura provides numerous functions for dealing with HTML. In the preceding example, fetch_url fetches the raw HTML, html2txt extracts the text content which includes the navigation links, related content links, and other text content. Finally, the extract method extracts the content of the main body which is the blog post itself. The output of the preceding code should look like the following:

RAW HTML
<!doctype html> <html lang="en-US" class="no-js aws-lng-en_US" xmlns="http://www.w3.org/1999/xhtml" data-aws-assets="https://a0.awsstatic.com" data-js-version="1.0.681" data-css-version="1.0.538" data-static-assets="https://a0.awsstatic.com" prefix="

ALL TEXT
Skip to Main Content Click here to return to Amazon Web Services homepage About AWS Contact Us Support English My Account Sign In Create an AWS Account Products Solutions Pricing Documentation Learn Partner Network AWS Marketplace Customer Enablement

MAIN TEXT
AWS Machine Learning Blog Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker This post is co-written with Meta’s PyTorch team. In today’s rapidly evolving AI landscape, businesses are constantly seeking ways to use advanced large lan

PDF processing

PDF is a common format for storing and distributing documents within organizations. Extracting clean text from PDFs can be challenging for several reasons. PDFs may use complex layouts that include text columns, images, tables, and figures. They can also contain embedded fonts and graphics that cannot be parsed by standard libraries. Unlike HTML, there is no structural information to work with such as headings, paragraphs, lists, and others, which makes parsing PDF documents significantly more difficult. If possible, PDF parsing should be avoided if an alternative format for the document exists such an HTML, markdown, or even a DOCX file. In cases where an alternative format is not available, you can use libraries such as pdfplumber, pypdf, and pdfminer to help with the extraction of text and tabular data from the PDF. The following is an example of using pdfplumber to parse the first page of the 2023 Amazon annual report in PDF format.

import pdfplumber

pdf_file = "Amazon-com-Inc-2023-Annual-Report.pdf"

with pdfplumber.open(pdf_file) as pdf:
    page = pdf.pages[1]

print(page.extract_text(x_tolerance=1)[:300])

pdfplumber provides bounding box information, which can be used to remove superfluous text such as page headers and footers. However, the library only works with PDFs that have text present, such as digitally authored PDFs. For PDF documents that require OCR, such as scanned documents, you can use services such as Amazon Textract.

Office document processing

Documents authored with Microsoft Office or other compatible productivity software are another common format within an organization. Such documents can include DOCX, PPTX, and XLSX files, and there are libraries available to work with these formats. The following code snippet uses the python-docx library to extract text from a Word document. The code iterates through the document paragraphs and concatenates them into a single string.

from docx import Document
doc_file = "SampleDoc.docx"

doc = Document(doc_file)

full_text = []
for paragraph in doc.paragraphs:
  full_text.append(paragraph.text)

document_text = 'n'.join(full_text)

Deduplication

After the preprocessing step, it is important to process the data further to remove duplicates (deduplication) and filter out low-quality content.

Deduplication is a critical aspect for preparing high-quality pretraining datasets. According to CCNet, duplicated training examples are pervasive in common natural language processing (NLP) datasets. This issue is not only a frequent source of bias in datasets originating from public domains such as the internet, but it can also be a potential problem when curating your own training dataset. When organizations attempt to create their own training dataset, they often use various data sources such as internal emails, memos, internal employee chat logs, support tickets, conversations, and internal wiki pages. The same chunk of text might appear across multiple sources or can repeat excessively in a single data source such as an email thread. Duplicated data extends the training time and potentially biases the model towards more frequently repeated examples.

A commonly used processing pipeline is the CCNet pipeline. The following section will describe deduplication and filtering employed in the CCNet pipeline.

Break documents into shards. In the CCNet paper, the author divided 30 TB of data into 1,600 shards. In that example, the shards are documents that have been grouped together. Each shard contains 5 GB data and 1.6 million documents. Organizations can determine the number of shards and size of each shard based on their data size and compute environment. The main purpose of creating shards is to parallelize the deduplication process across a cluster of compute nodes.

Compute hash code for each paragraph of the document. Each shard contains many documents and each document contains multiple paragraphs. For each paragraph, we compute a hash code and save them into a binary file. The authors of the CCNet paper use the first 64 bits of SHA-1 digits of the normalized paragraphs as the key. Deduplication is done by comparing these keys. If the same key appears multiple times, the paragraphs that these keys link to are considered duplicates. You can compare the keys within one shard, in which case there might still be duplicated paragraphs across different shards. If you compare the keys across all shards, you can verify that no duplicated paragraph exists in your whole dataset. However,  this can be computationally expensive.

MinHash is another popular method for estimating the similarities between two paragraphs. This technique is particularly useful for large datasets because it provides an efficient approximation of the Jaccard similarity. Paragraphs are broken down into shingles, which are overlapping sequences of words or characters of a fixed length. Multiple hashing functions are applied to each shingle. For each hash function, we find the minimum hash value across all the shingles and use that as the signature of the paragraph, called the MinHash signature. Using the MinHash signatures, we can calculate the similarity of the paragraphs. The MinHash technique can also be applied to words, sentences, or entire documents. This flexibility makes MinHash a powerful tool for a wide range of text similarity tasks. The following example shows the pseudo-code for this technique:

function MinHash_similarity(text1, text2, shingle_length, num_hash_functions):
    # Preprocess texts
    shingles1 = create_shingles(text1, shingle_length)
    shingles2 = create_shingles(text2, shingle_length)

    # Initialize MinHash signatures
    minhash_signatures = []

    # Compute MinHash signatures
    for i from 1 to num_hash_functions:
        hash_function = generate_hash_function()
        minhash1 = minimum_hash(shingles1, hash_function)
        minhash2 = minimum_hash(shingles2, hash_function)
        minhash_signatures.append((minhash1, minhash2))

    # Estimate Jaccard similarity
    common_minhashes = count_common_minhashes(minhash_signatures)
    jaccard_similarity = common_minhashes / num_hash_functions
    return jaccard_similarity

The complete steps of using MinHash for deduplication are:

  1. Break down documents into paragraphs.
  2. Apply the MinHash algorithm as shown in the preceding example and calculate the similarity scores between paragraphs.
  3. Use the similarity between paragraphs to identify duplicate pairs.
  4. Combine duplicate pairs into clusters. From each cluster, select one representative paragraph to minimize duplicates.

To enhance the efficiency of similarity searches, especially when dealing with large datasets, MinHash is often used in conjunction with additional techniques such as Locality Sensitive Hashing (LSH). LSH complements MinHash by providing a way to quickly identify potential matches through bucketing and hashing techniques without having to compare every pair of items in the dataset. This combination allows for efficient similarity searches even in massive collections of documents or data points, significantly reducing the computational overhead typically associated with such operations.

It’s important to note that paragraph-level deduplication is not the only choice of granularity. As shown in Meta’s Llama 3 paper, you can also use sentence-level deduplication. The authors also applied document-level deduplication to remove near duplicate documents. The computation cost for sentence-level deduplication is even higher compared to paragraph-level deduplication. However, this approach offers more fine-grained control over duplicate content. At the same time, removing duplicated sentences might result in an incomplete paragraph, potentially affecting the coherence and context of the remaining text. Thus, the trade-off between granularity and context preservation needs to be carefully considered based on the nature of the dataset.

Creating a dataset for model fine-tuning

Fine-tuning a pre-trained LLM involves adapting it to a specific task or domain by training it on an annotated dataset in a supervised manner or through reinforcement learning techniques. The dataset considerations for fine-tuning are crucial because they directly impact the model’s performance, accuracy, and generalization capabilities. Top considerations include:

  1. Relevance and domain-specificity:The dataset should closely match the task or domain the model is being fine-tuned for. Make sure that the dataset includes diverse examples and edge cases that the model is likely to encounter. This helps improve the robustness and generalizability of the model across a range of real-world scenarios. For example,  when fine-tuning a model for financial sentiment analysis, the dataset should contain financial news articles, analyst reports, stock market commentary, and corporate earnings announcements.
  2. Annotation quality:The dataset must be free of noise, errors, and irrelevant information. Annotated datasets must maintain consistency in labeling. The dataset should accurately reflect the correct answers, human preferences, or other target outcomes that the fine-tuning process aims to achieve.
  3. Dataset size and distribution:Although fine-tuning generally requires fewer tokens than pretraining (thousands compared to millions), the dataset should still be large enough to cover the breadth of the task requirements. The dataset should include a diverse set of examples that reflect the variations in language, context, and style that the model is expected to handle.
  4. Ethical considerations: Analyze and mitigate biases present in the dataset, such as gender, racial, or cultural biases. These biases can be amplified during fine-tuning, leading to unfair or discriminatory model outputs. Make sure that the dataset aligns with ethical standards and represents diverse groups and perspectives fairly.
  5. Sensible data cut offs: While preparing the dataset, one of the considerations to understand is choosing a cut-off date for the data. Generally, depending on the speed of changes in the information, you can choose an early or late cut off. For example, for fine-tuning an LLM for brand adherence, you can have a distant cutoff date because the brand language remains consistent for many years. Whereas preparing the dataset for generating audit and compliance letters needs an earlier cutoff date because new compliance regulations are created and are updated quite often.
  6. Modalities: In the case of multi-modal models, the dataset must include various supported data types. Each data type must follow the other considerations mentioned here around annotation quality, ethical considerations, relevance, domain specificity, and so on.
  7. Synthetic data augmentation:Consider generating synthetic data to supplement real-world data, especially to help fill gaps in the dataset to make sure that it’s realistic and representative. Employing these techniques can help overcome the challenges of limited data availability, enhance model robustness, and provide better generalization across diverse tasks.

Dataset format for fine tuning

Instruction tuning is a process of further training a pretrained model on a diverse set of tasks framed as natural language instructions. This approach aims to enhance the model’s ability to understand and follow explicit instructions, improving its performance on a wide range of downstream tasks without the need for task-specific fine -tuning. A technique called reference tuning is used to align large models with human preferences and values. This approach aims to improve the quality, safety, and desirability of AI-generated outputs by incorporating human feedback into the training process. The data requirements for both of these techniques are very different, so we’ll go into further detail in the following sections.

Instruction fine tuning dataset format

The columns in the table that follows represent the key components of the instruction-tuning paradigm: Type categorizes the task or instruction type. Instruction contains the specific directive for the model. Input provides the context or data to work with. Output shows the expected or desired response. This structure is crucial for training models to understand and execute a wide variety of tasks based on natural language instructions.

Type Instruction Input  Output
antonym_relation What is the relationship between the given pairs? Night : Day :: Right : Left The relationship between the given pairs is that they are opposites.
email_subject_generation Generate an appropriate subject title for the following email: Hi [person name],nn I’m writing to ask you if you are willing to be a panelist of our workshop on multi-modality at CVPR. The workshop will be held on June 20, 2023. nnBest,n[my name] Invitation to be a panelist for CVPR 2023 workshop on multi-modality
analogy_completion To make the pairs have the same analogy, write the fourth word. Instability : Turmoil:: Change : Instability : Turmoil :: Change : Revolution
question_answering_generation_from_facts Using the facts given, write a question-answer pair. Name: Bill GatesnBorn: October 28, 1955 (age 66)nEducation: Harvard University (dropped out)nOccupation: Software developer, investor, entrepreneurnSource: WikipedianTime: August 2022 Question: What is Bill Gates’s occupation? nAnswer: He is a software developer, investor, and entrepreneur.

 

Preference-tuning dataset format

For a preference-tuning dataset, the dataset typically consists of entries, each containing an input or prompt, a chosen (preferred) response, and a rejected (less preferred) response. The following is an example of a Direct Preference Optimization (DPO) dataset.

Prompt Chosen Rejected

Artists Gustav Klimt and Carl Moll were born in which European country?

What is the answer?

[{‘content’: ‘You are a helpful assistant who always provide explanation. Think like you are answering a five year old.’, ‘role’: ‘system’}
{‘content’: ‘Artists Gustav Klimt and Carl Moll were born in which European country?nWhat is the answer?’, ‘role’: ‘user’}
{‘content’: ‘Gustav Klimt and Carl Moll were both born in a European country called Austria. Austria is a beautiful place with mountains, forests, and many people who speak German. These two artists created lots of pretty paintings while they lived in Austria.’, ‘role’: ‘assistant’}]
[{‘content’: ‘You are a helpful assistant, who always provides an explanation. Think like you are answering a five year old.’, ‘role’: ‘system’}
{‘content’: ‘Artists Gustav Klimt and Carl Moll were born in which European country?nWhat is the answer?’, ‘role’: ‘user’}
{‘content’: ” Ooh ooh! *giggle* You wanna know where Gustav Klimt and Carl Moll were born? *bounce bounce* Well, they were born in a special place called Austria! *twirl* That’s right, Austria! It’s a beautiful country in Europe, and lots of talented artists like Gustav and Carl were born there! *smile* Does that make you wanna dance? *spin spin* Hehehe!”, ‘role’: ‘assistant’}]

The following is an example using the Ultrachat-feedback dataset format, which includes the following elements: prompt, chosen, rejected, message, score_chosen, and score_rejected. This type of dataset is often used in DPO or reinforcement learning from human feedback (RLHF) to improve AI model outputs. By providing examples of your preferred and non-preferred responses along with their respective scores, the dataset can be used to train models to generate more desirable outputs.

prompt chosen rejected messages score_chosen score_rejected
Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready? [{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: ‘Sure, I would love to play.’, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. you ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
7 6

In the case of Meta Llama 3, instruction-tuned models go through an iterative process of DPO preference alignment, and the dataset typically consists of triplets—a user prompt and two model responses, with one response preferred over the other. In advanced implementations, this format can be extended to include a third, edited response that’s considered superior to both original responses. The preference between responses is quantified using a multi-level rating system, ranging from marginally better to significantly better. This granular approach to preference annotation allows for a more nuanced training of the model, enabling it to distinguish between slight improvements and significant enhancements in response quality.

prompt chosen rejected edited alignment rating
Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready? [{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: ‘Sure, I would love to play.’, ‘role’: ‘assistant’}]
[{‘content’: “Let’s play a game. I say a sentence, then you make a sentence that follows up my sentence, then I give a continuation to yours and so on. You ready?”, ‘role’: ‘user’}
{‘content’: “I’m ready! Let’s begin. Please provide your first sentence.”, ‘role’: ‘assistant’}]
significantly better

 

Synthetic data creation approach for the instruction-tuning dataset format using the Self-Instruct technique

Synthetic data creation using the Self-Instruct technique is one of the most well-known approaches for generating instruction-finetuning datasets. This method uses the capabilities of LLMs to bootstrap a diverse and extensive collection of instruction-tuning examples, significantly reducing the need for manual annotation. The following figure shows the process of the Self-Instruct technique, which is described in the following sections.

 

Seed data and tasks

The seed data process begins with a small set of human-written instruction-output pairs that serve as seed data. The seed dataset serves as the foundation for building a robust collection of tasks used in various domains, with a focus on promoting task diversity. In some cases, the input field provides context to support the instruction, especially in classification tasks where output labels are limited. On the other hand, for tasks that are non-classification, the instruction alone might be self-contained without needing additional input. This dataset encourages task variety through different data formats and solutions, making it a critical step in defining the final task pool, which supports the development of diverse AI applications.

The following is an example of a seed task that identifies financial entities (companies, government institutions, or assets) and assigns a part of speech tag or entity classification based on the given sentence.

{
    "id": "finance_task_001",
    "name": "financial_entity_classification",
    "instruction": "Identify the type of financial entity in the given sentence.",
    "instances": [
      {
        "input": "Entity: Federal ReservenSentence: The Federal Reserve raised interest rates by 0.25% to combat inflation.",
        "output": "Government Institution, ORG"
      }
    ],
    "is_classification": true
  }

The following example requests an explanation of a financial concept, and because it isn’t a classification task, the output is more open-ended.

{
    "id": "finance_task_002",
    "name": "explain_financial_concept",
    "instruction": "Explain the concept of compound interest in two sentences.",
    "instances": [
      {
        "input": "",
        "output": "Compound interest is the interest on a loan or deposit calculated based on both the initial principal and the accumulated interest from previous periods. It allows investments to grow at a faster rate compared to simple interest, where interest is only calculated on the principal."
      }
    ],
    "is_classification": false
  }

Instruction generation

Using the seed data as a foundation, an LLM is prompted to generate new instructions. The process uses existing human-written instructions as examples to help a model (such as Anthropic’s Claude 3.5 or Meta Llama 405B) to generate new instructions, which are then checked and filtered for quality before being added to the final output list.

Come up with a series of tasks:
1. Suggest a diversified investment portfolio for someone with a moderate risk tolerance.
2. What is the relation between the following financial ratios and company performance?
3. Generate a one-sentence description for each of the following economic terms.
4. Describe a situation in which market volatility can negatively impact retirement planning.

Instance generation

For each generated instruction, the model creates corresponding input-output pairs. This step produces concrete examples of how to follow the instructions. The Input-First Approach for non-classification tasks asks the model to first generate the input values, which will then be used to generate the corresponding output. This approach is especially useful for tasks such as financial calculations, where the output directly depends on specific inputs.

input_first_template = 
'''Come up with examples for the following tasks.
Try to generate multiple examples when possible.
If the task doesn't require additional input, you can generate the output directly.
Task: Calculate the compound interest for the given principal, rate, and time period.
Example 1
Principal: $10,000, Rate: 5%, Time: 2 years
Output: $1,025 (Compound interest using annual compounding)
Example 2
Principal: $5,000, Rate: 3%, Time: 5 years
Output: $796.25 (Compound interest using annual compounding)
...
Task: {instruction}'''

The Output-First Approach for classification tasks is designed to first define the output (class label), and then condition the input generation based on the output. This approach verifies that inputs are created in such a way that they correspond to the pre-defined class labels.

output_first_template = 
'''Given the classification task definition and the class labels,
generate an input that corresponds to each of the class labels.
If the task doesn't require input, just generate possible class labels.
Task: Identify whether the following financial transaction is categorized as "Income" or "Expense."
Class Label: Income
Transaction: Payment received from client for consulting services - $5,000.
Class Label: Expense
Transaction: Payment made for office rent - $1,200.
...
Task: {instruction}'''

Post-processing filters

The filtering and quality control step verifies the dataset quality by applying various mechanisms to remove low-quality or redundant examples. After generating tasks, instances are extracted and formatted, followed by filtering based on rules such as removing instances where the input and output are identical, the output is empty, or the instance is already in the task pool. Additional heuristic checks, such as incomplete generations or formatting issues, are also applied to maintain the integrity of the final dataset.

For more details on self-instruct synthetic data creation, see Alpaca: A Strong, Replicable Instruction-Following Model for information about the data creation approach and instruction fine-tuning with the dataset. You can follow a similar approach for various fine-tuning tasks including instruction fine-tuning and direct preference optimization.

Data labeling for different downstream tasks (such as, code languages, summarization, and so on)

When it comes to preparing the data for training an LLM, data labeling plays a crucial role because it directly controls and impacts the quality of responses a model produces. Generally, for training an LLM, there are a variety of approaches that you can take. It depends on the task at hand because we expect the LLM to work on a variety of use cases. The reason we see base foundation models excelling a variety of instructions and tasks is because during the pre-training process, we provided such instructions and examples to the model so it can understand the instructions and perform the tasks. For example, asking the model to generate code or perform name entity extraction. Training the LLM for each type of task requires task-specific labeled datasets. Let’s explore some of the common data-labeling approaches:

  • Human labelers: The most common method for data labeling is to use human labelers. In this approach, a team of human labelers annotates data for various tasks, such as general question-answering, sentiment analysis, summarization, comparing various text for similarity and differences, and so on. For each category of task, you prepare a dataset for the various tasks and ask the human labelers to provide the answers. To mitigate individual bias, you can collect multiple responses for the same question by sourcing answers from multiple human labelers and then consolidate responses into an aggregate label. Human labeling is regarded as the gold standard for collecting high-quality data at scale. However, the process of labeling by hand tends to be tedious, time-consuming, and expensive for labeling tasks that involve millions of data points, which has motivated the study of AI-assisted data annotation tools—such as Snapper—that interactively reduce the burden of manual annotation.
  • LLM-assisted labeling: Another common approach to labeling is to use another LLM to label the data to speed up the labeling process. In this approach, you use another LLM to generate the responses for the various tasks such as sentiment analysis, summarization, coding, and so on. This can be achieved in different ways. In some cases, we can use N-shot learning approaches to improve the quality of the label. To mitigate bias, we use the human-in-the-loop (HITL) approach to review certain responses to verify that the labels are high quality. The benefit of this approach is that it’s faster than human labeling because you can scale the LLM endpoint and serve multiple requests in parallel. However, the downside is that you have to keep iterating and changing the acceptance threshold of confidence of the model’s response. For example, if you’re preparing the dataset for financial crime, you have to lower the tolerance for false negatives and accept slightly higher false positives.
  • Cohort-based labeling: Cohort-based labeling is an emerging approach where more than two LLMs are asked to generate the label for the same data. The models are then asked whether they agree with the other model’s response. The label is accepted if both models agree with each other’s response. There is another variation of this approach where instead of asking the models to agree with each other’s responses, you use a third LLM to rate the quality of the output of the other two models. It produces high quality outputs, but the cost of labeling rises exponentially because you need to make at least three LLM invocation calls for each data point to produce the final label. This approach is under active research, and we expect more orchestration tools for this in the near future.
  • RLHF-based data labeling: This approach is inspired by the RLHF fine-tuning process. Based on the task at hand, you first take a sample of unlabeled data points and have them labeled by a human labeler. You then use the labeled dataset to fine-tune an LLM. The next step is to use the fine-tuned LLM to produce multiple outputs for another subset of unlabeled data points. A human labeler ranks the outputs from best to worst and you use this data to train a reward model. You then send the rest of the unlabeled data points through the re-enforcement-learned PPO initialized through supervised policy. The policy generates the label and then you ask the reward model to calculate a reward for the label. The reward is further used to update the PPO policy. For further reading on this topic, see Improving your LLMs with RLHF on Amazon SageMaker.

Data processing architecture

The entire data processing pipeline can be achieved using a series of jobs as illustrated in the following architecture diagram. Amazon SageMaker is used as a job facility to filter, deduplicate, and tokenize the data. The intermediate outputs of each job can be stored on Amazon Simple Storage Service (Amazon S3). Depending on the size of the final datasets, either Amazon S3 or FSx for Lustre can be used for storing the final dataset. For larger datasets, FSx can provide significant improvements in the training throughput by eliminating the need to copy or stream data directly from S3. An example pipeline using the Hugging Face DataTrove library is provided in this repo.

Pipeline for fine-tuning

As previously discussed, fine-tuning data is typically comprised of an input instruction and the desired outputs. This data can be sourced using manual human annotation, synthetic generation, or a combination of the two. The following architecture diagram outlines an example pipeline where fine-tuning data is generated from an existing corpus of domain-specific documents. An example of a fine-tuning dataset would take a source document as input or context and generate task-specific responses such as a summary of the document, key information extracted from the document, or answers to questions about the document.

Models provided by Amazon Bedrock can be used to generate the synthetic data, which can then be validated and modified by a human reviewer using Amazon SageMaker Ground Truth. SageMaker Ground Truth can also be used to create human-labeled data fine-tuning from scratch. For synthetic data generation, be sure to review the model provider’s acceptable usage terms to verify compliance.

Pipeline for DPO

After a model is fine-tuned, it can be deployed on model hosting services such as Amazon SageMaker. The hosted model can then be used to generate candidate responses to various prompts. Through SageMaker Ground Truth, users can then provide feedback on which responses they prefer, resulting in a preference dataset. This flow is outlined in the following architecture diagram and can be repeated multiple times as the model tunes using the latest preference data.

Conclusion

Preparing high-quality datasets for LLM training is a critical yet complex process that requires careful consideration of various factors. From extracting and cleaning data from diverse sources to deduplicating content and maintaining ethical standards, each step plays a crucial role in shaping the model’s performance. By following the guidelines outlined in this post, organizations can curate well-rounded datasets that capture the nuances of their domain, leading to more accurate and reliable LLMs.


About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Vinayak Arannil is a Sr. Applied Scientist from the AWS Bedrock team. With several years of experience, he has worked on various domains of AI like computer vision, natural language processing etc. Vinayak led the data processing for the Amazon Titan model training. Currently, Vinayak helps build new features on the Bedrock platform enabling customers to build cutting-edge AI applications with ease and efficiency.

Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping customers from financial industries design, build and scale their GenAI/ML workloads on AWS. He carries an experience of more than a decade and a half working on entire ML and software engineering stack. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

David Ping is a Sr. Manager of AI/ML Solutions Architecture at Amazon Web Services. He helps enterprise customers build and operate machine learning solutions on AWS. David enjoys hiking and following the latest machine learning advancement.

Graham Horwood is Sr. Manager of Data Science from the AWS Bedrock team.

Read More

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

As generative AI capabilities evolve, successful business adoptions hinge on the development of robust problem-solving capabilities. At the forefront of this transformation are agentic systems, which harness the power of foundation models (FMs) to tackle complex, real-world challenges. By seamlessly integrating multiple agents, these innovative solutions enable autonomous collaboration, decision-making, and efficient problem-solving in diverse environments. Empirical research conducted by Amazon Web Services (AWS) scientists in conjunction with academic researchers has demonstrated the significant strides made in enhancing the reasoning capabilities through agent collaboration on competitive tasks.

This post provides step-by-step instructions for creating a collaborative multi-agent framework with reasoning capabilities to decouple business applications from FMs. It demonstrates how to combine Amazon Bedrock Agents with open source multi-agent frameworks, enabling collaborations and reasoning among agents to dynamically execute various tasks. The exercise will guide you through the process of building a reasoning orchestration system using Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and FMs. We also explore the integration of Amazon Bedrock Agents with open source orchestration frameworks LangGraph and CrewAI for dispatching and reasoning.

AWS has introduced a multi-agent collaboration capability for Amazon Bedrock, enabling developers to build, deploy, and manage multiple AI agents working together on complex tasks. This feature allows for the creation of specialized agents that handle different aspects of a process, coordinated by a supervisor agent that breaks down requests, delegates tasks, and consolidates outputs. This approach improves task success rates, accuracy, and productivity, especially for complex, multi-step tasks.

For the example code and demonstration discussed in this post, refer to the agentic-orchestration GitHub repository and this AWS Workshop. You can also refer to GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

Key characteristics of an agentic service

In the context of generative AI, “agent” refers to an autonomous function that can interact with its environment, gather data, and make decisions to execute complex tasks to achieve predefined goals. Generative AI agents are autonomous, goal-oriented systems that use FMs, such as large language models (LLMs), to interact with and adapt to their environments. These agents excel in planning, problem-solving, and decision-making, using techniques such as chain-of-thought prompting to break down complex tasks. They can self-reflect, improve their processes, and expand their capabilities through tool use and collaborations with other AI models. These agents can operate independently or collaboratively, executing tasks across various domains while continuously adapting to new information and changing circumstances. Agents can lead to increased creativity and produce content at scale, automating repetitive tasks so humans can focus on strategic work, thus reducing repetitive actions and leading to cost savings. The following diagram shows the high-level architecture of the solution.

To implement an agent on AWS, you can use the Amazon Bedrock Agents Boto3 client as demonstrated in the following code example. After the required AWS and Identity and Access Management (IAM) role is created for the agent, use the create_agent API. This API requires an agent name, an FM identifier, and an instruction string. Optionally, you can also provide an agent description. The created agent is not yet prepared for use. We focus on preparing the agent and then using it to invoke actions and interact with other APIs. Use the following code example to obtain your agent ID; it will be crucial for performing operations with the agent.

# Use the Python boto3 SDK to interact with Amazon Bedrock Agent service

bedrock_agent_client = boto3.client('bedrock-agent')

# Create a new Bedrock Agent
response = bedrock_agent_client.create_agent(
    agentName=<agent_name>, #customized text string
    agentResourceRoleArn=<agent_role['Role']['Arn']>, #IAM role assigned to the agent
    description=<agent_description>, #customized text string
    idleSessionTTLInSeconds=1800, 
    foundationModel=<agent_foundation_model>, #e.g. "anthropic.claude-3-sonnet-20240229-v1:0"
    instruction=<agent_instruction>, #agent instruction text string
)
agent_id = response['agent']['agentId']

Multi-agent pipelines for intra-agent collaboration

Multi-agent pipelines are orchestrated processes within AI systems that involve multiple specialized agents working together to accomplish complex tasks. Within pipelines, agents are organized in a sequential order structure, with different agents handling specific subtasks or roles within the overall workflow. Agents interact with each other, often through a shared “scratchpad” or messaging system, allowing them to exchange information and build upon each other’s work. Each agent maintains its own state, which can be updated with new information as the flow progresses. Complex projects are broken down into manageable subtasks, which are then distributed among the specialized agents. The workflow includes clearly defined processes for how tasks should be orchestrated, facilitating efficient task distribution and alignment with objectives. These processes can govern both inter-agent interactions and intra-agent operations (such as how an agent interacts with tools or processes outputs). Agents can be assigned specific roles (for example, retriever or injector) to tackle different aspects of a problem.

As a practical example, consider a multi-agent pipeline for blog writing, implemented with the multi-agent framework CrewAI. To create a multi-agent pipeline with CrewAI, first define the individual agents that will participate in the pipeline. The agents in the following example are the Planner Agent, a Writer Agent, and an Editor Agent. Next, arrange these agents into a pipeline, specifying the order of task execution and how the data flows between them. CrewAI provides mechanisms for agents to pass information to each other and coordinate their actions. The modular and scalable design of CrewAI makes it well-suited for developing both simple and sophisticated multi-agent AI applications. The following diagram shows this multi-agent pipeline.

from crewai import Agent, Task, Crew, Process

# Create a blog writing multi-agent pipeline, which is comprised of a planner, a writer, and an editor agent
# This code snippet shows only the planner agent, which calls web search tools 
# and Amazon Bedrock for the LLM 
class blogAgents():
   def __init__(self, topic, model_id):
       self.topic = topic
       self.model_id = model_id
    
   def planner(self, topic, model_id):
       return Agent(
           role="Content Planner",
           goal=f"""Plan engaging and factually accurate content on {topic}.""", 
           backstory=f"""You're working on planning a blog article about the topic: {topic}. n
                     You collect information by searching the web for the latest developments that directly relate to the {topic}. n
                     You help the audience learn something to make informed decisions regarding {topic}. n 
                     Your work is the basis for the Content Writer to write an article on this {topic}.""",
           allow_delegation=False,
           tools=<tools_to_use>,
           llm=<Bedrock_foundation_model>,
           verbose=True
       )
......

# Create the associated blog agent tasks which are comprised of a planner, writer, and editor tasks.
# This code snippet shows only the planner task.
class blogTasks():
   def __init__(self, topic, model_id):
       self.topic = topic
       self.model_id = model_id

   def plan(self, planner, topic, model_id):  
       return Task(
           description=(
                 f"""1. Prioritize the latest trends, key players, and noteworthy news on {topic}.n
                 2. Identify the target audience, considering their interests and pain points.n
                 3. Develop a detailed content outline including an introduction, key points, and a call to action.n
                 4. Include SEO keywords and relevant data or sources."""
           ),
           expected_output=f"""Convey the latest developments on the {topic} with sufficient depth as a domain expert.n
               Create a comprehensive content plan document with an outline, audience analysis,
SEO keywords, and resources.""",
           agent=planner
       )
......

# Define planner agent and planning tasks
planner_agent = agents.planner(self.topic, self.model_id)
plan_task = tasks.plan(planner_agent, self.topic, self.model_id)
......
 
# Define an agentic pipeline to chain the agent and associated tasks
# with service components, embedding engine, and execution process
crew = Crew(
        agents=[planner_agent, writer_agent, editor_agent],
        tasks=[plan_task, write_task, edit_task],
        verbose=True,
        memory=True,
        embedder={
            "provider": "huggingface",
            "config": {"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"},
        },
        cache=True,
        process=Process.sequential # Sequential process will have tasks executed one after the other
       )
result = crew.kickoff()

As demonstrated in this code example, multi-agent pipelines are generally simple linear structures that may be easy to set up and understand. They have a clear sequential flow of tasks from one agent to the next and can work well for straightforward workflows with a defined order of operations. Meanwhile, the pipeline structure can be less flexible for complex, nonlinear agent interactions, which makes it less able to handle branching logic or cycles. This might be less efficient for problems that require back-and-forth between agents. The next section addresses a graph framework for multi-agent systems, which lend better to more complex scenarios.

Multi-agent graph framework for asynchronous orchestration and reasoning

A multi-agent framework offers significant potential for intelligent, dynamic problem-solving that enable collaborative, specialized task execution. While these systems can enhance inference accuracy and response efficiency by dynamically activating and coordinating agents, they also present critical challenges including potential bias, limited reasoning capabilities, and the need for robust oversight. Effective multi-agent frameworks require careful design considerations such as clear leadership, dynamic team construction, effective information sharing, planning mechanisms like chain-of-thought prompting, memory systems for contextual learning, and strategic orchestration of specialized language models. As the technology evolves, balancing agent autonomy with human oversight and ethical safeguards will be crucial to unlocking the full potential of these intelligent systems while mitigating potential risks.

A multi-agent graph framework is a system that models the interactions and relationships between multiple autonomous agents using a graph-based representation. In this type of framework, agents are represented as nodes in the graph, with each agent having its own set of capabilities, goals, and decision-making processes. The edges in the graph represent the interactions, communications, or dependencies between the agents. These can include things like information sharing, task delegation, negotiation, or coordination. The graph structure allows for the modeling of complex, dynamic relationships between agents, including cycles, feedback loops, and hierarchies. The following diagram shows this architecture.

The graph-based approach provides a flexible and scalable way to represent the structure of multi-agent systems, making it easier to analyze, simulate, and reason about the emergent behaviors that arise from agent interactions. The following code snippet illustrates the process of building a graph framework designed for multi-agent orchestration using LangGraph. This framework is essential for managing and coordinating the interactions between multiple agents within a system, promoting efficient and effective communication and collaboration. Notably, it emphasizes the plug-and-play feature, which allows for dynamic changes and the flexibility to accommodate third-party agents. Frameworks with this capability can seamlessly adapt to new requirements and integrate with external systems, enhancing their overall versatility and usability.

from langgraph.graph import StateGraph, END
......
# Create a graph to orchestrate multiple agents (i.e. nodes) 
orch = StateGraph(MultiAgentState)
orch.add_node("rewrite_agent", rewrite_node)
orch.add_node('booking_assistant', bedrock_agent_node)
orch.add_node('blog_writer', blog_writer_node)
orch.add_node("router_agent", router_node)
orch.add_node('search_expert', search_expert_node)
....

# Create edges to connect agents to form a graph
orch.set_entry_point("rewrite_agent")
orch.add_edge('rewrite_agent', 'router_agent')
orch.add_conditional_edges(
    "RAG_agent",
    decide_to_search,
    {
        "to_human": "human",
        "do_search": "search_expert",
    },
)
orch.add_edge('blog_writer', 'text2image_generation')
......

# Compile the graph for agentic orchestration
graph = orch.compile(checkpointer=memory, interrupt_before = ['human'])

The multi-agent graph approach is particularly useful for domains where complex, dynamic interactions between autonomous entities need to be modeled and analyzed, such as in robotics, logistics, social networks, and more. There are multiple advantages and disadvantages to the multi-agent graph-based approach over the linear multi-agent pipelines approach, which are captured below.

Advantages and limitations

The emergence of agentic services represents a transformative approach to system design. Unlike conventional AI models that adhere to fixed, predetermined workflows, agentic systems are characterized by their capacity to collaborate, adapt, and make decisions in real time. This transition from passive to active AI opens up exciting opportunities and presents unique design challenges for developers and architects. Central to agentic services is the notion of agentic reasoning, which embodies a flexible, iterative problem-solving methodology that reflects human cognitive processes. By integrating design patterns such as reflection, self-improvement, and tool utilization, we can develop AI agents that are capable of ongoing enhancement and broader functionality across various domains.

Agentic services, although promising, face several limitations that must be addressed for their successful production implementation. The complexity of managing multiple autonomous agents, especially as their numbers and scope increase, poses a significant challenge in maintaining system coherence and stability. Additionally, the emergent behaviors of these systems can be difficult to predict and understand, hindering transparency and interpretability, which are crucial for building trust and accountability. Safety and robustness are paramount concerns because unintended behaviors or failures could have far-reaching consequences, necessitating robust safeguards and error-handling mechanisms. As agentic services scale up, maintaining efficient performance becomes increasingly challenging, requiring optimized resource utilization and load balancing. Finally, the lack of widely adopted standards and protocols for agent-based systems creates interoperability issues, making it difficult to integrate these services with existing infrastructure. Addressing these limitations is essential for the widespread adoption and success of agentic services in various domains.

Advantages:

  • More flexible representation of agent interactions using a graph structure
  • Better suited for complex workflows with nonlinear agent communication
  • Can more easily represent cycles and branching logic between agents
  •  Potentially more scalable for large multi-agent system
  • Clearer visualization of overall agent system structure

Disadvantages:

  • More complex initial setup compared to linear pipelines
  • Can require more upfront planning to design the graph structure
  • Can require extra source usage and longer response time

Next steps

In the next phase of multi-agent orchestration, our focus will be on enhancing the reasoning, reflection, and self-correction capabilities of our agents. This involves developing advanced algorithms (such as tree-of-thoughts (ToT) prompting, Monte Carlo tree search (MCTS), and others) that allow agents to learn from their peer interactions, adapt to new situations, and correct their behaviors based on feedback. Additionally, we’re working on creating a production-ready framework that can accommodate a variety of agentic services. This framework will be designed to be flexible and scalable, enabling seamless integration of different types of agents and services. These efforts are currently underway, and we’ll provide a detailed update on our progress in the next blog post. Stay tuned for more insights into our innovative approach to multi-agent orchestration.

Conclusion

Multi-agent orchestration and reasoning represent a significant leap forward in generative AI production adoption, offering unprecedented potential for complex problem-solving and decision-making, decoupling your applications from individual FMs. It’s also crucial to acknowledge and address the limitations, including scalability challenges, long latency and likely incompatibility among different agents. As we look to the future, enhancing self and intra-agent reasoning, reflection, and self-correction capabilities of our agents will be paramount. This will involve developing more sophisticated algorithms for metacognition, improving inter-agent communication protocols, and implementing robust error detection and correction mechanisms.

For the example code and demonstration discussed in this post, refer to the agentic-orchestration GitHub repository and this AWS Workshop. You can also refer to GitHub repo for Amazon Bedrock multi-agent collaboration code samples.

The authors wish to express their gratitude to Mark Roy, Maria Laderia Tanke, and Max Iguer for their insightful contributions, as well as to Nausheen Sayed for her relentless coordination.


About the authors

Alfred Shen is a Senior GenAI Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on agentic solutions and multimodality.

annadrbAnya Derbakova is a Senior Startup Solutions Architect at AWS, specializing in Healthcare and Life Science technologies. A University of North Carolina graduate, she previously worked as a Principal Developer at Blue Cross Blue Shield Association. Anya is recognized for her contributions to AWS professional development, having been featured on the AWS Developer Podcast and participating in multiple educational series. She co-hosted a six-part mini-series on AWS Certification Exam Prep, focusing on cost-optimized cloud architecture strategies. Additionally, she was instrumental in the “Get Schooled on…Architecting” podcast, which provided comprehensive preparation for the AWS Solutions Architect Exam.

Read More

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco and Andrea Policarpi from BIP xTech.

AI’s transformative impact extends throughout the modern business landscape, with telecommunications emerging as a key area of innovation. Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. With a vision to build a large language model (LLM) trained on Italian data, Fastweb embarked on a journey to make this powerful AI capability available to third parties.

Training an LLM is a compute-intensive and complex process, which is why Fastweb, as a first step in their AI journey, used AWS generative AI and machine learning (ML) services such as Amazon SageMaker HyperPod.

SageMaker HyperPod can provision and maintain large-scale compute resilient clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA H200 and H100 Graphical Processing Units (GPUs), but its flexibility allowed Fastweb to deploy a small, agile and on-demand cluster enabling efficient resource utilization and cost management, aligning well with the project’s requirements.

In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.

Fine-tuning Mistral 7B on AWS

Fastweb recognized the importance of developing language models tailored to the Italian language and culture. To achieve this, the team built an extensive Italian language dataset by combining public sources and acquiring licensed data from publishers and media companies. Using this data, Fastweb, in their first experiment with LLM training, fine-tuned the Mistral 7B model, a state-of-the-art LLM, successfully adapting it to handle tasks such as summarization, question answering, and creative writing in the Italian language, applying a nuanced understanding of Italian culture to the LLM’s responses and providing contextually appropriate and culturally sensitive output.

The team opted for fine-tuning on AWS. This strategic decision was driven by several factors:

  • Efficient data preparation – Building a high-quality pre-training dataset is a complex task, involving assembling and preprocessing text data from various sources, including web sources and partner companies. Because the final, comprehensive pre-training dataset was still under construction, it was essential to begin with an approach that could adapt existing models to Italian.
  • Early results and insights – Fine-tuning allowed the team to achieve early results in training models on the Italian language, providing valuable insights and preliminary Italian language models. This enabled the engineers to iteratively improve the approach based on initial outcomes.
  • Computational efficiency – Fine-tuning requires significantly less computational power and less time to complete compared to a complete model pre-training. This approach streamlined the development process and allowed for a higher volume of experiments within a shorter time frame on AWS.

To facilitate the process, the team created a comprehensive dataset encompassing a wide range of tasks, constructed by translating existing English datasets and generating synthetic elements. The dataset was stored in an Amazon Simple Storage Service (Amazon S3) bucket, which served as a centralized data repository. During the training process, our SageMaker HyperPod cluster was connected to this S3 bucket, enabling effortless retrieval of the dataset elements as needed.

The integration of Amazon S3 and the SageMaker HyperPod cluster exemplifies the power of the AWS ecosystem, where various services work together seamlessly to support complex workflows.

Overcoming data scarcity with translation and synthetic data generation

When fine-tuning a custom version of the Mistral 7B LLM for the Italian language, Fastweb faced a major obstacle: high-quality Italian datasets were extremely limited or unavailable. To tackle this data scarcity challenge, Fastweb had to build a comprehensive training dataset from scratch to enable effective model fine-tuning.

While establishing strategic agreements to acquire licensed data from publishers and media companies, Fastweb employed two main strategies to create a diverse and well-rounded dataset: translating open source English training data into Italian and generating synthetic Italian data using AI models.

To use the wealth of information available in English, Fastweb translated open source English training datasets into Italian. This approach made valuable data accessible and relevant for Italian language training. Both LLMs and open source translation tools were used for this process.

The open source Argos Translate tool was used for bulk translation of datasets with simpler content. Although LLMs offer superior translation quality, Argos Translate is free, extremely fast, and well-suited for efficiently handling large volumes of straightforward data. For complex datasets where accuracy was critical, LLMs were employed to provide high-quality translations.

To further enrich the dataset, Fastweb generated synthetic Italian data using LLMs. This involved creating a variety of text samples covering a wide range of topics and tasks relevant to the Italian language. High-quality Italian web articles, books, and other texts served as the basis for training the LLMs to generate authentic-sounding synthetic content that captured the nuances of the language.

The resulting sub-datasets spanned diverse subjects, including medical information, question-answer pairs, conversations, web articles, science topics, and more. The tasks covered were also highly varied, encompassing question answering, summarization, creative writing, and others.

Each subset generated through translation or synthetic data creation underwent meticulous filtering to maintain quality and diversity. A similarity check was performed to deduplicate the data; if two elements were found to be too similar, one was removed. This step was crucial in maintaining variability and preventing bias from repetitive or overly similar content.

The deduplication process involved embedding dataset elements using a text embedder, then computing cosine similarity between the embeddings to identify similar elements. Meta’s FAISS library, renowned for its efficiency in similarity search and clustering of dense vectors, was used as the underlying vector database due to its ability to handle large-scale datasets effectively.

After filtering and deduplication, the remaining subsets were postprocessed and combined to form the final fine-tuning dataset, comprising 300,000 training elements. This comprehensive dataset enabled Fastweb to effectively fine-tune their custom version of the Mistral 7B model, achieving high performance and diversity across a wide range of tasks and topics.

All data generation and processing steps were run in parallel directly on the SageMaker HyperPod cluster nodes, using a unique working environment and highlighting the cluster’s versatility for various tasks beyond just training models.

The following diagram illustrates two distinct data pipelines for creating the final dataset: the upper pipeline uses translations of existing English datasets into Italian, and the lower pipeline employs custom generated synthetic data.

Dataset creation pipelines

The computational cost of training an LLM

The computational cost of training LLMs scales approximately with the number of parameters and the amount of training data. As a general rule, for each model parameter being trained, approximately 24 bytes of memory are required. This means that to fully fine-tune a 7 billion parameter model like Mistral 7B, at least 156 GB of hardware memory is necessary, not including the additional overhead of loading training data.

The following table provides additional examples.

LLM Model Size vs. Training Memory
Number of Parameters Memory Requirement
500 million 12 GB
1 billion 23 GB
2 billion 45 GB
3 billion 67 GB
5 billion 112 GB
7 billion 156 GB
10 billion 224 GB

Parameter-efficient fine-tuning (PEFT) methods minimize the number of trainable parameters, whereas quantization reduces the number of bits per parameter, often with minimal negative impact on the final training results.

Despite these memory-saving techniques, fine-tuning large models still demands substantial GPU memory and extended training times. This makes distributed training essential, allowing the workload to be shared across multiple GPUs, thereby enabling the efficient handling of such large-scale computational tasks.

The following table and figure illustrate the allocation of GPU memory during each phase of LLM training.

Training requirements

Solution overview

Training LLMs often requires significant computational resources that can exceed the capabilities of a single GPU. Distributed training is a powerful technique that addresses this challenge by distributing the workload across multiple GPUs and nodes, enabling parallel processing and reducing training time. SageMaker HyperPod simplifies the process of setting up and running distributed training jobs, providing preconfigured environments and libraries specifically designed for this purpose.

There are two main techniques for distributed training: data parallelization and model parallelization. Data parallelization involves distributing the training data across multiple GPUs, whereas model parallelization splits the model itself across different GPUs.

To take advantage of distributed training, a cluster of interconnected GPUs, often spread across multiple physical nodes, is required. SageMaker HyperPod allows for both data and model parallelization techniques to be employed simultaneously, maximizing the available computational resources. Also, SageMaker HyperPod provides resilience through features like automatic fault detection and recovery, which are crucial for long-running training jobs. SageMaker HyperPod allows for the creation of personalized Conda environments, enabling the installation of necessary libraries and tools for distributed training.

One popular library for implementing distributed training is DeepSpeed, a Python optimization library that handles distributed training and makes it memory-efficient and fast by enabling both data and model parallelization. The choice to use DeepSpeed was driven by the availability of an extensive, already-developed code base, ready to be employed for training experiments. The high flexibility and environment customization capabilities of SageMaker HyperPod made it possible to create a personalized Conda environment with all the necessary libraries installed, including DeepSpeed.

The following diagram illustrates the two key parallelization strategies offered by DeepSpeed: data parallelism and model parallelism. Data parallelism involves replicating the entire model across multiple devices, with each device processing a distinct batch of training data. In contrast, model parallelism distributes different parts of a single model across multiple devices, enabling the training of large models that exceed the memory capacity of a single device.

Data parallelization and model parallelization

To help meet the demanding computational requirements of training LLMs, we used the power and flexibility of SageMaker HyperPod clusters, orchestrated with Slurm. While HyperPod also supports orchestration with Amazon EKS, our research team had prior expertise with Slurm. The cluster configuration was tailored to our specific training needs, providing optimal resource utilization and cost-effectiveness.

The SageMaker HyperPod cluster architecture consisted of a controller machine to orchestrate the training job’s coordination and resource allocation. The training tasks were run by two compute nodes, which were g5.12xlarge instances equipped with high-performance GPUs. These compute nodes handled the bulk of the computational workload, using their GPUs to accelerate the training process.

The AWS managed high-performance Lustre file system (Amazon FSx for Lustre) mounted on the nodes provided high-speed data access and transfer rates, which are essential for efficient training operations.

SageMaker HyperPod is used to launch large clusters for pre-training Large Language Models (LLMs) with thousands of GPUs, but one of its key advantages is its flexibility, indeed it also allows for the creation of small, agile, and on-demand clusters. The versatility of SageMaker HyperPod made it possible to use resources only when needed, avoiding unnecessary costs.

For the DeepSpeed configuration, we followed the standard recommended setup, enabling data and model parallelism across the two g5.12xlarge nodes of the cluster, for a total of 8 GPUs.

Although more advanced techniques were available, such as offloading some computation to the CPU during training, our cluster was sized with a sufficiently high GPU margin. With 192 GiB (206 GB) of available overall GPU memory, even accounting for the additional GPU needed to keep dataset batches in memory during training, we had ample resources to train a 7B parameter model without the need for these advanced techniques. The following figure describes the infrastructure setup of our training solution.

Architecture diagram

Training results and output examples

After completing the training process, Fastweb’s fine-tuned language model demonstrated a significant performance improvement on Italian language tasks compared to the base model. Evaluated on an internal benchmark dataset, the fine-tuned model achieved an average accuracy increase of 20% across a range of tasks designed to assess its general understanding of the Italian language.

The benchmark tasks focused on three key areas: question answering, common sense reasoning, and next word prediction. Question answering tasks tested the model’s ability to comprehend and provide accurate responses to queries in Italian. Common sense reasoning evaluated the model’s grasp of common sense knowledge and its capacity to make logical inferences based on real-world scenarios. Next word prediction assessed the model’s understanding of language patterns and its ability to predict the most likely word to follow in a given context.

To evaluate the fine-tuned model’s performance, we initiated our interaction by inquiring about its capabilities. The model responded by enumerating its primary functions, emphasizing its ability to address Fastweb-specific topics. The response was formulated in correct Italian with a very natural syntax, as illustrated in the following example.

Dialog 1 - How can you help me?

Afterwards, we asked the model to generate five titles for a presentation on the topic of AI.

Generate titles for a slide deck about AI

Just for fun, we asked what the most famous sandwich is. The model responded with a combination of typical Italian ingredients and added that there is a wide variety of choices.

What is the most famous panini in Italy?

Lastly, we asked the model to provide us with a useful link to understand the recent EU AI Act. The model provided a working link, along with a helpful description.

Tell me something about EU AI Act

Conclusion

Using SageMaker HyperPod, Fastweb successfully fine-tuned the Mistral 7B model as a first step in their generative AI journey, significantly improving its performance on tasks involving the Italian language.

Looking ahead, Fastweb plans to deploy their next models also on Amazon Bedrock using the Custom Model Import feature. This strategic move will enable Fastweb to quickly build and scale new generative AI solutions for their customers, using the broad set of capabilities available on Amazon Bedrock.

By harnessing Amazon Bedrock, Fastweb can further enhance their offerings and drive digital transformation for their customers. This initiative aligns with Fastweb’s commitment to staying at the forefront of AI technology and fostering innovation across various industries.

With their fine-tuned language model running on Amazon Bedrock, Fastweb will be well-positioned to deliver cutting-edge generative AI solutions tailored to the unique needs of their customers. This will empower businesses to unlock new opportunities, streamline processes, and gain valuable insights, ultimately driving growth and competitiveness in the digital age.

Fastweb’s decision to use the Custom Model Import feature in Amazon Bedrock underscores the company’s forward-thinking approach and their dedication to providing their customers with the latest and most advanced AI technologies. This collaboration with AWS further solidifies Fastweb’s position as a leader in digital transformation and a driving force behind the adoption of innovative AI solutions across industries.

To learn more about SageMaker HyperPod, refer to Amazon SageMaker HyperPod and the Amazon SageMaker HyperPod workshop.


About the authors

Marta Cavalleri is the Manager of the Artificial Intelligence Center of Excellence (CoE) at Fastweb, where she leads teams of data scientists and engineers in implementing enterprise AI solutions. She specializes in AI operations, data governance, and cloud architecture on AWS.

Giovanni Germani is the Manager of Architecture & Artificial Intelligence CoE at Fastweb, where he leverages his extensive experience in Enterprise Architecture and digital transformation. With over 12 years in Management Consulting, Giovanni specializes in technology-driven projects across telecommunications, media, and insurance industries. He brings deep expertise in IT strategy, cybersecurity, and artificial intelligence to drive complex transformation programs.

Claudia Sacco is an AWS Professional Solutions Architect at BIP xTech, collaborating with Fastweb’s AI CoE and specialized in architecting advanced cloud and data platforms that drive innovation and operational excellence. With a sharp focus on delivering scalable, secure, and future-ready solutions, she collaborates with organizations to unlock the full potential of cloud technologies. Beyond her professional expertise, Claudia finds inspiration in the outdoors, embracing challenges through climbing and trekking adventures with her family.

Andrea Policarpi is a Data Scientist at BIP xTech, collaborating with Fastweb’s AI CoE. With a strong foundation in computer vision and natural language processing, he is currently exploring the world of Generative AI and leveraging its powerful tools to craft innovative solutions for emerging challenges. In his free time, Andrea is an avid reader and enjoys playing the piano to relax.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Adolfo Pica has a strong background in cloud computing, with over 20 years of experience in designing, implementing, and optimizing complex IT systems and architectures and with a keen interest and hands-on experience in the rapidly evolving field of generative AI and foundation models. He has expertise in AWS cloud services, DevOps practices, security, data analytics and generative AI. In his free time, Adolfo enjoys following his two sons in their sporting adventures in taekwondo and football.

Maurizio Pinto is a Senior Solutions Architect at AWS, specialized in cloud solutions for telecommunications. With extensive experience in software architecture and AWS services, he helps organizations navigate their cloud journey while pursuing his passion for AI’s transformative impact on technology and society.

Read More

Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights

Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights

Many enterprise customers across various industries are looking to adopt Generative AI to drive innovation, user productivity, and enhance customer experience. Generative AI–powered assistants such as Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business understands natural language and allows users to receive immediate, permissions-aware responses from enterprise data sources with citations. This capability supports various use cases such as IT, HR, and help desk.

With custom plugins for Amazon Q Business, you can enhance the application environment to enable your users to use natural language to perform specific tasks related to third-party applications — such as Jira, Salesforce, and ServiceNow — directly from within their web experience chat.

Enterprises that have adopted ServiceNow can improve their operations and boost user productivity by using Amazon Q Business for various use cases, including incident and knowledge management. Users can search ServiceNow knowledge base (KB) articles and incidents in addition to being able to create, manage, and track incidents and KB articles, all from within their web experience chat.

In this post, we’ll demonstrate how to configure an Amazon Q Business application and add a custom plugin that gives users the ability to use a natural language interface provided by Amazon Q Business to query real-time data and take actions in ServiceNow. By the end of this hands-on session, you should be able to:

  • Create an Amazon Q Business application and integrate it with ServiceNow using a custom plugin.
  • Use natural language in your Amazon Q web experience chat to perform read and write actions in ServiceNow such as querying and creating incidents and KB articles in a secure and governed fashion.

Prerequisites

Before proceeding, make sure that you have the necessary AWS account permissions and services enabled, along with access to a ServiceNow environment with the required privileges for configuration.

AWS

ServiceNow

  • Obtain a ServiceNow Personal Developer Instance or use a clean ServiceNow developer environment. You will need an account that has admin privileges to perform the configuration steps in ServiceNow.

Solution overview

The following architecture diagram illustrates the workflow for Amazon Q Business web experience with enhanced capabilities to integrate it seamlessly with ServiceNow.

Solution Overview

The implementation includes the following steps:

  1. The solution begins with configuring Amazon Q Business using the AWS Management Console. This includes setting up the application environment, adding users to AWS IAM Identity Center, selecting the appropriate subscription tier, and configuring the web experience for users to interact with. The environment can optionally be configured to provide real-time data retrieval using a native retriever, which pulls information from indexed data sources, such as Amazon Simple Storage Service (Amazon S3), during interactions.
  2. The next step involves adjusting the global controls and response settings for the application environment guardrails to allow Amazon Q Business to use its large language model (LLM) knowledge to generate responses when it cannot find responses from your connected data sources.
  3. Integration with ServiceNow is achieved by setting up an OAuth Inbound application endpoint in ServiceNow, which authenticates and authorizes interactions between Amazon Q Business and ServiceNow. This involves creating an OAuth API endpoint in ServiceNow and using the web experience URL from Amazon Q Business as the callback URL. The setup makes sure that Amazon Q Business can securely perform actions in ServiceNow with the same scoped permissions as the user signing in to ServiceNow.
  4. The final step of the solution involves enhancing the application environment with a custom plugin for ServiceNow using APIs defined in an OpenAPI schema. The plugin allows Amazon Q Business to securely interact with ServiceNow’s REST APIs, enabling operations such as querying, creating, and updating records dynamically and in real time

Configuring the Amazon Q Business application

To create an Amazon Q Business application, sign in to the Amazon Q Business console.
As a prerequisite to creating an Amazon Q Business application, follow the instructions in Configuring an IAM Identity Center instance section. Amazon Q Business integrates with IAM Identity Center to enable managing user access to your Amazon Q Business application. This is the recommended method for managing human access to AWS resources and the method used for the purpose of this blog.

Amazon Q Business also supports identity federation through IAM. When you use identity federation, you can manage users with your enterprise identity provider (IdP) and use IAM to authenticate users when they sign in to Amazon Q Business.

Create and configure the Amazon Q Business application:

  1. In the Amazon Q Business console, choose Application from the navigation pane and then choose Create application.
  2. Enter the following information for your Amazon Q Business application:
    • Application name: Enter a name for quick identification, such as my-demo-application.
    • Service access: Select the Create and use a new service-linked role (SLR). A service-linked role is a unique type of IAM role that is linked directly to Amazon Q Business. Service-linked roles are predefined by Amazon Q Business and include the permissions that the service requires to call other AWS services on your behalf.
    • Choose Create.
  3.  After creating your Amazon Q Business application environment, create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. On the Select Retriever page:
    • Retrievers: Select Use native retriever.
    • Index provisioning: Select Starter, which is ideal for proof-of-concept or developer workloads. See Index types for more information.
    • Number of units: Enter 1. This indicates the capacity units that you want to provision for your index. Each unit is 20,000 documents. Choose Next.
    • Choose Next.

Select Retriever

  1. After you select a retriever for your Amazon Q Business application environment, you can optionally connect other data sources to it. Because a data source isn’t required for this session, we won’t configure one. For more information on connecting data sources to an Amazon Q Business application, see connecting data sources.
    • Choose Next.
  2. As an account admin, you can add users to your IAM Identity Center instance from the Amazon Q Business console. After you add users or groups to an application environment, you can then choose the Amazon Q Business tier for each user or group. On the Add groups and users page:
    • Choose Add groups and users.
    • In the Add new users dialog box that opens, enter the details of the user. The details you must enter for a single user include: Username, First name, Last name, email address, Confirm email address, and Display name.
    • Choose Next and then Add. The user is automatically added to an IAM Identity Center directory and an email invitation to join Identity Center is sent to the email address provided.
    • After adding a user or group, choose the Amazon Q Business subscription tier for each user or group. From the Current subscription dropdown menu, select Q Business Pro.
    • For the Web experience service access, select Create and use a new service role.
    • Choose Create application.

    Add groups and users

Upon successful completion, Amazon Q Business returns a web experience URL that you can share with the users you added to your application environment. The Web experience URL (in this case: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/) will be used when creating an OAuth application endpoint in ServiceNow. Note that your web experience URL will be different from the one shown here.

Application Created

Enhancing an Amazon Q Business application with guardrails

By default, an Amazon Q Business application is configured to respond to user chat queries using only enterprise data. Because we didn’t configure a data source for the purpose of this post, you will use Admin controls and guardrails to allow Amazon Q to use its LLM world knowledge to generate responses when it cannot find responses from your connected data sources.

Create a custom plugin for ServiceNow:

  1. From the Amazon Q Business console, choose Applications in the navigation pane. Select the name of your application from the list of applications.
  2. From the navigation pane, choose Enhancements, and then choose Admin Controls and guardrails.
  3. In Global Controls, choose Edit.
  4. In Response settings under Application guardrails, select Allow Amazon Q to fall back to LLM knowledge.

create guardrails

Configuring ServiceNow

To allow Amazon Q Business to connect to your ServiceNow instance, you need to create an OAuth inbound application endpoint. OAuth-based authentication validates the identity of the client that attempts to establish a trust on the system by using an authentication protocol. For more information, see OAuth Inbound and Outbound authentication.

Create an OAuth application endpoint for external client applications to access the ServiceNow instance:

  1. In the ServiceNow console, navigate to All, then System OAuth, then Application Registry and then choose New. On the interceptor page, select Create an OAuth API endpoint for external clients and then fill in the form with details for Name and Redirect URL. The other fields are automatically generated by the ServiceNow OAuth server.
    • The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by oauth/callback.
    • For this example, the URL is: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback
  2. For Auth Scope, set the value to useraccount. The scope API response parameter defines the amount of access granted by the access token, which means that the access token has the same rights as the user account that authorized the token. For example, if Abel Tuter authorizes an application by providing login credentials, then the resulting access token grants the token bearer the same access privileges as Abel Tuter.
  3. Choose Submit.

This creates an OAuth client application record and generates a client ID and client secret, which Amazon Q Business needs to access the restricted resources on the instance. You will need this authentication information (client ID and client secret) in the following custom plugin configuration process.

ServiceNow App Registry OAuth

Enhancing the Amazon Q Business application environment with custom plugins for ServiceNow

To integrate with external applications, Amazon Q Business uses APIs, which are configured as part of the custom plugins.

Before creating a custom plugin, you need to create or edit an OpenAPI schema, outlining the different API operations that you want to enable for your custom plugin. Amazon Q Business uses the configured third-party OpenAPI specifications to dynamically determine which API operations to perform to fulfill a user request. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy and might require design optimizations. In order to maximize accuracy and improve efficiency with an Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions.

To configure a custom plugin, you must define at least one and a maximum of eight API operations that can be invoked. To define the API operations, create an OpenAPI schema in JSON or YAML format. You can create OpenAPI schema files and upload them to Amazon S3. Alternatively, you can use the OpenAPI text editor in the console, which will validate your schema.

For this post, a working sample of an OpenAPI Schema for ServiceNow is provided in JSON format. Before using it, edit the template file and replace <YOUR_SERVICENOW_INSTANCE_URL> in the following sections with the URL of your ServiceNow instance.

You can use the REST API Explorer to browse available APIs, API versions, and methods for each API. The explorer enables you to test REST API requests straight from the user interface. The Table API provides endpoints that allow you to perform create, read, update, and delete (CRUD) operations on existing tables. The calling user must have sufficient roles to access the data in the table specified in the request. For additional information on assigning roles, see Managing roles.

{
  "openapi": "3.0.1",
  "info": {
    "title": "Table API",
    "description": "Allows you to perform create, read, update and delete (CRUD) operations on existing tables",
    "version": "latest"
  },
  "externalDocs": {
    "url": "https://docs.servicenow.com/?context=CSHelp:REST-Table-API"
  },
  "servers": [
    {
      "url": "YOUR_SERVICENOW_INSTANCE_URL"
    }
  ],
  "paths": {
    "/api/now/table/{tableName}": {
      "get": {
        "description": "Retrieve records from a table",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_query",
            "in": "query",
            "description": "An encoded query string used to filter the results like Incidents Numbers or Knowledge Base IDs etc",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_fields",
            "in": "query",
            "description": "A comma-separated list of fields to return in the response",
            "required": false,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_limit",
            "in": "query",
            "description": "The maximum number of results returned per page",
            "required": false,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/incident"
                }
              }
            }
          }
        }
      },
      "post": {
        "description": "Create a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "requestBody": {
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "short_description": {
                    "type": "string",
                    "description": "Short Description"
                  },
                  "description": {
                    "type": "string",
                    "description": "Full Description for Incidents only"
                  },
                  "caller_id": {
                    "type": "string",
                    "description": "Caller Email"
                  },
                  "state": {
                    "type": "string",
                    "description": "State of the incident",
                    "enum": [
                      "new",
                      "in_progress",
                      "resolved",
                      "closed"
                    ]
                  },
                  "text": {
                    "type": "string",
                    "description": "Article Body Text for Knowledge Bases Only (KB)"
                  }
                },
                "required": [
                  "short_description",
                  "caller_id"
                ]
              }
            }
          },
          "required": true
        },
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {}
            }
          }
        }
      }
    },
    "/api/now/table/{tableName}/{sys_id}": {
      "get": {
        "description": "Retrieve a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_fields",
            "in": "query",
            "description": "A comma-separated list of fields to return in the response",
            "required": false,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      },
      "delete": {
        "description": "Delete a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      },
      "patch": {
        "description": "Update or modify a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "requestBody": {
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "short_description": {
                    "type": "string",
                    "description": "Short Description"
                  },
                  "description": {
                    "type": "string",
                    "description": "Full Description for Incidents only"
                  },
                  "caller_id": {
                    "type": "string",
                    "description": "Caller Email"
                  },
                  "state": {
                    "type": "string",
                    "description": "State of the incident",
                    "enum": [
                      "new",
                      "in_progress",
                      "resolved",
                      "closed"
                    ]
                  },
                  "text": {
                    "type": "string",
                    "description": "Article Body Text for Knowledge Bases Only (KB)"
                  }
                },
                "required": [
                  "short_description",
                  "caller_id"
                ]
              }
            }
          },
          "required": true
        },
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "incident": {
        "type": "object",
        "properties": {
          "sys_id": {
            "type": "string",
            "description": "Unique identifier for the incident"
          },
          "number": {
            "type": "string",
            "description": "Incident number"
          },
          "short_description": {
            "type": "string",
            "description": "Brief description of the incident"
          }
        }
      }
    },
    "securitySchemes": {
      "oauth2": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_auth.do",
            "tokenUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_token.do",
            "scopes": {
            "useraccount": "Access equivalent to the user's account"
            }
          }
        }
      }
    }
  },
  "security": [
    {
      "oauth2": [
        "useraccount"
      ]
    }
  ]
}

The URL for the ServiceNow instance used in this post is: https://devxxxxxx.service-now.com/. After updating the sections of the template with the URL for this specific instance, the JSON should look like the following:

  "servers": [
    {
      "url": "https://devxxxxxx.service-now.com/"
    }
    "securitySchemes": {
      "oauth2": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "https://devxxxxxx.service-now.com/oauth_auth.do",
            "tokenUrl": "https://devxxxxxx.service-now.com/oauth_token.do",
            "scopes": {
              "useraccount": "Access equivalent to the user's account"
            }
          }
        }
      }
    }

To create a custom plugin for ServiceNow:

    1. Sign in to the Amazon Q Business console.
    2. Choose Applications in the navigation pane, and then select your application from the list of applications.
    3. In the navigation pane, choose Enhancements, and then choose Plugins.
    4. In Plugins, choose Add plugin.
    5. In Add plugins, choose Custom plugin.
      Create Custom Plugin
    6. In Custom plugin, enter the following information:
      • In Name and description, for Plugin name: Enter a name for your Amazon Q plugin.
      • In API schema, for API schema source, select Define with in-line OpenAPI schema editor.
      • Select JSON as the format for the schema.
      • Remove any sample schema that appears in the inline OpenAPI schema editor and replace it with the text from the provided sample JSON template, updated with your ServiceNow instance URL.

      Enter Custom Plugin Details

    7. In Authentication: Select Authentication required.
    8. For AWS Secrets Manager secret, choose Create and add a new secret. You need to store the ServiceNow OAuth authentication credentials in a Secrets Manager secret to connect your third-party application to Amazon Q. In the window that opens, enter the details in the form:
      • Secret name: A name for your Secrets Manager secret.
      • Client ID: The Client ID from ServiceNow OAuth configuration in the previous section.
      • Client secret: The Client Secret from ServiceNow OAuth configuration in the previous section.
      • OAuth callback URL: The URL the user needs to be redirected to after authentication. This will be your web experience URL. For this example, it’s: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback. Amazon Q Business will handle OAuth tokens in this URL.

Create AWS Secrets Manager secret

  1. In Choose a method to authorize Amazon Q Business: Select Create and add a new service role. The console will generate a service role name. To connect Amazon Q Business to third-party applications that require authentication, you need to give the Amazon Q role permissions to access your Secrets Manager secret. This will enable an Amazon Q Business custom plugin to access the credentials needed to sign in to the third-party service.
    Custom Plugin Authentication
  2. Choose Add plugin to add your plugin.

Upon successful completion, the plugin will appear under Plugins with Build status of Ready and Plugin status Active.
Custom Plugin Active

Using Amazon Q Business web experience chat to take actions in ServiceNow

Users can launch your Amazon Q Business web experience in two ways:

  • AWS access portal URL provided in an invitation email sent to the user to join AWS IAM Identity Center.
  • Web experience URL shared by the admin.

Navigate to the deployed web experience URL and sign with your AWS IAM Identity Center credentials.
After signing in, choose the New conversation icon in the left-hand menu to start a conversation.

Example: Search Knowledge Base Articles in ServiceNow for user issue and create an incident

The following chat conversation example illustrates a typical use case of Amazon Q Business integrated with custom plugins for ServiceNow. These features allow you to perform a wide range of tasks tailored to your organization’s needs.

In this example, we initiate a conversation in the web experience chat to search for KB articles related to ”log in issues” in ServiceNow by invoking a plugin action. After the user submits a prompt, Amazon Q Business queries ServiceNow through the appropriate API to retrieve the results and provides a response with related KB articles. We then proceed by asking Amazon Q Business for more details to see if any of the KB articles directly addresses the user’s issue. When no relevant KB articles pertaining to the user’s issue are found, we ask Amazon Q Business to summarize the conversation and create a new incident in ServiceNow, making sure the issue is logged for resolution.

User prompt 1 – I am having issues logging in to the intranet and want to know if there are any ServiceNow KB articles on log-in issues. Perform the search on both Short Description and Text field using LIKE operator

Before submitting the preceding prompt for an action to create an incident in ServiceNow, choose the vertical ellipsis to open Conversation settings, then choose Use a Plugin to select the corresponding custom plugin for ServiceNow.
Web Experience Chat conversation with Amazon Q Business with Custom Plugin
If this is the first time a user is accessing the custom plugin or if their past sign-in has expired, the user will need to authenticate. After authenticating successfully, Amazon Q Business will perform the requested task.

Choose Authorize.
Amazon Q Business Authorization for ServiceNow Interaction

If the user isn’t already signed in to ServiceNow, they will be prompted to enter their credentials. For this example, the user signing in to ServiceNow is the admin user and API actions performed in ServiceNow by Amazon Q Business on behalf of the user will have the same level of access as the user within ServiceNow.
ServiceNow Login

Choose Allow for Amazon Q Business to connect to ServiceNow and perform the requested task on your behalf.

Allow Access to Amazon Q Business

Upon executing the user’s request after verifying that they are authorized, Amazon Q Business responds with the information that it retrieved. We then proceed to retrieve additional details with the following prompt.

User prompt 2 – Can you list the KB number and short description in a tabular form?

Conversation with Amazon Q Business to search for KB articles in ServiceNow
Because there no KB articles related the user’s issue were found, we will ask Amazon Q to summarize the conversation context to create an incident with the following prompt.

User prompt 3 – The error I get is "Unable to Login After System Upgrade". Summarize my issue and create an incident with detailed description and add a note that this needs to be resolved asap.

In response to your prompt for an action, Amazon Q displays a review form where you can modify or fill in the necessary information.

To successfully complete the action, choose submit.

Note: The caller_id value entered in the following example is a valid ServiceNow user.

Amazon Q Business Create Service Now Incident
Your web experience will display a success message if the action succeeds, or an error message if the action fails. In this case, the action succeeded and Amazon Q Business responded accordingly.

Amazon Q Business - Success message after incident Creation

The following screenshot shows that the incident was created successfully in ServiceNow.

Shows ServiceNow Incident Created from Amazon Q Business

Troubleshooting common errors

To have a seamless experience with third-party application integrations, it’s essential to thoroughly test, identify, and troubleshoot unexpected behavior.

A common error encountered in Amazon Q Business is API Response too large, which occurs when an API response size exceeds the current limit of 100 KB. While prompting techniques are essential for obtaining accurate and relevant answers, optimizing API responses to include only the necessary and relevant data is crucial for better response times and enhanced user experience.

The REST API Explorer (shown in the following figure) in ServiceNow is a tool that allows developers and administrators to interact with and test the ServiceNow REST APIs directly from within the ServiceNow environment. It provides a user-friendly interface for making API requests, viewing responses, and understanding the available endpoints and data structures. Using this tool simplifies the process of testing and integrating with ServiceNow.
Rest API Explorer in ServiceNow

Clean up

To clean up AWS configurations, sign in to the Amazon Q Business console.

  1. From the Amazon Q Business console, in Applications, select the application that you want to delete.
  2. Choose Actions and select Delete.
  3. To confirm deletion, enter Delete.

This will take a few minutes to finish. When completed, the application and the configured custom plugin will be deleted.
Delete Amazon Q Business App

When you delete the Amazon Q Business application, the users created as part of the configuration are not automatically deleted from IAM Identity Center. Use the instructions in Delete users in IAM Identity Center to delete the users created for this post.

To clean up in ServiceNow, release the Personal Developer Instance provisioned for this post by following the instructions in the ServiceNow Documentation.

Conclusion

The integration of generative AI-powered assistants such as Amazon Q Business with enterprise systems such as ServiceNow offers significant benefits for organizations. By using natural language processing capabilities, enterprises can streamline operations, enhance user productivity, and deliver better customer experiences. The ability to query real-time data and create incidents and knowledge articles through a secure and governed chat interface transforms how users interact with enterprise data and applications. As demonstrated in this post, enhancing Amazon Q Business to integrate with ServiceNow using custom plugins empowers users to perform complex tasks effortlessly, driving efficiency across various business functions. Adopting this technology not only modernizes workflows, but also positions enterprises at the forefront of innovation.

Learn more


About the Author

Siddhartha Angara is a Senior Solutions Architect at Amazon Web Services. He helps enterprise customers design and build well-architected solutions in the cloud, accelerate cloud adoption, and build Machine Learning and Generative AI applications. He enjoys playing the guitar, reading and family time!

Read More

Simplify multimodal generative AI with Amazon Bedrock Data Automation

Simplify multimodal generative AI with Amazon Bedrock Data Automation

Developers face significant challenges when using foundation models (FMs) to extract data from unstructured assets. This data extraction process requires carefully identifying models that meet the developer’s specific accuracy, cost, and feature requirements. Additionally, developers must invest considerable time optimizing price performance through fine-tuning and extensive prompt engineering. Managing multiple models, implementing safety guardrails, and adapting outputs to align with downstream system requirements can be difficult and time consuming.

Amazon Bedrock Data Automation in public preview helps address these and other challenges. This new capability from Amazon Bedrock offers a unified experience for developers of all skillsets to easily automate the extraction, transformation, and generation of relevant insights from documents, images, audio, and videos to build generative AI–powered applications. With Amazon Bedrock Data Automation, customers can fully utilize their data by extracting insights from their unstructured multimodal content in a format compatible with their applications. Amazon Bedrock Data Automation’s managed experience, ease of use, and customization capabilities help customers deliver business value faster, eliminating the need to spend time and effort orchestrating multiple models, engineering prompts, or stitching together outputs.

In this post, we demonstrate how to use Amazon Bedrock Data Automation in the AWS Management Console and the AWS SDK for Python (Boto3) for media analysis and intelligent document processing (IDP) workflows.

Amazon Bedrock Data Automation overview

You can use Amazon Bedrock Data Automation to generate standard outputs and custom outputs. Standard outputs are modality-specific default insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. Custom outputs use customer-defined blueprints that specify output requirements using natural language or a schema editor. The blueprint includes a list of fields to extract, data format for each field, and other instructions, such as data transformations and normalizations. This gives customers full control of the output, making it easy to integrate Amazon Bedrock Data Automation into existing applications.

Using Amazon Bedrock Data Automation, you can build powerful generative AI applications and automate use cases such as media analysis and IDP. Amazon Bedrock Data Automation is also integrated with Amazon Bedrock Knowledge Bases, making it easier for developers to generate meaningful information from their unstructured multimodal content to provide more relevant responses for Retrieval Augmented Generation (RAG).

Customers can get started with standard outputs for all four modalities: documents, images, videos, and audio and custom outputs for documents and images. Custom outputs for video and audio will be supported when the capability is generally available.

Amazon Bedrock Data Automation for images, audio, and video

To take a media analysis example, suppose that customers in the media and entertainment industry are looking to monetize long-form content, such as TV shows and movies, through contextual ad placement. To deliver the right ads at the right video moments, you need to derive meaningful insights from both the ads and the video content. Amazon Bedrock Data Automation enables your contextual ad placement application by generating these insights. For instance, you can extract valuable information such as video summaries, scene-level summaries, content moderation concepts, and scene classifications based on the Interactive Advertising Bureau (IAB) taxonomy.

To get started with deriving insights with Amazon Bedrock Data Automation, you can create a project where you can specify your output configuration using the AWS console, AWS Command Line Interface (AWS CLI) or API.

To create a project on the Amazon Bedrock console, follow these steps:

  1. Expand the Data Automation dropdown menu in the navigation pane and select Projects, as shown in the following screenshot.
  2. From the Projects console, create a new project and provide a project name, as shown in the following screenshot.
  3. From within the project, choose Edit, as shown in the following screenshot, to specify or modify an output configuration. Standard output is the default way of interacting with Amazon Bedrock Data Automation, and it can be used with audio, documents, images and videos, where you can have one standard output configuration per data type for each project.
  4. For customers who want to analyze images and videos for media analysis, standard output can be used to generate insights such as image summary, video scene summary, and scene classifications with IAB taxonomy. You can select the image summarization, video scene summarization, and IAB taxonomy checkboxes from the Standard output tab and then choose Save changes to finish configuring your project, as shown in the following screenshot.
  5. To test the standard output configuration using your media assets, choose Test, as shown in the following screenshot.

The next example uses the project to generate insights for a travel ad.

  1. Upload an image, then choose Generate results, as shown in the following screenshot, for Amazon Bedrock Data Automation to invoke an inference request.
  2. Amazon Bedrock Data Automation will process the uploaded file based on the project’s configuration, automatically detecting that the file is an image and then generating a summary and IAB categories for the travel ad.
  3. After you have generated insights for the ad image, you can generate video insights to determine the best video scene for effective ad placement. In the same project, upload a video file and choose Generate results, as shown in the following screenshot.

Amazon Bedrock Data Automation will detect that the file is a video and will generate insights for the video based on the standard output configuration specified in the project, as shown in the following screenshot.

These insights from Amazon Bedrock Data Automation, can help you effectively place relevant ads in your video content, which can help improve content monetization.

Intelligent document processing with Amazon Bedrock Data Automation

You can use Amazon Bedrock Data Automation to automate IDP workflows at scale, without needing to orchestrate complex document processing tasks such as classification, extraction, normalization, or validation.

To take a mortgage example, a lender wants to automate the processing of a mortgage lending packet to streamline their IDP pipeline and improve the accuracy of loan processing. Amazon Bedrock Data Automation simplifies the automation of complex IDP tasks such as document splitting, classification, data extraction, output format normalization, and data validation. Amazon Bedrock Data Automation also incorporates confidence scores and visual grounding of the output data to mitigate hallucinations and help improve result reliability.

For example, you can generate custom output by defining blueprints, which specify output requirements using natural language or a schema editor, to process multiple file types in a single, streamlined API. Blueprints can be created using the console or the API, and you can use a catalog blueprint or create a custom blueprint for documents and images.

For all modalities, this workflow consists of three main steps: creating a project, invoking the analysis, and retrieving the results.

The following solution walks you through a simplified mortgage lending process with Amazon Bedrock Data Automation using the Amazon SDK for Python (Boto3), which is straightforward to integrate into an existing IDP workflow.

Prerequisites

Before you invoke the Amazon Bedrock API, make sure you have the following:

Create custom blueprint

In this example, you have the lending packet, as shown in the following image, which contains three documents: a pay stub, a W-2 form, and a driver’s license.

Amazon Bedrock Data Automation has sample blueprints for these three documents that define commonly extracted fields. However, you can also customize Amazon Bedrock Data Automation to extract specific fields from each document. For example, you can extract only the gross pay and net pay from the pay stub by creating a custom blueprint.

To create a custom blueprint using the API, you can use the CreateBlueprint operation using the Amazon Bedrock Data Automation Client. The following example shows the gross pay and net pay being defined as properties passed to CreateBlueprint, to be extracted from the lending packet:

bda_create_blueprint_response = bedrock_data_automation_client.create_blueprint(
    blueprintName='CUSTOM_PAYSLIP_BLUEPRINT',
    type='DOCUMENT',
    blueprintStage='LIVE',
    schema=json.dumps({
        '$schema': 'http://json-schema.org/draft-07/schema#',
        'description': 'default',
        'documentClass': 'default',
        'type': 'object',
        'properties': {
            'gross_pay_this_period': {
                'type': 'number',
                'inferenceType': 'extractive',
                'description': 'The gross pay for this pay period from the Earnings table'
            },
            'net_pay': {
                'type': 'number',
                'inferenceType': 'extractive',
                'description': 'The net pay for this pay period from the bottom of the document'
            }
        }
    }),
)

The CreateBlueprint response returns the blueprintARN for the pay stub’s custom blueprint:

'blueprintArn: arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'

Configure Amazon Bedrock Data Automation project

To begin processing files using blueprints with Amazon Bedrock Data Automation, you first need to create a data automation project. To process a multiple-page document containing different file types, you can configure a project with different blueprints for each file type.

Use Amazon Bedrock Data Automation to apply multiple document blueprints within one project so you can process different types of documents within the same project, each with its own custom extraction logic.

When using the API to create a project, you invoke the CreateDataAutomationProject operation. The following is an example of how you can configure custom output using the custom blueprint for the pay stub and the sample blueprints for the W-2 and driver’s license:

bda_bedrock_automation_create_project_response = bedrock_data_automation_client.create_data_automation_project(
    projectName='TEST_PROJECT',
    projectDescription='test BDA project',
    projectStage=bda_stage,
    standardOutputConfiguration={
        'document': {
            'outputFormat': {
                'textFormat': {
                    'types': ['PLAIN_TEXT']
                },
                'additionalFileFormat': {
                    'state': 'ENABLED',
                }
            }
        },
    },
    customOutputConfiguration={
        'blueprints': [
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'
          },
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-w2-form'
          },
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license'
          },
        ],
    },
    overrideConfiguration={
        'document': {
            'splitter': {
                'state': 'ENABLED'
            }
        }
    },
)

The CreateProject response returns the projectARN for the project:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'

To process different types of documents using multiple document blueprints in a single project, Amazon Bedrock Data Automation uses a splitter configuration, which must be enabled through the API. The following is the override configuration for the splitter, and you can refer to the Boto3 documentation for more information:

overrideConfiguration={
    'document': {
        'splitter': {
            'state': 'ENABLED' | 'DISABLED'
        }
    }
},

Upon creation, the API validates the input configuration and creates a new project, returning the projectARN, as shown in the following screenshot.

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'

Test the solution

Now that the blueprint and project setup is complete, the InvokeDataAutomationAsync operation from the Amazon Bedrock Data Automation runtime can be used to start processing files. This API call initiatives the asynchronous processing of files in an S3 bucket, in this case the lending packet, using the configuration defined in the project by passing the project’s ARN:

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri': '<S3_URI>'},
    outputConfiguration={'s3Uri': '<S3_URI>'},
    dataAutomationConfiguration={
        'dataAutomationArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>',
        'stage': 'LIVE'
    }
)

InvokeDataAutomationAsync returns the invocationARN:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'

GetDataAutomationStatus can be used to view the status of the invocation, using the InvocationARN from the previous response:

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.get_data_automation_status(
    invocationArn='arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'
)

When the job is complete, view the results in the S3 bucket used in the outputConfiguration by navigating to the ~/JOB_ID/0/custom_output/ folder.

From the following sample output, Amazon Bedrock Data Automation associated the pay stub file with the custom pay stub blueprint with a high level of confidence:

'matched_blueprint': {
    'arn': '<BLUEPRINT_ARN>', 'name': 'CUSTOM_PAYSLIP_BLUEPRINT', 'confidence': 0.99959725
}

Using the matched blueprint, Amazon Bedrock Data Automation was able to accurately extract each field defined in the blueprint:

'inference_result': {
    'net_pay': 291.9, 'gross_pay_this_period': 452.43
}

Additionally, Amazon Bedrock Data Automation returns confidence intervals and bounding box information for each field:

'explainability_info': [{
    'net_pay': {'success': true, 'confidence': 0.96484375, 'geometry': [{'boundingBox': ...

This example demonstrates how customers can use Amazon Bedrock Data Automation to streamline and automate an IDP workflow. Amazon Bedrock Data Automation automates complex document processing tasks such as data extraction, normalization, and validation from documents. Amazon Bedrock Data Automation helps to reduce operational complexity and improves processing efficiency to handle higher loan processing volumes, minimize errors, and drive operational excellence.

Cleanup

When you’re finished evaluating this feature, delete the S3 bucket and any objects to avoid any further charges.

Summary

Customers can get started with Amazon Bedrock Data Automation, which is available in public preview in AWS Region US West 2 (Oregon). Learn more on Amazon Bedrock Data Automation and how to automate the generation of accurate information from unstructured content for building generative AI–based applications.


About the authors

Ian Lodge is a Solutions Architect at AWS, helping ISV customers in solving their architectural, operational, and cost optimization challenges. Outside of work he enjoys spending time with his family, ice hockey and woodworking.

Alex Pieri is a Solutions Architect at AWS that works with retail customers to plan, build, and optimize their AWS cloud environments. He specializes in helping customers build enterprise-ready generative AI solutions on AWS.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Read More

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

TUI Group is one of the world’s leading global tourism services, providing 21 million customers with an unmatched holiday experience in 180 regions. TUI Group covers the end-to-end tourism chain with over 400 owned hotels, 16 cruise ships, 1,200 travel agencies, and 5 airlines covering all major holiday destinations around the globe. At TUI, crafting high-quality content is a crucial component of its promotional strategy.

The TUI content teams are tasked with producing high-quality content for its websites, including product details, hotel information, and travel guides, often using descriptions written by hotel and third-party partners. This content needs to adhere to TUI’s tone of voice, which is essential to communicating the brand’s distinct personality. But as its portfolio expands with more hotels and offerings, scaling content creation has proven challenging. This presents an opportunity to augment and automate the existing content creation process using generative AI.

In this post, we discuss how we used Amazon SageMaker and Amazon Bedrock to build a content generator that rewrites marketing content following specific brand and style guidelines. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon SageMaker helps data scientists and machine learning (ML) engineers build FMs from scratch, evaluate and customize FMs with advanced techniques, and deploy FMs with fine-grain controls for generative AI use cases that have stringent requirements on accuracy, latency, and cost.

Through experimentation, we found that following a two-phased approach worked best to make sure that the output aligned to TUI’s tone of voice requirements. The first phase was to fine-tune with a smaller large language model (LLM) on a large corpus of data. The second phase used a different LLM model for post-processing. Through fine-tuning, we generate content that mimics the TUI brand voice using static data and which could not be captured through prompt engineering. Employing a second model with few-shot examples helped verify the output adhered to specific formatting and grammatical rules. The latter uses a more dynamic dataset, which we can use to adjust the output quickly in the future for different brand requirements. Overall, this approach resulted in higher quality content and allowed TUI to improve content quality at a higher velocity.

Solution overview

The architecture consists of a few key components:

  • LLM models – We evaluated different approaches and found that a two-model solution performed the best. This consists of a fine-tuned Meta Llama model to generate a description for the given hotel and Anthropic’s Claude model to reformat its output. Fine-tuning and hosting the Meta Llama 2 model was done on Amazon SageMaker, and Anthropic’s Claude 2 was consumed from Amazon Bedrock through API calls.
  • Orchestration – We created a state machine using AWS Step Functions to make calls in a batch format to the two LLMs and fetch the search engine optimization (SEO) score for the generated content from a third-party API. If the SEO content score is above a defined threshold (80%), the generated content is stored in an Amazon DynamoDB table and can later be reviewed by the content team directly in the front-end UI. Through this process, we maintain and monitor content quality at scale.
  • Human in the loop feedback – We developed a custom React front-end application to gather feedback from the content team to facilitate continuous improvement and future model fine-tuning. You can use the feedback to fine-tune a base model on SageMaker using reinforcement learning from human feedback (RLHF) to improve performance.

The following diagram is the high-level architecture of the solution.

Architecture Diagram

Prompt engineering

We started by crafting prompts that include the most common issues identified by the TUI content team, including tone of voice, accuracy, length, and grammar. We provided the LLM with a few examples of curated content within the context window. Although the generated output followed the guidance, the writing style didn’t meet TUI’s tone of voice requirements.

 Example prompt:

You are an experienced British copywriter for TUI. TUI is a world-leading travel company. You are an expert in generating hotel descriptions, based on TUI’s tone of voice. TUI's tone of voice can be described as upbeat, enthusiastic, and promotional. Avoid all the words in the following list: {banned words}
Write at most 100 words.
Your hotel descriptions must follow TUI's tone of voice and apply SEO guidelines.
These are some good examples. You should mimic below.
{examples}.
Human: {input}
Assistant:

Here is an example of the results we received.

Original text Generated text
Boutique interiors and a great location combine at this small but perfectly formed hotel. It may be small but this 15-room hotel packs a punch with its interior design. The immaculate pool is set in an open atrium area. And there’s funky cocktail bar and lounge, too. The location is every bit as impressive as the styling. It’s about a 10-minute walk to the beach and less than half that to the centre of Thassos Town. This petite hotel brings boutique style to Thassos Town. Its immaculate pool and funky cocktail bar create an oasis in the heart of the action. The beach is just a 10-minute walk away, so you can easily split your time between sunbathing and sightseeing. With only 15 rooms, the experience feels intimate yet worldly. The interiors marry modern design with traditional touches for a unique look. And the central location means popular attractions like the Archaeological Museum and shopping street are right on your doorstep.

To evaluate these results, we generated a dataset of 150 hotel descriptions using two different prompt versions and asked our business teams to review it. However, they flagged 98% of the output with significant issues. Firstly, Anthropic’s Claude 2 struggled to adhere to TUI’s tone of voice and branding guidelines, frequently employed abbreviations, and favored American English over British English. It also failed to follow the SEO guidelines provided and sometimes generated false information regarding hotel facilities and locations. The following image shows a list of these challenges and how the LLM handled them. Of the six challenges, the LLM met only one.

Main Challenges

Fine-tuning Llama 2 using PEFT on Amazon SageMaker JumpStart

These issues and poor feedback led us to conclude that prompt engineering alone would not adequately address the problem. As a result, we decided to pursue an alternative approach: fine-tuning a smaller large language model to rewrite the text in accordance with TUI’s tone of voice. We used a curated set of hotel descriptions written by TUI copywriters so that the model would have better alignment with our guidelines.

We selected the Meta Llama 2 model using Amazon SageMaker JumpStart, one of the top open source LLMs available at the time, and chose the 13B parameter version to apply parameter-efficient fine-tuning (PEFT), specifically using quantized low-rank adaptation (QLoRA). This technique quantizes the pre-trained model to 4 bits and adds small low-rank adapters for fine-tuning. We fine-tuned the model on a single ml.g5.4xlarge instance in about 20 hours using a relatively small dataset of around 4,500 hotels. We also tested out the Llama 2 7B and 70B models. We found that the 7B model didn’t perform well enough, and the 70B model had much higher costs without seeing significant improvement.

We used common natural language processing (NLP) evaluation metrics, such as perplexity for evaluation and monitoring during training, and established daily feedback loops with the content team to refine the test set. The output from the fine-tuned Meta Llama 2 13B model effectively mimicked the TUI tone of voice and adhered to most SEO guidelines, even those not specified in the prompt.

For more information, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker Jumpstart.

Original text Generated text
The NH München City Süd hotel, formerly known as NH München am Ring, is located southeast of Munich. The location is ideal for reaching the highway towards Salzburg and provides good access to the airport and Exhibition Center. You can also get to the city center in 20 minutes by hopping on the S-Bahn at the Karl-Preis-Platz station. The peaceful residential location offers an excellent opportunity to relax in one of the 163 rooms. These are decorated in white with red accents. Rooms feature free Wi-Fi and electronic safes that are large enough for your laptop. If you need to work, there are spacious desks in every room and apartments with seperate living rooms are available for longer term guests. The suites also come with separate living rooms and foldaway beds are available upon request. The breakfast will get you ready for a busy day with a large spread of cold cuts, cheeses, fresh pastries, and fruit. These come accompanied with freshly brewed coffee, tea, and fresh fruit juices. For lunch and dinner, you will find many restaurant options in close proximity to the hotel. There is a sauna for relaxation and you can park your car in the garage.

Make yourself at home in Munich at the München City Süd. This former NH München am Ring sits just southeast of the Bavarian capital, putting both the highway to Salzburg and the airport within easy reach. The S-Bahn station Karl-Preis-Platz is a quick walk away too, so you can be in the city center in just 20 minutes.

The neighborhood here is peaceful, giving you the chance to properly relax during your stay. The hotel’s 163 rooms are decorated in calming white and red hues. Free Wi-Fi and laptop-sized safes come as standard, while spacious desks make working easy. For longer visits, apartments with separate living areas are available. The suites take things up a notch with their own living spaces and foldaway beds on request.

Fuel up each morning with a vast breakfast spread of cold cuts, cheeses, pastries and fruit. All washed down with coffee, tea and fresh juices. You’ll find many dining options close by for other meals. After a busy day, unwind in the sauna or park up in the garage.

The following image shows a list of the challenges and how the LLM handled them. Of the six challenges, the LLM met four.

Improvements

Integrating Anthropic’s Claude 2 for further improvement

To further improve the output, we introduced Anthropic’s Claude 2 using Amazon Bedrock as a final refinement step. This included converting American spelling to British spelling, writing numbers one through nine in words and larger numbers in digits, correcting typos and capitalization errors, minimizing banned words, incorporating essential TUI branding words, and adding missing hotel information. We also implemented a feedback mechanism in the UI to use data for ongoing fine-tuning in production. By using Anthropic Claude 2, we make sure that the final output applies the remaining formatting rules.

Original text Generated text

As its name suggests, Vienna Sporthotel has loads lined up for sports fans.

There are 10 tennis courts and four golf simulators here, plus, you can book a session with a coach to brush up on your skills. On top of that, there’s an outdoor pool, a well-stocked fitness room and a big sauna. To fuel up for all that exercise, you can grab a plate at the breakfast buffet.

Location-wise, this place is set up for sightseeing. Hop on a bus and you can reach the likes of The Hofburg, Belvedere Palace and Karlskirche in less than half an hour. You’re only a 10-minute stroll away from the nearest restaurants and cafes, too.

With a name like Vienna Sporthotel, you can expect this place to serve up plenty for fans of sport and activity. Brush up your skills on one of 10 tennis courts or in the four golf simulators, then recharge in the outdoor pool or well-equipped gym. There’s even a coach on hand to help perfect your technique.

When you’re ready to refuel, tuck into the breakfast buffet. Then get set for sightseeing – a bus ride whisks you to top Vienna attractions like The Hofburg, Belvedere Palace and Karlskirche in under 30 minutes. You’re also just a short stroll from local eateries and coffee shops.

The following image shows a list of the challenges and how the LLM handled them. The LLM met all six challenges.

Success in Target Outcomes

Key outcomes

The final architecture consists of a fine-tuned Meta Llama 2 13B model and Anthropic Claude 2, using the strengths of each model. In a blind test, these dynamically generated hotel descriptions were rated higher than those written by humans in 75% of a sample of 50 hotels. We also integrated a third-party API to calculate SEO scores for the generated content, and we observed up to 4% uplift in SEO scores for the generated content compared to human written descriptions. Most significantly, the content generation process is now five times faster, enhancing our team’s productivity without compromising quality or consistency. We can generate a vast number of hotel descriptions in just a few hours— a task that previously took months.

Takeaways

Moving forward, we plan to explore how this technology can address current inefficiencies and quality gaps, especially for hotels that our team hasn’t had the capacity to curate. We plan to expand this solution to more brands and regions within the TUI portfolio, including producing content in various languages and tailoring it to meet the specific needs of different audiences.

Throughout this project, we learned a few valuable lessons:

  • Few-shot prompting is cost-effective and sufficient when you have limited examples and specific guidelines for responses. Fine-tuning can help significantly improve model performance when you need to tailor content to match a brand’s tone of voice, but can be resource intensive and is based on static data sources that can get outdated.
  • Fine-tuning the Llama 70B model was much more expensive than Llama 13B and did not result in significant improvement.
  • Incorporating human feedback and maintaining a human-in-the-loop approach is essential for protecting brand integrity and continuously improving the solution. The collaboration between TUI engineering, content, and SEO teams was crucial to the success of this project.

Although Meta Llama 2 and Anthropic’s Claude 2 were the latest state-of-the-art models available at the time of our experiment, since then we have seen the launch of Meta Llama 3 and Anthropic’s Claude 3.5, which we expect can significantly improve the quality of our outputs. Amazon Bedrock also now supports fine-tuning for Meta Llama 2, Cohere Command Light, and Amazon Titan models, making it simpler and faster to test models without managing infrastructure.


About the Authors

Nikolaos Zavitsanos is a Data Scientist at TUI, specialized in developing customer-facing Generative AI applications using AWS services. With a strong background in Computer Science and Artificial Intelligence, he leverages advanced technologies to enhance user experiences and drive innovation. Outside of work, Nikolaos plays water polo and is competing at a national level. Connect with Nikolaos on Linkedin

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training. Connect with Hin Yee on LinkedIn.

Read More

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the Llama 3.3 70B from Meta is available in Amazon SageMaker JumpStart. Llama 3.3 70B marks an exciting advancement in large language model (LLM) development, offering comparable performance to larger Llama versions with fewer computational resources.

In this post, we explore how to deploy this model efficiently on Amazon SageMaker AI, using advanced SageMaker AI features for optimal performance and cost management.

Overview of the Llama 3.3 70B model

Llama 3.3 70B represents a significant breakthrough in model efficiency and performance optimization. This new model delivers output quality comparable to Llama 3.1 405B while requiring only a fraction of the computational resources. According to Meta, this efficiency gain translates to nearly five times more cost-effective inference operations, making it an attractive option for production deployments.

The model’s sophisticated architecture builds upon Meta’s optimized version of the transformer design, featuring an enhanced attention mechanism that can help substantially reduce inference costs. During its development, Meta’s engineering team trained the model on an extensive dataset comprising approximately 15 trillion tokens, incorporating both web-sourced content and over 25 million synthetic examples specifically created for LLM development. This comprehensive training approach results in the model’s robust understanding and generation capabilities across diverse tasks.

What sets Llama 3.3 70B apart is its refined training methodology. The model underwent an extensive supervised fine-tuning process, complemented by Reinforcement Learning from Human Feedback (RLHF). This dual-approach training strategy helps align the model’s outputs more closely with human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterpart, Llama 3.3 70B demonstrated remarkable consistency, trailing Llama 3.1 405B by less than 2% in 6 out of 10 standard AI benchmarks and actually outperforming it in three categories. This performance profile makes it an ideal candidate for organizations seeking to balance model capabilities with operational efficiency.

The following figure summarizes the benchmark results (source).

Getting started with SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select pre-trained foundation models (FMs), including Llama 3 models. These models are fully customizable for your use case with your data, and you can deploy them into production using either the UI or SDK.

Deploying Llama 3.3 70B through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.

Deploy Llama 3.3 70B through the SageMaker JumpStart UI

You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or Amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, complete the following steps:

  1. In SageMaker Unified Studio, on the Build menu, choose JumpStart models.

Alternatively, on the SageMaker Studio console, choose JumpStart in the navigation pane.

  1. Search for Meta Llama 3.3 70B.
  2. Choose the Meta Llama 3.3 70B model.
  3. Choose Deploy.
  4. Accept the end-user license agreement (EULA).
  5. For Instance type¸ choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
  6. Choose Deploy.

Wait until the endpoint status shows as InService. You can now run inference using the model.

Deploy Llama 3.3 70B using the SageMaker Python SDK

For teams looking to automate deployment or integrate with existing MLOps pipelines, you can use the following code to deploy the model using the SageMaker Python SDK:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"

gpu_instance_type = "ml.p4d.24xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

model= model_builder.build()

predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)

Set up auto scaling and scale down to zero

You can optionally set up auto scaling to scale down to zero after deployment. For more information, refer to Unlock cost savings with the new scale down to zero feature in SageMaker Inference.

Optimize deployment with SageMaker AI

SageMaker AI simplifies the deployment of sophisticated models like Llama 3.3 70B, offering a range of features designed to optimize both performance and cost efficiency. With the advanced capabilities of SageMaker AI, organizations can deploy and manage LLMs in production environments, taking full advantage of Llama 3.3 70B’s efficiency while benefiting from the streamlined deployment process and optimization tools of SageMaker AI. Default deployment through SageMaker JumpStart uses accelerated deployment, which uses speculative decoding to improve throughput. For more information on how speculative decoding works with SageMaker AI, see Amazon SageMaker launches the updated inference optimization toolkit for generative AI.

Firstly, the Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight streaming mechanism. This feature fundamentally changes how model weights are loaded onto accelerators, dramatically reducing the time required to get the model ready for inference. Instead of the traditional approach of loading the entire model into memory before beginning operations, Fast Model Loader streams weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster startup and scaling times.

One SageMaker inference capability is Container Caching, which transforms how model containers are managed during scaling operations. This feature eliminates one of the major bottlenecks in deployment scaling by pre-caching container images, removing the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where container images can be substantial in size, this optimization significantly reduces scaling latency and improves overall system responsiveness.

Another key capability is Scale to Zero. It introduces intelligent resource management that automatically adjusts compute capacity based on actual usage patterns. This feature represents a paradigm shift in cost optimization for model deployments, allowing endpoints to scale down completely during periods of inactivity while maintaining the ability to scale up quickly when demand returns. This capability is particularly valuable for organizations running multiple models or dealing with variable workload patterns.

Together, these features create a powerful deployment environment that maximizes the benefits of Llama 3.3 70B’s efficient architecture while providing robust tools for managing operational costs and performance.

Conclusion

The combination of Llama 3.3 70B with the advanced inference features of SageMaker AI provides an optimal solution for production deployments. By using Fast Model Loader, Container Caching, and Scale to Zero capabilities, organizations can achieve both high performance and cost-efficiency in their LLM deployments.

We encourage you to try this implementation and share your experiences.


About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Adriana Simmons is a Senior Product Marketing Manager at AWS.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Yotam Moss is a Software development Manager for Inference at AWS AI.

Read More

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale

We spoke with Dr. Swami Sivasubramanian, Vice President of Data and AI, shortly after AWS re:Invent 2024 to hear his impressions—and to get insights on how the latest AWS innovations help meet the real-world needs of customers as they build and scale transformative generative AI applications.

Q: What made this re:Invent different?

Swami Sivasubramanian: The theme I spoke about in my re:Invent keynote was simple but powerful—convergence. I believe that we’re at an inflection point unlike any other in the evolution of AI. We’re seeing a remarkable convergence of data, analytics, and generative AI. It’s a combination that enables next-level generative AI applications that are far more capable. And it lets our customers move faster in a really significant way, getting more value, more quickly. Companies like Rocket Mortgage are building on an AI-driven platform powered by Amazon Bedrock to create AI agents and automate tasks—working to give their employees access to generative AI with no-code tools. Canva uses AWS to power 1.2 million requests a day and sees 450 new designs created every second. There’s also a human side to convergence, as people across organizations are working together in new ways, requiring a deeper level of collaboration between groups, like science and engineering teams. And this isn’t just a one-time collaboration. It’s an ongoing process.

People’s expectations for applications and customer experiences are changing again with generative AI. Increasingly, I think generative AI inference is going to be a core building block for every application. To realize this future, organizations need more than just a chatbot or a single powerful large language model (LLM). At re:Invent, we made some exciting announcements about the future of generative AI, of course. But we also launched a remarkable portfolio of new products, capabilities, and features that will help our customers manage generative AI at scale—making it easier to control costs, build trust, increase productivity, and deliver ROI.

Q: Are there key innovations that build on the experience and lessons learned at Amazon in adopting generative AI? How are you bringing those capabilities to your customers

Swami Sivasubramanian: Yes, our announcement of Amazon Nova, a new generation of foundation models (FMs), has state-of-the-art intelligence across a wide range of tasks and industry-leading price performance. Amazon Nova models expand the growing selection of the broadest and most capable FMs in Amazon Bedrock for enterprise customers. The specific capabilities of Amazon Nova Micro, Lite, and Pro demonstrate exceptional intelligence, capabilities, and speed—and perform quite competitively against the best models in their respective categories. Amazon Nova Canvas, our state-of-the-art image generation model, creates professional grade images from text and image inputs, democratizing access to production-grade visual content for advertising, training, social media, and more. Finally, Amazon Nova Reel offers state-of-the-art video generation that allows customers to create high-quality video from text or images. With about 1,000 generative AI applications in motion inside Amazon, groups like Amazon Ads are using Amazon Nova to remove barriers for sellers and advertisers, enabling new levels of creativity and innovation. New capabilities like image and video generation are helping Amazon Ads customers promote more products in their catalogs, and experiment with new strategies like keyword-level creative to increase engagement and drive sales.

But there’s more ahead, and here’s where an important shift is happening. We’re working on an even more capable any-to-any model where you can provide text, images, audio, and video as input and the model can generate outputs in any of these modalities. And we think this multi-modal approach is how models are going to evolve, moving ahead where one model can accept any kind of input and generate any kind of output. Over time, I think this is what state-of-the-art models will look like.

Q: Speaking of announcements like Amazon Nova, you’ve been a key innovator in AI for many years. What continues to inspire you?

Swami Sivasubramanian: It’s fascinating to think about what LLMs are capable of. What inspires me most though is how can we help our customers unblock the challenges they are facing and realize that potential. Consider hallucinations. As highly capable as today’s models are, they still have a tendency to get things wrong occasionally. It’s a challenge that many of our customers struggle with when integrating generative AI into their businesses and moving to production. We explored the problem and asked ourselves if we could do more to help. We looked inward, and leveraged Automated Reasoning, an innovation that Amazon has been using as a behind-the-scenes technology in many of our services like identity and access management.

I like to think of this situation as yin and yang. Automated Reasoning is all about certainty and being able to mathematically prove that something is correct. Generative AI is all about creativity and open-ended responses. Though they might seem like opposites, they’re actually complementary—with Automated Reasoning completing and strengthening generative AI. We’ve found that Automated Reasoning works really well when you have a huge surface area of a problem, a corpus of knowledge about that problem area, and when it’s critical that you get the correct answer—which makes Automated Reasoning a good fit for addressing hallucinations.

At re:Invent, we announced Amazon Bedrock Guardrails Automated Reasoning checks—the first and only generative AI safeguard that helps prevent factual errors due to hallucinations. All by using logically accurate and verifiable reasoning that explains why generative AI responses are correct. I think that it’s an innovation that will have significant impact across organizations and industries, helping build trust and accelerate generative AI adoption.

Q: Controlling costs is important to all organizations, large and small, particularly as they take generative AI applications into production. How do the announcements at re:Invent answer this need?

Swami Sivasubramanian: Like our customers, here at Amazon we’re increasing our investment in generative AI development, with multiple projects in process—all requiring timely access to accelerated compute resources. But allocating optimal compute capacity to each project can create a supply/demand challenge. To address this challenge, we created an internal service that helped Amazon drive utilization of compute resources to more than 90% across all our projects. This service enabled us to smooth out demand across projects and achieve higher capacity utilization, speeding development.

As with Automated Reasoning, we realized that our customers would also benefit from these capabilities. So, at re:Invent, I announced the new task governance capability in Amazon SageMaker HyperPod, which helps our customers optimize compute resource utilization and reduce time to market by up to 40%. With this capability, users can dynamically run tasks across the end-to-end FM workflow— accelerating time to market for AI innovations while avoiding cost overruns due to underutilized compute resources.

Our customers also tell me that the trade-off between cost and accuracy for models is real. We’re answering this need by making it super-easy to evaluate models on Amazon Bedrock, so they don’t have to spend months researching and making comparisons. We’re also lowering costs with game-changing capabilities such Amazon Bedrock Model Distillation, which pairs models for lower costs; Amazon Bedrock Intelligent Prompt Routing, which manages prompts more efficiently, at scale; and prompt caching, which reduces repeated processing without compromising on accuracy.

Q: Higher productivity is one of the core promises of generative AI. How is AWS helping employees at all levels be more productive?

Swami Sivasubramanian: I like to point out that using generative AI becomes irresistible when it makes employees 10 times more productive. In short, not an incremental increase, but a major leap in productivity. And we’re helping employees get there. For example, Amazon Q Developer is transforming code development by taking care of the time-consuming chores that developers don’t want to deal with, like software upgrades. And it also helps them move much faster by automating code reviews and dealing with mainframe modernization. Consider Novacomp, a leading IT company in Latin America, which leveraged Amazon Q Developer to upgrade a project with over 10,000 lines of Java code in just 50 minutes, a task that would have typically taken an estimated 3 weeks. The company also simplified everyday tasks for developers, reducing its technical debt by 60% on average.

On the business side, Amazon Q Business is bridging the gap between unstructured and structured data, recognizing that most businesses need to draw from a mix of data. With Amazon Q in QuickSight, non-technical users can leverage natural language to build, discover, and share meaningful insights in seconds. Now they can access databases and data warehouses, as well as unstructured business data, like emails, reports, charts, graphs, and images.

And looking ahead, we announced advanced agentic capabilities for Amazon Q Business, coming in 2025, which will use agents to automate complex tasks that stretch across multiple teams and applications. Agents give generative AI applications next-level capabilities, and we’re bringing them to our customers via Amazon Q Business, as well as Amazon Bedrock multi-agent collaboration, which improves successful task completion by 40% over popular solutions. This major improvement translates to more accurate and human-like outcomes in use cases like automating customer support, analyzing financial data for risk management, or optimizing supply-chain logistics.

It’s all part of how we’re enabling greater productivity today, with even more on the horizon.

Q: To get employees and customers adopting generative AI and benefiting from that increased productivity, it has to be trusted. What steps is AWS taking to help build that trust?

Swami Sivasubramanian: I think that lack of trust is a big obstacle to moving from proof of concept to production. Business leaders are about to hit go and they hesitate because they don’t want to lose the trust of their customers. As generative AI continues to drive innovation across industries and our daily life, the need for responsible AI has become increasingly acute. And we’re helping meet that need with innovations like Amazon Bedrock Automated Reasoning, which I mentioned earlier, that works to prevent hallucinations—and increases trust. We also announced new LLM-as-a-judge capabilities with Amazon Bedrock Model Evaluation so you can now perform tests and evaluate other models with humanlike quality at a fraction of the cost and time of running human evaluations. These evaluations assess multiple quality dimensions, including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness.

I should also mention that AWS recently became the first major cloud provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. This international management system standard outlines requirements and controls for organizations to promote the responsible development and use of AI systems. Technical standards like ISO/IEC 42001 are significant because they provide a much-needed common framework for responsible AI development and deployment.

Q: Data remains central to building more personalized experiences applicable to your business. How do the re:Invent launches help AWS customers get their data ready for generative AI?

Swami Sivasubramanian: Generative AI isn’t going to be useful for organizations unless it can seamlessly access and deeply understand the organization’s data. With these insights, our customers can create customized experiences, such as highly personalized customer service agents that can help service representatives resolve issues faster. For AWS customers, getting data ready for generative AI isn’t just a technical challenge—it’s a strategic imperative. Proprietary, high-quality data is the key differentiator in transforming generic AI into powerful, business-specific applications. To prepare for this AI-driven future, we’re helping our customers build a robust, cloud-based data foundation, with built-in security and privacy. That’s the backbone of AI readiness.

With the next generation of Amazon SageMaker announced at re:Invent, we’re introducing an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities. Collaborate and build faster from a unified studio using familiar AWS tools for model development, generative AI, data processing, and SQL analytics—with Amazon Q Developer assisting you along the way. Access all your data whether it’s stored in data lakes, data warehouses, third-party or federated data sources. And move with confidence and trust, thanks to built-in governance to address enterprise security needs.

At re:Invent, we also launched key Amazon Bedrock capabilities that help our customers maximize the value of their data. Amazon Bedrock Knowledge Bases now offers the only managed, out-of-the-box Retrieval Augmented Generation (RAG) solution, which enables our customers to natively query their structured data where it resides, accelerating development. Support for GraphRAG generates more relevant responses by modeling and storing relationships between data. And Amazon Bedrock Data Automation transforms unstructured, multimodal data into structured data for generative AI—automatically extracting, transforming, and generating usable data from multimodal content, at scale. These capabilities and more help our customers leverage their data to create powerful, insightful generative AI applications.

Q: What did you take away from your customer conversations at re:Invent?

Swami Sivasubramanian: I continue to be amazed and inspired by our customers and the important work they’re doing. We continue to offer our customers the choice and specialization they need to power their unique use cases. With Amazon Bedrock Marketplace, customers now have access to more than 100 popular, emerging, and specialized models.

At re:Invent, I heard a lot about the new efficiency and transformative experiences customers are creating. I also heard about innovations that are changing people’s lives. Like Exact Sciences, a molecular diagnostic company, which developed an AI-powered solution using Amazon Bedrock to accelerate genetic testing and analysis by 50%. Behind that metric there’s a real human value—enabling earlier cancer detection and personalized treatment planning. And that’s just one story among thousands, as our customers reach higher and build faster, achieving impressive results that change industries and improve lives.

I get excited when I think about how we can help educate the next wave of innovators building these experiences. With the launch of the new Education Equity Initiative, Amazon is committing up to $100 million in cloud technology and technical resources to help existing, dedicated learning organizations reach more learners by creating new and innovative digital learning solutions. That’s truly inspiring to me.

In fact, the pace of change, the remarkable innovations we introduced at re:Invent, and the enthusiasm of our customers all reminded me of the early days of AWS, when anything seemed possible. And now, it still is.


About the author

Swami Sivasubramanian is VP, AWS AI & Data. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More