Introducing guardrails in Knowledge Bases for Amazon Bedrock

Introducing guardrails in Knowledge Bases for Amazon Bedrock

Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you securely connect foundation models (FMs) in Amazon Bedrock to your company data using Retrieval Augmented Generation (RAG). This feature streamlines the entire RAG workflow, from ingestion to retrieval and prompt augmentation, eliminating the need for custom data source integrations and data flow management.

We recently announced the general availability of Guardrails for Amazon Bedrock, which allows you to implement safeguards in your generative artificial intelligence (AI) applications that are customized to your use cases and responsible AI policies. You can create multiple guardrails tailored to various use cases and apply them across multiple FMs, standardizing safety controls across generative AI applications.

Today’s launch of guardrails in Knowledge Bases for Amazon Bedrock brings enhanced safety and compliance to your generative AI RAG applications. This new functionality offers industry-leading safety measures that filter harmful content and protect sensitive information in your documents, improving user experience and aligning with organizational standards.

Solution overview

Knowledge Bases for Amazon Bedrock allows you to configure your RAG applications to query your knowledge base using the RetrieveAndGenerate API, generating responses from the retrieved information.

By default, knowledge bases allow your RAG applications to query the entire vector database, accessing all records and retrieving relevant results. This may lead to the generation of inappropriate or undesirable content or provide sensitive information, which could potentially violate certain policies or guidelines set by your company. Integrating guardrails with your knowledge base provides a mechanism to filter and control the generated output, complying with predefined rules and regulations.

The following diagram illustrates an example workflow.

guardrails in knowledge bases workflow

When you test the knowledge base using the Amazon Bedrock console or call the RetrieveAndGenerate API using one of the AWS SDKs, the system generates a query embedding and performs a semantic search to retrieve similar documents from the vector store.

The query is then augmented to have the retrieved document chunks, prompt, and guardrails configuration. Guardrails are applied to check for denied topics and filter out harmful content before the augmented query is sent to the InvokeModel API. Finally, the InvokeModel API generates a response from the large language model (LLM), making sure the output is free of any undesirable content.

In the following sections, we demonstrate how to create a knowledge base with guardrails. We also compare query results using the same knowledge base with and without guardrails.

Use cases for guardrails with Knowledge Bases for Amazon Bedrock

The following are common use cases for integrating guardrails in the knowledge base:

  • Internal knowledge management for a legal firm — This helps legal professionals search through case files, legal precedents, and client communications. Guardrails can prevent the retrieval of confidential client information and filter out inappropriate language. For instance, a lawyer might ask, “What are the key points from the latest case law on intellectual property?” and guardrails will make sure no confidential client details or inappropriate language are included in the response, maintaining the integrity and confidentiality of the information.
  • Conversational search for financial services — This enables financial advisors to search through investment portfolios, transaction histories, and market analyses. Guardrails can prevent the retrieval of unauthorized investment advice and filter out content that violates regulatory compliance. An example query could be, “What are the recent performance metrics for our high-net-worth clients?” with guardrails making sure only permissible information is shared.
  • Customer support for an ecommerce platform — This allows customer service representatives to access order histories, customer queries, and product details. Guardrails can block sensitive customer data (like names, emails, or addresses) from being exposed in responses. For example, when a representative asks, “Can you summarize the recent complaints about our new product line?” guardrails will redact any personally identifiable information (PII), enforcing privacy and compliance with data protection regulations.

Prepare a dataset for Knowledge Bases for Amazon Bedrock

For this post, we use a sample dataset containing multiple fictional emergency room reports, such as detailed procedural notes, preoperative and postoperative diagnoses, and patient histories. These records illustrate how to integrate knowledge bases with guardrails and query them effectively.

  1. If you want to follow along in your AWS account, download the file. Each medical record is a Word document.
  2. We store the dataset in an Amazon Simple Storage Service (Amazon S3) bucket. For instructions to create a bucket, see Creating a bucket.
  3. Upload the unzipped files to this S3 bucket.

Create a knowledge base for Amazon Bedrock

For instructions to create a new knowledge base, see Create a knowledge base. For this example, we use the following settings:

  1. On the Configure data source page, under Amazon S3, choose the S3 bucket with your dataset.
  2. Under Chunking strategy, select No chunking because the documents in the dataset are preprocessed to be within a certain length.
  3. In the Embeddings model section, choose model Titan G1 Embeddings – Text.
  4. In the Vector database section, choose Quick create a new vector store.

Create knowledge bases

Synchronize the dataset with the knowledge base

After you create the knowledge base, and your data files are in an S3 bucket, you can start the incremental ingestion. For instructions, see Sync to ingest your data sources into the knowledge base.

While you wait for the sync job to finish, you can move on to the next section, where you create guardrails.

Create a guardrail on the Amazon Bedrock console

Complete the following steps to create a guardrail:

  1. On the Amazon Bedrock console, choose Guardrails in the navigation pane.
  2. Choose Create guardrail.
  3. On the Provide guardrail details page, under Guardrail details, provide a name and optional description for the guardrail.
  4. In the Denied topics section, add the information for two topics as shown in the following screenshot.
  5. In the Add sensitive information filters section, under PII types, add all the PII types.
  6. Choose Create guardrail.

Create guardrails

Query the knowledge base on the Amazon Bedrock console

Let’s now test our knowledge base with guardrails:

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose the knowledge base you created.
  3. Choose Test knowledge base.
  4. Choose the Configurations icon, then scroll down to Guardrails.

The following screenshots show some side-by-side comparisons of querying a knowledge base without (left) and with (right) guardrails.

The first example illustrates querying against denied topics.

Query with and without guardrails knowledge bases

Next, we query data that contains PII.

Query with and without guardrails knowledge bases

Finally, we query about another denied topic.

Query with and without guardrails knowledge bases

Query the knowledge base with using the AWS SDK

You can use the following sample code to query the knowledge base with guardrails using the AWS SDK for Python (Boto3):

import boto3
client = boto3.client('bedrock-agent-runtime')
response = client.retrieve_and_generate(
    input={
        'text': 'Example input text'
    },
    retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'guardrailConfiguration': {
                    'guardrailId': 'your-guardrail-id',
                    'guardrailVersion': 'your-guardrail-version'
                }
            },
            'knowledgeBaseId': 'your-knowledge-base-id',
            'modelArn': 'your-model-arn'
        },
        'type': 'KNOWLEDGE_BASE'
    },
    sessionId='your-session-id'
)

Clean up

To clean up your resources, complete the following steps:

  1. Delete the knowledge base:
    1. On the Amazon Bedrock console, choose Knowledge bases under Orchestration in the navigation pane.
    2. Choose the knowledge base you created.
    3. Take note of the AWS Identity and Access Management (IAM) service role name in the Knowledge base overview
    4. In the Vector database section, take note of the Amazon OpenSearch Serverless collection ARN.
    5. Choose Delete, then enter delete to confirm.
  2. Delete the vector database:
    1. On the Amazon OpenSearch Service console, choose Collections under Serverless in the navigation pane.
    2. Enter the collection ARN you saved in the search bar.
    3. Select the collection and chose Delete.
    4. Enter confirm in the confirmation prompt, then choose Delete.
  3. Delete the IAM service role:
    1. On the IAM console, choose Roles in the navigation pane.
    2. Search for the role name you noted earlier.
    3. Select the role and choose Delete.
    4. Enter the role name in the confirmation prompt and delete the role.
  4. Delete the sample dataset:
    1. On the Amazon S3 console, navigate to the S3 bucket you used.
    2. Select the prefix and files, then choose Delete.
    3. Enter permanently delete in the confirmation prompt to delete.

Conclusion

In this post, we covered the integration of guardrails with Knowledge Bases for Amazon Bedrock. With this, you can benefit from a robust and customizable safety framework that aligns with your application’s unique requirements and responsible AI practices. This integration aims to enhance the overall security, compliance, and responsible usage of foundation models within the knowledge base ecosystem, providing you with greater control and confidence in your AI-driven applications.

For pricing information, visit Amazon Bedrock Pricing. To get started using Knowledge Bases for Amazon Bedrock, refer to Create a knowledge base. For deep-dive technical content and to learn how our Builder communities are using Amazon Bedrock in their solutions, visit our community.aws website.


About the Authors

Hardik Vasa is a Senior Solutions Architect at AWS. He focuses on Generative AI and Serverless technologies, helping customers make the best use of AWS services. Hardik shares his knowledge at various conferences and workshops. In his free time, he enjoys learning about new tech, playing video games, and spending time with his family.

Bani Sharma is a Sr Solutions Architect with Amazon Web Services (AWS), based out of Denver, Colorado. As a Solutions Architect, she works with a large number of Small and Medium businesses, and provides technical guidance and solutions on AWS. She has an area of depth in Containers, modernization and currently working on gaining depth in Generative AI. Prior to AWS, Bani worked in various technical roles for a large Telecom provider and worked as a Senior Developer for a multi-national bank.

Read More

Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock

Prompt engineering techniques and best practices: Learn by doing with Anthropic’s Claude 3 on Amazon Bedrock

You have likely already had the opportunity to interact with generative artificial intelligence (AI) tools (such as virtual assistants and chatbot applications) and noticed that you don’t always get the answer you are looking for, and that achieving it may not be straightforward. Large language models (LLMs), the models behind the generative AI revolution, receive instructions on what to do, how to do it, and a set of expectations for their response by means of a natural language text called a prompt. The way prompts are crafted greatly impacts the results generated by the LLM. Poorly written prompts will often lead to hallucinations, sub-optimal results, and overall poor quality of the generated response, whereas good-quality prompts will steer the output of the LLM to the output we want.

In this post, we show how to build efficient prompts for your applications. We use the simplicity of Amazon Bedrock playgrounds and the state-of-the-art Anthropic’s Claude 3 family of models to demonstrate how you can build efficient prompts by applying simple techniques.

Prompt engineering

Prompt engineering is the process of carefully designing the prompts or instructions given to generative AI models to produce the desired outputs. Prompts act as guides that provide context and set expectations for the AI. With well-engineered prompts, developers can take advantage of LLMs to generate high-quality, relevant outputs. For instance, we use the following prompt to generate an image with the Amazon Titan Image Generation model:

An illustration of a person talking to a robot. The person looks visibly confused because he can not instruct the robot to do what he wants.

We get the following generated image.

Let’s look at another example. All the examples in this post are run using Claude 3 Haiku in an Amazon Bedrock playground. Although the prompts can be run using any LLM, we discuss best practices for the Claude 3 family of models. In order to get access to the Claude 3 Haiku LLM on Amazon Bedrock, refer to Model access.

We use the following prompt:

What is 10 + 10?

Claude 3 Haiku’s response:


10 + 10 is 20

The request prompt is actually very ambiguous. 10 + 10 may have several valid answers; in this case, Claude 3 Haiku, using its internal knowledge, determined that 10 + 10 is 20. Let’s change the prompt to get a different answer for the same question:

1 + 1 is an addition
1 - 1 is a substraction
1 * 1 is multiplication
1 / 1 is a division

What is 10 + 10?

Claude 3 Haiku’s response:

10 + 10 is an addition. The answer is 20.

The response changed accordingly by specifying that 10 + 10 is an addition. Additionally, although we didn’t request it, the model also provided the result of the operation. Let’s see how, through a very simple prompting technique, we can obtain an even more succinct result:

1 + 1 is an addition
1 - 1 is a substraction
1 * 1 is multiplication
1 / 1 is a division

What is 10 + 10?

Answer only as in the examples provided and 
provide no additional information.

Claude 3 Haiku response:

10 + 10 is an addition.

Well-designed prompts can improve user experience by making AI responses more coherent, accurate, and useful, thereby making generative AI applications more efficient and effective.

The Claude 3 model family

The Claude 3 family is a set of LLMs developed by Anthropic. These models are built upon the latest advancements in natural language processing (NLP) and machine learning (ML), allowing them to understand and generate human-like text with remarkable fluency and coherence. The family is comprised of three models: Haiku, Sonnet, and Opus.

Haiku is the fastest and most cost-effective model on the market. It is a fast, compact model for near-instant responsiveness. For the vast majority of workloads, Sonnet is two times faster than Claude 2 and Claude 2.1, with higher levels of intelligence, and it strikes the ideal balance between intelligence and speed—qualities especially critical for enterprise use cases. Opus is the most advanced, capable, state-of-the-art foundation model (FM) with deep reasoning, advanced math, and coding abilities, with top-level performance on highly complex tasks.

Among the key features of the model’s family are:

  • Vision capabilities – Claude 3 models have been trained to not only understand text but also images, charts, diagrams, and more.
  • Best-in-class benchmarks – Claude 3 exceeds existing models on standardized evaluations such as math problems, programming exercises, and scientific reasoning. Specifically, Opus outperforms its peers on most of the common evaluation benchmarks for AI systems, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K), and more. It exhibits high levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.
  • Reduced hallucination – Claude 3 models mitigate hallucination through constitutional AI techniques that provide transparency into the model’s reasoning, as well as improved accuracy. Claude 3 Opus shows an estimated twofold gain in accuracy over Claude 2.1 on difficult open-ended questions, reducing the likelihood of faulty responses.
  • Long context window – Claude 3 models excel at real-world retrieval tasks with a 200,000-token context window, the equivalent of 500 pages of information.

To learn more about the Claude 3 family, see Unlocking Innovation: AWS and Anthropic push the boundaries of generative AI together, Anthropic’s Claude 3 Sonnet foundation model is now available in Amazon Bedrock, and Anthropic’s Claude 3 Haiku model is now available on Amazon Bedrock.

The anatomy of a prompt

As prompts become more complex, it’s important to identify its various parts. In this section, we present the components that make up a prompt and the recommended order in which they should appear:

  1. Task context: Assign the LLM a role or persona and broadly define the task it is expected to perform.
  2. Tone context: Set a tone for the conversation in this section.
  3. Background data (documents and images): Also known as context. Use this section to provide all the necessary information for the LLM to complete its task.
  4. Detailed task description and rules: Provide detailed rules about the LLM’s interaction with its users.
  5. Examples: Provide examples of the task resolution for the LLM to learn from them.
  6. Conversation history: Provide any past interactions between the user and the LLM, if any.
  7. Immediate task description or request: Describe the specific task to fulfill within the LLMs assigned roles and tasks.
  8. Think step-by-step: If necessary, ask the LLM to take some time to think or think step by step.
  9. Output formatting: Provide any details about the format of the output.
  10. Prefilled response: If necessary, prefill the LLMs response to make it more succinct.

The following is an example of a prompt that incorporates all the aforementioned elements:

Human: You are a solutions architect working at Amazon Web Services (AWS) 
named John Doe.

Your goal is to answer customers' questions regarding AWS best architectural
practices and principles.
Customers may be confused if you don't respond in the character of John. You should maintain a friendly customer service tone. Answer the customers' questions using the information provided below <context>{{CONTEXT}}</context> Here are some important rules for the interaction: - Always stay in character, as John, a solutions architect that
work at AWS.
- If you are unsure how to respond, say "Sorry, I didn't understand that.
Could you repeat the question?"
- If someone asks something irrelevant, say, "Sorry, I am John and I give AWS
architectural advise. Do you have an AWS architecture question today I can
help you with?"
Here is an example of how to respond in a standard interaction:
<example>
User: Hi, what do you do?
John: Hello! My name is John, and I can answer your questions about best
architectural practices on AWS. What can I help you with today?
</example>

Here is the conversation history (between the user and you) prior to the
question. It could be empty if there is no history:
<history>{{HISTORY}}</history>

Here is the user's question: <question>{{QUESTION}}</question>
How do you respond to the user's question?

Think about your answer first before you respond. Put your response in <response></responses> Assistant: <response>

Best prompting practices with Claude 3

In the following sections, we dive deep into Claude 3 best practices for prompt engineering.

Text-only prompts

For prompts that deal only with text, follow this set of best practices to achieve better results:

  • Mark parts of the prompt with XLM tags – Claude has been fine-tuned to pay special attention to XML tags. You can take advantage of this characteristic to clearly separate sections of the prompt (instructions, context, examples, and so on). You can use any names you prefer for these tags; the main idea is to delineate in a clear way the content of your prompt. Make sure you include <> and </> for the tags.
  • Always provide good task descriptions – Claude responds well to clear, direct, and detailed instructions. When you give an instruction that can be interpreted in different ways, make sure that you explain to Claude what exactly you mean.
  • Help Claude learn by example – One way to enhance Claude’s performance is by providing examples. Examples serve as demonstrations that allow Claude to learn patterns and generalize appropriate behaviors, much like how humans learn by observation and imitation. Well-crafted examples significantly improve accuracy by clarifying exactly what is expected, increase consistency by providing a template to follow, and boost performance on complex or nuanced tasks. To maximize effectiveness, examples should be relevant, diverse, clear, and provided in sufficient quantity (start with three to five examples and experiment based on your use case).
  • Keep the responses aligned to your desired format – To get Claude to produce output in the format you want, give clear directions, telling it exactly what format to use (like JSON, XML, or markdown).
  • Prefill Claude’s response – Claude tends to be chatty in its answers, and might add some extra sentences at the beginning of the answer despite being instructed in the prompt to respond with a specific format. To improve this behavior, you can use the assistant message to provide the beginning of the output.
  • Always define a persona to set the tone of the response – The responses given by Claude can vary greatly depending on which persona is provided as context for the model. Setting a persona helps Claude set the proper tone and vocabulary that will be used to provide a response to the user. The persona guides how the model will communicate and respond, making the conversation more realistic and tuned to a particular personality. This is especially important when using Claude as the AI behind a chat interface.
  • Give Claude time to think – As recommended by Anthropic’s research team, giving Claude time to think through its response before producing the final answer leads to better performance. The simplest way to encourage this is to include the phrase “Think step by step” in your prompt. You can also capture Claude’s step-by-step thought process by instructing it to “please think about it step-by-step within <thinking></thinking> tags.”
  • Break a complex task into subtasks – When dealing with complex tasks, it’s a good idea to break them down and use prompt chaining with LLMs like Claude. Prompt chaining involves using the output from one prompt as the input for the next, guiding Claude through a series of smaller, more manageable tasks. This improves accuracy and consistency for each step, makes troubleshooting less complicated, and makes sure Claude can fully focus on one subtask at a time. To implement prompt chaining, identify the distinct steps or subtasks in your complex process, create separate prompts for each, and feed the output of one prompt into the next.
  • Take advantage of the long context window – Working with long documents and large amounts of text can be challenging, but Claude’s extended context window of over 200,000 tokens enables it to handle complex tasks that require processing extensive information. This feature is particularly useful with Claude Haiku because it can help provide high-quality responses with a cost-effective model. To take full advantage of this capability, it’s important to structure your prompts effectively.
  • Allow Claude to say “I don’t know” – By explicitly giving Claude permission to acknowledge when it’s unsure or lacks sufficient information, it’s less likely to generate inaccurate responses. This can be achieved by adding a preface to the prompt, such as, “If you are unsure or don’t have enough information to provide a confident answer, simply say ‘I don’t know’ or ‘I’m not sure.’”

Prompts with images

The Claude 3 family offers vision capabilities that can process images and return text outputs. It’s capable of analyzing and understanding charts, graphs, technical diagrams, reports, and other visual assets. The following are best practices when working with images with Claude 3:

  • Image placement and size matters – For optimal performance, when working with Claude 3’s vision capabilities, the ideal placement for images is at the very start of the prompt. Anthropic also recommends resizing an image before uploading and striking a balance between image clarity and image size. For more information, refer to Anthropic’s guidance on image sizing.
  • Apply traditional techniques – When working with images, you can apply the same techniques used for text-only prompts (such as giving Claude time to think or defining a role) to help Claude improve its responses.

Consider the following example, which is an extraction of the picture “a fine gathering” (Author: Ian Kirck, https://en.m.wikipedia.org/wiki/File:A_fine_gathering_(8591897243).jpg).

We ask Claude 3 to count how many birds are in the image:

You are a bird expert who has perfect vision and pay a lot of attention to details. 
How many birds are in this picture? Before providing the number of birds in <answer>
tags, take the time to think in <thinking> tags analyzing each part of the picture 
from left to right.

Claude 3 Haiku’s response:

<thinking>
Alright, let me carefully examine this image as a bird expert with keen visual acuity.
Starting from the left, I see three birds with mottled brown and white plumage, 
which appear to be some type of shorebird like a sandpiper or plover. 
Next, there is a group of four more birds with similar mottled plumage. 
In the center, there is a striking black-headed gull or tern with a white body. 
To the right of the gull, there is another bird with the same distinctive black head and white body. 
Finally, at the far right, I can make out the silhouette of another small bird.
</thinking>

<answer>9 birds</answer>

In this example, we asked Claude to take some time to think and put its
reasoning in an XML tag and the final answer in another. Also, we gave Claude time to think and clear instructions to pay attention to details, which helped Claude to provide the correct response.

  • Take advantage of visual prompts – The ability to use images also enables you to add prompts directly within the image itself instead of providing a separate prompt.

Let’s see an example with the following image:

In this case, the image itself is the prompt:
Claude 3 Haiku’s response:

To solve for the missing side length X, we can use the Pythagorean theorem. 
The Pythagorean theorem states that in a right-angled triangle, the square of the 
length of the hypotenuse (the longest side) is equal to the sum of the squares 
of the lengths of the other two sides.
 
Given:
- One side is 12 centimeters.
- The other side is 10 centimeters.

Let's denote the missing side length as X.

Using the Pythagorean theorem:

(12)^2 + (10)^2 = X^2
144 + 100 = X^2
244 = X^2
X = √244
X ≈ 15.63 centimeters

Therefore, the length of the missing side X is approximately 15.63 centimeters.
  • Examples are also valid using images – You can provide multiple images in the same prompt and take advantage of Claude’s vision capabilities to provide examples and additional valuable information using the images. Make sure you use image tags to clearly identify the different images. Because this question is a reasoning and mathematical question, set the temperature to 0 for a more deterministic response.

Let’s look at the following example:

Prompt:

You are an expert at logical reasoning and counting figures. When you identify figures 
in a given image, you always pay special attention to all the lines and vertices that can 
form distinct smaller figures within the overall large shape.

Using the following images, image 1 and image 2

<image1>

</image1>

<image2>

</image2>

Please tell me how many triangles do you see in <image1> using as reference <image2> 
to identify the vertices of the triangles.

Here is an example of how to list out the identified triangles using the vertices:

Large triangle: 1 (ABC)

Please take the time to analyze the image in detail in <thinking> tags and put your 
response in <response> tags.

Claude 3 Haiku’s response:

<thinking>

Analyzing the image 1 in the tags <image1>, and using the image 2 in the tags <image2> as 
a reference for identifying the vertices, I can see the following triangles:

1. The large triangle formed by the outer lines (vertices A, B, C)
2. The smaller triangle formed by the lines within the larger triangle (vertices C, D, A)
3. The second smaller triangle formed by the lines within the larger triangle (vertices D, B, C)
4. The third smaller triangle formed by the lines, with vertices A, B, and D

So in total, there are 4 distinct triangles visible in the image 1.
</thinking>

<response>

In the image 1 in the tags <image1>, there are 4 triangles:

1. The large triangle ABC
2. The smaller triangle ACD
3. The second smaller triangle DBC
4. The third smaller triangle ABD

</response>
  • Use detailed descriptions when working with complicated charts or graphics – Working with charts or graphics is a relatively straightforward task when using Claude’s models. We simply take advantage of Claude’s vision capabilities, pass the charts or graphics in image format, and then ask questions about the provided images. However, when working with complicated charts that have lots of colors (which look very similar) or a lot of data points, it’s a good practice to help Claude better understand the information with the following methods:
    • Ask Claude to describe in detail each data point that it sees in the image.
    • Ask Claude to first identify the HEX codes of the colors in the graphics to clearly see the difference in colors.

Let’s see an example. We pass to Claude the following map chart in image format (source: https://ourworldindata.org/co2-and-greenhouse-gas-emissions), then we ask about Japan’s greenhouse gas emissions.

Prompt:

What are the greenhouse gas emissions for Japan in 2021? Prior to provide your response, 
please take your time to think in <thinking> tags, identifying the nine color used and 
their corresponding legend using HEX code. Then, provide your response in <response> tags.

Claude 3 Haiku’s response:

<thinking>

Analyzing the distinct colors used in the color map and their corresponding legend:

The color scale on the map uses the following 9 distinct colors:

1. Light yellow: HEX code #FFFFD0, corresponding to "0 t" emissions range
2. Pale orange: HEX code #FFD580, corresponding to "10 million t" emissions range
3. Slightly darker orange: HEX code #FFA852, corresponding to "30 million t" emissions range
4. Bright orange: HEX code #FFA040, corresponding to "100 million t" to "300 million t" emissions range
5. Darker orange: HEX code #FF7F2A, corresponding to "300 million t" to "1 billion t" emissions range
6. Reddish-orange: HEX code #FF6018, corresponding to "1 billion t" to "3 billion t" emissions range
7. Darker reddish-orange: HEX code #E64700, corresponding to "3 billion t" to "10 billion t" emissions range
8. Maroon: HEX code #8B0000, corresponding to "10 billion t" emissions range
9. No data: Shown as a gray color

The color used for Japan is the darker orange shade with HEX code #FF7F2A, which 
corresponds to the "300 million t" to "1 billion t" emissions range.

</thinking>

<response>

Based on the color analysis and the corresponding legend, the greenhouse gas emissions for 
Japan in 2021 are in the range of 300 million to 1 billion tonnes of carbon dioxide-equivalents.

</response>
  • Use narrations when working with slide decks – Similar to the use case with charts and graphics, when you want to work with slide decks, the first step is to convert the original files that you have into images, with the consideration that you need to create one image per slide. Then you’re ready to ask questions. You can also increase the accuracy of the responses or ask questions about the entire content of the slide deck. It’s a good practice to first narrate each slide and consolidate the distinct narration in a final narration. This way, you can have a narration that covers the entire content of the slide deck, which can later serve to respond to questions using a regular text workflow. This includes vector search.

Let’s see an example of narration with the following image (source: Sustainable Development Goals Report 2023, https://unstats.un.org/sdgs/report/2023/The-Sustainable-Development-Goals-Report-2023.pdf):

Prompt:
<image>

</image>

You are presenting the 2023 sustainable development goals report from OECD 
and you are currently in slide 9.

Please narrate the slide 9 in <image> tags that provides information about the ninth 
goal as if you were the presenter.

Avoid discussing any topics, particularly acronyms, unless you are completely confident 
in your understanding of their meaning and do not introduce information that is not 
explicitly presented on the current slide. Make sure you narrate every single number 
showed in the slide and all the details. If you fail to mention every number in the slide, 
some viewers who are visual impaired won’t be able to read the information by themselves.

Put your narration in <narration> tags.

Claude 3 Haiku’s response:

<narration>

This slide focuses on the 9th Sustainable Development Goal, which is to "Build Resilient 
Infrastructure, Promote Inclusive and Sustainable Industrialization and Foster Innovation".

The slide presents information on global manufacturing. It shows that global manufacturing 
growth slowed from 7.4% in 2021 to 3.3% in 2022, due to factors like inflation, energy 
price shocks, supply chain disruptions, and global economic deceleration.

The slide also highlights that CO2 emissions from energy-related sources reached a record 
high of 36.8 billion metric tons in 2022.

Furthermore, the slide indicates that less developed countries, or LDCs, are likely to miss 
their 2030 target of doubling their manufacturing share of GDP. In 2015, this share was 12.1%, 
rising to 14% in 2022, but the 2030 target is 24.2%.

The regional breakdown shows that sub-Saharan Africa has the lowest manufacturing share at 
21.7%, Europe and North America has the highest at 47.1%, and Eastern Asia is in the middle 
at 47.7%.

</narration>

In this example, we were careful to control the content of the narration. We made sure Claude didn’t mention any extra information or discuss anything it wasn’t completely confident about. We also made sure Claude covered all the key details and numbers presented in the slide. This is very important because the information from the narration in text format needs to be precise and accurate in order to be used to respond to questions.

An in-depth prompt example for information extraction

Information extraction is the process of automating the retrieval of specific information related to a specific topic from a collection of texts or documents. LLMs can extract information regarding attributes given a context and a schema. The kinds of documents that can be better analyzed with LLMs are resumes, legal contracts, leases, newspaper articles, and other documents with unstructured text.

The following prompt instructs Claude 3 Haiku to extract information from short text like posts on social media, although it can be used for much longer pieces of text like legal documents or manuals. In the following example, we use the color code defined earlier to highlight the prompt sections:

Human: You are an information extraction system. Your task is to extract key information 
from the text enclosed between <post></post> and put it in JSON.

Here are some basic rules for the task:
- Do not output your reasoning for the extraction
- Always produce complete and valid JSON objects
- If no information can be extracted or you can not produce a valid JSON object output
an empty json object "{}"
Here are some examples of how to extract information from text:
<examples>
<example_1>
<post>
"""Six months ago, Wall Street Journal reporter Evan Gershkovich was detained in Russia
during a reporting trip. He remains in a Moscow prison. We’re offering resources for
those who want to show their support for him. #IStandWithEvan https://wsj.com/Evan"""
</post>
<json>
{
"topic": "detention of a reporter",
"location": "Moscow"
"entities": ["Evan Gershkovich", "Wall Street Journal"],
"keyphrases": ["reporter", "detained", "prison"],
"sentiment": "negative",
"links": ["https://wsj.com/Evan"],
}
</json>
</example_1>

<example_2>
<post>
"""'We’re living an internal war': Once-peaceful Ecuador has become engulfed in the
cocaine trade, and the bodies are piling up."""
</post>
<json>
{
"topic": "drug war",
"location": "Ecuador",
"entities": ["Ecuador"],
"keyphrases": ["drug war", "cocaine trade"],
"sentiment": "negative",
"links": [],
}
</json>
</example_2>

</examples>

Extract information from the following post. Generate only a complete JSON object and put
it in <json></json>.
<post>
"""A postpandemic hiring spree has left airports vulnerable to security gaps as new staff
gain access to secure areas, creating an opening for criminal groups."""
</post>

Use the following JSON object definition to write your answer
<json_definition>
{
"type": "object",
"properties": {
"topic": {
"description": "the main topic of the post",
"type": "string",
"default": ""
},
"location": {
"description": "the location, if exists, where the events occur",
"type": "string",
"default": ""
},
"entities": {
"description": "the entities involved in the post",
"type": "list",
"default": []
},
"keyphrases": {
"description": "the keyphrases in the post",
"type": "list",
"default": []
},
"sentiment": {
"description": "the sentiment of the post",
"type": "string",
"default": ""
},
"links": {
"description": "any links found within the post",
"type": "list",
"default": []
}
}
}
</json_definition>

Assistant:<json>

Claude 3 Haiku’s response:

{
"topic": "airport security gaps",
"location": "",
"entities": ["airports"],
"keyphrases": ["postpandemic hiring spree", "security gaps", 
			"new staff", "secure areas", "criminal groups"],
"sentiment": "negative",
"links": []
}
</json>

The prompt incorporates the following best practices:

  • Define a persona and tone for the LLM – In this case, we specified that the LLM is an information extraction system.
  • Provide clear task descriptions – We were as specific as possible when describing the task to the LLM.
  • Specify the data you want to extract using JSON objects to define the expected output – We provided a full definition of the JSON object we want to obtain.
  • Use few-shot prompting – We showed the LLM pairs of unstructured text and information extracted.
  • Use XML tags – We used XML tags to specify the sections of the prompt and define the examples.
  • Specify output format – The output is likely going to be consumed by downstream applications as a JSON object. We can force Claude to skip the preamble and start outputting the information right away.

An in-depth prompt example for Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is an approach in natural language generation that combines the strengths of information retrieval and language generation models. In RAG, a retrieval system first finds relevant passages or documents from a large corpus based on the input context or query. Then, a language generation model uses the retrieved information as additional context to generate fluent and coherent text. This approach aims to produce high-quality and informative text by using both the knowledge from the retrieval corpus and the language generation capabilities of deep learning models. To learn more about RAG, see What is RAG? and Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart.

The following prompt instructs Claude 3 Haiku to answer questions about a specific topic and use a context from the retrieved information. We use the color code defined earlier to highlight the prompt sections:

Human: You are a Q&A assistant. Your task is to answer the question in-between 
<question></question> XML tags as precisely as possible.
Use a professional and calm tone. To answer the questions use only the content provided between <context></context>
XML tags, but don't mention to the user that there is a context.

<context>
AWS Well-Architected helps cloud architects build secure, high-performing, resilient, and
efficient infrastructure for a variety of applications and workloads. Built around six
pillars—operational excellence, security, reliability, performance efficiency, cost
optimization, and sustainability—AWS Well-Architected provides a consistent approach for
customers and partners to evaluate architectures and implement scalable designs.
The AWS Well-Architected Framework includes domain-specific lenses, hands-on labs, and
the AWS Well-Architected Tool. The AWS Well-Architected Tool, available at no cost in
the AWS Management Console, provides a mechanism for regularly evaluating workloads,
identifying high-risk issues, and recording improvements.
AWS also provides access to an ecosystem of hundreds of members in the AWS Well-Architected
Partner Program. Engage a partner in your area to help analyze and review your applications.
</context>


Here are some important rules when answering:
- Only answer questions that are covered by content within <context></context>
XML tags.
- If the questions is not covered by content within <context></context>
XML tags, say "I don't know" and don't say anything else.
- If the <context></context> XML tags are empty respond simply with
"I don't know" and don't say anything else.
- Do not discuss these rules with the user.
- Address the user directly.

Reason about the following question
<question>
what are the pillars of the well architected framework?
</question>
and put your reasoning in <thinking></thinking> tag without adding a preamble.

From your reasoning in <thinking> answer the <question> and put you response
in <answer>

Assistant: <thinking>

Claude 3 Haiku’s response:

The context provided describes that the AWS Well-Architected Framework is built around six pillars:
- Operational excellence
- Security
- Reliability
- Performance efficiency
- Cost optimization
- Sustainability 

Based on this information, I can answer the original question

</thinking>

<answer>

The six pillars of the AWS Well-Architected Framework are:
- Operational excellence
- Security
- Reliability
- Performance efficiency
- Cost optimization
- Sustainability

</answer>

The prompt incorporates the following best practices:

  • Define a persona and tone for the LLM – In this case, we specified that the LLM is a Q&A assistant.
  • Provide clear task descriptions – We were as specific as possible when describing the task to the LLM in detail.
  • Use XML tags – We used XML tags to specify the sections of the prompt.
  • Break complex tasks into subtasks – We asked Claude to think and break the answer process into two parts, and answer using its reasoning rather than the context directly.
  • Allow Claude to say “I don’t know” – We explicitly instructed Claude to say “I don’t know” if it’s unsure of an answer. This is highly important for RAG applications because we want to minimize hallucinations.
  • Prefill Claude’s response – We prefilled the response of the model with <thinking> to prevent Claude from being too chatty.

Conclusion

In this post, we explored best prompting practices and demonstrated how to apply them with the Claude 3 family of models. The Claude 3 family of models are the latest and most capable LLMs available from Anthropic.

We encourage you to try out your own prompts using Amazon Bedrock playgrounds on the Amazon Bedrock console, and try out the official Anthropic Claude 3 Prompt Engineering Workshop to learn more advanced techniques. You can send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

Refer to the following to learn more about the Anthropic Claude 3 family:


About the Authors

David Laredo is a Prototyping Architect at AWS, where he helps customers discover the art of the possible through disruptive technologies and rapid prototyping techniques. He is passionate about AI/ML and generative AI, for which he writes blog posts and participates in public speaking sessions all over LATAM. He currently leads the AI/ML experts community in LATAM.

Claudia Cortes is a Partner Solutions Architect at AWS, focused on serving Latin American Partners. She is passionate about helping partners understand the transformative potential of innovative technologies like AI/ML and generative AI, and loves to help partners achieve practical use cases. She is responsible for programs such as AWS Latam Black Belt, which aims to empower partners in the Region by equipping them with the necessary knowledge and resources.

Simón Córdova is a Senior Solutions Architect at AWS, focused on bridging the gap between AWS services and customer needs. Driven by an insatiable curiosity and passion for generative AI and AI/ML, he tirelessly explores ways to leverage these cutting-edge technologies to enhance solutions offered to customers.

Gabriel Velazquez is a Sr Generative AI Solutions Architect at AWS, he currently focuses on supporting Anthropic on go-to-market strategy. Prior to working in AI, Gabriel built deep expertise in the telecom industry where he supported the launch of Canada’s first 4G wireless network. He now combines his expertise in connecting a nation with knowledge of generative AI to help customers innovate and scale.

Read More

Improve productivity when processing scanned PDFs using Amazon Q Business

Improve productivity when processing scanned PDFs using Amazon Q Business

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and extract insights directly from the content in digital as well as scanned PDF documents in your enterprise data sources without needing to extract the text first.

Customers across industries such as finance, insurance, healthcare life sciences, and more need to derive insights from various document types, such as receipts, healthcare plans, or tax statements, which are frequently in scanned PDF format. These document types often have a semi-structured or unstructured format, which requires processing to extract text before indexing with Amazon Q Business.

The launch of scanned PDF document support with Amazon Q Business can help you seamlessly process a variety of multi-modal document types through the AWS Management Console and APIs, across all supported Amazon Q Business AWS Regions. You can ingest documents, including scanned PDFs, from your data sources using supported connectors, index them, and then use the documents to answer questions, provide summaries, and generate content securely and accurately from your enterprise systems. This feature eliminates the development effort required to extract text from scanned PDF documents outside of Amazon Q Business, and improves the document processing pipeline for building your generative artificial intelligence (AI) assistant with Amazon Q Business.

In this post, we show how to asynchronously index and run real-time queries with scanned PDF documents using Amazon Q Business.

Solution overview

You can use Amazon Q Business for scanned PDF documents from the console, AWS SDKs, or AWS Command Line Interface (AWS CLI).

Amazon Q Business provides a versatile suite of data connectors that can integrate with a wide range of enterprise data sources, empowering you to develop generative AI solutions with minimal setup and configuration. To learn more, visit Amazon Q Business, now generally available, helps boost workforce productivity with generative AI.

After your Amazon Q Business application is ready to use, you can directly upload the scanned PDFs into an Amazon Q Business index using either the console or the APIs. Amazon Q Business offers multiple data source connectors that can integrate and synchronize data from multiple data repositories into single index. For this post, we demonstrate two scenarios to use documents: one with the direct document upload option, and another using the Amazon Simple Storage Service (Amazon S3) connector. If you need to ingest documents from other data sources, refer to Supported connectors for details on connecting additional data sources.

Index the documents

In this post, we use three scanned PDF documents as examples: an invoice, a health plan summary, and an employment verification form, along with some text documents.

The first step is to index these documents. Complete the following steps to index documents using the direct upload feature of Amazon Q Business. For this example, we upload the scanned PDFs.

  1. On the Amazon Q Business console, choose Applications in the navigation pane and open your application.
  2. Choose Add data source.
  3. Choose Upload Files.
  4. Upload the scanned PDF files.

You can monitor the uploaded files on the Data sources tab. The Upload status changes from Received to Processing to Indexed or Updated, as which point the file has been successfully indexed into the Amazon Q Business data store. The following screenshot shows the successfully indexed PDFs.

Indexed documents in uploaded files section.

The following steps demonstrate how to integrate and synchronize documents using an Amazon S3 connector with Amazon Q Business. For this example, we index the text documents.

  1. On the Amazon Q Business console, choose Applications in the navigation pane and open your application.
  2. Choose Add data source.
  3. Choose Amazon S3 for the connector.
  4. Enter the information for Name, VPC and security group settings, IAM role, and Sync mode.
  5. To finish connecting your data source to Amazon Q Business, choose Add data source.
  6. In the Data source details section of your connector details page, choose Sync now to allow Amazon Q Business to begin syncing (crawling and ingesting) data from your data source.

When the sync job is complete, your data source is ready to use. The following screenshot shows all five documents (scanned and digital PDFs, and text files) are successfully indexed.

Amazon S3 connector

The following screenshot shows a comprehensive view of the two data sources: the directly uploaded documents and the documents ingested through the Amazon S3 connector.

Amazon Q Business data sources.

Now let’s run some queries with Amazon Q Business on our data sources.

Queries on dense, unstructured, scanned PDF documents

Your documents might be dense, unstructured, scanned PDF document types. Amazon Q Business can identify and extract the most salient information-dense text from it. In this example, we use the multi-page health plan summary PDF we indexed earlier. The following screenshot shows an example page.

Health plan summary document.

This is an example of a health plan summary document.

In the Amazon Q Business web UI, we ask “What is the annual total out-of-pocket maximum, mentioned in the health plan summary?”

Amazon Q Business searches the indexed document, retrieves the relevant information, and generates an answer while citing the source for its information. The following screenshot shows the sample output.

Amazon Q Business output

Queries on structured, tabular, scanned PDF documents

Documents might also contain structured data elements in tabular format. Amazon Q Business can automatically identify, extract, and linearize structured data from scanned PDFs to accurately resolve any user queries. In the following example, we use the invoice PDF we indexed earlier. The following screenshot shows an example.

Invoice

This is an example of an invoice.

In the Amazon Q Business web UI, we ask “How much were the headphones charged in the invoice?”

Amazon Q Business searches the indexed document and retrieves the answer with reference to the source document. The following screenshot shows that Amazon Q Business is able to extract bill information from the invoice.

Amazon Q Business output

Queries on semi-structured forms

Your documents might also contain semi-structured data elements in a form, such as key-value pairs. Amazon Q Business can accurately satisfy queries related to these data elements by extracting specific fields or attributes that are meaningful for the queries. In this example, we use the employment verification PDF. The following screenshot shows an example.

Employment verification sample

This is an example of an employment verification form.

In the Amazon Q Business web UI, we ask “What is the applicant’s date of employment in the employment verification form?” Amazon Q Business searches the indexed employment verification document and retrieves the answer with reference to the source document.

Amazon Q Business output

Index documents using the AWS CLI

In this section, we show you how to use the AWS CLI to ingest structured and unstructured documents stored in an S3 bucket into an Amazon Q Business index. You can quickly retrieve detailed information about your documents, including their statuses and any errors occurred during indexing. If you’re an existing Amazon Q Business user and have indexed documents in various formats, such as scanned PDFs and other supported types, and you now want to reindex the scanned documents, complete the following steps:

  1.  Check the status of each document to filter failed documents according to the status "DOCUMENT_FAILED_TO_INDEX". You can filter the documents based on this error message:

"errorMessage": "Document cannot be indexed since it contains no text to index and search on. Document must contain some text."

If you’re a new user and haven’t indexed any documents, you can skip this step.

The following is an example of using the ListDocuments API to filter documents with a specific status and their error messages:

aws qbusiness list-documents --region <region> 
--application-id <application-id> 
--index-id <index-id> 
--query "documentDetailList[?status=='DOCUMENT_FAILED_TO_INDEX'].{DocumentId:documentId, ErrorMessage:error.errorMessage}"
--output json

The following screenshot shows the AWS CLI output with a list of failed documents with error messages.

List of failed documents

Now you batch-process the documents. Amazon Q Business supports adding one or more documents to an Amazon Q Business index.

  1. Use the BatchPutDocument API to ingest multiple scanned documents stored in an S3 bucket into the index:
    aws qbusiness batch-put-document —region <region> 
    --documents '[{ "id":"s3://<your-bucket-path>/<scanned-pdf-document1>","content":{"s3":{"bucket":"<your-bucket> ","key":"<scanned-pdf-document1>"}}}, { "id":"s3://<your-bucket-path>/<scanned-pdf-document2>","content":{"s3":{"bucket":" <your-bucket>","key":"<scanned-pdf-document2>"}}}]' 
    --application-id <application-id> 
    --index-id <index-id> 
    --endpoint-url <application-endpoint-url> 
    --role-arn <role-arn> 
    --no-verify-ssl

The following screenshot shows the AWS CLI output. You should see failed documents as an empty list.

List of failed documents

  1. Finally, use the ListDocuments API again to review if all documents were indexed properly:
    aws qbusiness list-documents --region <region> 
    --application-id <application-id> 
    --index-id <index-id> 
    --endpoint-url <application-endpoint-url> 
    --no-verify-ssl

The following screenshot shows that the documents are indexed in the data source.

List of indexed documents

Clean up

If you created a new Amazon Q Business application and don’t plan to use it further, unsubscribe and remove assigned users from the application and delete it so that your AWS account doesn’t accumulate costs. Moreover, if you don’t need to use the indexed data sources further, refer to Managing Amazon Q Business data sources for instructions to delete your indexed data sources.

Conclusion

This post demonstrated the support for scanned PDF document types with Amazon Q Business. We highlighted the steps to sync, index, and query supported document types—now including scanned PDF documents—using generative AI with Amazon Q Business. We also showed examples of queries on structured, unstructured, or semi-structured multi-modal scanned documents using the Amazon Q Business web UI and AWS CLI.

To learn more about this feature, refer to Supported document formats in Amazon Q Business. Give it a try on the Amazon Q Business console today! For more information, visit Amazon Q Business and the Amazon Q Business User Guide. You can send feedback to AWS re:Post for Amazon Q or through your usual AWS support contacts.


About the Authors

Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Himesh Kumar is a seasoned Senior Software Engineer, currently working at Amazon Q Business in AWS. He is passionate about building distributed systems in the generative AI/ML space. His expertise extends to develop scalable and efficient systems, ensuring high availability, performance, and reliability. Beyond the technical skills, he is dedicated to continuous learning and staying at the forefront of technological advancements in AI and machine learning.

Qing Wei is a Senior Software Developer for Amazon Q Business team in AWS, and passionate about building modern applications using AWS technologies. He loves community-driven learning and sharing of technology especially for machine learning hosting and inference related topics. His main focus right now is on building serverless and event-driven architectures for RAG data ingestion.

Read More

Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Originally PyTorch used an eager mode where each PyTorch operation that forms the model is run independently as soon as it’s reached. PyTorch 2.0 introduced torch.compile to speed up PyTorch code over the default eager mode. In contrast to eager mode, the torch.compile pre-compiles the entire model into a single graph in a manner that’s optimal for running on a given hardware platform. AWS optimized the PyTorch torch.compile feature for AWS Graviton3 processors. This optimization results in up to 2x better performance for Hugging Face model inference (based on geomean of performance improvement for 33 models) and up to 1.35x better performance for TorchBench model inference (geomean of performance improvement for 45 models) compared to the default eager mode inference across several natural language processing (NLP), computer vision (CV), and recommendation models on AWS Graviton3-based Amazon EC2 instances. Starting with PyTorch 2.3.1, the optimizations are available in torch Python wheels and AWS Graviton PyTorch deep learning container (DLC).

In this blog post, we show how we optimized torch.compile performance on AWS Graviton3-based EC2 instances, how to use the optimizations to improve inference performance, and the resulting speedups.

Why torch.compile and what’s the goal?

In eager mode, operators in a model are run immediately as they are encountered. It’s easier to use, more suitable for machine learning (ML) researchers, and hence is the default mode. However, eager mode incurs runtime overhead because of redundant kernel launch and memory read overhead. Whereas in torch compile mode, operators are first synthesized into a graph, wherein one operator is merged with another to reduce and localize memory reads and total kernel launch overhead.

The goal for the AWS Graviton team was to optimize torch.compile backend for Graviton3 processors. PyTorch eager mode was already optimized for Graviton3 processors with Arm Compute Library (ACL) kernels using oneDNN (also known as MKLDNN). So, the question was, how to reuse those kernels in torch.compile mode to get the best of graph compilation and the optimized kernel performance together?

Results

The AWS Graviton team extended the torch inductor and oneDNN primitives that reused the ACL kernels and optimized compile mode performance on Graviton3 processors. Starting with PyTorch 2.3.1, the optimizations are available in the torch Python wheels and AWS Graviton DLC. Please see the Running an inference section that follows for the instructions on installation, runtime configuration, and how to run the tests.

To demonstrate the performance improvements, we used NLP, CV, and recommendation models from TorchBench and the most downloaded NLP models from Hugging Face across Question Answering, Text Classification, Token Classification, Translation, Zero-Shot Classification, Translation, Summarization, Feature Extraction, Text Generation, Text2Text Generation, Fill-Mask, and Sentence Similarity tasks to cover a wide variety of customer use cases.

We started with measuring TorchBench model inference latency, in milliseconds (msec), for the eager mode, which is marked 1.0 with a red dotted line in the following graph. Then we compared the improvements from torch.compile for the same model inference, the normalized results are plotted in the graph. You can see that for the 45 models we benchmarked, there is a 1.35x latency improvement (geomean for the 45 models).

Image 1: PyTorch model inference performance improvement with torch.compile on AWS Graviton3-based c7g instance using TorchBench framework. The reference eager mode performance is marked as 1.0. (higher is better)

Similar to the preceding TorchBench inference performance graph, we started with measuring the Hugging Face NLP model inference latency, in msec, for the eager mode, which is marked 1.0 with a red dotted line in the following graph. Then we compared the improvements from torch.compile for the same model inference, the normalized results are plotted in the graph. You can see that for the 33 models we benchmarked, there is around 2x performance improvement (geomean for the 33 models).

Image 2: Hugging Face NLP model inference performance improvement with torch.compile on AWS Graviton3-based c7g instance using Hugging Face example scripts. The reference eager mode performance is marked as 1.0. (higher is better)

Running an inference

Starting with PyTorch 2.3.1, the optimizations are available in the torch Python wheel and in AWS Graviton PyTorch DLC. This section shows how to run inference in eager and torch.compile modes using torch Python wheels and benchmarking scripts from Hugging Face and TorchBench repos.

To successfully run the scripts and reproduce the speedup numbers mentioned in this post, you need an instance from the Graviton3 family (c7g/r7g/m7g/hpc7g) of hardware. For this post, we used the c7g.4xl (16 vcpu) instance. The instance, the AMI details, and the required torch library versions are mentioned in the following snippet.

Instance: c7g.4xl instance
Region: us-west-2
AMI: ami-05cc25bfa725a144a (Ubuntu 22.04/Jammy with 6.5.0-1017-aws kernel)

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1

The generic runtime tunings implemented for eager mode inference are equally applicable for the torch.compile mode, so, we set the following environment variables to further improve the torch.compile performance on AWS Graviton3 processors.

# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable Linux Transparent Huge Page (THP) allocations,
# to reduce the tensor memory allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capacity to cache the primitives and avoid redundant
# memory allocations
export LRU_CACHE_CAPACITY=1024

TorchBench benchmarking scripts

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. We benchmarked 45 models using the scripts from the TorchBench repo. Following code shows how to run the scripts for the eager mode and the compile mode with inductor backend.

# Set OMP_NUM_THREADS to number of vcpus, 16 for c7g.4xl instance
export OMP_NUM_THREADS=16

# Install the dependencies
sudo apt-get install -y libgl1-mesa-glx
sudo apt-get install -y libpangocairo-1.0-0
python3 -m pip install psutil numpy transformers pynvml numba onnx onnxruntime scikit-learn timm effdet gym doctr opencv-python h5py==3.10.0 python-doctr

# Clone pytorch benchmark repo
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# PyTorch benchmark repo doesn't have any release tags. So,
# listing the commit we used for collecting the performance numbers
git checkout 9a5e4137299741e1b6fb7aa7f5a6a853e5dd2295

# Setup the models
python3 install.py

# Colect eager mode performance using the following command. The results will be
# stored at .userbenchmark/cpu/metric-<timestamp>.json.
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --metrics="latencies,cpu_peak_mem"

# Collect torch.compile mode performance with inductor backend
# and weights pre-packing enabled. The results will be stored at
# .userbenchmark/cpu/metric-<timestamp>.json
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

On successful completion of the inference runs, the script stores the results in JSON format. The following is the sample output:

{
"name": "cpu"
"environ": {
"pytorch_git_version": "d44533f9d073df13895333e70b66f81c513c1889"
},

"metrics": {
"BERT_pytorch-eval_latency": 56.3769865,
"BERT_pytorch-eval_cmem": 0.4169921875
}
}

Hugging Face benchmarking scripts

Google T5 Small Text Translation model is one of the around 30 Hugging Face models we benchmarked. We’re using it as a sample model to demonstrate how to run inference in eager and compile modes. The additional configurations and APIs required to run it in compile mode are highlighted in BOLD. Save the following script as google_t5_small_text_translation.py .

import argparse
from transformers import T5Tokenizer, T5Model
import torch
from torch.profiler import profile, record_function, ProfilerActivity
import torch._inductor.config as config config.cpp.weight_prepack=True config.freezing=True

def test_inference(mode, num_iter):
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
"Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

    if (mode == 'compile'):         model = torch.compile(model)

with torch.no_grad():
for _ in range(50):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(num_iter):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))

def main() -> None:
global m, args
parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"-m",
"--mode",
choices=["eager", "compile"],
default="eager",
help="Which test to run.",
)
parser.add_argument(
"-n",
"--number",
type=int,
default=100,
help="how many iterations to run.",
)
args = parser.parse_args()
test_inference(args.mode, args.number)

if __name__ == "__main__":
main()

Run the script with the following steps.

# Set OMP_NUM_THREADS to number of vcpus to 4 because
# the scripts are running inference in sequence, and
# they don't need large number of vcpus
export OMP_NUM_THREADS=4

# Install the dependencies
python3 -m pip install transformers

# Run the inference script in Eager mode
# using number of iterations as 1 just to show the torch profiler output
# but for the benchmarking, we used 1000 iterations.
python3 google_t5_small_text_translation.py -n 1 -m eager

# Run the inference script in torch compile mode
python3 google_t5_small_text_translation.py -n 1 -m compile

On successful completion of the inference runs, the script prints the torch profiler output with the latency breakdown for the torch operators. The following is the sample output from torch profiler:


# Torch profiler output for the eager mode run on c7g.xl (4vcpu)
---------------    ------------  -----------  ------------  -----------  ------------  ------------
Name                 Self CPU %   Self CPU     CPU total %   CPU total   CPU time avg    # of Calls
---------------    ------------  -----------  ------------  -----------  ------------  ------------
aten::mm            40.71%         12.502ms       40.71%      12.502ms     130.229us            96
model_inference     26.44%         8.118ms       100.00%      30.708ms      30.708ms             1
aten::bmm            6.85%         2.102ms         9.47%       2.908ms      80.778us            36
aten::matmul         3.73%         1.146ms        57.26%      17.583ms     133.205us           132
aten::select         1.88%       576.000us         1.90%     583.000us       0.998us           584
aten::transpose      1.51%       464.000us         1.83%     563.000us       3.027us           186
---------------    ------------  -----------  ------------  -----------  ------------  -------------
Self CPU time total: 30.708ms

# Torch profiler output for the compile mode run for the same model on the same instance
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
Name                      Self CPU %    Self CPU    CPU total %    CPU total   CPU time avg   # of Calls
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
mkldnn::_linear_pointwise   37.98%       5.461ms        45.91%       6.602ms      68.771us            96
Torch-Compiled Region       29.56%       4.251ms        98.53%      14.168ms      14.168ms             1
aten::bmm                   14.90%       2.143ms        21.73%       3.124ms      86.778us            36
aten::select                 4.51%     648.000us         4.62%     665.000us       1.155us           576
aten::view                   3.29%     473.000us         3.29%     473.000us       1.642us           288
aten::empty                  2.53%     364.000us         2.53%     364.000us       3.165us           115
-------------------------  ---------  -----------  ------------  ------------  ------------ -------------
Self CPU time total: 14.379ms

What’s next

Next, we’re extending the torch inductor CPU backend support to compile Llama model, and adding support for fused GEMM kernels to enable torch inductor operator fusion optimization on AWS Graviton3 processors.

Conclusion

In this tutorial, we covered how we optimized torch.compile performance on AWS Graviton3-based EC2 instances, how to use the optimizations to improve PyTorch model inference performance, and demonstrated the resulting speedups. We hope that you will give it a try! If you need any support with ML software on Graviton, please open an issue on the AWS Graviton Technical Guide GitHub.


About the Author

Sunita Nadampalli is a Software Development Manager and AI/ML expert at AWS. She leads AWS Graviton software performance optimizations for AI/ML and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions for SoCs based on the Arm ISA.

Read More

Access control for vector stores using metadata filtering with Knowledge Bases for Amazon Bedrock

Access control for vector stores using metadata filtering with Knowledge Bases for Amazon Bedrock

In November 2023, we announced Knowledge Bases for Amazon Bedrock as generally available.

Knowledge bases allow Amazon Bedrock users to unlock the full potential of Retrieval Augmented Generation (RAG) by seamlessly integrating their company data into the language model’s generation process. This feature allows organizations to harness the power of large language models (LLMs) while making sure that the generated responses are tailored to their specific domain knowledge, regulations, and business requirements. By incorporating their unique data sources, such as internal documentation, product catalogs, or transcribed media, organizations can enhance the relevance, accuracy, and contextual awareness of the language model’s outputs.

Knowledge bases effectively bridge the gap between the broad knowledge encapsulated within foundation models and the specialized, domain-specific information that businesses possess, enabling a truly customized and valuable generative artificial intelligence (AI) experience.

With metadata filtering now available in Knowledge Bases for Amazon Bedrock, you can define and use metadata fields to filter the source data used for retrieving relevant context during RAG. For example, if your data contains documents from different products, departments, or time periods, you can use metadata filtering to limit retrieval to only the most relevant subset of data for a given query or conversation. This helps improve the relevance and quality of retrieved context while reducing potential hallucinations or noise from irrelevant data. Metadata filtering gives you more control over the RAG process for better results tailored to your specific use case needs.

In this post, we discuss how to implement metadata filtering within Knowledge Bases for Amazon Bedrock by implementing access control and ensuring data privacy and security in RAG applications.

Access control with metadata filters

Metadata filtering in knowledge bases enables access control for your data. By defining metadata fields based on attributes such as user roles, departments, or data sensitivity levels, you can ensure that the retrieval only fetches and uses information that a particular user or application is authorized to access. This helps maintain data privacy and security, preventing sensitive or restricted information from being inadvertently surfaced or used in generated responses. With this access control capability, you can safely use retrieval across different user groups or scenarios while complying with company specific data governance policies and regulations.

During retrieval of contextually relevant chunks, metadata filters add an additional layer of selection to those vectors that are returned to the LLM for response generation. In addition, metadata filtering requires fewer computation resources, thereby improving the overall performance and reducing costs associated with the search.

Let’s explore some practical applications of metadata filtering in Knowledge Bases for Amazon Bedrock. Here are a few examples and use cases across different domains:

  • A company uses a chatbot to help HR personnel navigate employee files. There is sensitive information present in the documents and only certain employees should be able to have access and converse with them. With metadata filters on access IDs, a user can only chat with documents that have metadata associated with their access ID. The access ID associated with their authentication when the chat is initiated can be passed as a filter.
  • A business-to-business (B2B) platform is developed for companies to allow their end-users to access all their uploaded documents, search over them conversationally, and complete various tasks using those documents. To ensure that end-users can only chat with their data, metadata filters on user access tokens—such as those obtained through an authentication service—can enable secure access to their information. This provides customers with peace of mind while maintaining compliance with various data security standards.
  • A work organization application has a conversational search feature. Documents, kanbans, meeting recording transcripts, and other assets can be searched more intently and with more granular control. The app uses a single sign-on (SSO) functionality that allows them to access company-wide resources and other services and follows a company’s data level access protocol. With metadata filters on work groups and a privilege level (for example Limited, Standard, or Admin) derived from their SSO authentication, you can enforce data security while personalizing the chat experience to streamline a user’s work and collaboration with others.

Access control with metadata filtering in the healthcare domain

To demonstrate the access-control capabilities enabled by metadata filtering in knowledge bases, let’s consider a use case where a healthcare provider has a knowledge base that contains transcripts of conversations between doctors and patients. In this scenario, it is crucial that each doctor can only access transcripts from their own patient interactions during the search, and not have access to transcripts from other doctors’ patient interactions.

By defining a metadata field for patient_id and associating each transcript with the corresponding patient’s identifier, the healthcare provider can implement access control within their search application. When a doctor initiates a conversation, the knowledge base can filter the vector store to retrieve context only from transcripts where the patient_id metadata matches either a specific patient ID or the list of patient IDs associated with the authenticated doctor. This way, the generated responses will be augmented solely with information from that doctor’s past patient interactions, maintaining patient privacy and confidentiality.

This access control approach can be extended to other relevant metadata fields, such as year or department, further refining the subset of data accessible to each user or application. By using metadata filtering in knowledge bases, the healthcare provider can achieve compliance with data governance policies and regulations while enabling doctors to have personalized, contextually relevant conversations tailored to their specific patient histories and needs.

Solution overview

Let’s walk through the high-level steps to implement access control with Knowledge Bases for Amazon Bedrock. The following GitHub repository provides a guided notebook that you can follow to deploy this example in your own account.

The following diagram illustrates the solution architecture.

Figure 1: Solution architecture

The workflow for the solution is as follows:

  1. The doctor interacts with the Streamlit frontend, which serves as the application interface. Amazon Cognito handles user authentication and access control, ensuring only authorized doctors can access the application. For production use, it is recommended to use a more robust frontend framework such as AWS Amplify, which provides a comprehensive set of tools and services for building scalable and secure web applications.
  2. After the doctor has successfully signed in, the application retrieves the list of patients associated with the doctor’s ID from the Amazon DynamoDB database. The doctor is then presented with this list of patients, from which they can select one or more patients to filter their search.
  3. When the doctor interacts with the Streamlit frontend, it sends a request to an AWS Lambda function, which acts as the application backend. The request includes the doctor’s ID, a list of patient IDs to filter by, and the text query.
  4. Before querying the knowledge base, the Lambda function retrieves data from the DynamoDB database, which stores doctor-patient associations. This step validates that the doctor is authorized to access the requested patient or list of patient’s information.
  5. If the validation is successful, the Lambda function queries the knowledge base using the provided patient or list of patient’s IDs. The knowledge base is pre-populated with transcript and metadata files stored in Amazon Simple Storage Service (Amazon S3).
  6. The knowledge base returns the relevant results, which are then sent back to the Streamlit application and displayed to the doctor.

User authentication with Amazon Cognito

To implement the access control solution for the healthcare provider use case, you can use Amazon Cognito user pools to manage the authentication and user identities of the doctors.

To start, you will create an Amazon Cognito user pool that will store the doctor user accounts. During the user pool setup, you define the necessary attributes for each doctor, including their name and a unique identifier (sub or custom attribute). For patients, their identifier will be used as the patient_id metadata field. This unique identifier will be associated with each patient’s account and used for metadata filtering in the knowledge base retrieval process.

Figure 2: User information

Doctor and patient association in DynamoDB

To facilitate the access control mechanism based on the doctor-patient relationship, the healthcare provider can create a DynamoDB table to store these associations. This table will act as a centralized repository, allowing efficient retrieval of the patient IDs associated with each authenticated doctor during the knowledge base search process. When a doctor authenticates through Amazon Cognito, their unique identifier can be used to query the doctor_patient_list_associations table and retrieve the list of patient_id values associated with that doctor.

Figure 3: Items retrieved based on the doctor_ID and patient relationships

This approach offers flexibility in managing doctor-patient associations. If a doctor changes over time, only the corresponding entries in the DynamoDB table need to be updated. This update does not require modifying the metadata files of the transcripts themselves.

Now that you have your doctor and patients set up with their relationships defined, let’s examine the dataset format required for effective metadata filtering.

Dataset format

When working with Knowledge Bases for Amazon Bedrock, the dataset format plays a crucial role in providing seamless integration and effective metadata filtering. This example uses a series of PDF files containing transcripts of doctor-patient conversations.

These files need to be uploaded to an S3 bucket for processing. To use metadata filtering, you need to create a separate metadata JSON file for each transcript file. The metadata file should share the same name as the corresponding PDF file (including the extension). For instance, if the transcript file is named transcript_001.pdf, the metadata file should be named transcript_001.pdf.metadata.json. This nomenclature is crucial for the knowledge base to identify the metadata for specific files during the ingestion process.

The metadata JSON file will contain key-value pairs representing the relevant metadata fields associated with the transcript. In the healthcare provider use case, the most important metadata field is patient_id, which will be used to implement access control. You assign each transcript to a specific patient by including their unique identifier from the Amazon Cognito user pool in the patient_id field of the metadata file, as in the following example:

{"metadataAttributes": {"patient_id": 669}}

By structuring the dataset with transcript PDF files accompanied by their corresponding metadata JSON files, you can effectively use the metadata filtering capabilities of Knowledge Bases for Amazon Bedrock. This approach enables you to implement access control, so each doctor can only retrieve and use content from their own patient transcripts during the retrieval process. For customers processing thousands of files, automating the generation of the metadata files using Lambda functions or a similar solution could be a more efficient approach to scale.

Knowledge base creation

With the dataset properly structured and organized, you can now create the knowledge base in Amazon Bedrock. The process is straightforward, thanks to the user-friendly interface and step-by-step guidance provided by the AWS Management Console. See Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock for instructions to create a new knowledge base, upload your dataset, and configure the necessary settings to achieve optimal performance. Alternatively, you can create a knowledge base using the AWS SDK, API, or AWS CloudFormation template, which provides programmatic and automated ways to set up and manage your knowledge bases.

Figure 4: Using the console to create a knowledge base

After you create the knowledge base and sync it with your dataset, you can immediately experience the power of metadata filtering.

In the test pane, navigate to the settings section and locate the filters option. Here, you can define specific filter conditions by specifying the patient_id field along with the unique IDs or list of identifiers of the patients you wish to test. By applying this filter, the retrieval process will fetch and incorporate only the relevant context from transcripts associated with the specified patient or patients. This filter-based retrieval approach means that the generated responses are tailored to the doctor’s individual patient interactions, maintaining data privacy and confidentiality.

Figure 5:Knowledge Bases console test configuration Panel

Figure 6: Knowledge Bases console test panel

Querying the knowledge base programmatically

You have seen how to implement access control with metadata filtering through the console, but what if you want to integrate knowledge bases directly into your applications? AWS provides SDKs that allow you to programmatically interact with Amazon Bedrock features, including knowledge bases.

The following code snippet demonstrates how to call the retrieve_and_generate API using the Boto3 library in Python. It includes metadata filtering capabilities within the vectorSearchConfiguration, where you can now add filter conditions. For this specific use case, you first need to retrieve the list of patient_ids associated with a doctor from the DynamoDB table. This allows you to filter the search results based on the authenticated user’s identity.

import boto3
import json

bedrock_agent = boto3.client('bedrock-agent-runtime')

# Retrieve and generate API

response = bedrock_agent.retrieve_and_generate(
    input={
        "text": "Who is Kelly?"
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
             'knowledgeBaseId': <<KnowledgeBase id>>,
            "modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-v2:1",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5,
                    "filter": {
                        "in": {
                            "key": "patient_id",
                            "value": <<patient_ids>> # Amazon Cognito Id once the doctor is authenticated.
                        }
                    }
                } 
            }
        }
    }
)

print(response['output']['text'],end='n'*2) 

You can create a Lambda function that serves as the backend for the application. This Lambda function uses the Boto3 library to interact with Amazon Bedrock, specifically to retrieve relevant information from the knowledge base using the retrieve_and_generate API.

Now that the architectural components are in place, you can create a visual interface to display the results.

Streamlit sample app

To showcase the interaction between doctors and the knowledge base, we developed a user-friendly web application using Streamlit, a popular open source Python library for building interactive data apps. Streamlit provides a simple and intuitive way to create custom interfaces that can seamlessly integrate with the various AWS services involved in this solution.

The Streamlit application acts as the frontend for doctors to initiate conversations and interact with the knowledge base. It uses Amazon Cognito for user authentication, so only authorized doctors can access the application and the corresponding patient data. Upon successful authentication, the application interacts with Lambda to handle the RAG workflow using the Amazon Cognito user ID.

Figure 7: Demo

Clean up

It’s important to clean up and delete the resources created during this solution deployment to avoid unnecessary costs. In the provided GitHub repository, you’ll find a section at the end of the notebook dedicated to deleting all the resources created as part of this solution to ensure that you don’t incur any ongoing charges for resources that are no longer needed.

Conclusion

This post has demonstrated the powerful capabilities of metadata filtering within Knowledge Bases for Amazon Bedrock by implementing access control and ensuring data privacy and security in RAG applications. By using metadata fields, organizations can precisely control the subset of data accessible to different users or applications during the RAG process while also improving the relevancy and performance of the search.

Get started with Knowledge Bases for Amazon Bedrock, and let us know your thoughts in the comments section.


About the Authors

Dani Mitchell is an Generative AI Specialist Solutions Architect at Amazon Web Services. He is focused on computer vision use cases and helping customers across EMEA accelerate their ML journey.

Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focused on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

Kshitiz Agarwal is an Engineering Leader at Amazon Web Services (AWS), where he leads the development of Knowledge Bases for Amazon Bedrock. With a decade of experience at Amazon, having joined in 2012, Kshitiz has gained deep insights into the cloud computing landscape. His passion lies in engaging with customers and understanding the innovative ways they leverage AWS to drive their business success. Through his work, Kshitiz aims to contribute to the continuous improvement of AWS services, enabling customers to unlock the full potential of the cloud.

Read More

Accenture creates a custom memory-persistent conversational user experience using Amazon Q Business

Accenture creates a custom memory-persistent conversational user experience using Amazon Q Business

Traditionally, finding relevant information from documents has been a time-consuming and often frustrating process. Manually sifting through pages upon pages of text, searching for specific details, and synthesizing the information into coherent summaries can be a daunting task. This inefficiency not only hinders productivity but also increases the risk of overlooking critical insights buried within the document’s depths.

Imagine a scenario where a call center agent needs to quickly analyze multiple documents to provide summaries for clients. Previously, this process would involve painstakingly navigating through each document, a task that is both time-consuming and prone to human error.

With the advent of chatbots in the conversational artificial intelligence (AI) domain, you can now upload your documents through an intuitive interface and initiate a conversation by asking specific questions related to your inquiries. The chatbot then analyzes the uploaded documents, using advanced natural language processing (NLP) and machine learning (ML) technologies to provide comprehensive summaries tailored to your questions.

However, the true power lies in the chatbot’s ability to preserve context throughout the conversation. As you navigate through the discussion, the chatbot should maintain a memory of previous interactions, allowing you to review past discussions and retrieve specific details as needed. This seamless experience makes sure you can effortlessly explore the depths of your documents without losing track of the conversation’s flow.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.

This post demonstrates how Accenture used Amazon Q Business to implement a chatbot application that offers straightforward attachment and conversation ID management. This solution can speed up your development workflow, and you can use it without crowding your application code.

“Amazon Q Business distinguishes itself by delivering personalized AI assistance through seamless integration with diverse data sources. It offers accurate, context-specific responses, contrasting with foundation models that typically require complex setup for similar levels of personalization. Amazon Q Business real-time, tailored solutions drive enhanced decision-making and operational efficiency in enterprise settings, making it superior for immediate, actionable insights”

– Dominik Juran, Cloud Architect, Accenture

Solution overview

In this use case, an insurance provider uses a Retrieval Augmented Generation (RAG) based large language model (LLM) implementation to upload and compare policy documents efficiently. Policy documents are preprocessed and stored, allowing the system to retrieve relevant sections based on input queries. This enhances the accuracy, transparency, and speed of policy comparison, making sure clients receive the best coverage options.

This solution augments an Amazon Q Business application with persistent memory and context tracking throughout conversations. As users pose follow-up questions, Amazon Q Business can continually refine responses while recalling previous interactions. This preserves conversational flow when navigating in-depth inquiries.

At the core of this use case lies the creation of a custom Python class for Amazon Q Business, which streamlines the development workflow for this solution. This class offers robust document management capabilities, keeping track of attachments already shared within a conversation as well as new uploads to the Streamlit application. Additionally, it maintains an internal state to persist conversation IDs for future interactions, providing a seamless user experience.

The solution involves developing a web application using Streamlit, Python, and AWS services, featuring a chat interface where users can interact with an AI assistant to ask questions or upload PDF documents for analysis. Behind the scenes, the application uses Amazon Q Business for conversation history management, vectorizing the knowledge base, context creation, and NLP. The integration of these technologies allows for seamless communication between the user and the AI assistant, enabling tasks such as document summarization, question answering, and comparison of multiple documents based on the documents attached in real time.

The code uses Amazon Q Business APIs to interact with Amazon Q Business and send and receive messages within a conversation, specifically the qbusiness client from the boto3 library.

In this use case, we used the German language to test our RAG LLM implementation on 10 different documents and 10 different use cases. Policy documents were preprocessed and stored, enabling accurate retrieval of relevant sections based on input queries. This testing demonstrated the system’s accuracy and effectiveness in handling German language policy comparisons.

The following is a code snippet:

import boto3
import json
from botocore.exceptions import ClientError
from os import environ

class AmazonQHandler:
    def __init__(self, application_id, user_id, conversation_id, system_message_id):
        self.application_id = application_id
        self.user_id = user_id
        self.qbusiness = boto3.client('qbusiness')
        self.prompt_engineering_instruction = "Ansage: Auf Deutsch, und nur mit den nötigsten Wörter ohne ganze Sätze antworten, bitte"
        self.parent_message_id = system_message_id
        self.conversation_id = conversation_id

    def process_message(self, initial_message, input_text):
        print('Please ask as many questions as you want. At the end of the session write exitn')
        
        message = f'{self.prompt_engineering_instruction}: {input_text}'
            
        return message

    

def send_message(self, input_text, uploaded_file_names=[]):
        attachments = []
        message = f'{self.prompt_engineering_instruction}: {input_text}'
        if len(uploaded_file_names) > 0:
            for file_name in uploaded_file_names:
                in_file = open(file_name, "rb")
                data = in_file.read()
                attachments.append({
                    'data': data,
                    'name': file_name
                })

        if self.conversation_id:
            print("we are in if part of send_message")
            if len(attachments) > 0:
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                    conversationId=self.conversation_id,
                    parentMessageId=self.parent_message_id,
                    attachments=attachments,
                )
            else:
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                    conversationId=self.conversation_id,
                    parentMessageId=self.parent_message_id,
                )
        else:
            if len(attachments) > 0:
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                    attachments=attachments,
                )
            else: 
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                )
            self.conversation_id = resp.get("conversationId")

        print(f'Amazon Q: "{resp.get("systemMessage")}"n')
        print(json.dumps(resp))
        self.parent_message_id = resp.get("systemMessageId")
        return resp.get("systemMessage")

if __name__ == '__main__':
    application_id = environ.get("APPLICATION_ID", "a392f5e9-50ed-4f93-bcad-6f8a26a8212d")
    user_id = environ.get("USER_ID", "AmazonQ-Administrator")

    amazon_q_handler = AmazonQHandler(application_id, user_id)
    amazon_q_handler.process_message(None)

The architectural flow of this solution is shown in the following diagram.

Q business

The workflow consists of the following steps:

  1. The LLM wrapper application code is containerized using AWS CodePipeline, a fully managed continuous delivery service that automates the build, test, and deploy phases of the software release process.
  2. The application is deployed to Amazon Elastic Container Service (Amazon ECS), a highly scalable and reliable container orchestration service that provides optimal resource utilization and high availability. Because we were making the calls from a Flask-based ECS task running Streamlit to Amazon Q Business, we used Amazon Cognito user pools rather than AWS IAM Identity Center to authenticate users for simplicity, and we hadn’t experimented with IAM Identity Center on Amazon Q Business at the time. For instructions to set up IAM Identity Center integration with Amazon Q Business, refer to Setting up Amazon Q Business with IAM Identity Center as identity provider.
  3. Users authenticate through an Amazon Cognito UI, a secure user directory that scales to millions of users and integrates with various identity providers.
  4. A Streamlit application running on Amazon ECS receives the authenticated user’s request.
  5. An instance of the custom AmazonQ class is initiated. If an ongoing Amazon Q Business conversation is present, the correct conversation ID is persisted, providing continuity. If no existing conversation is found, a new conversation is initiated.
  6. Documents attached to the Streamlit state are passed to the instance of the AmazonQ class, which keeps track of the delta between the documents already attached to the conversation ID and the documents yet to be shared. This approach respects and optimizes the five-attachment limit imposed by Amazon Q Business. To simplify and avoid repetitions in the middleware library code we are maintaining on the Streamlit application, we decided to write a custom wrapper class for the Amazon Q Business calls, which keeps the attachment and conversation history management in itself as class variables (as opposed to state-based management on the Streamlit level).
  7. Our wrapper Python class encapsulating the Amazon Q Business instance parses and returns the answers based on the conversation ID and the dynamically provided context derived from the user’s question.
  8. Amazon ECS serves the answer to the authenticated user, providing a secure and scalable delivery of the response.

Prerequisites

This solution has the following prerequisites:

  • You must have an AWS account where you will be able to create access keys and configure services like Amazon Simple Storage Service (Amazon S3) and Amazon Q Business
  • Python must be installed on the environment, as well as all the necessary libraries such as boto3
  • It is assumed that you have Streamlit library installed for Python, along with all the necessary settings

Deploy the solution

The deployment process entails provisioning the required AWS infrastructure, configuring environment variables, and deploying the application code. This is accomplished by using AWS services such as CodePipeline and Amazon ECS for container orchestration and Amazon Q Business for NLP.

Additionally, Amazon Cognito is integrated with Amazon ECS using the AWS Cloud Development Kit (AWS CDK) and user pools are used for user authentication and management. After deployment, you can access the application through a web browser. Amazon Q Business is called from the ECS task. It is crucial to establish proper access permissions and security measures to safeguard user data and uphold the application’s integrity.

We use AWS CDK to deploy a web application using Amazon ECS with AWS Fargate, Amazon Cognito for user authentication, and AWS Certificate Manager for SSL/TLS certificates.

To deploy the infrastructure, run the following commands:

  • npm install to install dependencies
  • npm run build to build the TypeScript code
  • npx cdk synth to synthesize the AWS CloudFormation template
  • npx cdk deploy to deploy the infrastructure

The following screenshot shows our deployed CloudFormation stack.

UI demonstration

The following screenshot shows the home page when a user opens the application in a web browser.

The following screenshot shows an example response from Amazon Q Business when no file was uploaded and no relevant answer to the question was found.

The following screenshot illustrates the entire application flow, where the user asked a question before a file was uploaded, then uploaded a file, and asked the same question again. The response from Amazon Q Business after uploading the file is different from the first query (for testing purposes, we used a very simple file with randomly generated text in PDF format).

Solution benefits

This solution offers the following benefits:

  • Efficiency – Automation enhances productivity by streamlining document analysis, saving time, and optimizing resources
  • Accuracy – Advanced techniques provide precise data extraction and interpretation, reducing errors and improving reliability
  • User-friendly experience – The intuitive interface and conversational design make it accessible to all users, encouraging adoption and straightforward integration into workflows

This containerized architecture allows the solution to scale seamlessly while optimizing request throughput. Persisting the conversation state enhances precision by continuously expanding dialog context. Overall, this solution can help you balance performance with the fidelity of a persistent, context-aware AI assistant through Amazon Q Business.

Clean up

After deployment, you should implement a thorough cleanup plan to maintain efficient resource management and mitigate unnecessary costs, particularly concerning the AWS services used in the deployment process. This plan should include the following key steps:

  • Delete AWS resources – Identify and delete any unused AWS resources, such as EC2 instances, ECS clusters, and other infrastructure provisioned for the application deployment. This can be achieved through the AWS Management Console or AWS Command Line Interface (AWS CLI).
  • Delete CodeCommit repositories – Remove any CodeCommit repositories created for storing the application’s source code. This helps declutter the repository list and prevents additional charges for unused repositories.
  • Review and adjust CodePipeline configuration – Review the configuration of CodePipeline and make sure there are no active pipelines associated with the deployed application. If pipelines are no longer required, consider deleting them to prevent unnecessary runs and associated costs.
  • Evaluate Amazon Cognito user pools – Evaluate the user pools configured in Amazon Cognito and remove any unnecessary pools or configurations. Adjust the settings to optimize costs and adhere to the application’s user management requirements.

By diligently implementing these cleanup procedures, you can effectively minimize expenses, optimize resource usage, and maintain a tidy environment for future development iterations or deployments. Additionally, regular review and adjustment of AWS services and configurations is recommended to provide ongoing cost-effectiveness and operational efficiency.

If the solution runs in AWS Amplify or is provisioned by the AWS CDK, you don’t need to take care of removing everything described in this section; deleting the Amplify application or AWS CDK stack is enough to get rid all of the resources associated with the application.

Conclusion

In this post, we showcased how Accenture created a custom memory-persistent conversational assistant using AWS generative AI services. The solution can cater to clients developing end-to-end conversational persistent chatbot applications at a large scale following the provided architectural practices and guidelines.

The joint effort between Accenture and AWS builds on the 15-year strategic relationship between the companies and uses the same proven mechanisms and accelerators built by the Accenture AWS Business Group (AABG). Connect with the AABG team at accentureaws@amazon.com to drive business outcomes by transforming to an intelligent data enterprise on AWS.

For further information about generative AI on AWS using Amazon Bedrock or Amazon Q Business, we recommend the following resources:

You can also sign up for the AWS generative AI newsletter, which includes educational resources, post posts, and service updates.


About the Authors

Dominik Juran works as a full stack developer at Accenture with a focus on AWS technologies and AI. He also has a passion for ice hockey.

Milica Bozic works as Cloud Engineer at Accenture, specializing in AWS Cloud solutions for the specific needs of clients with background in telecommunications, particularly 4G and 5G technologies. Mili is passionate about art, books, and movement training, finding inspiration in creative expression and physical activity.

Zdenko Estok works as a cloud architect and DevOps engineer at Accenture. He works with AABG to develop and implement innovative cloud solutions, and specializes in infrastructure as code and cloud security. Zdenko likes to bike to the office and enjoys pleasant walks in nature.

Selimcan “Can” Sakar is a cloud first developer and solution architect at Accenture with a focus on artificial intelligence and a passion for watching models converge.

Shikhar Kwatra is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services, working with leading Global System Integrators. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Read More

Create an end-to-end serverless digital assistant for semantic search with Amazon Bedrock

Create an end-to-end serverless digital assistant for semantic search with Amazon Bedrock

With the rise of generative artificial intelligence (AI), an increasing number of organizations use digital assistants to have their end-users ask domain-specific questions, using Retrieval Augmented Generation (RAG) over their enterprise data sources.

As organizations transition from proofs of concept to production workloads, they establish objectives to run and scale their workloads with minimal operational overhead, while optimizing on costs. Organizations also require the implementation of common security practices such as identity and access management, to make sure that only authorized and authenticated users are allowed to perform specific actions or access specific resources.

This post covers a solution to create an end-to-end digital assistant as a web application using a serverless architecture to address these requirements. Because the solution components primarily use serverless technologies, it provides several benefits, such as automatic scaling, built-in high availability, and a pay-per-use billing model to optimize on costs. The solution also includes an authentication layer and an authorization layer to manage identities and permissions.

This solution also uses the hybrid search feature of Knowledge Bases for Amazon Bedrock to increase the relevancy of retrieved results using RAG. When receiving a query from an end-user, hybrid search performs both a semantic search and a keyword search:

  • A semantic search provides results based on the meaning and intent within the query
  • A keyword search provides results based on specific entities in a query such as product codes or acronyms

For example, if a user submits a prompt that includes keywords, a text-based search may provide better results than a semantic search. This is why hybrid search combines the two approaches: the precision of semantic search and coverage of keywords. For more information about hybrid search, see Knowledge Bases for Amazon Bedrock now supports hybrid search.

In this post, we provide an operational overview of the solution, and then describe how to set it up with the following services:

  • Amazon Bedrock and a knowledge base to generate responses from user questions based on enterprise data sources. Amazon Bedrock is a fully managed service that makes a wide range of foundation models (FMs) available though an API without having to manage any infrastructure. Refer to the Amazon Bedrock FAQs for further details.
  • An Amazon OpenSearch Serverless vector engine to store enterprise data as vectors to perform semantic search.
  • AWS Amplify to create and deploy the web application.
  • Amazon API Gateway and AWS Lambda to create an API with an authentication layer and integrate with Amazon Bedrock.
  • Amazon Cognito to implement an identity platform (user directory and authorization management) for the web application.
  • Amazon Simple Storage Service (Amazon S3) to store the enterprise data used by the solution and web application-related assets.

Solution overview

The solution architecture involves the following steps:

  1. The user authenticates to the web application (the digital assistant UI).
  2. Amazon Cognito validates the authentication details.
  3. The user submits a request using the web application.
  4. The request is sent by the web application to the API.
  5. The API calls a Lambda authorizer to confirm that the user is authorized to perform the operation.
  6. The request is sent from the API to a Lambda function.
  7. The Lambda function submits the request as a prompt to a knowledge base (Knowledge Bases for Amazon Bedrock), and explicitly requests a hybrid search to be performed using the Amazon Bedrock API.
  8. Amazon Bedrock retrieves relevant data from the vector store (using the vector engine for OpenSearch Serverless) using hybrid search.
  9. Amazon Bedrock submits a prompt to a foundation model.

After Step 9, the foundation model generates a response back that will be returned to the user in the web application’s digital assistant.

The following diagram illustrates this workflow.

Prerequisites

To follow along and set up this solution, you must have the following:

  • An AWS account
  • A device with access to your AWS account with the following:
  • Model access to the following models in Amazon Bedrock: Titan Embeddings G1 – Text and Claude Instant

Upload documents and create a knowledge base

In this section, we create a knowledge base in Amazon Bedrock. The knowledge base will enrich the prompt submitted to an Amazon Bedrock foundation model with contextual information derived from our data source (in our case, documents uploaded in a S3 bucket).

During the creation of the knowledge base, a vector store will also be created to ingest documents encoded as vectors, using an embeddings model. An embeddings model encodes data as vectors in order to capture the meaning and context of our sample documents. This allows us to find data relevant to our end-user prompts.

For our use case, we use the vector engine for OpenSearch Serverless as a vector store and Titan Text Embeddings G1 model as the embeddings model.

Complete the following steps to create an S3 bucket to upload documents, and synchronize them with a knowledge base in Amazon Bedrock:

  1. Create an S3 bucket in your account.
  2. Upload the following documents in the S3 bucket:
  3. Create a knowledge base with the following configuration:
    • For Knowledge base name, enter assistant-knowledgebase.
    • For Knowledge base description, enter Knowledge base for digital assistant.
    • For IAM permissions, select Create and use a new service role.
    • For Data source name, enter assistant-knowledgebase-datasource.
    • For S3 URI, enter the URI of the previously created S3 bucket (for example, s3://#s3-bucket-name#).
    • For Embeddings model, choose Titan G1 Embeddings – Text.
    • For Vector database, select Quick create a new vector store.
  4. Ingest and synchronize the documents in the knowledge base.

Create the API and backend

In this section, we create the following resources:

  • A user directory for web authentication and authorization, created with an Amazon Cognito user pool.
  • An API created with Amazon API Gateway. This will expose a single-entry door interface to our digital assistant’s web application.
  • An authorization layer in our API, to protect our backend from unauthorized users. This will be implemented with a Lambda authorizer function to validate that incoming requests include valid authorization details.
  • A Lambda function behind the API, which will submit prompts to a knowledge base and return responses back to the API.

Complete the following steps to create the API and the backend of the digital assistant’s web application, using AWS CloudFormation templates:

  1. Clone the GitHub repository.
  2. Navigate to the api folder, which includes the following content:
    • A template named webapp-userpool-stack.yml for the Amazon Cognito user pool
    • A template named webapp-lambda-stack.yml for the Lambda function calling a knowledge base
    • A template named webapp-api-stack.yml for the API and the Lambda authorizer function
    • A subfolder named lambda-auth for the Lambda authorizer function code
    • A subfolder named lambda-knowledgebase for the Lambda function calling a knowledge base
    • A script named cognito-create-testuser.sh to create a test user in the Amazon Cognito user pool
  3.  Create the Amazon Cognito user pool of the web application using the following AWS Command Line Interface (AWS CLI) command:
    aws cloudformation create-stack --stack-name webapp-userpool-stack --template-body file://webapp-userpool-stack.yml

  4. Go to the lambda-knowledgebase folder and download the dependencies with the following command:
    pip install -r requirements.txt -t .

  5. Create a .zip file named lambda-knowledgebase.zip with the Lambda code and its dependencies (the .zip file’s root directory must include the Lambda code and its dependencies).
  6. From the api folder, go to the lambda-auth folder and download the dependencies with the following command:
    pip install -r requirements.txt -t .

  7. Create .a zip file named lambda-auth.zip with the Lambda code and its dependencies (the .zip file’s root directory must include the Lambda code and its dependencies).
  8. Create an S3 bucket in your account.
  9. Upload both .zip files (lambda-auth.zip and lambda-knowledgebase.zip) to the S3 bucket.
  10. Go back to the api folder and create the Lambda function of the web application using the following AWS CLI command (provide your S3 bucket and knowledge base ID):
aws cloudformation create-stack 
--stack-name webapp-lambda-knowledgebase-stack 
--capabilities "CAPABILITY_IAM" 
--template-body file://webapp-lambda-knowledgebase-stack.yml 
--parameters ParameterKey=BedrockKnowledgeBaseId,ParameterValue=#bedrock-knowledgebase-id# 
ParameterKey=BedrockLambdaS3Bucket,ParameterValue=#lambdacode-s3-bucket-name# 
ParameterKey=BedrockLambdaS3Key,ParameterValue=lambda-knowledgebase.zip

You can retrieve the knowledge base ID by running the following AWS CLI command:

aws bedrock-agent list-knowledge-bases 
--output text 
--query 'knowledgeBaseSummaries[?name==`assistant-knowledgebase`].knowledgeBaseId'

  1. Create the API of the web application using the following AWS CLI command (provide your bucket name):
aws cloudformation create-stack 
--stack-name webapp-api-stack 
--capabilities "CAPABILITY_IAM" 
--template-body file://webapp-api-stack.yml 
--parameters ParameterKey=LambdaAuthorizerS3Bucket,ParameterValue=#lambdacode-s3-bucket-name# 
ParameterKey=LambdaAuthorizerS3Key,ParameterValue=lambda-auth.zip

Configure the Amazon Cognito user pool

In this section, we create a user in our Amazon Cognito user pool. This user will be used to log in to our web application.

Complete the following steps to configure the Amazon Cognito user pool created in the previous section:

  1. On the Amazon Cognito console, access the user pool named webapp-userpool.
  2. On the Users tab, choose Create a user.
  3. For Invitation message, select Send an email invitation.
  4. For Email address section, enter your email address and select Mark email address as verified.
  5. For Temporary password, select Generate a password.
  6. Choose Create user.


You can also complete these steps by running the script cognito-create-testuser.sh available in the api folder as follows (provide your email address):

./cognito-create-testuser.sh #your-email-address#

After you create the user, you should receive an email with a temporary password in this format: “Your username is #your-email-address# and temporary password is #temporary-password#.

Keep note of these login details (email address and temporary password) to use later when testing the web application.

Create the web application

In this section, we build a web application using Amplify and publish it to make it accessible through an endpoint URL. To complete this section, you must first install and set up the Amplify CLI, as discussed in the prerequisites.

Complete the following steps to create the web application of the digital assistant:

  1. Go back to the root folder of the repository and open the frontend folder.
  2. Run the script amplify-setup.sh to create the Amplify application:
    ./amplify-setup.sh

The amplify-setup.sh script creates an Amplify application and configures it to integrate with resources you created in the previous modules:

    • The Amazon Cognito user pool to authenticate our user through the web application’s login page
    • The Amazon API Gateway to process prompts submitted using the web application’s chat interface
  1. Configure the hosting of the Amplify application using the following command:
    amplify add hosting

  2. Choose the following options:
    • For Select the plugin module to execute, choose Hosting with Amplify Console (Managed hosting with custom domains, Continuous deployment).
    • For Choose a type, choose Manual deployment.

In this step, we configure how the web application will be deployed and hosted:

    • The web application will be hosted using the Amplify console, which offers fully managed hosting
    • The web application will be deployed using manual deployment, which allows us to publish our web application to the Amplify console without connecting a Git provider
  1. Publish the Amplify application using the following command:
    amplify publish --yes

The web application is now available for testing and a URL should be displayed, as shown in the following screenshot. Take note of the URL to use in the following section.

Test the digital assistant

In this section, you test the web application of the digital assistant:

  1. Open the URL of the Amplify application in your navigator.
  2. Enter your login information (your email and the temporary password you received earlier while configuring the user pool in Amazon Cognito) and choose Sign in.
  3. When prompted, enter a new password and choose Change Password.
  4. You should now be able to see a chat interface.
  5. Ask a question to test the assistant. For example, “What is the OPS number related to health of operations in the Well Architected framework?

You should receive a response along with sources, as shown in the following screenshot

Clean up

To make sure that no additional cost is incurred, remove the resources provisioned in your account. Make sure you’re in the correct AWS account before deleting the following resources.

  1. Delete the knowledge base.
  2. Delete the CloudFormation stacks (provide the AWS Region where you created your resources):
    aws cloudformation delete-stack --stack-name webapp-api-stack --region #region#
    aws cloudformation delete-stack --stack-name webapp-lambda-knowledgebase-stack --region #region#
    aws cloudformation delete-stack --stack-name webapp-userpool-stack --region #region#

  3. Delete the Amplify application with the following AWS CLI command (provide your application ID and the Region where it was created):
    aws amplify delete-app --app-id #app-id# --region #region#

  4. You can retrieve the app id by running the following AWS CLI command:
    aws amplify list-apps --query 'apps[?name==`frontend`].appId'

  5. Delete the S3 buckets.

You should exercise caution when performing the preceding steps. Make sure you are deleting the resources in the correct AWS account.

Conclusion

In this post, we walked through a solution to create a digital assistant using serverless services. First, we created a knowledge base and ingested documents into it from an S3 bucket. Then we created an API and a Lambda function to submit prompts to the knowledge base. We also configured a user pool to grant a user access to the digital assistant’s web application. Finally, we created the frontend of the web application in Amplify.

For further information on the services used, consult the Amazon Bedrock, Security in Amazon Bedrock, Amazon OpenSearch Serverless, AWS Amplify, Amazon API Gateway, AWS Lambda, Amazon Cognito, and Amazon S3 product pages.

To dive deeper into this solution, a self-paced workshop is available in AWS Workshop Studio, at this location.


About the author

Mehdi Amrane is a Senior Solutions Architect at Amazon Web Services. He supports customers on their initiatives and provides them prescriptive guidance to achieve their goals, and accelerate their cloud journey. He is passionate about creating content on application architecture, DevOps and Serverless technologies.

Read More

Build a self-service digital assistant using Amazon Lex and Knowledge Bases for Amazon Bedrock

Build a self-service digital assistant using Amazon Lex and Knowledge Bases for Amazon Bedrock

Organizations strive to implement efficient, scalable, cost-effective, and automated customer support solutions without compromising the customer experience. Generative artificial intelligence (AI)-powered chatbots play a crucial role in delivering human-like interactions by providing responses from a knowledge base without the involvement of live agents. These chatbots can be efficiently utilized for handling generic inquiries, freeing up live agents to focus on more complex tasks.

Amazon Lex provides advanced conversational interfaces using voice and text channels. It features natural language understanding capabilities to recognize more accurate identification of user intent and fulfills the user intent faster.

Amazon Bedrock simplifies the process of developing and scaling generative AI applications powered by large language models (LLMs) and other foundation models (FMs). It offers access to a diverse range of FMs from leading providers such as Anthropic Claude, AI21 Labs, Cohere, and Stability AI, as well as Amazon’s proprietary Amazon Titan models. Additionally, Knowledge Bases for Amazon Bedrock empowers you to develop applications that harness the power of Retrieval Augmented Generation (RAG), an approach where retrieving relevant information from data sources enhances the model’s ability to generate contextually appropriate and informed responses.

The generative AI capability of QnAIntent in Amazon Lex lets you securely connect FMs to company data for RAG. QnAIntent provides an interface to use enterprise data and FMs on Amazon Bedrock to generate relevant, accurate, and contextual responses. You can use QnAIntent with new or existing Amazon Lex bots to automate FAQs through text and voice channels, such as Amazon Connect.

With this capability, you no longer need to create variations of intents, sample utterances, slots, and prompts to predict and handle a wide range of FAQs. You can simply connect QnAIntent to company knowledge sources and the bot can immediately handle questions using the allowed content.

In this post, we demonstrate how you can build chatbots with QnAIntent that connects to a knowledge base in Amazon Bedrock (powered by Amazon OpenSearch Serverless as a vector database) and build rich, self-service, conversational experiences for your customers.

Solution overview

The solution uses Amazon Lex, Amazon Simple Storage Service (Amazon S3), and Amazon Bedrock in the following steps:

  1. Users interact with the chatbot through a prebuilt Amazon Lex web UI.
  2. Each user request is processed by Amazon Lex to determine user intent through a process called intent recognition.
  3. Amazon Lex provides the built-in generative AI feature QnAIntent, which can be directly attached to a knowledge base to fulfill user requests.
  4. Knowledge Bases for Amazon Bedrock uses the Amazon Titan embeddings model to convert the user query to a vector and queries the knowledge base to find the chunks that are semantically similar to the user query. The user prompt is augmented along with the results returned from the knowledge base as an additional context and sent to the LLM to generate a response.
  5. The generated response is returned through QnAIntent and sent back to the user in the chat application through Amazon Lex.

The following diagram illustrates the solution architecture and workflow.

In the following sections, we look at the key components of the solution in more detail and the high-level steps to implement the solution:

  1. Create a knowledge base in Amazon Bedrock for OpenSearch Serverless.
  2. Create an Amazon Lex bot.
  3. Create new generative AI-powered intent in Amazon Lex using the built-in QnAIntent and point the knowledge base.
  4. Deploy the sample Amazon Lex web UI available in the GitHub repo. Use the provided AWS CloudFormation template in your preferred AWS Region and configure the bot.

Prerequisites

To implement this solution, you need the following:

  1. An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
  2. Familiarity with AWS services such as Amazon S3, Amazon Lex, Amazon OpenSearch Service, and Amazon Bedrock.
  3. Access enabled for the Amazon Titan Embeddings G1 – Text model and Anthropic Claude 3 Haiku on Amazon Bedrock. For instructions, see Model access.
  4. A data source in Amazon S3. For this post, we use Amazon shareholder docs (Amazon Shareholder letters – 2023 & 2022) as a data source to hydrate the knowledge base.

Create a knowledge base

To create a new knowledge base in Amazon Bedrock, complete the following steps. For more information, refer to Create a knowledge base.

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose Create knowledge base.
  3. On the Provide knowledge base details page, enter a knowledge base name, IAM permissions, and tags.
  4. Choose Next.
  5. For Data source name, Amazon Bedrock prepopulates the auto-generated data source name; however, you can change it to your requirements.
  6. Keep the data source location as the same AWS account and choose Browse S3.
  7. Select the S3 bucket where you uploaded the Amazon shareholder documents and choose Choose.
    This will populate the S3 URI, as shown in the following screenshot.
  8. Choose Next.
  9. Select the embedding model to vectorize the documents. For this post, we select Titan embedding G1 – Text v1.2.
  10. Select Quick create a new vector store to create a default vector store with OpenSearch Serverless.
  11. Choose Next.
  12. Review the configurations and create your knowledge base.
    After the knowledge base is successfully created, you should see a knowledge base ID, which you need when creating the Amazon Lex bot.
  13. Choose Sync to index the documents.

Create an Amazon Lex bot

Complete the following steps to create your bot:

  1. On the Amazon Lex console, choose Bots in the navigation pane.
  2. Choose Create bot.
  3. For Creation method, select Create a blank bot.
  4. For Bot name, enter a name (for example, FAQBot).
  5. For Runtime role, select Create a new IAM role with basic Amazon Lex permissions to access other services on your behalf.
  6. Configure the remaining settings based on your requirements and choose Next.
  7. On the Add language to bot page, you can choose from different languages supported.
    For this post, we choose English (US).
  8. Choose Done.

    After the bot is successfully created, you’re redirected to create a new intent.
  9. Add utterances for the new intent and choose Save intent.

Add QnAIntent to your intent

Complete the following steps to add QnAIntent:

  1. On the Amazon Lex console, navigate to the intent you created.
  2. On the Add intent dropdown menu, choose Use built-in intent.
  3. For Built-in intent, choose AMAZON.QnAIntent – GenAI feature.
  4. For Intent name, enter a name (for example, QnABotIntent).
  5. Choose Add.

    After you add the QnAIntent, you’re redirected to configure the knowledge base.
  6. For Select model, choose Anthropic and Claude3 Haiku.
  7. For Choose a knowledge store, select Knowledge base for Amazon Bedrock and enter your knowledge base ID.
  8. Choose Save intent.
  9. After you save the intent, choose Build to build the bot.
    You should see a Successfully built message when the build is complete.
    You can now test the bot on the Amazon Lex console.
  10. Choose Test to launch a draft version of your bot in a chat window within the console.
  11. Enter questions to get responses.

Deploy the Amazon Lex web UI

The Amazon Lex web UI is a prebuilt fully featured web client for Amazon Lex chatbots. It eliminates the heavy lifting of recreating a chat UI from scratch. You can quickly deploy its features and minimize time to value for your chatbot-powered applications. Complete the following steps to deploy the UI:

  1. Follow the instructions in the GitHub repo.
  2. Before you deploy the CloudFormation template, update the LexV2BotId and LexV2BotAliasId values in the template based on the chatbot you created in your account.
  3. After the CloudFormation stack is deployed successfully, copy the WebAppUrl value from the stack Outputs tab.
  4. Navigate to the web UI to test the solution in your browser.

Clean up

To avoid incurring unnecessary future charges, clean up the resources you created as part of this solution:

  1. Delete the Amazon Bedrock knowledge base and the data in the S3 bucket if you created one specifically for this solution.
  2. Delete the Amazon Lex bot you created.
  3. Delete the CloudFormation stack.

Conclusion

In this post, we discussed the significance of generative AI-powered chatbots in customer support systems. We then provided an overview of the new Amazon Lex feature, QnAIntent, designed to connect FMs to your company data. Finally, we demonstrated a practical use case of setting up a Q&A chatbot to analyze Amazon shareholder documents. This implementation not only provides prompt and consistent customer service, but also empowers live agents to dedicate their expertise to resolving more complex issues.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.


About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design and architecture. She helps key customer accounts on their data, generative AI and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.

Manjula Nagineni is a Senior Solutions Architect with AWS based in New York. She works with major financial service institutions, architecting and modernizing their large-scale applications while adopting AWS Cloud services. She is passionate about designing cloud-centered big data workloads. She has over 20 years of IT experience in software development, analytics, and architecture across multiple domains such as finance, retail, and telecom.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Identify idle endpoints in Amazon SageMaker

Identify idle endpoints in Amazon SageMaker

Amazon SageMaker is a machine learning (ML) platform designed to simplify the process of building, training, deploying, and managing ML models at scale. With a comprehensive suite of tools and services, SageMaker offers developers and data scientists the resources they need to accelerate the development and deployment of ML solutions.

In today’s fast-paced technological landscape, efficiency and agility are essential for businesses and developers striving to innovate. AWS plays a critical role in enabling this innovation by providing a range of services that abstract away the complexities of infrastructure management. By handling tasks such as provisioning, scaling, and managing resources, AWS allows developers to focus more on their core business logic and iterate quickly on new ideas.

As developers deploy and scale applications, unused resources such as idle SageMaker endpoints can accumulate unnoticed, leading to higher operational costs. This post addresses the issue of identifying and managing idle endpoints in SageMaker. We explore methods to monitor SageMaker endpoints effectively and distinguish between active and idle ones. Additionally, we walk through a Python script that automates the identification of idle endpoints using Amazon CloudWatch metrics.

Identify idle endpoints with a Python script

To effectively manage SageMaker endpoints and optimize resource utilization, we use a Python script that uses the AWS SDK for Python (Boto3) to interact with SageMaker and CloudWatch. This script automates the process of querying CloudWatch metrics to determine endpoint activity and identifies idle endpoints based on the number of invocations over a specified time period.

Let’s break down the key components of the Python script and explain how each part contributes to the identification of idle endpoints:

  • Global variables and AWS client initialization – The script begins by importing necessary modules and initializing global variables such as NAMESPACE, METRIC, LOOKBACK, and PERIOD. These variables define parameters for querying CloudWatch metrics and SageMaker endpoints. Additionally, AWS clients for interacting with SageMaker and CloudWatch services are initialized using Boto3.
from datetime import datetime, timedelta
import boto3
import logging

# AWS clients initialization
cloudwatch = boto3.client("cloudwatch")
sagemaker = boto3.client("sagemaker")

# Global variables
NAMESPACE = "AWS/SageMaker"
METRIC = "Invocations"
LOOKBACK = 1  # Number of days to look back for activity
PERIOD = 86400  # We opt for a granularity of 1 Day to reduce the volume of metrics retrieved while maintaining accuracy.

# Calculate time range for querying CloudWatch metrics
ago = datetime.utcnow() - timedelta(days=LOOKBACK)
now = datetime.utcnow()
  • Identify idle endpoints – Based on the CloudWatch metrics data, the script determines whether an endpoint is idle or active. If an endpoint has received no invocations over the defined period, it’s flagged as idle. In this case, we select a cautious default threshold of zero invocations over the analyzed period. However, depending on your specific use case, you can adjust this threshold to suit your requirements.
# Helper function to extract endpoint name from CloudWatch metric

def get_endpoint_name_from_metric(metric):
    for d in metric["Dimensions"]:
        if d["Name"] == "EndpointName" or d["Name"] == "InferenceComponentName" :
            yield d["Value"]

# Helper Function to aggregate individual metrics for a designated endpoint and output the total. This validation helps in determining if the endpoint has been idle during the specified period.

def list_metrics():
    paginator = cloudwatch.get_paginator("list_metrics")
    response_iterator = paginator.paginate(Namespace=NAMESPACE, MetricName=METRIC)
    return [m for r in response_iterator for m in r["Metrics"]]


# Helper function to check if endpoint is in use based on CloudWatch metrics

def is_endpoint_busy(metric):
    metric_values = cloudwatch.get_metric_data(
        MetricDataQueries=[{
            "Id": "metricname",
            "MetricStat": {
                "Metric": {
                    "Namespace": metric["Namespace"],
                    "MetricName": metric["MetricName"],
                    "Dimensions": metric["Dimensions"],
                },
                "Period": PERIOD,
                "Stat": "Sum",
                "Unit": "None",
            },
        }],
        StartTime=ago,
        EndTime=now,
        ScanBy="TimestampAscending",
        MaxDatapoints=24 * (LOOKBACK + 1),
    )
    return sum(metric_values.get("MetricDataResults", [{}])[0].get("Values", [])) > 0

# Helper function to log endpoint activity

def log_endpoint_activity(endpoint_name, is_busy):
    status = "BUSY" if is_busy else "IDLE"
    log_message = f"{datetime.utcnow()} - Endpoint {endpoint_name} {status}"
    print(log_message)
  • Main function – The main() function serves as the entry point to run the script. It orchestrates the process of retrieving SageMaker endpoints, querying CloudWatch metrics, and logging endpoint activity.
# Main function to identify idle endpoints and log their activity status
def main():
    endpoints = sagemaker.list_endpoints()["Endpoints"]
    
    if not endpoints:
        print("No endpoints found")
        return

    existing_endpoints_name = []
    for endpoint in endpoints:
        existing_endpoints_name.append(endpoint["EndpointName"])
    
    for metric in list_metrics():
        for endpoint_name in get_endpoint_name_from_metric(metric):
            if endpoint_name in existing_endpoints_name:
                is_busy = is_endpoint_busy(metric)
                log_endpoint_activity(endpoint_name, is_busy)
            else:
                print(f"Endpoint {endpoint_name} not active")

if __name__ == "__main__":
    main()

By following along with the explanation of the script, you’ll gain a deeper understanding of how to automate the identification of idle endpoints in SageMaker, paving the way for more efficient resource management and cost optimization.

Permissions required to run the script

Before you run the provided Python script to identify idle endpoints in SageMaker, make sure your AWS Identity and Access Management (IAM) user or role has the necessary permissions. The permissions required for the script include:

  • CloudWatch permissions – The IAM entity running the script must have permissions for the CloudWatch actions cloudwatch:GetMetricData and cloudwatch:ListMetrics
  • SageMaker permissions – The IAM entity must have permissions to list SageMaker endpoints using the sagemaker:ListEndpoints action

Run the Python script

You can run the Python script using various methods, including:

  • The AWS CLI – Make sure the AWS Command Line Interface (AWS CLI) is installed and configured with the appropriate credentials.
  • AWS Cloud9 – If you prefer a cloud-based integrated development environment (IDE), AWS Cloud9 provides an IDE with preconfigured settings for AWS development. Simply create a new environment, clone the script repository, and run the script within the Cloud9 environment.

In this post, we demonstrate running the Python script through the AWS CLI.

Actions to take after identifying idle endpoints

After you’ve successfully identified idle endpoints in your SageMaker environment using the Python script, you can take proactive steps to optimize resource utilization and reduce operational costs. The following are some actionable measures you can implement:

  • Delete or scale down endpoints – For endpoints that consistently show no activity over an extended period, consider deleting or scaling them down to minimize resource wastage. SageMaker allows you to delete idle endpoints through the AWS Management Console or programmatically using the AWS SDK.
  • Review and refine the model deployment strategy – Evaluate the deployment strategy for your ML models and assess whether all deployed endpoints are necessary. Sometimes, endpoints may become idle due to changes in business requirements or model updates. By reviewing your deployment strategy, you can identify opportunities to consolidate or optimize endpoints for better efficiency.
  • Implement auto scaling policies – Configure auto scaling policies for active endpoints to dynamically adjust the compute capacity based on workload demand. SageMaker supports auto scaling, allowing you to automatically increase or decrease the number of instances serving predictions based on predefined metrics such as CPU utilization or inference latency.
  • Explore serverless inference options – Consider using SageMaker serverless inference as an alternative to traditional endpoint provisioning. Serverless inference eliminates the need for manual endpoint management by automatically scaling compute resources based on incoming prediction requests. This can significantly reduce idle capacity and optimize costs for intermittent or unpredictable workloads.

Conclusion

In this post, we discussed the importance of identifying idle endpoints in SageMaker and provided a Python script to help automate this process. By implementing proactive monitoring solutions and optimizing resource utilization, SageMaker users can effectively manage their endpoints, reduce operational costs, and maximize the efficiency of their machine learning workflows.

Get started with the techniques demonstrated in this post to automate cost monitoring for SageMaker inference. Explore AWS re:Post for valuable resources on optimizing your cloud infrastructure and maximizing AWS services.

Resources

For more information about the features and services used in this post, refer to the following:


About the authors

Pablo Colazurdo is a Principal Solutions Architect at AWS where he enjoys helping customers to launch successful projects in the Cloud. He has many years of experience working on varied technologies and is passionate about learning new things. Pablo grew up in Argentina but now enjoys the rain in Ireland while listening to music, reading or playing D&D with his kids.

Ozgur Canibeyaz is a Senior Technical Account Manager at AWS with 8 years of experience. Ozgur helps customers optimize their AWS usage by navigating technical challenges, exploring cost-saving opportunities, achieving operational excellence, and building innovative services using AWS products.

Read More

Indian language RAG with Cohere multilingual embeddings and Anthropic Claude 3 on Amazon Bedrock

Indian language RAG with Cohere multilingual embeddings and Anthropic Claude 3 on Amazon Bedrock

Media and entertainment companies serve multilingual audiences with a wide range of content catering to diverse audience segments. These enterprises have access to massive amounts of data collected over their many years of operations. Much of this data is unstructured text and images. Conventional approaches to analyzing unstructured data for generating new content rely on the use of keyword or synonym matching. These approaches don’t capture the full semantic context of a document, making them less effective for users’ search, content creation, and several other downstream tasks.

Text embeddings use machine learning (ML) capabilities to capture the essence of unstructured data. These embeddings are generated by language models that map natural language text into their numerical representations and, in the process, encode contextual information in the natural language document. Generating text embeddings is the first step to many natural language processing (NLP) applications powered by large language models (LLMs) such as Retrieval Augmented Generation (RAG), text generation, entity extraction, and several other downstream business processes.

Cohere Multilingual V3 converting text to embeddings

Converting text to embeddings using cohere multilingual embedding model

Despite the rising popularity and capabilities of LLMs, the language most often used to converse with the LLM, often through a chat-like interface, is English. And although progress has been made in adapting open source models to comprehend and respond in Indian languages, such efforts fall short of the English language capabilities displayed among larger, state-of-the-art LLMs. This makes it difficult to adopt such models for RAG applications based on Indian languages.

In this post, we showcase a RAG application that can search and query across multiple Indian languages using the Cohere Embed – Multilingual model and Anthropic Claude 3 on Amazon Bedrock. This post focuses on Indian languages, but you can use the approach with other languages that are supported by the LLM.

Solution overview

We use the Flores dataset [1], a benchmark dataset for machine translation between English and low-resource languages. This also serves as a parallel corpus, which is a collection of texts that have been translated into one or more languages.

With the Flores dataset, we can demonstrate that the embeddings and, subsequently, the documents retrieved from the retriever, are relevant for the same question being asked in multiple languages. However, given the sparsity of the dataset (approximately 1,000 lines per language from more than 200 languages), the nature and number of questions that can be asked against the dataset is limited.

After you have downloaded the data, load the data into the pandas data frame for processing. For this demo, we are restricting ourselves to Bengali, Kannada, Malayalam, Tamil, Telugu, Hindi, Marathi, and English. If you are looking to adopt this approach for other languages, make sure the language is supported by both the embedding model and the LLM that’s being used in the RAG setup.

Load the data with the following code:

import pandas as pd

df_ben = pd.read_csv('./data/Flores/dev/dev.ben_Beng', sep='t') 
df_kan = pd.read_csv('./data/Flores/dev/dev.kan_Knda', sep='t') 
df_mal = pd.read_csv('./data/Flores/dev/dev.mal_Mlym', sep='t') 
df_tam = pd.read_csv('./data/Flores/dev/dev.tam_Taml', sep='t') 
df_tel = pd.read_csv('./data/Flores/dev/dev.tel_Telu', sep='t') 
df_hin = pd.read_csv('./data/Flores/dev/dev.hin_Deva', sep='t') 
df_mar = pd.read_csv('./data/Flores/dev/dev.mar_Deva', sep='t') 
df_eng = pd.read_csv('./data/Flores/dev/dev.eng_Latn', sep='t') 
# Choose fewer/more languages if needed

df_all_Langs = pd.concat([df_ben, df_kan, df_mal, df_tam, df_tel, df_hin, df_mar,df_eng], axis=1)
df_all_Langs.columns = ['Bengali', 'Kannada', 'Malayalam', 'Tamil', 'Telugu', 'Hindi', 'Marathi','English']

df_all_Langs.shape #(996,8)


df = df_all_Langs
stacked_df = df.stack().reset_index() # for ease of handling

# select only the required columns, rename them
stacked_df = stacked_df.iloc[:,[1,2]]
stacked_df.columns = ['language','text'] 

The Cohere multilingual embedding model

Cohere is a leading enterprise artificial intelligence (AI) platform that builds world-class LLMs and LLM-powered solutions that allow computers to search, capture meaning, and converse in text. They provide ease of use and strong security and privacy controls.

The Cohere Embed – Multilingual model generates vector representations of documents for over 100 languages and is available on Amazon Bedrock. With Amazon Bedrock, you can access the embedding model through an API call, which eliminates the need to manage the underlying infrastructure and makes sure sensitive information remains securely managed and protected.

The multilingual embedding model groups text with similar meanings by assigning them positions in the semantic vector space that are close to each other. Developers can process text in multiple languages without switching between different models. This makes processing more efficient and improves performance for multilingual applications.

Text embeddings turn unstructured data into a structured form. This allows you to objectively compare, dissect, and derive insights from all these documents. Cohere’s new embedding models have a new required input parameter, input_type, which must be set for every API call and include one of the following four values, which align towards the most frequent use cases for text embeddings:

  • input_type=”search_document” – Use this for texts (documents) you want to store in your vector database
  • input_type=”search_query” – Use this for search queries to find the most relevant documents in your vector database
  • input_type=”classification” – Use this if you use the embeddings as input for a classification system
  • input_type=”clustering” – Use this if you use the embeddings for text clustering

Using these input types provides the highest possible quality for the respective tasks. If you want to use the embeddings for multiple use cases, we recommend using input_type="search_document".

Prerequisites

To use the Claude 3 Sonnet LLM and the Cohere multilingual embeddings model on this dataset, ensure that you have access to the models in your AWS account under Amazon Bedrock, Model Access section and then proceed with installing the following packages. The following code has been tested to work with the Amazon SageMaker Data Science 3.0 Image, backed by an ml.t3.medium instance.

! apt-get update 
! apt-get install build-essential -y # for the hnswlib package below
! pip install hnswlib

Create a search index

With all of the prerequisites in place, you can now convert the multilingual corpus into embeddings and store those in hnswlib, a header-only C++ Hierarchical Navigable Small Worlds (HNSW) implementation with Python bindings, insertions, and updates. HNSWLib is an in-memory vector store that can be saved to a file, which should be sufficient for the small dataset we are working with. Use the following code:

import hnswlib
import os
import json
import botocore
import boto3

boto3_bedrock = boto3.client('bedrock')
bedrock_runtime = boto3.client('bedrock-runtime')

# Create a search index
index = hnswlib.Index(space='ip', dim=1024)
index.init_index(max_elements=10000, ef_construction=512, M=64)

all_text = stacked_df['text'].to_list()
all_text_lang = stacked_df['language'].to_list()

Embed and index documents

To embed and store the small multilingual dataset, use the Cohere embed-multilingual-v3.0 model, which creates embeddings with 1,024 dimensions, using the Amazon Bedrock runtime API:

modelId="cohere.embed-multilingual-v3"
contentType= "application/json"
accept = "*/*"


df_chunk_size = 80
chunk_embeddings = []
for i in range(0,len(all_text), df_chunk_size):
    chunk = all_text[i:i+df_chunk_size]
    body=json.dumps(
            {"texts":chunk,"input_type":"search_document"} # search documents
    ) 
    response = bedrock_runtime.invoke_model(body=body, 
                                            modelId=modelId,
                                            accept=accept,
                                            contentType=contentType)
    response_body = json.loads(response.get('body').read())
    index.add_items(response_body['embeddings'])

Verify that the embeddings work

To test the solution, write a function that takes a query as input, embeds it, and finds the top N documents most closely related to it:

# Retrieval of closest N docs to query
def retrieval(query, num_docs_to_return=10):
    modelId="cohere.embed-multilingual-v3"
    contentType= "application/json"
    accept = "*/*"
    body=json.dumps(
            {"texts":[query],"input_type":"search_query"} # search query
    ) 
    response = bedrock_runtime.invoke_model(body=body, 
                                            modelId=modelId,
                                            accept=accept,
                                            contentType=contentType)
    response_body = json.loads(response.get('body').read())
    doc_ids = index.knn_query(response_body['embeddings'], 
                              k=num_docs_to_return)[0][0] 
    print(f"Query: {query} n")
    retrieved_docs = []

    for doc_id in doc_ids:
        # Append results
        retrieved_docs.append(all_text[doc_id]) # original vernacular language docs

        # Print results
        print(f"Original Flores Text {all_text[doc_id]}")
        print("-"*30)

    print("END OF RESULTS nn")
    return retrieved_docs   

You can explore what the RAG stack does with a couple of queries in different languages, such as Hindi:

queries = [
    "मुझे सिंधु नदी घाटी सभ्यता के बारे में बताइए","
]
# translation: tell me about Indus Valley Civilization
for query in queries:
    retrieval(query)

The index returns documents relevant to the search query from across languages:

Query: मुझे सिंधु नदी घाटी सभ्यता के बारे में बताइए 

Original Flores Text सिंधु घाटी सभ्यता उत्तर-पश्चिम भारतीय उपमहाद्वीप में कांस्य युग की सभ्यता थी जिसमें आस-पास के आधुनिक पाकिस्तान और उत्तर पश्चिम भारत और उत्तर-पूर्व अफ़गानिस्तान के कुछ क्षेत्र शामिल थे.
------------------------------
Original Flores Text सिंधु नदी के घाटों में पनपी सभ्यता के कारण यह इसके नाम पर बनी है.
------------------------------
Original Flores Text यद्यपि कुछ विद्वानों का अनुमान है कि चूंकि सभ्यता अब सूख चुकी सरस्वती नदी के घाटियों में विद्यमान थी, इसलिए इसे सिंधु-सरस्वती सभ्यता कहा जाना चाहिए, जबकि 1920 के दशक में हड़प्पा की पहली खुदाई के बाद से कुछ इसे हड़प्पा सभ्यता कहते हैं।
------------------------------
Original Flores Text సింధు నది పరీవాహక ప్రాంతాల్లో నాగరికత విలసిల్లింది.
------------------------------
Original Flores Text सिंधू संस्कृती ही वायव्य भारतीय उपखंडातील कांस्य युग संस्कृती होती ज्यामध्ये  आधुनिक काळातील पाकिस्तान, वायव्य भारत आणि ईशान्य अफगाणिस्तानातील काही प्रदेशांचा समावेश होता.
------------------------------
Original Flores Text সিন্ধু সভ্যতা হল উত্তর-পশ্চিম ভারতীয় উপমহাদেশের একটি তাম্রযুগের সভ্যতা যা আধুনিক-পাকিস্তানের অধিকাংশ ও উত্তর-পশ্চিম ভারত এবং উত্তর-পূর্ব আফগানিস্তানের কিছু অঞ্চলকে ঘিরে রয়েছে।
-------------------------
 .....

You can now use these documents retrieved from the index as context while calling the Anthropic Claude 3 Sonnet model on Amazon Bedrock. In production settings with datasets that are several orders of magnitude larger than the Flores dataset, we can make the search results from the index even more relevant by using Cohere’s Rerank models.

Use the system prompt to outline how you want the LLM to process your query:

# Retrieval of docs relevant to the query
def context_retrieval(query, num_docs_to_return=10):

    modelId="cohere.embed-multilingual-v3"
    contentType= "application/json"
    accept = "*/*"
    body=json.dumps(
            {"texts":[query],"input_type":"search_query"} # search query
    ) 
    response = bedrock_runtime.invoke_model(body=body, 
                                            modelId=modelId,
                                            accept=accept,
                                            contentType=contentType)
    response_body = json.loads(response.get('body').read())
    doc_ids = index.knn_query(response_body['embeddings'], 
                              k=num_docs_to_return)[0][0] 
    retrieved_docs = []
    
    for doc_id in doc_ids:
        retrieved_docs.append(all_text[doc_id])
    return " ".join(retrieved_docs)

def query_rag_bedrock(query, model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'):

    system_prompt = '''
    You are a helpful emphathetic multilingual assitant. 
    Identify the language of the user query, and respond to the user query in the same language. 

    For example 
    if the user query is in English your response will be in English, 
    if the user query is in Malayalam, your response will be in Malayalam, 
    if the user query is in Tamil, your response will be in Tamil
    and so on...

    if you cannot identify the language: Say you cannot idenitify the language

    You will use only the data provided within the <context> </context> tags, that matches the user's query's language, to answer the user's query
    If there is no data provided within the <context> </context> tags, Say that you do not have enough information to answer the question
    
    Restrict your response to a paragraph of less than 400 words avoid bullet points
    '''
    max_tokens = 1000

    messages  = [{"role": "user", "content": f'''
                    query : {query}
                    <context>
                    {context_retrieval(query)}
                    </context>
                '''}]

    body=json.dumps(
            {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": max_tokens,
                "system": system_prompt,
                "messages": messages
            }  
        )  


    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    return response_body['content'][0]['text']

Let’s pass in the same query in multiple Indian languages:

queries = ["tell me about the indus river valley civilization",
           "मुझे सिंधु नदी घाटी सभ्यता के बारे में बताइए",
           "मला सिंधू नदीच्या संस्कृतीबद्दल सांगा",
           "సింధు నది నాగరికత గురించి చెప్పండి",
           "ಸಿಂಧೂ ನದಿ ಕಣಿವೆ ನಾಗರಿಕತೆಯ ಬಗ್ಗೆ ಹೇಳಿ", 
           "সিন্ধু নদী উপত্যকা সভ্যতা সম্পর্কে বলুন",
           "சிந்து நதி பள்ளத்தாக்கு நாகரிகத்தைப் பற்றி சொல்",
           "സിന്ധു നദീതാഴ്വര നാഗരികതയെക്കുറിച്ച് പറയുക"] 

for query in queries:
    print(query_rag_bedrock(query))
    print('_'*20)

The query is in English, so I will respond in English.

The Indus Valley Civilization, also known as the Harappan Civilization, was a Bronze Age civilization that flourished in the northwestern regions of the Indian subcontinent, primarily in the basins of the Indus River and its tributaries. It encompassed parts of modern-day Pakistan, northwest India, and northeast Afghanistan. While some scholars suggest calling it the Indus-Sarasvati Civilization due to its presence in the now-dried-up Sarasvati River basin, the name "Indus Valley Civilization" is derived from its development along the Indus River valley. This ancient civilization dates back to around 3300–1300 BCE and was one of the earliest urban civilizations in the world. It was known for its well-planned cities, advanced drainage systems, and a writing system that has not yet been deciphered.
____________________
सिंधु घाटी सभ्यता एक प्राचीन नगर सभ्यता थी जो उत्तर-पश्चिम भारतीय उपमहाद्वीप में फैली हुई थी। यह लगभग 3300 से 1300 ईसा पूर्व की अवधि तक विकसित रही। इस सभ्यता के केंद्र वर्तमान पाकिस्तान के सिंध और पंजाब प्रांतों में स्थित थे, लेकिन इसके अवशेष भारत के राजस्थान, गुजरात, मध्य प्रदेश, महाराष्ट्र और उत्तर प्रदेश में भी मिले हैं। सभ्यता का नाम सिंधु नदी से लिया गया है क्योंकि इसके प्रमुख स्थल इस नदी के किनारे स्थित थे। हालांकि, कुछ विद्वानों का अनुमान है कि सरस्वती नदी के किनारे भी इस सभ्यता के स्थल विद्यमान थे इसलिए इसे सिंधु-सरस्वती सभ्यता भी कहा जाता है। यह एक महत्वपूर्ण शहरी समाज था जिसमें विकसित योजना बनाने की क्षमता, नगरीय संरचना और स्वच्छ जलापूर्ति आदि प्रमुख विशेषताएं थीं।
____________________
सिंधू संस्कृती म्हणजे सिंधू नदीच्या पट्टीकेतील प्राचीन संस्कृती होती. ही संस्कृती सुमारे ई.पू. ३३०० ते ई.पू. १३०० या कालखंडात फुलणारी होती. ती भारतातील कांस्ययुगीन संस्कृतींपैकी एक मोठी होती. या संस्कृतीचे अवशेष आजच्या पाकिस्तान, भारत आणि अफगाणिस्तानमध्ये आढळून आले आहेत. या संस्कृतीत नगररचना, नागरी सोयी सुविधांचा विकास झाला होता. जलवाहिनी, नगरदेवालय इत्यादी अद्भुत बाबी या संस्कृतीत होत्या. सिंधू संस्कृतीत लिपीसुद्धा विकसित झाली होती परंतु ती अजूनही वाचण्यास आलेली नाही. सिंधू संस्कृती ही भारतातील पहिली शहरी संस्कृती मानली जाते.
____________________
సింధు నది నాగరికత గురించి చెప్పుతూ, ఈ నాగరికత సింధు నది పరిసర ప్రాంతాల్లో ఉన్నదని చెప్పవచ్చు. దీనిని సింధు-సరస్వతి నాగరికత అనీ, హరప్ప నాగరికత అనీ కూడా పిలుస్తారు. ఇది ఉత్తర-ఆర్య భారతదేశం, ఆధునిక పాకిస్తాన్, ఉత్తర-పశ్చిమ భారతదేశం మరియు ఉత్తర-ఆర్థిక అఫ్గానిస్తాన్ కు చెందిన తామ్రయుగపు నాగరికత. సరస్వతి నది పరీవాహక ప్రాంతాల్లోనూ నాగరికత ఉందని కొందరు పండితులు అభిప్రాయపడ్డారు. దీని మొదటి స్థలాన్ని 1920లలో హరప్పాలో త్రవ్వారు. ఈ నాగరికతలో ప్రశస్తమైన బస్తీలు, నగరాలు, మలిచ్చి రంగులతో నిర్మించిన భవనాలు, పట్టణ నిర్మాణాలు ఉన్నాయి.
____________________
ಸಿಂಧೂ ಕಣಿವೆ ನಾಗರಿಕತೆಯು ವಾಯುವ್ಯ ಭಾರತದ ಉಪಖಂಡದಲ್ಲಿ ಕಂಚಿನ ಯುಗದ ನಾಗರಿಕತೆಯಾಗಿದ್ದು, ಪ್ರಾಚೀನ ಭಾರತದ ಇತಿಹಾಸದಲ್ಲಿ ಮುಖ್ಯವಾದ ಪಾತ್ರವನ್ನು ವಹಿಸಿದೆ. ಈ ನಾಗರಿಕತೆಯು ಆಧುನಿಕ-ದಿನದ ಪಾಕಿಸ್ತಾನ ಮತ್ತು ವಾಯುವ್ಯ ಭಾರತದ ಭೂಪ್ರದೇಶಗಳನ್ನು ಹಾಗೂ ಈಶಾನ್ಯ ಅಫ್ಘಾನಿಸ್ತಾನದ ಕೆಲವು ಪ್ರದೇಶಗಳನ್ನು ಒಳಗೊಂಡಿರುವುದರಿಂದ ಅದಕ್ಕೆ ಸಿಂಧೂ ನಾಗರಿಕತೆ ಎಂದು ಹೆಸರಿಸಲಾಗಿದೆ. ಸಿಂಧೂ ನದಿಯ ಪ್ರದೇಶಗಳಲ್ಲಿ ಈ ನಾಗರಿಕತೆಯು ವಿಕಸಿತಗೊಂಡಿದ್ದರಿಂದ ಅದಕ್ಕೆ ಸಿಂಧೂ ನಾಗರಿಕತೆ ಎಂದು ಹೆಸರಿಸಲಾಗಿದೆ. ಈಗ ಬತ್ತಿ ಹೋದ ಸರಸ್ವತಿ ನದಿಯ ಪ್ರದೇಶಗಳಲ್ಲಿ ಸಹ ನಾಗರೀಕತೆಯ ಅಸ್ತಿತ್ವವಿದ್ದಿರಬಹುದೆಂದು ಕೆಲವು ಪ್ರಾಜ್ಞರು ಶಂಕಿಸುತ್ತಾರೆ. ಆದ್ದರಿಂದ ಈ ನಾಗರಿಕತೆಯನ್ನು ಸಿಂಧೂ-ಸರಸ್ವತಿ ನಾಗರಿಕತೆ ಎಂದು ಸೂಕ್ತವಾಗಿ ಕರೆ
____________________
সিন্ধু নদী উপত্যকা সভ্যতা ছিল একটি প্রাচীন তাম্রযুগীয় সভ্যতা যা বর্তমান পাকিস্তান এবং উত্তর-পশ্চিম ভারত ও উত্তর-পূর্ব আফগানিস্তানের কিছু অঞ্চলকে নিয়ে গঠিত ছিল। এই সভ্যতার নাম সিন্ধু নদীর অববাহিকা অঞ্চলে এটির বিকাশের কারণে এরকম দেওয়া হয়েছে। কিছু পণ্ডিত মনে করেন যে সরস্বতী নদীর ভূমি-প্রদেশেও এই সভ্যতা বিদ্যমান ছিল, তাই এটিকে সিন্ধু-সরস্বতী সভ্যতা বলা উচিত। আবার কেউ কেউ এই সভ্যতাকে হরপ্পা পরবর্তী হরপ্পান সভ্যতা নামেও অবিহিত করেন। যাই হোক, সিন্ধু সভ্যতা ছিল প্রাচীন তাম্রযুগের এক উল্লেখযোগ্য সভ্যতা যা সিন্ধু নদী উপত্যকার এলাকায় বিকশিত হয়েছিল।
____________________
சிந்து நதிப் பள்ளத்தாக்கில் தோன்றிய நாகரிகம் சிந்து நாகரிகம் என்றழைக்கப்படுகிறது. சிந்து நதியின் படுகைகளில் இந்த நாகரிகம் மலர்ந்ததால் இப்பெயர் வழங்கப்பட்டது. ஆனால், தற்போது வறண்டுபோன சரஸ்வதி நதிப் பகுதியிலும் இந்நாகரிகம் இருந்திருக்கலாம் என சில அறிஞர்கள் கருதுவதால், சிந்து சரஸ்வதி நாகரிகம் என்று அழைக்கப்பட வேண்டும் என்று வாதிடுகின்றனர். மேலும், இந்நாகரிகத்தின் முதல் தளமான ஹரப்பாவின் பெயரால் ஹரப்பா நாகரிகம் என்றும் அழைக்கப்படுகிறது. இந்த நாகரிகம் வெண்கலயுக நாகரிகமாக கருதப்படுகிறது. இது தற்கால பாகிஸ்தானின் பெரும்பகுதி, வடமேற்கு இந்தியா மற்றும் வடகிழக்கு ஆப்கானிஸ்தானின் சில பகுதிகளை உள்ளடக்கியது.
____________________
സിന്ധു നദീതട സംസ്കാരം അഥവാ ഹാരപ്പൻ സംസ്കാരം ആധുനിക പാകിസ്ഥാൻ, വടക്ക് പടിഞ്ഞാറൻ ഇന്ത്യ, വടക്ക് കിഴക്കൻ അഫ്ഗാനിസ്ഥാൻ എന്നിവിടങ്ങളിൽ നിലനിന്ന ഒരു വെങ്കല യുഗ സംസ്കാരമായിരുന്നു. ഈ സംസ്കാരത്തിന്റെ അടിസ്ഥാനം സിന്ധു നദിയുടെ തടങ്ങളായതിനാലാണ് ഇതിന് സിന്ധു നദീതട സംസ്കാരം എന്ന പേര് ലഭിച്ചത്. ചില പണ്ഡിതർ ഇപ്പോൾ വറ്റിപ്പോയ സരസ്വതി നദിയുടെ തടങ്ങളിലും ഈ സംസ്കാരം നിലനിന്നിരുന്നതിനാൽ സിന്ധു-സരസ്വതി നദീതട സംസ്കാരമെന്ന് വിളിക്കുന്നത് ശരിയായിരിക്കുമെന്ന് അഭിപ്രായപ്പെടുന്നു. എന്നാൽ ചിലർ 1920കളിൽ ആദ്യമായി ഉത്ഖനനം നടത്തിയ ഹാരപ്പ എന്ന സ്ഥലത്തെ പേര് പ്രകാരം ഈ സംസ്കാരത്തെ ഹാരപ്പൻ സംസ്കാരമെന്ന് വിളിക്കുന്നു.

Conclusion

This post presented a walkthrough for using Cohere’s multilingual embedding model along with Anthropic Claude 3 Sonnet on Amazon Bedrock. In particular, we showed how the same question asked in multiple Indian languages, is getting answered using relevant documents retrieved from a vector store

Cohere’s multilingual embedding model supports over 100 languages. It removes the complexity of building applications that require working with a corpus of documents in different languages. The Cohere Embed model is trained to deliver results in real-world applications. It handles noisy data as inputs, adapts to complex RAG systems, and delivers cost-efficiency from its compression-aware training method.

Start building with Cohere’s multilingual embedding model and Anthropic Claude 3 Sonnet on Amazon Bedrock today.

References

[1] Flores Dataset: https://github.com/facebookresearch/flores/tree/main/flores200


About the Author

ronykroy

Rony K Roy is a Sr. Specialist Solutions Architect, Specializing in AI/ML. Rony helps partners build AI/ML solutions on AWS.

Read More