Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Originally PyTorch used an eager mode where each PyTorch operation that forms the model is run independently as soon as it’s reached. PyTorch 2.0 introduced torch.compile to speed up PyTorch code over the default eager mode. In contrast to eager mode, the torch.compile pre-compiles the entire model into a single graph in a manner that’s optimal for running on a given hardware platform. AWS optimized the PyTorch torch.compile feature for AWS Graviton3 processors. This optimization results in up to 2x better performance for Hugging Face model inference (based on geomean of performance improvement for 33 models) and up to 1.35x better performance for TorchBench model inference (geomean of performance improvement for 45 models) compared to the default eager mode inference across several natural language processing (NLP), computer vision (CV), and recommendation models on AWS Graviton3-based Amazon EC2 instances. Starting with PyTorch 2.3.1, the optimizations are available in torch Python wheels and AWS Graviton PyTorch deep learning container (DLC).

In this blog post, we show how we optimized torch.compile performance on AWS Graviton3-based EC2 instances, how to use the optimizations to improve inference performance, and the resulting speedups.

Why torch.compile and what’s the goal?

In eager mode, operators in a model are run immediately as they are encountered. It’s easier to use, more suitable for machine learning (ML) researchers, and hence is the default mode. However, eager mode incurs runtime overhead because of redundant kernel launch and memory read overhead. Whereas in torch compile mode, operators are first synthesized into a graph, wherein one operator is merged with another to reduce and localize memory reads and total kernel launch overhead.

The goal for the AWS Graviton team was to optimize torch.compile backend for Graviton3 processors. PyTorch eager mode was already optimized for Graviton3 processors with Arm Compute Library (ACL) kernels using oneDNN (also known as MKLDNN). So, the question was, how to reuse those kernels in torch.compile mode to get the best of graph compilation and the optimized kernel performance together?

Results

The AWS Graviton team extended the torch inductor and oneDNN primitives that reused the ACL kernels and optimized compile mode performance on Graviton3 processors. Starting with PyTorch 2.3.1, the optimizations are available in the torch Python wheels and AWS Graviton DLC. Please see the Running an inference section that follows for the instructions on installation, runtime configuration, and how to run the tests.

To demonstrate the performance improvements, we used NLP, CV, and recommendation models from TorchBench and the most downloaded NLP models from Hugging Face across Question Answering, Text Classification, Token Classification, Translation, Zero-Shot Classification, Translation, Summarization, Feature Extraction, Text Generation, Text2Text Generation, Fill-Mask, and Sentence Similarity tasks to cover a wide variety of customer use cases.

We started with measuring TorchBench model inference latency, in milliseconds (msec), for the eager mode, which is marked 1.0 with a red dotted line in the following graph. Then we compared the improvements from torch.compile for the same model inference, the normalized results are plotted in the graph. You can see that for the 45 models we benchmarked, there is a 1.35x latency improvement (geomean for the 45 models).

Image 1: PyTorch model inference performance improvement with torch.compile on AWS Graviton3-based c7g instance using TorchBench framework. The reference eager mode performance is marked as 1.0. (higher is better)

Similar to the preceding TorchBench inference performance graph, we started with measuring the Hugging Face NLP model inference latency, in msec, for the eager mode, which is marked 1.0 with a red dotted line in the following graph. Then we compared the improvements from torch.compile for the same model inference, the normalized results are plotted in the graph. You can see that for the 33 models we benchmarked, there is around 2x performance improvement (geomean for the 33 models).

Image 2: Hugging Face NLP model inference performance improvement with torch.compile on AWS Graviton3-based c7g instance using Hugging Face example scripts. The reference eager mode performance is marked as 1.0. (higher is better)

Running an inference

Starting with PyTorch 2.3.1, the optimizations are available in the torch Python wheel and in AWS Graviton PyTorch DLC. This section shows how to run inference in eager and torch.compile modes using torch Python wheels and benchmarking scripts from Hugging Face and TorchBench repos.

To successfully run the scripts and reproduce the speedup numbers mentioned in this post, you need an instance from the Graviton3 family (c7g/r7g/m7g/hpc7g) of hardware. For this post, we used the c7g.4xl (16 vcpu) instance. The instance, the AMI details, and the required torch library versions are mentioned in the following snippet.

Instance: c7g.4xl instance
Region: us-west-2
AMI: ami-05cc25bfa725a144a (Ubuntu 22.04/Jammy with 6.5.0-1017-aws kernel)

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1

The generic runtime tunings implemented for eager mode inference are equally applicable for the torch.compile mode, so, we set the following environment variables to further improve the torch.compile performance on AWS Graviton3 processors.

# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable Linux Transparent Huge Page (THP) allocations,
# to reduce the tensor memory allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capacity to cache the primitives and avoid redundant
# memory allocations
export LRU_CACHE_CAPACITY=1024

TorchBench benchmarking scripts

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. We benchmarked 45 models using the scripts from the TorchBench repo. Following code shows how to run the scripts for the eager mode and the compile mode with inductor backend.

# Set OMP_NUM_THREADS to number of vcpus, 16 for c7g.4xl instance
export OMP_NUM_THREADS=16

# Install the dependencies
sudo apt-get install -y libgl1-mesa-glx
sudo apt-get install -y libpangocairo-1.0-0
python3 -m pip install psutil numpy transformers pynvml numba onnx onnxruntime scikit-learn timm effdet gym doctr opencv-python h5py==3.10.0 python-doctr

# Clone pytorch benchmark repo
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# PyTorch benchmark repo doesn't have any release tags. So,
# listing the commit we used for collecting the performance numbers
git checkout 9a5e4137299741e1b6fb7aa7f5a6a853e5dd2295

# Setup the models
python3 install.py

# Colect eager mode performance using the following command. The results will be
# stored at .userbenchmark/cpu/metric-<timestamp>.json.
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --metrics="latencies,cpu_peak_mem"

# Collect torch.compile mode performance with inductor backend
# and weights pre-packing enabled. The results will be stored at
# .userbenchmark/cpu/metric-<timestamp>.json
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

On successful completion of the inference runs, the script stores the results in JSON format. The following is the sample output:

{
"name": "cpu"
"environ": {
"pytorch_git_version": "d44533f9d073df13895333e70b66f81c513c1889"
},

"metrics": {
"BERT_pytorch-eval_latency": 56.3769865,
"BERT_pytorch-eval_cmem": 0.4169921875
}
}

Hugging Face benchmarking scripts

Google T5 Small Text Translation model is one of the around 30 Hugging Face models we benchmarked. We’re using it as a sample model to demonstrate how to run inference in eager and compile modes. The additional configurations and APIs required to run it in compile mode are highlighted in BOLD. Save the following script as google_t5_small_text_translation.py .

import argparse
from transformers import T5Tokenizer, T5Model
import torch
from torch.profiler import profile, record_function, ProfilerActivity
import torch._inductor.config as config config.cpp.weight_prepack=True config.freezing=True

def test_inference(mode, num_iter):
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
"Studies have been shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1

    if (mode == 'compile'):         model = torch.compile(model)

with torch.no_grad():
for _ in range(50):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(num_iter):
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

print(prof.key_averages().table(sort_by="self_cpu_time_total"))

def main() -> None:
global m, args
parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"-m",
"--mode",
choices=["eager", "compile"],
default="eager",
help="Which test to run.",
)
parser.add_argument(
"-n",
"--number",
type=int,
default=100,
help="how many iterations to run.",
)
args = parser.parse_args()
test_inference(args.mode, args.number)

if __name__ == "__main__":
main()

Run the script with the following steps.

# Set OMP_NUM_THREADS to number of vcpus to 4 because
# the scripts are running inference in sequence, and
# they don't need large number of vcpus
export OMP_NUM_THREADS=4

# Install the dependencies
python3 -m pip install transformers

# Run the inference script in Eager mode
# using number of iterations as 1 just to show the torch profiler output
# but for the benchmarking, we used 1000 iterations.
python3 google_t5_small_text_translation.py -n 1 -m eager

# Run the inference script in torch compile mode
python3 google_t5_small_text_translation.py -n 1 -m compile

On successful completion of the inference runs, the script prints the torch profiler output with the latency breakdown for the torch operators. The following is the sample output from torch profiler:


# Torch profiler output for the eager mode run on c7g.xl (4vcpu)
---------------    ------------  -----------  ------------  -----------  ------------  ------------
Name                 Self CPU %   Self CPU     CPU total %   CPU total   CPU time avg    # of Calls
---------------    ------------  -----------  ------------  -----------  ------------  ------------
aten::mm            40.71%         12.502ms       40.71%      12.502ms     130.229us            96
model_inference     26.44%         8.118ms       100.00%      30.708ms      30.708ms             1
aten::bmm            6.85%         2.102ms         9.47%       2.908ms      80.778us            36
aten::matmul         3.73%         1.146ms        57.26%      17.583ms     133.205us           132
aten::select         1.88%       576.000us         1.90%     583.000us       0.998us           584
aten::transpose      1.51%       464.000us         1.83%     563.000us       3.027us           186
---------------    ------------  -----------  ------------  -----------  ------------  -------------
Self CPU time total: 30.708ms

# Torch profiler output for the compile mode run for the same model on the same instance
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
Name                      Self CPU %    Self CPU    CPU total %    CPU total   CPU time avg   # of Calls
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
mkldnn::_linear_pointwise   37.98%       5.461ms        45.91%       6.602ms      68.771us            96
Torch-Compiled Region       29.56%       4.251ms        98.53%      14.168ms      14.168ms             1
aten::bmm                   14.90%       2.143ms        21.73%       3.124ms      86.778us            36
aten::select                 4.51%     648.000us         4.62%     665.000us       1.155us           576
aten::view                   3.29%     473.000us         3.29%     473.000us       1.642us           288
aten::empty                  2.53%     364.000us         2.53%     364.000us       3.165us           115
-------------------------  ---------  -----------  ------------  ------------  ------------ -------------
Self CPU time total: 14.379ms

What’s next

Next, we’re extending the torch inductor CPU backend support to compile Llama model, and adding support for fused GEMM kernels to enable torch inductor operator fusion optimization on AWS Graviton3 processors.

Conclusion

In this tutorial, we covered how we optimized torch.compile performance on AWS Graviton3-based EC2 instances, how to use the optimizations to improve PyTorch model inference performance, and demonstrated the resulting speedups. We hope that you will give it a try! If you need any support with ML software on Graviton, please open an issue on the AWS Graviton Technical Guide GitHub.


About the Author

Sunita Nadampalli is a Software Development Manager and AI/ML expert at AWS. She leads AWS Graviton software performance optimizations for AI/ML and HPC workloads. She is passionate about open source software development and delivering high-performance and sustainable software solutions for SoCs based on the Arm ISA.

Read More

Access control for vector stores using metadata filtering with Knowledge Bases for Amazon Bedrock

Access control for vector stores using metadata filtering with Knowledge Bases for Amazon Bedrock

In November 2023, we announced Knowledge Bases for Amazon Bedrock as generally available.

Knowledge bases allow Amazon Bedrock users to unlock the full potential of Retrieval Augmented Generation (RAG) by seamlessly integrating their company data into the language model’s generation process. This feature allows organizations to harness the power of large language models (LLMs) while making sure that the generated responses are tailored to their specific domain knowledge, regulations, and business requirements. By incorporating their unique data sources, such as internal documentation, product catalogs, or transcribed media, organizations can enhance the relevance, accuracy, and contextual awareness of the language model’s outputs.

Knowledge bases effectively bridge the gap between the broad knowledge encapsulated within foundation models and the specialized, domain-specific information that businesses possess, enabling a truly customized and valuable generative artificial intelligence (AI) experience.

With metadata filtering now available in Knowledge Bases for Amazon Bedrock, you can define and use metadata fields to filter the source data used for retrieving relevant context during RAG. For example, if your data contains documents from different products, departments, or time periods, you can use metadata filtering to limit retrieval to only the most relevant subset of data for a given query or conversation. This helps improve the relevance and quality of retrieved context while reducing potential hallucinations or noise from irrelevant data. Metadata filtering gives you more control over the RAG process for better results tailored to your specific use case needs.

In this post, we discuss how to implement metadata filtering within Knowledge Bases for Amazon Bedrock by implementing access control and ensuring data privacy and security in RAG applications.

Access control with metadata filters

Metadata filtering in knowledge bases enables access control for your data. By defining metadata fields based on attributes such as user roles, departments, or data sensitivity levels, you can ensure that the retrieval only fetches and uses information that a particular user or application is authorized to access. This helps maintain data privacy and security, preventing sensitive or restricted information from being inadvertently surfaced or used in generated responses. With this access control capability, you can safely use retrieval across different user groups or scenarios while complying with company specific data governance policies and regulations.

During retrieval of contextually relevant chunks, metadata filters add an additional layer of selection to those vectors that are returned to the LLM for response generation. In addition, metadata filtering requires fewer computation resources, thereby improving the overall performance and reducing costs associated with the search.

Let’s explore some practical applications of metadata filtering in Knowledge Bases for Amazon Bedrock. Here are a few examples and use cases across different domains:

  • A company uses a chatbot to help HR personnel navigate employee files. There is sensitive information present in the documents and only certain employees should be able to have access and converse with them. With metadata filters on access IDs, a user can only chat with documents that have metadata associated with their access ID. The access ID associated with their authentication when the chat is initiated can be passed as a filter.
  • A business-to-business (B2B) platform is developed for companies to allow their end-users to access all their uploaded documents, search over them conversationally, and complete various tasks using those documents. To ensure that end-users can only chat with their data, metadata filters on user access tokens—such as those obtained through an authentication service—can enable secure access to their information. This provides customers with peace of mind while maintaining compliance with various data security standards.
  • A work organization application has a conversational search feature. Documents, kanbans, meeting recording transcripts, and other assets can be searched more intently and with more granular control. The app uses a single sign-on (SSO) functionality that allows them to access company-wide resources and other services and follows a company’s data level access protocol. With metadata filters on work groups and a privilege level (for example Limited, Standard, or Admin) derived from their SSO authentication, you can enforce data security while personalizing the chat experience to streamline a user’s work and collaboration with others.

Access control with metadata filtering in the healthcare domain

To demonstrate the access-control capabilities enabled by metadata filtering in knowledge bases, let’s consider a use case where a healthcare provider has a knowledge base that contains transcripts of conversations between doctors and patients. In this scenario, it is crucial that each doctor can only access transcripts from their own patient interactions during the search, and not have access to transcripts from other doctors’ patient interactions.

By defining a metadata field for patient_id and associating each transcript with the corresponding patient’s identifier, the healthcare provider can implement access control within their search application. When a doctor initiates a conversation, the knowledge base can filter the vector store to retrieve context only from transcripts where the patient_id metadata matches either a specific patient ID or the list of patient IDs associated with the authenticated doctor. This way, the generated responses will be augmented solely with information from that doctor’s past patient interactions, maintaining patient privacy and confidentiality.

This access control approach can be extended to other relevant metadata fields, such as year or department, further refining the subset of data accessible to each user or application. By using metadata filtering in knowledge bases, the healthcare provider can achieve compliance with data governance policies and regulations while enabling doctors to have personalized, contextually relevant conversations tailored to their specific patient histories and needs.

Solution overview

Let’s walk through the high-level steps to implement access control with Knowledge Bases for Amazon Bedrock. The following GitHub repository provides a guided notebook that you can follow to deploy this example in your own account.

The following diagram illustrates the solution architecture.

Figure 1: Solution architecture

The workflow for the solution is as follows:

  1. The doctor interacts with the Streamlit frontend, which serves as the application interface. Amazon Cognito handles user authentication and access control, ensuring only authorized doctors can access the application. For production use, it is recommended to use a more robust frontend framework such as AWS Amplify, which provides a comprehensive set of tools and services for building scalable and secure web applications.
  2. After the doctor has successfully signed in, the application retrieves the list of patients associated with the doctor’s ID from the Amazon DynamoDB database. The doctor is then presented with this list of patients, from which they can select one or more patients to filter their search.
  3. When the doctor interacts with the Streamlit frontend, it sends a request to an AWS Lambda function, which acts as the application backend. The request includes the doctor’s ID, a list of patient IDs to filter by, and the text query.
  4. Before querying the knowledge base, the Lambda function retrieves data from the DynamoDB database, which stores doctor-patient associations. This step validates that the doctor is authorized to access the requested patient or list of patient’s information.
  5. If the validation is successful, the Lambda function queries the knowledge base using the provided patient or list of patient’s IDs. The knowledge base is pre-populated with transcript and metadata files stored in Amazon Simple Storage Service (Amazon S3).
  6. The knowledge base returns the relevant results, which are then sent back to the Streamlit application and displayed to the doctor.

User authentication with Amazon Cognito

To implement the access control solution for the healthcare provider use case, you can use Amazon Cognito user pools to manage the authentication and user identities of the doctors.

To start, you will create an Amazon Cognito user pool that will store the doctor user accounts. During the user pool setup, you define the necessary attributes for each doctor, including their name and a unique identifier (sub or custom attribute). For patients, their identifier will be used as the patient_id metadata field. This unique identifier will be associated with each patient’s account and used for metadata filtering in the knowledge base retrieval process.

Figure 2: User information

Doctor and patient association in DynamoDB

To facilitate the access control mechanism based on the doctor-patient relationship, the healthcare provider can create a DynamoDB table to store these associations. This table will act as a centralized repository, allowing efficient retrieval of the patient IDs associated with each authenticated doctor during the knowledge base search process. When a doctor authenticates through Amazon Cognito, their unique identifier can be used to query the doctor_patient_list_associations table and retrieve the list of patient_id values associated with that doctor.

Figure 3: Items retrieved based on the doctor_ID and patient relationships

This approach offers flexibility in managing doctor-patient associations. If a doctor changes over time, only the corresponding entries in the DynamoDB table need to be updated. This update does not require modifying the metadata files of the transcripts themselves.

Now that you have your doctor and patients set up with their relationships defined, let’s examine the dataset format required for effective metadata filtering.

Dataset format

When working with Knowledge Bases for Amazon Bedrock, the dataset format plays a crucial role in providing seamless integration and effective metadata filtering. This example uses a series of PDF files containing transcripts of doctor-patient conversations.

These files need to be uploaded to an S3 bucket for processing. To use metadata filtering, you need to create a separate metadata JSON file for each transcript file. The metadata file should share the same name as the corresponding PDF file (including the extension). For instance, if the transcript file is named transcript_001.pdf, the metadata file should be named transcript_001.pdf.metadata.json. This nomenclature is crucial for the knowledge base to identify the metadata for specific files during the ingestion process.

The metadata JSON file will contain key-value pairs representing the relevant metadata fields associated with the transcript. In the healthcare provider use case, the most important metadata field is patient_id, which will be used to implement access control. You assign each transcript to a specific patient by including their unique identifier from the Amazon Cognito user pool in the patient_id field of the metadata file, as in the following example:

{"metadataAttributes": {"patient_id": 669}}

By structuring the dataset with transcript PDF files accompanied by their corresponding metadata JSON files, you can effectively use the metadata filtering capabilities of Knowledge Bases for Amazon Bedrock. This approach enables you to implement access control, so each doctor can only retrieve and use content from their own patient transcripts during the retrieval process. For customers processing thousands of files, automating the generation of the metadata files using Lambda functions or a similar solution could be a more efficient approach to scale.

Knowledge base creation

With the dataset properly structured and organized, you can now create the knowledge base in Amazon Bedrock. The process is straightforward, thanks to the user-friendly interface and step-by-step guidance provided by the AWS Management Console. See Knowledge Bases now delivers fully managed RAG experience in Amazon Bedrock for instructions to create a new knowledge base, upload your dataset, and configure the necessary settings to achieve optimal performance. Alternatively, you can create a knowledge base using the AWS SDK, API, or AWS CloudFormation template, which provides programmatic and automated ways to set up and manage your knowledge bases.

Figure 4: Using the console to create a knowledge base

After you create the knowledge base and sync it with your dataset, you can immediately experience the power of metadata filtering.

In the test pane, navigate to the settings section and locate the filters option. Here, you can define specific filter conditions by specifying the patient_id field along with the unique IDs or list of identifiers of the patients you wish to test. By applying this filter, the retrieval process will fetch and incorporate only the relevant context from transcripts associated with the specified patient or patients. This filter-based retrieval approach means that the generated responses are tailored to the doctor’s individual patient interactions, maintaining data privacy and confidentiality.

Figure 5:Knowledge Bases console test configuration Panel

Figure 6: Knowledge Bases console test panel

Querying the knowledge base programmatically

You have seen how to implement access control with metadata filtering through the console, but what if you want to integrate knowledge bases directly into your applications? AWS provides SDKs that allow you to programmatically interact with Amazon Bedrock features, including knowledge bases.

The following code snippet demonstrates how to call the retrieve_and_generate API using the Boto3 library in Python. It includes metadata filtering capabilities within the vectorSearchConfiguration, where you can now add filter conditions. For this specific use case, you first need to retrieve the list of patient_ids associated with a doctor from the DynamoDB table. This allows you to filter the search results based on the authenticated user’s identity.

import boto3
import json

bedrock_agent = boto3.client('bedrock-agent-runtime')

# Retrieve and generate API

response = bedrock_agent.retrieve_and_generate(
    input={
        "text": "Who is Kelly?"
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
             'knowledgeBaseId': <<KnowledgeBase id>>,
            "modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-v2:1",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5,
                    "filter": {
                        "in": {
                            "key": "patient_id",
                            "value": <<patient_ids>> # Amazon Cognito Id once the doctor is authenticated.
                        }
                    }
                } 
            }
        }
    }
)

print(response['output']['text'],end='n'*2) 

You can create a Lambda function that serves as the backend for the application. This Lambda function uses the Boto3 library to interact with Amazon Bedrock, specifically to retrieve relevant information from the knowledge base using the retrieve_and_generate API.

Now that the architectural components are in place, you can create a visual interface to display the results.

Streamlit sample app

To showcase the interaction between doctors and the knowledge base, we developed a user-friendly web application using Streamlit, a popular open source Python library for building interactive data apps. Streamlit provides a simple and intuitive way to create custom interfaces that can seamlessly integrate with the various AWS services involved in this solution.

The Streamlit application acts as the frontend for doctors to initiate conversations and interact with the knowledge base. It uses Amazon Cognito for user authentication, so only authorized doctors can access the application and the corresponding patient data. Upon successful authentication, the application interacts with Lambda to handle the RAG workflow using the Amazon Cognito user ID.

Figure 7: Demo

Clean up

It’s important to clean up and delete the resources created during this solution deployment to avoid unnecessary costs. In the provided GitHub repository, you’ll find a section at the end of the notebook dedicated to deleting all the resources created as part of this solution to ensure that you don’t incur any ongoing charges for resources that are no longer needed.

Conclusion

This post has demonstrated the powerful capabilities of metadata filtering within Knowledge Bases for Amazon Bedrock by implementing access control and ensuring data privacy and security in RAG applications. By using metadata fields, organizations can precisely control the subset of data accessible to different users or applications during the RAG process while also improving the relevancy and performance of the search.

Get started with Knowledge Bases for Amazon Bedrock, and let us know your thoughts in the comments section.


About the Authors

Dani Mitchell is an Generative AI Specialist Solutions Architect at Amazon Web Services. He is focused on computer vision use cases and helping customers across EMEA accelerate their ML journey.

Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focused on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

Kshitiz Agarwal is an Engineering Leader at Amazon Web Services (AWS), where he leads the development of Knowledge Bases for Amazon Bedrock. With a decade of experience at Amazon, having joined in 2012, Kshitiz has gained deep insights into the cloud computing landscape. His passion lies in engaging with customers and understanding the innovative ways they leverage AWS to drive their business success. Through his work, Kshitiz aims to contribute to the continuous improvement of AWS services, enabling customers to unlock the full potential of the cloud.

Read More

Accenture creates a custom memory-persistent conversational user experience using Amazon Q Business

Accenture creates a custom memory-persistent conversational user experience using Amazon Q Business

Traditionally, finding relevant information from documents has been a time-consuming and often frustrating process. Manually sifting through pages upon pages of text, searching for specific details, and synthesizing the information into coherent summaries can be a daunting task. This inefficiency not only hinders productivity but also increases the risk of overlooking critical insights buried within the document’s depths.

Imagine a scenario where a call center agent needs to quickly analyze multiple documents to provide summaries for clients. Previously, this process would involve painstakingly navigating through each document, a task that is both time-consuming and prone to human error.

With the advent of chatbots in the conversational artificial intelligence (AI) domain, you can now upload your documents through an intuitive interface and initiate a conversation by asking specific questions related to your inquiries. The chatbot then analyzes the uploaded documents, using advanced natural language processing (NLP) and machine learning (ML) technologies to provide comprehensive summaries tailored to your questions.

However, the true power lies in the chatbot’s ability to preserve context throughout the conversation. As you navigate through the discussion, the chatbot should maintain a memory of previous interactions, allowing you to review past discussions and retrieve specific details as needed. This seamless experience makes sure you can effortlessly explore the depths of your documents without losing track of the conversation’s flow.

Amazon Q Business is a generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.

This post demonstrates how Accenture used Amazon Q Business to implement a chatbot application that offers straightforward attachment and conversation ID management. This solution can speed up your development workflow, and you can use it without crowding your application code.

“Amazon Q Business distinguishes itself by delivering personalized AI assistance through seamless integration with diverse data sources. It offers accurate, context-specific responses, contrasting with foundation models that typically require complex setup for similar levels of personalization. Amazon Q Business real-time, tailored solutions drive enhanced decision-making and operational efficiency in enterprise settings, making it superior for immediate, actionable insights”

– Dominik Juran, Cloud Architect, Accenture

Solution overview

In this use case, an insurance provider uses a Retrieval Augmented Generation (RAG) based large language model (LLM) implementation to upload and compare policy documents efficiently. Policy documents are preprocessed and stored, allowing the system to retrieve relevant sections based on input queries. This enhances the accuracy, transparency, and speed of policy comparison, making sure clients receive the best coverage options.

This solution augments an Amazon Q Business application with persistent memory and context tracking throughout conversations. As users pose follow-up questions, Amazon Q Business can continually refine responses while recalling previous interactions. This preserves conversational flow when navigating in-depth inquiries.

At the core of this use case lies the creation of a custom Python class for Amazon Q Business, which streamlines the development workflow for this solution. This class offers robust document management capabilities, keeping track of attachments already shared within a conversation as well as new uploads to the Streamlit application. Additionally, it maintains an internal state to persist conversation IDs for future interactions, providing a seamless user experience.

The solution involves developing a web application using Streamlit, Python, and AWS services, featuring a chat interface where users can interact with an AI assistant to ask questions or upload PDF documents for analysis. Behind the scenes, the application uses Amazon Q Business for conversation history management, vectorizing the knowledge base, context creation, and NLP. The integration of these technologies allows for seamless communication between the user and the AI assistant, enabling tasks such as document summarization, question answering, and comparison of multiple documents based on the documents attached in real time.

The code uses Amazon Q Business APIs to interact with Amazon Q Business and send and receive messages within a conversation, specifically the qbusiness client from the boto3 library.

In this use case, we used the German language to test our RAG LLM implementation on 10 different documents and 10 different use cases. Policy documents were preprocessed and stored, enabling accurate retrieval of relevant sections based on input queries. This testing demonstrated the system’s accuracy and effectiveness in handling German language policy comparisons.

The following is a code snippet:

import boto3
import json
from botocore.exceptions import ClientError
from os import environ

class AmazonQHandler:
    def __init__(self, application_id, user_id, conversation_id, system_message_id):
        self.application_id = application_id
        self.user_id = user_id
        self.qbusiness = boto3.client('qbusiness')
        self.prompt_engineering_instruction = "Ansage: Auf Deutsch, und nur mit den nötigsten Wörter ohne ganze Sätze antworten, bitte"
        self.parent_message_id = system_message_id
        self.conversation_id = conversation_id

    def process_message(self, initial_message, input_text):
        print('Please ask as many questions as you want. At the end of the session write exitn')
        
        message = f'{self.prompt_engineering_instruction}: {input_text}'
            
        return message

    

def send_message(self, input_text, uploaded_file_names=[]):
        attachments = []
        message = f'{self.prompt_engineering_instruction}: {input_text}'
        if len(uploaded_file_names) > 0:
            for file_name in uploaded_file_names:
                in_file = open(file_name, "rb")
                data = in_file.read()
                attachments.append({
                    'data': data,
                    'name': file_name
                })

        if self.conversation_id:
            print("we are in if part of send_message")
            if len(attachments) > 0:
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                    conversationId=self.conversation_id,
                    parentMessageId=self.parent_message_id,
                    attachments=attachments,
                )
            else:
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                    conversationId=self.conversation_id,
                    parentMessageId=self.parent_message_id,
                )
        else:
            if len(attachments) > 0:
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                    attachments=attachments,
                )
            else: 
                resp = self.qbusiness.chat_sync(
                    applicationId=self.application_id,
                    userId=self.user_id,
                    userMessage=message,
                )
            self.conversation_id = resp.get("conversationId")

        print(f'Amazon Q: "{resp.get("systemMessage")}"n')
        print(json.dumps(resp))
        self.parent_message_id = resp.get("systemMessageId")
        return resp.get("systemMessage")

if __name__ == '__main__':
    application_id = environ.get("APPLICATION_ID", "a392f5e9-50ed-4f93-bcad-6f8a26a8212d")
    user_id = environ.get("USER_ID", "AmazonQ-Administrator")

    amazon_q_handler = AmazonQHandler(application_id, user_id)
    amazon_q_handler.process_message(None)

The architectural flow of this solution is shown in the following diagram.

Q business

The workflow consists of the following steps:

  1. The LLM wrapper application code is containerized using AWS CodePipeline, a fully managed continuous delivery service that automates the build, test, and deploy phases of the software release process.
  2. The application is deployed to Amazon Elastic Container Service (Amazon ECS), a highly scalable and reliable container orchestration service that provides optimal resource utilization and high availability. Because we were making the calls from a Flask-based ECS task running Streamlit to Amazon Q Business, we used Amazon Cognito user pools rather than AWS IAM Identity Center to authenticate users for simplicity, and we hadn’t experimented with IAM Identity Center on Amazon Q Business at the time. For instructions to set up IAM Identity Center integration with Amazon Q Business, refer to Setting up Amazon Q Business with IAM Identity Center as identity provider.
  3. Users authenticate through an Amazon Cognito UI, a secure user directory that scales to millions of users and integrates with various identity providers.
  4. A Streamlit application running on Amazon ECS receives the authenticated user’s request.
  5. An instance of the custom AmazonQ class is initiated. If an ongoing Amazon Q Business conversation is present, the correct conversation ID is persisted, providing continuity. If no existing conversation is found, a new conversation is initiated.
  6. Documents attached to the Streamlit state are passed to the instance of the AmazonQ class, which keeps track of the delta between the documents already attached to the conversation ID and the documents yet to be shared. This approach respects and optimizes the five-attachment limit imposed by Amazon Q Business. To simplify and avoid repetitions in the middleware library code we are maintaining on the Streamlit application, we decided to write a custom wrapper class for the Amazon Q Business calls, which keeps the attachment and conversation history management in itself as class variables (as opposed to state-based management on the Streamlit level).
  7. Our wrapper Python class encapsulating the Amazon Q Business instance parses and returns the answers based on the conversation ID and the dynamically provided context derived from the user’s question.
  8. Amazon ECS serves the answer to the authenticated user, providing a secure and scalable delivery of the response.

Prerequisites

This solution has the following prerequisites:

  • You must have an AWS account where you will be able to create access keys and configure services like Amazon Simple Storage Service (Amazon S3) and Amazon Q Business
  • Python must be installed on the environment, as well as all the necessary libraries such as boto3
  • It is assumed that you have Streamlit library installed for Python, along with all the necessary settings

Deploy the solution

The deployment process entails provisioning the required AWS infrastructure, configuring environment variables, and deploying the application code. This is accomplished by using AWS services such as CodePipeline and Amazon ECS for container orchestration and Amazon Q Business for NLP.

Additionally, Amazon Cognito is integrated with Amazon ECS using the AWS Cloud Development Kit (AWS CDK) and user pools are used for user authentication and management. After deployment, you can access the application through a web browser. Amazon Q Business is called from the ECS task. It is crucial to establish proper access permissions and security measures to safeguard user data and uphold the application’s integrity.

We use AWS CDK to deploy a web application using Amazon ECS with AWS Fargate, Amazon Cognito for user authentication, and AWS Certificate Manager for SSL/TLS certificates.

To deploy the infrastructure, run the following commands:

  • npm install to install dependencies
  • npm run build to build the TypeScript code
  • npx cdk synth to synthesize the AWS CloudFormation template
  • npx cdk deploy to deploy the infrastructure

The following screenshot shows our deployed CloudFormation stack.

UI demonstration

The following screenshot shows the home page when a user opens the application in a web browser.

The following screenshot shows an example response from Amazon Q Business when no file was uploaded and no relevant answer to the question was found.

The following screenshot illustrates the entire application flow, where the user asked a question before a file was uploaded, then uploaded a file, and asked the same question again. The response from Amazon Q Business after uploading the file is different from the first query (for testing purposes, we used a very simple file with randomly generated text in PDF format).

Solution benefits

This solution offers the following benefits:

  • Efficiency – Automation enhances productivity by streamlining document analysis, saving time, and optimizing resources
  • Accuracy – Advanced techniques provide precise data extraction and interpretation, reducing errors and improving reliability
  • User-friendly experience – The intuitive interface and conversational design make it accessible to all users, encouraging adoption and straightforward integration into workflows

This containerized architecture allows the solution to scale seamlessly while optimizing request throughput. Persisting the conversation state enhances precision by continuously expanding dialog context. Overall, this solution can help you balance performance with the fidelity of a persistent, context-aware AI assistant through Amazon Q Business.

Clean up

After deployment, you should implement a thorough cleanup plan to maintain efficient resource management and mitigate unnecessary costs, particularly concerning the AWS services used in the deployment process. This plan should include the following key steps:

  • Delete AWS resources – Identify and delete any unused AWS resources, such as EC2 instances, ECS clusters, and other infrastructure provisioned for the application deployment. This can be achieved through the AWS Management Console or AWS Command Line Interface (AWS CLI).
  • Delete CodeCommit repositories – Remove any CodeCommit repositories created for storing the application’s source code. This helps declutter the repository list and prevents additional charges for unused repositories.
  • Review and adjust CodePipeline configuration – Review the configuration of CodePipeline and make sure there are no active pipelines associated with the deployed application. If pipelines are no longer required, consider deleting them to prevent unnecessary runs and associated costs.
  • Evaluate Amazon Cognito user pools – Evaluate the user pools configured in Amazon Cognito and remove any unnecessary pools or configurations. Adjust the settings to optimize costs and adhere to the application’s user management requirements.

By diligently implementing these cleanup procedures, you can effectively minimize expenses, optimize resource usage, and maintain a tidy environment for future development iterations or deployments. Additionally, regular review and adjustment of AWS services and configurations is recommended to provide ongoing cost-effectiveness and operational efficiency.

If the solution runs in AWS Amplify or is provisioned by the AWS CDK, you don’t need to take care of removing everything described in this section; deleting the Amplify application or AWS CDK stack is enough to get rid all of the resources associated with the application.

Conclusion

In this post, we showcased how Accenture created a custom memory-persistent conversational assistant using AWS generative AI services. The solution can cater to clients developing end-to-end conversational persistent chatbot applications at a large scale following the provided architectural practices and guidelines.

The joint effort between Accenture and AWS builds on the 15-year strategic relationship between the companies and uses the same proven mechanisms and accelerators built by the Accenture AWS Business Group (AABG). Connect with the AABG team at accentureaws@amazon.com to drive business outcomes by transforming to an intelligent data enterprise on AWS.

For further information about generative AI on AWS using Amazon Bedrock or Amazon Q Business, we recommend the following resources:

You can also sign up for the AWS generative AI newsletter, which includes educational resources, post posts, and service updates.


About the Authors

Dominik Juran works as a full stack developer at Accenture with a focus on AWS technologies and AI. He also has a passion for ice hockey.

Milica Bozic works as Cloud Engineer at Accenture, specializing in AWS Cloud solutions for the specific needs of clients with background in telecommunications, particularly 4G and 5G technologies. Mili is passionate about art, books, and movement training, finding inspiration in creative expression and physical activity.

Zdenko Estok works as a cloud architect and DevOps engineer at Accenture. He works with AABG to develop and implement innovative cloud solutions, and specializes in infrastructure as code and cloud security. Zdenko likes to bike to the office and enjoys pleasant walks in nature.

Selimcan “Can” Sakar is a cloud first developer and solution architect at Accenture with a focus on artificial intelligence and a passion for watching models converge.

Shikhar Kwatra is a Sr. AI/ML Specialist Solutions Architect at Amazon Web Services, working with leading Global System Integrators. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Read More

Create an end-to-end serverless digital assistant for semantic search with Amazon Bedrock

Create an end-to-end serverless digital assistant for semantic search with Amazon Bedrock

With the rise of generative artificial intelligence (AI), an increasing number of organizations use digital assistants to have their end-users ask domain-specific questions, using Retrieval Augmented Generation (RAG) over their enterprise data sources.

As organizations transition from proofs of concept to production workloads, they establish objectives to run and scale their workloads with minimal operational overhead, while optimizing on costs. Organizations also require the implementation of common security practices such as identity and access management, to make sure that only authorized and authenticated users are allowed to perform specific actions or access specific resources.

This post covers a solution to create an end-to-end digital assistant as a web application using a serverless architecture to address these requirements. Because the solution components primarily use serverless technologies, it provides several benefits, such as automatic scaling, built-in high availability, and a pay-per-use billing model to optimize on costs. The solution also includes an authentication layer and an authorization layer to manage identities and permissions.

This solution also uses the hybrid search feature of Knowledge Bases for Amazon Bedrock to increase the relevancy of retrieved results using RAG. When receiving a query from an end-user, hybrid search performs both a semantic search and a keyword search:

  • A semantic search provides results based on the meaning and intent within the query
  • A keyword search provides results based on specific entities in a query such as product codes or acronyms

For example, if a user submits a prompt that includes keywords, a text-based search may provide better results than a semantic search. This is why hybrid search combines the two approaches: the precision of semantic search and coverage of keywords. For more information about hybrid search, see Knowledge Bases for Amazon Bedrock now supports hybrid search.

In this post, we provide an operational overview of the solution, and then describe how to set it up with the following services:

  • Amazon Bedrock and a knowledge base to generate responses from user questions based on enterprise data sources. Amazon Bedrock is a fully managed service that makes a wide range of foundation models (FMs) available though an API without having to manage any infrastructure. Refer to the Amazon Bedrock FAQs for further details.
  • An Amazon OpenSearch Serverless vector engine to store enterprise data as vectors to perform semantic search.
  • AWS Amplify to create and deploy the web application.
  • Amazon API Gateway and AWS Lambda to create an API with an authentication layer and integrate with Amazon Bedrock.
  • Amazon Cognito to implement an identity platform (user directory and authorization management) for the web application.
  • Amazon Simple Storage Service (Amazon S3) to store the enterprise data used by the solution and web application-related assets.

Solution overview

The solution architecture involves the following steps:

  1. The user authenticates to the web application (the digital assistant UI).
  2. Amazon Cognito validates the authentication details.
  3. The user submits a request using the web application.
  4. The request is sent by the web application to the API.
  5. The API calls a Lambda authorizer to confirm that the user is authorized to perform the operation.
  6. The request is sent from the API to a Lambda function.
  7. The Lambda function submits the request as a prompt to a knowledge base (Knowledge Bases for Amazon Bedrock), and explicitly requests a hybrid search to be performed using the Amazon Bedrock API.
  8. Amazon Bedrock retrieves relevant data from the vector store (using the vector engine for OpenSearch Serverless) using hybrid search.
  9. Amazon Bedrock submits a prompt to a foundation model.

After Step 9, the foundation model generates a response back that will be returned to the user in the web application’s digital assistant.

The following diagram illustrates this workflow.

Prerequisites

To follow along and set up this solution, you must have the following:

  • An AWS account
  • A device with access to your AWS account with the following:
  • Model access to the following models in Amazon Bedrock: Titan Embeddings G1 – Text and Claude Instant

Upload documents and create a knowledge base

In this section, we create a knowledge base in Amazon Bedrock. The knowledge base will enrich the prompt submitted to an Amazon Bedrock foundation model with contextual information derived from our data source (in our case, documents uploaded in a S3 bucket).

During the creation of the knowledge base, a vector store will also be created to ingest documents encoded as vectors, using an embeddings model. An embeddings model encodes data as vectors in order to capture the meaning and context of our sample documents. This allows us to find data relevant to our end-user prompts.

For our use case, we use the vector engine for OpenSearch Serverless as a vector store and Titan Text Embeddings G1 model as the embeddings model.

Complete the following steps to create an S3 bucket to upload documents, and synchronize them with a knowledge base in Amazon Bedrock:

  1. Create an S3 bucket in your account.
  2. Upload the following documents in the S3 bucket:
  3. Create a knowledge base with the following configuration:
    • For Knowledge base name, enter assistant-knowledgebase.
    • For Knowledge base description, enter Knowledge base for digital assistant.
    • For IAM permissions, select Create and use a new service role.
    • For Data source name, enter assistant-knowledgebase-datasource.
    • For S3 URI, enter the URI of the previously created S3 bucket (for example, s3://#s3-bucket-name#).
    • For Embeddings model, choose Titan G1 Embeddings – Text.
    • For Vector database, select Quick create a new vector store.
  4. Ingest and synchronize the documents in the knowledge base.

Create the API and backend

In this section, we create the following resources:

  • A user directory for web authentication and authorization, created with an Amazon Cognito user pool.
  • An API created with Amazon API Gateway. This will expose a single-entry door interface to our digital assistant’s web application.
  • An authorization layer in our API, to protect our backend from unauthorized users. This will be implemented with a Lambda authorizer function to validate that incoming requests include valid authorization details.
  • A Lambda function behind the API, which will submit prompts to a knowledge base and return responses back to the API.

Complete the following steps to create the API and the backend of the digital assistant’s web application, using AWS CloudFormation templates:

  1. Clone the GitHub repository.
  2. Navigate to the api folder, which includes the following content:
    • A template named webapp-userpool-stack.yml for the Amazon Cognito user pool
    • A template named webapp-lambda-stack.yml for the Lambda function calling a knowledge base
    • A template named webapp-api-stack.yml for the API and the Lambda authorizer function
    • A subfolder named lambda-auth for the Lambda authorizer function code
    • A subfolder named lambda-knowledgebase for the Lambda function calling a knowledge base
    • A script named cognito-create-testuser.sh to create a test user in the Amazon Cognito user pool
  3.  Create the Amazon Cognito user pool of the web application using the following AWS Command Line Interface (AWS CLI) command:
    aws cloudformation create-stack --stack-name webapp-userpool-stack --template-body file://webapp-userpool-stack.yml

  4. Go to the lambda-knowledgebase folder and download the dependencies with the following command:
    pip install -r requirements.txt -t .

  5. Create a .zip file named lambda-knowledgebase.zip with the Lambda code and its dependencies (the .zip file’s root directory must include the Lambda code and its dependencies).
  6. From the api folder, go to the lambda-auth folder and download the dependencies with the following command:
    pip install -r requirements.txt -t .

  7. Create .a zip file named lambda-auth.zip with the Lambda code and its dependencies (the .zip file’s root directory must include the Lambda code and its dependencies).
  8. Create an S3 bucket in your account.
  9. Upload both .zip files (lambda-auth.zip and lambda-knowledgebase.zip) to the S3 bucket.
  10. Go back to the api folder and create the Lambda function of the web application using the following AWS CLI command (provide your S3 bucket and knowledge base ID):
aws cloudformation create-stack 
--stack-name webapp-lambda-knowledgebase-stack 
--capabilities "CAPABILITY_IAM" 
--template-body file://webapp-lambda-knowledgebase-stack.yml 
--parameters ParameterKey=BedrockKnowledgeBaseId,ParameterValue=#bedrock-knowledgebase-id# 
ParameterKey=BedrockLambdaS3Bucket,ParameterValue=#lambdacode-s3-bucket-name# 
ParameterKey=BedrockLambdaS3Key,ParameterValue=lambda-knowledgebase.zip

You can retrieve the knowledge base ID by running the following AWS CLI command:

aws bedrock-agent list-knowledge-bases 
--output text 
--query 'knowledgeBaseSummaries[?name==`assistant-knowledgebase`].knowledgeBaseId'

  1. Create the API of the web application using the following AWS CLI command (provide your bucket name):
aws cloudformation create-stack 
--stack-name webapp-api-stack 
--capabilities "CAPABILITY_IAM" 
--template-body file://webapp-api-stack.yml 
--parameters ParameterKey=LambdaAuthorizerS3Bucket,ParameterValue=#lambdacode-s3-bucket-name# 
ParameterKey=LambdaAuthorizerS3Key,ParameterValue=lambda-auth.zip

Configure the Amazon Cognito user pool

In this section, we create a user in our Amazon Cognito user pool. This user will be used to log in to our web application.

Complete the following steps to configure the Amazon Cognito user pool created in the previous section:

  1. On the Amazon Cognito console, access the user pool named webapp-userpool.
  2. On the Users tab, choose Create a user.
  3. For Invitation message, select Send an email invitation.
  4. For Email address section, enter your email address and select Mark email address as verified.
  5. For Temporary password, select Generate a password.
  6. Choose Create user.


You can also complete these steps by running the script cognito-create-testuser.sh available in the api folder as follows (provide your email address):

./cognito-create-testuser.sh #your-email-address#

After you create the user, you should receive an email with a temporary password in this format: “Your username is #your-email-address# and temporary password is #temporary-password#.

Keep note of these login details (email address and temporary password) to use later when testing the web application.

Create the web application

In this section, we build a web application using Amplify and publish it to make it accessible through an endpoint URL. To complete this section, you must first install and set up the Amplify CLI, as discussed in the prerequisites.

Complete the following steps to create the web application of the digital assistant:

  1. Go back to the root folder of the repository and open the frontend folder.
  2. Run the script amplify-setup.sh to create the Amplify application:
    ./amplify-setup.sh

The amplify-setup.sh script creates an Amplify application and configures it to integrate with resources you created in the previous modules:

    • The Amazon Cognito user pool to authenticate our user through the web application’s login page
    • The Amazon API Gateway to process prompts submitted using the web application’s chat interface
  1. Configure the hosting of the Amplify application using the following command:
    amplify add hosting

  2. Choose the following options:
    • For Select the plugin module to execute, choose Hosting with Amplify Console (Managed hosting with custom domains, Continuous deployment).
    • For Choose a type, choose Manual deployment.

In this step, we configure how the web application will be deployed and hosted:

    • The web application will be hosted using the Amplify console, which offers fully managed hosting
    • The web application will be deployed using manual deployment, which allows us to publish our web application to the Amplify console without connecting a Git provider
  1. Publish the Amplify application using the following command:
    amplify publish --yes

The web application is now available for testing and a URL should be displayed, as shown in the following screenshot. Take note of the URL to use in the following section.

Test the digital assistant

In this section, you test the web application of the digital assistant:

  1. Open the URL of the Amplify application in your navigator.
  2. Enter your login information (your email and the temporary password you received earlier while configuring the user pool in Amazon Cognito) and choose Sign in.
  3. When prompted, enter a new password and choose Change Password.
  4. You should now be able to see a chat interface.
  5. Ask a question to test the assistant. For example, “What is the OPS number related to health of operations in the Well Architected framework?

You should receive a response along with sources, as shown in the following screenshot

Clean up

To make sure that no additional cost is incurred, remove the resources provisioned in your account. Make sure you’re in the correct AWS account before deleting the following resources.

  1. Delete the knowledge base.
  2. Delete the CloudFormation stacks (provide the AWS Region where you created your resources):
    aws cloudformation delete-stack --stack-name webapp-api-stack --region #region#
    aws cloudformation delete-stack --stack-name webapp-lambda-knowledgebase-stack --region #region#
    aws cloudformation delete-stack --stack-name webapp-userpool-stack --region #region#

  3. Delete the Amplify application with the following AWS CLI command (provide your application ID and the Region where it was created):
    aws amplify delete-app --app-id #app-id# --region #region#

  4. You can retrieve the app id by running the following AWS CLI command:
    aws amplify list-apps --query 'apps[?name==`frontend`].appId'

  5. Delete the S3 buckets.

You should exercise caution when performing the preceding steps. Make sure you are deleting the resources in the correct AWS account.

Conclusion

In this post, we walked through a solution to create a digital assistant using serverless services. First, we created a knowledge base and ingested documents into it from an S3 bucket. Then we created an API and a Lambda function to submit prompts to the knowledge base. We also configured a user pool to grant a user access to the digital assistant’s web application. Finally, we created the frontend of the web application in Amplify.

For further information on the services used, consult the Amazon Bedrock, Security in Amazon Bedrock, Amazon OpenSearch Serverless, AWS Amplify, Amazon API Gateway, AWS Lambda, Amazon Cognito, and Amazon S3 product pages.

To dive deeper into this solution, a self-paced workshop is available in AWS Workshop Studio, at this location.


About the author

Mehdi Amrane is a Senior Solutions Architect at Amazon Web Services. He supports customers on their initiatives and provides them prescriptive guidance to achieve their goals, and accelerate their cloud journey. He is passionate about creating content on application architecture, DevOps and Serverless technologies.

Read More

Build a self-service digital assistant using Amazon Lex and Knowledge Bases for Amazon Bedrock

Build a self-service digital assistant using Amazon Lex and Knowledge Bases for Amazon Bedrock

Organizations strive to implement efficient, scalable, cost-effective, and automated customer support solutions without compromising the customer experience. Generative artificial intelligence (AI)-powered chatbots play a crucial role in delivering human-like interactions by providing responses from a knowledge base without the involvement of live agents. These chatbots can be efficiently utilized for handling generic inquiries, freeing up live agents to focus on more complex tasks.

Amazon Lex provides advanced conversational interfaces using voice and text channels. It features natural language understanding capabilities to recognize more accurate identification of user intent and fulfills the user intent faster.

Amazon Bedrock simplifies the process of developing and scaling generative AI applications powered by large language models (LLMs) and other foundation models (FMs). It offers access to a diverse range of FMs from leading providers such as Anthropic Claude, AI21 Labs, Cohere, and Stability AI, as well as Amazon’s proprietary Amazon Titan models. Additionally, Knowledge Bases for Amazon Bedrock empowers you to develop applications that harness the power of Retrieval Augmented Generation (RAG), an approach where retrieving relevant information from data sources enhances the model’s ability to generate contextually appropriate and informed responses.

The generative AI capability of QnAIntent in Amazon Lex lets you securely connect FMs to company data for RAG. QnAIntent provides an interface to use enterprise data and FMs on Amazon Bedrock to generate relevant, accurate, and contextual responses. You can use QnAIntent with new or existing Amazon Lex bots to automate FAQs through text and voice channels, such as Amazon Connect.

With this capability, you no longer need to create variations of intents, sample utterances, slots, and prompts to predict and handle a wide range of FAQs. You can simply connect QnAIntent to company knowledge sources and the bot can immediately handle questions using the allowed content.

In this post, we demonstrate how you can build chatbots with QnAIntent that connects to a knowledge base in Amazon Bedrock (powered by Amazon OpenSearch Serverless as a vector database) and build rich, self-service, conversational experiences for your customers.

Solution overview

The solution uses Amazon Lex, Amazon Simple Storage Service (Amazon S3), and Amazon Bedrock in the following steps:

  1. Users interact with the chatbot through a prebuilt Amazon Lex web UI.
  2. Each user request is processed by Amazon Lex to determine user intent through a process called intent recognition.
  3. Amazon Lex provides the built-in generative AI feature QnAIntent, which can be directly attached to a knowledge base to fulfill user requests.
  4. Knowledge Bases for Amazon Bedrock uses the Amazon Titan embeddings model to convert the user query to a vector and queries the knowledge base to find the chunks that are semantically similar to the user query. The user prompt is augmented along with the results returned from the knowledge base as an additional context and sent to the LLM to generate a response.
  5. The generated response is returned through QnAIntent and sent back to the user in the chat application through Amazon Lex.

The following diagram illustrates the solution architecture and workflow.

In the following sections, we look at the key components of the solution in more detail and the high-level steps to implement the solution:

  1. Create a knowledge base in Amazon Bedrock for OpenSearch Serverless.
  2. Create an Amazon Lex bot.
  3. Create new generative AI-powered intent in Amazon Lex using the built-in QnAIntent and point the knowledge base.
  4. Deploy the sample Amazon Lex web UI available in the GitHub repo. Use the provided AWS CloudFormation template in your preferred AWS Region and configure the bot.

Prerequisites

To implement this solution, you need the following:

  1. An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
  2. Familiarity with AWS services such as Amazon S3, Amazon Lex, Amazon OpenSearch Service, and Amazon Bedrock.
  3. Access enabled for the Amazon Titan Embeddings G1 – Text model and Anthropic Claude 3 Haiku on Amazon Bedrock. For instructions, see Model access.
  4. A data source in Amazon S3. For this post, we use Amazon shareholder docs (Amazon Shareholder letters – 2023 & 2022) as a data source to hydrate the knowledge base.

Create a knowledge base

To create a new knowledge base in Amazon Bedrock, complete the following steps. For more information, refer to Create a knowledge base.

  1. On the Amazon Bedrock console, choose Knowledge bases in the navigation pane.
  2. Choose Create knowledge base.
  3. On the Provide knowledge base details page, enter a knowledge base name, IAM permissions, and tags.
  4. Choose Next.
  5. For Data source name, Amazon Bedrock prepopulates the auto-generated data source name; however, you can change it to your requirements.
  6. Keep the data source location as the same AWS account and choose Browse S3.
  7. Select the S3 bucket where you uploaded the Amazon shareholder documents and choose Choose.
    This will populate the S3 URI, as shown in the following screenshot.
  8. Choose Next.
  9. Select the embedding model to vectorize the documents. For this post, we select Titan embedding G1 – Text v1.2.
  10. Select Quick create a new vector store to create a default vector store with OpenSearch Serverless.
  11. Choose Next.
  12. Review the configurations and create your knowledge base.
    After the knowledge base is successfully created, you should see a knowledge base ID, which you need when creating the Amazon Lex bot.
  13. Choose Sync to index the documents.

Create an Amazon Lex bot

Complete the following steps to create your bot:

  1. On the Amazon Lex console, choose Bots in the navigation pane.
  2. Choose Create bot.
  3. For Creation method, select Create a blank bot.
  4. For Bot name, enter a name (for example, FAQBot).
  5. For Runtime role, select Create a new IAM role with basic Amazon Lex permissions to access other services on your behalf.
  6. Configure the remaining settings based on your requirements and choose Next.
  7. On the Add language to bot page, you can choose from different languages supported.
    For this post, we choose English (US).
  8. Choose Done.

    After the bot is successfully created, you’re redirected to create a new intent.
  9. Add utterances for the new intent and choose Save intent.

Add QnAIntent to your intent

Complete the following steps to add QnAIntent:

  1. On the Amazon Lex console, navigate to the intent you created.
  2. On the Add intent dropdown menu, choose Use built-in intent.
  3. For Built-in intent, choose AMAZON.QnAIntent – GenAI feature.
  4. For Intent name, enter a name (for example, QnABotIntent).
  5. Choose Add.

    After you add the QnAIntent, you’re redirected to configure the knowledge base.
  6. For Select model, choose Anthropic and Claude3 Haiku.
  7. For Choose a knowledge store, select Knowledge base for Amazon Bedrock and enter your knowledge base ID.
  8. Choose Save intent.
  9. After you save the intent, choose Build to build the bot.
    You should see a Successfully built message when the build is complete.
    You can now test the bot on the Amazon Lex console.
  10. Choose Test to launch a draft version of your bot in a chat window within the console.
  11. Enter questions to get responses.

Deploy the Amazon Lex web UI

The Amazon Lex web UI is a prebuilt fully featured web client for Amazon Lex chatbots. It eliminates the heavy lifting of recreating a chat UI from scratch. You can quickly deploy its features and minimize time to value for your chatbot-powered applications. Complete the following steps to deploy the UI:

  1. Follow the instructions in the GitHub repo.
  2. Before you deploy the CloudFormation template, update the LexV2BotId and LexV2BotAliasId values in the template based on the chatbot you created in your account.
  3. After the CloudFormation stack is deployed successfully, copy the WebAppUrl value from the stack Outputs tab.
  4. Navigate to the web UI to test the solution in your browser.

Clean up

To avoid incurring unnecessary future charges, clean up the resources you created as part of this solution:

  1. Delete the Amazon Bedrock knowledge base and the data in the S3 bucket if you created one specifically for this solution.
  2. Delete the Amazon Lex bot you created.
  3. Delete the CloudFormation stack.

Conclusion

In this post, we discussed the significance of generative AI-powered chatbots in customer support systems. We then provided an overview of the new Amazon Lex feature, QnAIntent, designed to connect FMs to your company data. Finally, we demonstrated a practical use case of setting up a Q&A chatbot to analyze Amazon shareholder documents. This implementation not only provides prompt and consistent customer service, but also empowers live agents to dedicate their expertise to resolving more complex issues.

Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.


About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS. She has over 15 years of IT experience in software development, design and architecture. She helps key customer accounts on their data, generative AI and AI/ML journeys. She is passionate about data-driven AI and the area of depth in ML and generative AI.

Manjula Nagineni is a Senior Solutions Architect with AWS based in New York. She works with major financial service institutions, architecting and modernizing their large-scale applications while adopting AWS Cloud services. She is passionate about designing cloud-centered big data workloads. She has over 20 years of IT experience in software development, analytics, and architecture across multiple domains such as finance, retail, and telecom.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Identify idle endpoints in Amazon SageMaker

Identify idle endpoints in Amazon SageMaker

Amazon SageMaker is a machine learning (ML) platform designed to simplify the process of building, training, deploying, and managing ML models at scale. With a comprehensive suite of tools and services, SageMaker offers developers and data scientists the resources they need to accelerate the development and deployment of ML solutions.

In today’s fast-paced technological landscape, efficiency and agility are essential for businesses and developers striving to innovate. AWS plays a critical role in enabling this innovation by providing a range of services that abstract away the complexities of infrastructure management. By handling tasks such as provisioning, scaling, and managing resources, AWS allows developers to focus more on their core business logic and iterate quickly on new ideas.

As developers deploy and scale applications, unused resources such as idle SageMaker endpoints can accumulate unnoticed, leading to higher operational costs. This post addresses the issue of identifying and managing idle endpoints in SageMaker. We explore methods to monitor SageMaker endpoints effectively and distinguish between active and idle ones. Additionally, we walk through a Python script that automates the identification of idle endpoints using Amazon CloudWatch metrics.

Identify idle endpoints with a Python script

To effectively manage SageMaker endpoints and optimize resource utilization, we use a Python script that uses the AWS SDK for Python (Boto3) to interact with SageMaker and CloudWatch. This script automates the process of querying CloudWatch metrics to determine endpoint activity and identifies idle endpoints based on the number of invocations over a specified time period.

Let’s break down the key components of the Python script and explain how each part contributes to the identification of idle endpoints:

  • Global variables and AWS client initialization – The script begins by importing necessary modules and initializing global variables such as NAMESPACE, METRIC, LOOKBACK, and PERIOD. These variables define parameters for querying CloudWatch metrics and SageMaker endpoints. Additionally, AWS clients for interacting with SageMaker and CloudWatch services are initialized using Boto3.
from datetime import datetime, timedelta
import boto3
import logging

# AWS clients initialization
cloudwatch = boto3.client("cloudwatch")
sagemaker = boto3.client("sagemaker")

# Global variables
NAMESPACE = "AWS/SageMaker"
METRIC = "Invocations"
LOOKBACK = 1  # Number of days to look back for activity
PERIOD = 86400  # We opt for a granularity of 1 Day to reduce the volume of metrics retrieved while maintaining accuracy.

# Calculate time range for querying CloudWatch metrics
ago = datetime.utcnow() - timedelta(days=LOOKBACK)
now = datetime.utcnow()
  • Identify idle endpoints – Based on the CloudWatch metrics data, the script determines whether an endpoint is idle or active. If an endpoint has received no invocations over the defined period, it’s flagged as idle. In this case, we select a cautious default threshold of zero invocations over the analyzed period. However, depending on your specific use case, you can adjust this threshold to suit your requirements.
# Helper function to extract endpoint name from CloudWatch metric

def get_endpoint_name_from_metric(metric):
    for d in metric["Dimensions"]:
        if d["Name"] == "EndpointName" or d["Name"] == "InferenceComponentName" :
            yield d["Value"]

# Helper Function to aggregate individual metrics for a designated endpoint and output the total. This validation helps in determining if the endpoint has been idle during the specified period.

def list_metrics():
    paginator = cloudwatch.get_paginator("list_metrics")
    response_iterator = paginator.paginate(Namespace=NAMESPACE, MetricName=METRIC)
    return [m for r in response_iterator for m in r["Metrics"]]


# Helper function to check if endpoint is in use based on CloudWatch metrics

def is_endpoint_busy(metric):
    metric_values = cloudwatch.get_metric_data(
        MetricDataQueries=[{
            "Id": "metricname",
            "MetricStat": {
                "Metric": {
                    "Namespace": metric["Namespace"],
                    "MetricName": metric["MetricName"],
                    "Dimensions": metric["Dimensions"],
                },
                "Period": PERIOD,
                "Stat": "Sum",
                "Unit": "None",
            },
        }],
        StartTime=ago,
        EndTime=now,
        ScanBy="TimestampAscending",
        MaxDatapoints=24 * (LOOKBACK + 1),
    )
    return sum(metric_values.get("MetricDataResults", [{}])[0].get("Values", [])) > 0

# Helper function to log endpoint activity

def log_endpoint_activity(endpoint_name, is_busy):
    status = "BUSY" if is_busy else "IDLE"
    log_message = f"{datetime.utcnow()} - Endpoint {endpoint_name} {status}"
    print(log_message)
  • Main function – The main() function serves as the entry point to run the script. It orchestrates the process of retrieving SageMaker endpoints, querying CloudWatch metrics, and logging endpoint activity.
# Main function to identify idle endpoints and log their activity status
def main():
    endpoints = sagemaker.list_endpoints()["Endpoints"]
    
    if not endpoints:
        print("No endpoints found")
        return

    existing_endpoints_name = []
    for endpoint in endpoints:
        existing_endpoints_name.append(endpoint["EndpointName"])
    
    for metric in list_metrics():
        for endpoint_name in get_endpoint_name_from_metric(metric):
            if endpoint_name in existing_endpoints_name:
                is_busy = is_endpoint_busy(metric)
                log_endpoint_activity(endpoint_name, is_busy)
            else:
                print(f"Endpoint {endpoint_name} not active")

if __name__ == "__main__":
    main()

By following along with the explanation of the script, you’ll gain a deeper understanding of how to automate the identification of idle endpoints in SageMaker, paving the way for more efficient resource management and cost optimization.

Permissions required to run the script

Before you run the provided Python script to identify idle endpoints in SageMaker, make sure your AWS Identity and Access Management (IAM) user or role has the necessary permissions. The permissions required for the script include:

  • CloudWatch permissions – The IAM entity running the script must have permissions for the CloudWatch actions cloudwatch:GetMetricData and cloudwatch:ListMetrics
  • SageMaker permissions – The IAM entity must have permissions to list SageMaker endpoints using the sagemaker:ListEndpoints action

Run the Python script

You can run the Python script using various methods, including:

  • The AWS CLI – Make sure the AWS Command Line Interface (AWS CLI) is installed and configured with the appropriate credentials.
  • AWS Cloud9 – If you prefer a cloud-based integrated development environment (IDE), AWS Cloud9 provides an IDE with preconfigured settings for AWS development. Simply create a new environment, clone the script repository, and run the script within the Cloud9 environment.

In this post, we demonstrate running the Python script through the AWS CLI.

Actions to take after identifying idle endpoints

After you’ve successfully identified idle endpoints in your SageMaker environment using the Python script, you can take proactive steps to optimize resource utilization and reduce operational costs. The following are some actionable measures you can implement:

  • Delete or scale down endpoints – For endpoints that consistently show no activity over an extended period, consider deleting or scaling them down to minimize resource wastage. SageMaker allows you to delete idle endpoints through the AWS Management Console or programmatically using the AWS SDK.
  • Review and refine the model deployment strategy – Evaluate the deployment strategy for your ML models and assess whether all deployed endpoints are necessary. Sometimes, endpoints may become idle due to changes in business requirements or model updates. By reviewing your deployment strategy, you can identify opportunities to consolidate or optimize endpoints for better efficiency.
  • Implement auto scaling policies – Configure auto scaling policies for active endpoints to dynamically adjust the compute capacity based on workload demand. SageMaker supports auto scaling, allowing you to automatically increase or decrease the number of instances serving predictions based on predefined metrics such as CPU utilization or inference latency.
  • Explore serverless inference options – Consider using SageMaker serverless inference as an alternative to traditional endpoint provisioning. Serverless inference eliminates the need for manual endpoint management by automatically scaling compute resources based on incoming prediction requests. This can significantly reduce idle capacity and optimize costs for intermittent or unpredictable workloads.

Conclusion

In this post, we discussed the importance of identifying idle endpoints in SageMaker and provided a Python script to help automate this process. By implementing proactive monitoring solutions and optimizing resource utilization, SageMaker users can effectively manage their endpoints, reduce operational costs, and maximize the efficiency of their machine learning workflows.

Get started with the techniques demonstrated in this post to automate cost monitoring for SageMaker inference. Explore AWS re:Post for valuable resources on optimizing your cloud infrastructure and maximizing AWS services.

Resources

For more information about the features and services used in this post, refer to the following:


About the authors

Pablo Colazurdo is a Principal Solutions Architect at AWS where he enjoys helping customers to launch successful projects in the Cloud. He has many years of experience working on varied technologies and is passionate about learning new things. Pablo grew up in Argentina but now enjoys the rain in Ireland while listening to music, reading or playing D&D with his kids.

Ozgur Canibeyaz is a Senior Technical Account Manager at AWS with 8 years of experience. Ozgur helps customers optimize their AWS usage by navigating technical challenges, exploring cost-saving opportunities, achieving operational excellence, and building innovative services using AWS products.

Read More

Indian language RAG with Cohere multilingual embeddings and Anthropic Claude 3 on Amazon Bedrock

Indian language RAG with Cohere multilingual embeddings and Anthropic Claude 3 on Amazon Bedrock

Media and entertainment companies serve multilingual audiences with a wide range of content catering to diverse audience segments. These enterprises have access to massive amounts of data collected over their many years of operations. Much of this data is unstructured text and images. Conventional approaches to analyzing unstructured data for generating new content rely on the use of keyword or synonym matching. These approaches don’t capture the full semantic context of a document, making them less effective for users’ search, content creation, and several other downstream tasks.

Text embeddings use machine learning (ML) capabilities to capture the essence of unstructured data. These embeddings are generated by language models that map natural language text into their numerical representations and, in the process, encode contextual information in the natural language document. Generating text embeddings is the first step to many natural language processing (NLP) applications powered by large language models (LLMs) such as Retrieval Augmented Generation (RAG), text generation, entity extraction, and several other downstream business processes.

Cohere Multilingual V3 converting text to embeddings

Converting text to embeddings using cohere multilingual embedding model

Despite the rising popularity and capabilities of LLMs, the language most often used to converse with the LLM, often through a chat-like interface, is English. And although progress has been made in adapting open source models to comprehend and respond in Indian languages, such efforts fall short of the English language capabilities displayed among larger, state-of-the-art LLMs. This makes it difficult to adopt such models for RAG applications based on Indian languages.

In this post, we showcase a RAG application that can search and query across multiple Indian languages using the Cohere Embed – Multilingual model and Anthropic Claude 3 on Amazon Bedrock. This post focuses on Indian languages, but you can use the approach with other languages that are supported by the LLM.

Solution overview

We use the Flores dataset [1], a benchmark dataset for machine translation between English and low-resource languages. This also serves as a parallel corpus, which is a collection of texts that have been translated into one or more languages.

With the Flores dataset, we can demonstrate that the embeddings and, subsequently, the documents retrieved from the retriever, are relevant for the same question being asked in multiple languages. However, given the sparsity of the dataset (approximately 1,000 lines per language from more than 200 languages), the nature and number of questions that can be asked against the dataset is limited.

After you have downloaded the data, load the data into the pandas data frame for processing. For this demo, we are restricting ourselves to Bengali, Kannada, Malayalam, Tamil, Telugu, Hindi, Marathi, and English. If you are looking to adopt this approach for other languages, make sure the language is supported by both the embedding model and the LLM that’s being used in the RAG setup.

Load the data with the following code:

import pandas as pd

df_ben = pd.read_csv('./data/Flores/dev/dev.ben_Beng', sep='t') 
df_kan = pd.read_csv('./data/Flores/dev/dev.kan_Knda', sep='t') 
df_mal = pd.read_csv('./data/Flores/dev/dev.mal_Mlym', sep='t') 
df_tam = pd.read_csv('./data/Flores/dev/dev.tam_Taml', sep='t') 
df_tel = pd.read_csv('./data/Flores/dev/dev.tel_Telu', sep='t') 
df_hin = pd.read_csv('./data/Flores/dev/dev.hin_Deva', sep='t') 
df_mar = pd.read_csv('./data/Flores/dev/dev.mar_Deva', sep='t') 
df_eng = pd.read_csv('./data/Flores/dev/dev.eng_Latn', sep='t') 
# Choose fewer/more languages if needed

df_all_Langs = pd.concat([df_ben, df_kan, df_mal, df_tam, df_tel, df_hin, df_mar,df_eng], axis=1)
df_all_Langs.columns = ['Bengali', 'Kannada', 'Malayalam', 'Tamil', 'Telugu', 'Hindi', 'Marathi','English']

df_all_Langs.shape #(996,8)


df = df_all_Langs
stacked_df = df.stack().reset_index() # for ease of handling

# select only the required columns, rename them
stacked_df = stacked_df.iloc[:,[1,2]]
stacked_df.columns = ['language','text'] 

The Cohere multilingual embedding model

Cohere is a leading enterprise artificial intelligence (AI) platform that builds world-class LLMs and LLM-powered solutions that allow computers to search, capture meaning, and converse in text. They provide ease of use and strong security and privacy controls.

The Cohere Embed – Multilingual model generates vector representations of documents for over 100 languages and is available on Amazon Bedrock. With Amazon Bedrock, you can access the embedding model through an API call, which eliminates the need to manage the underlying infrastructure and makes sure sensitive information remains securely managed and protected.

The multilingual embedding model groups text with similar meanings by assigning them positions in the semantic vector space that are close to each other. Developers can process text in multiple languages without switching between different models. This makes processing more efficient and improves performance for multilingual applications.

Text embeddings turn unstructured data into a structured form. This allows you to objectively compare, dissect, and derive insights from all these documents. Cohere’s new embedding models have a new required input parameter, input_type, which must be set for every API call and include one of the following four values, which align towards the most frequent use cases for text embeddings:

  • input_type=”search_document” – Use this for texts (documents) you want to store in your vector database
  • input_type=”search_query” – Use this for search queries to find the most relevant documents in your vector database
  • input_type=”classification” – Use this if you use the embeddings as input for a classification system
  • input_type=”clustering” – Use this if you use the embeddings for text clustering

Using these input types provides the highest possible quality for the respective tasks. If you want to use the embeddings for multiple use cases, we recommend using input_type="search_document".

Prerequisites

To use the Claude 3 Sonnet LLM and the Cohere multilingual embeddings model on this dataset, ensure that you have access to the models in your AWS account under Amazon Bedrock, Model Access section and then proceed with installing the following packages. The following code has been tested to work with the Amazon SageMaker Data Science 3.0 Image, backed by an ml.t3.medium instance.

! apt-get update 
! apt-get install build-essential -y # for the hnswlib package below
! pip install hnswlib

Create a search index

With all of the prerequisites in place, you can now convert the multilingual corpus into embeddings and store those in hnswlib, a header-only C++ Hierarchical Navigable Small Worlds (HNSW) implementation with Python bindings, insertions, and updates. HNSWLib is an in-memory vector store that can be saved to a file, which should be sufficient for the small dataset we are working with. Use the following code:

import hnswlib
import os
import json
import botocore
import boto3

boto3_bedrock = boto3.client('bedrock')
bedrock_runtime = boto3.client('bedrock-runtime')

# Create a search index
index = hnswlib.Index(space='ip', dim=1024)
index.init_index(max_elements=10000, ef_construction=512, M=64)

all_text = stacked_df['text'].to_list()
all_text_lang = stacked_df['language'].to_list()

Embed and index documents

To embed and store the small multilingual dataset, use the Cohere embed-multilingual-v3.0 model, which creates embeddings with 1,024 dimensions, using the Amazon Bedrock runtime API:

modelId="cohere.embed-multilingual-v3"
contentType= "application/json"
accept = "*/*"


df_chunk_size = 80
chunk_embeddings = []
for i in range(0,len(all_text), df_chunk_size):
    chunk = all_text[i:i+df_chunk_size]
    body=json.dumps(
            {"texts":chunk,"input_type":"search_document"} # search documents
    ) 
    response = bedrock_runtime.invoke_model(body=body, 
                                            modelId=modelId,
                                            accept=accept,
                                            contentType=contentType)
    response_body = json.loads(response.get('body').read())
    index.add_items(response_body['embeddings'])

Verify that the embeddings work

To test the solution, write a function that takes a query as input, embeds it, and finds the top N documents most closely related to it:

# Retrieval of closest N docs to query
def retrieval(query, num_docs_to_return=10):
    modelId="cohere.embed-multilingual-v3"
    contentType= "application/json"
    accept = "*/*"
    body=json.dumps(
            {"texts":[query],"input_type":"search_query"} # search query
    ) 
    response = bedrock_runtime.invoke_model(body=body, 
                                            modelId=modelId,
                                            accept=accept,
                                            contentType=contentType)
    response_body = json.loads(response.get('body').read())
    doc_ids = index.knn_query(response_body['embeddings'], 
                              k=num_docs_to_return)[0][0] 
    print(f"Query: {query} n")
    retrieved_docs = []

    for doc_id in doc_ids:
        # Append results
        retrieved_docs.append(all_text[doc_id]) # original vernacular language docs

        # Print results
        print(f"Original Flores Text {all_text[doc_id]}")
        print("-"*30)

    print("END OF RESULTS nn")
    return retrieved_docs   

You can explore what the RAG stack does with a couple of queries in different languages, such as Hindi:

queries = [
    "मुझे सिंधु नदी घाटी सभ्यता के बारे में बताइए","
]
# translation: tell me about Indus Valley Civilization
for query in queries:
    retrieval(query)

The index returns documents relevant to the search query from across languages:

Query: मुझे सिंधु नदी घाटी सभ्यता के बारे में बताइए 

Original Flores Text सिंधु घाटी सभ्यता उत्तर-पश्चिम भारतीय उपमहाद्वीप में कांस्य युग की सभ्यता थी जिसमें आस-पास के आधुनिक पाकिस्तान और उत्तर पश्चिम भारत और उत्तर-पूर्व अफ़गानिस्तान के कुछ क्षेत्र शामिल थे.
------------------------------
Original Flores Text सिंधु नदी के घाटों में पनपी सभ्यता के कारण यह इसके नाम पर बनी है.
------------------------------
Original Flores Text यद्यपि कुछ विद्वानों का अनुमान है कि चूंकि सभ्यता अब सूख चुकी सरस्वती नदी के घाटियों में विद्यमान थी, इसलिए इसे सिंधु-सरस्वती सभ्यता कहा जाना चाहिए, जबकि 1920 के दशक में हड़प्पा की पहली खुदाई के बाद से कुछ इसे हड़प्पा सभ्यता कहते हैं।
------------------------------
Original Flores Text సింధు నది పరీవాహక ప్రాంతాల్లో నాగరికత విలసిల్లింది.
------------------------------
Original Flores Text सिंधू संस्कृती ही वायव्य भारतीय उपखंडातील कांस्य युग संस्कृती होती ज्यामध्ये  आधुनिक काळातील पाकिस्तान, वायव्य भारत आणि ईशान्य अफगाणिस्तानातील काही प्रदेशांचा समावेश होता.
------------------------------
Original Flores Text সিন্ধু সভ্যতা হল উত্তর-পশ্চিম ভারতীয় উপমহাদেশের একটি তাম্রযুগের সভ্যতা যা আধুনিক-পাকিস্তানের অধিকাংশ ও উত্তর-পশ্চিম ভারত এবং উত্তর-পূর্ব আফগানিস্তানের কিছু অঞ্চলকে ঘিরে রয়েছে।
-------------------------
 .....

You can now use these documents retrieved from the index as context while calling the Anthropic Claude 3 Sonnet model on Amazon Bedrock. In production settings with datasets that are several orders of magnitude larger than the Flores dataset, we can make the search results from the index even more relevant by using Cohere’s Rerank models.

Use the system prompt to outline how you want the LLM to process your query:

# Retrieval of docs relevant to the query
def context_retrieval(query, num_docs_to_return=10):

    modelId="cohere.embed-multilingual-v3"
    contentType= "application/json"
    accept = "*/*"
    body=json.dumps(
            {"texts":[query],"input_type":"search_query"} # search query
    ) 
    response = bedrock_runtime.invoke_model(body=body, 
                                            modelId=modelId,
                                            accept=accept,
                                            contentType=contentType)
    response_body = json.loads(response.get('body').read())
    doc_ids = index.knn_query(response_body['embeddings'], 
                              k=num_docs_to_return)[0][0] 
    retrieved_docs = []
    
    for doc_id in doc_ids:
        retrieved_docs.append(all_text[doc_id])
    return " ".join(retrieved_docs)

def query_rag_bedrock(query, model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'):

    system_prompt = '''
    You are a helpful emphathetic multilingual assitant. 
    Identify the language of the user query, and respond to the user query in the same language. 

    For example 
    if the user query is in English your response will be in English, 
    if the user query is in Malayalam, your response will be in Malayalam, 
    if the user query is in Tamil, your response will be in Tamil
    and so on...

    if you cannot identify the language: Say you cannot idenitify the language

    You will use only the data provided within the <context> </context> tags, that matches the user's query's language, to answer the user's query
    If there is no data provided within the <context> </context> tags, Say that you do not have enough information to answer the question
    
    Restrict your response to a paragraph of less than 400 words avoid bullet points
    '''
    max_tokens = 1000

    messages  = [{"role": "user", "content": f'''
                    query : {query}
                    <context>
                    {context_retrieval(query)}
                    </context>
                '''}]

    body=json.dumps(
            {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": max_tokens,
                "system": system_prompt,
                "messages": messages
            }  
        )  


    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    return response_body['content'][0]['text']

Let’s pass in the same query in multiple Indian languages:

queries = ["tell me about the indus river valley civilization",
           "मुझे सिंधु नदी घाटी सभ्यता के बारे में बताइए",
           "मला सिंधू नदीच्या संस्कृतीबद्दल सांगा",
           "సింధు నది నాగరికత గురించి చెప్పండి",
           "ಸಿಂಧೂ ನದಿ ಕಣಿವೆ ನಾಗರಿಕತೆಯ ಬಗ್ಗೆ ಹೇಳಿ", 
           "সিন্ধু নদী উপত্যকা সভ্যতা সম্পর্কে বলুন",
           "சிந்து நதி பள்ளத்தாக்கு நாகரிகத்தைப் பற்றி சொல்",
           "സിന്ധു നദീതാഴ്വര നാഗരികതയെക്കുറിച്ച് പറയുക"] 

for query in queries:
    print(query_rag_bedrock(query))
    print('_'*20)

The query is in English, so I will respond in English.

The Indus Valley Civilization, also known as the Harappan Civilization, was a Bronze Age civilization that flourished in the northwestern regions of the Indian subcontinent, primarily in the basins of the Indus River and its tributaries. It encompassed parts of modern-day Pakistan, northwest India, and northeast Afghanistan. While some scholars suggest calling it the Indus-Sarasvati Civilization due to its presence in the now-dried-up Sarasvati River basin, the name "Indus Valley Civilization" is derived from its development along the Indus River valley. This ancient civilization dates back to around 3300–1300 BCE and was one of the earliest urban civilizations in the world. It was known for its well-planned cities, advanced drainage systems, and a writing system that has not yet been deciphered.
____________________
सिंधु घाटी सभ्यता एक प्राचीन नगर सभ्यता थी जो उत्तर-पश्चिम भारतीय उपमहाद्वीप में फैली हुई थी। यह लगभग 3300 से 1300 ईसा पूर्व की अवधि तक विकसित रही। इस सभ्यता के केंद्र वर्तमान पाकिस्तान के सिंध और पंजाब प्रांतों में स्थित थे, लेकिन इसके अवशेष भारत के राजस्थान, गुजरात, मध्य प्रदेश, महाराष्ट्र और उत्तर प्रदेश में भी मिले हैं। सभ्यता का नाम सिंधु नदी से लिया गया है क्योंकि इसके प्रमुख स्थल इस नदी के किनारे स्थित थे। हालांकि, कुछ विद्वानों का अनुमान है कि सरस्वती नदी के किनारे भी इस सभ्यता के स्थल विद्यमान थे इसलिए इसे सिंधु-सरस्वती सभ्यता भी कहा जाता है। यह एक महत्वपूर्ण शहरी समाज था जिसमें विकसित योजना बनाने की क्षमता, नगरीय संरचना और स्वच्छ जलापूर्ति आदि प्रमुख विशेषताएं थीं।
____________________
सिंधू संस्कृती म्हणजे सिंधू नदीच्या पट्टीकेतील प्राचीन संस्कृती होती. ही संस्कृती सुमारे ई.पू. ३३०० ते ई.पू. १३०० या कालखंडात फुलणारी होती. ती भारतातील कांस्ययुगीन संस्कृतींपैकी एक मोठी होती. या संस्कृतीचे अवशेष आजच्या पाकिस्तान, भारत आणि अफगाणिस्तानमध्ये आढळून आले आहेत. या संस्कृतीत नगररचना, नागरी सोयी सुविधांचा विकास झाला होता. जलवाहिनी, नगरदेवालय इत्यादी अद्भुत बाबी या संस्कृतीत होत्या. सिंधू संस्कृतीत लिपीसुद्धा विकसित झाली होती परंतु ती अजूनही वाचण्यास आलेली नाही. सिंधू संस्कृती ही भारतातील पहिली शहरी संस्कृती मानली जाते.
____________________
సింధు నది నాగరికత గురించి చెప్పుతూ, ఈ నాగరికత సింధు నది పరిసర ప్రాంతాల్లో ఉన్నదని చెప్పవచ్చు. దీనిని సింధు-సరస్వతి నాగరికత అనీ, హరప్ప నాగరికత అనీ కూడా పిలుస్తారు. ఇది ఉత్తర-ఆర్య భారతదేశం, ఆధునిక పాకిస్తాన్, ఉత్తర-పశ్చిమ భారతదేశం మరియు ఉత్తర-ఆర్థిక అఫ్గానిస్తాన్ కు చెందిన తామ్రయుగపు నాగరికత. సరస్వతి నది పరీవాహక ప్రాంతాల్లోనూ నాగరికత ఉందని కొందరు పండితులు అభిప్రాయపడ్డారు. దీని మొదటి స్థలాన్ని 1920లలో హరప్పాలో త్రవ్వారు. ఈ నాగరికతలో ప్రశస్తమైన బస్తీలు, నగరాలు, మలిచ్చి రంగులతో నిర్మించిన భవనాలు, పట్టణ నిర్మాణాలు ఉన్నాయి.
____________________
ಸಿಂಧೂ ಕಣಿವೆ ನಾಗರಿಕತೆಯು ವಾಯುವ್ಯ ಭಾರತದ ಉಪಖಂಡದಲ್ಲಿ ಕಂಚಿನ ಯುಗದ ನಾಗರಿಕತೆಯಾಗಿದ್ದು, ಪ್ರಾಚೀನ ಭಾರತದ ಇತಿಹಾಸದಲ್ಲಿ ಮುಖ್ಯವಾದ ಪಾತ್ರವನ್ನು ವಹಿಸಿದೆ. ಈ ನಾಗರಿಕತೆಯು ಆಧುನಿಕ-ದಿನದ ಪಾಕಿಸ್ತಾನ ಮತ್ತು ವಾಯುವ್ಯ ಭಾರತದ ಭೂಪ್ರದೇಶಗಳನ್ನು ಹಾಗೂ ಈಶಾನ್ಯ ಅಫ್ಘಾನಿಸ್ತಾನದ ಕೆಲವು ಪ್ರದೇಶಗಳನ್ನು ಒಳಗೊಂಡಿರುವುದರಿಂದ ಅದಕ್ಕೆ ಸಿಂಧೂ ನಾಗರಿಕತೆ ಎಂದು ಹೆಸರಿಸಲಾಗಿದೆ. ಸಿಂಧೂ ನದಿಯ ಪ್ರದೇಶಗಳಲ್ಲಿ ಈ ನಾಗರಿಕತೆಯು ವಿಕಸಿತಗೊಂಡಿದ್ದರಿಂದ ಅದಕ್ಕೆ ಸಿಂಧೂ ನಾಗರಿಕತೆ ಎಂದು ಹೆಸರಿಸಲಾಗಿದೆ. ಈಗ ಬತ್ತಿ ಹೋದ ಸರಸ್ವತಿ ನದಿಯ ಪ್ರದೇಶಗಳಲ್ಲಿ ಸಹ ನಾಗರೀಕತೆಯ ಅಸ್ತಿತ್ವವಿದ್ದಿರಬಹುದೆಂದು ಕೆಲವು ಪ್ರಾಜ್ಞರು ಶಂಕಿಸುತ್ತಾರೆ. ಆದ್ದರಿಂದ ಈ ನಾಗರಿಕತೆಯನ್ನು ಸಿಂಧೂ-ಸರಸ್ವತಿ ನಾಗರಿಕತೆ ಎಂದು ಸೂಕ್ತವಾಗಿ ಕರೆ
____________________
সিন্ধু নদী উপত্যকা সভ্যতা ছিল একটি প্রাচীন তাম্রযুগীয় সভ্যতা যা বর্তমান পাকিস্তান এবং উত্তর-পশ্চিম ভারত ও উত্তর-পূর্ব আফগানিস্তানের কিছু অঞ্চলকে নিয়ে গঠিত ছিল। এই সভ্যতার নাম সিন্ধু নদীর অববাহিকা অঞ্চলে এটির বিকাশের কারণে এরকম দেওয়া হয়েছে। কিছু পণ্ডিত মনে করেন যে সরস্বতী নদীর ভূমি-প্রদেশেও এই সভ্যতা বিদ্যমান ছিল, তাই এটিকে সিন্ধু-সরস্বতী সভ্যতা বলা উচিত। আবার কেউ কেউ এই সভ্যতাকে হরপ্পা পরবর্তী হরপ্পান সভ্যতা নামেও অবিহিত করেন। যাই হোক, সিন্ধু সভ্যতা ছিল প্রাচীন তাম্রযুগের এক উল্লেখযোগ্য সভ্যতা যা সিন্ধু নদী উপত্যকার এলাকায় বিকশিত হয়েছিল।
____________________
சிந்து நதிப் பள்ளத்தாக்கில் தோன்றிய நாகரிகம் சிந்து நாகரிகம் என்றழைக்கப்படுகிறது. சிந்து நதியின் படுகைகளில் இந்த நாகரிகம் மலர்ந்ததால் இப்பெயர் வழங்கப்பட்டது. ஆனால், தற்போது வறண்டுபோன சரஸ்வதி நதிப் பகுதியிலும் இந்நாகரிகம் இருந்திருக்கலாம் என சில அறிஞர்கள் கருதுவதால், சிந்து சரஸ்வதி நாகரிகம் என்று அழைக்கப்பட வேண்டும் என்று வாதிடுகின்றனர். மேலும், இந்நாகரிகத்தின் முதல் தளமான ஹரப்பாவின் பெயரால் ஹரப்பா நாகரிகம் என்றும் அழைக்கப்படுகிறது. இந்த நாகரிகம் வெண்கலயுக நாகரிகமாக கருதப்படுகிறது. இது தற்கால பாகிஸ்தானின் பெரும்பகுதி, வடமேற்கு இந்தியா மற்றும் வடகிழக்கு ஆப்கானிஸ்தானின் சில பகுதிகளை உள்ளடக்கியது.
____________________
സിന്ധു നദീതട സംസ്കാരം അഥവാ ഹാരപ്പൻ സംസ്കാരം ആധുനിക പാകിസ്ഥാൻ, വടക്ക് പടിഞ്ഞാറൻ ഇന്ത്യ, വടക്ക് കിഴക്കൻ അഫ്ഗാനിസ്ഥാൻ എന്നിവിടങ്ങളിൽ നിലനിന്ന ഒരു വെങ്കല യുഗ സംസ്കാരമായിരുന്നു. ഈ സംസ്കാരത്തിന്റെ അടിസ്ഥാനം സിന്ധു നദിയുടെ തടങ്ങളായതിനാലാണ് ഇതിന് സിന്ധു നദീതട സംസ്കാരം എന്ന പേര് ലഭിച്ചത്. ചില പണ്ഡിതർ ഇപ്പോൾ വറ്റിപ്പോയ സരസ്വതി നദിയുടെ തടങ്ങളിലും ഈ സംസ്കാരം നിലനിന്നിരുന്നതിനാൽ സിന്ധു-സരസ്വതി നദീതട സംസ്കാരമെന്ന് വിളിക്കുന്നത് ശരിയായിരിക്കുമെന്ന് അഭിപ്രായപ്പെടുന്നു. എന്നാൽ ചിലർ 1920കളിൽ ആദ്യമായി ഉത്ഖനനം നടത്തിയ ഹാരപ്പ എന്ന സ്ഥലത്തെ പേര് പ്രകാരം ഈ സംസ്കാരത്തെ ഹാരപ്പൻ സംസ്കാരമെന്ന് വിളിക്കുന്നു.

Conclusion

This post presented a walkthrough for using Cohere’s multilingual embedding model along with Anthropic Claude 3 Sonnet on Amazon Bedrock. In particular, we showed how the same question asked in multiple Indian languages, is getting answered using relevant documents retrieved from a vector store

Cohere’s multilingual embedding model supports over 100 languages. It removes the complexity of building applications that require working with a corpus of documents in different languages. The Cohere Embed model is trained to deliver results in real-world applications. It handles noisy data as inputs, adapts to complex RAG systems, and delivers cost-efficiency from its compression-aware training method.

Start building with Cohere’s multilingual embedding model and Anthropic Claude 3 Sonnet on Amazon Bedrock today.

References

[1] Flores Dataset: https://github.com/facebookresearch/flores/tree/main/flores200


About the Author

ronykroy

Rony K Roy is a Sr. Specialist Solutions Architect, Specializing in AI/ML. Rony helps partners build AI/ML solutions on AWS.

Read More

The future of productivity agents with NinjaTech AI and AWS Trainium

The future of productivity agents with NinjaTech AI and AWS Trainium

This is a guest post by Arash Sadrieh, Tahir Azim, and Tengfui Xue from NinjaTech AI.

NinjaTech AI’s mission is to make everyone more productive by taking care of time-consuming complex tasks with fast and affordable artificial intelligence (AI) agents. We recently launched MyNinja.ai, one of the world’s first multi-agent personal AI assistants, to drive towards our mission. MyNinja.ai is built from the ground up using specialized agents that are capable of completing tasks on your behalf, including scheduling meetings, conducting deep research from the web, generating code, and helping with writing. These agents can break down complicated, multi-step tasks into branched solutions, and are capable of evaluating the generated solutions dynamically while continually learning from past experiences. All of these tasks are accomplished in a fully autonomous and asynchronous manner, freeing you up to continue your day while Ninja works on these tasks in the background, and engaging when your input is required.

Because no single large language model (LLM) is perfect for every task, we knew that building a personal AI assistant would require multiple LLMs optimized specifically for a variety of tasks. In order to deliver the accuracy and capabilities to delight our users, we also knew that we would require these multiple models to work together in tandem. Finally, we needed scalable and cost-effective methods for training these various models—an undertaking that has historically been costly to pursue for most startups. In this post, we describe how we built our cutting-edge productivity agent NinjaLLM, the backbone of MyNinja.ai, using AWS Trainium chips.

Building a dataset

We recognized early that to deliver on the mission of tackling tasks on a user’s behalf, we needed multiple models that were optimized for specific tasks. Examples include our Deep Researcher, Deep Coder, and Advisor models. After testing available open source models, we felt that the out-of-the-box capabilities and responses were insufficient with prompt engineering alone to meet our needs. Specifically, in our testing with open source models, we wanted to make sure each model was optimized for a ReAct/chain-of-thought style of prompting. Additionally, we wanted to make sure the model would, when deployed as part of a Retrieval Augmented Generation (RAG) system, accurately cite each source, as well as any bias towards saying “I don’t know” as opposed to generating false answers. For that purpose, we chose to fine-tune the models for the various downstream tasks.

In constructing our training dataset, our goal was twofold: adapt each model for its suited downstream task and persona (Researcher, Advisor, Coder, and so on), and adapt the models to follow a specific output structure. To that end, we followed the Lima approach for fine-tuning. We used a training sample size of roughly 20 million tokens, focusing on the format and tone of the output while using a diverse but relatively small sample size. To construct our supervised fine-tuning dataset, we began by creating initial seed tasks for each model. With these seed tasks, we generated an initial synthetic dataset using Meta’s Llama 2 model. We were able to use the synthetic dataset to perform an initial round of fine-tuning. To initially evaluate the performance of this fine-tuned model, we crowd-sourced user feedback to iteratively create more samples. We also used a series of benchmarks—internal and public—to assess model performance and continued to iterate.

Fine-tuning on Trainium

We elected to start with the Llama models for a pre-trained base model for several reasons: most notably the great out-of-the-box performance, strong ecosystem support from various libraries, and the truly open source and permissive license. At the time, we began with Llama 2, testing across the various sizes (7B, 13B, and 70B). For training, we chose to use a cluster of trn1.32xlarge instances to take advantage of Trainium chips. We used a cluster of 32 instances in order to efficiently parallelize the training. We also used AWS ParallelCluster to manage cluster orchestration. By using a cluster of Trainium instances, each fine-tuning iteration took less than 3 hours, at a cost of less than $1,000. This quick iteration time and low cost, allowed us to quickly tune and test our models and improve our model accuracy. To achieve the accuracies discussed in the following sections, we only had to spend around $30k, savings hundreds of thousands, if not millions of dollars if we had to train on traditional training accelerators.

The following diagram illustrates our training architecture.

After we had established our fine-tuning pipelines built on top of Trainium, we were able to fine-tune and refine our models thanks to the Neuron Distributed training libraries. This was exceptionally useful and timely, because leading up to the launch of MyNinja.ai, Meta’s Llama 3 models were released. Llama 3 and Llama 2 share similar architecture, so we were able to rapidly upgrade to the newer model. This velocity in switching allowed us to take advantage of the inherent gains in model accuracy, and very quickly run through another round of fine-tuning with the Llama 3 weights and prepare for launch.

Model evaluation

For evaluating the model, there were two objectives: evaluate the model’s ability to answer user questions, and evaluate the system’s ability to answer questions with provided sources, because this is our personal AI assistant’s primary interface. We selected the HotPotQA and Natural Questions (NQ) Open datasets, both of which are a good fit because of their open benchmarking datasets with public leaderboards.

We calculated accuracy by matching the model’s answer to the expected answer, using the top 10 passages retrieved from a Wikipedia corpus. We performed content filtering and ranking using ColBERTv2, a BERT-based retrieval model. We achieved accuracies of 62.22% on the NQ Open dataset and 58.84% on HotPotQA by using our enhanced Llama 3 RAG model, demonstrating notable improvements over other baseline models. The following figure summarizes our results.

Future work

Looking ahead, we’re working on several developments to continue improving our model’s performance and user experience. First, we intend to use ORPO to fine-tune our models. ORPO combines traditional fine-tuning with preference alignment, while using a single preference alignment dataset for both. We believe this will allow us to better align models to achieve better results for users.

Additionally, we intend to build a custom ensemble model from the various models we have fine-tuned thus far. Inspired by Mixture of Expert (MoE) model architectures, we intend to introduce a routing layer to our various models. We believe this will radically simplify our model serving and scaling architecture, while maintaining the quality in various tasks that our users have come to expect from our personal AI assistant.

Conclusion

Building next-gen AI agents to make everyone more productive is NinjaTech AI’s pathway to achieving its mission. To democratize access to this transformative technology, it is critical to have access to high-powered compute, open source models, and an ecosystem of tools that make training each new agent affordable and fast. AWS’s purpose-built AI chips, access to the top open source models, and its training architecture make this possible.

To learn more about how we built NinjaTech AI’s multi-agent personal AI, you can read our whitepaper. You can also try these AI agents for free at MyNinja.ai.


About the authors

 Arash Sadrieh is the Co-Founder and Chief Science Officer at Ninjatech.ai. Arash co-founded Ninjatech.ai with a vision to make everyone more productive by taking care of time-consuming tasks with AI agents. This vision was shaped during his tenure as a Senior Applied Scientist at AWS, where he drove key research initiatives that significantly improved infrastructure efficiency over six years, earning him multiple patents for optimizing core infrastructure. His academic background includes a PhD in computer modeling and simulation, with collaborations with esteemed institutions such as Oxford University, Sydney University, and CSIRO. Prior to his industry tenure, Arash had a postdoctoral research tenure marked by publications in high-impact journals, including Nature Communications.

Tahir Azim is a Staff Software Engineer at NinjaTech. Tahir focuses on NinjaTech’s Inf2 and Trn1 based training and inference platforms, its unified gateway for accessing these platforms, and its RAG-based research skill. He previously worked at Amazon as a senior software engineer, building data-driven systems for optimal utilization of Amazon’s global Internet edge infrastructure, driving down cost, congestion and latency. Before moving to industry, Tahir earned an M.S. and Ph.D. in Computer Science from Stanford University, taught for three years as an assistant professor at NUST(Pakistan), and did a post-doc in fast data analytics systems at EPFL. Tahir has authored several publications presented at top-tier conferences such as VLDB, USENIX ATC, MobiCom and MobiHoc.

Tengfei Xue is an Applied Scientist at NinjaTech AI. His current research interests include natural language processing and multimodal learning, particularly using large language models and large multimodal models. Tengfei completed his PhD studies at the School of Computer Science, University of Sydney, where he focused on deep learning for healthcare using various modalities. He was also a visiting PhD candidate at the Laboratory of Mathematics in Imaging (LMI) at Harvard University, where he worked on 3D computer vision for complex geometric data.

Read More

Build generative AI applications on Amazon Bedrock — the secure, compliant, and responsible foundation

Build generative AI applications on Amazon Bedrock — the secure, compliant, and responsible foundation

Generative AI has revolutionized industries by creating content, from text and images to audio and code. Although it can unlock numerous possibilities, integrating generative AI into applications demands meticulous planning. Amazon Bedrock is a fully managed service that provides access to large language models (LLMs) and other foundation models (FMs) from leading AI companies through a single API. It provides a broad set of tools and capabilities to help build generative AI applications.

Starting today, I’ll be writing a blog series to highlight some of the key factors driving customers to choose Amazon Bedrock. One of the most important reason is that Bedrock enables customers to build a secure, compliant, and responsible foundation for generative AI applications. In this post, I explore how Amazon Bedrock helps address security and privacy concerns, enables secure model customization, accelerates auditability and incident response, and fosters trust through transparency and responsible AI. Plus, I’ll showcase real-world examples of companies building secure generative AI applications on Amazon Bedrock—demonstrating its practical applications across different industries.

Listening to what our customers are saying

During the past year, my colleague Jeff Barr, VP & Chief Evangelist at AWS, and I have had the opportunity to speak with numerous customers about generative AI. They mention compelling reasons for choosing Amazon Bedrock to build and scale their transformative generative AI applications. Jeff’s video highlights some of the key factors driving customers to choose Amazon Bedrock today.

As you build and operationalize generative AI, it’s important not to lose sight of critically important elements—security, compliance, and responsible AI—particularly for use cases involving sensitive data. The OWASP Top 10 For LLMs outlines the most common vulnerabilities, but addressing these may require additional efforts including stringent access controls, data encryption, preventing prompt injection attacks, and compliance with policies. You want to make sure your AI applications work reliably, as well as securely.

Making data security and privacy a priority

Like many organizations starting their generative AI journey, the first concern is to make sure the organization’s data remains secure and private when used for model tuning or Retrieval Augmented Generation (RAG). Amazon Bedrock provides a multi-layered approach to address this issue, helping you ensure that your data remains secure and private throughout the entire lifecycle of building generative AI applications:

  • Data isolation and encryption. Any customer content processed by Amazon Bedrock, such as customer inputs and model outputs, is not shared with any third-party model providers, and will not be used to train the underlying FMs. Furthermore, data is encrypted in-transit using TLS 1.2+ and at-rest through AWS Key Management Service (AWS KMS).
  • Secure connectivity options. Customers have flexibility with how they connect to Amazon Bedrock’s API endpoints. You can use public internet gateways, AWS PrivateLink (VPC endpoint) for private connectivity, and even backhaul traffic over AWS Direct Connect from your on-premises networks.
  • Model access controls. Amazon Bedrock provides robust access controls at multiple levels. Model access policies allow you to explicitly allow or deny enabling specific FMs for your account. AWS Identity and Access Management (IAM) policies let you further restrict which provisioned models your applications and roles can invoke, and which APIs on those models can be called.

Druva provides a data security software-as-a-service (SaaS) solution to enable cyber, data, and operational resilience for all businesses. They used Amazon Bedrock to rapidly experiment, evaluate, and implement different LLM components tailored to solve specific customer needs around data protection without worrying about the underlying infrastructure management.

“We built our new service Dru — an AI co-pilot that both IT and business teams can use to access critical information about their protection environments and perform actions in natural language — in Amazon Bedrock because it provides fully managed and secure access to an array of foundation models,”

– David Gildea, Vice President of Product, Generative AI at Druva.

Ensuring secure customization

A critical aspect of generative AI adoption for many organizations is the ability to securely customize the application to align with your specific use cases and requirements, including RAG or fine-tuning FMs. Amazon Bedrock offers a secure approach to model customization, so sensitive data remains protected throughout the entire process:

  • Model customization data security. When fine-tuning a model, Amazon Bedrock uses the encrypted training data from an Amazon Simple Storage Service (Amazon S3) bucket through a private VPC connection. Amazon Bedrock doesn’t use model customization data for any other purpose. Your training data isn’t used to train the base Amazon Titan models or distributed to third parties. Nor is other usage data, such as usage timestamps, logged account IDs, and other information logged by the service, used to train the models. In fact, none of the training or validation data you provide for fine tuning or continued pre-training is stored by Amazon Bedrock. When the model customization work is complete—it remains isolated and encrypted with your KMS keys.
  • Secure deployment of fine-tuned models. The pre-trained or fine-tuned models are deployed in isolated environments specifically for your account. You can further encrypt these models with your own KMS keys, preventing access without appropriate IAM permissions.
  • Centralized multi-account model access.  AWS Organizations provides you with the ability to centrally manage your environment across multiple accounts. You can create and organize accounts in an organization, consolidate costs, and apply policies for custom environments. For organizations with multiple AWS accounts or a distributed application architecture, Amazon Bedrock supports centralized governance and access to FMs – you can secure your environment, create and share resources, and centrally manage permissions. Using standard AWS cross-account IAM roles, administrators can grant secure access to models across different accounts, enabling controlled and auditable usage while maintaining a centralized point of control.

With seamless access to LLMs in Amazon Bedrock—and with data encrypted in-transit and at-rest—BMW Group securely delivers high-quality connected mobility solutions to motorists around the world.

“Using Amazon Bedrock, we’ve been able to scale our cloud governance, reduce costs and time to market, and provide a better service for our customers. All of this is helping us deliver the secure, first-class digital experiences that people across the world expect from BMW.”

– Dr. Jens Kohl, Head of Offboard Architecture, BMW Group.

Enabling auditability and visibility

In addition to the security controls around data isolation, encryption, and access, Amazon Bedrock provides capabilities to enable auditability and accelerate incident response when needed:

  • Compliance certifications. For customers with stringent regulatory requirements, you can use Amazon Bedrock in compliance with the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and more. In addition, AWS has successfully extended the registration status of Amazon Bedrock in Cloud Infrastructure Service Providers in Europe Data Protection Code of Conduct (CISPE CODE) Public Register. This declaration provides independent verification and an added level of assurance that Amazon Bedrock can be used in compliance with the GDPR. For Federal agencies and public sector organizations, Amazon Bedrock recently announced FedRAMP Moderate, approved for use in our US East and West AWS Regions. Amazon Bedrock is also under JAB review for FedRAMP High authorization in AWS GovCloud (US).
  • Monitoring and logging. Native integrations with Amazon CloudWatch and AWS CloudTrail provide comprehensive monitoring, logging, and visibility into API activity, model usage metrics, token consumption, and other performance data. These capabilities enable continuous monitoring for improvement, optimization, and auditing as needed – something we know is critical from working with customers in the cloud for the last 18 years. Amazon Bedrock allows you to enable detailed logging of all model inputs and outputs, including IAM invocation role, and metadata associated with all calls that are performed in your account. These logs facilitate monitoring model responses to adhere to your organization’s AI policies and reputation guidelines. When you enable log model invocation logging, you can use AWS KMS to encrypt your log data, and use IAM policies to protect who can access your log data. None of this data is stored within Amazon Bedrock, and is only available within a customer’s account.

Implementing responsible AI practices

AWS is committed to developing generative AI responsibly, taking a people-centric approach that prioritizes education, science, and our customers, to integrate responsible AI across the full AI lifecycle. With AWS’s comprehensive approach to responsible AI development and governance, Amazon Bedrock empowers you to build trustworthy generative AI systems in line with your responsible AI principles.

We give our customers the tools, guidance, and resources they need to get started with purpose-built services and features, including several in Amazon Bedrock:

  • Safeguard generative AI applications– Guardrails for Amazon Bedrock is the only responsible AI capability offered by a major cloud provider that enables customers to customize and apply safety, privacy, and truthfulness checks for your generative AI applications. Guardrails helps customers block as much as 85% more harmful content than protection natively provided by some FMs on Amazon Bedrock today. It works with all LLMs in Amazon Bedrock, fine-tuned models, and also integrates with Agents and Knowledge Bases for Amazon Bedrock. Customers can define content filters with configurable thresholds to help filter harmful content across hate speech, insults, sexual language, violence, misconduct (including criminal activity), and prompt attacks (prompt injection and jailbreak). Using a short natural language description, Guardrails for Amazon Bedrock allows you to detect and block user inputs and FM responses that fall under restricted topics or sensitive content such as personally identifiable information (PII). You can combine multiple policy types to configure these safeguards for different scenarios and apply them across FMs on Amazon Bedrock. This ensures that your generative AI applications adhere to your organization’s responsible AI policies as well as provide a consistent and safe user experience.
  • Provenance tracking. Now available in preview, Model Evaluation on Amazon Bedrock helps customers evaluate, compare, and select the best FMs for their specific use case based on custom metrics, such as accuracy and safety, using either automatic or human evaluations. Customers can evaluate AI models in two ways—automatic or with human input. For automatic evaluations, they pick criteria such as accuracy or toxicity, and use their own data or public datasets. For evaluations needing human judgment, customers can easily set up workflows for human review with a few clicks. After setting up, Amazon Bedrock runs the evaluations and provides a report showing how well the model performed on important safety and accuracy measures. This report helps customers choose the best model for their needs, even more important when helping customers are evaluating migrating to a new model in Amazon Bedrock against an existing model for an application.
  • Watermark detection. All Amazon Titan FMs are built with responsible AI in mind. Amazon Titan Image Generator creates images embedded with imperceptible digital watermarks. The watermark detection for Amazon Titan Image Generator allows you to identify images generated by Amazon Titan Image Generator, a foundation model that allows users to create realistic, studio-quality images in large volumes and at low cost, using natural language prompts. With this feature, you can increase transparency around AI-generated content by mitigating harmful content generation and reducing the spread of misinformation. It also provides a confidence score, allowing you to assess the reliability of the detection, even if the original image has been modified. Simply upload an image in the Amazon Bedrock console, and the API will detect watermarks embedded in images created by Titan Image Generator, including those generated by the base model and any customized versions.
  • AI Service Cards provide transparency and document the intended use cases and fairness considerations for our AWS AI services. Our latest services cards include Amazon Titan Text Premier and Amazon Titan Text Lite and Titan Text Express with more coming soon.

Aha! is a software company that helps more than 1 million people bring their product strategy to life.

“Our customers depend on us every day to set goals, collect customer feedback, and create visual roadmaps. That is why we use Amazon Bedrock to power many of our generative AI capabilities. Amazon Bedrock provides responsible AI features, which enable us to have full control over our information through its data protection and privacy policies, and block harmful content through Guardrails for Bedrock.”

– Dr. Chris Waters, co-founder and Chief Technology Officer at Aha!

Building trust through transparency

By addressing security, compliance, and responsible AI holistically, Amazon Bedrock helps customers to unlock generative AI’s transformative potential. As generative AI capabilities continue to evolve so rapidly, building trust through transparency is crucial. Amazon Bedrock works continuously to help develop safe and secure applications and practices, helping build generative AI applications responsibly.

The bottom line? Amazon Bedrock makes it effortless for you to unlock sustained growth with generative AI and experience the power of LLMs. Get started today – Build AI applications or customize models securely using your data to start your generative AI journey with confidence.

Resources

For more information about generative AI and Amazon Bedrock, explore the following resources:


About the author

Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.

Read More

Build a conversational chatbot using different LLMs within single interface – Part 1

Build a conversational chatbot using different LLMs within single interface – Part 1

With the advent of generative artificial intelligence (AI), foundation models (FMs) can generate content such as answering questions, summarizing text, and providing highlights from the sourced document. However, for model selection, there is a wide choice from model providers, like Amazon, Anthropic, AI21 Labs, Cohere, and Meta, coupled with discrete real-world data formats in PDF, Word, text, CSV, image, audio, or video.

Amazon Bedrock is a fully managed service that makes it straightforward to build and scale generative AI applications. Amazon Bedrock offers a choice of high-performing FMs from leading AI companies, including AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, through a single API. It enables you to privately customize FMs with your data using techniques such as fine-tuning, prompt engineering, and Retrieval Augmented Generation (RAG), and build agents that run tasks using your enterprise systems and data sources while complying with security and privacy requirements.

In this post, we show you a solution for building a single interface conversational chatbot that allows end-users to choose between different large language models (LLMs) and inference parameters for varied input data formats. The solution uses Amazon Bedrock to create choice and flexibility to improve the user experience and compare the model outputs from different options.

The entire code base is available in GitHub, along with an AWS CloudFormation template.

What is RAG

Retrieval Augmented Generation (RAG) can enhance the generation process by using the benefits of retrieval, enabling a natural language generation model to produce more informed and contextually appropriate responses. By incorporating relevant information from retrieval into the generation process, RAG aims to improve the accuracy, coherence, and informativeness of the generated content.

Implementing an effective RAG system requires several key components working in harmony:

  • Foundation models – The foundation of a RAG architecture is a pre-trained language model that handles text generation. Amazon Bedrock encompasses models from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, and Amazon that possess strong language comprehension and synthesis abilities to engage in conversational dialogue.
  • Vector store – At the heart of the retrieval functionality is a vector store database persisting document embeddings for similarity search. This allows rapid identification of relevant contextual information. AWS offers many services for your vector database requirements:
  • Retriever – The retriever module uses the vector store to efficiently find pertinent documents and passages to augment prompts.
  • Embedder – To populate the vector store, an embedding model encodes source documents into vector representations consumable by the retriever. Models like Amazon Titan Embeddings G1 – Text v1.2 are ideal for this text-to-vector abstraction.
  • Document ingestion – Robust pipelines ingest, preprocess, and tokenize source documents, chunking them into manageable passages for embedding and efficient lookup. For this solution, we use the LangChain framework for document preprocessing. By orchestrating these core components using LangChain, RAG systems empower language models to access vast knowledge for grounded generation.

We have fully managed support for our end-to-end RAG workflow using Knowledge Bases for Amazon Bedrock. With Knowledge Bases for Amazon Bedrock, you can give FMs and agents contextual information from your company’s private data sources for RAG to deliver more relevant, accurate, and customized responses.

To equip FMs with up-to-date and proprietary information, organizations use RAG to fetch data from company data sources and enrich the prompt to provide more relevant and accurate responses. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. Session context management is built in, so your app can readily support multi-turn conversations.

Solution overview

This chatbot is built using RAG, enabling it to provide versatile conversational abilities. The following figure illustrates a sample UI of the Q&A interface using Streamlit and the workflow.

This post provides a single UI with multiple choices for the following capabilities:

  • Leading FMs available through Amazon Bedrock
  • Inference parameters for each of these models
  • Source data input formats for RAG:
    • Text (PDF, CSV, Word)
    • Website link
    • YouTube video
    • Audio
    • Scanned image
    • PowerPoint
  • RAG operation using the LLM, inference parameter, and sources:
    • Q&A
    • Summary: summarize, get highlights, extract text

We have used one of LangChain’s many document loaders, YouTubeLoader. The from_you_tube_url function helps extract transcripts and metadata from the YouTube video.

The documents contain two attributes:

  • page_content with the transcripts
  • metadata with basic information about the video

Text is extracted from the transcript and using Langchain TextLoader, the document is split and chunked, and embeddings are created, which are then stored in the vector store.

The following diagram illustrates the solution architecture.

Prerequisites

To implement this solution, you should have the following prerequisites:

  • An AWS account with the required permissions to launch the stack using AWS CloudFormation.
  • Amazon Elastic Compute Cloud (Amazon EC2) hosting the application should have internet access so as to download all the necessary OS patches and application related (python) libraries
  • A basic understanding of Amazon Bedrock and FMs.
  • This solution uses the Amazon Titan Text Embedding model. Make sure this model is enabled for use in Amazon Bedrock. On the Amazon Bedrock console, choose Model access in the navigation pane.
    • If Amazon Titan Text Embeddings is enabled, the access status will state Access granted.
    • If the model is not available, enable access to the model by choosing Manage model access, selecting Titan Multimodal Embeddings G1, and choosing Request model access. The model is enabled for use immediately.

Deploy the solution

The CloudFormation template deploys an Amazon Elastic Compute Cloud (Amazon EC2) instance to host the Streamlit application, along with other associated resources like an AWS Identity and Access Management (IAM) role and Amazon Simple Storage Service (Amazon S3) bucket. For more information about Amazon Bedrock and IAM, refer to How Amazon Bedrock Works with IAM.

In this post, we deploy the Streamlit application over an EC2 instance inside a VPC, but you can deploy it as a containerized application using a serverless solution with AWS Fargate. We discuss this in more detail in Part 2.

Complete the following steps to deploy the solution resources using AWS CloudFormation:

  1. Download the CloudFormation template StreamlitAppServer_Cfn.yml from the GitHub repo.
  2. On the AWS CloudFormation, create a new stack.
  3. For Prepare template, select Template is ready.
  4. In the Specify template section, provide the following information:
    1. For Template source, select Upload a template file.
    2. Choose file and upload the template you downloaded.
  5. Choose Next.

  1. For Stack name, enter a name (for this post, StreamlitAppServer).
  2. In the Parameters section, provide the following information:
    1. For Specify the VPC ID where you want your app server deployed, enter the VPC ID where you want to deploy this application server.
    2. For VPCCidr, enter the CIDR of the VPC you’re using.
    3. For SubnetID, enter the subnet ID from the same VPC.
    4. For MYIPCidr, enter the IP address of your computer or workstation so you can open the Streamlit application in your local browser.

You can run the command curl https://api.ipify.org on your local terminal to get your IP address.

Specify_Stack_Details_Screenshot-2

  1. Leave the rest of the parameters as defaulted.
  2. Choose Next.
  3. In the Capabilities section, select the acknowledgement check box.
  4. Choose Submit.

Wait until you see the stack status show as CREATE_COMPLETE.

  1. Choose the stack’s Resources tab to see the resources you launched as part of the stack deployment.

  1. Choose the link for S3Bucket to be redirected to the Amazon S3 console.
    1. Note the S3 bucket name to update the deployment script later.
    2. Choose Create folder to create a new folder.
    3. For Folder name, enter a name (for this post, gen-ai-qa).

Make sure to follow AWS security best practices for securing data in Amazon S3. For more details, see Top 10 security best practices for securing data in Amazon S3.

  1. Return to the stack Resources tab and choose the link to StreamlitAppServer to be redirected to the Amazon EC2 console.
    1. Select StreamlitApp_Sever and choose Connect.

This will open a new page with various ways to connect to the EC2 instance launched.

  1. For this solution, select Connect using EC2 Instance Connect, then choose Connect.

This will open an Amazon EC2 session in your browser.

  1. Run the following command to monitor the progress of all the Python-related libraries being installed as part of the user data:
tail -f /tmp/userData.log
  1. When you see the message Finished running user data..., you can exit the session by pressing Ctrl + C.

This takes about 15 minutes to complete.

  1. Run the following commands to start the application:
cd $HOME/bedrock-qnachatbot
bucket_name=$(aws cloudformation describe-stacks --stack-name StreamlitAppServer --query "Stacks[0].Outputs[?starts_with(OutputKey, 'BucketName')].OutputValue" --output text)
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
aws_region_name=$(curl -s http://169.254.169.254/latest/meta-data/placement/region -H "X-aws-ec2-metadata-token: $TOKEN")
sed -i "s/<S3_Bucket_Name>/${bucket_name}/g" $HOME/bedrock-qnachatbot/src/utils.py
sed -i "s/<AWS_Region>/${aws_region_name}/g" $HOME/bedrock-qnachatbot/src/utils.py
export AWS_DEFAULT_REGION=${aws_region_name}
streamlit run src/1_🏠_Home.py

  1. Make a note of the External URL value.
  2. If by any chance you exit of the session (or application is stopped), you can restart the application by running the same command as highlighted in Step # 18

Use the chatbot

Use the external URL you copied in the previous step to access the application.

You can upload your file to start using the chatbot for Q&A.

Clean up

To avoid incurring future charges, delete the resources that you created:

  1. Empty the contents of the S3 bucket you created as a part of this post.
  2. Delete the CloudFormation stack you created as part of this post.

Conclusion

In this post, we showed you how to create a Q&A chatbot that can answer questions across an enterprise’s corpus of documents with choices of FM available within Amazon Bedrock—within a single interface.

In Part 2, we show you how to use Knowledge Bases for Amazon Bedrock with enterprise-grade vector databases like OpenSearch Service, Amazon Aurora PostgreSQL, MongoDB Atlas, Weaviate, and Pinecone with your Q&A chatbot.


About the Authors

Anand Mandilwar is an Enterprise Solutions Architect at AWS. He works with enterprise customers helping customers innovate and transform their business in AWS. He is passionate about automation around Cloud operation , Infrastructure provisioning and Cloud Optimization. He also likes python programming. In his spare time, he enjoys honing his photography skill especially in Portrait and landscape area.

NagaBharathi Challa is a solutions architect in the US federal civilian team at Amazon Web Services (AWS). She works closely with customers to effectively use AWS services for their mission use cases, providing architectural best practices and guidance on a wide range of services. Outside of work, she enjoys spending time with family & spreading the power of meditation.

Read More