Build financial search applications using the Amazon Bedrock Cohere multilingual embedding model

Build financial search applications using the Amazon Bedrock Cohere multilingual embedding model

Enterprises have access to massive amounts of data, much of which is difficult to discover because the data is unstructured. Conventional approaches to analyzing unstructured data use keyword or synonym matching. They don’t capture the full context of a document, making them less effective in dealing with unstructured data.

In contrast, text embeddings use machine learning (ML) capabilities to capture the meaning of unstructured data. Embeddings are generated by representational language models that translate text into numerical vectors and encode contextual information in a document. This enables applications such as semantic search, Retrieval Augmented Generation (RAG), topic modeling, and text classification.

For example, in the financial services industry, applications include extracting insights from earnings reports, searching for information from financial statements, and analyzing sentiment about stocks and markets found in financial news. Text embeddings enable industry professionals to extract insights from documents, minimize errors, and increase their performance.

In this post, we showcase an application that can search and query across financial news in different languages using Cohere’s Embed and Rerank models with Amazon Bedrock.

Cohere’s multilingual embedding model

Cohere is a leading enterprise AI platform that builds world-class large language models (LLMs) and LLM-powered solutions that allow computers to search, capture meaning, and converse in text. They provide ease of use and strong security and privacy controls.

Cohere’s multilingual embedding model generates vector representations of documents for over 100 languages and is available on Amazon Bedrock. This allows AWS customers to access it as an API, which eliminates the need to manage the underlying infrastructure and ensures that sensitive information remains securely managed and protected.

The multilingual model groups text with similar meanings by assigning them positions that are close to each other in a semantic vector space. With a multilingual embedding model, developers can process text in multiple languages without the need to switch between different models, as illustrated in the following figure. This makes processing more efficient and improves performance for multilingual applications.

The following are some of the highlights of Cohere’s embedding model:

  • Focus on document quality – Typical embedding models are trained to measure similarity between documents, but Cohere’s model also measures document quality
  • Better retrieval for RAG applications – RAG applications require a good retrieval system, which Cohere’s embedding model excels at
  • Cost-efficient data compression – Cohere uses a special, compression-aware training method, resulting in substantial cost savings for your vector database

Use cases for text embedding

Text embeddings turn unstructured data into a structured form. This allows you to objectively compare, dissect, and derive insights from all of these documents. The following are example use cases that Cohere’s embedding model enables:

  • Semantic search – Enables powerful search applications when coupled with a vector database, with excellent relevance based on search phrase meaning
  • Search engine for a larger system – Finds and retrieves the most relevant information from connected enterprise data sources for RAG systems
  • Text classification – Supports intent recognition, sentiment analysis, and advanced document analysis
  • Topic modeling – Turns a collection of documents into distinct clusters to uncover emerging topics and themes

Enhanced search systems with Rerank

In enterprises where conventional keyword search systems are already present, how do you introduce modern semantic search capabilities? For such systems that have been part of a company’s information architecture for a long time, a complete migration to an embeddings-based approach is, in many cases, just not feasible.

Cohere’s Rerank endpoint is designed to bridge this gap. It acts as the second stage of a search flow to provide a ranking of relevant documents per a user’s query. Enterprises can retain an existing keyword (or even semantic) system for the first-stage retrieval and boost the quality of search results with the Rerank endpoint in the second-stage reranking.

Rerank provides a fast and straightforward option for improving search results by introducing semantic search technology into a user’s stack with a single line of code. The endpoint also comes with multilingual support. The following figure illustrates the retrieval and reranking workflow.

Solution overview

Financial analysts need to digest a lot of content, such as financial publications and news media, in order to stay informed. According to the Association for Financial Professionals (AFP), financial analysts spend 75% of their time gathering data or administering the process instead of added-value analysis. Finding the answer to a question across a variety of sources and documents is time-intensive and tedious work. The Cohere embedding model helps analysts quickly search across numerous article titles in multiple languages to find and rank the articles that are most relevant to a particular query, saving an enormous amount of time and effort.

In the following use case example, we showcase how Cohere’s Embed model searches and queries across financial news in different languages in one unique pipeline. Then we demonstrate how adding Rerank to your embeddings retrieval (or adding it to a legacy lexical search) can further improve results.

The supporting notebook is available on GitHub.

The following diagram illustrates the workflow of the application.

Enable model access through Amazon Bedrock

Amazon Bedrock users need to request access to models to make them available for use. To request access to additional models, choose Model access the navigation pane on the Amazon Bedrock console. For more information, see Model access. For this walkthrough, you need to request access to the Cohere Embed Multilingual model.

Install packages and import modules

First, we install the necessary packages and import the modules we’ll use in this example:

!pip install --upgrade cohere-aws hnswlib translate

import pandas as pd
import cohere_aws
import hnswlib
import os
import re
import boto3

Import documents

We use a dataset (MultiFIN) containing a list of real-world article headlines covering 15 languages (English, Turkish, Danish, Spanish, Polish, Greek, Finnish, Hebrew, Japanese, Hungarian, Norwegian, Russian, Italian, Icelandic, and Swedish). This is an open source dataset curated for financial natural language processing (NLP) and is available on a GitHub repository.

In our case, we’ve created a CSV file with MultiFIN’s data as well as a column with translations. We don’t use this column to feed the model; we use it to help us follow along when we print the results for those who don’t speak Danish or Spanish. We point to that CSV to create our dataframe:

url = ""
df = pd.read_csv(url)

# Inspect dataset

Select a list of documents to query

MultiFIN has over 6,000 records in 15 different languages. For our example use case, we focus on three languages: English, Spanish, and Danish. We also sort the headers by length and pick the longest ones.

Because we’re picking the longest articles, we ensure the length is not due to repeated sequences. The following code shows an example where that is the case. We will clean that up.


'El 86% de las empresas españolas comprometidas con los Objetivos de Desarrollo 
Sostenible comprometidas con los Objetivos de Desarrollo Sostenible comprometidas 
con los Objetivos de Desarrollo Sostenible comprometidas con los Objetivos de 
Desarrollo Sostenible'
# Ensure there is no duplicated text in the headers
def remove_duplicates(text):
    return re.sub(r'((bw+b.{1,2}w+b)+).+1', r'1', text, flags=re.I)

df ['text'] = df['text'].apply(remove_duplicates)

# Keep only selected languages
languages = ['English', 'Spanish', 'Danish']
df = df.loc[df['lang'].isin(languages)]

# Pick the top 80 longest articles
df['text_length'] = df['text'].str.len()
df.sort_values(by=['text_length'], ascending=False, inplace=True)
top_80_df = df[:80]

# Language distribution

Our list of documents is nicely distributed across the three languages:

Spanish    33
English    29
Danish     18
Name: count, dtype: int64

The following is the longest article header in our dataset:

"CFOdirect: Resultater fra PwC's Employee Engagement Landscape Survey, herunder hvordan 
man skaber mere engagement blandt medarbejdere. Læs desuden om de regnskabsmæssige 
konsekvenser for indkomstskat ifbm. Brexit"

Embed and index documents

Now, we want to embed our documents and store the embeddings. The embeddings are very large vectors that encapsulate the semantic meaning of our document. In particular, we use Cohere’s embed-multilingual-v3.0 model, which creates embeddings with 1,024 dimensions.

When a query is passed, we also embed the query and use the hnswlib library to find the closest neighbors.

It only takes a few lines of code to establish a Cohere client, embed the documents, and create the search index. We also keep track of the language and translation of the document to enrich the display of the results.

# Establish Cohere client
co = cohere_aws.Client(mode=cohere_aws.Mode.BEDROCK)
model_id = "cohere.embed-multilingual-v3"

# Embed documents
docs = top_80_df['text'].to_list()
docs_lang = top_80_df['lang'].to_list()
translated_docs = top_80_df['translated_text'].to_list() #for reference when returning non-English results
doc_embs = co.embed(texts=docs, model_id=model_id, input_type='search_document').embeddings

# Create a search index
index = hnswlib.Index(space='ip', dim=1024)
index.init_index(max_elements=len(doc_embs), ef_construction=512, M=64)
index.add_items(doc_embs, list(range(len(doc_embs))))

Build a retrieval system

Next, we build a function that takes a query as input, embeds it, and finds the four headers more closely related to it:

# Retrieval of 4 closest docs to query
def retrieval(query):
    # Embed query and retrieve results
    query_emb = co.embed(texts=[query], model_id=model_id, input_type="search_query").embeddings
    doc_ids = index.knn_query(query_emb, k=3)[0][0] # we will retrieve 4 closest neighbors
    # Print and append results
    print(f"QUERY: {query.upper()} n")
    retrieved_docs, translated_retrieved_docs = [], []
    for doc_id in doc_ids:
        # Append results
        # Print results
        print(f"ORIGINAL ({docs_lang[doc_id]}): {docs[doc_id]}")
        if docs_lang[doc_id] != "English":
            print(f"TRANSLATION: {translated_docs[doc_id]} n----")
    print("END OF RESULTS nn")
    return retrieved_docs, translated_retrieved_docs

Query the retrieval system

Let’s explore what our system does with a couple of different queries. We start with English:

queries = [
    "Are businessess meeting sustainability goals?",
    "Can data science help meet sustainability goals?"

for query in queries:

The results are as follows:


ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals 
improves, but has a long way to go to meet and drive targets.
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but 
businesses remain on starting blocks for integration and progress
ORIGINAL (Spanish): Integrar los criterios ESG y el propósito en la estrategia 
principal reto de los Consejos de las empresas españolas en el mundo post-COVID 

TRANSLATION: Integrate ESG criteria and purpose into the main challenge strategy 
of the Boards of Spanish companies in the post-COVID world 


ORIGINAL (English): Using AI to better manage the environment could reduce greenhouse 
gas emissions, boost global GDP by up to 38m jobs by 2030
ORIGINAL (English): Quality of business reporting on the Sustainable Development Goals 
improves, but has a long way to go to meet and drive targets.
ORIGINAL (English): Only 10 years to achieve Sustainable Development Goals but 
businesses remain on starting blocks for integration and progress

Notice the following:

  • We’re asking related, but slightly different questions, and the model is nuanced enough to present the most relevant results at the top.
  • Our model does not perform keyword-based search, but semantic search. Even if we’re using a term like “data science” instead of “AI,” our model is able to understand what’s being asked and return the most relevant result at the top.

How about a query in Danish? Let’s look at the following query:

query = "Hvor kan jeg finde den seneste danske boligplan?" # "Where can I find the latest Danish property plan?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

ORIGINAL (Danish): Nyt fra CFOdirect: Ny PP&E-guide, FAQs om den nye leasingstandard, 
podcast om udfordringerne ved implementering af leasingstandarden og meget mere

TRANSLATION: New from CFOdirect: New PP&E guide, FAQs on the new leasing standard, 
podcast on the challenges of implementing the leasing standard and much more 
ORIGINAL (Danish): Lovforslag fremlagt om rentefri lån, udskudt frist for 
lønsumsafgift, førtidig udbetaling af skattekredit og loft på indestående på 

TRANSLATION: Legislative proposal presented on interest-free loans, deferred payroll 
tax deadline, early payment of tax credit and ceiling on deposits in the tax account 
ORIGINAL (Danish): Nyt fra CFOdirect: Shareholder-spørgsmål til ledelsen, SEC 
cybersikkerhedsguide, den amerikanske skattereform og meget mere

TRANSLATION: New from CFOdirect: Shareholder questions for management, the SEC 
cybersecurity guide, US tax reform and more 

In the preceding example, the English acronym “PP&E” stands for “property, plant, and equipment,” and our model was able to connect it to our query.

In this case, all returned results are in Danish, but the model can return a document in a language other than the query if its semantic meaning is closer. We have complete flexibility, and with a few lines of code, we can specify whether the model should only look at documents in the language of the query, or whether it should look at all documents.

Improve results with Cohere Rerank

Embeddings are very powerful. However, we’re now going to look at how to refine our results even further with Cohere’s Rerank endpoint, which has been trained to score the relevancy of documents against a query.

Another advantage of Rerank is that it can work on top of a legacy keyword search engine. You don’t have to change to a vector database or make drastic changes to your infrastructure, and it only takes a few lines of code. Rerank is available in Amazon SageMaker.

Let’s try a new query. We use SageMaker this time:

query = "Are companies ready for the next down market?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

ORIGINAL (Spanish): El valor en bolsa de las 100 mayores empresas cotizadas cae un 15% 
entre enero y marzo pero aguanta el embate del COVID-19 

TRANSLATION: The stock market value of the 100 largest listed companies falls 15% 
between January and March but withstands the onslaught of COVID-19 
ORIGINAL (English): 69% of business leaders have experienced a corporate crisis in the 
last five years yet 29% of companies have no staff dedicated to crisis preparedness
ORIGINAL (English): As work sites slowly start to reopen, CFOs are concerned about the 
global economy and a potential new COVID-19 wave - PwC survey

In this case, a semantic search was able to retrieve our answer and display it in the results, but it’s not at the top. However, when we pass the query again to our Rerank endpoint with the list of docs retrieved, Rerank is able to surface the most relevant document at the top.

First, we create the client and the Rerank endpoint:

# map model package arn
import boto3
cohere_package = "cohere-rerank-multilingual-v2--8b26a507962f3adb98ea9ac44cb70be1" # replace this with your info

model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

co = cohere_aws.Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual", instance_type="ml.g4dn.xlarge", n_instances=1)

When we pass the documents to Rerank, the model is able to pick the most relevant one accurately:

results = co.rerank(query=query, documents=retrieved_docs, top_n=1)

for hit in results:
69% of business leaders have experienced a corporate crisis in the last five years yet 
29% of companies have no staff dedicated to crisis preparedness


This post presented a walkthrough of using Cohere’s multilingual embedding model in Amazon Bedrock in the financial services domain. In particular, we demonstrated an example of a multilingual financial articles search application. We saw how the embedding model enables efficient and accurate discovery of information, thereby boosting the productivity and output quality of an analyst.

Cohere’s multilingual embedding model supports over 100 languages. It removes the complexity of building applications that require working with a corpus of documents in different languages. The Cohere Embed model is trained to deliver results in real-world applications. It handles noisy data as inputs, adapts to complex RAG systems, and delivers cost-efficiency from its compression-aware training method.

Start building with Cohere’s multilingual embedding model in Amazon Bedrock today.

About the Authors

James Yi is a Senior AI/ML Partner Solutions Architect in the Technology Partners COE Tech team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy, and scale AI/ML applications to derive business value. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Gonzalo Betegon is a Solutions Architect at Cohere, a provider of cutting-edge natural language processing technology. He helps organizations address their business needs through the deployment of large language models.

Meor Amer is a Developer Advocate at Cohere, a provider of cutting-edge natural language processing (NLP) technology. He helps developers build cutting-edge applications with Cohere’s Large Language Models (LLMs).

Read More

Ball position tracking in the cloud with the PGA TOUR

Ball position tracking in the cloud with the PGA TOUR

The PGA TOUR continues to enhance the golf experience with real-time data that brings fans closer to the game. To deliver even richer experiences, they are pursuing the development of a next-generation ball position tracking system that automatically tracks the position of the ball on the green.

The TOUR currently uses ShotLink powered by CDW, a premier scoring system that uses a complex camera system with on-site compute, to closely track the start and end position of every shot. The TOUR wanted to explore computer vision and machine learning (ML) techniques to develop a next-generation cloud-based pipeline to locate golf balls on the putting green.

The Amazon Generative AI Innovation Center (GAIIC) demonstrated the effectiveness of these techniques in an example dataset from a recent PGA TOUR event. The GAIIC designed a modular pipeline cascading a series of deep convolutional neural networks that successfully localizes players within a camera’s field of view, determines which player is putting, and tracks the ball as it moves toward the cup.

In this post, we describe the development of this pipeline, the raw data, the design of the convolutional neural networks comprising the pipeline, and an evaluation of its performance.


The TOUR provided 3 days of continuous video from a recent tournament from three 4K cameras positioned around the green on one hole. The following figure shows a frame from one camera cropped and zoomed so that the player putting is easily visible. Note that despite the high resolution of the cameras, because of the distance from the green, the ball appears small (usually 3×3, 4×4 or 5×5 pixels), and targets of this size can be difficult to localize accurately.

In addition to the camera feeds, the TOUR provided the GAIIC with annotated scoring data on each shot, including world location of its resting position and the timestamp. This allowed for visualizations of every putt on the green, as well as the ability to pull all of the video clips of players putting, which could be manually labeled and used to train detection models that make up the pipeline. The following figure show the three camera views with approximate putt path overlays, counterclockwise from top left. The pin is moved each day, where day 1 corresponds to blue, day 2 to red, and day 3 to orange.

Pipeline overview

The overall system consists of both a training pipeline an inference pipeline. The following diagram illustrates the architecture of the training pipeline. The starting point is ingestion of video data, either from a streaming module like Amazon Kinesis for live video or placement directly into Amazon Simple Storage Service (Amazon S3) for historical video. The training pipeline requires video preprocessing and hand labeling of images with Amazon SageMaker Ground Truth. Models can be trained with Amazon SageMaker and their artifacts stored with Amazon S3.

The inference pipeline, shown in the following diagram, consists of a number of modules that successively extract information from the raw video and ultimately predict the world coordinates of the ball at rest. Initially, the green is cropped from the larger field of view from each camera, in order to cut down on the pixel area in which the models must search for players and balls. Next, a deep convolutional neural network (CNN) is used to find the locations of people in the field of view. Another CNN is used to predict which type of person has been found in order to determine whether anyone is about to putt. After a likely putter has been localized in the field of view, the same network is used to predict the location of the ball near the putter. A third CNN tracks the ball during its motion, and lastly, a transformation function from camera pixel position to GPS coordinates is applied.

Player detection

Although it would be possible to run a CNN for ball detection over an entire 4K frame at a set interval, given the angular size of the ball at these camera distances, any small white object triggers a detection, resulting in many false alarms. To avoid searching the entire image frame for the ball, it’s possible to take advantage of correlations between player pose and ball location. A ball that is about to be putted must be next to a player, so finding the players in the field of view will greatly restrict the pixel area in which the detector must search for the ball.

We were able to use a CNN that was pre-trained to predict bounding boxes around all the people in a scene, as shown in the following figure. Unfortunately, there is frequently more than one ball on the green, so further logic is required beyond simply finding all people and searching for a ball. This requires another CNN to find the player that was currently putting.

Player classification and ball detection

To further narrow down where the ball could be, we fine-tuned a pre-trained object-detection CNN (YOLO v7) to classify all the people on the green. An important component of this process was manually labeling a set of images using SageMaker Ground Truth. The labels allowed the CNN to classify the player putting with high accuracy. In the labeling process, the ball was also outlined along with the player putting, so this CNN was able to perform ball detection as well, drawing an initial bounding box around the ball before a putt and feeding the position information into the downstream ball tracking CNN.

We use four different labels to annotate the objects in the images:

  • player-putting – The player holding a club and in the putting position
  • player-not-putting – The player not in the putting position (may also be holding a club)
  • other-person – Any other person who is not a player
  • golf-ball – The golf ball

The following figure shows a CNN was fine-tuned using labels from SageMaker Ground Truth to classify each person in the field of view. This is difficult because of the wide range of visual appearances of players, caddies, and fans. After a player was classified as putting, a CNN fine-tuned for ball detection was applied to the small area immediately around that player.

Ball path tracking

A third CNN, a ResNet architecture pre-trained for motion tracking, was used for tracking the ball after it was putted. Motion tracking is a thoroughly researched problem, so this network performed well when integrated into the pipeline without further fine-tuning.

Pipeline output

The cascade of CNNs places bounding boxes around people, classifies people on the green, detects the initial ball position, and tracks the ball once it begins moving. The following figure shows the labeled video output of the pipeline. The pixel positions of the ball as it moves are tracked and recorded. Note that people on the green are being tracked and outlined by bounding boxes; the putter at the bottom is labeled correctly as “player putting,” and the moving ball is being tracked and outlined by a small blue bounding box.


To assess performance of components of the pipeline, it’s necessary to have labeled data. Although we were provided with the ground truth world position of the ball, we didn’t have intermediate points for ground truth, like the final pixel position of the ball or the pixel location of the player putting. With the labeling job that we carried out, we developed ground truth data for these intermediate outputs of the pipeline that allow us to measure performance.

Player classification and ball detection accuracy

For detection of the player putting and the initial ball location, we labeled a dataset and fine-tuned a YOLO v7 CNN model as described earlier. The model classified the output from the previous person detection module into four classes: a player putting, a player not putting, other people, and the golf ball, as shown in the following figure.

The performance of this module is assessed with a confusion matrix, shown in the following figure. The values in the diagonal boxes show how often the predicted class matched the actual class from the ground truth labels. The model has 89% recall or better for each person class, and 79% recall for golf balls (which is to be expected because the model is pre-trained on examples with people but not on examples with golf balls; this could be improved with more labeled golf balls in the training set).

The next step is to trigger the ball tracker. Because the ball detection output is a confidence probability, it’s also possible to set the threshold for “detected ball” and observe how that changes the results, summarized in the following figure. There is a trade-off in this method because a higher threshold will necessarily have fewer false alarms but also miss some of the less certain examples of balls. We tested thresholds of 20% and 50% confidence, and found ball detection at 78% and 61%, respectively. By this measure, the 20% threshold is better. The trade-off is apparent in that for the 20% confidence threshold, 80% of total detections were actually balls (20% false positive), whereas for the 50% confidence threshold, 90% were balls (10% false positive). For fewer false positives, the 50% confidence threshold is better. Both of these measures could be improved with more labeled data for a larger training set.

The detection pipeline throughput is on the order of 10 frames per second, so in its current form, a single instance is not fast enough to be run continuously on the input at 50 frames per second. Achieving the 7-second mark for output after the ball steps would require further optimization for latency, perhaps by running multiple versions of the pipeline in parallel and compressing the CNN models via quantization (for example).

Ball path tracking accuracy

The pre-trained CNN model from MMTracking works well, but there are interesting failure cases. The following figure shows a case where the tracker starts on the ball, expands its bounding box to include both the putter head and ball, and then unfortunately tracks the putter head and forgets the ball. In this case, the putter head appears white (possibly due to specular reflection), so the confusion is understandable; labeled data for tracking and fine-tuning of the tracking CNN could help improve this in the future.


In this post, we discussed the development of a modular pipeline that localizes players within a camera’s field of view, determines which player is putting, and tracks the ball as it moves toward the cup.

For more information about AWS collaboration with the PGA TOUR, refer to PGA TOUR tees up with AWS to reimagine the fan experience.

About the Authors

James Golden is an applied scientist at Amazon Bedrock with a background in machine learning and neuroscience.

Henry Wang is an applied scientist at Amazon Generative AI Innovation Center, where he researches and builds generative AI solutions for AWS customers. He focuses on sports and media & entertainment industries, and has worked with various sports leagues, teams and broadcasters in the past. During his spare time, he likes to play tennis and golf.

Tryambak Gangopadhyay is an Applied Scientist at the AWS Generative AI Innovation Center, where he collaborates with organizations across a diverse spectrum of industries. His role involves conducting research and developing Generative AI solutions to address crucial business challenges and accelerate AI adoption.

Read More

Build an Amazon SageMaker Model Registry approval and promotion workflow with human intervention

Build an Amazon SageMaker Model Registry approval and promotion workflow with human intervention

This post is co-written with Jayadeep Pabbisetty, Sr. Specialist Data Engineering at Merck, and Prabakaran Mathaiyan, Sr. ML Engineer at Tiger Analytics.

The large machine learning (ML) model development lifecycle requires a scalable model release process similar to that of software development. Model developers often work together in developing ML models and require a robust MLOps platform to work in. A scalable MLOps platform needs to include a process for handling the workflow of ML model registry, approval, and promotion to the next environment level (development, test, UAT, or production).

A model developer typically starts to work in an individual ML development environment within Amazon SageMaker. When a model is trained and ready to be used, it needs to be approved after being registered in the Amazon SageMaker Model Registry. In this post, we discuss how the AWS AI/ML team collaborated with the Merck Human Health IT MLOps team to build a solution that uses an automated workflow for ML model approval and promotion with human intervention in the middle.

Overview of solution

This post focuses on a workflow solution that the ML model development lifecycle can use between the training pipeline and inferencing pipeline. The solution provides a scalable workflow for MLOps in supporting the ML model approval and promotion process with human intervention. An ML model registered by a data scientist needs an approver to review and approve before it is used for an inference pipeline and in the next environment level (test, UAT, or production). The solution uses AWS Lambda, Amazon API Gateway, Amazon EventBridge, and SageMaker to automate the workflow with human approval intervention in the middle. The following architecture diagram shows the overall system design, the AWS services used, and the workflow for approving and promoting ML models with human intervention from development to production.

Model approver architecture

The workflow includes the following steps:

  1. The training pipeline develops and registers a model in the SageMaker model registry. At this point, the model status is PendingManualApproval.
  2. EventBridge monitors status change events to automatically take actions with simple rules.
  3. The EventBridge model registration event rule invokes a Lambda function that constructs an email with a link to approve or reject the registered model.
  4. The approver gets an email with the link to review and approve or reject the model.
  5. The approver approves the model by following the link in the email to an API Gateway endpoint.
  6. API Gateway invokes a Lambda function to initiate model updates.
  7. The model registry is updated for the model status (Approved for the dev environment, but PendingManualApproval for test, UAT, and production).
  8. The model detail is stored in AWS Parameter Store, a capability of AWS Systems Manager, including the model version, approved target environment, model package.
  9. The inference pipeline fetches the model approved for the target environment from Parameter Store.
  10. The post-inference notification Lambda function collects batch inference metrics and sends an email to the approver to promote the model to the next environment.


The workflow in this post assumes the environment for the training pipeline is set up in SageMaker, along with other resources. The input to the training pipeline is the features dataset. The feature generation details are not included in this post, but it focuses on the registry, approval, and promotion of ML models after they are trained. The model is registered in the model registry and is governed by a monitoring framework in Amazon SageMaker Model Monitor to detect for any drift and proceed to retraining in case of model drift.

Workflow details

The approval workflow starts with a model developed from a training pipeline. When data scientists develop a model, they register it to the SageMaker Model Registry with the model status of PendingManualApproval. EventBridge monitors SageMaker for the model registration event and triggers an event rule that invokes a Lambda function. The Lambda function dynamically constructs an email for an approval of the model with a link to an API Gateway endpoint to another Lambda function. When the approver follows the link to approve the model, API Gateway forwards the approval action to the Lambda function, which updates the SageMaker Model Registry and the model attributes in Parameter Store. The approver must be authenticated and part of the approver group managed by Active Directory. The initial approval marks the model as Approved for dev but PendingManualApproval for test, UAT, and production. The model attributes saved in Parameter Store include the model version, model package, and approved target environment.

When an inference pipeline needs to fetch a model, it checks Parameter Store for the latest model version approved for the target environment and gets the inference details. When the inference pipeline is complete, a post-inference notification email is sent to a stakeholder requesting an approval to promote the model to the next environment level. The email has the details about the model and metrics as well as an approval link to an API Gateway endpoint for a Lambda function that updates the model attributes.

The following is the sequence of events and implementation steps for the ML model approval/promotion workflow from model creation to production. The model is promoted from development to test, UAT, and production environments with an explicit human approval in each step.

We start with the training pipeline, which is ready for model development. The model version starts as 0 in SageMaker Model Registry.

model registry version 0

  1. The SageMaker training pipeline develops and registers a model in SageMaker Model Registry. Model version 1 is registered and starts with Pending Manual Approval status.model registry version 1The Model Registry metadata has four custom fields for the environments: dev, test, uat, and prod.model registry bottom
  2. EventBridge monitors the SageMaker Model Registry for the status change to automatically take action with simple rules.EventBridge event patternEventBridge event bus and rules
  3. The model registration event rule invokes a Lambda function that constructs an email with the link to approve or reject the registered model.lambda and api gatewaylambda environment variables
  4. The approver gets an email with the link to review and approve (or reject) the model.model approval email
  5. The approver approves the model by following the link to the API Gateway endpoint in the email.API Gateway model approvalAPI Gateway route detailsAPI GW route integration details
  6. API Gateway invokes the Lambda function to initiate model updates.
  7. The SageMaker Model Registry is updated with the model status.Lambda funcion code sample
  8. The model detail information is stored in Parameter Store, including the model version, approved target environment, and model package.model version 1 approvedmodel registry custom metadata
  9. The inference pipeline fetches the model approved for the target environment from Parameter Store.
  10. The post-inference notification Lambda function collects batch inference metrics and sends an email to the approver to promote the model to the next environment.
  11. The approver approves the model promotion to the next level by following the link to the API Gateway endpoint, which triggers the Lambda function to update the SageMaker Model Registry and Parameter Store.

The complete history of the model versioning and approval is saved for review in Parameter Store.

model approval release detailsmodel attributes in parameter store


The large ML model development lifecycle requires a scalable ML model approval process. In this post, we shared an implementation of an ML model registry, approval, and promotion workflow with human intervention using SageMaker Model Registry, EventBridge, API Gateway, and Lambda. If you are considering a scalable ML model development process for your MLOps platform, you can follow the steps in this post to implement a similar workflow.

About the authors

Tom Kim is a Senior Solution Architect at AWS, where he helps his customers achieve their business objectives by developing solutions on AWS. He has extensive experience in enterprise systems architecture and operations across several industries – particularly in Health Care and Life Science. Tom is always learning new technologies that lead to desired business outcome for customers – e.g. AI/ML, GenAI and Data Analytics. He also enjoys traveling to new places and playing new golf courses whenever he can find time.

Sharmika's portraitShamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Healthcare and Life Sciences division at Amazon Web Services (AWS),specializes in Generative AI, with a focus on Large Language Model (LLM) training, inference optimizations, and MLOps (Machine Learning Operations). He guides customers in embedding advanced Generative AI into their projects, ensuring robust training processes, efficient inference mechanisms, and streamlined MLOps practices for effective and scalable AI solutions. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.

Jayadeep Pabbisetty is a Senior ML/Data Engineer at Merck, where he designs and develops ETL and MLOps solutions to unlock data science and analytics for the business. He is always enthusiastic about learning new technologies, exploring new avenues, and acquiring the skills necessary to evolve with the ever-changing IT industry. In his spare time, he follows his passion for sports and likes to travel and explore new places.

Prabakaran Mathaiyan is a Senior Machine Learning Engineer at Tiger Analytics LLC, where he helps his customers to achieve their business objectives by providing solutions for the model building, training, validation, monitoring, CICD and improvement of machine learning solutions on AWS. Prabakaran is always learning new technologies that lead to desired business outcome for customers – e.g. AI/ML, GenAI, GPT and LLM. He also enjoys playing cricket whenever he can find time.

Read More

Inference Llama 2 models with real-time response streaming using Amazon SageMaker

Inference Llama 2 models with real-time response streaming using Amazon SageMaker

With the rapid adoption of generative AI applications, there is a need for these applications to respond in time to reduce the perceived latency with higher throughput. Foundation models (FMs) are often pre-trained on vast corpora of data with parameters ranging in scale of millions to billions and beyond. Large language models (LLMs) are a type of FM that generate text as a response of the user inference. Inferencing these models with varying configurations of inference parameters may lead to inconsistent latencies. The inconsistency could be because of the varying number of response tokens you are expecting from the model or the type of accelerator the model is deployed on.

In either case, rather than waiting for the full response, you can adopt the approach of response streaming for your inferences, which sends back chunks of information as soon as they are generated. This creates an interactive experience by allowing you to see partial responses streamed in real time instead of a delayed full response.

With the official announcement that Amazon SageMaker real-time inference now supports response streaming, you can now continuously stream inference responses back to the client when using Amazon SageMaker real-time inference with response streaming. This solution will help you build interactive experiences for various generative AI applications such as chatbots, virtual assistants, and music generators. This post shows you how to realize faster response times in the form of Time to First Byte (TTFB) and reduce the overall perceived latency while inferencing Llama 2 models.

To implement the solution, we use SageMaker, a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. For more information about the various deployment options SageMaker provides, refer to Amazon SageMaker Model Hosting FAQs. Let’s understand how we can address the latency issues using real-time inference with response streaming.

Solution overview

Because we want to address the aforementioned latencies associated with real-time inference with LLMs, let’s first understand how we can use the response streaming support for real-time inferencing for Llama 2. However, any LLM can take advantage of response streaming support with real-time inferencing.

Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Llama 2 models are autoregressive models with decoder only architecture. When provided with a prompt and inference parameters, Llama 2 models are capable of generating text responses. These models can be used for translation, summarization, question answering, and chat.

For this post, we deploy the Llama 2 Chat model meta-llama/Llama-2-13b-chat-hf on SageMaker for real-time inferencing with response streaming.

When it comes to deploying models on SageMaker endpoints, you can containerize the models using specialized AWS Deep Learning Container (DLC) images available for popular open source libraries. Llama 2 models are text generation models; you can use either the Hugging Face LLM inference containers on SageMaker powered by Hugging Face Text Generation Inference (TGI) or AWS DLCs for Large Model Inference (LMI).

In this post, we deploy the Llama 2 13B Chat model using DLCs on SageMaker Hosting for real-time inference powered by G5 instances. G5 instances are a high-performance GPU-based instances for graphics-intensive applications and ML inference. You can also use supported instance types p4d, p3, g5, and g4dn with appropriate changes as per the instance configuration.


To implement this solution, you should have the following:

  • An AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution.
  • If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain.
  • A Hugging Face account. Sign up with your email if you don’t already have account.
    • For seamless access of the models available on Hugging Face, especially gated models such as Llama, for fine-tuning and inferencing purposes, you should have a Hugging Face account to obtain a read access token. After you sign up for your Hugging Face account, log in to visit to create a read access token.
  • Access to Llama 2, using the same email ID that you used to sign up for Hugging Face.
    • The Llama 2 models available via Hugging Face are gated models. The use of the Llama model is governed by the Meta license. To download the model weights and tokenizer, request access to Llama and accept their license.
    • After you’re granted access (typically in a couple of days), you will receive an email confirmation. For this example, we use the model Llama-2-13b-chat-hf, but you should be able to access other variants as well.

Approach 1: Hugging Face TGI

In this section, we show you how to deploy the meta-llama/Llama-2-13b-chat-hf model to a SageMaker real-time endpoint with response streaming using Hugging Face TGI. The following table outlines the specifications for this deployment.

Specification Value
Container Hugging Face TGI
Model Name meta-llama/Llama-2-13b-chat-hf
ML Instance ml.g5.12xlarge
Inference Real-time with response streaming

Deploy the model

First, you retrieve the base image for the LLM to be deployed. You then build the model on the base image. Finally, you deploy the model to the ML instance for SageMaker Hosting for real-time inference.

Let’s observe how to achieve the deployment programmatically. For brevity, only the code that helps with the deployment steps is discussed in this section. The full source code for deployment is available in the notebook llama-2-hf-tgi/llama-2-13b-chat-hf/1-deploy-llama-2-13b-chat-hf-tgi-sagemaker.ipynb.

Retrieve the latest Hugging Face LLM DLC powered by TGI via pre-built SageMaker DLCs. You use this image to deploy the meta-llama/Llama-2-13b-chat-hf model on SageMaker. See the following code:

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(

Define the environment for the model with the configuration parameters defined as follows:

instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
config = {
    'HF_MODEL_ID': "meta-llama/Llama-2-13b-chat-hf", # model_id from
    'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
    'MAX_INPUT_LENGTH': json.dumps(2048),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation

Replace <YOUR_HUGGING_FACE_READ_ACCESS_TOKEN> for the config parameter HUGGING_FACE_HUB_TOKEN with the value of the token obtained from your Hugging Face profile as detailed in the prerequisites section of this post. In the configuration, you define the number of GPUs used per replica of a model as 4 for SM_NUM_GPUS. Then you can deploy the meta-llama/Llama-2-13b-chat-hf model on an ml.g5.12xlarge instance that comes with 4 GPUs.

Now you can build the instance of HuggingFaceModel with the aforementioned environment configuration:

llm_model = HuggingFaceModel(

Finally, deploy the model by providing arguments to the deploy method available on the model with various parameter values such as endpoint_name, initial_instance_count, and instance_type:

llm = llm_model.deploy(

Perform inference

The Hugging Face TGI DLC comes with the ability to stream responses without any customizations or code changes to the model. You can use invoke_endpoint_with_response_stream if you are using Boto3 or InvokeEndpointWithResponseStream when programming with the SageMaker Python SDK.

The InvokeEndpointWithResponseStream API of SageMaker allows developers to stream responses back from SageMaker models, which can help improve customer satisfaction by reducing the perceived latency. This is especially important for applications built with generative AI models, where immediate processing is more important than waiting for the entire response.

For this example, we use Boto3 to infer the model and use the SageMaker API invoke_endpoint_with_response_stream as follows:

def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
    return response_stream

The argument CustomAttributes is set to the value accept_eula=false. The accept_eula parameter must be set to true to successfully obtain the response from the Llama 2 models. After the successful invocation using invoke_endpoint_with_response_stream, the method will return a response stream of bytes.

The following diagram illustrates this workflow.

HF TGI Streaming Architectural Diagram

You need an iterator that loops over the stream of bytes and parses them to readable text. The LineIterator implementation can be found at llama-2-hf-tgi/llama-2-13b-chat-hf/utils/ Now you’re ready to prepare the prompt and instructions to use them as a payload while inferencing the model.

Prepare a prompt and instructions

In this step, you prepare the prompt and instructions for your LLM. To prompt Llama 2, you should have the following prompt template:

<s>[INST] <<SYS>>
{{ system_prompt }}

{{ user_message }} [/INST]

You build the prompt template programmatically defined in the method build_llama2_prompt, which aligns with the aforementioned prompt template. You then define the instructions as per the use case. In this case, we’re instructing the model to generate an email for a marketing campaign as covered in the get_instructions method. The code for these methods is in the llama-2-hf-tgi/llama-2-13b-chat-hf/2-sagemaker-realtime-inference-llama-2-13b-chat-hf-tgi-streaming-response.ipynb notebook. Build the instruction combined with the task to be performed as detailed in user_ask_1 as follows:

user_ask_1 = f'''
AnyCompany recently announced new service launch named AnyCloud Internet Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is
Mention the Coupon Code: EARLYB1RD to get 20% for 1st 3 months.
instructions = get_instructions(user_ask_1)
prompt = build_llama2_prompt(instructions)

We pass the instructions to build the prompt as per the prompt template generated by build_llama2_prompt.

inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 0.9,
        "top_k": 50,
        "max_new_tokens": 512,
        "repetition_penalty": 1.03,
        "stop": ["</s>"],
        "return_full_text": False
payload = {
    "inputs":  prompt,
    "parameters": inference_params,
    "stream": True ## <-- to have response stream.

We club the inference parameters along with prompt with the key stream with the value True to form a final payload. Send the payload to get_realtime_response_stream, which will be used to invoke an endpoint with response streaming:

resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)

The generated text from the LLM will be streamed to the output as shown in the following animation.

Llama 2 13B Chat Response Streaming - HF TGI

Approach 2: LMI with DJL Serving

In this section, we demonstrate how to deploy the meta-llama/Llama-2-13b-chat-hf model to a SageMaker real-time endpoint with response streaming using LMI with DJL Serving. The following table outlines the specifications for this deployment.

Specification Value
Container LMI container image with DJL Serving
Model Name meta-llama/Llama-2-13b-chat-hf
ML Instance ml.g5.12xlarge
Inference Real-time with response streaming

You first download the model and store it in Amazon Simple Storage Service (Amazon S3). You then specify the S3 URI indicating the S3 prefix of the model in the file. Next, you retrieve the base image for the LLM to be deployed. You then build the model on the base image. Finally, you deploy the model to the ML instance for SageMaker Hosting for real-time inference.

Let’s observe how to achieve the aforementioned deployment steps programmatically. For brevity, only the code that helps with the deployment steps is detailed in this section. The full source code for this deployment is available in the notebook llama-2-lmi/llama-2-13b-chat/1-deploy-llama-2-13b-chat-lmi-response-streaming.ipynb.

Download the model snapshot from Hugging Face and upload the model artifacts on Amazon S3

With the aforementioned prerequisites, download the model on the SageMaker notebook instance and then upload it to the S3 bucket for further deployment:

model_name = 'meta-llama/Llama-2-13b-chat-hf'
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model", "*.safetensors", "*.bin", "*.chk", "*.pth"]

# Download the model snapshot
model_download_path = snapshot_download(

Note that even though you don’t provide a valid access token, the model will download. But when you deploy such a model, the model serving won’t succeed. Therefore, it’s recommended to replace <YOUR_HUGGING_FACE_READ_ACCESS_TOKEN> for the argument token with the value of the token obtained from your Hugging Face profile as detailed in the prerequisites. For this post, we specify the official model’s name for Llama 2 as identified on Hugging Face with the value meta-llama/Llama-2-13b-chat-hf. The uncompressed model will be downloaded to local_model_path as a result of running the aforementioned code.

Upload the files to Amazon S3 and obtain the URI, which will be later used in

You will be packaging the meta-llama/Llama-2-13b-chat-hf model on the LMI container image with DJL Serving using the configuration specified via Then you deploy the model along with model artifacts packaged on the container image on the SageMaker ML instance ml.g5.12xlarge. You then use this ML instance for SageMaker Hosting for real-time inferencing.

Prepare model artifacts for DJL Serving

Prepare your model artifacts by creating a configuration file:

%%writefile chat_llama2_13b_hf/
engine = MPI

We use the following settings in this configuration file:

  • engine – This specifies the runtime engine for DJL to use. The possible values include Python, DeepSpeed, FasterTransformer, and MPI. In this case, we set it to MPI. Model Parallelization and Inference (MPI) facilitates partitioning the model across all the available GPUs and therefore accelerates inference.
  • option.entryPoint – This option specifies which handler offered by DJL Serving you would like to use. The possible values are djl_python.huggingface, djl_python.deepspeed, and djl_python.stable-diffusion. We use djl_python.huggingface for Hugging Face Accelerate.
  • option.tensor_parallel_degree – This option specifies the number of tensor parallel partitions performed on the model. You can set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the number of workers per model that will be started up when DJL serving runs. For example, if we have a 4 GPU machine and we are creating four partitions, then we will have one worker per model to serve the requests.
  • option.low_cpu_mem_usage – This reduces CPU memory usage when loading models. We recommend that you set this to TRUE.
  • option.rolling_batch – This enables iteration-level batching using one of the supported strategies. Values include auto, scheduler, and lmi-dist. We use lmi-dist for turning on continuous batching for Llama 2.
  • option.max_rolling_batch_size – This limits the number of concurrent requests in the continuous batch. The value defaults to 32.
  • option.model_id – You should replace {{model_id}} with the model ID of a pre-trained model hosted inside a model repository on Hugging Face or S3 path to the model artifacts.

More configuration options can be found in Configurations and settings.

Because DJL Serving expects the model artifacts to be packaged and formatted in a .tar file, run the following code snippet to compress and upload the .tar file to Amazon S3:

s3_code_prefix = f"{s3_prefix}/code" # folder within bucket where code artifact will go
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

Retrieve the latest LMI container image with DJL Serving

Next, you use the DLCs available with SageMaker for LMI to deploy the model. Retrieve the SageMaker image URI for the djl-deepspeed container programmatically using the following code:

from sagemaker import image_uris
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.25.0"

You can use the aforementioned image to deploy the meta-llama/Llama-2-13b-chat-hf model on SageMaker. Now you can proceed to create the model.

Create the model

You can create the model whose container is built using the inference_image_uri and the model serving code located at the S3 URI indicated by s3_code_artifact:

from sagemaker.utils import name_from_base

model_name = name_from_base(f"Llama-2-13b-chat-lmi-streaming")

create_model_response = sm_client.create_model(
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {"MODEL_LOADING_TIMEOUT": "3600"},

Now you can create the model config with all the details for the endpoint configuration.

Create the model config

Use the following code to create a model config for the model identified by model_name:

endpoint_config_name = f"{model_name}-config"

endpoint_name = name_from_base(model_name)

endpoint_config_response = sm_client.create_endpoint_config(
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,

The model config is defined for the ProductionVariants parameter InstanceType for the ML instance ml.g5.12xlarge. You also provide the ModelName using the same name that you used to create the model in the earlier step, thereby establishing a relation between the model and endpoint configuration.

Now that you have defined the model and model config, you can create the SageMaker endpoint.

Create the SageMaker endpoint

Create the endpoint to deploy the model using the following code snippet:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name

You can view the progress of the deployment using the following code snippet:

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]

After the deployment is successful, the endpoint status will be InService. Now that the endpoint is ready, let’s perform inference with response streaming.

Real-time inference with response streaming

As we covered in the earlier approach for Hugging Face TGI, you can use the same method get_realtime_response_stream to invoke response streaming from the SageMaker endpoint. The code for inferencing using the LMI approach is in the llama-2-lmi/llama-2-13b-chat/2-inference-llama-2-13b-chat-lmi-response-streaming.ipynb notebook. The LineIterator implementation is located in llama-2-lmi/utils/ Note that the LineIterator for the Llama 2 Chat model deployed on the LMI container is different to the LineIterator referenced in Hugging Face TGI section. The LineIterator loops over the byte stream from Llama 2 Chat models inferenced with the LMI container with djl-deepspeed version 0.25.0. The following helper function will parse the response stream received from the inference request made via the invoke_endpoint_with_response_stream API:

from utils.LineIterator import LineIterator

def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    for line in LineIterator(event_stream):
        print(line, end='')

The preceding method prints the stream of data read by the LineIterator in a human-readable format.

Let’s explore how to prepare the prompt and instructions to use them as a payload while inferencing the model.

Because you’re inferencing the same model in both Hugging Face TGI and LMI, the process of preparing the prompt and instructions is same. Therefore, you can use the methods get_instructions and build_llama2_prompt for inferencing.

The get_instructions method returns the instructions. Build the instructions combined with the task to be performed as detailed in user_ask_2 as follows:

user_ask_2 = f'''
AnyCompany recently announced new service launch named AnyCloud Streaming Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is
Mention the Coupon Code: STREAM2DREAM to get 15% for 1st 6 months.

instructions = get_instructions(user_ask_2)
prompt = build_llama2_prompt(instructions)

Pass the instructions to build the prompt as per the prompt template generated by build_llama2_prompt:

inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 0.9,
        "top_k": 50,
        "max_new_tokens": 512,
        "return_full_text": False,

payload = {
    "inputs":  prompt,
    "parameters": inference_params

We club the inference parameters along with the prompt to form a final payload. Then you send the payload to get_realtime_response_stream, which is used to invoke an endpoint with response streaming:

resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)

The generated text from the LLM will be streamed to the output as shown in the following animation.

Llama 2 13B Chat Response Streaming - LMI

Clean up

To avoid incurring unnecessary charges, use the AWS Management Console to delete the endpoints and its associated resources that were created while running the approaches mentioned in the post. For both deployment approaches, perform the following cleanup routine:

import boto3
sm_client = boto3.client('sagemaker')
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint['EndpointConfigName']
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']

About to delete the following sagemaker resources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Model: {model_name}

# delete endpoint
# delete endpoint config
# delete model

Replace <SageMaker_Real-time_Endpoint_Name> for variable endpoint_name with the actual endpoint.

For the second approach, we stored the model and code artifacts on Amazon S3. You can clean up the S3 bucket using the following code:

s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket)


In this post, we discussed how a varying number of response tokens or a different set of inference parameters can affect the latencies associated with LLMs. We showed how to address the problem with the help of response streaming. We then identified two approaches for deploying and inferencing Llama 2 Chat models using AWS DLCs—LMI and Hugging Face TGI.

You should now understand the importance of streaming response and how it can reduce perceived latency. Streaming response can improve the user experience, which otherwise would make you wait until the LLM builds the whole response. Additionally, deploying Llama 2 Chat models with response streaming improves the user experience and makes your customers happy.

You can refer to the official aws-samples amazon-sagemaker-llama2-response-streaming-recipes that covers deployment for other Llama 2 model variants.


About the Authors

Pavan Kumar Rao NavulePavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services. He works with ISVs in India to help them innovate on AWS. He is a published author for the book “Getting Started with V Programming.” He pursued an Executive M.Tech in Data Science from the Indian Institute of Technology (IIT), Hyderabad. He also pursued an Executive MBA in IT specialization from the Indian School of Business Management and Administration, and holds a B.Tech in Electronics and Communication Engineering from the Vaagdevi Institute of Technology and Science. Pavan is an AWS Certified Solutions Architect Professional and holds other certifications such as AWS Certified Machine Learning Specialty, Microsoft Certified Professional (MCP), and Microsoft Certified Technology Specialist (MCTS). He is also an open-source enthusiast. In his free time, he loves to listen to the great magical voices of Sia and Rihanna.

Sudhanshu HateSudhanshu Hate is principal AI/ML specialist with AWS and works with clients to advise them on their MLOps and generative AI journey. In his previous role before Amazon, he conceptualized, created, and led teams to build ground-up open source-based AI and gamification platforms, and successfully commercialized it with over 100 clients. Sudhanshu to his credit a couple of patents, has written two books and several papers and blogs, and has presented his points of view in various technical forums. He has been a thought leader and speaker, and has been in the industry for nearly 25 years. He has worked with Fortune 1000 clients across the globe and most recently with digital native clients in India.

Read More

Deploy a Slack gateway for Amazon Q, your business expert

Deploy a Slack gateway for Amazon Q, your business expert

Amazon Q is a new generative AI-powered application that helps users get work done. Amazon Q can become your tailored business expert and let you discover content, brainstorm ideas, or create summaries using your company’s data safely and securely. You can use Amazon Q to have conversations, solve problems, generate content, gain insights, and take action by connecting to your company’s information repositories, code, data, and enterprise systems. For more information, see Introducing Amazon Q, a new generative AI-powered assistant (preview).

In this post, we show you how to bring Amazon Q, your business expert, to users in Slack.

You’ll be able converse with Amazon Q using Slack direct messages (DMs) to ask questions and get answers based on company data, get help creating new content such as email drafts, summarize attached files, and perform tasks.

You can also invite Amazon Q to participate in your team channels. In a channel, users can ask it questions in a new message, or tag it in an existing thread at any point, to provide additional data points, resolve a debate, or summarize the conversation and capture the next steps.

Solution overview

Amazon Q is amazingly powerful. Check out the following demo—seeing is believing!

In the demo, our Amazon Q application is populated with a set of AWS whitepapers. You can populate your own Amazon Q business expert application with your own company’s documents and knowledge base articles, so it will be able to answer your questions!

Everything you need is provided as open source in our GitHub repo.

In this post, we walk you through the process to deploy Amazon Q in your AWS account and add it to your Slack workspace. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things it can do:

  • Respond to messages – In DMs, it responds to all messages. In channels, it responds only to @mentions and responds in a conversation thread.
  • Render answers containing markdown – This includes headings, lists, bold, italics, tables, and more.
  • Track sentiment – It provides thumbs up and thumbs down buttons to track user sentiment.
  • Provide source attribution – It provides references and hyperlinks to sources used by Amazon Q.
  • Understand conversation context – It tracks the conversation and responds based on the context.
  • Stay aware of multiple users – When it’s tagged in a thread, it knows who said what, and when, so it can contribute in context and accurately summarize the thread when asked.
  • Process attached files – It can process up to five attached files for document question answering, summaries, and more.
  • Start new conversations – You can reset and start new conversations in DM channels by using /new_conversation.

Slack example

In the following sections, we show how to deploy the project to your own AWS account and Slack workspace, and start experimenting!


You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need to have an existing, working Amazon Q business expert application. If you haven’t set one up yet, see Creating an Amazon Q application.

Lastly, you need a Slack account and access to create and publish apps to your Slack organization. If you don’t have one, see if your company can create a Slack sandbox organization for you to experiment, or go to to create a free Slack account and workspace.

Deploy the solution resources

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

  1. Log in to the AWS Management Console.
  2. Choose one of the following Launch Stack buttons for your desired AWS Region to open the AWS CloudFormation console and create a new stack.
Region Launch Stack
N. Virginia (us-east-1)
Oregon (us-west-2)
  1. For Stack name, enter a name for your app (for example, AMAZON-Q-SLACK-GATEWAY).
  2. For AmazonQAppId, enter your existing Amazon Q application ID (for example, 80xxxxx9-7xx3-4xx0-bxx4-5baxxxxx2af5). You can copy it from the Amazon Q console.
  3. For AmazonQRegion, choose the Region where you created your Amazon Q application (us-east-1 or us-west-2).
  4. For AmazonQUserId, enter an Amazon Q user ID email address (leave blank to use a Slack user email as the user ID).
  5. For ContextDaysToLive, enter the length of time to keep conversation metadata cached in Amazon DynamoDB (you can leave this as the default).

When your CloudFormation stack status is CREATE_COMPLETE, choose the Outputs tab, and keep it open—you’ll need it in later steps.

Create your app

Now you can create your app in Slack. Complete the following steps:

  1. Create a Slack app in from the generated manifest—copy and paste from the stack output: SlackAppManifest.
  2. Choose App Home in the navigation pane and scroll down to the section Show Tabs.
  3. Enable Messages Tab.
  4. Select Allow users to send Slash commands and messages from the messages tab.

This is a required step to enable your user to send messages to your app.

Slack enable messages

Add your app in your workspace

Now you can add your app in your workspace. This is required to generate the bot user OAuth token value that is needed in the next step.

  1. Go to OAuth & Permissions (in and choose Install to Workspace to generate the OAuth token.
  2. In Slack, go to your workspace.
  3. Choose your workspace name, Settings & administration, and Manage apps.
  4. Choose your newly created app.
  5. In the right pane, choose Open in App Directory.
  6. Choose Open in Slack.

Configure Slack secrets in AWS Secrets Manager

Let’s configure your Slack secrets in order to verify the signature of each request and post on behalf of your Amazon Q bot.

In this example, we are not enabling Slack token rotation. You can enable it for a production app by implementing rotation via AWS Secrets Manager. Create an issue (or, better yet, a pull request) in the GitHub repo if you want this feature added to a future version.

Complete the following steps to configure a secret in Secrets Manager:

  1. On the AWS CloudFormation console, navigate to your stack Outputs tab and choose the link for SlackSecretConsoleUrl to be redirected to the Secrets Manager console.
  2. Choose Retrieve secret value.
  3. Choose Edit.
  4. Replace the values of SlackSigningSecret and SlackBotUserOAuthToken using the values in the Slack application configuration under Basic Information and OAuth & Permissions.

Be careful you don’t accidentally copy Client Secret instead of Signing Secret.

Edit secrets

Start using Amazon Q

Complete the following steps to start using Amazon Q in Slack:

  1. Open your Slack workspace.
  2. Under Apps, Manage, add your new Amazon Q app.
  3. Optionally, add your Amazon Q app to team channels.
  4. In the app DM channel, enter Hello.

Say hello

You have now deployed a powerful new AI assistant into your sandbox Slack environment.

Play with it, try all the features discussed in this post, and copy the things you saw in the demo video. Most importantly, you can ask about topics related to the documents that you have ingested into your own Amazon Q business expert application. But don’t stop there. You can find additional ways to make it useful, and when you do, let us know by posting a comment.

Once you are convinced how useful it is, talk to your Slack admins (and show them this post) and work with them to deploy it in your company’s Slack workspaces. Your fellow employees will thank you!

Clean up

When you’re finished experimenting with this solution, delete your app in Slack ( and clean up your AWS resources by opening the AWS CloudFormation console and deleting the AMAZON-Q-SLACK-GATEWAY stack that you deployed. This deletes the resources that you created by deploying the solution.


This sample Amazon Q slack application discussed in this post is provided as open source—you can use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Explore the code, choose Watch in the GitHub repo to be notified of new releases, and check back for the latest updates. We’d also love to hear your suggestions for improvements and features.

For more information on Amazon Q, refer to What is Amazon Q (For Business Use)?

About the Authors

Gary Benattar is a Senior Software Development Manager in AWS HR. Gary started at Amazon in 2012 as an intern, focusing on building scalable, real-time outlier detection systems. He worked in Seattle and Luxembourg and is now based in Tel Aviv, Israel, where he dedicates his time to building software to revolutionize the future of Human Resources. He co-founded a startup, Zengo, with a focus on making digital wallets secure through multi-party computation. He received his MSc in Software Engineering from Sorbonne University in Paris.

Bob Strahan

Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Read More

Create a document lake using large-scale text extraction from documents with Amazon Textract

Create a document lake using large-scale text extraction from documents with Amazon Textract

AWS customers in healthcare, financial services, the public sector, and other industries store billions of documents as images or PDFs in Amazon Simple Storage Service (Amazon S3). However, they’re unable to gain insights such as using the information locked in the documents for large language models (LLMs) or search until they extract the text, forms, tables, and other structured data. With AWS intelligent document processing (IDP) using AI services such as Amazon Textract, you can take advantage of industry-leading machine learning (ML) technology to quickly and accurately process data from PDFs or document images (TIFF, JPEG, PNG). After the text is extracted from the documents, you can use it to fine-tune a foundation model, summarize the data using a foundation model, or send it to a database.

In this post, we focus on processing a large collection of documents into raw text files and storing them in Amazon S3. We provide you with two different solutions for this use case. The first allows you to run a Python script from any server or instance including a Jupyter notebook; this is the quickest way to get started. The second approach is a turnkey deployment of various infrastructure components using AWS Cloud Development Kit (AWS CDK) constructs. The AWS CDK construct provides a resilient and flexible framework to process your documents and build an end-to-end IDP pipeline. Through the use of the AWS CDK, you can extend its functionality to include redaction, store the output in Amazon OpenSearch, or add a custom AWS Lambda function with your own business logic.

Both of these solutions allow you to quickly process many millions of pages. Before running either of these solutions at scale, we recommend testing with a subset of your documents to make sure the results meet your expectations. In the following sections, we first describe the script solution, followed by the AWS CDK construct solution.

Solution 1: Use a Python script

This solution processes documents for raw text through Amazon Textract as quickly as the service will allow with the expectation that if there is a failure in the script, the process will pick up from where it left off. The solution utilizes three different services: Amazon S3, Amazon DynamoDB, and Amazon Textract.

The following diagram illustrates the sequence of events within the script. When the script ends, a completion status along with the time taken will be returned to the SageMaker studio console.


We have packaged this solution in a .ipynb script and .py script. You can use any of the deployable solutions as per your requirements.


To run this script from a Jupyter notebook, the AWS Identity and Access Management (IAM) role assigned to the notebook must have permissions that allow it to interact with DynamoDB, Amazon S3, and Amazon Textract. The general guidance is to provide least-privilege permissions for each of these services to your AmazonSageMaker-ExecutionRole role. To learn more, refer to Get started with AWS managed policies and move toward least-privilege permissions.

Alternatively, you can run this script from other environments such as an Amazon Elastic Compute Cloud (Amazon EC2) instance or container that you would manage, provided that Python, Pip3, and the AWS SDK for Python (Boto3) are installed. Again, the same IAM polices need to be applied that allow the script to interact with the various managed services.


To implement this solution, you first need to clone the repository GitHub.

You need to set the following variables in the script before you can run it:

  • tracking_table – This is the name of the DynamoDB table that will be created.
  • input_bucket – This is your source location in Amazon S3 that contains the documents that you want to send to Amazon Textract for text detection. For this variable, provide the name of the bucket, such as mybucket.
  • output_bucket – This is for storing the location of where you want Amazon Textract to write the results to. For this variable, provide the name of the bucket, such as myoutputbucket.
  • _input_prefix (optional) – If you want to select certain files from within a folder in your S3 bucket, you can specify this folder name as the input prefix. Otherwise, leave the default as empty to select all.

The script is as follows:

_tracking_table = "Table_Name_for_storing_s3ObjectNames"
_input_bucket = "your_files_are_here"
_output_bucket = "Amazon Textract_writes_JSON_containing_raw_text_to_here"

The following DynamoDB table schema gets created when the script is run:

Table              Table_Name_for_storing_s3ObjectNames
Partition Key       objectName (String)
                    bucketName (String)
                    createdDate (Decimal)
                    outputbucketName (String)
                    txJobId (String)

When the script is run for the first time, it will check to see if the DynamoDB table exists and will automatically create it if needed. After the table is created, we need to populate it with a list of document object references from Amazon S3 that we want to process. The script by design will enumerate over objects in the specified input_bucket and automatically populate our table with their names when ran. It takes approximately 10 minutes to enumerate over 100,000 documents and populate those names into the DynamoDB table from the script. If you have millions of objects in a bucket, you could alternatively use the inventory feature of Amazon S3 that generates a CSV file of names, then populate the DynamoDB table from this list with your own script in advance and not use the function called fetchAllObjectsInBucketandStoreName by commenting it out. To learn more, refer to Configuring Amazon S3 Inventory.

As mentioned earlier, there is both a notebook version and a Python script version. The notebook is the most straightforward way to get started; simply run each cell from start to finish.

If you decide to run the Python script from a CLI, it is recommended that you use a terminal multiplexer such as tmux. This is to prevent the script from stopping should your SSH session finish. For example: tmux new -d ‘python3’.

The following is the script’s entry point; from here you can comment out methods not needed:

"""Main entry point into script --- Start Here"""
if __name__ == "__main__":    
    now = time.perf_counter()

The following fields are set when the script is populating the DynamoDB table:

  • objectName – The name of the document located in Amazon S3 that will be sent to Amazon Textract
  • bucketName – The bucket where the document object is stored

These two fields must be populated if you decide to use a CSV file from the S3 inventory report and skip the auto populating that happens within the script.

Now that the table is created and populated with the document object references, the script is ready to start calling the Amazon Textract StartDocumentTextDetection API. Amazon Textract, similar to other managed services, has a default limit on the APIs called transactions per second (TPS). If required, you can request a quota increase from the Amazon Textract console. The code is designed to use multiple threads concurrently when calling Amazon Textract to maximize the throughput with the service. You can change this within the code by modifying the threadCountforTextractAPICall variable. By default, this is set to 20 threads. The script will initially read 200 rows from the DynamoDB table and store these in an in-memory list that is wrapped with a class for thread safety. Each caller thread is then started and runs within its own swim lane. Basically, the Amazon Textract caller thread will retrieve an item from the in-memory list that contains our object reference. It will then call the asynchronous start_document_text_detection API and wait for the acknowledgement with the job ID. The job ID is then updated back to the DynamoDB row for that object, and the thread will repeat by retrieving the next item from the list.

The following is the main orchestration code script:

while len(results) > 0:
        for record in results: # put these records into our thread safe list
        """create our threads for processing Amazon Textract"""
        	  threadsforTextractAPI=threading.Thread(name="Thread - " + str(i), target=procestTextractFunction, args=(fileList,)) 

The caller threads will continue repeating until there are no longer any items within the list, at which point the threads will each stop. When all threads operating within their swim lanes have stopped, the next 200 rows from DynamoDB are retrieved and a new set of 20 threads are started, and the whole process repeats until every row that doesn’t contain a job ID is retrieved from DynamoDB and updated. Should the script crash due to some unexpected problem, then the script can be run again from the orchestrate() method. This makes sure that the threads will continue processing rows that contain empty job IDs. Note that when rerunning the orchestrate() method after the script has stopped, there is a potential that a few documents will get sent to Amazon Textract again. This number will be equal to or less than the number of threads that were running at the time of the crash.

When there are no more rows containing a blank job ID in the DynamoDB table, the script will stop. All the JSON output from Amazon Textract for all the objects will be found in the output_bucket by default under the textract_output folder. Each subfolder within textract_output will be named with the job ID that corresponds to the job ID that was stored in the DynamoDB table for that object. Within the job ID folder, you will find the JSON, which will be numerically named starting at 1 and can potentially span additional JSON files that would be labeled 2, 3, and so on. Spanning JSON files is a result of dense or multi-page documents, where the amount of content extracted exceeds the Amazon Textract default JSON size of 1,000 blocks. Refer to Block for more information on blocks. These JSON files will contain all the Amazon Textract metadata, including the text that was extracted from within the documents.

You can find the Python code notebook version and script for this solution in GitHub.

Clean up

When the Python script is complete, you can save costs by shutting down or stopping the Amazon SageMaker Studio notebook or container that you spun up.

Now on to our second solution for documents at scale.

Solution 2: Use a serverless AWS CDK construct

This solution uses AWS Step Functions and Lambda functions to orchestrate the IDP pipeline. We use the IDP AWS CDK constructs, which make it straightforward to work with Amazon Textract at scale. Additionally, we use a Step Functions distributed map to iterate over all the files in the S3 bucket and initiate processing. The first Lambda function determines how many pages your documents has. This enables the pipeline to automatically use either the synchronous (for single-page documents) or asynchronous (for multi-page documents) API. When using the asynchronous API, an additional Lambda function is called to all the JSON files that Amazon Textract will produce for all of your pages into one JSON file to make it straightforward for your downstream applications to work with the information.

This solution also contains two additional Lambda functions. The first function parses the text from the JSON and saves it as a text file in Amazon S3. The second function analyzes the JSON and stores that for metrics on the workload.

The following diagram illustrates the Step Functions workflow.



This code base uses the AWS CDK and requires Docker. You can deploy this from an AWS Cloud9 instance, which has the AWS CDK and Docker already set up.


To implement this solution, you first need to clone the repository.

After you clone the repository, install the dependencies:

pip install -r requirements.txt

Then use the following code to deploy the AWS CDK stack:

cdk bootstrap
cdk deploy --parameters SourceBucket=<Source Bucket> SourcePrefix=<Source Prefix>

You must provide both the source bucket and source prefix (the location of the files you want to process) for this solution.

When the deployment is complete, navigate to the Step Functions console, where you should see the state machine ServerlessIDPArchivePipeline.


Open the state machine details page and on the Executions tab, choose Start execution.


Choose Start execution again to run the state machine.


After you start the state machine, you can monitor the pipeline by looking at the map run. You will see an Item processing status section like the following screenshot. As you can see, this is built to run and track what was successful and what failed. This process will continue to run until all documents have been read.


With this solution, you should be able to process millions of files in your AWS account without worrying about how to properly determine which files to send to which API or corrupt files failing your pipeline. Through the Step Functions console, you will be able to watch and monitor your files in real time.

Clean up

After your pipeline is finished running, to clean up, you can go back into your project and enter the following command:

cdk destroy

This will delete any services that were deployed for this project.


In this post, we presented a solution that makes it straightforward to convert your document images and PDFs to text files. This is a key prerequisite to using your documents for generative AI and search. To learn more about using text to train or fine-tune your foundation models, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker JumpStart. To use with search, refer to Implement smart document search index with Amazon Textract and Amazon OpenSearch. To learn more about advanced document processing capabilities offered by AWS AI services, refer to Guidance for Intelligent Document Processing on AWS.

About the Authors

Tim CondelloTim Condello is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.

David Girling is a senior AI/ML solutions architect with over twenty years of experience in designing, leading and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate and utilize these highly capable services with their data for their use cases.

Read More

Modernizing data science lifecycle management with AWS and Wipro

Modernizing data science lifecycle management with AWS and Wipro

This post was written in collaboration with Bhajandeep Singh and Ajay Vishwakarma from Wipro’s AWS AI/ML Practice.

Many organizations have been using a combination of on-premises and open source data science solutions to create and manage machine learning (ML) models.

Data science and DevOps teams may face challenges managing these isolated tool stacks and systems. Integrating multiple tool stacks to build a compact solution might involve building custom connectors or workflows. Managing different dependencies based on the current version of each stack and maintaining those dependencies with the release of new updates of each stack complicates the solution. This increases the cost of infrastructure maintenance and hampers productivity.

Artificial intelligence (AI) and machine learning (ML) offerings from Amazon Web Services (AWS), along with integrated monitoring and notification services, help organizations achieve the required level of automation, scalability, and model quality at optimal cost. AWS also helps data science and DevOps teams to collaborate and streamlines the overall model lifecycle process.

The AWS portfolio of ML services includes a robust set of services that you can use to accelerate the development, training, and deployment of machine learning applications. The suite of services can be used to support the complete model lifecycle including monitoring and retraining ML models.

In this post, we discuss model development and MLOps framework implementation for one of Wipro’s customers that uses Amazon SageMaker and other AWS services.

Wipro is an AWS Premier Tier Services Partner and Managed Service Provider (MSP). Its AI/ML solutions drive enhanced operational efficiency, productivity, and customer experience for many of their enterprise clients.

Current challenges

Let’s first understand a few of the challenges the customer’s data science and DevOps teams faced with their current setup. We can then examine how the integrated SageMaker AI/ML offerings helped solve those challenges.

  • Collaboration – Data scientists each worked on their own local Jupyter notebooks to create and train ML models. They lacked an effective method for sharing and collaborating with other data scientists.
  • Scalability – Training and re-training ML models was taking more and more time as models became more complex while the allocated infrastructure capacity remained static.
  • MLOps – Model monitoring and ongoing governance wasn’t tightly integrated and automated with the ML models. There are dependencies and complexities with integrating third-party tools into the MLOps pipeline.
  • Reusability – Without reusable MLOps frameworks, each model must be developed and governed separately, which adds to the overall effort and delays model operationalization.

This diagram summarizes the challenges and how Wipro’s implementation on SageMaker addressed them with built-in SageMaker services and offerings.

SageMaker offerings for ML workload migration

Figure 1 – SageMaker offerings for ML workload migration

Wipro defined an architecture that addresses the challenges in a cost-optimized and fully automated way.

The following is the use case and model used to build the solution:

  • Use case: Price prediction based on the used car dataset
  • Problem type: Regression
  • Models used: XGBoost and Linear Learner (SageMaker built-in algorithms)

Solution architecture

Wipro consultants conducted a deep-dive discovery workshop with the customer’s data science, DevOps, and data engineering teams to understand the current environment as well as their requirements and expectations for a modern solution on AWS. By the end of the consulting engagement, the team had implemented the following architecture that effectively addressed the core requirements of the customer team, including:

Code Sharing – SageMaker notebooks enable data scientists to experiment and share code with other team members. Wipro further accelerated their ML model journey by implementing Wipro’s code accelerators and snippets to expedite feature engineering, model training, model deployment, and pipeline creation.

Continuous integration and continuous delivery (CI/CD) pipeline – Using the customer’s GitHub repository enabled code versioning and automated scripts to launch pipeline deployment whenever new versions of the code are committed.

MLOps – The architecture implements a SageMaker model monitoring pipeline for continuous model quality governance by validating data and model drift as required by the defined schedule. Whenever drift is detected, an event is launched to notify the respective teams to take action or initiate model retraining.

Event-driven architecture – The pipelines for model training, model deployment, and model monitoring are well integrated by use Amazon EventBridge, a serverless event bus. When defined events occur, EventBridge can invoke a pipeline to run in response. This provides a loosely-coupled set of pipelines that can run as needed in response to the environment.

Event Driven MLOps architecture with SageMaker

Figure 2 – Event Driven MLOps architecture with SageMaker

Solution components

This section describes the various solution components of the architecture.

Experiment notebooks

  • Purpose: The customer’s data science team wanted to experiment with various datasets and multiple models to come up with the optimal features, using those as further inputs to the automated pipeline.
  • Solution: Wipro created SageMaker experiment notebooks with code snippets for each reusable step, such as reading and writing data, model feature engineering, model training, and hyperparameter tuning. Feature engineering tasks can also be prepared in Data Wrangler, but the client specifically asked for SageMaker processing jobs and AWS Step Functions because they were more comfortable using those technologies. We used the AWS step function data science SDK to create a step function—for flow testing—directly from the notebook instance to enable well-defined inputs for the pipelines. This has helped the data scientist team to create and test pipelines at a much faster pace.

Automated training pipeline

  • Purpose: To enable an automated training and re-training pipeline with configurable parameters such as instance type, hyperparameters, and an Amazon Simple Storage Service (Amazon S3) bucket location. The pipeline should also be launched by the data push event to S3.
  • Solution: Wipro implemented a reusable training pipeline using the Step Functions SDK, SageMaker processing, training jobs, a SageMaker model monitor container for baseline generation, AWS Lambda, and EventBridge services.Using AWS event-driven architecture, the pipeline is configured to launch automatically based on a new data event being pushed to the mapped S3 bucket. Notifications are configured to be sent to the defined email addresses. At a high level, the training flow looks like the following diagram:
Training pipeline step machine

Figure 3 – Training pipeline step machine.

Flow description for the automated training pipeline

The above diagram is an automated training pipeline built using Step Functions, Lambda, and SageMaker. It’s a reusable pipeline for setting up automated model training, generating predictions, creating a baseline for model monitoring and data monitoring, and creating and updating an endpoint based on previous model threshold value.

  1. Pre-processing: This step takes data from an Amazon S3 location as input and uses the SageMaker SKLearn container to perform necessary feature engineering and data pre-processing tasks, such as the train, test, and validate split.
  2. Model training: Using the SageMaker SDK, this step runs training code with the respective model image and trains datasets from pre-processing scripts while generating the trained model artifacts.
  3. Save model: This step creates a model from the trained model artifacts. The model name is stored for reference in another pipeline using the AWS Systems Manager Parameter Store.
  4. Query training results: This step calls the Lambda function to fetch the metrics of the completed training job from the earlier model training step.
  5. RMSE threshold: This step verifies the trained model metric (RMSE) against a defined threshold to decide whether to proceed towards endpoint deployment or reject this model.
  6. Model accuracy too low: At this step the model accuracy is checked against the previous best model. If the model fails at metric validation, the notification is sent by a Lambda function to the target topic registered in Amazon Simple Notification Service (Amazon SNS). If this check fails, the flow exits because the new trained model didn’t meet the defined threshold.
  7. Baseline job data drift: If the trained model passes the validation steps, baseline stats are generated for this trained model version to enable monitoring and the parallel branch steps are run to generate the baseline for the model quality check.
  8. Create model endpoint configuration: This step creates endpoint configuration for the evaluated model in the previous step with an enable data capture configuration.
  9. Check endpoint: This step checks if the endpoint exists or needs to be created. Based on the output, the next step is to create or update the endpoint.
  10. Export configuration: This step exports the parameter’s model name, endpoint name, and endpoint configuration to the AWS Systems Manager Parameter Store.

Alerts and notifications are configured to be sent to the configured SNS topic email on the failure or success of state machine status change. The same pipeline configuration is reused for the XGBoost model.

Automated batch scoring pipeline

  • Purpose: Launch batch scoring as soon as scoring input batch data is available in the respective Amazon S3 location. The batch scoring should use the latest registered model to do the scoring.
  • Solution: Wipro implemented a reusable scoring pipeline using the Step Functions SDK, SageMaker batch transformation jobs, Lambda, and EventBridge. The pipeline is auto triggered based on the new scoring batch data availability to the respective S3 location.
Scoring pipeline step machine for linear learner and XGBoost model

Figure 4 – Scoring pipeline step machine for linear learner and XGBoost model

Flow description for the automated batch scoring pipeline:

  1. Pre-processing: The input for this step is a data file from the respective S3 location, and does the required pre-processing before calling SageMaker batch transformation job.
  2. Scoring: This step runs the batch transformation job to generate inferences, calling the latest version of the registered model and storing the scoring output in an S3 bucket. Wipro has used the input filter and join functionality of SageMaker batch transformation API. It helped enrich the scoring data for better decision making.
Input filter and join flow for batch transformation

Figure 5 – Input filter and join flow for batch transformation

  1. In this step, the state machine pipeline is launched by a new data file in the S3 bucket.

The notification is configured to be sent to the configured SNS topic email on the failure/success of the state machine status change.

Real-time inference pipeline

  • Purpose: To enable real-time inferences from both the models’ (Linear Learner and XGBoost) endpoints and get the maximum predicted value (or by using any other custom logic that can be written as a Lambda function) to be returned to the application.
  • Solution: The Wipro team has implemented reusable architecture using Amazon API Gateway, Lambda, and SageMaker endpoint as shown in Figure 6:
Real-time inference pipeline

Figure 6 – Real-time inference pipeline

Flow description for the real-time inference pipeline shown in Figure 6:

  1. The payload is sent from the application to Amazon API Gateway, which routes it to the respective Lambda function.
  2. A Lambda function (with an integrated SageMaker custom layer) does the required pre-processing, JSON or CSV payload formatting, and invokes the respective endpoints.
  3. The response is returned to Lambda and sent back to the application through API Gateway.

The customer used this pipeline for small and medium scale models, which included using various types of open-source algorithms. One of the key benefits of SageMaker is that various types of algorithms can be brought into SageMaker and deployed using a bring your own container (BYOC) technique. BYOC involves containerizing the algorithm and registering the image in Amazon Elastic Container Registry (Amazon ECR), and then using the same image to create a container to do training and inference.

Scaling is one of the biggest issues in the machine learning cycle. SageMaker comes with the necessary tools for scaling a model during inference. In the preceding architecture, users need to enable auto-scaling of SageMaker, which eventually handles the workload. To enable auto-scaling, users must provide an auto-scaling policy that asks for the throughput per instance and maximum and minimum instances. Within the policy in place, SageMaker automatically handles the workload for real-time endpoints and switches between instances when needed.

Custom model monitor pipeline

  • Purpose: The customer team wanted to have automated model monitoring to capture both data drift and model drift. The Wipro team used SageMaker model monitoring to enable both data drift and model drift with a reusable pipeline for real-time inferences and batch transformation.Note that during the development of this solution, the SageMaker model monitoring didn’t provide provision for detecting data or model drift for batch transformation. We have implemented customizations to use the model monitor container for the batch transformations payload.
  • Solution: The Wipro team implemented a reusable model-monitoring pipeline for real-time and batch inference payloads using AWS Glue to capture the incremental payload and invoke the model monitoring job according to the defined schedule.
Model monitor step machine

Figure 7 – Model monitor step machine

Flow description for the custom model monitor pipeline:
The pipeline runs according to the defined schedule configured through EventBridge.

  1. CSV consolidation – It uses the AWS Glue bookmark feature to detect the presence of incremental payload in the defined S3 bucket of real-time data capture and response and batch data response. It then aggregates that data for further processing.
  2. Evaluate payload – If there is incremental data or payload present for the current run, it invokes the monitoring branch. Otherwise, it bypasses without processing and exits the job.
  3. Post processing – The monitoring branch is designed to have two parallel sub branches—one for data drift and another for model drift.
  4. Monitoring (data drift) – The data drift branch runs whenever there is a payload present. It uses the latest trained model baseline constraints and statistics files generated through the training pipeline for the data features and runs the model monitoring job.
  5. Monitoring (model drift) – The model drift branch runs only when ground truth data is supplied, along with the inference payload. It uses trained model baseline constraints and statistics files generated through the training pipeline for the model quality features and runs the model monitoring job.
  6. Evaluate drift – The outcome of both data and model drift is a constraint violation file that’s evaluated by the evaluate drift Lambda function which sends notification to the respective Amazon SNS topics with details of the drift. Drift data is enriched further with the addition of attributes for reporting purposes. The drift notification emails will look similar to the examples in Figure 8.
SageMaker model drift monitor email

Figure 8 – Data and model drift notification message

SageMaker model drift monitor email

Figure 9 – Data and model drift notification message

Insights with Amazon QuickSight visualization:

  • Purpose: The customer wanted to have insights about the data and model drift, relate the drift data to the respective model monitoring jobs, and find out the inference data trends to understand the nature of the interference data trends.
  • Solution: The Wipro team enriched the drift data by connecting input data with the drift result, which enables triage from drift to monitoring and respective scoring data. Visualizations and dashboards were created using Amazon QuickSight with Amazon Athena as the data source (using the Amazon S3 CSV scoring and drift data).
Model monitoring visualization architecture

Figure 10 – Model monitoring visualization architecture

Design considerations:

  1. Use the QuickSight spice dataset for better in-memory performance.
  2. Use QuickSight refresh dataset APIs to automate the spice data refresh.
  3. Implement group-based security for dashboard and analysis access control.
  4. Across accounts, automate deployment using export and import dataset, data source, and analysis API calls provided by QuickSight.

Model monitoring dashboard:

To enable an effective outcome and meaningful insights of the model monitoring jobs, custom dashboards were created for the model monitoring data. The input data points are combined in parallel with inference request data, jobs data, and monitoring output to create a visualization of trends revealed by the model monitoring.

This has really helped the customer team to visualize the aspects of various data features along with the predicted outcome of each batch of inference requests.

Model monitor dashboard with selection prompts

Figure 11 – Model monitor dashboard with selection prompts

Model monitor dashboard with selection prompts

Figure 12 – Model monitor drift analysis


The implementation explained in this post enabled Wipro to effectively migrate their on-premises models to AWS and build a scalable, automated model development framework.

The use of reusable framework components empowers the data science team to effectively package their work as deployable AWS Step Functions JSON components. Simultaneously, the DevOps teams used and enhanced the automated CI/CD pipeline to facilitate the seamless promotion and retraining of models in higher environments.

Model monitoring component has enabled continuous monitoring of the model performance, and users receive alerts and notifications whenever data or model drift is detected.

The customer’s team is using this MLOps framework to migrate or develop more models and increase their SageMaker adoption.

By harnessing the comprehensive suite of SageMaker services in conjunction with our meticulously designed architecture, customers can seamlessly onboard multiple models, significantly reducing deployment time and mitigating complexities associated with code sharing. Moreover, our architecture simplifies code versioning maintenance, ensuring a streamlined development process.

This architecture handles the entire machine learning cycle, encompassing automated model training, real-time and batch inference, proactive model monitoring, and drift analysis. This end-to-end solution empowers customers to achieve optimal model performance while maintaining rigorous monitoring and analysis capabilities to ensure ongoing accuracy and reliability.

To create this architecture, begin by creating essential resources like Amazon Virtual Private Cloud (Amazon VPC), SageMaker notebooks, and Lambda functions. Make sure to set up appropriate AWS Identity and Access Management (IAM) policies for these resources.

Next, focus on building the components of the architecture—such as training and preprocessing scripts—within SageMaker Studio or Jupyter Notebook. This step involves developing the necessary code and configurations to enable the desired functionalities.

After the architecture’s components are defined, you can proceed with building the Lambda functions for generating inferences or performing post-processing steps on the data.

At the end, use Step Functions to connect the components and establish a smooth workflow that coordinates the running of each step.

About the Authors

Stephen Randolph - AWS Partner Solutions ArchitectStephen Randolph is a Senior Partner Solutions Architect at Amazon Web Services (AWS). He enables and supports Global Systems Integrator (GSI) partners on the latest AWS technology as they develop industry solutions to solve business challenges. Stephen is especially passionate about Security and Generative AI, and helping customers and partners architect secure, efficient, and innovative solutions on AWS.

Bhajandeep SinghBhajandeep Singh has served as the AWS AI/ML Center of Excellence Head at Wipro Technologies, leading customer engagements to deliver data analytics and AI solutions. He holds the AWS AI/ML Specialty certification and authors technical blogs on AI/ML services and solutions. With experience of leading AWS AI/ML solutions across industries, Bhajandeep has enabled clients to maximize the value of AWS AI/ML services through his expertise and leadership.

Ajay VishwakarmaAjay Vishwakarma is an ML engineer for the AWS wing of Wipro’s AI solution practice. He has good experience in building BYOM solution for custom algorithm in SageMaker, end to end ETL pipeline deployment, building chatbots using Lex, Cross account QuickSight resource sharing and building CloudFormation templates for deployments. He likes exploring AWS taking every customers problem as a challenge to explore more and provide solutions to them.

Read More

Generating value from enterprise data: Best practices for Text2SQL and generative AI

Generating value from enterprise data: Best practices for Text2SQL and generative AI

Generative AI has opened up a lot of potential in the field of AI. We are seeing numerous uses, including text generation, code generation, summarization, translation, chatbots, and more. One such area that is evolving is using natural language processing (NLP) to unlock new opportunities for accessing data through intuitive SQL queries. Instead of dealing with complex technical code, business users and data analysts can ask questions related to data and insights in plain language. The primary goal is to automatically generate SQL queries from natural language text. To do this, the text input is transformed into a structured representation, and from this representation, a SQL query that can be used to access a database is created.

In this post, we provide an introduction to text to SQL (Text2SQL) and explore use cases, challenges, design patterns, and best practices. Specifically, we discuss the following:

  • Why do we need Text2SQL
  • Key components for Text to SQL
  • Prompt engineering considerations for natural language or Text to SQL
  • Optimizations and best practices
  • Architecture patterns

Why do we need Text2SQL?

Today, a large amount of data is available in traditional data analytics, data warehousing, and databases, which may be not easy to query or understand for the majority of organization members. The primary goal of Text2SQL is to make querying databases more accessible to non-technical users, who can provide their queries in natural language.

NLP SQL enables business users to analyze data and get answers by typing or speaking questions in natural language, such as the following:

  • “Show total sales for each product last month”
  • “Which products generated more revenue?”
  • “What percentage of customers are from each region?”

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) via a single API, enabling to easily build and scale Gen AI applications. It can be leveraged to generate SQL queries based on questions similar to the ones listed above and query organizational structured data and generate natural language responses from the query response data.

Key components for text to SQL

Text-to-SQL systems involve several stages to convert natural language queries into runnable SQL:

  • Natural language processing:
    • Analyze the user’s input query
    • Extract key elements and intent
    • Convert to a structured format
  • SQL generation:
    • Map extracted details into SQL syntax
    • Generate a valid SQL query
  • Database query:
    • Run the AI-generated SQL query on the database
    • Retrieve results
    • Return results to the user

One remarkable capability of Large Language Models (LLMs) is generation of code, including Structured Query Language (SQL) for databases. These LLMs can be leveraged to understand the natural language question and generate a corresponding SQL query as an output. The LLMs will benefit by adopting in-context learning and fine-tuning settings as more data is provided.

The following diagram illustrates a basic Text2SQL flow.

Text 2 SQL high level process flow

Prompt engineering considerations for natural language to SQL

The prompt is crucial when using LLMs to translate natural language into SQL queries, and there are several important considerations for prompt engineering.

Effective prompt engineering is key to developing natural language to SQL systems. Clear, straightforward prompts provide better instructions for the language model. Providing context that the user is requesting a SQL query along with relevant database schema details enables the model to translate the intent accurately. Including a few annotated examples of natural language prompts and corresponding SQL queries helps guide the model to produce syntax-compliant output. Additionally, incorporating Retrieval Augmented Generation (RAG), where the model retrieves similar examples during processing, further improves the mapping accuracy. Well-designed prompts that give the model sufficient instruction, context, examples, and retrieval augmentation are crucial for reliably translating natural language into SQL queries.

The following is an example of a baseline prompt with code representation of the database from the whitepaper Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.

/* Given the following database schema : */
" Gymnast_ID " int ,
" Floor_Exercise_Points " real ,
" Pommel_Horse_Points " real ,
" Rings_Points " real ,
" Vault_Points " real ,
" Parallel_Bars_Points " real ,
" Horizontal_Bar_Points " real ,
 " Total_Points " real ,
 PRIMARY KEY ( " Gymnast_ID " ) ,
 FOREIGN KEY ( " Gymnast_ID " ) REFERENCES " people " ( " People_ID " )
 ) ;
 " People_ID " int ,
 " Name " text ,
 " Age " real ,
 " Height " real ,
 " Hometown " text ,
 PRIMARY KEY ( " People_ID " )
 ) ;

/* Answer the following : Return the total points of the gymnast with the lowest age .

select t1 . total_points from gymnast as t1 join people as t2 on t1 . gymnast_id = t2 .
people_id order by t2 . age asc limit 1

As illustrated in this example, prompt-based few-shot learning provides the model with a handful of annotated examples in the prompt itself. This demonstrates the target mapping between natural language and SQL for the model. Typically, the prompt would contain around 2–3 pairs showing a natural language query and the equivalent SQL statement. These few examples guide the model to generate syntax-compliant SQL queries from natural language without requiring extensive training data.

Fine-tuning vs. prompt engineering

When building natural language to SQL systems, we often get into the discussion of if fine-tuning the model is the right technique or if effective prompt engineering is the way to go. Both approaches could be considered and selected based on the right set of requirements:

    • Fine-tuning – The baseline model is pre-trained on a large general text corpus and then can use instruction-based fine-tuning, which uses labeled examples to improve the performance of a pre-trained foundation model on text-SQL. This adapts the model to the target task. Fine-tuning directly trains the model on the end task but requires many text-SQL examples. You can use supervised fine-tuning based on your LLM to improve the effectiveness of text-to-SQL. For this, you can use several datasets like Spider, WikiSQL, CHASE, BIRD-SQL, or CoSQL.
    • Prompt engineering – The model is trained to complete prompts designed to prompt the target SQL syntax. When generating SQL from natural language using LLMs, providing clear instructions in the prompt is important for controlling the model’s output. In the prompt to annotate different components like pointing to columns, schema and then instruct which type of SQL to create. These act like instructions that tell the model how to format the SQL output. The following prompt shows an example where you point table columns and instruct to create a MySQL query:
Table offices, columns = [OfficeId, OfficeName]
Table employees, columns = [OfficeId, EmployeeId,EmployeeName]
Create a MySQL query for all employees in the Machine Learning Department

An effective approach for text-to-SQL models is to first start with a baseline LLM without any task-specific fine-tuning. Well-crafted prompts can then be used to adapt and drive the base model to handle the text-to-SQL mapping. This prompt engineering allows you to develop the capability without needing to do fine-tuning. If prompt engineering on the base model doesn’t achieve sufficient accuracy, fine-tuning on a small set of text-SQL examples can then be explored along with further prompt engineering.

The combination of fine-tuning and prompt engineering may be required if prompt engineering on the raw pre-trained model alone doesn’t meet requirements. However, it’s best to initially attempt prompt engineering without fine-tuning, because this allows rapid iteration without data collection. If this fails to provide adequate performance, fine-tuning alongside prompt engineering is a viable next step. This overall approach maximizes efficiency while still allowing customization if purely prompt-based methods are insufficient.

Optimization and best practices

Optimization and best practices are essential for enhancing effectiveness and ensuring resources are used optimally and the right results are achieved in the best way possible. The techniques help in improving performance, controlling costs, and achieving a better-quality outcome.

When developing text-to-SQL systems using LLMs, optimization techniques can improve performance and efficiency. The following are some key areas to consider:

  • Caching – To improve latency, cost control, and standardization, you can cache the parsed SQL and recognized query prompts from the text-to-SQL LLM. This avoids reprocessing repeated queries.
  • Monitoring – Logs and metrics around query parsing, prompt recognition, SQL generation, and SQL results should be collected to monitor the text-to-SQL LLM system. This provides visibility for the optimization example updating the prompt or revisiting the fine-tuning with an updated dataset.
  • Materialized views vs. tables – Materialized views can simplify SQL generation and improve performance for common text-to-SQL queries. Querying tables directly may result in complex SQL and also result in performance issues, including constant creation of performance techniques like indexes. Additionally, you can avoid performance issues when the same table is used for other areas of application at the same time.
  • Refreshing data – Materialized views need to be refreshed on a schedule to keep data current for text-to-SQL queries. You can use batch or incremental refresh approaches to balance overhead.
  • Central data catalog – Creating a centralized data catalog provides a single pane of glass view to an organization’s data sources and will help LLMs select appropriate tables and schemas in order to provide more accurate responses. Vector embeddings created from a central data catalog can be supplied to an LLM along with information requested to generate relevant and precise SQL responses.

By applying optimization best practices like caching, monitoring, materialized views, scheduled refreshing, and a central catalog, you can significantly improve the performance and efficiency of text-to-SQL systems using LLMs.

Architecture patterns

Let’s look at some architecture patterns that can be implemented for a text to SQL workflow.

Prompt engineering

The following diagram illustrates the architecture for generating queries with an LLM using prompt engineering.

illustrates the architecture for generating queries with an LLM using prompt engineering

In this pattern, the user creates prompt-based few-shot learning that provides the model with annotated examples in the prompt itself, which includes the table and schema details and some sample queries with its results. The LLM uses the provided prompt to return back the AI-generated SQL, which is validated and then run against the database to get the results. This is the most straightforward pattern to get started using prompt engineering. For this, you can use Amazon Bedrock or foundation models in Amazon SageMaker JumpStart.

In this pattern, the user creates a prompt-based few-shot learning that provides the model with annotated examples in the prompt itself, which includes the table and schema details and some sample queries with its results. The LLM uses the provided prompt to return back the AI generated SQL which is validated and run against the database to get the results. This is the most straightforward pattern to get started using prompt engineering. For this, you can use Amazon Bedrock which is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI or JumpStart Foundation Models which offers state-of-the-art foundation models for use cases such as content writing, code generation, question answering, copywriting, summarization, classification, information retrieval, and more

Prompt engineering and fine-tuning

The following diagram illustrates the architecture for generating queries with an LLM using prompt engineering and fine-tuning.

illustrates the architecture for generating queries with an LLM using prompt engineering and fine-tuning

This flow is similar to the previous pattern, which mostly relies on prompt engineering, but with an additional flow of fine-tuning on the domain-specific dataset. The fine-tuned LLM is used to generate the SQL queries with minimal in-context value for the prompt. For this, you can use SageMaker JumpStart to fine-tune an LLM on a domain-specific dataset in the same way you would train and deploy any model on Amazon SageMaker.

Prompt engineering and RAG

The following diagram illustrates the architecture for generating queries with an LLM using prompt engineering and RAG.

illustrates the architecture for generating queries with an LLM using prompt engineering and RAG

In this pattern, we use Retrieval Augmented Generation using vector embeddings stores, like Amazon Titan Embeddings or Cohere Embed, on Amazon Bedrock from a central data catalog, like AWS Glue Data Catalog, of databases within an organization. The vector embeddings are stored in vector databases like Vector Engine for Amazon OpenSearch Serverless, Amazon Relational Database Service (Amazon RDS) for PostgreSQL with the pgvector extension, or Amazon Kendra. LLMs use the vector embeddings to select the right database, tables, and columns from tables faster when creating SQL queries. Using RAG is helpful when data and relevant information that need to be retrieved by LLMs are stored in multiple separate database systems and the LLM needs to be able to search or query data from all these different systems. This is where providing vector embeddings of a centralized or unified data catalog to the LLMs results in more accurate and comprehensive information returned by the LLMs.


In this post, we discussed how we can generate value from enterprise data using natural language to SQL generation. We looked into key components, optimization, and best practices. We also learned architecture patterns from basic prompt engineering to fine-tuning and RAG. To learn more, refer to Amazon Bedrock to easily build and scale generative AI applications with foundation models

About the Authors

Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area focused on helping customers adopt and use AWS Cloud. Arghya is focused on Big Data, Data Lakes, Streaming, Batch Analytics and AI/ML services and technologies.

Read More