November 2023 – Page 8

AI-Powered Tech Company Helps Grocers Start Afresh in Supply Chain Management

Talk about going after low-hanging fruit. Afresh is an AI startup that helps grocery stores and retailers reduce food waste by making supply chains more efficient.

In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with the company’s cofounder and president, Nathan Fenner, about its mission, offerings and the greater challenge of eliminating food waste.

Most supply chain and inventory management offerings targeting grocers and retailers are outdated. Fenner and his team noticed those solutions, built for the nonperishable side of the business, didn’t work as well on the fresh side — creating enormous amounts of food waste and causing billions in lost profits.

The AI Podcast · AI-Powered Tech Company Helps Grocers Start Afresh in Supply Chain Management

The team first sought to solve the store-replenishment challenge by developing a platform to help grocers decide how much fresh produce to order to optimize costs while meeting demand.

They created machine learning and AI models that could effectively use the data generated by fresh produce, which is messier than data generated by nonperishable goods because of factors like time to decay, greater demand fluctuation and unreliability caused by lack of barcodes, leading to incorrect scans at self-checkout registers.

The result was a fully integrated, machine learning-based platform that helps grocers make informed decisions at each node of the operations process.

The company also recently launched inventory management software that allows grocers to save time and increase data accuracy by intelligently tracking inventory. That information can be inputted back into the platform’s ordering solution, further refining the accuracy of inventory data.

It’s all part of Afresh’s greater mission to tackle climate change.

“The most impactful thing we can do is reduce food waste to mitigate climate change,” Fenner said. “It’s really one of the key things that brought me into the business: I think I’ve always had a keen eye to work in the climate space. It’s really motivating for a lot of our team, and it’s a key part of our mission.”

Subscribe to the AI Podcast: Now Available on Amazon Music

The AI Podcast is now available through Amazon Music.

In addition, get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better. Have a few minutes to spare? Fill out this listener survey.

Maya Cakmak builds teachable robots to help humans

Through the UW + Amazon Science Hub, the UW associate professor and Science Hub advisory board member is helping to realize a future where robots and people collaborate on tasks.Read More

How Amazon Music uses SageMaker with NVIDIA to optimize ML training and inference performance and cost

In the dynamic world of streaming on Amazon Music, every search for a song, podcast, or playlist holds a story, a mood, or a flood of emotions waiting to be unveiled. These searches serve as a gateway to new discoveries, cherished experiences, and lasting memories. The search bar is not just about finding a song; it’s about the millions of active users starting their personal journey into the rich and diverse world that Amazon Music has to offer.

Delivering a superior customer experience to instantly find the music that users search for requires a platform that is both smart and responsive. Amazon Music uses the power of AI to accomplish this. However, optimizing the customer experience while managing cost of training and inference of AI models that power the search bar’s capabilities, like real-time spellcheck and vector search, is difficult during peak traffic times.

Amazon SageMaker provides an end-to-end set of services that allow Amazon Music to build, train, and deploy on the AWS Cloud with minimal effort. By taking care of the undifferentiated heavy lifting, SageMaker allows you to focus on working on your machine learning (ML) models, and not worry about things such as infrastructure. As part of the shared responsibility model, SageMaker makes sure that the services they provide are reliable, performant, and scalable, while you make sure the application of the ML models makes the best use of the capabilities that SageMaker provides.

In this post, we walk through the journey Amazon Music took to optimize performance and cost using SageMaker and NVIDIA Triton Inference Server and TensorRT. We dive deep into showing how that seemingly simple, yet intricate, search bar works, ensuring an unbroken journey into the universe of Amazon Music with little-to-zero frustrating typo delays and relevant real-time search results.

Amazon SageMaker and NVIDIA: Delivering fast and accurate vector search and spellcheck capabilities

Amazon Music offers a vast library of over 100 million songs and millions of podcast episodes. However, finding the right song or podcast can be challenging, especially if you don’t know the exact title, artist, or album name, or the searched query is very broad, such as “news podcasts.”

Amazon Music has taken a two-pronged approach to improve the search and retrieval process. The first step is to introduce vector search (also known as embedding-based retrieval), an ML technique that can help users find the most relevant content they’re looking for by using semantics of the content. The second step involves introducing a Transformer-based Spell Correction model in the search stack. This can be especially helpful when searching for music, because users may not always know the exact spelling of a song title or artist name. Spell correction can help users find the music they’re looking for even if they make a spelling mistake in their search query.

Introducing Transformer models in a search and retrieval pipeline (in query embedding generation needed for vector search and the generative Seq2Seq Transformer model in Spell Correction) may lead to significant increase in overall latency, affecting customer experience negatively. Therefore, it became a top priority for us to optimize the real-time inference latency for vector search and spell correction models.

Amazon Music and NVIDIA have come together to bring the best possible customer experience to the search bar, using SageMaker to implement both fast and accurate spellcheck capabilities and real-time semantic search suggestions using vector search-based techniques. The solution includes using SageMaker hosting powered by G5 instances that uses NVIDIA A10G Tensor Core GPUs, SageMaker-supported NVIDIA Triton Inference Server Container, and the NVIDIA TensorRT model format. By reducing the inference latency of the spellcheck model to 25 milliseconds at peak traffic, and reducing search query embedding generation latency by 63% on average and cost by 73% compared to CPU based inference, Amazon Music has elevated the search bar’s performance.

Additionally, when training the AI model to deliver accurate results, Amazon Music achieved a whopping 12 fold acceleration in training time for their BART sequence-to-sequence spell corrector transformer model, saving them both time and money, by optimizing their GPU utilization.

Amazon Music partnered with NVIDIA to prioritize the customer search experience and craft a search bar with well-optimized spellcheck and vector search functionalities. In the following sections, we share more about how these optimizations were orchestrated.

Optimizing training with NVIDIA Tensor Core GPUs

Gaining access to an NVIDIA Tensor Core GPU for large language model training is not enough to capture its true potential. There are key optimization steps that must happen during training in order to fully maximize the GPU’s utilization. However, an underutilized GPU will undoubtedly lead to inefficient use of resources, prolonged training durations, and increased operational costs.

During the initial phases of training the spell corrector BART (bart-base) transformer model on a SageMaker ml.p3.24xlarge instance (8 NVIDIA V100 Tensor Core GPUs), Amazon Music’s GPU utilization was around 35%. To maximize the benefits of NVIDIA GPU-accelerated training, AWS and NVIDIA solution architects supported Amazon Music in identifying areas for optimizations, particularly around the batch size and precision parameters. These two crucial parameters influence the efficiency, speed, and accuracy of training deep learning models.

The resulting optimizations yielded a new and improved V100 GPU utilization, steady at around 89%, drastically reducing Amazon Music’s training time from 3 days to 5–6 hours. By switching the batch size from 32 to 256 and using optimization techniques like running automatic mixed precision training instead of only using FP32 precision, Amazon Music was able to save both time and money.

The following chart illustrates the 54% percentage point increase in GPU utilization after optimizations.

The following figure illustrates the acceleration in training time.

This increase in batch size enabled the NVIDIA GPU to process significantly more data concurrently across multiple Tensor Cores, resulting in accelerated training time. However, it’s important to maintain a delicate balance with memory, because larger batch sizes demand more memory. Both increasing batch size and employing mixed precision can be critical in unlocking the power of NVIDIA Tensor Core GPUs.

After the model was trained to convergence, it was time to optimize for inference deployment on Amazon Music’s search bar.

Spell Correction: BART model inferencing

With the help of SageMaker G5 instances and NVIDIA Triton Inference Server (an open source inference serving software), as well as NVIDIA TensorRT, an SDK for high-performance deep learning inference that includes an inference optimizer and runtime, Amazon Music limits their spellcheck BART (bart-base) model server inference latency to just 25 milliseconds at peak traffic. This includes overheads like load balancing, preprocessing, model inferencing, and postprocessing times.

NVIDIA Triton Inference Server provides two different kind backends: one for hosting models on GPU, and a Python backend where you can bring your own custom code to be used in preprocessing and postprocessing steps. The following figure illustrates the model ensemble scheme.

Amazon Music built its BART inference pipeline by running both preprocessing (text tokenization) and postprocessing (tokens to text) steps on CPUs, whereas the model execution step runs on NVIDIA A10G Tensor Core GPUs. A Python backend sits in the middle of the preprocessing and postprocessing steps, and is responsible for communicating with the TensorRT-converted BART models as well as the encoder/decoder networks. TensorRT boosts inference performance with precision calibration, layer and tensor fusion, kernel auto-tuning, dynamic tensor memory, multi-stream execution, and time fusion.

The following figure illustrates the high-level design of the key modules that make up the spell corrector BART model inferencing pipeline.

Vector search: Query embedding generation sentence BERT model inferencing

The following chart illustrates the 60% improvement in latency (serving p90 800–900 TPS) when using the NVIDIA AI Inference Platform compared to a CPU-based baseline.

The following chart shows a 70% improvement in cost when using the NVIDIA AI Inference Platform compared to a CPU-based baseline.

The following figure illustrates an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

To achieve these results, Amazon Music experimented with several different Triton deployment parameters using Triton Model Analyzer, a tool that helps find the best NVIDIA Triton model configuration to deploy efficient inference. To optimize model inference, Triton offers features like dynamic batching and concurrent model execution, and has framework support for other flexibility capabilities. The dynamic batching gathers inference requests, seamlessly grouping them together into cohorts in order to maximize throughput, all while ensuring real-time responses for Amazon Music users. The concurrent model execution capability further enhances inference performance by hosting multiple copies of the model on the same GPU. Finally, by utilizing Triton Model Analyzer, Amazon Music was able to carefully fine-tune the dynamic batching and model concurrency inference hosting parameters to find optimal settings that maximize inference performance using simulated traffic.

Conclusion

Optimizing configurations with Triton Inference Server and TensorRT on SageMaker allowed Amazon Music to achieve outstanding results for both training and inference pipelines. The SageMaker platform is the end-to-end open platform for production AI, providing quick time to value and the versatility to support all major AI use cases across both hardware and software. By optimizing V100 GPU utilization for training and switching from CPUs to G5 instances using NVIDIA A10G Tensor Core GPUs, as well as by using optimized NVIDIA software like Triton Inference Server and TensorRT, companies like Amazon Music can save time and money while boosting performance in both training and inference, directly translating to a better customer experience and lower operating costs.

SageMaker handles the undifferentiated heavy lifting for ML training and hosting, allowing Amazon Music to deliver reliable, scalable ML operations across both hardware and software.

We encourage you to check that your workloads are optimized using SageMaker by always evaluating your hardware and software choices to see if there are ways you can achieve better performance with decreased costs.

To learn more about NVIDIA AI in AWS, refer to the following:

About the authors

Siddharth Sharma is a Machine Learning Tech Lead at Science & Modeling team at Amazon Music. He specializes in Search, Retrieval, Ranking and NLP related modeling problems. Siddharth has a rich back-ground working on large scale machine learning problems that are latency sensitive e.g. Ads Targeting, Multi Modal Retrieval, Search Query Understanding etc. Prior to working at Amazon Music, Siddharth was working at companies like Meta, Walmart Labs, Rakuten on E-Commerce centric ML Problems. Siddharth spent early part of his career working with bay area ad-tech startups.

Tarun Sharma is a Software Development Manager leading Amazon Music Search Relevance. His team of scientists and ML engineers is responsible for providing contextually relevant and personalized search results to Amazon Music customers.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Tugrul Konuk is a Senior Solution Architect at NVIDIA, specializing at large-scale training, multimodal deep learning, and high-performance scientific computing. Prior to NVIDIA, he worked at the energy industry, focusing on developing algorithms for computational imaging. As part of his PhD, he worked on physics-based deep learning for numerical simulations at scale. In his leisure time, he enjoys reading, playing the guitar and the piano.

Rohil Bhargava is a Product Marketing Manager at NVIDIA, focused on deploying NVIDIA application frameworks and SDKs on specific CSP platforms.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Machine Learning with MATLAB and Amazon SageMaker

This post is written in collaboration with Brad Duncan, Rachel Johnson and Richard Alcock from MathWorks.

MATLAB  is a popular programming tool for a wide range of applications, such as data processing, parallel computing, automation, simulation, machine learning, and artificial intelligence. It’s heavily used in many industries such as automotive, aerospace, communication, and manufacturing. In recent years, MathWorks has brought many product offerings into the cloud, especially on Amazon Web Services (AWS). For more details about MathWorks cloud products, see MATLAB and Simulink in the Cloud or email Mathworks.

In this post, we bring MATLAB’s machine learning capabilities into Amazon SageMaker, which has several significant benefits:

Compute resources: Using the high-performance computing environment offered by SageMaker can speed up machine learning training.
Collaboration: MATLAB and SageMaker together provide a robust platform that t teams can use to collaborate effectively on building, testing, and deploying machine learning models.
Deployment and accessibility: Models can be deployed as SageMaker real-time endpoints, making them readily accessible for other applications to process live streaming data.

We show you how to train a MATLAB machine learning model as a SageMaker training job and then deploy the model as a SageMaker real-time endpoint so it can process live, streaming data.

To do this, we’ll use a predictive maintenance example where we classify faults in an operational pump that’s streaming live sensor data. We have access to a large repository of labeled data generated from a Simulink simulation that has three possible fault types in various possible combinations (for example, one healthy and seven faulty states). Because we have a model of the system and faults are rare in operation, we can take advantage of simulated data to train our algorithm. The model can be tuned to match operational data from our real pump using parameter estimation techniques in MATLAB and Simulink.

Our objective is to demonstrate the combined power of MATLAB and Amazon SageMaker using this fault classification example.

We start by training a classifier model on our desktop with MATLAB. First, we extract features from a subset of the full dataset using the Diagnostic Feature Designer app, and then run the model training locally with a MATLAB decision tree model. Once we’re satisfied with the parameter settings, we can generate a MATLAB function and send the job along with the dataset to SageMaker. This allows us to scale up the training process to accommodate much larger datasets. After training our model, we deploy it as a live endpoint which can be integrated into a downstream app or dashboard, such as a MATLAB Web App.

This example will summarize each step, providing a practical understanding of how to leverage MATLAB and Amazon SageMaker for machine learning tasks. The full code and description for the example is available in this repository.

Prerequisites

Working environment of MATLAB 2023a or later with MATLAB Compiler and the Statistics and Machine Learning Toolbox on Linux. Here is a quick guide on how to run MATLAB on AWS.
Docker set up in an Amazon Elastic Compute Cloud (Amazon EC2) instance where MATLAB is running. Either Ubuntu or Linux.
Installation of AWS Command-Line Interface (AWS CLI), AWS Configure, and Python3.
1. AWS CLI, should be already installed if you followed the installation guide from step 1.
2. Set up AWS Configure to interact with AWS resources.
3. Verify your python3 installation by running python -V or python --version command on your terminal. Install Python if necessary.

Copy this repo to a folder in your Linux machine by running:

git clone https://github.com/mathworks/Machine-Learning-with-MATLAB-and-Amazon-Sagemaker-Demo.git

Check the permission on the repo folder. If it does not have write permission, run the following shell command:
```
sudo chmod -R 777
```

Build the MATLAB training container and push it to the Amazon Elastic Container Registry (Amazon ECR).

Navigate to folder docker

Create an Amazon ECR repo using the AWS CLI (replace REGION with your preferred AWS region)

aws ecr create-repository  
--repository-name sagemaker-matlab-training  
--image-scanning-configuration scanOnPush=true  
--region

Run the following docker command:

docker build -t sagemaker-matlab-training-r2023a . 
 
docker tag sagemaker-matlab-training-r2023a ACCOUNT.dkr.ecr.REGION.amazonaws.com/sagemaker-matlab-training-r2023a:latest 
 
aws ecr get-login-password --region REGION | docker login --username AWS --password-stdin ACCOUNT.dkr.ecr.us-east-1.amazonaws.com 
 
docker push ACCOUNT.dkr.ecr. REGION.amazonaws.com/sagemaker-matlab-training-r2023a:latest

Open MATLAB and open the live script called PumpFaultClassificationMATLABSageMaker.mlx in folder examples/PumpFaultClassification. Make this folder your current working folder in MATLAB.

Part 1: Data preparation & feature extraction

The first step in any machine learning project is to prepare your data. MATLAB provides a wide range of tools for importing, cleaning, and extracting features from your data.:

load SensorData.mat

The SensorData.mat dataset contains 240 records. Each record has two timetables: flow and pressure. The target column is faultcode, which is a binary representation of three possible fault combinations in the pump. For those time series tables, each table has 1,201 rows which mimic 1.2 seconds of pump flow and pressure measurement with 0.001 seconds increment.

Next, the Diagnostic Feature Designer app allows you to extract, visualize, and rank a variety of features from the data. Here, you use Auto Features, which quickly extracts a broad set of time and frequency domain features from the dataset and ranks the top candidates for model training. You can then export a MATLAB function that will recompute the top 15 ranked features from new input data. Let’s call this function extractFeaturesTraining. This function can be configured to take in data all in one batch or as streaming data.

This function produces a table of features with associated fault codes, as shown in the following figure:

Part 2: Organize data for SageMaker

Next, you need to organize the data in a way that SageMaker can use for machine learning training. Typically, this involves splitting the data into training and validation sets and splitting the predictor data from the target response.

In this stage, other more complex data cleaning and filtering operations might be required. In this example, the data is already clean. Potentially, if the data processing is very complex and time consuming, SageMaker processing jobs can be used to run these jobs apart from SageMaker training so that they can be separated into two steps.

trainPredictors = trainingData(:,2:end);

trainResponse = trainingData(:,1);

Part 3: Train and test a machine learning model in MATLAB

Before moving to SageMaker, it’s a good idea to build and test the machine learning model locally in MATLAB. This allows you to quickly iterate and debug the model. You can set up and train a simple decision tree classifier locally.

classifierModel = fitctree(...
trainPredictors,...
trainResponse,...
OptimizeHyperparameters='auto');

The training job here should take less than a minute to finish and generates some graphs to indicate the training progress. After the training is finished, a MATLAB machine learning model is produced. The Classification Learner app can be used to try many types of classification models and tune them for best performance, then produce the needed code to replace the model training code above.

After checking the accuracy metrics for the locally-trained model, we can move the training into Amazon SageMaker.

Part 4: Train the model in Amazon SageMaker

After you’re satisfied with the model, you can train it at scale using SageMaker. To begin calling SageMaker SDKs, you need to initiate a SageMaker session.

session = sagemaker.Session();

Specify a SageMaker execution IAM role that training jobs and endpoint hosting will use.

role = "arn:aws:iam::ACCOUNT:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXXXXXXX";

From MATLAB, save the training data as a .csv file to an Amazon Simple Storage Service (Amazon S3) bucket.

writetable(trainingData,'pump_training_data.csv');

trainingDataLocation = "s3:// "+session.DefaultBucket+ +"/cooling_system/input/pump_training";

copyfile("pump_training_data.csv", trainingDataLocation);

Create a SageMaker Estimator

Next, you need to create a SageMaker estimator and pass all the necessary parameters to it, such as a training docker image, training function, environment variables, training instance size, and so on. The training image URI should be the Amazon ECR URI you created in the prerequisite step with the format ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/sagemaker-matlab-training-r2023a:latest. The training function should be provided at the bottom of the MATLAB live script.

trainingImage = "ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/sagemaker-matlab-training-r2023a:latest"; 
 
est = sagemaker.MATLABEstimator(... 
    role, ... 
    Image=trainingImage, ... 
    Session=session, ... 
    BaseJobName="PumpDecisionTreeMatlab", ... 
    Environment = loadenv(fullfile(rootFolder, "training.env")), ... 
    TrainingFunction = @trainingFunction, ... 
    HyperParameters = struct(), ... % named args to train_decision_tree 
    InstanceType="ml.m5.large", ... 
    MaxRunTime=minutes(10), ...     
    MaxWaitTime=minutes(20), ... 
    UseSpotInstances=true);

Submit SageMaker training job

Calling the fit method from the estimator submits the training job into SageMaker.

est.fit(training=struct(Location=trainingDataLocation, ContentType="text/csv"))

You can also check the training job status from the SageMaker console:

After the training jobs finishes, selecting the job link takes you to the job description page where you can see the MATLAB model saved in the dedicated S3 bucket:

Part 5: Deploy the model as a real-time SageMaker endpoint

After training, you can deploy the model as a real-time SageMaker endpoint, which you can use to make predictions in real time. To do this, call the deploy method from the estimator. This is where you can set up the desired instance size for hosting depending on the workload.

predictor = est.deploy(role, "ClassificationTreeInferenceHandler", uint8(1), "ml.m5.large")

Behind the scenes, this step builds an inference docker image and pushes it to the Amazon ECR repository, nothing is required from the user to build the inference container. The image contains all the necessary information to serve the inference request, such as model location, MATLAB authentication information, and algorithms. After that, Amazon SageMaker creates a SageMaker endpoint configuration and finally deploys the real-time endpoint. The endpoint can be monitored in the SageMaker console and can be terminated anytime if it’s no longer used.

Part 6: Test the endpoint

Now that the endpoint is up and running, you can test the endpoint by giving it a few records to predict. Use the following code to select 10 records from the training data and send them to the endpoint for prediction. The prediction result is sent back from the endpoint and shown in the following image.

input = trainPredictors(10:19,:) 
prediction = predictor.predict(input)

Part 7: Dashboard integration

The SageMaker endpoint can be called by many native AWS services. It can also be used as a standard REST API if deployed together with an AWS Lambda function and API gateway, which can be integrated with any web applications. For this particular use case, you can use streaming ingestion with Amazon SageMaker Feature Store and Amazon Managed Streaming for Apache Kafka, MSK, to make machine learning-backed decisions in near real-time. Another possible integration is using a combination of Amazon Kinesis, SageMaker, and Apache Flink to build a managed, reliable, scalable, and highly available application that’s capable of real-time inferencing on a data stream.

After algorithms are deployed to a SageMaker endpoint, you might want to visualize them using a dashboard that displays streaming predictions in real time. In the custom MATLAB web app that follows, you can see pressure and flow data by pump, and live fault predictions from the deployed model.

In this dashboard includes a remaining useful life (RUL) model to predict the time to failure for each pump in question. To learn how to train RUL algorithms, see Predictive Maintenance Toolbox.

Clean Up

After you run this solution, make sure you clean up any unneeded AWS resources to avoid unexpected costs. You can clean up these resources using the SageMaker Python SDK or the AWS Management Console for the specific services used here (SageMaker, Amazon ECR, and Amazon S3). By deleting these resources, you prevent further charges for resources you’re no longer using.

Conclusion

We’ve demonstrated how you can bring MATLAB to SageMaker for a pump predictive maintenance use case with the entire machine learning lifecycle. SageMaker provides a fully managed environment for running machine learning workloads and deploying models with a great selection of compute instances serving various needs.

Disclaimer: The code used in this post is owned and maintained by MathWorks. Refer to the license terms in the GitHub repo. For any issues with the code or feature requests, please open a GitHub issue in the repository

References

About the Authors

Brad Duncan is the product manager for machine learning capabilities in the Statistics and Machine Learning Toolbox at MathWorks. He works with customers to apply AI in new areas of engineering such as incorporating virtual sensors in engineered systems, building explainable machine learning models, and standardizing AI workflows using MATLAB and Simulink. Before coming to MathWorks he led teams for 3D simulation and optimization of vehicle aerodynamics, user experience for 3D simulation, and product management for simulation software. Brad is also a guest lecturer at Tufts University in the area of vehicle aerodynamics.

Richard Alcock is the senior development manager for Cloud Platform Integrations at MathWorks. In this role, he is instrumental in seamlessly integrating MathWorks products into cloud and container platforms. He creates solutions that enable engineers and scientists to harness the full potential of MATLAB and Simulink in cloud-based environments. He was previously a software engineering at MathWorks, developing solutions to support parallel and distributed computing workflows.

Rachel Johnson is the product manager for predictive maintenance at MathWorks, and is responsible for overall product strategy and marketing. She was previously an application engineer directly supporting the aerospace industry on predictive maintenance projects. Prior to MathWorks, Rachel was an aerodynamics and propulsion simulation engineer for the US Navy. She also spent several years teaching math, physics, and engineering.

Shun Mao is a Senior AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys fishing, traveling and playing Ping-Pong.

Ramesh Jatiya is a Solutions Architect in the Independent Software Vendor (ISV) team at Amazon Web Services. He is passionate about working with ISV customers to design, deploy and scale their applications in cloud to derive their business values. He is also pursuing an MBA in Machine Learning and Business Analytics from Babson College, Boston. Outside of work, he enjoys running, playing tennis and cooking.

Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart

Text vectors or embeddings are numerical vector representations of text that are generated by large language models (LLMs). After LLMs are fully pre-trained on a large dataset or fine-tuned from different tasks, including text completion, question answering, and translations, text embeddings capture semantic information of the input text. Different downstream applications are made possible by text embeddings, including similarity searching, information retrieval, recommendations and personalization, multilingual translations, and more.

Before intelligent applications could be built from embeddings, enterprises and organizations had to embed their existing documents, which can be expensive and technically complicated. Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey. With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI. You can seamlessly deploy these models into production with the SageMaker JumpStart user interface or SDK. In addition, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave its own VPC, you can trust your data remains private and confidential.

In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG). We demonstrate how to do the following:

Run inference on a text embedding model deployed from SageMaker JumpStart
Find the nearest neighbors for an input sentence with your own dataset
Run the batch transform on large documents to minimize costs

All the code is available on GitHub.

Deploy a text embedding model via SageMaker JumpStart

To host a model on Amazon SageMaker, the first step is to set up and authenticate the use of AWS services. In Amazon SageMaker Studio, we use the execution role associated with the notebook instance. See the following code:

import sagemaker, boto3, json
from sagemaker.session import Session
sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including bge, gte, e5, and more. In this post, we use huggingface-sentencesimilarity-bge-large-en as an example. We can use the SageMaker SDK to deploy this state-of-the-art text embedding model:

from sagemaker.jumpstart.model import JumpStartModel

model_id = "huggingface-sentencesimilarity-bge-large-en"
text_embedding_model = JumpStartModel(model_id=model_id)
predictor = text_embedding_model.deploy()

Text embedding model query

Let’s look at the text embedding model query in more detail.

Text to embedding

If you have already deployed a SageMaker endpoint before, the predictor can be restored as follows:

from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import IdentitySerializer

predictor = Predictor(
    endpoint_name=<YOUR_ENDPOINT_NAME>,
    deserializer=JSONDeserializer(),
    serializer=IdentitySerializer(),
)
predictor.content_type = "application/x-text"

After the model is successfully deployed, you can query the endpoint with a batch of input texts within a JSON payload:

sentences = [
    # Pets
    "Your dog is so cute.",
    "How cute your dog is!",
    "You have such a cute dog!",
    # Cities
    "Sydney is the place where I work.",
    "I work in Sydney.",
    # Color
    "What colour do you like the most?",
    "What is your favourite colour?",
]

predictor.predict(json.dumps(sentences).encode('utf-8'))

The correlation of the embeddings of these sentences is plotted in the following figure.

As shown in the preceding figure, same subjects are highly correlated within themselves, including Pets, Cities, and Color; different subjects are much dissimilar. This indicates the embedding generated by the LLMs (in this case, bge) can represent the semantic information accurately.

For this post, we used the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. Latency is the amount of time from the moment that a user sends a request until the time that the application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same batch of input texts on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model	g5.2xlarge Average Latency (ms)	c6i.xlarge Average Latency(ms)	Language Support
all-MiniLM-L6-v2	19.5	27.9	English
BGE Base En	21.2	114	English
BGE Small En	28.3	45.6	English
BGE Large En	34.7	337	English
Multilingual E5 Base	22.1	118	Multilingual
Multilingual E5 Large	39.8	360	Multilingual
E5 Base	25.6	117	English
E5 Base V2	25.2	123	English
E5 Large	32.2	339	English
E5 Large V2	32.5	331	English
GTE Base	22.2	112	English
GTE Small	19.7	46	English
GTE Large	39.7	347	English

Get the nearest neighbors

The deployed model from SageMaker JumpStart can also facilitate the process of identifying the nearest neighbors to queries within the corpus. When provided with queries and a corpus, the model will produce the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. It uses the following parameters:

corpus – Provides the list of inputs from which to find the nearest neighbor
queries – Provides the list of inputs for which to find the nearest neighbor from the corpus
top_k – The number of nearest neighbors to find from the corpus
mode – Set as nn_corpus for getting the nearest neighbors to input queries within the corpus

See the following code:

corpus = [
    "Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.",
    "Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.",
    "Amazon SageMaker provides a full end-to-end workflow, but you can continue to use your existing tools with SageMaker. You can easily transfer the results of each stage in and out of SageMaker as your business requirements dictate."
]
queries = [
    "What is Amazon SageMaker?",
    "How does Amazon SageMaker secure my code?",
    "What if I have my own notebook, training, or hosting environment in my own business environment?"
]

payload_nearest_neighbor = {"corpus": corpus, "queries": queries, "top_k": 3, "mode": "nn_corpus"}
query_response = predictor.predict(payload_nearest_neighbor)

We get the following output:

[
    [
        {'corpus_id': 0, 'score': 0.8992230892181396},
        {'corpus_id': 2, 'score': 0.8664969205856323},
        {'corpus_id': 1, 'score': 0.8456423282623291}
    ],
    [
        {'corpus_id': 1, 'score': 0.8919335603713989},
        {'corpus_id': 0, 'score': 0.840064525604248},
        {'corpus_id': 2, 'score': 0.8145401477813721}
    ],
    [
        {'corpus_id': 2, 'score': 0.7712811231613159},
        {'corpus_id': 1, 'score': 0.7564010620117188},
        {'corpus_id': 0, 'score': 0.7525666356086731}
    ]
]

This result means the first query is most similar to the first corpus, the second is closer to the second corpus, and so on. This is a correct match in this example.

We also took the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. The numbers in the following table represent the average latency for a total of 100 requests using the same payload on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model	g5.2xlarge Average Latency (ms)	c6i.xlarge Average Latency(ms)	Language Support
all-MiniLM-L6-v2	21.7	69.1	English
BGE Base En	29.1	372	English
BGE Small En	29.2	124	English
BGE Large En	47.2	1240	English
Multilingual E5 Base	30	389	Multilingual
Multilingual E5 Large	47.1	1380	Multilingual
E5 Base	30.4	373	English
E5 Base V2	31	409	English
E5 Large	45.9	1230	English
E5 Large V2	49.6	1220	English
GTE Base	30.3	375	English
GTE Small	28.5	129	English
GTE Large	46.6	1320	English

Get the nearest neighbors on a large dataset

When making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5 MB, and the request timeout is set to 1 minute. If corpus size exceeds these limits, you could use a SageMaker training job, which generates embeddings for your large dataset and persists them alongside the model inside the SageMaker endpoint. Therefore, they don’t have to be passed as part of the invocation payload. The process of finding the nearest neighbors is carried out using SentenceTransformer and its utility function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and the precomputed sentence embeddings during the training job.

In the following example, we fetch and prepare the Amazon_SageMaker_FAQs dataset to use it in finding the nearest neighbor to an input question:

!aws s3 cp s3://jumpstart-cache-prod-us-west-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv Amazon_SageMaker_FAQs.csv

import pandas as pd

data = pd.read_csv("Amazon_SageMaker_FAQs.csv", names=["Questions", "Answers"])
data["id"] = data.index
data_req = data[["id", "Answers"]]
data_req.to_csv("data.csv", index=False, header=False)

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ss-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_dataset_s3_path = f"s3://{output_bucket}/{output_prefix}/data/data.csv"

!aws s3 cp data.csv {training_dataset_s3_path}

For algorithm-specific training hyperparameters, the SageMaker SDK can be fetched or overwritten:

from sagemaker import hyperparameters

hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version = "*")
hyperparameters["batch_size"] = "64"
print(hyperparameters)
>>> {'max_seq_length': 'None', 'batch_size': '64', 'store_text_with_embedding': 'True'}

The SageMaker training consists of two steps: create the estimator object and launch the training job. The output is a model prepackaged with embeddings of your large dataset used as training data, which can be deployed for inference to get the nearest neighbor for any input sentence. See the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=hyperparameters,
    output_path=s3_output_location
)

estimator.fit(
    {"training": f"s3://{output_bucket}/{output_prefix}/data"}
)
predictor = estimator.deploy()

The query syntax to convert text into embeddings is the same as before. The code to get the nearest neighbor, however, can be simplified as follows:

payload_nearest_neighbour = {
    "queries": ["Is R supported with Amazon SageMaker?"],
    "top_k": 1,
    "mode": "nn_train_data",
}

response = predictor.predict(payload_nearest_neighbour)
>>> [[{'id': '9', 'score': 0.9240573048591614}]]

data["Answers"].iloc[int(response[0][0]["id"])]
>>> "Yes, R is supported with Amazon SageMaker. You can use R within SageMaker notebook instances, which include a preinstalled R kernel and the reticulate library. Reticulate offers an R interface for the Amazon SageMaker Python SDK, enabling ML practitioners to build, train, tune, and deploy R models."

We can also query the endpoint with questions in the Amazon_SageMaker_FAQs dataset and compare how many of the correct corresponding answers are returned. In the following example, we measure the top-3 accuracy, given there could be similar question answer pairs. This means if the correct answer is returned as one of the top-3 returns, it’s treated as a correct query.

total_correct_answers = 0

for i in range(len(data)):
    question = data["Questions"].iloc[i]
    payload_nearest_neighbor = {
        "queries": [question],
        "top_k": 3,
        "mode": "nn_train_data",
    }
    response = predictor.predict(payload_nearest_neighbor)
    response_ids = [int(res["id"]) for res in response[0]]

    if i in response_ids:
        total_correct_answers += 1
    else:
        pred_answer = [data["Answers"].iloc[response_id] for response_id in response_ids]

print(total_correct_answers*100/len(data))
>>>
81.16883116883118

Run a batch transform to get embeddings on large datasets

For enterprises and organizations with a large volume of historical documents that exceed the memory of a single endpoint instance, you can use SageMaker batch transform to save cost. When you start a batch transform job, SageMaker launches the necessary compute resources to process the data. During the job, SageMaker automatically provisions and manage the compute resources. When the batch transform job is complete, those resources are automatically cleaned up, which minimizes costs. By dividing a large dataset into smaller chunks and using more instances, you can scale out the compute for faster inference with similar cost, without managing infrastructure. The maximum payload for batch transform is 100 MB and timeout is 1 hour.

The input format for our batch transform job is a JSONL file, with entries as a line of JSON, which consists of id and text_inputs. See the following code:

test_data_file_name = "test.jsonl"
test_data = []

for i in range(len(data)):
    answer = data.loc[i, "Answers"]
    payload = {"id": i, "text_inputs": answer}
    test_data.append(payload)

with open(test_data_file_name, "w") as outfile:
    for entry in test_data:
        outfile.write(f"{json.dumps(entry)}n")

s3 = boto3.client("s3")
s3.upload_file(test_data_file_name, output_bucket, f"{output_prefix}/batch_input/test.jsonl")

When the data is ready in Amazon Simple Storage Service (Amazon S3), you can create the batch transform object from the SageMaker JumpStart model, which triggers the transform job:

s3_input_data_path = f"s3://{output_bucket}/{output_prefix}/batch_input/"
s3_output_data_path = f"s3://{output_bucket}/{output_prefix}/batch_output/"

batch_transformer = text_embedding_model.transformer(
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=s3_output_data_path,
    assemble_with="Line",
    accept="text/csv",
    max_payload=1,
)

batch_transformer.transform(
    s3_input_data_path,
    content_type="application/jsonlines",
    split_type="Line"
)

batch_transformer.wait()

After the batch transform job is complete, you can download the result from Amazon S3:

s3 = boto3.client("s3")
s3.download_file(
    output_bucket, output_prefix + "/batch_output/" + "test.jsonl.out", "predict.jsonl"
)

with open("predict.jsonl", "r") as json_file:
    json_list = list(json_file)

Conclusion

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language foundation models for text embedding and semantic search. With the user interface or just a few lines of code, you can deploy a highly accurate text embedding model and find semantic matches across large datasets, at scale and cost-efficiently. SageMaker JumpStart removes the barriers to implement semantic search by providing instant access to cutting-edge models like the ones benchmarked on the MTEB leaderboard. Businesses and developers can build intelligent search and recommendation systems faster.

This post demonstrated how to find semantically similar questions and answers, which could be applied to RAG use cases, recommendations and personalization, multilingual translations, and more. With continued advances in language models and the simplicity of SageMaker JumpStart, more organizations can infuse generative AI capabilities into their products. As the next step, you can try text-embedding models from SageMaker JumpStart on your own dataset to test and benchmark the results for your RAG use cases.

About the Authors

Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Open sourcing Project Guideline: A platform for computer vision accessibility technology

Posted by Dave Hawkey, Software Engineer, Google Research

Two years ago we announced Project Guideline, a collaboration between Google Research and Guiding Eyes for the Blind that enabled people with visual impairments (e.g., blindness and low-vision) to walk, jog, and run independently. Using only a Google Pixel phone and headphones, Project Guideline leverages on-device machine learning (ML) to navigate users along outdoor paths marked with a painted line. The technology has been tested all over the world and even demonstrated during the opening ceremony at the Tokyo 2020 Paralympic Games.

Since the original announcement, we set out to improve Project Guideline by embedding new features, such as obstacle detection and advanced path planning, to safely and reliably navigate users through more complex scenarios (such as sharp turns and nearby pedestrians). The early version featured a simple frame-by-frame image segmentation that detected the position of the path line relative to the image frame. This was sufficient for orienting the user to the line, but provided limited information about the surrounding environment. Improving the navigation signals, such as alerts for obstacles and upcoming turns, required a much better understanding and mapping of the users’ environment. To solve these challenges, we built a platform that can be utilized for a variety of spatially-aware applications in the accessibility space and beyond.

Today, we announce the open source release of Project Guideline, making it available for anyone to use to improve upon and build new accessibility experiences. The release includes source code for the core platform, an Android application, pre-trained ML models, and a 3D simulation framework.

System design

The primary use-case is an Android application, however we wanted to be able to run, test, and debug the core logic in a variety of environments in a reproducible way. This led us to design and build the system using C++ for close integration with MediaPipe and other core libraries, while still being able to integrate with Android using the Android NDK.

Under the hood, Project Guideline uses ARCore to estimate the position and orientation of the user as they navigate the course. A segmentation model, built on the DeepLabV3+ framework, processes each camera frame to generate a binary mask of the guideline (see the previous blog post for more details). Points on the segmented guideline are then projected from image-space coordinates onto a world-space ground plane using the camera pose and lens parameters (intrinsics) provided by ARCore. Since each frame contributes a different view of the line, the world-space points are aggregated over multiple frames to build a virtual mapping of the real-world guideline. The system performs piecewise curve approximation of the guideline world-space coordinates to build a spatio-temporally consistent trajectory. This allows refinement of the estimated line as the user progresses along the path.

Project Guideline builds a 2D map of the guideline, aggregating detected points in each frame (red) to build a stateful representation (blue) as the runner progresses along the path.

A control system dynamically selects a target point on the line some distance ahead based on the user’s current position, velocity, and direction. An audio feedback signal is then given to the user to adjust their heading to coincide with the upcoming line segment. By using the runner’s velocity vector instead of camera orientation to compute the navigation signal, we eliminate noise caused by irregular camera movements common during running. We can even navigate the user back to the line while it’s out of camera view, for example if the user overshot a turn. This is possible because ARCore continues to track the pose of the camera, which can be compared to the stateful line map inferred from previous camera images.

Project Guideline also includes obstacle detection and avoidance features. An ML model is used to estimate depth from single images. To train this monocular depth model, we used SANPO, a large dataset of outdoor imagery from urban, park, and suburban environments that was curated in-house. The model is capable of detecting the depth of various obstacles, including people, vehicles, posts, and more. The depth maps are converted into 3D point clouds, similar to the line segmentation process, and used to detect the presence of obstacles along the user’s path and then alert the user through an audio signal.

Using a monocular depth ML model, Project Guideline constructs a 3D point cloud of the environment to detect and alert the user of potential obstacles along the path.

A low-latency audio system based on the AAudio API was implemented to provide the navigational sounds and cues to the user. Several sound packs are available in Project Guideline, including a spatial sound implementation using the Resonance Audio API. The sound packs were developed by a team of sound researchers and engineers at Google who designed and tested many different sound models. The sounds use a combination of panning, pitch, and spatialization to guide the user along the line. For example, a user veering to the right may hear a beeping sound in the left ear to indicate the line is to the left, with increasing frequency for a larger course correction. If the user veers further, a high-pitched warning sound may be heard to indicate the edge of the path is approaching. In addition, a clear “stop” audio cue is always available in the event the user veers too far from the line, an anomaly is detected, or the system fails to provide a navigational signal.

Project Guideline has been built specifically for Google Pixel phones with the Google Tensor chip. The Google Tensor chip enables the optimized ML models to run on-device with higher performance and lower power consumption. This is critical for providing real-time navigation instructions to the user with minimal delay. On a Pixel 8 there is a 28x latency improvement when running the depth model on the Tensor Processing Unit (TPU) instead of CPU, and 9x improvement compared to GPU.

Testing and simulation

Project Guideline includes a simulator that enables rapid testing and prototyping of the system in a virtual environment. Everything from the ML models to the audio feedback system runs natively within the simulator, giving the full Project Guideline experience without needing all the hardware and physical environment set up.

Screenshot of Project Guideline simulator.

Future direction

To launch the technology forward, WearWorks has become an early adopter and teamed up with Project Guideline to integrate their patented haptic navigation experience, utilizing haptic feedback in addition to sound to guide runners. WearWorks has been developing haptics for over 8 years, and previously empowered the first blind marathon runner to complete the NYC Marathon without sighted assistance. We hope that integrations like these will lead to new innovations and make the world a more accessible place.

The Project Guideline team is also working towards removing the painted line completely, using the latest advancements in mobile ML technology, such as the ARCore Scene Semantics API, which can identify sidewalks, buildings, and other objects in outdoor scenes. We invite the accessibility community to build upon and improve this technology while exploring new use cases in other fields.

Acknowledgements

Many people were involved in the development of Project Guideline and the technologies behind it. We’d like to thank Project Guideline team members: Dror Avalon, Phil Bayer, Ryan Burke, Lori Dooley, Song Chun Fan, Matt Hall, Amélie Jean-aimée, Dave Hawkey, Amit Pitaru, Alvin Shi, Mikhail Sirotenko, Sagar Waghmare, John Watkinson, Kimberly Wilber, Matthew Willson, Xuan Yang, Mark Zarich, Steven Clark, Jim Coursey, Josh Ellis, Tom Hoddes, Dick Lyon, Chris Mitchell, Satoru Arao, Yoojin Chung, Joe Fry, Kazuto Furuichi, Ikumi Kobayashi, Kathy Maruyama, Minh Nguyen, Alto Okamura, Yosuke Suzuki, and Bryan Tanaka. Thanks to ARCore contributors: Ryan DuToit, Abhishek Kar, and Eric Turner. Thanks to Alec Go, Jing Li, Liviu Panait, Stefano Pellegrini, Abdullah Rashwan, Lu Wang, Qifei Wang, and Fan Yang for providing ML platform support. We’d also like to thank Hartwig Adam, Tomas Izo, Rahul Sukthankar, Blaise Aguera y Arcas, and Huisheng Wang for their leadership support. Special thanks to our partners Guiding Eyes for the Blind and Achilles International.

Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Layout is a new feature that allows customers to automatically extract layout elements such as paragraphs, titles, subtitles, headers, footers, and more from documents. Layout extends Amazon Textract’s word and line detection by automatically grouping the text into these layout elements and sequencing them according to human reading patterns. (That is, reading order from left to right and top to bottom.).

Building document processing and understanding solutions for financial and research reports, medical transcriptions, contracts, media articles, and so on requires extraction of information present in titles, headers, paragraphs, and so on. For example, when cataloging financial reports in a document database, extracting and storing the title as a catalog index enables easy retrieval. Prior to the introduction of this feature, customers had to construct these elements using post-processing code and the words and lines response from Amazon Textract.

The complexity of implementing this code is amplified with documents with multiple columns and complex layouts. With this announcement, extraction of commonly occurring layout elements from documents becomes easier and allows customers to build efficient document processing solutions faster with less code.

In Sept 2023, Amazon Textract launched the Layout feature that automatically extracts layout elements such as paragraphs, titles, lists, headers, and footers and orders the text and elements as a human would read. We also released the updated version of the open source postprocessing toolkit, purpose-built for Amazon Textract, known as Amazon Textract Textractor.

In this post, we discuss how customers can take advantage of this feature for document processing workloads. We also discuss a qualitative study demonstrating how Layout improves generative artificial intelligence (AI) task accuracy for both abstractive and extractive tasks for document processing workloads involving large language models (LLMs).

Layout elements

Central to the Layout feature of Amazon Textract are the new Layout elements. The LAYOUT feature of AnalyzeDocument API can now detect up to ten different layout elements in a document’s page. These layout elements are represented as block type in the response JSON and contain the confidence, geometry (that is, bounding box and polygon information), and Relationships, which is a list of IDs corresponding to the LINE block type.

Title – The main title of the document. Returned as LAYOUT_TITLE block type.
Header – Text located in the top margin of the document. Returned as LAYOUT_HEADER block type.
Footer – Text located in the bottom margin of the document. Returned as LAYOUT_FOOTER block type.
Section Title – The titles below the main title that represent sections in the document. Returned as LAYOUT_SECTION_HEADER block type.
Page Number – The page number of the documents. Returned as LAYOUT_PAGE_NUMBER block type.
List – Any information grouped together in list form. Returned as LAYOUT_LIST block type.
Figure – Indicates the location of an image in a document. Returned as LAYOUT_FIGURE block type.
Table – Indicates the location of a table in the document. Returned as LAYOUT_TABLE block type.
Key Value – Indicates the location of form key-value pairs in a document. Returned as LAYOUT_KEY_VALUE block type.
Text – Text that is present typically as a part of paragraphs in documents. It is a catch all for text that is not present in other elements. Returned as LAYOUT_TEXT block type.

Each layout element may contain one or more LINE relationships, and these lines constitute the actual textual content of the layout element (for example, LAYOUT_TEXT is typically a paragraph of text containing multiple LINEs). It is important to note that layout elements appear in the correct reading order in the API response as the reading order in the document, which makes it easy to construct the layout text from the API’s JSON response.

Use cases of layout-aware extraction

Following are some of the common use cases for the new AnalyzeDocument LAYOUT feature.

Extracting layout elements for search indexing and cataloging purposes. The contents of the LAYOUT_TITLE or LAYOUT_SECTION_HEADER, along with the reading order, can be used to appropriately tag or enrich metadata. This improves the context of a document in a document repository to improve search capabilities or organize documents.
Summarize the entire document or parts of a document by extracting text in proper reading order and using the layout elements.
Extracting specific parts of the document. For example, a document may contain a mix of images with text within it and other plaintext sections or paragraphs. You can now isolate the text sections using the LAYOUT_TEXT element.
Better performance and accurate answers for in-context document Q&A and entity extractions using an LLM.

There are other possible document automation use cases where Layout can be useful. However, in this post we explain how to extract layout elements in order to help understand how to use the feature for traditional documentation automation solutions. We discuss the benefits of using Layout for a document Q&A use case with LLMs using a common method known as Retrieval Augmented Generation (RAG), and for entity extraction use-case. For the outcomes of both of these use-cases, we present comparative scores that helps differentiate the benefits of layout aware text as opposed to just plaintext.

To highlight the benefits, we ran tests to compare how plaintext extracted using raster scans with DetectDocumentText and layout-aware linearized text extracted using AnalyzeDocument with LAYOUT feature impacts the outcome of in-context Q&A outputs by an LLM. For this test, we used Anthropic’s Claude Instant model with Amazon Bedrock. However, for complex document layouts, the generation of text in proper reading order and subsequently chunking them appropriately may be challenging, depending on how complex the document layout is. In the following sections, we discuss how to extract layout elements, and linearize the text to build an LLM-based application. Specifically, we discuss the comparative evaluation of the responses generated by the LLM for document Q&A application using raster scan–based plaintext and layout-aware linearized text.

Extracting layout elements from a page

The Amazon Textract Textractor toolkit can process a document through the AnalyzeDocument API with LAYOUT feature and subsequently exposes the detected layout elements through the page’s PAGE_LAYOUT property and its own subproperty TITLES, HEADERS, FOOTERS, TABLES, KEY_VALUES, PAGE_NUMBERS, LISTS, and FIGURES. Each element has its own visualization function, allowing you to see exactly what was detected. To get started, you start by installing Textractor using

pip install amazon-textract-textractor

As demonstrated in the following code snippet, the document news_article.pdf is processed with the AnalyzeDocument API with LAYOUT feature. The response results in a variable document that contains each of the detected Layout blocks from the properties.

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

input_document = "./news_article.pdf"

document = extractor.analyze_document(
                   file_source=input_document,
                   features=[TextractFeatures.LAYOUT],
                   save_image=True)

document.pages[0].visualize()
document.pages[0].page_layout.titles.visualize()
document.pages[0].page_layout.headers.visualize()

document.pages[0].page_layout.section_headers.visualize()
document.pages[0].page_layout.footers.visualize()
document.pages[0].page_layout.tables.visualize()
document.pages[0].page_layout.key_values.visualize()
document.pages[0].page_layout.page_numbers.visualize()
document.pages[0].page_layout.lists.visualize()
document.pages[0].page_layout.figures.visualize()

See a more in-depth example in the official Textractor documentation.

Linearizing text from the layout response

To use the layout capabilities, Amazon Textract Textractor was extensively reworked for the 1.4 release to provide linearization with over 40 configuration options, allowing you to tailor the linearized text output to your downstream use case with little effort. The new linearizer supports all currently available AnalyzeDocument APIs, including forms and signatures, which lets you add selection items to the resulting text without making any code changes.

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractor.data.text_linearization_config import TextLinearizationConfig

extractor = Textractor(profile_name="default")

config = TextLinearizationConfig(
                         hide_figure_layout=True,
                         title_prefix="# ",
                         section_header_prefix="## ")

document = extractor.analyze_document(
                                 file_source=input_document,
                                 features=[TextractFeatures.LAYOUT],
                                 save_image=True)

print(document.get_text(config=config))

See this example and more in the official Textractor documentation.

We have also added a layout pretty printer to the library that allows you to call a single function by passing in the layout API response in JSON format and get the linearized text (by page) in return.

python -m pip install -q amazon-textract-prettyprinter

You have the option to format the text in markdown format, exclude text from within figures in the document, and exclude page header, footer, and page number extractions from the linearized output. You can also store the linearized output in plaintext format in your local file system or in an Amazon S3 location by passing the save_txt_path parameter. The following code snippet demonstrates a sample usage –

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document=input_document,
                      features=[Textract_Features.LAYOUT,
                      Textract_Features.TABLES])
layout = get_text_from_layout_json(textract_json=textract_json,
exclude_figure_text=True, # optional
exclude_page_header=True, # optional
exclude_page_footer=True, # optional
exclude_page_number=True, # optional
save_txt_path="s3://bucket/prefix") # optional

full_text = layout[1]
print(full_text)

Evaluating LLM performing metrics for abstractive and extractive tasks

Layout-aware text is found to improve the performance and quality of text generated by LLMs. In particular, we evaluate two types of LLM tasks—abstractive and extractive tasks.

Abstractive tasks refer to assignments that require the AI to generate new text that is not directly found in the source material. Some examples of abstractive task include summarization and question answering. For these tasks, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric to evaluate the performance of an LLM on question-answering tasks with respect to a set of ground truth data.

Extractive tasks refer to activities where the model identifies and extracts specific portions of the input text to construct a response. In these tasks, the model is focused on selecting relevant segments (such as sentences, phrases, or keywords) from the source material rather than generating new content. Some examples are named entity recognition (NER) and keyword extraction. For these tasks, we use Average Normalized Levenshtein Similarity (ANLS) on named entity recognition tasks based on the layout-linearized text extracted by Amazon Textract.

ROUGE score analysis on abstractive question-answering task

Our test is set up to perform in-context Q&A on a multicolumn document by extracting the text and then performing RAG to get answer responses from the LLM. We perform Q&A on a set of questions using the raster scan–based raw text and layout-aware linearized text. We then evaluate ROUGE metrics for each question by comparing the machine-generated response to the corresponding ground truth answer. In this case, the ground truth is the same set of questions answered by a human, which is considered as a control group.

In-context Q&A with RAG requires extracting text from the document, creating smaller chunks of the text, generating vector embeddings of the chunks, and subsequently storing them in a vector database. This is done so that the system can perform a relevance search with the question on the vector database to return chunks of text that are most relevant to the question being asked. These relevant chunks are then used to build the overall context and provided to the LLM so that it can accurately answer the question.

The following document, taken from the DocUNet: Document Image Unwarping via a Stacked U-Net dataset, is used for the test. This document is a multicolumn document with headers, titles, paragraphs, and images. We also defined a set of 20 questions answered by a human as a control group or ground truth. The same set of 20 questions was then used to generate responses from the LLM.

In the next step, we extract the text from this document using DetectDocumentText API and AnalyzeDocument API with LAYOUT feature. Since most LLMs have a limited token context window, we kept the chunk size small, about 250 characters with a chunk overlap of 50 characters, using LangChain’s RecursiveCharacterTextSplitter. This resulted in two separate sets of document chunks—one generated using the raw text and the other using the layout-aware linearized text. Both sets of chunks were stored in a vector database by generating vector embeddings using the Amazon Titan Embeddings G1 Text embedding model.

The following code snippet generates the raw text from the document.

import textractcaller as tc
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string

plain_textract_json = call_textract(input_document = input_document)
plain_text = get_lines_string(textract_json = plain_textract_json)

print(plain_text)

The output (trimmed for brevity) looks like the following. The text reading order is incorrect due to the lack of layout awareness of the API, and the extracted text spans the text columns.

PHOTONICS FOR A BETTER WORLD
UNESCO ENDORSES
INTERNATIONAL DAY OF LIGHT
First celebration in 2018 will become an annual
reminder of photonics-enabled technologies
T he executive board of the United Nations Educational,
in areas such as science, culture, education, sustainable development,
Scientific, and Cultural Organization (UNESCO) has endorsed
medicine, communications, and energy.
a proposal to establish an annual International Day of Light
The final report of IYL 2015 was delivered to UNESCO in Paris
(IDL) as an extension of the highly successful International Year of
during a special meeting in October 2016. At this event, SPIE member
Light and Light-based Technologies (IYL 2015).
...

The visual of the reading order for raw text extracted by DetectDocumentText can be seen in the following image.

The following code snippet generates the layout-linearized text from the document. You can use either method to generate the linearized text from the document using the latest version of Amazon Textract Textractor Python library.

import textractcaller as tc
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

layout_textract_json = call_textract(input_document = input_document,
                                     features = [Textract_Features.LAYOUT])
layout_text = get_text_from_layout_json(textract_json = layout_textract_json)[1]
print(layout_text)

The output (trimmed for brevity) looks like the following. The text reading order is preserved since we used the LAYOUT feature, and the text makes more sense.

PHOTONICS FOR A BETTER WORLD

UNESCO ENDORSES INTERNATIONAL DAY OF LIGHT

First celebration in 2018 will become an annual
reminder of photonics-enabled technologies

T he executive board of the United Nations Educational,
Scientific, and Cultural Organization (UNESCO) has endorsed
a proposal to establish an annual International Day of Light
(IDL) as an extension of the highly successful International Year of
Light and Light-based Technologies (IYL 2015).
The endorsement for a Day of Light has been
embraced by SPIE and other founding partners of
IYL 2015.
...

The visual of the reading order for raw text extracted by AnalyzeDocument with LAYOUT feature can be seen in the following image.

We performed chunking on both the extracted text separately, with a chunk size of 250 and an overlap of 50.

Next, we generate vector embeddings for the chunks and load them into a vector database in two separate collections. We used open source ChromaDB as our in-memory vector database and used topK value of 3 for the relevance search. This means that for every question, our relevance search query with ChromaDB returns 3 relevant chunks of text of size 250 each. These three chunks are then used to build a context for the LLM. We intentionally chose a smaller chunk size and smaller topK to build the context for the following specific reasons.

Shorten the overall size of our context since research suggests that LLMs tend to perform better with shorter context, even though the model supports longer context (through a larger token context window).
Smaller overall prompt size results in lower overall text generation model latency. The larger the overall prompt size (which includes the context), the longer it may take the model to generate a response.
Comply with the model’s limited token context window, as is the case with most LLMs.
Cost efficiency since using fewer tokens means lower cost per question for input and output tokens combined.

Note that Anthropic Claude Instant v1 does support a 100,000 token context window via Amazon Bedrock. We intentionally limited ourselves to a smaller chunk size since that also makes the test relevant to models with fewer parameters and overall shorter context windows.

We used ROUGE metrics to evaluate machine-generated text against a reference text (or ground truth), measuring various aspects like the overlap of n-grams, word sequences, and word pairs between the two texts. We chose three ROUGE metrics for evaluation.

ROUGE-1: Compares the overlap of unigrams (single words) between the generated text and a reference text.
ROUGE-2: Compares the overlap of bigrams (two-word sequences) between the generated text and a reference text.
ROUGE-L: Measures the longest common subsequence (LCS) between the generated text and a reference text, focusing on the longest sequence of words that appear in both texts, albeit not necessarily consecutively.

For our 20 sample questions relevant to the document, we ran Q&A with the raw text and linearized text, respectively, and then ran the ROUGE score analysis. We noticed almost 50 percent average improvement in precision overall. And there was significant improvement in F1-scores when layout-linearized text was compared to ground truth as opposed to when raw text was compared to ground truth.

This suggests that the model became better at generating correct responses with the help of linearized text and smaller chunking. This led to an increase in precision, and the balance between precision and recall shifted favorably towards precision, leading to an increase in the F1 score. The increased F1 score, which balances precision and recall, suggests an improvement. It’s essential to consider the practical implications of these metric changes. For instance, in a scenario where false positives are costly, the increase in precision is highly beneficial.

ANLS score analysis on extractive tasks over academic datasets

We measure the ANLS or the Average Normalized Levenshtein Similarity, which is an edit distance metric that was introduced by the paper Scene Text Visual Question Answering and aims to softly penalize minor OCR imperfections while considering the model’s reasoning abilities at the same time. This metric is a derivative version of traditional Levenshtein distance, which is a measure of the difference between two sequences (such as strings). It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

For our ANLS tests, we performed an NER task where the LLM was prompted to extract the exact value from the OCR-extracted text. The two academic datasets used for the tests are DocVQA and InfographicVQA. We used zero-shot prompting to attempt extraction of key entities. The prompt used for the LLMs is of the following structure.

template = """You are asked to answer a question using only the provided Document.

The answer to the question should be taken as-is from the document and as short as possible.

Document:n{document}

Question: {question}

Extract the answer from the document with as few words as possible."""

Accuracy improvements were observed in all document question-answering datasets tested with the open source FlanT5-XL model when using layout-aware linearized text, as opposed to raw text (raster scan), in response to zero-shot prompts. In the InfographicVQA dataset, using layout-aware linearized text enables the smaller 3B parameter FlanT5-XL model to match the performance of the larger FlanT5-XXL model (on raw text), which has nearly four times as many parameters (11B).

Dataset	ANLS*
	FlanT5-XL (3B)			FlanT5-XXL (11B)
	Not Layout-aware (Raster)	Layout-aware	Δ	Not Layout- aware (Raster)	Layout-aware	Δ
DocVQA	66.03%	68.46%	1.43%	70.71%	72.05%	1.34%
InfographicsVQA	29.47%	35.76%	6.29%	37.82%	45.61%	7.79%

* ANLS is measured on text extracted by Amazon Textract, not the provided document transcription

Conclusion

The launch of Layout marks a significant advancement in using Amazon Textract to build document automation solutions. As discussed in this post, Layout uses traditional and generative AI methods to improve efficiencies when building a wide variety of document automation solutions such as document search, contextual Q&A, summarization, key-entities extraction, and more. As we continue to embrace the power of AI in building document processing and understanding systems, these enhancements will no doubt pave the way for more streamlined workflows, higher productivity, and more insightful data analysis.

For more information on the Layout feature and how to take advantage of the feature for document automation solutions, refer to AnalyzeDocument, Layout analysis, and Text linearization for generative AI applications documentation.

About the Authors

Anjan Biswas is a Senior AI Services Solutions Architect who focuses on computer vision, NLP, and generative AI. Anjan is part of the worldwide AI services specialist team and works with customers to help them understand and develop solutions to business problems with AWS AI Services and generative AI.

Lalita Reddi is a Senior Technical Product Manager with the Amazon Textract team. She is focused on building machine learning–based services for AWS customers. In her spare time, Lalita likes to play board games and go on hikes.

Edouard Belval is a Research Engineer in the computer vision team at AWS. He is the main contributor behind the Amazon Textract Textractor library.

NVIDIA Collaborates With Genentech to Accelerate Drug Discovery Using Generative AI

Genentech, a member of the Roche Group, is pioneering the use of generative AI to discover and develop new therapeutics and deliver treatments to patients more efficiently.

A new collaboration between Genentech, the biotechnology pioneer, and NVIDIA aims to transform the discovery and development of new medicines by bringing together experts from each company to optimize and accelerate Genentech’s proprietary algorithms.

NVIDIA will work with Genentech to accelerate these models on NVIDIA DGX Cloud, which provides dedicated instances of AI supercomputing and software hosted by NVIDIA cloud service provider partners.

Genentech plans to use NVIDIA BioNeMo, which enables biotech companies to customize models at scale, and integrate BioNeMo cloud application programming interfaces directly into computational drug discovery workflows.

BioNeMo, now generally available as a training service, is a domain-specific platform that simplifies, accelerates and scales generative AI applications for computational drug discovery. It allows researchers to pretrain or fine-tune state-of-the-art models on DGX Cloud.

The collaboration will initially focus on optimizing Genentech’s drug discovery AI models in its “lab in a loop” framework. The goal: To allow its researchers to understand complex biomolecular patterns and relationships to truly disrupt drug development and improve the success rate of R&D, and to empower scientists to deliver multiplicative, rather than linear or additive, benefits for patients and the broader healthcare ecosystem.

“Our collaboration with NVIDIA builds on our long history of successfully inventing and deploying technology in ways that were not initially apparent to others,” said Aviv Regev, executive vice president and head of Genentech Research & Early Development (gRED). “We were the first biotech company to leverage molecular biology for drug discovery and development, which changed the world. We pioneered antibody therapeutics that became the paradigm of treatment. And now, we have brought AI, the lab and the clinic together to uncover otherwise inaccessible patterns in vast quantities of data, and to design experiments to test those patterns. Collaborating with NVIDIA, and introducing generative AI, has the power to turbocharge the discovery and design of therapeutics that will improve the lives of patients across the world.”

Streamlining Drug Discovery With Computation

Drug discovery and development is currently a lengthy, complicated and costly process. Drug targets for novel medicines are difficult to predict, as is successfully developing a molecule as a potential therapeutic. AI can play a transformational role because generative and other AI models can help scientists rapidly identify potential drug molecules and interactions by training on large-scale datasets.

For Genentech, using AI helps bridge the gap between lab experiments and computational algorithms.

The company’s R&D group, gRED, has already done significant work using AI — across multiple modalities — to discover and develop novel therapeutics while learning more about the building blocks of biology and diseases.

Teams from Genentech and NVIDIA will now work together to optimize Genentech’s custom-developed models to shorten this time-consuming process of drug discovery and development and lead to greater success.

Putting AI in a Loop

Genentech’s “lab in a loop” is an iterative framework for generating and exploring molecular designs with predicted properties. It aims to use experimental data to inform generative computational models and better optimize future molecular designs. NVIDIA will help Genentech optimize its framework by accelerating training and inference of Genentech’s drug discovery models.

Through this collaboration, NVIDIA AI experts will gain insights into AI-related challenges in drug discovery and development. NVIDIA plans to use these insights to improve its BioNeMo platform and others to further accommodate the requirements of models used by the biotech industry.

“AI can play a transformational role in accelerating drug discovery and development — as it has across many parts of healthcare and life sciences,” said Kimberly Powell, vice president of healthcare at NVIDIA. “Together, NVIDIA and Genentech are unlocking scientific innovation by developing and implementing AI models and algorithms that enable us to rapidly iterate and unearth insights.”

Subscribe to NVIDIA healthcare news.

3D Artist Cooks Up Stunningly Photorealistic Food Renders This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows.

It’s the season of gratitude: that time of year to give thanks for the people and small moments that make life so special.

This week’s featured In the NVIDIA Studio artist, Ravissen Carpenen, is serving up a feast of mouthwateringly photorealistic 3D food renders to the dinner table.

His delectable time-lapse videos are featured on his YouTube channel, CG Realism — presented with a side of upbeat music and a pinch of style.

Carpenen was one of several contributors to the food-themed Studio Standout video contest, alongside Roger Roque (@rogerroqueid), Nicole Morena (@nicky.blender), Heloise Cart (@isoheell) and Kris Theroin (@kristheorin).

Finally, livestreamers using OBS Studio — a free, open-source software for video recording and livestreaming — can download the latest update with HDR10 capture support, WHIP and WebRTC output and more. Learn more details.

All About That Baste

Carpenen’s wife, a pastry chef, inspired his photorealistic, food-centered works.

“My aim, one day, is to be able to create ultra-realistic renders that will be used in film and movies,” he said.

His projects begin with online research and reference gathering, mainly on Pinterest and Behance, which he then compiles into mood boards using the stand-alone image tracking program PureRef.

Before any modeling takes place, Carpenen lights the scene — but without textures.

“This is to tell the story of the artwork, as light intends to give artwork an emotional flow, alongside as well as color and materials,” he said.

Carpenen initially sculpts his models in ZBrush using customizable brushes to shape, texture and paint his virtual clay in a real-time environment with instant feedback.

He then browses the Quixel Megascans library for models that can further add realism, such as garlic cloves and rosemary garnishes for his turkey project.

Rare-in for More

Carpenen uses Marmoset Toolbag’s ambient occlusion, curvature, normal and thickness features to bake the ZBrush meshes from high-poly to low-poly models as 32-bit textures.

The process saves memory space, minimizing lag time while allowing greater flexibility in the modeling stage.

Bake ZBrush meshes from high-poly to low-poly models as 32-bit textures in Marmoset Toolbag.

Carpenen’s GeForce RTX 3070 GPU-powered system with RTX acceleration instantly optimized his meshes. RTX-accelerated ray tracing and OptiX AI-powered denoising also enabled smoother viewport movement.

Baking a Berry Good Pie

Once the renders are ready, Carpenen imports them into Adobe Substance 3D Painter to apply custom colors and textures.

There, Carpenen uses RTX-accelerated light and ambient occlusion baking — though not in the oven — to optimize his assets, such as this berry pie, in mere seconds.

He also had the option to set up a live link connection between 3D Painter and NVIDIA Omniverse, a development platform for connecting and building Universal Scene Description (OpenUSD)-based tools and applications, via the USD Composer foundation app.

The connection would allow Carpenen’s texture work in Substance 3D Painter to directly translate to USD Composer — eliminating the need for numerous file imports, exports and reformatting.

Donut Hole in One

Carpenen uses Blender to bring his scenes together with advanced model sculpting, animations and further lighting refinement.

RTX GPUs allow smoother movement in the viewport thanks to Blender Cycles’ RTX-accelerated OptiX ray tracing.

Beautifully rendered donuts make us go nuts.

And for exporting final files, RTX-accelerated OptiX ray tracing in Blender Cycles delivers the fastest final-frame render.

Thanks to AI, This Work Is Toast

Carpenen uses Adobe Photoshop Lightroom to put the finishing touches on his food scenes.

GPU-accelerated image processing enables dramatically more responsive adjustments on his 4K-resolution display.

Carpenen had even more RTX-accelerated AI tools at his disposal in Lightroom. The “Raw Details” feature refines the fine color details of high-resolution RAW images. And “Super Resolution” uses AI to upscale images with higher quality than traditional methods.

According to Carpenen, putting in the work is key.

“It’s equivalent to practicing football — if you don’t get enough time daily to practice, you can’t hone skills,” he said. “It’s important to know how to tackle obstacles — and that knowledge can only be gained by experience.”

Digital 3D artist Ravissen Carpenen’s logo and signature work.

Check out Carpenen’s YouTube channel, CG Realism, and ArtStation to check out more food projects.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Use Amazon SageMaker Studio to build a RAG question answering solution with Llama 2, LangChain, and Pinecone for fast experimentation

Retrieval Augmented Generation (RAG) allows you to provide a large language model (LLM) with access to data from external knowledge sources such as repositories, databases, and APIs without the need to fine-tune it. When using generative AI for question answering, RAG enables LLMs to answer questions with the most relevant, up-to-date information and optionally cite their data sources for verification.

A typical RAG solution for knowledge retrieval from documents uses an embeddings model to convert the data from the data sources to embeddings and stores these embeddings in a vector database. When a user asks a question, it searches the vector database and retrieves documents that are most similar to the user’s query. Next, it combines the retrieved documents and the user’s query in an augmented prompt that is sent to the LLM for text generation. There are two models in this implementation: the embeddings model and the LLM that generates the final response.

In this post, we demonstrate how to use Amazon SageMaker Studio to build a RAG question answering solution.

Using notebooks for RAG-based question answering

Implementing RAG typically entails experimenting with various embedding models, vector databases, text generation models, and prompts, while also debugging your code until you achieve a functional prototype. Amazon SageMaker offers managed Jupyter notebooks equipped with GPU instances, enabling you to rapidly experiment during this initial phase without spinning up additional infrastructure. There are two options for using notebooks in SageMaker. The first option is fast launch notebooks available through SageMaker Studio. In SageMaker Studio, the integrated development environment (IDE) purpose-built for ML, you can launch notebooks that run on different instance types and with different configurations, collaborate with colleagues, and access additional purpose-built features for machine learning (ML). The second option is using a SageMaker notebook instance, which is a fully managed ML compute instance running the Jupyter Notebook app.

In this post, we present a RAG solution that augments the model’s knowledge with additional data from external knowledge sources to provide more accurate responses specific to a custom domain. We use a single SageMaker Studio notebook running on an ml.g5.2xlarge instance (1 A10G GPU) and Llama 2 7b chat hf, the fine-tuned version of Llama 2 7b, which is optimized for dialog use cases from Hugging Face Hub. We use two AWS Media & Entertainment Blog posts as the sample external data, which we convert into embeddings with the BAAI/bge-small-en-v1.5 embeddings. We store the embeddings in Pinecone, a vector-based database that offers high-performance search and similarity matching. We also discuss how to transition from experimenting in the notebook to deploying your models to SageMaker endpoints for real-time inference when you complete your prototyping. The same approach can be used with different models and vector databases.

Solution overview

The following diagram illustrates the solution architecture.

Implementing the solution consists of two high-level steps: developing the solution using SageMaker Studio notebooks, and deploying the models for inference.

Develop the solution using SageMaker Studio notebooks

Complete the following steps to start developing the solution:

Load the Llama-2 7b chat model from Hugging Face Hub in the notebook.
Create a PromptTemplate with LangChain and use it to create prompts for your use case.
For 1–2 example prompts, add relevant static text from external documents as prompt context and assess if the quality of the responses improves.
Assuming that the quality improves, implement the RAG question answering workflow:
- Gather the external documents that can help the model better answer the questions in your use case.
- Load the BGE embeddings model and use it to generate embeddings of these documents.
- Store these embeddings in a Pinecone index.
- When a user asks a question, perform a similarity search in Pinecone and add the content from the most similar documents to the prompt’s context.

Deploy the models to SageMaker for inference at scale

When you hit your performance goals, you can deploy the models to SageMaker to be used by generative AI applications:

Deploy the Llama-2 7b chat model to a SageMaker real-time endpoint.
Deploy the BAAI/bge-small-en-v1.5 embeddings model to a SageMaker real-time endpoint.
Use the deployed models in your question answering generative AI applications.

In the following sections, we walk you through the steps of implementing this solution in SageMaker Studio notebooks.

Prerequisites

To follow the steps in this post, you need to have an AWS account and an AWS Identity and Access Management (IAM) role with permissions to create and access the solution resources. If you are new to AWS, see Create a standalone AWS account.

To use SageMaker Studio notebooks in your AWS account, you need a SageMaker domain with a user profile that has permissions to launch the SageMaker Studio app. If you are new to SageMaker Studio, the Quick Studio setup is the fastest way to get started. With a single click, SageMaker provisions the SageMaker domain with default presets, including setting up the user profile, IAM role, IAM authentication, and public internet access. The notebook for this post assumes an ml.g5.2xlarge instance type. To review or increase your quota, open the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for Studio KernelGateway apps running on ml.g5.2xlarge instances.

After confirming your quota limit, you need to complete the dependencies to use Llama 2 7b chat.

Llama 2 7b chat is available under the Llama 2 license. To access Llama 2 on Hugging Face, you need to complete a few steps first:

Create a Hugging Face account if you don’t have one already.
Complete the form “Request access to the next version of Llama” on the Meta website.
Request access to Llama 2 7b chat on Hugging Face.

After you have been granted access, you can create a new access token to access models. To create an access token, navigate to the Settings page on the Hugging Face website.

You need to have an account with Pinecone to use it as a vector database. Pinecone is available on AWS via the AWS Marketplace. The Pinecone website also offers the option to create a free account that comes with permissions to create a single index, which is sufficient for the purposes of this post. To retrieve your Pinecone keys, open the Pinecone console and choose API Keys.

Set up the notebook and environment

To follow the code in this post, open SageMaker Studio and clone the following GitHub repository. Next, open the notebook studio-local-gen-ai/rag/RAG-with-Llama-2-on-Studio.ipynb and choose the PyTorch 2.0.0 Python 3.10 GPU Optimized image, Python 3 kernel, and ml.g5.2xlarge as the instance type. If this is your first time using SageMaker Studio notebooks, refer to Create or Open an Amazon SageMaker Studio Notebook.

To set up the development environment, you need to install the necessary Python libraries, as demonstrated in the following code:

%%writefile requirements.txt
sagemaker>=2.175.0
transformers==4.33.0
accelerate==0.21.0
datasets==2.13.0
langchain==0.0.297
pypdf>=3.16.3
pinecone-client
sentence_transformers
safetensors>=0.3.3

!pip install -U -r requirements.txt

Load the pre-trained model and tokenizer

After you have imported the required libraries, you can load the Llama-2 7b chat model along with its corresponding tokenizers from Hugging Face. These loaded model artifacts are stored in the local directory within SageMaker Studio. This enables you to swiftly reload them into memory whenever you need to resume your work at a different time.

import torch

from transformers import (
	AutoTokenizer,
	LlamaTokenizer,
	LlamaForCausalLM,
	GenerationConfig,
	AutoModelForCausalLM
)
import transformers

tg_model_id = "meta-llama/Llama-2-7b-chat-hf" #the model id in Hugging Face
tg_model_path = f"./tg_model/{tg_model_id}" #the local directory where the model will be saved

tg_model = AutoModelForCausalLM.from_pretrained(tg_model_id, token=hf_access_token,do_sample=True, use_safetensors=True, device_map="auto", torch_dtype=torch.float16
tg_tokenizer = AutoTokenizer.from_pretrained(tg_model_id, token=hf_access_token)

tg_model.save_pretrained(save_directory=tg_model_path, from_pt=True)
tg_tokenizer.save_pretrained(save_directory=tg_model_path, from_pt=True)

Ask a question that requires up-to-date information

You can now start using the model and ask questions. Llama-2 chat models expect the prompt to adhere to the following format:

<s>[INST] <<SYS>>
system_prompt
<<SYS>>
{{ user_message }} [/INST]

You can use the PromptTemplate from LangChain to create a recipe based on the prompt format, so that you can easily create prompts going forward:

from langchain import PromptTemplate

template = """<s>[INST] <<SYS>>nYou are an assistant for question-answering tasks. You are helpful and friendly. Use the following pieces of retrieved context to answer the query. If you don't know the answer, you just say I don't know. Use three sentences maximum and keep the answer concise.
<<SYS>>n
{context}n
{question} [/INST]
"""
prompt_template = PromptTemplate( template=template, input_variables=['context','question'] )

Let’s ask the model a question that needs recent information from 2023. You can use LangChain and specifically the LLMChain type of chain and pass as parameters the LLM, the prompt template you created earlier, and the question:

question = "When can I visit the AWS M&E Customer Experience Center in New York City?"

tg_tokenizer.add_special_tokens( {"pad_token": "[PAD]"} )
tg_tokenizer.padding_side = "left"

tg_pipe = transformers.pipeline(task='text-generation', model=tg_model, tokenizer=tg_tokenizer, num_return_sequences=1, eos_token_id=tg_tokenizer.eos_token_id, pad_token_id=tg_tokenizer.eos_token_id, max_new_tokens=400, temperature=0.7)

from langchain.chains import LLMChain
from langchain.llms import HuggingFacePipeline

llm=HuggingFacePipeline(pipeline=tg_pipe, model_kwargs={'temperature':0.7})
llm_chain = LLMChain(llm=llm, prompt=prompt_template)
no_context_response = llm_chain.predict(context="", question=question)
print(no_context_response)

We get the following generated answer:

Thank you for reaching out! The AWS M&E Customer Experience Center in New York City is currently closed for visits due to the COVID-19 pandemic. However, you can check their official website or social media accounts for any updates on when the center will reopen. In the meantime, you can explore their virtual tours and resources available online.

Improve the answer by adding context to the prompt

The answer we generated is not entirely true. Let’s see if we can improve it by providing some context. You can add an extract from the post AWS announces new M&E Customer Experience Center in New York, which includes updates on the topic from 2023:

context = """Media and entertainment (M&E) customers continue to face challenges in creating more content, more quickly, and distributing it to more endpoints than ever before in their quest to delight viewers globally. Amazon Web Services (AWS), along with AWS Partners, have showcased the rapid evolution of M&E solutions for years at industry events like the National Association of Broadcasters (NAB) Show and the International Broadcast Convention (IBC). Until now, AWS for M&E technology demonstrations were accessible in this way just a few weeks out of the year. Customers are more engaged than ever before; they want to have higher quality conversations regarding user experience and media tooling. These conversations are best supported by having an interconnected solution architecture for reference. Scheduling a visit of the M&E Customer Experience Center will be available starting November 13th, please send an email to AWS-MediaEnt-CXC@amazon.com."""

Use the LLMChain again and pass the preceding text as context:

context_response = llm_chain.predict(context=context, question=question)
print(context_response)

The new response answers the question with up-to-date information:

You can visit the AWS M&E Customer Experience Center in New York City starting from November 13th. Please send an email to AWS-MediaEnt-CXC@amazon.com to schedule a visit.

We have confirmed that by adding the right context, the model’s performance is improved. Now you can focus your efforts on finding and adding the right context for the question asked. In other words, implement RAG.

Implement RAG question answering with BGE embeddings and Pinecone

At this juncture, you must decide on the sources of information to enhance the model’s knowledge. These sources could be internal webpages or documents within your organization, or publicly available data sources. For the purposes of this post and for the sake of simplicity, we have chosen two AWS Blog posts published in 2023:

These posts are already available as PDF documents in the data project directory in SageMaker Studio for quick access. To divide the documents into manageable chunks, you can employ the RecursiveCharacterTextSplitter method from LangChain:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()

text_splitter=RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=5
)
docs = text_splitter.split_documents(documents)

Next, use the BGE embeddings model bge-small-en created by the Beijing Academy of Artificial Intelligence (BAAI) that is available on Hugging Face to generate the embeddings of these chunks. Download and save the model in the local directory in Studio. We use fp32 so that it can run on the instance’s CPU.

em_model_name = "BAAI/bge-small-en"
em_model_path = f"./em-model"

from transformers import AutoModel
# Load model from HuggingFace Hub
em_model = AutoModel.from_pretrained(em_model_name,torch_dtype=torch.float32)
em_tokenizer = AutoTokenizer.from_pretrained(em_model_name,device="cuda")

# save model to disk
em_tokenizer.save_pretrained(save_directory=f"{em_model_path}/model",from_pt=True)
em_model.save_pretrained(save_directory=f"{em_model_path}/model",from_pt=True)
em_model.eval()

Use the following code to create an embedding_generator function, which takes the document chunks as input and generates the embeddings using the BGE model:

# Tokenize sentences
def tokenize_text(_input, device):
    return em_tokenizer(
        [_input], 
        padding=True, 
        truncation=True, 
        return_tensors='pt'
    ).to(device)

# Run embedding task as a function with model and text sentences as input
def embedding_generator(_input, normalize=True):
    # Compute token embeddings
    with torch.no_grad():
        embedded_output = em_model(
            **tokenize_text(
                _input, 
                em_model.device
            )
        )
        sentence_embeddings = embedded_output[0][:, 0]
        # normalize embeddings
        if normalize:
            sentence_embeddings = torch.nn.functional.normalize(
                sentence_embeddings, 
                p=2, 
                dim=1
            )
    
    return sentence_embeddings[0, :].tolist()
    
sample_sentence_embedding = embedding_generator(docs[0].page_content)
print(f"Embedding size of the document --->", len(sample_sentence_embedding))

In this post, we demonstrate a RAG workflow using Pinecone, a managed, cloud-native vector database that also offers an API for similarity search. You are free to rewrite the following code to use your preferred vector database.

We initialize a Pinecone python client and create a new vector search index using the embedding model’s output length. We use LangChain’s built-in Pinecone class to ingest the embeddings we created in the previous step. It needs three parameters: the documents to ingest, the embeddings generator function, and the name of the Pinecone index.

import pinecone
pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)
#check if index already exists, if not we create it
index_name = "rag-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(sample_sentence_embedding), ## 384 for bge-small-en 
        metric='cosine'
    )

#insert the embeddings
from langchain.vectorstores import Pinecone
vector_store = Pinecone.from_documents(
    docs,
    embedding_generator,
    index_name=index_name
)

With the Llama-2 7B chat model loaded into memory and the embeddings integrated into the Pinecone index, you can now combine these elements to enhance Llama 2’s responses for our question-answering use case. To achieve this, you can employ the LangChain RetrievalQA, which augments the initial prompt with the most similar documents from the vector store. By setting return_source_documents=True, you gain visibility into the exact documents used to generate the answer as part of the response, allowing you to verify the accuracy of the answer.

from langchain.chains import RetrievalQA
import textwrap

#helper method to improve the readability of the response
def print_response(llm_response):
    temp = [textwrap.fill(line, width=100) for line in llm_response['result'].split('n')]
    response = 'n'.join(temp)
    print(f"{llm_response['query']}n n{response}'n n Source Documents:")
    for source in llm_response["source_documents"]:
        print(source.metadata)

llm_qa_chain = RetrievalQA.from_chain_type(
    llm=llm, #the Llama-2 7b chat model
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}), # perform similarity search in Pinecone
    return_source_documents=True, #show the documents that were used to answer the question
    chain_type_kwargs={"prompt": prompt_template}
)
print_response(llm_qa_chain(question))

We get the following answer:

Q: When can I visit the AWS M&E Customer Experience Center in New York City?

A: I’m happy to help! According to the context, the AWS M&E Customer Experience Center in New York City will be available for visits starting on November 13th. You can send an email to AWS-MediaEnt-CXC@amazon.com to schedule a visit.’

Source Documents:

{‘page’: 4.0, ‘source’: ‘data/AWS announces new M&E Customer Experience Center in New York City _ AWS for M&E Blog.pdf’}

{‘page’: 2.0, ‘source’: ‘data/AWS announces new M&E Customer Experience Center in New York City _ AWS for M&E Blog.pdf’}

Let’s try a different question:

question2=" How many awards have AWS Media Services won in 2023?"
print_response(llm_qa_chain(question2))

We get the following answer:

Q: How many awards have AWS Media Services won in 2023?

A: According to the blog post, AWS Media Services have won five industry awards in 2023.’

Source Documents:

{‘page’: 0.0, ‘source’: ‘data/AWS Media Services awarded industry accolades _ AWS for M&E Blog.pdf’}

{‘page’: 1.0, ‘source’: ‘data/AWS Media Services awarded industry accolades _ AWS for M&E Blog.pdf’}

After you have established a sufficient level of confidence, you can deploy the models to SageMaker endpoints for real-time inference. These endpoints are fully managed and offer support for auto scaling.

SageMaker offers large model inference using Large Model Inference containers (LMIs), which we can utilize to deploy our models. These containers come equipped with pre-installed open source libraries like DeepSpeed, facilitating the implementation of performance-enhancing techniques such as tensor parallelism during inference. Additionally, they use DJLServing as a pre-built integrated model server. DJLServing is a high-performance, universal model-serving solution that offers support for dynamic batching and worker auto scaling, thereby increasing throughput.

In our approach, we use the SageMaker LMI with DJLServing and DeepSpeed Inference to deploy the Llama-2-chat 7b and BGE models to SageMaker endpoints running on ml.g5.2xlarge instances, enabling real-time inference. If you want to follow these steps yourself, refer to the accompanying notebook for detailed instructions.

You will require two ml.g5.2xlarge instances for deployment. To review or increase your quota, open the AWS Service Quotas console, choose AWS Services in the navigation pane, choose Amazon SageMaker, and refer to the value for ml.g5.2xlarge for endpoint usage.

The following steps outline the process of deploying custom models for the RAG workflow on a SageMaker endpoint:

Deploy the Llama-2 7b chat model to a SageMaker real-time endpoint running on an ml.g5.2xlarge instance for fast text generation.
Deploy the BAAI/bge-small-en-v1.5 embeddings model to a SageMaker real-time endpoint running on an ml.g5.2xlarge instance. Alternatively, you can deploy your own embedding model.
Ask a question and use the LangChain RetrievalQA to augment the prompt with the most similar documents from Pinecone, this time using the model deployed in the SageMaker real-time endpoint:

# convert your local LLM into SageMaker endpoint LLM
llm_sm_ep = SagemakerEndpoint(
    endpoint_name=tg_sm_model.endpoint_name, # <--- Your text-gen model endpoint name
    region_name=region,
    model_kwargs={
        "temperature": 0.05, 
        "max_new_tokens": 512
    },
    content_handler=content_handler,
)

llm_qa_smep_chain = RetrievalQA.from_chain_type(
    llm=llm_sm_ep,  # <--- This uses SageMaker Endpoint model for inference
    chain_type='stuff',
    retriever=vector_store.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

Use LangChain to verify that the SageMaker endpoint with the embedding model works as expected so that it can be used for future document ingestion:

response_model = smr_client.invoke_endpoint(
    EndpointName=em_sm_model.endpoint_name, <--- Your embedding model endpoint name
    Body=json.dumps({
        "text": "This is a sample text"
    }),
    ContentType="application/json",
)

outputs = json.loads(response_model["Body"].read().decode("utf8"))['outputs']

Clean up

Complete the following steps to clean up your resources:

When you have finished working in your SageMaker Studio notebook, make sure you shut down the ml.g5.2xlarge instance to avoid any charges by choosing the stop icon. You can also set up lifecycle configuration scripts to automatically shut down resources when they are not used.

If you deployed the models to SageMaker endpoints, run the following code at the end of the notebook to delete the endpoints:

#delete your text generation endpoint
sm_client.delete_endpoint(
     EndpointName=tg_sm_model.endpoint_name
)
# delete your text embedding endpoint
sm_client.delete_endpoint(
      EndpointName=em_sm_model.endpoint_name
)

Finally, run the following line to delete the Pinecone index:

pinecone.delete_index(index_name)

Conclusion

SageMaker notebooks provide a straightforward way to kickstart your journey with Retrieval Augmented Generation. They allow you to experiment interactively with various models, configurations, and questions without spinning up additional infrastructure. In this post, we showed how to enhance the performance of Llama 2 7b chat in a question answering use case using LangChain, the BGE embeddings model, and Pinecone. To get started, launch SageMaker Studio and run the notebook available in the following GitHub repo. Please share your thoughts in the comments section!

About the authors

Anastasia Tzeveleka is a Machine Learning and AI Specialist Solutions Architect at AWS. She works with customers in EMEA and helps them architect machine learning solutions at scale using AWS services. She has worked on projects in different domains including Natural Language Processing (NLP), MLOps and Low Code No Code tools.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes. In his free time, he enjoys playing chess and traveling.

Subscribe to the AI Podcast: Now Available on Amazon Music

Amazon SageMaker and NVIDIA: Delivering fast and accurate vector search and spellcheck capabilities

Optimizing training with NVIDIA Tensor Core GPUs

Spell Correction: BART model inferencing

Vector search: Query embedding generation sentence BERT model inferencing

Conclusion

About the authors

Prerequisites

Part 1: Data preparation & feature extraction

Part 2: Organize data for SageMaker

Part 3: Train and test a machine learning model in MATLAB

Part 4: Train the model in Amazon SageMaker

Create a SageMaker Estimator

Submit SageMaker training job

Part 5: Deploy the model as a real-time SageMaker endpoint

Part 6: Test the endpoint

Part 7: Dashboard integration

Clean Up

Conclusion

References

About the Authors

Deploy a text embedding model via SageMaker JumpStart

Text embedding model query

Text to embedding

Get the nearest neighbors

Get the nearest neighbors on a large dataset

Run a batch transform to get embeddings on large datasets

Conclusion

About the Authors

System design

Testing and simulation

Future direction

Acknowledgements

Layout elements

Use cases of layout-aware extraction

Extracting layout elements from a page

Linearizing text from the layout response

Evaluating LLM performing metrics for abstractive and extractive tasks

ROUGE score analysis on abstractive question-answering task

ANLS score analysis on extractive tasks over academic datasets

Conclusion

About the Authors

Streamlining Drug Discovery With Computation

Putting AI in a Loop

All About That Baste

Rare-in for More

Baking a Berry Good Pie

Donut Hole in One

Thanks to AI, This Work Is Toast

Using notebooks for RAG-based question answering

Solution overview

Develop the solution using SageMaker Studio notebooks

Deploy the models to SageMaker for inference at scale

Prerequisites

Set up the notebook and environment

Load the pre-trained model and tokenizer

Ask a question that requires up-to-date information

Improve the answer by adding context to the prompt

Implement RAG question answering with BGE embeddings and Pinecone

Clean up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.