Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

Improving how users discover new content is critical to increase user engagement and satisfaction on media platforms. Keyword search alone has challenges capturing semantics and user intent, leading to results that lack relevant context; for example, finding date night or Christmas-themed movies. This can drive lower retention rates if users can’t reliably find the content they want. However, with large language models (LLMs), there is an opportunity to solve these semantic and user intent challenges. By combining embeddings that capture semantics with a technique called Retrieval Augmented Generation (RAG), you can generate more relevant answers based on retrieved context from your own data sources.

In this post, we show you how to securely create a movie chatbot by implementing RAG with your own data using Knowledge Bases for Amazon Bedrock. We use the IMDb and Box Office Mojo dataset to simulate a catalog for media and entertainment customers and showcase how you can build your own RAG solution in just a couple of steps.

Solution overview

The IMDb and Box Office Mojo Movies/TV/OTT licensable data package provides a wide range of entertainment metadata, including over 1.6 billion user ratings; credits for more than 13 million cast and crew members; 10 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

Introduction to Knowledge Bases for Amazon Bedrock

To equip an LLM with up-to-date proprietary information, organizations use RAG, a technique that involves fetching data from company data sources and enriching the prompt with that data to deliver more relevant and accurate responses. Knowledge Bases for Amazon Bedrock enable a fully managed RAG capability that allows you to customize LLM responses with contextual and relevant company data. Knowledge Bases automate the end-to-end RAG workflow, including ingestion, retrieval, prompt augmentation, and citations, eliminating the need for you to write custom code to integrate data sources and manage queries. Knowledge Bases for Amazon Bedrock also enable multi-turn conversations so that the LLM can answer complex user queries with the correct answer.

We use the following services as part of this solution:

We walk through the following high-level steps:

  1. Preprocess the IMDb data to create documents from every movie record and upload the data into an Amazon Simple Storage Service (Amazon S3) bucket.
  2. Create a knowledge base.
  3. Sync your knowledge base with your data source.
  4. Use the knowledge base to answer semantic queries about the movie catalog.

Prerequisites

The IMDb data used in this post requires a commercial content license and paid subscription to IMDb and Box Office Mojo Movies/TV/OTT licensing package on AWS Data Exchange. To inquire about a license and access sample data, visit developer.imdb.com. To access the dataset, refer to Power recommendation and search using an IMDb knowledge graph – Part 1 and follow the Access the IMDb data section.

Preprocess the IMDb data

Before we create a knowledge base, we need to preprocess the IMDb dataset into text files and upload them to an S3 bucket. In this post, we simulate a customer catalog using the IMDb dataset. We take 10,000 popular movies from the IMDb dataset for the catalog and build the dataset.

Use the following notebook to create the dataset with additional info like actors, director, and producer names. We use the following code to create a single file for a movie with all the information stored in the file in an unstructured text that can be understood by LLMs:

def create_txt_files_imdb(row):
    full_text = ""
    full_text += f"{row['originalTitle']} ({row['titleId']}) was shot in year {int(row['year'])} with rating {row['rating']} and poster url {row['poster_url']}.nn"
    full_text += f"{row['originalTitle']} has genres {', '.join(row['genres'])}.nn"
    full_text += f"{row['originalTitle']} has actors {', '.join(row['Actors'])}.nn"   
    full_text += f"{row['originalTitle']} has directors {', '.join(row['Directors'])}.nn"
    full_text += f"{row['originalTitle']} has producers {', '.join(row['Producers'])}.nn"
    full_text += f"{row['originalTitle']} has keyword {', '.join([x.replace('-',' ') for x in row['keyword']])}.nn"
    full_text += f"{row['originalTitle']} has location {', '.join(row['location'])}.nn"
    full_text += f"{row['originalTitle']} has plot {row['plot']}.nn"
    with open(f"<path>/data/imdb_data/{row['titleId']}.txt","w") as f:
        f.write(full_text)
    return full_text

After you have the data in .txt format, you can upload the data into Amazon S3 using the following command:

aws s3 cp <path to local data> s3://<bucket-name>/<path>/ --recursive

Create the IMDb Knowledge Base

Complete the following steps to create your knowledge base:

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. For Knowledge base name, enter imdb.
  4. For Knowledge base description, enter an optional description, such as Knowledge base for ingesting and storing imdb data.
  5. For IAM permissions, select Create and use a new service role, then enter a name for your new service role.
  6. Choose Next.

knowledge base details console page

  1. For Data source name, enter imdb-s3.
  2. For S3 URI, enter the S3 URI that you uploaded the data to.
  3. In the Advanced settings – optional section, for Chunking strategy, choose No chunking.
  4. Choose Next.

Knowledge bases enable you to chunk your documents in smaller segments to make it straightforward for you to process large documents. In our case, we have already chunked the data into a smaller size document (one per movie).

knowledge base console 2

  1. In the Vector database section, select Quick create a new vector store.

Amazon Bedrock will automatically create a fully managed OpenSearch Serverless vector search collection and configure the settings for embedding your data sources using the chosen Titan Embedding G1 – Text embedding model.

knowledge base vector store page

  1. Choose Next.

  1. Review your settings and choose Create knowledge base.

Sync your data with the knowledge base

Now that you have created your knowledge base, you can sync the knowledge base with your data.

  1. On the Amazon Bedrock console, navigate to your knowledge base.
  2. In the Data source section, choose Sync.

knowledge base sync

After the data source is synced, you’re ready to query the data.

Improve search using semantic results

Complete the following steps to test the solution and improve your search using semantic results:

  1. On the Amazon Bedrock console, navigate to your knowledge base.
  2. Select your knowledge base and choose Test knowledge base.
  3. Choose Select model, and choose Anthropic Claude v2.1.
  4. Choose Apply.

Now you are ready to query the data.

We can ask some semantic questions, such as “Recommend me some Christmas themed movies.”

query Recommend me some Christmas themed movies.

Knowledge base responses contain citations that you can explore for response correctness and factuality.

knowledge base citations

You can also drill down on any information that you need from these movies. In the following example, we ask “who directed nightmare before christmas?”

“who directed nightmare before christmas?”

You can also ask more specific questions related to the genres and ratings, such as “show me classic animated movies with ratings greater than 7?”

show me classic animated movies with ratings greater than 7?

Augment your knowledge base with agents

Agents for Amazon Bedrock help you automate complex tasks. Agents can break down the user query into smaller tasks and call custom APIs or knowledge bases to supplement information for running actions. With Agents for Amazon Bedrock, developers can integrate intelligent agents into their apps, accelerating the delivery of AI-powered applications and saving weeks of development time. With agents, you can augment your knowledge base by adding more functionality like recommendations from Amazon Personalize for user-specific recommendations or performing actions such as filtering movies based on user needs.

Conclusion

In this post, we showed how to build a conversational movie chatbot using Amazon Bedrock in a few steps to answer semantic search and conversational experiences based on your own data and the IMDb and Box Office Mojo Movies/TV/OTT licensed dataset. In the next post, we go through the process of adding more functionality to your solution using Agents for Amazon Bedrock. To get started with knowledge bases on Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock.


About the Authors

Gaurav Rele is a Senior Data Scientist at the Generative AI Innovation Center, where he works with AWS customers across different verticals to accelerate their use of generative AI and AWS Cloud services to solve their business challenges.

Divya Bhargavi is a Senior Applied Scientist Lead at the Generative AI Innovation Center, where she solves high-value business problems for AWS customers using generative AI methods. She works on image/video understanding & retrieval, knowledge graph augmented large language models and personalized advertising use cases.

Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using Large Language Models, primarily through Amazon Bedrock and other AWS Cloud services.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

This post was co-written with Ricardo Perdigao, Solution Architecture Manager at Mendix, a Siemens business.

Mendix, a Siemens business, offers the low-code platform with the vision and execution designed for today’s complex software development challenges. Since 2005, we’ve helped thousands of organizations worldwide reimagine how they develop applications with our platform’s cutting-edge capabilities. Mendix allows enterprises to turn ideas into outcomes by delivering sophisticated applications to market faster than ever. Mendix has been named an industry leader by Gartner and Forrester.

In a world where customer-centric innovation is the key to staying ahead of the competition, businesses increasingly turn to advanced technology to revolutionize product offerings. For Mendix, integrating the cutting-edge generative AI capabilities of Amazon Bedrock has been a game changer in redefining our customer experience landscape.

In this post, we share how Mendix is revolutionizing customer experiences using Amazon Bedrock and generative AI.

Overview

In 2016, a new era of innovation began when Mendix announced a strategic collaboration with AWS. By taking advantage of the robust cloud infrastructure of AWS, Mendix was able to provide a secure, scalable, and reliable solution for enterprises across the globe. The primary objective of this partnership was to enable organizations to build, deploy, and manage enterprise-grade applications quickly and effectively.

This unique collaboration combines the agility of the Mendix low-code application development platform with the robustness of AWS Cloud services. This combination facilitates rapid application development and deployment, empowering businesses to respond swiftly to market changes and customer demands, achieving successful digital transformation at an accelerated pace.

However, the rapid evolution of technology brings new challenges. The rise of generative AI presents a unique opportunity to redefine how applications are developed and used. Integrating these advanced AI capabilities into a low-code environment is a complex task, requiring a solution that is innovative, scalable, secure, and easy to use, that nonetheless delivers significant value to users.

Recognizing the importance of staying competitive in this rapidly evolving technological landscape, Mendix is committed to enhancing its platform with seamless AI integrations. Mendix not only offers an AI-enabled low-code application development experience, but also seeks to equip our customers with user-friendly tools necessary for implementing generative AI in the solutions they build.

By combining the power of AI with the accessibility of low-code development, we are setting the stage for unprecedented innovation in the industry, empowering our customers to create AI-enhanced applications with ease and efficiency.

The solution: Integrating generative AI capabilities provided by Amazon Bedrock

As a first step on our journey, we embraced Amazon Bedrock to infuse our low-code platform with generative AI capabilities. Amazon Bedrock offers many ready-to-use AI models. These models excel at writing clear text, creating images from just descriptions, and translating between different languages.

The Mendix AWS Bedrock Connector in the Mendix Marketplace eliminates traditional complexities and streamlines the integration of generative AI within our ecosystem. This pivotal service, accompanied by a wealth of shared knowledge via samples and blog posts, has paved a frictionless path for integration.

Generative AI from Amazon Bedrock is reshaping the landscape for low-code developers, offering a remarkable foundation for rapidly creating sophisticated applications that were previously only possible with extensive development time. These AI models are designed to equip developers with the ability to develop applications that not only learn and adapt but to do so with a previously unseen depth of understanding. As we integrate these technologies into the Mendix environment, we’re ushering in a new era of democratizing generative AI.

Imagine a world where applications are no longer static but can dynamically generate personalized content, such as images and interfaces, by analyzing individual user data like browsing habits, geographic location, and the time of day. This capability of generative AI to tailor experiences to each user promises to boost engagement and satisfaction dramatically. In today’s data-driven enterprise world, where the sheer volume of information can be overwhelming, generative AI stands as a powerful technology, turning complex data into accessible insights, streamlining reports for executives, identifying trends, and predicting future outcomes faster than ever before.

Taking a step further, generative AI also offers actionable recommendations, not just interpretations. This feature is set to transform sectors like customer service, where it can advise service agents on the best subsequent actions or automate routine responses based on a profound comprehension of customer data. By bringing these innovations to the Mendix platform, we’re moving towards a future where applications anticipate and meet user needs proactively, turning every interaction into an opportunity for innovation and a delightful customer experience.

But our vision extends beyond the horizons of today’s achievements. With our sight set on redefining the fabric of low-code application development, we’ve immersed ourselves in pioneering research. Using the Mendix Extensibility framework, we explore generative AI’s potential to transform our industry from the ground up. Our investigative forays have already yielded exciting prospects—from conjuring comprehensive domain models with simple narrative inputs to automating data mapping and sample generation with AI’s interpretative prowess. We’re even on the cusp of enabling Mendix UI setups through intuitive AI-powered dialogs. These nascent concepts—demonstrated in the following video—are still being experimented on. But they promise to herald a new dawn for low-code innovation in the future.

In the following sections, we discuss our needs when selecting the right platform to build generative AI, and how Amazon Bedrock exceeded our expectations.

Ease of implementation

At Mendix, our diverse applications span text generation, summarization, virtual assistance, and multimodal image generation. Amazon Bedrock is our go-to platform for building and implementing generative AI, providing seamless access to cutting-edge generative AI foundational models. Its unified API simplifies experimentation and upgrades, facilitating quick and efficient integration of various models into our systems. This streamlined approach significantly reduces development and deployment efforts, enhancing overall efficiency and quickly innovating on disruptive technologies like generative AI.

Security is our top priority

At Mendix, as we innovate, generative AI is vital for innovation, yet security is critical. With Amazon Bedrock, we customized models using labeled data in Amazon Simple Storage Service (Amazon S3). We also added encryption using AWS Key Management Service (AWS KMS), Amazon Virtual Private Cloud (Amazon VPC), and AWS Private Link to establish private connectivity from a VPC to Amazon Bedrock hardened data, security, and access, fulfilling our stringent enterprise security needs.

As we learned with Amazon Bedrock, a private copy of the base model is launched as we fine-tune a foundation model. Our data is not shared with foundation models from Amazon and other leading AI startups or used to improve the base models. Amazon Titan and other leading AI foundation models from AI21 Labs, Anthropic, Cohere, Meta, and Stability AI do not have access to our data (prompts, completion results), ensuring responsible development and exceeding our security requirements.

Thanks to AWS and Amazon Bedrock, balancing the power of generative AI with robust security measures ensures responsible and safe development, fostering technological advancement with confidence. Amazon Bedrock exceeded our security requirements.

Continual updates and support

With Amazon Bedrock, users can benefit from continual updates and support for the available models. Enterprises like us have access to the latest advancements and improvements in AI technology, allowing us to remain ahead of the curve and adjust to evolving market trends and demands.

We can’t wait to further experiment with the new features of Amazon Bedrock announced at AWS re:Invent 2023, including Agents for Amazon Bedrock and Knowledge Bases for Amazon Bedrock, to accelerate our generative AI innovation on AWS.

Cost-effectiveness

The diverse model offerings in Amazon Bedrock empower you to select cost-effective large language models based on your use case. This flexibility enables companies to optimize AI investments, allocate resources efficiently, and achieve the best return on investment by choosing the most suitable model for each use case.

Conclusion

Mendix’s strategic integration of generative AI using Amazon Bedrock represents a groundbreaking move in democratizing tech innovation. Aligning with our legacy of empowering businesses with agility, Amazon Bedrock enhances our platform with advanced AI, ensuring every Mendix solution is not just current but future-engineered. We’re not merely adapting to low-code development evolution, but pioneering it.

To learn more about how your business can use the power of Amazon Bedrock and AWS to drive innovation and revolutionize customer experiences, see Why Using Amazon Bedrock in Your Mendix Apps is a Must.


About the Authors

Ricardo Perdigao has worked in the Enterprise Software industry for nearly 30 years, bringing a wealth of experience and innovation to his role as a Solutions Architect at Mendix, a high-productivity application development platform that empowers the creation and continuous improvement of mobile and web applications at scale. In 2023, Ricardo was honored with the prestigious Mendix Innovation Award for his groundbreaking research in Generative AI, further cementing his reputation as a forward-thinking leader in technology.

Suresh Patnam is the principal GTM Specialist AI/ML and Generative AI at AWS. He is passionate about helping businesses of all sizes transform into fast-moving digital organizations focusing on data, AI/ML, and generative AI. Suresh holds an MBA from Duke University-Fuqua School of Business. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Read More

Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

In the first part of this three-part series, we presented a solution that demonstrates how you can automate detecting document tampering and fraud at scale using AWS AI and machine learning (ML) services for a mortgage underwriting use case.

In this post, we present an approach to develop a deep learning-based computer vision model to detect and highlight forged images in mortgage underwriting. We provide guidance on building, training, and deploying deep learning networks on Amazon SageMaker.

In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.

Solution overview

To meet the objective of detecting document tampering in mortgage underwriting, we employ a computer vision model hosted on SageMaker for our image forgery detection solution. This model receives a testing image as input and generates a likelihood prediction of forgery as its output. The network architecture is as depicted in the following diagram.

Tampering detection model architecture

Image forgery mainly involves four techniques: splicing, copy-move, removal, and enhancement. Depending on the characteristics of the forgery, different clues can be used as the foundation for detection and localization. These clues include JPEG compression artifacts, edge inconsistencies, noise patterns, color consistency, visual similarity, EXIF consistency, and camera model.

Given the expansive realm of image forgery detection, we use the Error Level Analysis (ELA) algorithm as an illustrative method for detecting forgeries. We selected the ELA technique for this post for the following reasons:

  • It is quicker to implement and can easily catch tampering of images.
  • It works by analyzing the compression levels of different parts of an image. This allows it to detect inconsistencies that may indicate tampering—for example, if one area was copied and pasted from another image that had been saved at a different compression level.
  • It is good at detecting more subtle or seamless tampering that may be hard to spot with the naked eye. Even small changes to an image can introduce detectable compression anomalies.
  • It doesn’t rely on having the original unmodified image for comparison. ELA can identify tampering signs within only the questioned image itself. Other techniques often require the unmodified original to compare against.
  • It is a lightweight technique that only relies on analyzing compression artifacts in the digital image data. It doesn’t depend on specialized hardware or forensics expertise. This makes ELA accessible as a first-pass analysis tool.
  • The output ELA image can clearly highlight differences in compression levels, making tampered areas visibly obvious. This allows even a non-expert to recognize signs of possible manipulation.
  • It works on many image types (such as JPEG, PNG, and GIF) and requires only the image itself to analyze. Other forensic techniques may be more restricted in formats or original image requirements.

However, in real-world scenarios where you may have a combination of input documents (JPEG, PNG, GIF, TIFF, PDF), we recommend employing ELA in conjunction with various other methods, such as detecting inconsistencies in edges, noise patterns, color uniformity, EXIF data consistency, camera model identification, and font uniformity. We aim to update the code for this post with additional forgery detection techniques.

ELA’s underlying premise assumes that the input images are in JPEG format, known for its lossy compression. Nevertheless, the method can still be effective even if the input images were originally in a lossless format (such as PNG, GIF, or BMP) and later converted to JPEG during the tampering process. When ELA is applied to original lossless formats, it typically indicates consistent image quality without any deterioration, rendering it challenging to pinpoint altered areas. In JPEG images, the expected norm is for the entire picture to exhibit similar compression levels. However, if a particular section within the image displays a markedly different error level, it often suggests a digital alteration has been made.

ELA highlights differences in the JPEG compression rate. Regions with uniform coloring will likely have a lower ELA result (for example, a darker color compared to high-contrast edges). The things to look for to identify tampering or modification include the following:

  • Similar edges should have similar brightness in the ELA result. All high-contrast edges should look similar to each other, and all low-contrast edges should look similar. With an original photo, low-contrast edges should be almost as bright as high-contrast edges.
  • Similar textures should have similar coloring under ELA. Areas with more surface detail, such as a close-up of a basketball, will likely have a higher ELA result than a smooth surface.
  • Regardless of the actual color of the surface, all flat surfaces should have about the same coloring under ELA.

JPEG images use a lossy compression system. Each re-encoding (resave) of the image adds more quality loss to the image. Specifically, the JPEG algorithm operates on an 8×8 pixel grid. Each 8×8 square is compressed independently. If the image is completely unmodified, then all 8×8 squares should have similar error potentials. If the image is unmodified and resaved, then every square should degrade at approximately the same rate.

ELA saves the image at a specified JPEG quality level. This resave introduces a known amount of errors across the entire image. The resaved image is then compared against the original image. If an image is modified, then every 8×8 square that was touched by the modification should be at a higher error potential than the rest of the image.

The results from ELA are directly dependent on the image quality. You may want to know if something was added, but if the picture is copied multiple times, then ELA may only permit detecting the resaves. Try to find the best quality version of the picture.

With training and practice, ELA can also learn to identify image scaling, quality, cropping, and resave transformations. For example, if a non-JPEG image contains visible grid lines (1 pixel wide in 8×8 squares), then it means the picture started as a JPEG and was converted to non-JPEG format (such as PNG). If some areas of the picture lack grid lines or the grid lines shift, then it denotes a splice or drawn portion in the non-JPEG image.

In the following sections, we demonstrate the steps for configuring, training, and deploying the computer vision model.

Prerequisites

To follow along with this post, complete the following prerequisites:

  1. Have an AWS account.
  2. Set up Amazon SageMaker Studio. You can swiftly initiate SageMaker Studio using default presets, facilitating a rapid launch. For more information, refer to Amazon SageMaker simplifies the Amazon SageMaker Studio setup for individual users.
  3. Open SageMaker Studio and launch a system terminal.
    Setup system terminal
  4. Run the following command in the terminal:
    git clone https://github.com/aws-samples/document-tampering-detection.git
  5. The total cost of running SageMaker Studio for one user and the configurations of the notebook environment is $7.314 USD per hour.

Set up the model training notebook

Complete the following steps to set up your training notebook:

  1. Open the tampering_detection_training.ipynb file from the document-tampering-detection directory.
  2. Set up the notebook environment with the image TensorFlow 2.6 Python 3.8 CPU or GPU Optimized.
    You may run into issue of insufficient availability or hit the quota limit for GPU instances within your AWS account when selecting GPU optimized instances. To increase the quota, visit the Service Quotas console and increase the service limit for the specific instance type you need. You can also use a CPU optimized notebook environment in such cases.
  3. For Kernel, choose Python3.
  4. For Instance type, choose ml.m5d.24xlarge or any other large instance.

We selected a larger instance type to reduce the training time of the model. With an ml.m5d.24xlarge notebook environment, the cost per hour is $7.258 USD per hour.

Run the training notebook

Run each cell in the notebook tampering_detection_training.ipynb in order. We discuss some cells in more detail in the following sections.

Prepare the dataset with a list of original and tampered images

Before you run the following cell in the notebook, prepare a dataset of original and tampered documents based on your specific business requirements. For this post, we use a sample dataset of tampered paystubs, and bank statements. The dataset is available within the images directory of the GitHub repository.

Prepare dataset

The notebook reads the original and tampered images from the images/training directory.

The dataset for training is created using a CSV file with two columns: the path to the image file and the label for the image (0 for original image and 1 for tampered image).

Label dataset

Process the dataset by generating the ELA results of each training image

In this step, we generate the ELA result (at 90% quality) of the input training image. The function convert_to_ela_image takes two parameters: path, which is the path to an image file, and quality, representing the quality parameter for JPEG compression. The function performs the following steps:

  1. Convert the image to RGB format and resave the image as a JPEG file with the specified quality under the name tempresaved.jpg.
  2. Compute the difference between the original image and the resaved JPEG image (ELA) to determine the maximum difference in pixel values between the original and resaved images.
  3. Calculate a scale factor based on the maximum difference to adjust the brightness of the ELA image.
  4. Enhance the brightness of the ELA image using the calculated scale factor.
  5. Resize the ELA result to 128x128x3, where 3 represents the number of channels to reduce the input size for training.
  6. Return the ELA image.

In lossy image formats such as JPEG, the initial saving process leads to considerable color loss. However, when the image is loaded and subsequently re-encoded in the same lossy format, there’s generally less added color degradation. ELA outcomes emphasize the image areas most susceptible to color degradation upon resaving. Generally, alterations appear prominently in regions exhibiting higher potential for degradation compared to the rest of the image.

Next, the images are processed into a NumPy array for training. We then split the input dataset randomly into training and test or validation data (80/20). You can ignore any warnings when running these cells.

Convert to ELA for training

Depending on the size of dataset, running these cells could take time to complete. For the sample dataset we provided in this repository, it could take 5–10 minutes.

Configure the CNN model

In this step, we construct a minimal version of the VGG network with small convolutional filters. The VGG-16 consists of 13 convolutional layers and three fully connected layers. The following screenshot illustrates the architecture of our Convolutional Neural Network (CNN) model.

Tensorflow model architecture

Note the following configurations:

  • Input – The model takes in an image input size of 128x128x3.
  • Convolutional layers – The convolutional layers use a minimal receptive field (3×3), the smallest possible size that still captures up/down and left/right. This is followed by a rectified linear unit (ReLU) activation function that reduces training time. This is a linear function that will output the input if positive; otherwise, the output is zero. The convolution stride is fixed at the default (1 pixel) to keep the spatial resolution preserved after convolution (stride is the number of pixel shifts over the input matrix).
  • Fully connected layers – The network has two fully connected layers. The first dense layer uses ReLU activation, and the second uses softmax to classify the image as original or tampered.

You can ignore any warnings when running these cells.

Save the model artifacts

Save the trained model with a unique file name—for example, based on the current date and time—into a directory named model.

Save the tensorflow model artifacts

The model is saved in Keras format with the extension .keras. We also save the model artifacts as a directory named 1 containing serialized signatures and the state needed to run them, including variable values and vocabularies to deploy to a SageMaker runtime (which we discuss later in this post).

Measure model performance

The following loss curve shows the progression of the model’s loss over training epochs (iterations).

model accuracy plot

The loss function measures how well the model’s predictions match the actual targets. Lower values indicate better alignment between predictions and true values. Decreasing loss over epochs signifies that the model is improving. The accuracy curve illustrates the model’s accuracy over training epochs. Accuracy is the ratio of correct predictions to the total number of predictions. Higher accuracy indicates a better-performing model. Typically, accuracy increases during training as the model learns patterns and improves its predictive ability. These will help you determine if the model is overfitting (performing well on training data but poorly on unseen data) or underfitting (not learning enough from the training data).

The following confusion matrix visually represents how well the model accurately distinguishes between the positive (forged image, represented as value 1) and negative (untampered image, represented as value 0) classes.

Confusion matrix plot

Following the model training, our next step involves deploying the computer vision model as an API. This API will be integrated into business applications as a component of the underwriting workflow. To achieve this, we use Amazon SageMaker Inference, a fully managed service. This service seamlessly integrates with MLOps tools, enabling scalable model deployment, cost-efficient inference, enhanced model management in production, and reduced operational complexity. In this post, we deploy the model as a real-time inference endpoint. However, it’s important to note that, depending on the workflow of your business applications, the model deployment can also be tailored as batch processing, asynchronous handling, or through a serverless deployment architecture.

Set up the model deployment notebook

Complete the following steps to set up your model deployment notebook:

  1. Open the tampering_detection_model_deploy.ipynb file from document-tampering-detection directory.
  2. Set up the notebook environment with the image Data Science 3.0.
  3. For Kernel, choose Python3.
  4. For Instance type, choose ml.t3.medium.

With an ml.t3.medium notebook environment, the cost per hour is $0.056 USD.

Create a custom inline policy for the SageMaker role to allow all Amazon S3 actions

The AWS Identity and Access Management (IAM) role for SageMaker will be in the format AmazonSageMaker- ExecutionRole-<random numbers>. Make sure you’re using the correct role. The role name can be found under the user details within the SageMaker domain configurations.

Update the IAM role to include an inline policy to allow all Amazon Simple Storage Service (Amazon S3) actions. This will be required to automate the creation and deletion of S3 buckets that will store the model artifacts. You can limit the access to specific S3 buckets. Note that we used a wildcard for the S3 bucket name in the IAM policy (tamperingdetection*).

Run the deployment notebook

Run each cell in the notebook tampering_detection_model_deploy.ipynb in order. We discuss some cells in more detail in the following sections.

Create an S3 bucket

Run the cell to create an S3 bucket. The bucket will be named tamperingdetection<current date time> and in the same AWS Region as your SageMaker Studio environment.

Create Amazon S3 bucket

Create the model artifact archive and upload to Amazon S3

Create a tar.gz file from the model artifacts. We have saved the model artifacts as a directory named 1, containing serialized signatures and the state needed to run them, including variable values and vocabularies to deploy to the SageMaker runtime. You can also include a custom inference file called inference.py within the code folder in the model artifact. The custom inference can be used for preprocessing and postprocessing of the input image.

Tar file with model artifacts

Upload model artifacts to Amazon S3

Create a SageMaker inference endpoint

The cell to create a SageMaker inference endpoint may take a few minutes to complete.

Create Amazon SageMaker Inference endpoint

Test the inference endpoint

The function check_image preprocesses an image as an ELA image, sends it to a SageMaker endpoint for inference, retrieves and processes the model’s predictions, and prints the results. The model takes a NumPy array of the input image as an ELA image to provide predictions. The predictions are output as 0, representing an untampered image, and 1, representing a forged image.

Test Amazon SageMaker Inference endpoint

Let’s invoke the model with an untampered image of a paystub and check the result.

Test an original image

The model outputs the classification as 0, representing an untampered image.

Now let’s invoke the model with a tampered image of a paystub and check the result.

Test a forged image

The model outputs the classification as 1, representing a forged image.

Limitations

Although ELA is an excellent tool for helping detect modifications, there are a number of limitations, such as the following:

  • A single pixel change or minor color adjustment may not generate a noticeable change in the ELA because JPEG operates on a grid.
  • ELA only identifies what regions have different compression levels. If a lower-quality image is spliced into a higher-quality picture, then the lower-quality image may appear as a darker region.
  • Scaling, recoloring, or adding noise to an image will modify the entire image, creating a higher error level potential.
  • If an image is resaved multiple times, then it may be entirely at a minimum error level, where more resaves do not alter the image. In this case, the ELA will return a black image and no modifications can be identified using this algorithm.
  • With Photoshop, the simple act of saving the picture can auto-sharpen textures and edges, creating a higher error level potential. This artifact doesn’t identify intentional modification; it identifies that an Adobe product was used. Technically, ELA appears as a modification because Adobe automatically performed a modification, but the modification was not necessarily intentional by the user.

We recommend using ELA alongside other techniques previously discussed in the blog in order to detect a greater range of image manipulation cases. ELA can also serve as an independent tool for visually examining image disparities, especially when training a CNN-based model becomes challenging.

Clean up

To remove the resources you created as part of this solution, complete the following steps:

  1. Run the notebook cells under the Cleanup section. This will delete the following:
    1. SageMaker inference endpoint – The inference endpoint name will be tamperingdetection-<datetime>.
    2. Objects within the S3 bucket and the S3 bucket itself – The bucket name will be tamperingdetection<datetime>.
  2. Shut down the SageMaker Studio notebook resources.

Conclusion

In this post, we presented an end-to-end solution for detecting document tampering and fraud using deep learning and SageMaker. We used ELA to preprocess images and identify discrepancies in compression levels that may indicate manipulation. Then we trained a CNN model on this processed dataset to classify images as original or tampered.

The model can achieve strong performance, with an accuracy over 95% with a dataset (forged and original) suited for your business requirements. This indicates that it can reliably detect forged documents like paystubs and bank statements. The trained model is deployed to a SageMaker endpoint to enable low-latency inference at scale. By integrating this solution into mortgage workflows, institutions can automatically flag suspicious documents for further fraud investigation.

Although powerful, ELA has some limitations in identifying certain types of more subtle manipulation. As next steps, the model could be enhanced by incorporating additional forensic techniques into training and using larger, more diverse datasets. Overall, this solution demonstrates how you can use deep learning and AWS services to build impactful solutions that boost efficiency, reduce risk, and prevent fraud.

In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.


About the authors


Anup Ravindranath
is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada working with Financial Services organizations. He helps customers to transform their businesses and innovate on cloud.

Vinnie Saini is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada. She has been helping Financial Services customers transform on cloud, with AI and ML driven solutions laid on strong foundational pillars of Architectural Excellence.

Read More

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 1

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 1

With the advent of generative AI, today’s foundation models (FMs), such as the large language models (LLMs) Claude 2 and Llama 2, can perform a range of generative tasks such as question answering, summarization, and content creation on text data. However, real-world data exists in multiple modalities, such as text, images, video, and audio. Take a PowerPoint slide deck, for example. It could contain information in the form of text, or embedded in graphs, tables, and pictures.

In this post, we present a solution that uses multimodal FMs such as the Amazon Titan Multimodal Embeddings model and LLaVA 1.5 and AWS services including Amazon Bedrock and Amazon SageMaker to perform similar generative tasks on multimodal data.

Solution overview

The solution provides an implementation for answering questions using information contained in the text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by LLMs. In this post, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

There are different ways to design a RAG solution that includes images. We have presented one approach here and will follow up with an alternate approach in the second post of this three-part series.

This solution includes the following components:

  • Amazon Titan Multimodal Embeddings model – This FM is used to generate embeddings for the content in the slide deck used in this post. As a multimodal model, this Titan model can process text, images, or a combination as input and generate embeddings. The Titan Multimodal Embeddings model generates vectors (embeddings) of 1,024 dimensions and is accessed via Amazon Bedrock.
  • Large Language and Vision Assistant (LLaVA) – LLaVA is an open source multimodal model for visual and language understanding and is used to interpret the data in the slides, including visual elements such as graphs and tables. We use the 7-billion parameter version LLaVA 1.5-7b in this solution.
  • Amazon SageMaker – The LLaVA model is deployed on a SageMaker endpoint using SageMaker hosting services, and we use the resulting endpoint to run inferences against the LLaVA model. We also use SageMaker notebooks to orchestrate and demonstrate this solution end to end.
  • Amazon OpenSearch Serverless – OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Titan Multimodal Embeddings model. An index created in the OpenSearch Serverless collection serves as the vector store for our RAG solution.
  • Amazon OpenSearch Ingestion (OSI) – OSI is a fully managed, serverless data collector that delivers data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we use an OSI pipeline to deliver data to the OpenSearch Serverless vector store.

Solution architecture

The solution design consists of two parts: ingestion and user interaction. During ingestion, we process the input slide deck by converting each slide into an image, generate embeddings for these images, and then populate the vector data store. These steps are completed prior to the user interaction steps.

In the user interaction phase, a question from the user is converted into embeddings and a similarity search is run on the vector database to find a slide that could potentially contain answers to user question. We then provide this slide (in the form of an image file) to the LLaVA model and the user question as a prompt to generate an answer to the query. All the code for this post is available in the GitHub repo.

The following diagram illustrates the ingestion architecture.

Ingestion architecture diagram

The workflow steps are as follows:

  1. Slides are converted to image files (one per slide) in JPG format and passed to the Titan Multimodal Embeddings model to generate embeddings. In this post, we use the slide deck titled Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023, to demonstrate the solution. The sample deck has 31 slides, so we generate 31 sets of vector embeddings, each with 1,024 dimensions. We add additional metadata fields to these generated vector embeddings and create a JSON file. These additional metadata fields can be used to perform rich search queries using OpenSearch’s powerful search capabilities.
  2. The generated embeddings are put together in a single JSON file that is uploaded to Amazon Simple Storage Service (Amazon S3).
  3. Via Amazon S3 Event Notifications, an event is put in an Amazon Simple Queue Service (Amazon SQS) queue.
  4. This event in the SQS queue acts as a trigger to run the OSI pipeline, which in turn ingests the data (JSON file) as documents into the OpenSearch Serverless index. Note that the OpenSearch Serverless index is configured as the sink for this pipeline and is created as part of the OpenSearch Serverless collection.

The following diagram illustrates the user interaction architecture.

User interaction architecture

The workflow steps are as follows:

  1. A user submits a question related to the slide deck that has been ingested.
  2. The user input is converted into embeddings using the Titan Multimodal Embeddings model accessed via Amazon Bedrock. An OpenSearch vector search is performed using these embeddings. We perform a k-nearest neighbor (k=1) search to retrieve the most relevant embedding matching the user query. Setting k=1 retrieves the most relevant slide to the user question.
  3. The metadata of the response from OpenSearch Serverless contains a path to the image corresponding to the most relevant slide.
  4. A prompt is created by combining the user question and the image path and provided to LLaVA hosted on SageMaker. The LLaVA model is able to understand the user question and answer it by examining the data in the image.
  5. The result of this inference is returned to the user.

These steps are discussed in detail in the following sections. See the Results section for screenshots and details on the output.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This solution uses the Titan Multimodal Embeddings model. Ensure that this model is enabled for use in Amazon Bedrock. On the Amazon Bedrock console, choose Model access in the navigation pane. If Titan Multimodal Embeddings is enabled, the access status will state Access granted.

Manage model access in Amazon Bedrock

If the model is not available, enable access to the model by choosing Manage Model Access, selecting Titan Multimodal Embeddings G1, and choosing Request model access. The model is enabled for use immediately.

Request model access in Amazon Bedrock

Use an AWS CloudFormation template to create the solution stack

Use one of the following AWS CloudFormation templates (depending on your Region) to launch the solution resources.

AWS Region Link
us-east-1 Link to CloudFormation template for us-east-1
us-west-2 Link to CloudFormation template for us-west-2

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the value for MultimodalCollectionEndpoint, which we use in subsequent steps.

Resources created by the CloudFormation tempalate

The CloudFormation template creates the following resources:

  • IAM roles – The following AWS Identity and Access Management (IAM) roles are created. Update these roles to apply least-privilege permissions.
    • SMExecutionRole with Amazon S3, SageMaker, OpenSearch Service, and Bedrock full access.
    • OSPipelineExecutionRole with access to specific Amazon SQS and OSI actions.
  • SageMaker notebook – All the code for this post is run via this notebook.
  • OpenSearch Serverless collection – This is the vector database for storing and retrieving embeddings.
  • OSI pipeline – This is the pipeline for ingesting data into OpenSearch Serverless.
  • S3 bucket – All data for this post is stored in this bucket.
  • SQS queue – The events for triggering the OSI pipeline run are put in this queue.

The CloudFormation template configures the OSI pipeline with Amazon S3 and Amazon SQS processing as source and an OpenSearch Serverless index as sink. Any objects created in the specified S3 bucket and prefix (multimodal/osi-embeddings-json) will trigger SQS notifications, which are used by the OSI pipeline to ingest data into OpenSearch Serverless.

The CloudFormation template also creates network, encryption, and data access policies required for the OpenSearch Serverless collection. Update these policies to apply least-privilege permissions.

Note that the CloudFormation template name is referenced in SageMaker notebooks. If the default template name is changed, make sure you update the same in globals.py

Test the solution

After the prerequisite steps are complete and the CloudFormation stack has been created successfully, you’re now ready to test the solution:

  1. On the SageMaker console, choose Notebooks in the navigation pane.
  2. Select the MultimodalNotebookInstance notebook instance and choose Open JupyterLab.
    Notebook instance in Amazon SageMaker
  3. In File Browser, traverse to the notebooks folder to see the notebooks and supporting files.

The notebooks are numbered in the sequence in which they’re run. Instructions and comments in each notebook describe the actions performed by that notebook. We run these notebooks one by one.

  1. Choose 0_deploy_llava.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook deploys the LLaVA-v1.5-7B model to a SageMaker endpoint. In this notebook, we download the LLaVA-v1.5-7B model from HuggingFace Hub, replace the inference.py script with llava_inference.py, and create a model.tar.gz file for this model. The model.tar.gz file is uploaded to Amazon S3 and used for deploying the model on SageMaker endpoint. The llava_inference.py script has additional code to allow reading an image file from Amazon S3 and running inference on it.

  1. Choose 1_data_prep.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook downloads the slide deck, converts each slide into JPG file format, and uploads these to the S3 bucket used for this post.

  1. Choose 2_data_ingestion.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

We do the following in this notebook:

  • We create an index in the OpenSearch Serverless collection. This index stores the embeddings data for the slide deck. See the following code:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
      "index.knn": true
  },
  "mappings": {
      "properties": {
          "vector_embedding": {
              "type": "knn_vector",
              "dimension": 1024,
              "method": {
                  "name": "hnsw",
                  "engine": "nmslib",
                  "parameters": {}
              }
          },
          "image_path": {
              "type": "text"
          },
          "metadata": {
              "properties": {
                  "slide_filename": {
                      "type": "text"
                  },
                  "model_id": {
                      "type": "text"
                  },
                  "slide_description": {
                      "type": "text"
                  }
              }
          }
      }
  }
}
"""
index_body = json.loads(index_body)
try:
  response = os_client.indices.create(index_name, body=index_body)
  logger.info(f"response received for the create index -> {response}")
except Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")
  • We use Titan Multimodal Embeddings model to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path of the image file) are stored in a JSON file and uploaded to Amazon S3. Note that a single JSON file is created, which contains documents for all the slides (images) converted into embeddings. The following code snippet shows how an image (in the form of a Base64 encoded string) is converted into embeddings:
def get_multimodal_embeddings(bedrock: botocore.client, image: str) -> np.ndarray:
    body = json.dumps(dict(inputImage=image))
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.FMC_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response.get("body").read())
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)
    except Exception as e:
        logger.error(f"exception while image(truncated)={image[:10]}, exception={e}")
        embeddings = None

    return embeddings
  • This action triggers the OpenSearch Ingestion pipeline, which processes the file and ingests it into the OpenSearch Serverless index. The following is a sample of the JSON file created. (A vector with four dimensions is shown in the example code. The Titan Multimodal Embeddings model generates 1,024 dimensions.)
[
  {
    "image_path": "s3://<your-bucket-name>/path/to/file1.json",
    "metadata": {
      "slide_filename": "mypowerpoint1.pptx",
      "model_id": "amazon.titan-embed-image-v1",
      "slide_description": "This is a test slide deck"
    },
    "vector_embedding": [
      657.6052386529958,
      0.8865137233123771,
      763.870264592026
    ]
  }
] 
  1. Choose 3_rag_inference.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook implements the RAG solution: we convert the user question into embeddings, find a similar image (slide) from the vector database, and provide the retrieved image to LLaVA to generate an answer to the user question. We use the following prompt template:

prompt_template: str = """Pretend that you are a helpful assistant that answers questions about content in a slide deck. 
  Using only the information in the provided slide image answer the following question.
  If you do not find the answer in the image then say I did not find the answer to this question in the slide deck.

  {question}
"""

The following code snippet provides the RAG workflow:

# create prompt and convert to embeddings
question: str = "As per the AI/ML flywheel, what do the AWS AI/ML services provide?"
prompt = prompt_template.format(question=question)
text_embeddings = get_text_embeddings(bedrock, question)

# vector db search
vector_db_response: Dict = find_similar_data(text_embeddings)

# download image for local notebook display
s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path')
logger.info(f"going to answer the question="{question}" using the image "{s3_img_path}"")

!aws s3 cp {s3_img_path} .
local_img_path = os.path.basename(s3_img_path)
display(Image(filename=local_img_path))

# Ask LLaVA
data = {
    "image" : s3_img_path,
    "question" : prompt,
    "temperature" : 0.1,
}
output = predictor.predict(data)
logger.info(f"Image={s3_img_path}nQuestion: {question}nAnswer: {output}nn")

Results

The following is a compilation of some user questions and responses generated by our implementation. The Question column captures the user question, and the Answer column is the textual response generated by LLaVA. Image is the k-nearest (k=1) slide match returned by the OpenSearch Serverless vector search. This slide is interpreted by LLaVA to provide the answer.

Multimodal RAG results

Question Answer Image
How does Inf2 compare in performance to comparable EC2 instances? I need numbers. According to the slide deck, Inf2 instances by AWS Inferentia2 offer up to 4x higher throughput and 10x lower latency compared to comparable EC2 instances. Image for Question 1 in the Results table
As per the AI/ML flywheel, what do the AWS AI/ML services provide? The AWS AI/ML services provide better $/perfer capabilities, new capabilities, and investment in innovation. Image for Question 2 in the Results table
Compared to GPT-2, how many more parameters does GPT-3 have? What is the numerical difference between the parameter size of GPT-2 and GPT-3? According to the slide, GPT-3 has 175 billion parameters, while GPT-2 has 1.5 billion parameters. The numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion. Image for Question 3 in the Results table
What are quarks in particle physics? I did not find the answer to this question in the slide deck. Image for Question 4 in the Results table

Feel free to extend this solution to your slide decks. Simply update the SLIDE_DECK variable in globals.py with a URL to your slide deck and run the ingestion steps detailed in the previous section.

Tip

You can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data. The following screenshot shows an OpenSearch dashboard GET example.

View of OpenSearch Dashboards

Clean up

To avoid incurring future charges, delete the resources you created. You can do this by deleting the stack via the CloudFormation console.

Deletion of the CloudFormation stack

Additionally, delete the SageMaker inference endpoint created for LLaVA inferencing. You can do this by uncommenting the cleanup step in 3_rag_inference.ipynb and running the cell, or by deleting the endpoint via the SageMaker console: choose Inference and Endpoints in the navigation pane, then select the endpoint and delete it.

Conclusion

Enterprises generate new content all the time, and slide decks are a common mechanism used to share and disseminate information internally with the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks. You can use this solution and the power of multimodal FMs such as the Titan Multimodal Embeddings model and LLaVA to discover new information or uncover new perspectives on content in slide decks.

We encourage you to learn more by exploring Amazon SageMaker JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service, and building a solution using the sample implementation provided in this post.

Look out for two additional posts as part of this series. Part 2 covers another approach you could take to talk to your slide deck. This approach generates and stores LLaVA inferences and uses those stored inferences to respond to user queries. Part 3 compares the two approaches.


About the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Manju Prasad is a Senior Solutions Architect within Strategic Accounts at Amazon Web Services. She focuses on providing technical guidance in a variety of domains, including AI/ML to a marquee M&E customer. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup.

Archana Inapudi is a Senior Solutions Architect at AWS supporting strategic customers. She has over a decade of experience helping customers design and build data analytics and database solutions. She is passionate about using technology to provide value to customers and achieve business outcomes.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital native customers.

Read More

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart 

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart 

When deploying a large language model (LLM), machine learning (ML) practitioners typically care about two measurements for model serving performance: latency, defined by the time it takes to generate a single token, and throughput, defined by the number of tokens generated per second. Although a single request to the deployed endpoint would exhibit a throughput approximately equal to the inverse of model latency, this is not necessarily the case when multiple concurrent requests are simultaneously sent to the endpoint. Due to model serving techniques, such as client-side continuous batching of concurrent requests, latency and throughput have a complex relationship that varies significantly based on model architecture, serving configurations, instance type hardware, number of concurrent requests, and variations in input payloads such as number of input tokens and output tokens.

This post explores these relationships via a comprehensive benchmarking of LLMs available in Amazon SageMaker JumpStart, including Llama 2, Falcon, and Mistral variants. With SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models to deploy to dedicated Amazon SageMaker instances within a network-isolated environment. We provide theoretical principles on how accelerator specifications impact LLM benchmarking. We also demonstrate the impact of deploying multiple instances behind a single endpoint. Finally, we provide practical recommendations for tailoring the SageMaker JumpStart deployment process to align with your requirements on latency, throughput, cost, and constraints on available instance types. All the benchmarking results as well as recommendations are based on a versatile notebook that you can adapt to your use case.

Deployed endpoint benchmarking

The following figure shows the lowest latencies (left) and highest throughput (right) values for deployment configurations across a variety of model types and instance types. Importantly, each of these model deployments use default configurations as provided by SageMaker JumpStart given the desired model ID and instance type for deployment.

These latency and throughput values correspond to payloads with 256 input tokens and 256 output tokens. The lowest latency configuration limits model serving to a single concurrent request, and the highest throughput configuration maximizes the possible number of concurrent requests. As we can see in our benchmarking, increasing concurrent requests monotonically increases throughput with diminishing improvement for large concurrent requests. Additionally, models are fully sharded on the supported instance. For example, because the ml.g5.48xlarge instance has 8 GPUs, all SageMaker JumpStart models using this instance are sharded using tensor parallelism on all eight available accelerators.

We can note a few takeaways from this figure. First, not all models are supported on all instances; some smaller models, such as Falcon 7B, don’t support model sharding, whereas larger models have higher compute resource requirements. Second, as sharding increases, performance typically improves, but may not necessarily improve for small modelsThis is because small models such as 7B and 13B incur a substantial communication overhead when sharded across too many accelerators. We discuss this in more depth later. Finally, ml.p4d.24xlarge instances tend to have significantly better throughput due to memory bandwidth improvements of A100 over A10G GPUs. As we discuss later, the decision to use a particular instance type depends on your deployment requirements, including latency, throughput, and cost constraints.

How can you obtain these lowest latency and highest throughput configuration values? Let’s start by plotting latency vs. throughput for a Llama 2 7B endpoint on an ml.g5.12xlarge instance for a payload with 256 input tokens and 256 output tokens, as seen in the following curve. A similar curve exists for every deployed LLM endpoint.

As concurrency increases, throughput and latency also monotonically increase. Therefore, the lowest latency point occurs at a concurrent request value of 1, and you can cost-effectively increase system throughput by increasing concurrent requests. There exists a distinct “knee” in this curve, where it’s obvious that the throughput gains associated with additional concurrency don’t outweigh the associated increase in latency. The exact location of this knee is use case-specific; some practitioners may define the knee at the point where a pre-specified latency requirement is exceeded (for example, 100 ms/token), whereas others may use load test benchmarks and queueing theory methods like the half-latency rule, and others may use theoretical accelerator specifications.

We also note that the maximum number of concurrent requests is limited. In the preceding figure, the line trace ends with 192 concurrent requests. The source of this limitation is the SageMaker invocation timeout limit, where SageMaker endpoints timeout an invocation response after 60 seconds. This setting is account-specific and not configurable for an individual endpoint. For LLMs, generating a large number of output tokens can take seconds or even minutes. Therefore, large input or output payloads can cause the invocation requests to fail. Furthermore, if the number of concurrent requests is very large, then many requests will experience large queue times, driving this 60-second timeout limit. For the purpose of this study, we use the timeout limit to define the maximum throughput possible for a model deployment. Importantly, although a SageMaker endpoint may handle a large number of concurrent requests without observing an invocation response timeout, you may want to define maximum concurrent requests with respect to the knee in the latency-throughput curve. This is likely the point at which you start to consider horizontal scaling, where a single endpoint provisions multiple instances with model replicas and load balances incoming requests between the replicas, to support more concurrent requests.

Taking this one step further, the following table contains benchmarking results for different configurations for the Llama 2 7B model, including different number of input and output tokens, instance types, and number of concurrent requests. Note that the preceding figure only plots a single row of this table.

. Throughput (tokens/sec) Latency (ms/token)
Concurrent Requests 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512
Number of total tokens: 512,    Number of output tokens: 256
ml.g5.2xlarge 30 54 115 208 343 475 486 33 33 35 39 48 97 159
ml.g5.12xlarge 59 117 223 406 616 866 1098 1214 17 17 18 20 27 38 60 112
ml.g5.48xlarge 56 108 202 366 522 660 707 804 18 18 19 22 32 50 101 171
ml.p4d.24xlarge 49 85 178 353 654 1079 1544 2312 2905 2944 21 23 22 23 26 31 44 58 92 165
Number of total tokens: 4096,    Number of output tokens: 256
ml.g5.2xlarge 20 36 48 49 48 57 104 170
ml.g5.12xlarge 33 58 90 123 142 31 34 48 73 132
ml.g5.48xlarge 31 48 66 82 31 43 68 120
ml.p4d.24xlarge 39 73 124 202 278 290 26 27 33 43 66 107

We observe some additional patterns in this data. When increasing context size, latency increases and throughput decreases. For instance, on ml.g5.2xlarge with a concurrency of 1, throughput is 30 tokens/sec when the number of total tokens is 512, vs. 20 tokens/sec if the number of total tokens is 4,096. This is because it takes more time to process the larger input. We can also see that increasing GPU capability and sharding impacts the maximum throughput and maximum supported concurrent requests. The table shows that Llama 2 7B has notably different maximum throughput values for different instance types, and these maximum throughput values occur at different values of concurrent requests. These characteristics would drive an ML practitioner to justify the cost of one instance over another. For example, given a low latency requirement, the practitioner might select an ml.g5.12xlarge instance (4 A10G GPUs) over an ml.g5.2xlarge instance (1 A10G GPU). If given a high throughput requirement, the use of an ml.p4d.24xlarge instance (8 A100 GPUs) with full sharding would only be justified under high concurrency. Note, however, that it’s often beneficial to instead load multiple inference components of a 7B model on a single ml.p4d.24xlarge instance; such multi-model support is discussed later in this post.

The preceding observations were made for the Llama 2 7B model. However, similar patterns remain true for other models as well. A primary takeaway is that latency and throughput performance numbers are dependent on payload, instance type, and number of concurrent requests, so you will need to find the ideal configuration for your specific application. To generate the preceding numbers for your use case, you can run the linked notebook, where you can configure this load test analysis for your model, instance type, and payload.

Making sense of accelerator specifications

Selecting suitable hardware for LLM inference relies heavily on specific use cases, user experience goals, and the chosen LLM. This section attempts to create an understanding of the knee in the latency-throughput curve with respect to high-level principles based on accelerator specifications. These principles alone don’t suffice to make a decision: real benchmarks are necessary. The term device is used here to encompass all ML hardware accelerators. We assert the knee in the latency-throughput curve is driven by one of two factors:

  • The accelerator has exhausted memory to cache KV matrices, so subsequent requests are queued
  • The accelerator still has spare memory for the KV cache, but is using a large enough batch size that processing time is driven by compute operation latency rather than memory bandwidth

We typically prefer to be limited by the second factor because this implies the accelerator resources are saturated. Basically, you are maximizing the resources you payed for. Let’s explore this assertion in greater detail.

KV caching and device memory

Standard transformer attention mechanisms compute attention for each new token against all previous tokens. Most modern ML servers cache attention keys and values in device memory (DRAM) to avoid re-computation at every step. This is called this the KV cache, and it grows with batch size and sequence length. It defines how many user requests can be served in parallel and will determine the knee in the latency-throughput curve if the compute-bound regime in the second scenario mentioned earlier is not yet met, given the available DRAM. The following formula is a rough approximation for the maximum KV cache size.

In this formula, B is batch size and N is number of accelerators. For example, the Llama 2 7B model in FP16 (2 bytes/parameter) served on an A10G GPU (24 GB DRAM) consumes approximately 14 GB, leaving 10 GB for the KV cache. Plugging in the model’s full context length (N = 4096) and remaining parameters (n_layers=32, n_kv_attention_heads=32, and d_attention_head=128), this expression shows we are limited to serving a batch size of four users in parallel due to DRAM constraints. If you observe the corresponding benchmarks in the previous table, this is a good approximation for the observed knee in this latency-throughput curve. Methods such as grouped query attention (GQA) can reduce the KV cache size, in GQA’s case by the same factor it reduces the number of KV heads.

Arithmetic intensity and device memory bandwidth

The growth in the computational power of ML accelerators has outpaced their memory bandwidth, meaning they can perform many more computations on each byte of data in the amount of time it takes to access that byte.

The arithmetic intensity, or the ratio of compute operations to memory accesses, for an operation determines if it is limited by memory bandwidth or compute capacity on the selected hardware. For example, an A10G GPU (g5 instance type family) with 70 TFLOPS FP16 and 600 GB/sec bandwidth can compute approximately 116 ops/byte. An A100 GPU (p4d instance type family) can compute approximately 208 ops/byte. If the arithmetic intensity for a transformer model is under that value, it is memory-bound; if it is above, it is compute-bound. The attention mechanism for Llama 2 7B requires 62 ops/byte for batch size 1 (for an explanation, see A guide to LLM inference and performance), which means it is memory-bound. When the attention mechanism is memory-bound, expensive FLOPS are left unutilized.

There are two ways to better utilize the accelerator and increase arithmetic intensity: reduce the required memory accesses for the operation (this is what FlashAttention focuses on) or increase the batch size. However, we might not be able to increase our batch size enough to reach a compute-bound regime if our DRAM is too small to hold the corresponding KV cache. A crude approximation of the critical batch size B* that separates compute-bound from memory-bound regimes for standard GPT decoder inference is described by the following expression, where A_mb is the accelerator memory bandwidth, A_f is accelerator FLOPS, and N is the number of accelerators. This critical batch size can be derived by finding where memory access time equals computation time. Refer to this blog post to understand Equation 2 and its assumptions in greater detail.

This is the same ops/byte ratio we previously calculated for A10G, so the critical batch size on this GPU is 116. One way to approach this theoretical, critical batch size is to increase model sharding and split the cache across more N accelerators. This effectively increases the KV cache capacity as well as the memory-bound batch size.

Another benefit of model sharding is splitting model parameter and data loading work across N accelerators. This type of sharding is a type of model parallelism also referred to as tensor parallelism. Naively, there is N times the memory bandwidth and compute power in aggregate. Assuming no overhead of any kind (communication, software, and so on), this would decrease decoding latency per token by N if we are memory-bound, because token decoding latency in this regime is bound by the time it takes to load the model weights and cache. In real life, however, increasing the degree of sharding results in increased communication between devices to share intermediate activations at every model layer. This communication speed is limited by the device interconnect bandwidth. It’s difficult to estimate its impact precisely (for details, see Model parallelism), but this can eventually stop yielding benefits or deteriorate performance — this is especially true for smaller models, because smaller data transfers lead to lower transfer rates.

To compare ML accelerators based on their specs, we recommend the following. First, calculate the approximate critical batch size for each accelerator type according to the second equation and the KV cache size for the critical batch size according to the first equation. You can then use the available DRAM on the accelerator to calculate the minimum number of accelerators required to fit the KV cache and model parameters. If deciding between multiple accelerators, prioritize accelerators in order of lowest cost per GB/sec of memory bandwidth. Finally, benchmark these configurations and verify what is the best cost/token for your upper bound of desired latency.

Select an endpoint deployment configuration

Many LLMs distributed by SageMaker JumpStart use the text-generation-inference (TGI) SageMaker container for model serving. The following table discusses how to adjust a variety of model serving parameters to either affect model serving which impacts the latency-throughput curve or protect the endpoint against requests that would overload the endpoint. These are the primary parameters you can use to configure your endpoint deployment for your use case. Unless otherwise specified, we use default text generation payload parameters and TGI environment variables.

Environment Variable Description SageMaker JumpStart Default Value
Model serving configurations . .
MAX_BATCH_PREFILL_TOKENS Limits the number of tokens in the prefill operation. This operation generates the KV cache for a new input prompt sequence. It is memory intensive and compute bound, so this value caps the number of tokens allowed in a single prefill operation. Decoding steps for other queries pause while prefill is occurring. 4096 (TGI default) or model-specific maximum supported context length (SageMaker JumpStart provided), whichever is greater.
MAX_BATCH_TOTAL_TOKENS Controls the maximum number of tokens to include within a batch during decoding, or a single forward pass through the model. Ideally, this is set to maximize the usage of all available hardware. Not specified (TGI default). TGI will set this value with respect to remaining CUDA memory during model warm up.
SM_NUM_GPUS The number of shards to use. That is, the number of GPUs used to run the model using tensor parallelism. Instance dependent (SageMaker JumpStart provided). For each supported instance for a given model, SageMaker JumpStart provides the best setting for tensor parallelism.
Configurations to guard your endpoint (set these for your use case) . .
MAX_TOTAL_TOKENS This caps the memory budget of a single client request by limiting the number of tokens in the input sequence plus the number of tokens in the output sequence (the max_new_tokens payload parameter). Model-specific maximum supported context length. For example, 4096 for Llama 2.
MAX_INPUT_LENGTH Identifies the maximum allowed number of tokens in the input sequence for a single client request. Things to consider when increasing this value include: longer input sequences require more memory, which affects continuous batching, and many models have a supported context length that should not be exceeded. Model-specific maximum supported context length. For example, 4095 for Llama 2.
MAX_CONCURRENT_REQUESTS The maximum number of concurrent requests allowed by the deployed endpoint. New requests beyond this limit will immediately raise a model overloaded error to prevent poor latency for the current processing requests. 128 (TGI default). This setting allows you to obtain high throughput for a variety of use cases, but you should pin as appropriate to mitigate SageMaker invocation timeout errors.

The TGI server uses continuous batching, which dynamically batches concurrent requests together to share a single model inference forward pass. There are two types of forward passes: prefill and decode. Each new request must run a single prefill forward pass to populate the KV cache for the input sequence tokens. After the KV cache is populated, a decode forward pass performs a single next-token prediction for all batched requests, which is iteratively repeated to produce the output sequence. As new requests are sent to the server, the next decode step must wait so the prefill step can run for the new requests. This must occur before those new requests are included in subsequent continuously batched decode steps. Due to hardware constraints, the continuous batching used for decoding may not include all requests. At this point, requests enter a processing queue and inference latency starts to significantly increase with only minor throughput gain.

It’s possible to separate LLM latency benchmarking analyses into prefill latency, decode latency, and queue latency. The time consumed by each of these components is fundamentally different in nature: prefill is a one-time computation, decoding occurs one time for each token in the output sequence, and queueing involves server batching processes. When multiple concurrent requests are being processed, it becomes difficult to disentangle the latencies from each of these components because the latency experienced by any given client request involves queue latencies driven by the need to prefill new concurrent requests as well as queue latencies driven by the inclusion of the request in batch decoding processes. For this reason, this post focuses on end-to-end processing latency. The knee in the latency-throughput curve occurs at the point of saturation where queue latencies start to significantly increase. This phenomenon occurs for any model inference server and is driven by accelerator specifications.

Common requirements during deployment include satisfying a minimum required throughput, maximum allowed latency, maximum cost per hour, and maximum cost to generate 1 million tokens. You should condition these requirements on payloads that represent end-user requests. A design to meet these requirements should consider many factors, including the specific model architecture, size of the model, instance types, and instance count (horizontal scaling). In the following sections, we focus on deploying endpoints to minimize latency, maximize throughput, and minimize cost. This analysis considers 512 total tokens and 256 output tokens.

Minimize latency

Latency is an important requirement in many real-time use cases. In the following table, we look at minimum latency for each model and each instance type. You can achieve minimum latency by setting MAX_CONCURRENT_REQUESTS = 1.

Minimum Latency (ms/token)
Model ID ml.g5.2xlarge ml.g5.12xlarge ml.g5.48xlarge ml.p4d.24xlarge ml.p4de.24xlarge
Llama 2 7B 33 17 18 20
Llama 2 7B Chat 33 17 18 20
Llama 2 13B 22 23 23
Llama 2 13B Chat 23 23 23
Llama 2 70B 57 43
Llama 2 70B Chat 57 45
Mistral 7B 35
Mistral 7B Instruct 35
Mixtral 8x7B 33 27
Falcon 7B 33
Falcon 7B Instruct 33
Falcon 40B 53 33 27
Falcon 40B Instruct 53 33 28
Falcon 180B 42
Falcon 180B Chat 42

To achieve minimum latency for a model, you can use the following code while substituting your desired model ID and instance type:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.12xlarge",
    env={
        "MAX_CONCURRENT_REQUESTS": "1",
        "MAX_INPUT_TOKENS": "256",
        "MAX_TOTAL_TOKENS": "512",
    },
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

Note that the latency numbers change depending on the number of input and output tokens. However, the deployment process remains the same except the environment variables MAX_INPUT_TOKENS and MAX_TOTAL_TOKENS. Here, these environment variables are set to help guarantee endpoint latency requirements because larger input sequences may violate the latency requirement. Note that SageMaker JumpStart already provides the other optimal environment variables when selecting instance type; for instance, using ml.g5.12xlarge will set SM_NUM_GPUS to 4 in the model environment.

Maximize throughput

In this section, we maximize the number of generated tokens per second. This is typically achieved at the maximum valid concurrent requests for the model and the instance type. In the following table, we report the throughput achieved at the largest concurrent request value achieved before encountering a SageMaker invocation timeout for any request.

Maximum Throughput (tokens/sec), Concurrent Requests
Model ID ml.g5.2xlarge ml.g5.12xlarge ml.g5.48xlarge ml.p4d.24xlarge ml.p4de.24xlarge
Llama 2 7B 486 (64) 1214 (128) 804 (128) 2945 (512)
Llama 2 7B Chat 493 (64) 1207 (128) 932 (128) 3012 (512)
Llama 2 13B 787 (128) 496 (64) 3245 (512)
Llama 2 13B Chat 782 (128) 505 (64) 3310 (512)
Llama 2 70B 124 (16) 1585 (256)
Llama 2 70B Chat 114 (16) 1546 (256)
Mistral 7B 947 (64)
Mistral 7B Instruct 986 (128)
Mixtral 8x7B 701 (128) 3196 (512)
Falcon 7B 1340 (128)
Falcon 7B Instruct 1313 (128)
Falcon 40B 244 (32) 382 (64) 2699 (512)
Falcon 40B Instruct 245 (32) 415 (64) 2675 (512)
Falcon 180B 1100 (128)
Falcon 180B Chat 1081 (128)

To achieve maximum throughput for a model, you can use the following code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.12xlarge",
    env={
        "MAX_CONCURRENT_REQUESTS": "128",  # For your application, identify it from the benchmarking table with the maximum feasible concurrent requests.
        "MAX_INPUT_TOKENS": "256",
        "MAX_TOTAL_TOKENS": "512",
    },
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

Note that the maximum number of concurrent requests depends on the model type, instance type, maximum number of input tokens, and maximum number of output tokens. Therefore, you should set these parameters before setting MAX_CONCURRENT_REQUESTS.

Also note that a user interested in minimizing latency is often at odds with a user interested in maximizing throughput. The former is interested in real-time responses, whereas the latter is interested in batch processing such that the endpoint queue is always saturated, thereby minimizing processing downtime. Users who want to maximize throughput conditioned on latency requirements are often interested in operating at the knee in the latency-throughput curve.

Minimize cost

The first option to minimize cost involves minimizing cost per hour. With this, you can deploy a selected model on the SageMaker instance with the lowest cost per hour. For real-time pricing of SageMaker instances, refer to Amazon SageMaker pricing. In general, the default instance type for SageMaker JumpStart LLMs is the lowest-cost deployment option.

The second option to minimize cost involves minimizing the cost to generate 1 million tokens. This is a simple transformation of the table we discussed earlier to maximize throughput, where you can first compute the time it takes in hours to generate 1 million tokens (1e6 / throughput / 3600). You can then multiply this time to generate 1 million tokens with the price per hour of the specified SageMaker instance.

Note that instances with the lowest cost per hour aren’t the same as instances with the lowest cost to generate 1 million tokens. For instance, if the invocation requests are sporadic, an instance with the lowest cost per hour might be optimal, whereas in the throttling scenarios, the lowest cost to generate a million tokens might be more appropriate.

Tensor parallel vs. multi-model trade-off

In all previous analyses, we considered deploying a single model replica with a tensor parallel degree equal to the number of GPUs on the deployment instance type. This is the default SageMaker JumpStart behavior. However, as previously noted, sharding a model can improve model latency and throughput only up to a certain limit, beyond which inter-device communication requirements dominate computation time. This implies that it’s often beneficial to deploy multiple models with a lower tensor parallel degree on a single instance rather than a single model with a higher tensor parallel degree.

Here, we deploy Llama 2 7B and 13B endpoints on ml.p4d.24xlarge instances with tensor parallel (TP) degrees of 1, 2, 4, and 8. For clarity in model behavior, each of these endpoints only load a single model.

. Throughput (tokens/sec) Latency (ms/token)
Concurrent Requests 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512
TP Degree Llama 2 13B
1 38 74 147 278 443 612 683 722 26 27 27 29 37 45 87 174
2 49 92 183 351 604 985 1435 1686 1726 21 22 22 22 25 32 46 91 159
4 46 94 181 343 655 1073 1796 2408 2764 2819 23 21 21 24 25 30 37 57 111 172
8 44 86 158 311 552 1015 1654 2450 3087 3180 22 24 26 26 29 36 42 57 95 152
. Llama 2 7B
1 62 121 237 439 778 1122 1569 1773 1775 16 16 17 18 22 28 43 88 151
2 62 122 239 458 780 1328 1773 2440 2730 2811 16 16 17 18 21 25 38 56 103 182
4 60 106 211 420 781 1230 2206 3040 3489 3752 17 19 20 18 22 27 31 45 82 132
8 49 97 179 333 612 1081 1652 2292 2963 3004 22 20 24 26 27 33 41 65 108 167

Our previous analyses already showed significant throughput advantages on ml.p4d.24xlarge instances, which often translates to better performance in terms of cost to generate 1 million tokens over the g5 instance family under high concurrent request load conditions. This analysis clearly demonstrates that you should consider the trade-off between model sharding and model replication within a single instance; that is, a fully sharded model is not typically the best use of  ml.p4d.24xlarge compute resources for 7B and 13B model families. In fact, for the 7B model family, you obtain the best throughput for a single model replica with a tensor parallel degree of 4 instead of 8.

From here, you can extrapolate that the highest throughput configuration for the 7B model involves a tensor parallel degree of 1 with eight model replicas, and the highest throughput configuration for the 13B model is likely a tensor parallel degree of 2 with four model replicas. To learn more about how to accomplish this, refer to Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker, which demonstrates the use of inference component-based endpoints. Due to load balancing techniques, server routing, and sharing of CPU resources, you might not fully achieve throughput improvements exactly equal to the number of replicas times the throughput for a single replica.

Horizontal scaling

As observed earlier, each endpoint deployment has a limitation on the number of concurrent requests depending on the number of input and output tokens as well as the instance type. If this doesn’t meet your throughput or concurrent request requirement, you can scale up to utilize more than one instance behind the deployed endpoint. SageMaker automatically performs load balancing of queries between instances. For example, the following code deploys an endpoint supported by three instances:

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.2xlarge",
)
predictor = model.deploy(
    accept_eula=False,  # Change EULA acceptance to True
    initial_instance_count = 3,
)

The following table shows the throughput gain as a factor of number of instances for the Llama 2 7B model.

. . Throughput (tokens/sec) Latency (ms/token)
. Concurrent Requests 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Instance Count Instance Type Number of total tokens: 512, Number of output tokens: 256
1 ml.g5.2xlarge 30 60 115 210 351 484 492 32 33 34 37 45 93 160
2 ml.g5.2xlarge 30 60 115 221 400 642 922 949 32 33 34 37 42 53 94 167
3 ml.g5.2xlarge 30 60 118 228 421 731 1170 1400 32 33 34 36 39 47 57 110

Notably, the knee in the latency-throughput curve shifts to the right because higher instance counts can handle larger numbers of concurrent requests within the multi-instance endpoint. For this table, the concurrent request value is for the entire endpoint, not the number of concurrent requests that each individual instance receives.

You can also use autoscaling, a feature to monitor your workloads and dynamically adjust the capacity to maintain steady and predictable performance at the possible lowest cost. This is beyond the scope of this post. To learn more about autoscaling, refer to Configuring autoscaling inference endpoints in Amazon SageMaker.

Invoke endpoint with concurrent requests

Let’s suppose you have a large batch of queries that you would like to use to generate responses from a deployed model under high throughput conditions. For example, in the following code block, we compile a list of 1,000 payloads, with each payload requesting the generation of 100 tokens. In all, we are requesting the generation of 100,000 tokens.

payload = {
    "inputs": "I believe the meaning of life is to ",
    "parameters": {"max_new_tokens": 100, "details": True},
}
total_requests = 1000
payloads = [payload,] * total_requests

When sending a large number of requests to the SageMaker runtime API, you may experience throttling errors. To mitigate this, you can create a custom SageMaker runtime client that increases the number of retry attempts. You can provide the resulting SageMaker session object to either the JumpStartModel constructor or sagemaker.predictor.retrieve_default if you would like to attach a new predictor to an already deployed endpoint. In the following code, we use this session object when deploying a Llama 2 model with default SageMaker JumpStart configurations:

import boto3
from botocore.config import Config
from sagemaker.session import Session
from sagemaker.jumpstart.model import JumpStartModel

sagemaker_session = Session(
    sagemaker_runtime_client=boto3.client(
        "sagemaker-runtime",
        config=Config(connect_timeout=10, retries={"mode": "standard", "total_max_attempts": 20}),
    )
)
model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    sagemaker_session=sagemaker_session
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

This deployed endpoint has MAX_CONCURRENT_REQUESTS = 128 by default. In the following block, we use the concurrent futures library to iterate over invoking the endpoint for all payloads with 128 worker threads. At most, the endpoint will process 128 concurrent requests, and whenever a request returns a response, the executor will immediately send a new request to the endpoint.

import time
from concurrent import futures

concurrent_requests = 128

time_start = time.time()
with futures.ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
    responses = list(executor.map(predictor.predict, payloads))

total_tokens = sum([response[0]["details"]["generated_tokens"] for response in responses])
token_throughput = total_tokens / (time.time() - time_start)

This results in generating 100,000 total tokens with a throughput of 1255 tokens/sec on a single ml.g5.2xlarge instance. This takes approximately 80 seconds to process.

Note that this throughput value is notably different than the maximum throughput for Llama 2 7B on ml.g5.2xlarge in the previous tables of this post (486 tokens/sec at 64 concurrent requests). This is because the input payload uses 8 tokens instead of 256, the output token count is 100 instead of 256, and the smaller token counts allow for 128 concurrent requests. This is a final reminder that all latency and throughput numbers are payload dependent! Changing payload token counts will affect batching processes during model serving, which will in turn affect the emergent prefill, decode, and queue times for your application.

Conclusion

In this post, we presented benchmarking of SageMaker JumpStart LLMs, including Llama 2, Mistral, and Falcon. We also presented a guide to optimize latency, throughput, and cost for your endpoint deployment configuration. You can get started by running the associated notebook to benchmark your use case.


About the Authors

 Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

João Moura is a Senior AI/ML Specialist Solutions Architect at AWS. João helps AWS customers – from small startups to large enterprises – train and deploy large models efficiently, and more broadly build ML platforms on AWS.

Read More

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

Generative artificial intelligence (AI) applications built around large language models (LLMs) have demonstrated the potential to create and accelerate economic value for businesses. Examples of applications include conversational search, customer support agent assistance, customer support analytics, self-service virtual assistants, chatbots, rich media generation, content moderation, coding companions to accelerate secure, high-performance software development, deeper insights from multimodal content sources, acceleration of your organization’s security investigations and mitigations, and much more. Many customers are looking for guidance on how to manage security, privacy, and compliance as they develop generative AI applications. Understanding and addressing LLM vulnerabilities, threats, and risks during the design and architecture phases helps teams focus on maximizing the economic and productivity benefits generative AI can bring. Being aware of risks fosters transparency and trust in generative AI applications, encourages increased observability, helps to meet compliance requirements, and facilitates informed decision-making by leaders.

The goal of this post is to empower AI and machine learning (ML) engineers, data scientists, solutions architects, security teams, and other stakeholders to have a common mental model and framework to apply security best practices, allowing AI/ML teams to move fast without trading off security for speed. Specifically, this post seeks to help AI/ML and data scientists who may not have had previous exposure to security principles gain an understanding of core security and privacy best practices in the context of developing generative AI applications using LLMs. We also discuss common security concerns that can undermine trust in AI, as identified by the Open Worldwide Application Security Project (OWASP) Top 10 for LLM Applications, and show ways you can use AWS to increase your security posture and confidence while innovating with generative AI.

This post provides three guided steps to architect risk management strategies while developing generative AI applications using LLMs. We first delve into the vulnerabilities, threats, and risks that arise from the implementation, deployment, and use of LLM solutions, and provide guidance on how to start innovating with security in mind. We then discuss how building on a secure foundation is essential for generative AI. Lastly, we connect these together with an example LLM workload to describe an approach towards architecting with defense-in-depth security across trust boundaries.

By the end of this post, AI/ML engineers, data scientists, and security-minded technologists will be able to identify strategies to architect layered defenses for their generative AI applications, understand how to map OWASP Top 10 for LLMs security concerns to some corresponding controls, and build foundational knowledge towards answering the following top AWS customer question themes for their applications:

  • What are some of the common security and privacy risks with using generative AI based on LLMs in my applications that I can most impact with this guidance?
  • What are some ways to implement security and privacy controls in the development lifecycle for generative AI LLM applications on AWS?
  • What operational and technical best practices can I integrate into how my organization builds generative AI LLM applications to manage risk and increase confidence in generative AI applications using LLMs?

Improve security outcomes while developing generative AI

Innovation with generative AI using LLMs requires starting with security in mind to develop organizational resiliency, build on a secure foundation, and integrate security with a defense in depth security approach. Security is a shared responsibility between AWS and AWS customers. All the principles of the AWS Shared Responsibility Model are applicable to generative AI solutions. Refresh your understanding of the AWS Shared Responsibility Model as it applies to infrastructure, services, and data when you build LLM solutions.

Start with security in mind to develop organizational resiliency

Start with security in mind to develop organizational resiliency for developing generative AI applications that meet your security and compliance objectives. Organizational resiliency draws on and extends the definition of resiliency in the AWS Well-Architected Framework to include and prepare for the ability of an organization to recover from disruptions. Consider your security posture, governance, and operational excellence when assessing overall readiness to develop generative AI with LLMs and your organizational resiliency to any potential impacts. As your organization advances its use of emerging technologies such as generative AI and LLMs, overall organizational resiliency should be considered as a cornerstone of a layered defensive strategy to protect assets and lines of business from unintended consequences.

Organizational resiliency matters substantially for LLM applications

Although all risk management programs can benefit from resilience, organizational resiliency matters substantially for generative AI. Five of the OWASP-identified top 10 risks for LLM applications rely on defining architectural and operational controls and enforcing them at an organizational scale in order to manage risk. These five risks are insecure output handling, supply chain vulnerabilities, sensitive information disclosure, excessive agency, and overreliance. Begin increasing organizational resiliency by socializing your teams to consider AI, ML, and generative AI security a core business requirement and top priority throughout the whole lifecycle of the product, from inception of the idea, to research, to the application’s development, deployment, and use. In addition to awareness, your teams should take action to account for generative AI in governance, assurance, and compliance validation practices.

Build organizational resiliency around generative AI

Organizations can start adopting ways to build their capacity and capabilities for AI/ML and generative AI security within their organizations. You should begin by extending your existing security, assurance, compliance, and development programs to account for generative AI.

The following are the five key areas of interest for organizational AI, ML, and generative AI security:

  • Understand the AI/ML security landscape
  • Include diverse perspectives in security strategies
  • Take action proactively for securing research and development activities
  • Align incentives with organizational outcomes
  • Prepare for realistic security scenarios in AI/ML and generative AI

Develop a threat model throughout your generative AI Lifecycle

Organizations building with generative AI should focus on risk management, not risk elimination, and include threat modeling in and business continuity planning the planning, development, and operations of generative AI workloads. Work backward from production use of generative AI by developing a threat model for each application using traditional security risks as well as generative AI-specific risks. Some risks may be acceptable to your business, and a threat modeling exercise can help your company identify what your acceptable risk appetite is. For example, your business may not require 99.999% uptime on a generative AI application, so the additional recovery time associated to recovery using AWS Backup with Amazon S3 Glacier may be an acceptable risk. Conversely, the data in your model may be extremely sensitive and highly regulated, so deviation from AWS Key Management Service (AWS KMS) customer managed key (CMK) rotation and use of AWS Network Firewall to help enforce Transport Layer Security (TLS) for ingress and egress traffic to protect against data exfiltration may be an unacceptable risk.

Evaluate the risks (inherent vs. residual) of using the generative AI application in a production setting to identify the right foundational and application-level controls. Plan for rollback and recovery from production security events and service disruptions such as prompt injection, training data poisoning, model denial of service, and model theft early on, and define the mitigations you will use as you define application requirements. Learning about the risks and controls that need to be put in place will help define the best implementation approach for building a generative AI application, and provide stakeholders and decision-makers with information to make informed business decisions about risk. If you are unfamiliar with the overall AI and ML workflow, start by reviewing 7 ways to improve security of your machine learning workloads to increase familiarity with the security controls needed for traditional AI/ML systems.

Just like building any ML application, building a generative AI application involves going through a set of research and development lifecycle stages. You may want to review the AWS Generative AI Security Scoping Matrix to help build a mental model to understand the key security disciplines that you should consider depending on which generative AI solution you select.

Generative AI applications using LLMs are typically developed and operated following ordered steps:

  • Application requirements – Identify use case business objectives, requirements, and success criteria
  • Model selection – Select a foundation model that aligns with use case requirements
  • Model adaptation and fine-tuning – Prepare data, engineer prompts, and fine-tune the model
  • Model evaluation – Evaluate foundation models with use case-specific metrics and select the best-performing model
  • Deployment and integration – Deploy the selected foundation model on your optimized infrastructure and integrate with your generative AI application
  • Application monitoring – Monitor application and model performance to enable root cause analysis

Ensure teams understand the critical nature of security as part of the design and architecture phases of your software development lifecycle on Day 1. This means discussing security at each layer of your stack and lifecycle, and positioning security and privacy as enablers to achieving business objectives.Architect controls for threats before you launch your LLM application, and consider whether the data and information you will use for model adaptation and fine-tuning warrants controls implementation in the research, development, and training environments. As part of quality assurance tests, introduce synthetic security threats (such as attempting to poison training data, or attempting to extract sensitive data through malicious prompt engineering) to test out your defenses and security posture on a regular basis.

Additionally, stakeholders should establish a consistent review cadence for production AI, ML, and generative AI workloads and set organizational priority on understanding trade-offs between human and machine control and error prior to launch. Validating and assuring that these trade-offs are respected in the deployed LLM applications will increase the likelihood of risk mitigation success.

Build generative AI applications on secure cloud foundations

At AWS, security is our top priority. AWS is architected to be the most secure global cloud infrastructure on which to build, migrate, and manage applications and workloads. This is backed by our deep set of over 300 cloud security tools and the trust of our millions of customers, including the most security-sensitive organizations like government, healthcare, and financial services. When building generative AI applications using LLMs on AWS, you gain security benefits from the secure, reliable, and flexible AWS Cloud computing environment.

Use an AWS global infrastructure for security, privacy, and compliance

When you develop data-intensive applications on AWS, you can benefit from an AWS global Region infrastructure, architected to provide capabilities to meet your core security and compliance requirements. This is reinforced by our AWS Digital Sovereignty Pledge, our commitment to offering you the most advanced set of sovereignty controls and features available in the cloud. We are committed to expanding our capabilities to allow you to meet your digital sovereignty needs, without compromising on the performance, innovation, security, or scale of the AWS Cloud. To simplify implementation of security and privacy best practices, consider using reference designs and infrastructure as code resources such as the AWS Security Reference Architecture (AWS SRA) and the AWS Privacy Reference Architecture (AWS PRA). Read more about architecting privacy solutions, sovereignty by design, and compliance on AWS and use services such as AWS Config, AWS Artifact, and AWS Audit Manager to support your privacy, compliance, audit, and observability needs.

Understand your security posture using AWS Well-Architected and Cloud Adoption Frameworks

AWS offers best practice guidance developed from years of experience supporting customers in architecting their cloud environments with the AWS Well-Architected Framework and in evolving to realize business value from cloud technologies with the AWS Cloud Adoption Framework (AWS CAF). Understand the security posture of your AI, ML, and generative AI workloads by performing a Well-Architected Framework review. Reviews can be performed using tools like the AWS Well-Architected Tool, or with the help of your AWS team through AWS Enterprise Support. The AWS Well-Architected Tool automatically integrates insights from AWS Trusted Advisor to evaluate what best practices are in place and what opportunities exist to improve functionality and cost-optimization. The AWS Well-Architected Tool also offers customized lenses with specific best practices such as the Machine Learning Lens for you to regularly measure your architectures against best practices and identify areas for improvement. Checkpoint your journey on the path to value realization and cloud maturity by understanding how AWS customers adopt strategies to develop organizational capabilities in the AWS Cloud Adoption Framework for Artificial Intelligence, Machine Learning, and Generative AI. You might also find benefit in understanding your overall cloud readiness by participating in an AWS Cloud Readiness Assessment. AWS offers additional opportunities for engagement—ask your AWS account team for more information on how to get started with the Generative AI Innovation Center.

Accelerate your security and AI/ML learning with best practices guidance, training, and certification

AWS also curates recommendations from Best Practices for Security, Identity, & Compliance and AWS Security Documentation to help you identify ways to secure your training, development, testing, and operational environments. If you’re just getting started, dive deeper on security training and certification, consider starting with AWS Security Fundamentals and the AWS Security Learning Plan. You can also use the AWS Security Maturity Model to help guide you finding and prioritizing the best activities at different phases of maturity on AWS, starting with quick wins, through foundational, efficient, and optimized stages. After you and your teams have a basic understanding of security on AWS, we strongly recommend reviewing How to approach threat modeling and then leading a threat modeling exercise with your teams starting with the Threat Modeling For Builders Workshop training program. There are many other AWS Security training and certification resources available.

Apply a defense-in-depth approach to secure LLM applications

Applying a defense-in-depth security approach to your generative AI workloads, data, and information can help create the best conditions to achieve your business objectives. Defense-in-depth security best practices mitigate many of the common risks that any workload faces, helping you and your teams accelerate your generative AI innovation. A defense-in-depth security strategy uses multiple redundant defenses to protect your AWS accounts, workloads, data, and assets. It helps make sure that if any one security control is compromised or fails, additional layers exist to help isolate threats and prevent, detect, respond, and recover from security events. You can use a combination of strategies, including AWS services and solutions, at each layer to improve the security and resiliency of your generative AI workloads.

Diagram of defense-in-depth security layers

Many AWS customers align to industry standard frameworks, such as the NIST Cybersecurity Framework. This framework helps ensure that your security defenses have protection across the pillars of Identify, Protect, Detect, Respond, Recover, and most recently added, Govern. This framework can then easily map to AWS Security services and those from integrated third parties as well to help you validate adequate coverage and policies for any security event your organization encounters.

Diagram of defense-in-depth of AWS Security Services mapped to the NIST Cybersecurity Framework 2.0

Defense in depth: Secure your environment, then add enhanced AI/ML-specific security and privacy capabilities

A defense-in-depth strategy should start by protecting your accounts and organization first, and then layer on the additional built-in security and privacy enhanced features of services such as Amazon Bedrock and Amazon SageMaker. Amazon has over 30 services in the Security, Identity, and Compliance portfolio which are integrated with AWS AI/ML services, and can be used together to help secure your workloads, accounts, organization. To properly defend against the OWASP Top 10 for LLM, these should be used together with the AWS AI/ML services.

Start by implementing a policy of least privilege, using services like IAM Access Analyzer to look for overly permissive accounts, roles, and resources to restrict access using short-termed credentials. Next, make sure that all data at rest is encrypted with AWS KMS, including considering the use of CMKs, and all data and models are versioned and backed up using Amazon Simple Storage Service (Amazon S3) versioning and applying object-level immutability with Amazon S3 Object Lock. Protect all data in transit between services using AWS Certificate Manager and/or AWS Private CA, and keep it within VPCs using AWS PrivateLink. Define strict data ingress and egress rules to help protect against manipulation and exfiltration using VPCs with AWS Network Firewall policies. Consider inserting AWS Web Application Firewall (AWS WAF) in front to protect web applications and APIs from malicious bots, SQL injection attacks, cross-site scripting (XSS), and account takeovers with Fraud Control. Logging with AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, and Amazon Elastic Kubernetes Service (Amazon EKS) audit logs will help provide forensic review of each transaction available to services such as Amazon Detective. You can use Amazon Inspector to automate vulnerability discovery and management for Amazon Elastic Compute Cloud (Amazon EC2) instances, containers, AWS Lambda functions, and identify the network reachability of your workloads. Protect your data and models from suspicious activity using Amazon GuardDuty’s ML-powered threat models and intelligence feeds, and enabling its additional features for EKS Protection, ECS Protection, S3 Protection, RDS Protection, Malware Protection, Lambda Protection, and more. You can use services like AWS Security Hub to centralize and automate your security checks to detect deviations from security best practices and accelerate investigation and automate remediation of security findings with playbooks. You can also consider implementing a zero trust architecture on AWS to further increase fine-grained authentication and authorization controls for what human users or machine-to-machine processes can access on a per-request basis. Also consider using Amazon Security Lake to automatically centralize security data from AWS environments, SaaS providers, on premises, and cloud sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization.

After your generative AI workload environment has been secured, you can layer in AI/ML-specific features, such as Amazon SageMaker Data Wrangler to identify potential bias during data preparation and Amazon SageMaker Clarify to detect bias in ML data and models. You can also use Amazon SageMaker Model Monitor to evaluate the quality of SageMaker ML models in production, and notify you when there is drift in data quality, model quality, and feature attribution. These AWS AI/ML services working together (including SageMaker working with Amazon Bedrock) with AWS Security services can help you identify potential sources of natural bias and protect against malicious data tampering. Repeat this process for each of the OWASP Top 10 for LLM vulnerabilities to ensure you’re maximizing the value of AWS services to implement defense in depth to protect your data and workloads.

As AWS Enterprise Strategist Clarke Rodgers wrote in his blog post “CISO Insight: Every AWS Service Is A Security Service”, “I would argue that virtually every service within the AWS cloud either enables a security outcome by itself, or can be used (alone or in conjunction with one or more services) by customers to achieve a security, risk, or compliance objective.” And “Customer Chief Information Security Officers (CISOs) (or their respective teams) may want to take the time to ensure that they are well versed with all AWS services because there may be a security, risk, or compliance objective that can be met, even if a service doesn’t fall into the ‘Security, Identity, and Compliance’ category.”

Layer defenses at trust boundaries in LLM applications

When developing generative AI-based systems and applications, you should consider the same concerns as with any other ML application, as mentioned in the MITRE ATLAS Machine Learning Threat Matrix, such as being mindful of software and data component origins (such as performing an open source software audit, reviewing software bill of materials (SBOMs), and analyzing data workflows and API integrations) and implementing necessary protections against LLM supply chain threats. Include insights from industry frameworks, and be aware of ways to use multiple sources of threat intelligence and risk information to adjust and extend your security defenses to account for AI, ML, and generative AI security risks that are emergent and not included in traditional frameworks. Seek out companion information on AI-specific risks from industry, defense, governmental, international, and academic sources, because new threats emerge and evolve in this space regularly and companion frameworks and guides are updated frequently. For example, when using a Retrieval Augmented Generation (RAG) model, if the model doesn’t include the data it needs, it may request it from an external data source for using during inferencing and fine-tuning. The source that it queries may be outside of your control, and can be a potential source of compromise in your supply chain. A defense-in-depth approach should be extended towards external sources to establish trust, authentication, authorization, access, security, privacy, and accuracy of the data it is accessing. To dive deeper, read “Build a secure enterprise application with Generative AI and RAG using Amazon SageMaker JumpStart

Analyze and mitigate risk in your LLM applications

In this section, we analyze and discuss some risk mitigation techniques based on trust boundaries and interactions, or distinct areas of the workload with similar appropriate controls scope and risk profile. In this sample architecture of a chatbot application, there are five trust boundaries where controls are demonstrated, based on how AWS customers commonly build their LLM applications. Your LLM application may have more or fewer definable trust boundaries. In the following sample architecture, these trust boundaries are defined as:

  1. User interface interactions (request and response)
  2. Application interactions
  3. Model interactions
  4. Data interactions
  5. Organizational interactions and use

Diagram of example workflow for securing an LLM-based application and it's integration points

User interface interactions: Develop request and response monitoring

Detect and respond to cyber incidents related to generative AI in a timely manner by evaluating a strategy to address risk from the inputs and outputs of the generative AI application. For example, additional monitoring for behaviors and data outflow may need to be instrumented to detect sensitive information disclosure outside your domain or organization, in the case that it is used in the LLM application.

Generative AI applications should still uphold the standard security best practices when it comes to protecting data. Establish a secure data perimeter and secure sensitive data stores. Encrypt data and information used for LLM applications at rest and in transit. Protect data used to train your model from training data poisoning by understanding and controlling which users, processes, and roles are allowed to contribute to the data stores, as well as how data flows in the application, monitor for bias deviations, and using versioning and immutable storage in storage services such as Amazon S3. Establish strict data ingress and egress controls using services like AWS Network Firewall and AWS VPCs to protect against suspicious input and the potential for data exfiltration.

During the training, retraining, or fine-tuning process, you should be aware of any sensitive data that is utilized. After data is used during one of these processes, you should plan for a scenario where any user of your model suddenly becomes able to extract the data or information back out by utilizing prompt injection techniques. Understand the risks and benefits of using sensitive data in your models and inferencing. Implement robust authentication and authorization mechanisms for establishing and managing fine-grained access permissions, which don’t rely on LLM application logic to prevent disclosure. User-controlled input to a generative AI application has been demonstrated under some conditions to be able to provide a vector to extract information from the model or any non-user-controlled parts of the input. This can occur via prompt injection, where the user provides input that causes the output of the model to deviate from the expected guardrails of the LLM application, including providing clues to the datasets that the model was originally trained on.

Implement user-level access quotas for users providing input and receiving output from a model. You should consider approaches that don’t allow anonymous access under conditions where the model training data and information is sensitive, or where there is risk from an adversary training a facsimile of your model based on their input and your aligned model output. In general, if part of the input to a model consists of arbitrary user-provided text, consider the output to be susceptible to prompt injection, and accordingly ensure use of the outputs includes implemented technical and organizational countermeasures to mitigate insecure output handling, excessive agency, and overreliance. In the example earlier related to filtering for malicious input using AWS WAF, consider building a filter in front of your application for such potential misuse of prompts, and develop a policy for how to handle and evolve those as your model and data grows. Also consider a filtered review of the output before it is returned to the user to ensure it meets quality, accuracy, or content moderation standards. You may want to further customize this for your organization’s needs with an additional layer of control on inputs and outputs in front of your models to mitigate suspicious traffic patterns.

Application interactions: Application security and observability

Review your LLM application with attention to how a user could utilize your model to bypass standard authorization to a downstream tool or toolchain that they don’t have authorization to access or use. Another concern at this layer involves accessing external data stores by using a model as an attack mechanism using unmitigated technical or organizational LLM risks. For example, if your model is trained to access certain data stores that could contain sensitive data, you should ensure that you have proper authorization checks between your model and the data stores. Use immutable attributes about users that don’t come from the model when performing authorization checks. Unmitigated insecure output handling, insecure plugin design, and excessive agency can create conditions where a threat actor may use a model to trick the authorization system into escalating effective privileges, leading to a downstream component believing the user is authorized to retrieve data or take a specific action.

When implementing any generative AI plugin or tool, it is imperative to examine and comprehend the level of access being granted, as well as scrutinize the access controls that have been configured. Using unmitigated insecure generative AI plugins may render your system susceptible to supply chain vulnerabilities and threats, potentially leading to malicious actions, including running remote code.

Model interactions: Model attack prevention

You should be aware of the origin of any models, plugins, tools, or data you use, in order to evaluate and mitigate against supply chain vulnerabilities. For example, some common model formats permit the embedding of arbitrary runnable code in the models themselves. Use package mirrors, scanning, and additional inspections as relevant to your organizations security goals.

The datasets you train and fine-tune your models on must also be reviewed. If you further automatically fine-tune a model based on user feedback (or other end-user-controllable information), you must consider if a malicious threat actor could change the model arbitrarily based on manipulating their responses and achieve training data poisoning.

Data interactions: Monitor data quality and usage

Generative AI models such as LLMs generally work well because they have been trained on a large amount of data. Although this data helps LLMs complete complex tasks, it also can expose your system to risk of training data poisoning, which occurs when inappropriate data is included or omitted inside a training dataset that can alter a model’s behavior. To mitigate this risk, you should look at your supply chain and understand the data review process for your system before it’s used inside your model. Although the training pipeline is a prime source for data poisoning, you should also look at how your model gets data, such as in a RAG model or data lake, and if the source of that data is trusted and protected. Use AWS Security services such as AWS Security Hub, Amazon GuardDuty, and Amazon Inspector to help continuously monitor for suspicious activity in Amazon EC2, Amazon EKS, Amazon S3, Amazon Relational Database Service (Amazon RDS), and network access that may be indicators of emerging threats, and use Detective to visualize security investigations. Also consider using services such as Amazon Security Lake to accelerate security investigations by creating a purpose-built data lake to automatically centralize security data from AWS environments, SaaS providers, on premises, and cloud sources which contribute to your AI/ML workloads.

Organizational interactions: Implement enterprise governance guardrails for generative AI

Identify risks associated with the use of generative AI for your businesses. You should build your organization’s risk taxonomy and conduct risk assessments to make informed decisions when deploying generative AI solutions. Develop a business continuity plan (BCP) that includes AI, ML, and generative AI workloads and that can be enacted quickly to replace the lost functionality of an impacted or offline LLM application to meet your SLAs.

Identify process and resource gaps, inefficiencies, and inconsistencies, and improve awareness and ownership across your business. Threat model all generative AI workloads to identify and mitigate potential security threats that may lead to business-impacting outcomes, including unauthorized access to data, denial of service, and resource misuse. Take advantage of the new AWS Threat Composer Modeling Tool to help reduce time-to-value when performing threat modeling. Later in your development cycles, consider including introducing security chaos engineering fault injection experiments to create real-world conditions to understand how your system will react to unknowns and build confidence in the system’s resiliency and security.

Include diverse perspectives in developing security strategies and risk management mechanisms to ensure adherence and coverage for AI/ML and generative security across all job roles and functions. Bring a security mindset to the table from the inception and research of any generative AI application to align on requirements. If you need extra assistance from AWS, ask your AWS account manager to make sure that there is equal support by requesting AWS Solutions Architects from AWS Security and AI/ML to help in tandem.

Ensure that your security organization routinely takes actions to foster communication around both risk awareness and risk management understanding among generative AI stakeholders such as product managers, software developers, data scientists, and executive leadership, allowing threat intelligence and controls guidance to reach the teams that may be impacted. Security organizations can support a culture of responsible disclosure and iterative improvement by participating in discussions and bringing new ideas and information to generative AI stakeholders that relate to their business objectives. Learn more about our commitment to Responsible AI and additional responsible AI resources to help our customers.

Gain advantage in enabling better organizational posture for generative AI by unblocking time to value in the existing security processes of your organization. Proactively evaluate where your organization may require processes that are overly burdensome given the generative AI security context and refine these to provide developers and scientists a clear path to launch with the correct controls in place.

Assess where there may be opportunities to align incentives, derisk, and provide a clear line of sight on the desired outcomes. Update controls guidance and defenses to meet the evolving needs of AI/ML and generative AI application development to reduce confusion and uncertainty that can cost development time, increase risk, and increase impact.

Ensure that stakeholders who are not security experts are able to both understand how organizational governance, policies, and risk management steps apply to their workloads, as well as apply risk management mechanisms. Prepare your organization to respond to realistic events and scenarios that may occur with generative AI applications, and ensure that generative AI builder roles and response teams are aware of escalation paths and actions in case of concern for any suspicious activity.

Conclusion

To successfully commercialize innovation with any new and emerging technology requires starting with a security-first mindset, building on a secure infrastructure foundation, and thinking about how to further integrate security at each level of the technology stack early with a defense-in-depth security approach. This includes interactions at multiple layers of your technology stack, and integration points within your digital supply chain, to ensure organizational resiliency. Although generative AI introduces some new security and privacy challenges, if you follow fundamental security best practices such as using defense-in-depth with layered security services, you can help protect your organization from many common issues and evolving threats. You should implement layered AWS Security services across your generative AI workloads and larger organization, and focus on integration points in your digital supply chains to secure your cloud environments. Then you can use the enhanced security and privacy capabilities in AWS AI/ML services such as Amazon SageMaker and Amazon Bedrock to add further layers of enhanced security and privacy controls to your generative AI applications. Embedding security from the start will make it faster, easier, and more cost-effective to innovate with generative AI, while simplifying compliance. This will help you increase controls, confidence, and observability to your generative AI applications for your employees, customers, partners, regulators, and other concerned stakeholders.

Additional references


About the authors

Christopher Rae is a Principal Worldwide Security GTM Specialist focused on developing and executing strategic initiatives that accelerate and scale adoption of AWS security services. He is passionate about the intersection of cybersecurity and emerging technologies, with 20+ years of experience in global strategic leadership roles delivering security solutions to media, entertainment, and telecom customers. He recharges through reading, traveling, food and wine, discovering new music, and advising early-stage startups.

Elijah Winter is a Senior Security Engineer in Amazon Security, holding a BS in Cyber Security Engineering and infused with a love for Harry Potter. Elijah excels in identifying and addressing vulnerabilities in AI systems, blending technical expertise with a touch of wizardry. Elijah designs tailored security protocols for AI ecosystems, bringing a magical flair to digital defenses. Integrity driven, Elijah has a security background in both public and commercial sector organizations focused on protecting trust.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 3-year-old Sheepadoodle!

Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Emily Soward is a Data Scientist with AWS Professional Services. She holds a Master of Science with Distinction in Artificial Intelligence from the University of Edinburgh in Scotland, United Kingdom with emphasis on Natural Language Processing (NLP). Emily has served in applied scientific and engineering roles focused on AI-enabled product research and development, operational excellence, and governance for AI workloads running at organizations in the public and private sector. She contributes to customer guidance as an AWS Senior Speaker and recently, as an author for AWS Well-Architected in the Machine Learning Lens.

Read More

Deploy a Microsoft Teams gateway for Amazon Q, your business expert

Deploy a Microsoft Teams gateway for Amazon Q, your business expert

Amazon Q is a new generative AI-powered application that helps users get work done. Amazon Q can become your tailored business expert and let you discover content, brainstorm ideas, or create summaries using your company’s data safely and securely. You can use Amazon Q to have conversations, solve problems, generate content, gain insights, and take action by connecting to your company’s information repositories, code, data, and enterprise systems. For more information, see Introducing Amazon Q, a new generative AI-powered assistant (preview).

In this post, we show you how to bring Amazon Q, your business expert, to users in Microsoft Teams. (If you use Slack, refer to Deploy a Slack gateway for Amazon Q, your business expert.)

You’ll be able converse with Amazon Q business expert using Teams direct messages (DMs) to ask questions and get answers based on company data, get help creating new content such as email drafts, summarize attached files, and perform tasks.

You can also invite Amazon Q business expert to participate in your Teams channels. In a channel, users can ask Amazon Q business expert questions in a new message, or tag it in an existing thread at any point, to provide additional data points, resolve a debate, or summarize the conversation and capture the next steps.

Solution overview

Amazon Q business expert is amazingly powerful. Check out the following demo—seeing is believing!

In the demo, our Amazon Q business expert application is populated with some Wikipedia pages. You can populate your Amazon Q business expert application with your own company’s documents and knowledge base articles, so it will be able to answer your specific questions!

Everything you need is provided as open source in our GitHub repo.

In this post, we walk you through the process to deploy Amazon Q business expert in your AWS account and add it to Microsoft Teams. When you’re done, you’ll wonder how you ever managed without it!

The following are some of the things it can do:

  • Respond to messages – In DMs, it responds to all messages. In channels, it responds only to @mentions and responds in a conversation thread.
  • Render answers containing markdown – This includes headings, lists, bold, italics, tables, and more.
  • Track sentiment – It provides thumbs up and thumbs down buttons to track user sentiment.
  • Provide source attribution – It provides references and hyperlinks to sources used by Amazon Q business expert.
  • Understand conversation context – It tracks the conversation and responds based on the context.
  • Stay aware of multiple users – When it’s tagged in a thread, it knows who said what, and when, so it can contribute in context and accurately summarize the thread when asked.
  • Process attached files – It can process up to five attached files for document question answering, summaries, and more.
  • Start new conversations – You can reset and start new conversations in DM chats by using /new_conversation.

In the following sections, we show how to deploy the project to your own AWS account and Teams account, and start experimenting!

Prerequisites

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?

You also need to have an existing, working Amazon Q business expert application. If you haven’t set one up yet, see Creating an Amazon Q application.

Lastly, you need a Microsoft account and a Microsoft Teams subscription to create and publish the app using the steps outlined in this post. If you don’t have these, see if your company can create sandboxes for you to experiment, or create a new account and trial subscription as needed to complete the steps.

Deploy the solution resources

We’ve provided pre-built AWS CloudFormation templates that deploy everything you need in your AWS account.

If you’re a developer and you want to build, deploy, or publish the solution from code, refer to the Developer README.

Complete the following steps to launch the CloudFormation stack:

  1. Log in to the AWS Management Console.
  2. Choose one of the following Launch Stack buttons for your desired AWS Region to open the AWS CloudFormation console and create a new stack.
Region Launch Stack
N. Virginia (us-east-1)
Oregon (us-west-2)
  1. For Stack name, enter a name for your app (for example, AMAZON-Q-TEAMS-GATEWAY).
  2. For AmazonQAppId, enter your existing Amazon Q business expert application ID (for example, 80xxxxx9-7xx3-4xx0-bxx4-5baxxxxx2af5). You can copy it from the Amazon Q business expert console.
  3. For AmazonQRegion, choose the Region where you created your Amazon Q business expert application (us-east-1 or us-west-2).
  4. For AmazonQUserId, enter an Amazon Q business expert user ID email address (leave blank to use a Teams user email as the user ID).
  5. For ContextDaysToLive, enter the length of time to keep conversation metadata cached in Amazon DynamoDB (you can leave this as the default).

When your CloudFormation stack status is CREATE_COMPLETE, choose the Outputs tab, and keep it open—you’ll need it in later steps.

Register a new app in the Microsoft Azure portal

Complete the following steps to register a new app in the Microsoft Azure portal:

  1. Go to the Azure Portal and log in with your Microsoft account.
  2. Choose New registration.
    1. For Name, provide the name for your app. You can keep things simple by using the stack name you used for the CloudFormation stack.
    2. For Who can use this application or access this API?, choose Accounts in this organizational directory only (AWS only – Single tenant).
    3. Choose Register.
    4. Note down the Application (client) ID value and the Directory (tenant) ID from the Overview page. You’ll need them later when asked for MicrosoftAppId and MicrosoftAppTenantId.
  3. Choose Select API permissions in the navigation pane.
    1. Choose Add a permission.
    2. Choose Microsoft Graph.
    3. Choose Application permissions.
    4. Select User.Read.All.
    5. Select ChannelMessage.Read.All.
    6. Select Team.ReadBasic.All.
    7. Select Files.Read.All.
    8. Choose Add permissions. This permission allows the app to read data in your organization’s directory about the signed-in user.
    9. Use the options menu (three dots) on the right to choose Remove permission.
    10. Remove the original User.Read – Delegated permission.
    11. Choose Grant admin consent for Default Directory.
  4. Choose Certificates & secrets in the navigation pane.
    1. Choose New client secret.
    2. For Description, provide a value, such as description of my client secret.
    3. Choose a value for Expires. Note that in production, you’ll need to manually rotate your secret before it expires.
    4. Choose Add.
    5. Note down the value for your new secret. You’ll need it later when asked for MicrosoftAppPassword.
  5. Optionally, choose Owners to add any additional owners for the application.

Register your new app in the Microsoft Bot Framework

Complete the following steps to register your app in the Microsoft Bot Framework:

  1. Go to the Microsoft Bot Framework and log in with your Microsoft account.
  2. Optionally, you can create and upload a custom icon for your new Amazon Q business expert bot. For example, we created the following using Amazon Bedrock image playground.
  1. Enter your preferred display name, bot handle, and description.
  2. For Messaging endpoint, copy and paste the value of TeamsEventHandlerApiEndpoint from your stack Outputs tab.
  3. Do not select Enable Streaming Endpoint.
  4. For App type, choose Single Tenant.
  5. For Paste your app ID below to continue, enter the MicrosoftAppId value you noted earlier.
  6. For App Tenant ID, enter the MicrosoftAppTenantId value you noted earlier.
  7. Leave the other values as they are, agree to the terms, and choose Register.
  8. On the Channels page, under Add a featured channel, choose Microsoft Teams.
  9. Choose Microsoft Teams Commercial (most common), then choose Save.
  10. Agree to the Terms of Service and choose Agree.

Configure your secrets in AWS

Let’s configure your Teams secrets in order to verify the signature of each request and post on behalf of your Amazon Q business expert bot.

In this example, we are not enabling Teams token rotation. You can enable it for a production app by implementing rotation via AWS Secrets Manager. Create an issue (or, better yet, a pull request) in the GitHub repo if you want this feature added to a future version.

Complete the following steps to configure a secret in Secrets Manager:

  1. On the AWS CloudFormation console, navigate to your stack Outputs tab and choose the link for TeamsSecretConsoleUrl to be redirected to the Secrets Manager console.
  2. Choose Retrieve secret value.
  3. Choose Edit.
  4. Replace the values of MicrosoftAppId, MicrosoftAppPassword, and MicrosoftAppTenantId with the values you noted in the previous steps.

Deploy your app into Microsoft Teams

Complete the following steps to deploy the app to Teams:

  1. Go to the Developer Portal for Teams and log in with your Microsoft Teams user account.
  2. Choose Apps in the navigation pane, then choose New app.
    1. For Name, enter your bot name.
    2. Enter a name for Full name and both short and full descriptions (you can use the bot name for them all if you want, just don’t leave them empty).
    3. Enter values for Developer information and App URLs. For testing, you can make up values, and URLs like https://www.anycompany.com/. Use real ones for production.
    4. For Application (client) ID*, enter the value of MicrosoftAppId from earlier.
    5. Choose Save.
  3. Under Branding, you can upload AI-generated icons, or different icons, or none at all, it’s up to you. The following are some examples:
    1. Color icon 192×192
    2. Outline icon 32×32
  4. Under App features, choose Bot.
    1. Select Enter a bot ID, and enter the MicrosoftAppId value from the earlier steps.
    2. Under What can your bot do?, select Upload and download files.
    3. Under Select the scopes in which people can use this command, select Personal, Team, and Group chat.
    4. Choose Save.
  5. Choose Publish.
  6. Choose Download the app package to download a .zip file to your computer.
  7. Choose Preview in Teams to launch Microsoft Teams (work or school) app.
    1. In the navigation pane, choose Apps, then Manage your apps, then Upload an app.
    2. Choose Upload an app to your orgs app catalog, and select the .zip file you downloaded. This adds the app to Teams.
    3. Select the card for your new app, choose Add, and wait for it to complete (10–20 seconds).

Add your bot to one or more teams

Complete the following step to add your bot to a team:

  1. In the Teams app, select your team and choose Manage team.
  2. On the Apps tab, choose the new Amazon Q business expert app, and choose Add.

Now you can test your bot in Microsoft Teams!

Start using Amazon Q business expert

Complete the following steps to start using Amazon Q business expert in Teams:

  1. Open your Teams client.
  2. Under Apps, add your new Amazon Q business expert app to a chat.
  3. Optionally, add your Amazon Q business expert app to one or more Teams channels.
  4. In the app DM chat, enter Hello.

You have now deployed a powerful new AI assistant into your sandbox Teams environment.

Play with it, try all the features discussed in this post, and copy the things you saw in the demo video. Most importantly, you can ask about topics related to the documents that you have ingested into your own Amazon Q business expert application. But don’t stop there. You can find additional ways to make it useful, and when you do, let us know by posting a comment.

Once you are convinced how useful it is, talk to your Teams admins (show them this post) and work with them to deploy it in your company’s Teams organizations. Your fellow employees will thank you!

Clean up

When you’re finished experimenting with this solution, delete your app in Microsoft Teams, Bot Framework, and Azure portal. Then clean up your AWS resources by opening the AWS CloudFormation console and deleting the AMAZON-Q-TEAMS-GATEWAY stack that you deployed. This deletes the resources that you created by deploying the solution.

Conclusions

The sample Amazon Q business expert Teams application discussed in this post is provided as open source—you can use it as a starting point for your own solution, and help us make it better by contributing back fixes and features via GitHub pull requests. Explore the code, choose Watch in the GitHub repo to be notified of new releases, and check back for the latest updates. We’d also love to hear your suggestions for improvements and features.

For more information on Amazon Q business expert, refer to the Amazon Q (For Business Use) Developer Guide.


About the Authors

Gary Benattar is a Senior Software Development Manager in AWS HR. Gary started at Amazon in 2012 as an intern, focusing on building scalable, real-time outlier detection systems. He worked in Seattle and Luxembourg and is now based in Tel Aviv, Israel, where he dedicates his time to building software to revolutionize the future of Human Resources. He co-founded a startup, Zengo, with a focus on making digital wallets secure through multi-party computation. He received his MSc in Software Engineering from Sorbonne University in Paris.


Bob Strahan

Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Read More

Build enterprise-ready generative AI solutions with Cohere foundation models in Amazon Bedrock and Weaviate vector database on AWS Marketplace

Build enterprise-ready generative AI solutions with Cohere foundation models in Amazon Bedrock and Weaviate vector database on AWS Marketplace

Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) with these solutions has become increasingly popular. Building proofs of concept is relatively straightforward because cutting-edge foundation models are available from specialized providers through a simple API call. Therefore, organizations of various sizes and across different industries have begun to reimagine their products and processes using generative AI.

Despite their wealth of general knowledge, state-of-the-art LLMs only have access to the information they were trained on. This can lead to factual inaccuracies (hallucinations) when the LLM is prompted to generate text based on information they didn’t see during their training. Therefore, it’s crucial to bridge the gap between the LLM’s general knowledge and your proprietary data to help the model generate more accurate and contextual responses while reducing the risk of hallucinations. The traditional method of fine-tuning, although effective, can be compute-intensive, expensive, and requires technical expertise. Another option to consider is called Retrieval Augmented Generation (RAG), which provides LLMs with additional information from an external knowledge source that can be updated easily.

Additionally, enterprises must ensure data security when handling proprietary and sensitive data, such as personal data or intellectual property. This is particularly important for organizations operating in heavily regulated industries, such as financial services and healthcare and life sciences. Therefore, it’s important to understand and control the flow of your data through the generative AI application: Where is the model located? Where is the data processed? Who has access to the data? Will the data be used to train models, eventually risking the leak of sensitive data to public LLMs?

This post discusses how enterprises can build accurate, transparent, and secure generative AI applications while keeping full control over proprietary data. The proposed solution is a RAG pipeline using an AI-native technology stack, whose components are designed from the ground up with AI at their core, rather than having AI capabilities added as an afterthought. We demonstrate how to build an end-to-end RAG application using Cohere’s language models through Amazon Bedrock and a Weaviate vector database on AWS Marketplace. The accompanying source code is available in the related GitHub repository hosted by Weaviate. Although AWS will not be responsible for maintaining or updating the code in the partner’s repository, we encourage customers to connect with Weaviate directly regarding any desired updates.

Solution overview

The following high-level architecture diagram illustrates the proposed RAG pipeline with an AI-native technology stack for building accurate, transparent, and secure generative AI solutions.

Figure 1: RAG workflow using Cohere’s language models through Amazon Bedrock and a Weaviate vector database on AWS Marketplace

As a preparation step for the RAG workflow, a vector database, which serves as the external knowledge source, is ingested with the additional context from the proprietary data. The actual RAG workflow follows the four steps illustrated in the diagram:

  1. The user enters their query.
  2. The user query is used to retrieve relevant additional context from the vector database. This is done by generating the vector embeddings of the user query with an embedding model to perform a vector search to retrieve the most relevant context from the database.
  3. The retrieved context and the user query are used to augment a prompt template. The retrieval-augmented prompt helps the LLM generate a more relevant and accurate completion, minimizing hallucinations.
  4. The user receives a more accurate response based on their query.

The AI-native technology stack illustrated in the architecture diagram has two key components: Cohere language models and a Weaviate vector database.

Cohere language models in Amazon Bedrock

The Cohere Platform brings language models with state-of-the-art performance to enterprises and developers through a simple API call. There are two key types of language processing capabilities that the Cohere Platform provides—generative and embedding—and each is served by a different type of model:

  • Text generation with Command – Developers can access endpoints that power generative AI capabilities, enabling applications such as conversational, question answering, copywriting, summarization, information extraction, and more.
  • Text representation with Embed – Developers can access endpoints that capture the semantic meaning of text, enabling applications such as vector search engines, text classification and clustering, and more. Cohere Embed comes in two forms, an English language model and a multilingual model, both of which are now available on Amazon Bedrock.

The Cohere Platform empowers enterprises to customize their generative AI solution privately and securely through the Amazon Bedrock deployment. Amazon Bedrock is a fully managed cloud service that enables development teams to build and scale generative AI applications quickly while helping keep your data and applications secure and private. Your data is not used for service improvements, is never shared with third-party model providers, and remains in the Region where the API call is processed. The data is always encrypted in transit and at rest, and you can encrypt the data using your own keys. Amazon Bedrock supports security requirements, including U.S. Health Insurance Portability and Accountability Act (HIPAA) eligibility and General Data Protection Regulation (GDPR) compliance. Additionally, you can securely integrate and easily deploy your generative AI applications using the AWS tools you are already familiar with.

Weaviate vector database on AWS Marketplace

Weaviate is an AI-native vector database that makes it straightforward for development teams to build secure and transparent generative AI applications. Weaviate is used to store and search both vector data and source objects, which simplifies development by eliminating the need to host and integrate separate databases. Weaviate delivers subsecond semantic search performance and can scale to handle billions of vectors and millions of tenants. With a uniquely extensible architecture, Weaviate integrates natively with Cohere foundation models deployed in Amazon Bedrock to facilitate the convenient vectorization of data and use its generative capabilities from within the database.

The Weaviate AI-native vector database gives customers the flexibility to deploy it as a bring-your-own-cloud (BYOC) solution or as a managed service. This showcase uses the Weaviate Kubernetes Cluster on AWS Marketplace, part of Weaviate’s BYOC offering, which allows container-based scalable deployment inside your AWS tenant and VPC with just a few clicks using an AWS CloudFormation template. This approach ensures that your vector database is deployed in your specific Region close to the foundation models and proprietary data to minimize latency, support data locality, and protect sensitive data while addressing potential regulatory requirements, such as GDPR.

Use case overview

In the following sections, we demonstrate how to build a RAG solution using the AI-native technology stack with Cohere, AWS, and Weaviate, as illustrated in the solution overview.

The example use case generates targeted advertisements for vacation stay listings based on a target audience. The goal is to use the user query for the target audience (for example, “family with small children”) to retrieve the most relevant vacation stay listing (for example, a listing with playgrounds close by) and then to generate an advertisement for the retrieved listing tailored to the target audience.

Figure 2: First few rows of vacation stay listings available from Inside Airbnb.

The dataset is available from Inside Airbnb and is licensed under a Creative Commons Attribution 4.0 International License. You can find the accompanying code in the GitHub repository.

Prerequisites

To follow along and use any AWS services in the following tutorial, make sure you have an AWS account.

Enable components of the AI-native technology stack

First, you need to enable the relevant components discussed in the solution overview in your AWS account. Complete the following steps:

  1. In the left Amazon Bedrock console, choose Model access in the navigation pane.
  2. Choose Manage model access on the top right.
  3. Select the foundation models of your choice and request access.

Figure 3: Manage model access in Amazon Bedrock console.

Next, you set up a Weaviate cluster.

  1. Subscribe to the Weaviate Kubernetes Cluster on AWS Marketplace.
  2. Launch the software using a CloudFormation template according to your preferred Availability Zone.

The CloudFormation template is pre-populated with default values.

  1. For Stack name, enter a stack name.
  2. For helmauthenticationtype, it is recommended to enable authentication by setting helmauthenticationtype to apikey and defining a helmauthenticationapikey.
  3. For helmauthenticationapikey, enter your Weaviate API key.
  4. For helmchartversion, enter your version number. It must be at least v.16.8.0. Refer to the GitHub repo for the latest version.
  5. For helmenabledmodules, make sure tex2vec-aws and generative-aws are present in the list of enabled modules within Weaviate.

Figure 4: CloudFormation template.

This template takes about 30 minutes to complete.

Connect to Weaviate

Complete the following steps to connect to Weaviate:

  1. In the Amazon SageMaker console, navigate to Notebook instances in the navigation pane via Notebook > Notebook instances on the left.
  2. Create a new notebook instance.
  3. Install the Weaviate client package with the required dependencies:
$ pip install weaviate-client
  1. Connect to your Weaviate instance with the following code:
import weaviate

client = weaviate.Client(
  url = "http://<YOUR-WEAVIATE-URL>",
 auth_client_secret=weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"),
    additional_headers={
        "X-AWS-Access-Key": "<YOUR-AWS-ACCESS-KEY>",
        "X-AWS-Secret-Key": "<YOUR-AWS-SECRET-ACCESS-KEY>"
    }
)

Provide the following information:

  • Weaviate URL – Access Weaviate via the load balancer URL. In the Amazon Elastic Compute Cloud (Amazon EC2) console, choose Load balancers in the navigation pane and find the load balancer. Look for the DNS name column and add http:// in front of it.
  • Weaviate API key – This is the key you set earlier in the CloudFormation template (helmauthenticationapikey).
  • AWS access key and secret access key – You can retrieve the access key and secret access key for your user in the AWS Identity and Access Management (IAM) console.

Figure 5: AWS Identity and Access Management (IAM) console to retrieve AWS access key and secret access key.

Configure the Amazon Bedrock module to enable Cohere models

Next, you define a data collection (class) called Listings to store the listings’ data objects, which is analogous to creating a table in a relational database. In this step, you configure the relevant modules to enable the usage of Cohere language models hosted on Amazon Bedrock natively from within the Weaviate vector database. The vectorizer (“text2vec-aws“) and generative module (“generative-aws“) are specified in the data collection definition. Both of these modules take three parameters:

  • “service” – Use “bedrock” for Amazon Bedrock (alternatively, use “sagemaker” for Amazon SageMaker JumpStart)
  • “Region” – Enter the Region where your model is deployed
  • “model” – Provide the foundation model’s name

See the following code:

collection_definition = {
    "class": "Listings",
    "moduleConfig": {
        "text2vec-aws": {
            "service": "bedrock",
            "region": "us-east-1",
            "model": "cohere.embed-english-v3",
        },
        "generative-aws": {
            "service": "bedrock",
            "region": "us-east-1",
            "model": "cohere.command-text-v14"
        }
    },
    "vectorizer": "text2vec-aws"
}

Ingest data into the Weaviate vector database

In this step, you define the structure of the data collection by configuring its properties. Aside from the property’s name and data type, you can also configure if only the data object will be stored or if it will be stored together with its vector embeddings. In this example, host_name and property_type are not vectorized:

collection_definition["properties"] = [
        { "name": "host_name", "dataType": ["text"], 
     "moduleConfig": {"text2vec-aws": {"skip": True}}
        },
        { "name": "property_type", "dataType": ["text"], 
     "moduleConfig": {"text2vec-aws": {"skip": True}}
        }
        { "name": "description", "dataType": ["text"] },
        {"name": "neighborhood_overview", "dataType": ["text"] },
]

Run the following code to create the collection in your Weaviate instance:

client.schema.create_class(collection_definition)

You can now add objects to Weaviate. You use a batch import process for maximum efficiency. Run the following code to import data. During the import, Weaviate will use the defined vectorizer to create a vector embedding for each object. The following code loads objects, initializes a batch process, and adds objects to the target collection one by one:

from weaviate.util import generate_uuid5
import pandas as pd

# Read CSV file
csv_file = './data/listings.csv'
df = pd.read_csv(csv_file, usecols = ['host_name', 
                                      'property_type',
                                      'description', 
                                      'neighborhood_overview', 
                                        ])

df.fillna('', inplace=True)

# Configure batch
client.batch.configure(batch_size=100) 

# Initialize batch process
with client.batch as batch:
    for _, row in df.iterrows():
        listing_object = {
            "host_name": row["host_name"],
            "property_type" : row["property_type"],
            "description": row["description"],
            "neighborhood_overview" : row["neighborhood_overview"],
        }
        batch.add_data_object(
            class_name = "Listings",
            data_object = listing_object,
            uuid = generate_uuid5(listing_object)
        )

Retrieval Augmented Generation

You can build a RAG pipeline by implementing a generative search query on your Weaviate instance. For this, you first define a prompt template in the form of an f-string that can take in the user query ({target_audience}) directly and the additional context ({{host_name}}, {{property_type}}, {{description}}, and {{neighborhood_overview}}) from the vector database at runtime:

   prompt_template = f"""You are a copywriter.
    Write a short advertisement for the following vacation stay.
    Host: {{host_name}}
    Property type: {{property_type}}
    Description: {{description}}
    Neighborhood: {{neighborhood_overview}}
    Target audience: {target_audience}
    """

Next, you run a generative search query. This prompts the defined generative model with a prompt that is comprised of the user query as well as the retrieved data. The following query retrieves one listing object (.with_limit(1)) from the Listings collection that is most similar to the user query (.with_near_text({"concepts": target_audience})). Then the user query (target_audience) and the retrieved listings properties (["description", "neighborhood", "host_name", "property_type"]) are fed into the prompt template. See the following code:

   result = client.query
                .get("Listings", 
            ["description", "neighborhood", "host_name", "property_type"])
                .with_near_text({"concepts": target_audience})
                .with_limit(1)
                .with_generate(single_prompt=prompt_template)
                .do()

In the following example, you can see that the preceding piece of code for target_audience = “Family with small children” retrieves a listing from the host Marre. The prompt template is augmented with Marre’s listing details and the target audience:

"You are a copywriter.
Write a short advertisement for the following vacation stay.
Host: Marre
Property type: Entire townhouse
Description: Welcome to our lovely home! You've come to the right place ...
Neighborhood: THE NEIGHBORHOOD:<br /><br />We are in the city centre ...
Target audience: Family with small children"

Based on the retrieval-augmented prompt, Cohere’s Command model generates the following targeted advertisement:

"Looking for a kid-friendly home away from home in one of the trendiest areas of 
Amsterdam? Look no further than this stylish townhouse in the heart of the city! 
Our 120ft² space is perfect for a family of four or a group of adults, with two 
bedrooms featuring elevated beds suitable for kids and one bedroom with a single 
bed. The ground floor features a spacious living room, a kitchen with a large 
dining table, and a half bath while heading upstairs leads you to a master bedroom 
and a full bathroom. Our central location means you're just steps away from the 
best cafes, restaurants, and bars that the city has to offer, and the Vondelpark 
and other attractions are only a short walk away! Supermarkets and paid parking 
are also conveniently located nearby. Experience the best of Amsterdam in a 
laid-back,local way and create unforgettable memories with your family at our 
cozy townhouse."

Alternative customizations

You can make alternative customizations to different components in the proposed solution, such as the following:

  • Cohere’s language models are also available through Amazon SageMaker JumpStart, which provides access to cutting-edge foundation models and enables developers to deploy LLMs to Amazon SageMaker, a fully managed service that brings together a broad set of tools to enable high-performance, low-cost machine learning for any use case. Weaviate is integrated with SageMaker as well.
  • A powerful addition to this solution is the Cohere Rerank endpoint, available through SageMaker JumpStart. Rerank can improve the relevance of search results from lexical or semantic search. Rerank works by computing semantic relevance scores for documents that are retrieved by a search system and ranking the documents based on these scores. Adding Rerank to an application requires only a single line of code change.
  • To cater to different deployment requirements of different production environments, Weaviate can be deployed in various additional ways. For example, it is available as a direct download from Weaviate website, which runs on Amazon Elastic Kubernetes Service (Amazon EKS) or locally via Docker or Kubernetes. It’s also available as a managed service that can run securely within a VPC or as a public cloud service hosted on AWS with a 14-day free trial.
  • You can serve your solution in a VPC using Amazon Virtual Private Cloud (Amazon VPC), which enables organizations to launch AWS services in a logically isolated virtual network, resembling a traditional network but with the benefits of AWS’s scalable infrastructure. Depending on the classified level of sensitivity of the data, organizations can also disable internet access in these VPCs.

Clean up

To prevent unexpected charges, delete all the resources you deployed as part of this post. If you launched the CloudFormation stack, you can delete it via the AWS CloudFormation console. Note that there may be some AWS resources, such as Amazon Elastic Block Store (Amazon EBS) volumes and AWS Key Management Service (AWS KMS) keys, that may not be deleted automatically when the CloudFormation stack is deleted.

Figure 6: Delete all resources via the AWS CloudFormation console.

Conclusion

This post discussed how enterprises can build accurate, transparent, and secure generative AI applications while still having full control over their data. The proposed solution is a RAG pipeline using an AI-native technology stack as a combination of Cohere foundation models in Amazon Bedrock and a Weaviate vector database on AWS Marketplace. The RAG approach enables enterprises to bridge the gap between the LLM’s general knowledge and the proprietary data while minimizing hallucinations. An AI-native technology stack enables fast development and scalable performance.

You can start experimenting with RAG proofs of concept for your enterprise-ready generative AI applications using the steps outlined in this post. The accompanying source code is available in the related GitHub repository. Thank you for reading. Feel free to provide comments or feedback in the comments section.


About the authors

James Yi is a Senior AI/ML Partner Solutions Architect in the Technology Partners COE Tech team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy, and scale AI/ML applications to derive business value. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Leonie Monigatti is a Developer Advocate at Weaviate. Her focus area is AI/ML, and she helps developers learn about generative AI. Outside of work, she also shares her learnings in data science and ML on her blog and on Kaggle.

Meor Amer is a Developer Advocate at Cohere, a provider of cutting-edge natural language processing (NLP) technology. He helps developers build cutting-edge applications with Cohere’s Large Language Models (LLMs).

Shun Mao is a Senior AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys fishing, traveling and playing Ping-Pong.

Read More

Build a vaccination verification solution using the Queries feature in Amazon Textract

Build a vaccination verification solution using the Queries feature in Amazon Textract

Amazon Textract is a machine learning (ML) service that enables automatic extraction of text, handwriting, and data from scanned documents, surpassing traditional optical character recognition (OCR). It can identify, understand, and extract data from tables and forms with remarkable accuracy. Presently, several companies rely on manual extraction methods or basic OCR software, which is tedious and time-consuming, and requires manual configuration that needs updating when the form changes. Amazon Textract helps solve these challenges by utilizing ML to automatically process different document types and accurately extract information with minimal manual intervention. This enables you to automate document processing and use the extracted data for different purposes, such as automating loans processing or gathering information from invoices and receipts.

As travel resumes post-pandemic, verifying a traveler’s vaccination status may be required in many cases. Hotels and travel agencies often need to review vaccination cards to gather important details like whether the traveler is fully vaccinated, vaccine dates, and the traveler’s name. Some agencies do this through manual verification of cards, which can be time-consuming for staff and leaves room for human error. Others have built custom solutions, but these can be costly and difficult to scale, and take significant time to implement. Moving forward, there may be opportunities to streamline the vaccination status verification process in a way that is efficient for businesses while respecting travelers’ privacy and convenience.

Amazon Textract Queries helps address these challenges. Amazon Textract Queries allows you to specify and extract only the piece of information that you need from the document. It gives you precise and accurate information from the document.

In this post, we walk you through a step-by-step implementation guide to build a vaccination status verification solution using Amazon Textract Queries. The solution showcases how to process vaccination cards using an Amazon Textract query, verify the vaccination status, and store the information for future use.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. The user takes a photo of a vaccination card.
  2. The image is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
  3. When the image gets saved in the S3 bucket, it invokes an AWS Step Functions workflow:
  4. The Queries-Decider AWS Lambda function examines the document passed in and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for our example, we have four queries).
  5. NumberQueriesAndPagesChoice is a Choice state that adds conditional logic to a workflow. If there are between 15–31 queries and the number of pages is between 2–3,001, then Amazon Textract asynchronous processing is the only option, because synchronous APIs only support up to 15 queries and one-page documents. For all other cases, we route to the random selection of synchronous or asynchronous processing.
  6. The TextractSync Lambda function sends a request to Amazon Textract to analyze the document based on the following Amazon Textract queries:
    1. What is Vaccination Status?
    2. What is Name?
    3. What is Date of Birth?
    4. What is Document Number?
  7. Amazon Textract analyzes the image and sends the answers of these queries back to the Lambda function.
  8. The Lambda function verifies the customer’s vaccination status and stores the final result in CSV format in the same S3 bucket (demoqueries-textractxxx) in the csv-output folder.

Prerequisites

To complete this solution, you should have an AWS account and the appropriate permissions to create the resources required as part of the solution.

Download the deployment code and sample vaccination card from GitHub.

Use the Queries feature on the Amazon Textract console

Before you build the vaccination verification solution, let’s explore how you can use Amazon Textract Queries to extract vaccination status via the Amazon Textract console. You can use the vaccination card sample you downloaded from the GitHub repo.

  1. On the Amazon Textract console, choose Analyze Document in the navigation pane.
  2. Under Upload document, choose Choose document to upload the vaccination card from your local drive.
  3. After you upload the document, select Queries in the Configure Document section.
  4. You can then add queries in the form of natural language questions. Let’s add the following:
    • What is Vaccination Status?
    • What is Name?
    • What is Date of Birth?
    • What is Document Number?
  5. After you add all your queries, choose Apply configuration.
  6. Check the Queries tab to see the answers to the questions.

You can see Amazon Textract extracts the answer to your query from the document.

Deploy the vaccination verification solution

In this post, we use an AWS Cloud9 instance and install the necessary dependencies on the instance with the AWS Cloud Development Kit (AWS CDK) and Docker. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.

  1. In the terminal, choose Upload Local Files on the File menu.
  2. Choose Select folder and choose the vaccination_verification_solution folder you downloaded from GitHub.
  3. In the terminal, prepare your serverless application for subsequent steps in your development workflow in AWS Serverless Application Model (AWS SAM) using the following command:
    $ cd vaccination_verification_solution/
    $ pip install -r requirements.txt
    

  4. Deploy the application using the cdk deploy command:
    cdk deploy DemoQueries --outputs-file demo_queries.json --require-approval never

    Wait for the AWS CDK to deploy the model and create the resources mentioned in the template.

  5. When deployment is complete, you can check the deployed resources on the AWS CloudFormation console on the Resources tab of the stack details page.

Test the solution

Now it’s time to test the solution. To trigger the workflow, use aws s3 cp to upload the vac_card.jpg file to DemoQueries.DocumentUploadLocation inside the docs folder:

aws s3 cp docs/vac_card.JPG $(aws cloudformation list-exports --query 'Exports[?Name==`DemoQueries-DocumentUploadLocation`].Value' --output text)


The vaccination certificate file automatically gets uploaded to the S3 bucket demoqueries-textractxxx in the uploads folder.

The Step Functions workflow is triggered via a Lambda function as soon as the vaccination certificate file is uploaded to the S3 bucket.

The Queries-Decider Lambda function examines the document and adds information about the mime type, the number of pages, and the number of queries to the Step Functions workflow (for this example, we use four queries—document number, customer name, date of birth, and vaccination status).

The TextractSync function sends the input queries to Amazon Textract and synchronously returns the full result as part of the response. It supports 1-page documents (TIFF, PDF, JPG, PNG) and up to 15 queries. The GenerateCsvTask function takes the JSON output from Amazon Textract and converts it to a CSV file.

The final output is stored in the same S3 bucket in the csv-output folder as a CSV file.

You can download the file to your local machine using the following command:

aws s3 cp <paste the S3 URL from TextractOutputCSVPath>

The format of the result is timestamp, classification, filename, page number, key name, key_confidence, value, value_confidence, key_bb_top, key_bb_height, key_bb.width, key_bb_left, value_bb_top, value_bb_height, value_bb_width, value_bb_left.

You can scale the solution to hundreds of vaccination certificate documents for multiple customers by uploading their vaccination certificates to DemoQueries.DocumentUploadLocation. This automatically triggers multiple runs of the Step Functions state machine, and the final result is stored in the same S3 bucket in the csv-output folder.

To change the initial set of queries that are fed into Amazon Textract, you can go to your AWS Cloud9 instance and open the start_execution.py file. In the file view in the left pane, navigate to lambda, start_queries, app, start_execution.py. This Lambda function is invoked when a file is uploaded to DemoQueries.DocumentUploadLocation. The queries sent to the workflow are defined in start_execution.py; you can change those by updating the code as shown in the following screenshot.

Clean up

To avoid incurring ongoing charges, delete the resources created in this post using the following command:

cdk destroy DemoQueries

Answer the question Are you sure you want to delete: DemoQueries (y/n)? with y.

Conclusion

In this post, we showed you how to use Amazon Textract Queries to build a vaccination verification solution for the travel industry. You can use Amazon Textract Queries to build solutions in other industries like finance and healthcare, and retrieve information from documents such as paystubs, mortgage notes, and insurance cards based on natural language questions.

For more information, see Analyzing Documents, or check out the Amazon Textract console and try out this feature.


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Rishabh Yadav is a Partner Solutions architect at AWS with an extensive background in DevOps and Security offerings at AWS. He works with ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews along with building AWS practices through the implementation of the Well-Architected Framework. Outside of work, he likes to spend his time in the sports field and FPS gaming.

Read More