Monitor embedding drift for LLMs deployed from Amazon SageMaker JumpStart

Monitor embedding drift for LLMs deployed from Amazon SageMaker JumpStart

One of the most useful application patterns for generative AI workloads is Retrieval Augmented Generation (RAG). In the RAG pattern, we find pieces of reference content related to an input prompt by performing similarity searches on embeddings. Embeddings capture the information content in bodies of text, allowing natural language processing (NLP) models to work with language in a numeric form. Embeddings are just vectors of floating point numbers, so we can analyze them to help answer three important questions: Is our reference data changing over time? Are the questions users are asking changing over time? And finally, how well is our reference data covering the questions being asked?

In this post, you’ll learn about some of the considerations for embedding vector analysis and detecting signals of embedding drift. Because embeddings are an important source of data for NLP models in general and generative AI solutions in particular, we need a way to measure whether our embeddings are changing over time (drifting). In this post, you’ll see an example of performing drift detection on embedding vectors using a clustering technique with large language models (LLMS) deployed from Amazon SageMaker JumpStart. You’ll also be able to explore these concepts through two provided examples, including an end-to-end sample application or, optionally, a subset of the application.

Overview of RAG

The RAG pattern lets you retrieve knowledge from external sources, such as PDF documents, wiki articles, or call transcripts, and then use that knowledge to augment the instruction prompt sent to the LLM. This allows the LLM to reference more relevant information when generating a response. For example, if you ask an LLM how to make chocolate chip cookies, it can include information from your own recipe library. In this pattern, the recipe text is converted into embedding vectors using an embedding model, and stored in a vector database. Incoming questions are converted to embeddings, and then the vector database runs a similarity search to find related content. The question and the reference data then go into the prompt for the LLM.

Let’s take a closer look at the embedding vectors that get created and how to perform drift analysis on those vectors.

Analysis on embedding vectors

Embedding vectors are numeric representations of our data so analysis of these vectors can provide insight into our reference data that can later be used to detect potential signals of drift. Embedding vectors represent an item in n-dimensional space, where n is often large. For example, the GPT-J 6B model, used in this post, creates vectors of size 4096. To measure drift, assume that our application captures embedding vectors for both reference data and incoming prompts.

We start by performing dimension reduction using Principal Component Analysis (PCA). PCA tries to reduce the number of dimensions while preserving most of the variance in the data. In this case, we try to find the number of dimensions that preserves 95% of the variance, which should capture anything within two standard deviations.

Then we use K-Means to identify a set of cluster centers. K-Means tries to group points together into clusters such that each cluster is relatively compact and the clusters are as distant from each other as possible.

We calculate the following information based on the clustering output shown in the following figure:

  • The number of dimensions in PCA that explain 95% of the variance
  • The location of each cluster center, or centroid

Additionally, we look at the proportion (higher or lower) of samples in each cluster, as shown in the following figure.

Finally, we use this analysis to calculate the following:

  • Inertia – Inertia is the sum of squared distances to cluster centroids, which measures how well the data was clustered using K-Means.
  • Silhouette score – The silhouette score is a measure for the validation of the consistency within clusters, and ranges from -1 to 1. A value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters. A visual representation of the silhouette score can be seen in the following figure.

We can periodically capture this information for snapshots of the embeddings for both the source reference data and the prompts. Capturing this data allows us to analyze potential signals of embedding drift.

Detecting embedding drift

Periodically, we can compare the clustering information through snapshots of the data, which includes the reference data embeddings and the prompt embeddings. First, we can compare the number of dimensions needed to explain 95% of the variation in the embedding data, the inertia, and the silhouette score from the clustering job. As you can see in the following table, compared to a baseline, the latest snapshot of embeddings requires 39 more dimensions to explain the variance, indicating that our data is more dispersed. The inertia has gone up, indicating that the samples are in aggregate farther away from their cluster centers. Additionally, the silhouette score has gone down, indicating that the clusters are not as well defined. For prompt data, that might indicate that the types of questions coming into the system are covering more topics.

Next, in the following figure, we can see how the proportion of samples in each cluster has changed over time. This can show us whether our newer reference data is broadly similar to the previous set, or covers new areas.

Finally, we can see if the cluster centers are moving, which would show drift in the information in the clusters, as shown in the following table.

Reference data coverage for incoming questions

We can also evaluate how well our reference data aligns to the incoming questions. To do this, we assign each prompt embedding to a reference data cluster. We compute the distance from each prompt to its corresponding center, and look at the mean, median, and standard deviation of those distances. We can store that information and see how it changes over time.

The following figure shows an example of analyzing the distance between the prompt embedding and reference data centers over time.

As you can see, the mean, median, and standard deviation distance statistics between prompt embeddings and reference data centers is decreasing between the initial baseline and the latest snapshot. Although the absolute value of the distance is difficult to interpret, we can use the trends to determine if the semantic overlap between reference data and incoming questions is getting better or worse over time.

Sample application

In order to gather the experimental results discussed in the previous section, we built a sample application that implements the RAG pattern using embedding and generation models deployed through SageMaker JumpStart and hosted on Amazon SageMaker real-time endpoints.

The application has three core components:

  • We use an interactive flow, which includes a user interface for capturing prompts, combined with a RAG orchestration layer, using LangChain.
  • The data processing flow extracts data from PDF documents and creates embeddings that get stored in Amazon OpenSearch Service. We also use these in the final embedding drift analysis component of the application.
  • The embeddings are captured in Amazon Simple Storage Service (Amazon S3) via Amazon Kinesis Data Firehose, and we run a combination of AWS Glue extract, transform, and load (ETL) jobs and Jupyter notebooks to perform the embedding analysis.

The following diagram illustrates the end-to-end architecture.

The full sample code is available on GitHub. The provided code is available in two different patterns:

  • Sample full-stack application with a Streamlit frontend – This provides an end-to-end application, including a user interface using Streamlit for capturing prompts, combined with the RAG orchestration layer, using LangChain running on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate
  • Backend application – For those that don’t want to deploy the full application stack, you can optionally choose to only deploy the backend AWS Cloud Development Kit (AWS CDK) stack, and then use the Jupyter notebook provided to perform RAG orchestration using LangChain

To create the provided patterns, there are several prerequisites detailed in the following sections, starting with deploying the generative and text embedding models then moving on to the additional prerequisites.

Deploy models through SageMaker JumpStart

Both patterns assume the deployment of an embedding model and generative model. For this, you’ll deploy two models from SageMaker JumpStart. The first model, GPT-J 6B, is used as the embedding model and the second model, Falcon-40b, is used for text generation.

You can deploy each of these models through SageMaker JumpStart from the AWS Management Console, Amazon SageMaker Studio, or programmatically. For more information, refer to How to use JumpStart foundation models. To simplify the deployment, you can use the provided notebook derived from notebooks automatically created by SageMaker JumpStart. This notebook pulls the models from the SageMaker JumpStart ML hub and deploys them to two separate SageMaker real-time endpoints.

The sample notebook also has a cleanup section. Don’t run that section yet, because it will delete the endpoints just deployed. You will complete the cleanup at the end of the walkthrough.

After confirming successful deployment of the endpoints, you’re ready to deploy the full sample application. However, if you’re more interested in exploring only the backend and analysis notebooks, you can optionally deploy only that, which is covered in the next section.

Option 1: Deploy the backend application only

This pattern allows you to deploy the backend solution only and interact with the solution using a Jupyter notebook. Use this pattern if you don’t want to build out the full frontend interface.

Prerequisites

You should have the following prerequisites:

  • A SageMaker JumpStart model endpoint deployed – Deploy the models to SageMaker real-time endpoints using SageMaker JumpStart, as previously outlined
  • Deployment parameters – Record the following:
    • Text model endpoint name – The endpoint name of the text generation model deployed with SageMaker JumpStart
    • Embeddings model endpoint name – The endpoint name of the embedding model deployed with SageMaker JumpStart

Deploy the resources using the AWS CDK

Use the deployment parameters noted in the previous section to deploy the AWS CDK stack. For more information about AWS CDK installation, refer to Getting started with the AWS CDK.

Make sure that Docker is installed and running on the workstation that will be used for AWS CDK deployment. Refer to Get Docker for additional guidance.

$ cd pattern1-rag/cdk
$ cdk deploy BackendStack --exclusively
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

Alternatively, you can enter the context values in a file called cdk.context.json in the pattern1-rag/cdk directory and run cdk deploy BackendStack --exclusively.

The deployment will print out outputs, some of which will be needed to run the notebook. Before you can start question and answering, embed the reference documents, as shown in the next section.

Embed reference documents

For this RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.

An Amazon Elastic Compute Cloud (Amazon EC2) instance has been created for the PDF document ingestion and an Amazon Elastic File System (Amazon EFS) file system is mounted on the EC2 instance to save the PDF documents. An AWS DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.

To ingest the reference documents, complete the following steps:

  1. Retrieve the sample EC2 instance ID that was created (see the AWS CDK output JumpHostId) and connect using Session Manager, a capability of AWS Systems Manager. For instructions, refer to Connect to your Linux instance with AWS Systems Manager Session Manager.
  2. Go to the directory /mnt/efs/fs1, which is where the EFS file system is mounted, and create a folder called ingest:
    $ cd /mnt/efs/fs1
    $ mkdir ingest && cd ingest

  3. Add your reference PDF documents to the ingest directory.

The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.

The DataSync task runs on an hourly schedule; you can optionally start the task manually to start the embedding process immediately for the PDF documents you added.

  1. To start the task, locate the task ID from the AWS CDK output DataSyncTaskID and start the task with defaults.

After the embeddings are created, you can start the RAG question and answering through a Jupyter notebook, as shown in the next section.

Question and answering using a Jupyter notebook

Complete the following steps:

  1. Retrieve the SageMaker notebook instance name from the AWS CDK output NotebookInstanceName and connect to JupyterLab from the SageMaker console.
  2. Go to the directory fmops/full-stack/pattern1-rag/notebooks/.
  3. Open and run the notebook query-llm.ipynb in the notebook instance to perform question and answering using RAG.

Make sure to use the conda_python3 kernel for the notebook.

This pattern is useful to explore the backend solution without needing to provision additional prerequisites that are required for the full-stack application. The next section covers the implementation of a full-stack application, including both the frontend and backend components, to provide a user interface for interacting with your generative AI application.

Option 2: Deploy the full-stack sample application with a Streamlit frontend

This pattern allows you to deploy the solution with a user frontend interface for question and answering.

Prerequisites

To deploy the sample application, you must have the following prerequisites:

  • SageMaker JumpStart model endpoint deployed – Deploy the models to your SageMaker real-time endpoints using SageMaker JumpStart, as outlined in the previous section, using the provided notebooks.
  • Amazon Route 53 hosted zone – Create an Amazon Route 53 public hosted zone to use for this solution. You can also use an existing Route 53 public hosted zone, such as example.com.
  • AWS Certificate Manager certificate – Provision an AWS Certificate Manager (ACM) TLS certificate for the Route 53 hosted zone domain name and its applicable subdomains, such as example.com and *.example.com for all subdomains. For instructions, refer to Requesting a public certificate. This certificate is used to configure HTTPS on Amazon CloudFront and the origin load balancer.
  • Deployment parameters – Record the following:
    • Frontend application custom domain name – A custom domain name used to access the frontend sample application. The domain name provided is used to create a Route 53 DNS record pointing to the frontend CloudFront distribution; for example, app.example.com.
    • Load balancer origin custom domain name – A custom domain name used for the CloudFront distribution load balancer origin. The domain name provided is used to create a Route 53 DNS record pointing to the origin load balancer; for example, app-lb.example.com.
    • Route 53 hosted zone ID – The Route 53 hosted zone ID to host the custom domain names provided; for example, ZXXXXXXXXYYYYYYYYY.
    • Route 53 hosted zone name – The name of the Route 53 hosted zone to host the custom domain names provided; for example, example.com.
    • ACM certificate ARN – The ARN of the ACM certificate to be used with the custom domain provided.
    • Text model endpoint name – The endpoint name of the text generation model deployed with SageMaker JumpStart.
    • Embeddings model endpoint name – The endpoint name of the embedding model deployed with SageMaker JumpStart.

Deploy the resources using the AWS CDK

Use the deployment parameters you noted in the prerequisites to deploy the AWS CDK stack. For more information, refer to Getting started with the AWS CDK.

Make sure Docker is installed and running on the workstation that will be used for the AWS CDK deployment.

$ cd pattern1-rag/cdk
$ cdk deploy --all -c appCustomDomainName=<Enter Custom Domain Name to be used for Frontend App> 
    -c loadBalancerOriginCustomDomainName=<Enter Custom Domain Name to be used for Load Balancer Origin> 
    -c customDomainRoute53HostedZoneID=<Enter Route53 Hosted Zone ID for the Custom Domain being used> 
    -c customDomainRoute53HostedZoneName=<Enter Route53 Hostedzone Name> 
    -c customDomainCertificateArn=<Enter ACM Certificate ARN for Custom Domains provided> 
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

In the preceding code, -c represents a context value, in the form of the required prerequisites, provided on input. Alternatively, you can enter the context values in a file called cdk.context.json in the pattern1-rag/cdk directory and run cdk deploy --all.

Note that we specify the Region in the file bin/cdk.ts. Configuring ALB access logs requires a specified Region. You can change this Region before deployment.

The deployment will print out the URL to access the Streamlit application. Before you can start question and answering, you need to embed the reference documents, as shown in the next section.

Embed the reference documents

For a RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.

As we discussed in the first deployment option, an example EC2 instance has been created for the PDF document ingestion and an EFS file system is mounted on the EC2 instance to save the PDF documents. A DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.

To ingest the reference documents, complete the following steps:

  1. Retrieve the sample EC2 instance ID that was created (see the AWS CDK output JumpHostId) and connect using Session Manager.
  2. Go to the directory /mnt/efs/fs1, which is where the EFS file system is mounted, and create a folder called ingest:
    $ cd /mnt/efs/fs1
    $ mkdir ingest && cd ingest

  3. Add your reference PDF documents to the ingest directory.

The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.

The DataSync task runs on an hourly schedule. You can optionally start the task manually to start the embedding process immediately for the PDF documents you added.

  1. To start the task, locate the task ID from the AWS CDK output DataSyncTaskID and start the task with defaults.

Question and answering

After the reference documents have been embedded, you can start the RAG question and answering by visiting the URL to access the Streamlit application. An Amazon Cognito authentication layer is used, so it requires creating a user account in the Amazon Cognito user pool deployed via the AWS CDK (see the AWS CDK output for the user pool name) for first-time access to the application. For instructions on creating an Amazon Cognito user, refer to Creating a new user in the AWS Management Console.

Embed drift analysis

In this section, we show you how to perform drift analysis by first creating a baseline of the reference data embeddings and prompt embeddings, and then creating a snapshot of the embeddings over time. This allows you to compare the baseline embeddings to the snapshot embeddings.

Create an embedding baseline for the reference data and prompt

To create an embedding baseline of the reference data, open the AWS Glue console and select the ETL job embedding-drift-analysis. Set the parameters for the ETL job as follows and run the job:

  • Set --job_type to BASELINE.
  • Set --out_table to the Amazon DynamoDB table for reference embedding data. (See the AWS CDK output DriftTableReference for the table name.)
  • Set --centroid_table to the DynamoDB table for reference centroid data. (See the AWS CDK output CentroidTableReference for the table name.)
  • Set --data_path to the S3 bucket with the prefix; for example, s3://<REPLACE_WITH_BUCKET_NAME>/embeddingarchive/. (See the AWS CDK output BucketName for the bucket name.)

Similarly, using the ETL job embedding-drift-analysis, create an embedding baseline of the prompts. Set the parameters for the ETL job as follows and run the job:

  • Set --job_type to BASELINE
  • Set --out_table to the DynamoDB table for prompt embedding data. (See the AWS CDK output DriftTablePromptsName for the table name.)
  • Set --centroid_table to the DynamoDB table for prompt centroid data. (See the AWS CDK output CentroidTablePrompts for the table name.)
  • Set --data_path to the S3 bucket with the prefix; for example, s3://<REPLACE_WITH_BUCKET_NAME>/promptarchive/. (See the AWS CDK output BucketName for the bucket name.)

Create an embedding snapshot for the reference data and prompt

After you ingest additional information into OpenSearch Service, run the ETL job embedding-drift-analysis again to snapshot the reference data embeddings. The parameters will be the same as the ETL job that you ran to create the embedding baseline of the reference data as shown in the previous section, with the exception of setting the --job_type parameter to SNAPSHOT.

Similarly, to snapshot the prompt embeddings, run the ETL job embedding-drift-analysis again. The parameters will be the same as the ETL job that you ran to create the embedding baseline for the prompts as shown in the previous section, with the exception of setting the --job_type parameter to SNAPSHOT.

Compare the baseline to the snapshot

To compare the embedding baseline and snapshot for reference data and prompts, use the provided notebook pattern1-rag/notebooks/drift-analysis.ipynb.

To look at embedding comparison for reference data or prompts, change the DynamoDB table name variables (tbl and c_tbl) in the notebook to the appropriate DynamoDB table for each run of the notebook.

The notebook variable tbl should be changed to the appropriate drift table name. The following is an example of where to configure the variable in the notebook.

The table names can be retrieved as follows:

  • For the reference embedding data, retrieve the drift table name from the AWS CDK output DriftTableReference
  • For the prompt embedding data, retrieve the drift table name from the AWS CDK output DriftTablePromptsName

In addition, the notebook variable c_tbl should be changed to the appropriate centroid table name. The following is an example of where to configure the variable in the notebook.

The table names can be retrieved as follows:

  • For the reference embedding data, retrieve the centroid table name from the AWS CDK output CentroidTableReference
  • For the prompt embedding data, retrieve the centroid table name from the AWS CDK output CentroidTablePrompts

Analyze the prompt distance from the reference data

First, run the AWS Glue job embedding-distance-analysis. This job will find out which cluster, from the K-Means evaluation of the reference data embeddings, that each prompt belongs to. It then calculates the mean, median, and standard deviation of the distance from each prompt to the center of the corresponding cluster.

You can run the notebook pattern1-rag/notebooks/distance-analysis.ipynb to see the trends in the distance metrics over time. This will give you a sense of the overall trend in the distribution of the prompt embedding distances.

The notebook pattern1-rag/notebooks/prompt-distance-outliers.ipynb is an AWS Glue notebook that looks for outliers, which can help you identify whether you’re getting more prompts that are not related to the reference data.

Monitor similarity scores

All similarity scores from OpenSearch Service are logged in Amazon CloudWatch under the rag namespace. The dashboard RAG_Scores shows the average score and the total number of scores ingested.

Clean up

To avoid incurring future charges, delete all the resources that you created.

Delete the deployed SageMaker models

Reference the cleanup up section of the provided example notebook to delete the deployed SageMaker JumpStart models, or you can delete the models on the SageMaker console.

Delete the AWS CDK resources

If you entered your parameters in a cdk.context.json file, clean up as follows:

$ cd pattern1-rag/cdk
$ cdk destroy --all

If you entered your parameters on the command line and only deployed the backend application (the backend AWS CDK stack), clean up as follows:

$ cd pattern1-rag/cdk
$ cdk destroy --all
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

If you entered your parameters on the command line and deployed the full solution (the frontend and backend AWS CDK stacks), clean up as follows:

$ cd pattern1-rag/cdk
$ cdk destroy --all -c appCustomDomainName=<Enter Custom Domain Name to be used for Frontend App> 
    -c loadBalancerOriginCustomDomainName=<Enter Custom Domain Name to be used for Load Balancer Origin> 
    -c customDomainRoute53HostedZoneID=<Enter Route53 Hosted Zone ID for the Custom Domain being used> 
    -c customDomainRoute53HostedZoneName=<Enter Route53 Hostedzone Name> 
    -c customDomainCertificateArn=<Enter ACM Certificate ARN for Custom Domains provided> 
    -c textModelEndpointName=<Enter the SageMaker Endpoint Name for the Text generation model> 
    -c embeddingsModelEndpointName=<Enter the SageMaker Endpoint Name for the Text embeddings model>

Conclusion

In this post, we provided a working example of an application that captures embedding vectors for both reference data and prompts in the RAG pattern for generative AI. We showed how to perform clustering analysis to determine whether reference or prompt data is drifting over time, and how well the reference data covers the types of questions users are asking. If you detect drift, it can provide a signal that the environment has changed and your model is getting new inputs that it may not be optimized to handle. This allows for proactive evaluation of the current model against changing inputs.


About the Authors

Abdullahi Olaoye is a Senior Solutions Architect at Amazon Web Services (AWS). Abdullahi holds a MSC in Computer Networking from Wichita State University and is a published author that has held roles across various technology domains such as DevOps, infrastructure modernization and AI. He is currently focused on Generative AI and plays a key role in assisting enterprises to architect and build cutting-edge solutions powered by Generative AI. Beyond the realm of technology, he finds joy in the art of exploration. When not crafting AI solutions, he enjoys traveling with his family to explore new places.

Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

Shelbee Eigenbrode is a Principal AI and Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.

Read More

Designing generative AI workloads for resilience

Designing generative AI workloads for resilience

Resilience plays a pivotal role in the development of any workload, and generative AI workloads are no different. There are unique considerations when engineering generative AI workloads through a resilience lens. Understanding and prioritizing resilience is crucial for generative AI workloads to meet organizational availability and business continuity requirements. In this post, we discuss the different stacks of a generative AI workload and what those considerations should be.

Full stack generative AI

Although a lot of the excitement around generative AI focuses on the models, a complete solution involves people, skills, and tools from several domains. Consider the following picture, which is an AWS view of the a16z emerging application stack for large language models (LLMs).

Taxonomy of LLM App Stack on AWS

Compared to a more traditional solution built around AI and machine learning (ML), a generative AI solution now involves the following:

  • New roles – You have to consider model tuners as well as model builders and model integrators
  • New tools – The traditional MLOps stack doesn’t extend to cover the type of experiment tracking or observability necessary for prompt engineering or agents that invoke tools to interact with other systems

Agent reasoning

Unlike traditional AI models, Retrieval Augmented Generation (RAG) allows for more accurate and contextually relevant responses by integrating external knowledge sources. The following are some considerations when using RAG:

  • Setting appropriate timeouts is important to the customer experience. Nothing says bad user experience more than being in the middle of a chat and getting disconnected.
  • Make sure to validate prompt input data and prompt input size for allocated character limits that are defined by your model.
  • If you’re performing prompt engineering, you should persist your prompts to a reliable data store. That will safeguard your prompts in case of accidental loss or as part of your overall disaster recovery strategy.

Data pipelines

In cases where you need to provide contextual data to the foundation model using the RAG pattern, you need a data pipeline that can ingest the source data, convert it to embedding vectors, and store the embedding vectors in a vector database. This pipeline could be a batch pipeline if you prepare contextual data in advance, or a low-latency pipeline if you’re incorporating new contextual data on the fly. In the batch case, there are a couple challenges compared to typical data pipelines.

The data sources may be PDF documents on a file system, data from a software as a service (SaaS) system like a CRM tool, or data from an existing wiki or knowledge base. Ingesting from these sources is different from the typical data sources like log data in an Amazon Simple Storage Service (Amazon S3) bucket or structured data from a relational database. The level of parallelism you can achieve may be limited by the source system, so you need to account for throttling and use backoff techniques. Some of the source systems may be brittle, so you need to build in error handling and retry logic.

The embedding model could be a performance bottleneck, regardless of whether you run it locally in the pipeline or call an external model. Embedding models are foundation models that run on GPUs and do not have unlimited capacity. If the model runs locally, you need to assign work based on GPU capacity. If the model runs externally, you need to make sure you’re not saturating the external model. In either case, the level of parallelism you can achieve will be dictated by the embedding model rather than how much CPU and RAM you have available in the batch processing system.

In the low-latency case, you need to account for the time it takes to generate the embedding vectors. The calling application should invoke the pipeline asynchronously.

Vector databases

A vector database has two functions: store embedding vectors, and run a similarity search to find the closest k matches to a new vector. There are three general types of vector databases:

  • Dedicated SaaS options like Pinecone.
  • Vector database features built into other services. This includes native AWS services like Amazon OpenSearch Service and Amazon Aurora.
  • In-memory options that can be used for transient data in low-latency scenarios.

We don’t cover the similarity searching capabilities in detail in this post. Although they’re important, they are a functional aspect of the system and don’t directly affect resilience. Instead, we focus on the resilience aspects of a vector database as a storage system:

  • Latency – Can the vector database perform well against a high or unpredictable load? If not, the calling application needs to handle rate limiting and backoff and retry.
  • Scalability – How many vectors can the system hold? If you exceed the capacity of the vector database, you’ll need to look into sharding or other solutions.
  • High availability and disaster recovery – Embedding vectors are valuable data, and recreating them can be expensive. Is your vector database highly available in a single AWS Region? Does it have the ability to replicate data to another Region for disaster recovery purposes?

Application tier

There are three unique considerations for the application tier when integrating generative AI solutions:

  • Potentially high latency – Foundation models often run on large GPU instances and may have finite capacity. Make sure to use best practices for rate limiting, backoff and retry, and load shedding. Use asynchronous designs so that high latency doesn’t interfere with the application’s main interface.
  • Security posture – If you’re using agents, tools, plugins, or other methods of connecting a model to other systems, pay extra attention to your security posture. Models may try to interact with these systems in unexpected ways. Follow the normal practice of least-privilege access, for example restricting incoming prompts from other systems.
  • Rapidly evolving frameworks – Open source frameworks like LangChain are evolving rapidly. Use a microservices approach to isolate other components from these less mature frameworks.

Capacity

We can think about capacity in two contexts: inference and training model data pipelines. Capacity is a consideration when organizations are building their own pipelines. CPU and memory requirements are two of the biggest requirements when choosing instances to run your workloads.

Instances that can support generative AI workloads can be more difficult to obtain than your average general-purpose instance type. Instance flexibility can help with capacity and capacity planning. Depending on what AWS Region you are running your workload in, different instance types are available.

For the user journeys that are critical, organizations will want to consider either reserving or pre-provisioning instance types to ensure availability when needed. This pattern achieves a statically stable architecture, which is a resiliency best practice. To learn more about static stability in the AWS Well-Architected Framework reliability pillar, refer to Use static stability to prevent bimodal behavior.

Observability

Besides the resource metrics you typically collect, like CPU and RAM utilization, you need to closely monitor GPU utilization if you host a model on Amazon SageMaker or Amazon Elastic Compute Cloud (Amazon EC2). GPU utilization can change unexpectedly if the base model or the input data changes, and running out of GPU memory can put the system into an unstable state.

Higher up the stack, you will also want to trace the flow of calls through the system, capturing the interactions between agents and tools. Because the interface between agents and tools is less formally defined than an API contract, you should monitor these traces not only for performance but also to capture new error scenarios. To monitor the model or agent for any security risks and threats, you can use tools like Amazon GuardDuty.

You should also capture baselines of embedding vectors, prompts, context, and output, and the interactions between these. If these change over time, it may indicate that users are using the system in new ways, that the reference data is not covering the question space in the same way, or that the model’s output is suddenly different.

Disaster recovery

Having a business continuity plan with a disaster recovery strategy is a must for any workload. Generative AI workloads are no different. Understanding the failure modes that are applicable to your workload will help guide your strategy. If you are using AWS managed services for your workload, such as Amazon Bedrock and SageMaker, make sure the service is available in your recovery AWS Region. As of this writing, these AWS services don’t support replication of data across AWS Regions natively, so you need to think about your data management strategies for disaster recovery, and you also may need to fine-tune in multiple AWS Regions.

Conclusion

This post described how to take resilience into account when building generative AI solutions. Although generative AI applications have some interesting nuances, the existing resilience patterns and best practices still apply. It’s just a matter of evaluating each part of a generative AI application and applying the relevant best practices.

For more information about generative AI and using it with AWS services, refer to the following resources:


About the Authors

Jennifer Moran is an AWS Senior Resiliency Specialist Solutions Architect based out of New York City. She has a diverse background, having worked in many technical disciplines, including software development, agile leadership, and DevOps, and is an advocate for women in tech. She enjoys helping customers design resilient solutions to improve resilience posture and publicly speaks about all topics related to resilience.

Randy DeFauwRandy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.

Read More

Analyze security findings faster with no-code data preparation using generative AI and Amazon SageMaker Canvas

Analyze security findings faster with no-code data preparation using generative AI and Amazon SageMaker Canvas

Data is the foundation to capturing the maximum value from AI technology and solving business problems quickly. To unlock the potential of generative AI technologies, however, there’s a key prerequisite: your data needs to be appropriately prepared. In this post, we describe how use generative AI to update and scale your data pipeline using Amazon SageMaker Canvas for data prep.

Typically, data pipeline work requires a specialized skill to prepare and organize data for security analysts to use to extract value, which can take time, increase risks, and increase time to value. With SageMaker Canvas, security analysts can effortlessly and securely access leading foundation models to prepare their data faster and remediate cyber security risks.

Data prep involves careful formatting and thoughtful contextualization, working backward from the customer problem. Now with the SageMaker Canvas chat for data prep capability, analysts with domain knowledge can quickly prepare, organize, and extract value from data using a chat-based experience.

Solution overview

Generative AI is revolutionizing the security domain by providing personalized and natural language experiences, enhancing risk identification and remediations, while boosting business productivity. For this use case, we use SageMaker Canvas, Amazon SageMaker Data Wrangler, Amazon Security Lake, and Amazon Simple Storage Service (Amazon S3). Amazon Security Lake allows you to aggregate and normalize security data for analysis to gain a better understanding of security across your organization. Amazon S3 enables you to store and retrieve any amount of data at any time or place. It offers industry-leading scalability, data availability, security, and performance.

SageMaker Canvas now supports comprehensive data preparation capabilities powered by SageMaker Data Wrangler. With this integration, SageMaker Canvas provides an end-to-end no-code workspace to prepare data, build, and use machine learning (ML) and Amazon Bedrock foundation models to accelerate the time from data to business insights. You can now discover and aggregate data from over 50 data sources and explore and prepare data using over 300 built-in analyses and transformations in the SageMaker Canvas visual interface. You’ll also see faster performance for transforms and analyses, and benefit from a natural language interface to explore and transform data for ML.

In this post, we demonstrate three key transformations; filtering, column renaming, and text extraction from a column on the security findings dataset. We also demonstrate using the chat for data prep feature in SageMaker Canvas to analyze the data and visualize your findings.

Prerequisites

Before starting, you need an AWS account. You also need to set up an Amazon SageMaker Studio domain. For instructions on setting up SageMaker Canvas, refer to Generate machine learning predictions without code.

Access the SageMaker Canvas chat interface

Complete the following steps to start using the SageMaker Canvas chat feature:

  1. On the SageMaker Canvas console, choose Data Wrangler.
  2. Under Datasets, choose Amazon S3 as your source and specify the security findings dataset from Amazon Security Lake.
  3. Choose your data flow and choose Chat for data prep, which will display a chat interface experience with guided prompts.

Filter data

For this post, we first want to filter for critical and high severity warnings, so we enter into the chat box instructions to remove findings that are not critical or high severity. Canvas removes the rows, displays a preview of transformed data, and provides the option to use the code. We can add it to the list of steps in the Steps pane.

Rename columns

Next, we want rename two columns, so we enter in the chat box the following prompt, to rename the desc and title columns to Finding and Remediation. SageMaker Canvas generates a preview, and if you’re happy with the results, you can add the transformed data to the data flow steps.

Extract text

To determine the source Regions of the findings, you can enter in chat instructions to Extract the Region text from the UID column based on the pattern arn:aws:security:securityhub:region:*  and create a new column called Region) to extract the Region text from the UID column based on a pattern. SageMaker Canvas then generates code to create a new region column. The data preview shows the findings originate from one Region: us-west-2. You can add this transformation to the data flow for downstream analysis.

Analyze the data

Finally, we want to analyze the data to determine if there is a correlation between time of day and number of critical findings. You can enter a request to summarize critical findings by time of day into the chat, and SageMaker Canvas returns insights that are useful for your investigation and analysis.

Visualize findings

Next, we visualize the findings by severity over time to include in a leadership report. You can ask SageMaker Canvas to generate a bar chart of severity compared to time of day. In seconds, SageMaker Canvas has created the chart grouped by severity. You can add this visualization to the analysis in the data flow and download it for your report. The data shows the findings originate from one Region and happen at specific times. This gives us confidence on where to focus our security findings investigation to determine root causes and corrective actions.

Clean up

To avoid incurring unintended charges, complete the following steps to clean up your resources:

  1. Empty the S3 bucket you used as a source.
  2. Log out of SageMaker Canvas.

Conclusion

In this post, we showed you how to use SageMaker Canvas as an end-to-end no-code workspace for data preparation to build and use Amazon Bedrock foundation models to accelerate time to gather business insights from data.

Note that this approach is not limited to security findings; you can apply this to any generative AI use case that uses data preparation at its core.

The future belongs to businesses that can effectively harness the power of generative AI and large language models. But to do so, we must first develop a solid data strategy and understand the art of data preparation. By using generative AI to structure our data intelligently, and working backward from the customer, we can solve business problems faster. With SageMaker Canvas chat for data preparation, it’s effortless for analysts to get started and capture immediate value from AI.


About the Authors

Sudeesh Sasidharan is a Senior Solutions Architect at AWS, within the Energy team. Sudeesh loves experimenting with new technologies and building innovative solutions that solve complex business challenges. When he is not designing solutions or tinkering with the latest technologies, you can find him on the tennis court working on his backhand.

John Klacynski is a Principal Customer Solution Manager within the AWS Independent Software Vendor (ISV) team. In this role, he programmatically helps ISV customers adopt AWS technologies and services to reach their business goals more quickly. Prior to joining AWS, John led Data Product Teams for large Consumer Package Goods companies, helping them leverage data insights to improve their operations and decision making.

Read More

Getting started with Amazon Titan Text Embeddings

Getting started with Amazon Titan Text Embeddings

Embeddings play a key role in natural language processing (NLP) and machine learning (ML). Text embedding refers to the process of transforming text into numerical representations that reside in a high-dimensional vector space. This technique is achieved through the use of ML algorithms that enable the understanding of the meaning and context of data (semantic relationships) and the learning of complex relationships and patterns within the data (syntactic relationships). You can use the resulting vector representations for a wide range of applications, such as information retrieval, text classification, natural language processing, and many others.

Amazon Titan Text Embeddings is a text embeddings model that converts natural language text—consisting of single words, phrases, or even large documents—into numerical representations that can be used to power use cases such as search, personalization, and clustering based on semantic similarity.

In this post, we discuss the Amazon Titan Text Embeddings model, its features, and example use cases.

Some key concepts include:

  • Numerical representation of text (vectors) captures semantics and relationships between words
  • Rich embeddings can be used to compare text similarity
  • Multilingual text embeddings can identify meaning in different languages

How is a piece of text converted into a vector?

There are multiple techniques to convert a sentence into a vector. One popular method is using word embeddings algorithms, such as Word2Vec, GloVe, or FastText, and then aggregating the word embeddings to form a sentence-level vector representation.

Another common approach is to use large language models (LLMs), like BERT or GPT, which can provide contextualized embeddings for entire sentences. These models are based on deep learning architectures such as Transformers, which can capture the contextual information and relationships between words in a sentence more effectively.

Why do we need an embeddings model?

Vector embeddings are fundamental for LLMs to understand the semantic degrees of language, and also enable LLMs to perform well on downstream NLP tasks like sentiment analysis, named entity recognition, and text classification.

In addition to semantic search, you can use embeddings to augment your prompts for more accurate results through Retrieval Augmented Generation (RAG)—but in order to use them, you’ll need to store them in a database with vector capabilities.

The Amazon Titan Text Embeddings model is optimized for text retrieval to enable RAG use cases. It enables you to first convert your text data into numerical representations or vectors, and then use those vectors to accurately search for relevant passages from a vector database, allowing you to make the most of your proprietary data in combination with other foundation models.

Because Amazon Titan Text Embeddings is a managed model on Amazon Bedrock, it’s offered as an entirely serverless experience. You can use it via either the Amazon Bedrock REST API or the AWS SDK. The required parameters are the text that you would like to generate the embeddings of and the modelID parameter, which represents the name of the Amazon Titan Text Embeddings model. The following code is an example using the AWS SDK for Python (Boto3):

import boto3
import json
 
#Create the connection to Bedrock
bedrock = boto3.client(
    service_name='bedrock',
    region_name='us-west-2', 
    
)
 
bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-west-2', 
    
)
 
# Let's see all available Amazon Models
available_models = bedrock.list_foundation_models()
 
for model in available_models['modelSummaries']:
  if 'amazon' in model['modelId']:
    print(model)
 
# Define prompt and model parameters
prompt_data = """Write me a poem about apples"""
 
body = json.dumps({
    "inputText": prompt_data,
})
 
model_id = 'amazon.titan-embed-text-v1' #look for embeddings in the modelID
accept = 'application/json' 
content_type = 'application/json'
 
# Invoke model 
response = bedrock_runtime.invoke_model(
    body=body, 
    modelId=model_id, 
    accept=accept, 
    contentType=content_type
)
 
# Print response
response_body = json.loads(response['body'].read())
embedding = response_body.get('embedding')
 
#Print the Embedding
 
print(embedding)

The output will look something like the following:

[-0.057861328, -0.15039062, -0.4296875, 0.31054688, ..., -0.15625]

Refer to Amazon Bedrock boto3 Setup for more details on how to install the required packages, connect to Amazon Bedrock, and invoke models.

Features of Amazon Titan Text Embeddings

With Amazon Titan Text Embeddings, you can input up to 8,000 tokens, making it well suited to work with single words, phrases, or entire documents based on your use case. Amazon Titan returns output vectors of dimension 1536, giving it a high degree of accuracy, while also optimizing for low-latency, cost-effective results.

Amazon Titan Text Embeddings supports creating and querying embeddings for text in over 25 different languages. This means you can apply the model to your use cases without needing to create and maintain separate models for each language you want to support.

Having a single embeddings model trained on many languages provides the following key benefits:

  • Broader reach – By supporting over 25 languages out of the box, you can expand the reach of your applications to users and content in many international markets.
  • Consistent performance – With a unified model covering multiple languages, you get consistent results across languages instead of optimizing separately per language. The model is trained holistically so you get the advantage across languages.
  • Multilingual query support – Amazon Titan Text Embeddings allows querying text embeddings in any of the supported languages. This provides flexibility to retrieve semantically similar content across languages without being restricted to a single language. You can build applications that query and analyze multilingual data using the same unified embeddings space.

As of this writing, the following languages are supported:

  • Arabic
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Czech
  • Dutch
  • English
  • French
  • German
  • Hebrew
  • Hindi
  • Italian
  • Japanese
  • Kannada
  • Korean
  • Malayalam
  • Marathi
  • Polish
  • Portuguese
  • Russian
  • Spanish
  • Swedish
  • Filipino Tagalog
  • Tamil
  • Telugu
  • Turkish

Using Amazon Titan Text Embeddings with LangChain

LangChain is a popular open source framework for working with generative AI models and supporting technologies. It includes a BedrockEmbeddings client that conveniently wraps the Boto3 SDK with an abstraction layer. The BedrockEmbeddings client allows you to work with text and embeddings directly, without knowing the details of the JSON request or response structures. The following is a simple example:

from langchain.embeddings import BedrockEmbeddings

#create an Amazon Titan Text Embeddings client
embeddings_client = BedrockEmbeddings() 

#Define the text from which to create embeddings
text = "Can you please tell me how to get to the bakery?"

#Invoke the model
embedding = embeddings_client.embed_query(text)

#Print response
print(embedding)

You can also use LangChain’s BedrockEmbeddings client alongside the Amazon Bedrock LLM client to simplify implementing RAG, semantic search, and other embeddings-related patterns.

Use cases for embeddings

Although RAG is currently the most popular use case for working with embeddings, there are many other use cases where embeddings can be applied. The following are some additional scenarios where you can use embeddings to solve specific problems, either on their own or in cooperation with an LLM:

  • Question and answer – Embeddings can help support question and answer interfaces through the RAG pattern. Embeddings generation paired with a vector database allow you to find close matches between questions and content in a knowledge repository.
  • Personalized recommendations – Similar to question and answer, you can use embeddings to find vacation destinations, colleges, vehicles, or other products based on the criteria provided by the user. This could take the form of a simple list of matches, or you could then use an LLM to process each recommendation and explain how it satisfies the user’s criteria. You could also use this approach to generate custom “10 best” articles for a user based on their specific needs.
  • Data management – When you have data sources that don’t map cleanly to each other, but you do have text content that describes the data record, you can use embeddings to identify potential duplicate records. For example, you could use embeddings to identify duplicate candidates that might use different formatting, abbreviations, or even have translated names.
  • Application portfolio rationalization – When looking to align application portfolios across a parent company and an acquisition, it’s not always obvious where to start finding potential overlap. The quality of configuration management data can be a limiting factor, and it can be difficult coordinating across teams to understand the application landscape. By using semantic matching with embeddings, we can do a quick analysis across application portfolios to identify high-potential candidate applications for rationalization.
  • Content grouping – You can use embeddings to help facilitate grouping similar content into categories that you might not know ahead of time. For example, let’s say you had a collection of customer emails or online product reviews. You could create embeddings for each item, then run those embeddings through k-means clustering to identify logical groupings of customer concerns, product praise or complaints, or other themes. You can then generate focused summaries from those groupings’ content using an LLM.

Semantic search example

In our example on GitHub, we demonstrate a simple embeddings search application with Amazon Titan Text Embeddings, LangChain, and Streamlit.

The example matches a user’s query to the closest entries in an in-memory vector database. We then display those matches directly in the user interface. This can be useful if you want to troubleshoot a RAG application, or directly evaluate an embeddings model.

For simplicity, we use the in-memory FAISS database to store and search for embeddings vectors. In a real-world scenario at scale, you will likely want to use a persistent data store like the vector engine for Amazon OpenSearch Serverless or the pgvector extension for PostgreSQL.

Try a few prompts from the web application in different languages, such as the following:

  • How can I monitor my usage?
  • How can I customize models?
  • Which programming languages can I use?
  • Comment mes données sont-elles sécurisées ?
  • 私のデータはどのように保護されていますか?
  • Quais fornecedores de modelos estão disponíveis por meio do Bedrock?
  • In welchen Regionen ist Amazon Bedrock verfügbar?
  • 有哪些级别的支持?

Note that even though the source material was in English, the queries in other languages were matched with relevant entries.

Conclusion

The text generation capabilities of foundation models are very exciting, but it’s important to remember that understanding text, finding relevant content from a body of knowledge, and making connections between passages are crucial to achieving the full value of generative AI. We will continue to see new and interesting use cases for embeddings emerge over the next years as these models continue to improve.

Next steps

You can find additional examples of embeddings as notebooks or demo applications in the following workshops:


About the Authors

Jason Stehle is a Senior Solutions Architect at AWS, based in the New England area. He works with customers to align AWS capabilities with their greatest business challenges. Outside of work, he spends his time building things and watching comic book movies with his family.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Raj Pathak is a Principal Solutions Architect and Technical Advisor to large Fortune 50 companies and mid-sized financial services institutions (FSI) across Canada and the United States. He specializes in machine learning applications such as generative AI, natural language processing, intelligent document processing, and MLOps.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book – Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning (ML) projects in various domains such as computer vision, natural language processing and generative AI. She helps customers to build, train and deploy large machine learning models at scale. She speaks in internal and external conferences such re:Invent, Women in Manufacturing West, YouTube webinars and GHC 23. In her free time, she likes to go for long runs along the beach.

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Read More

Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

Build a movie chatbot for TV/OTT platforms using Retrieval Augmented Generation in Amazon Bedrock

Improving how users discover new content is critical to increase user engagement and satisfaction on media platforms. Keyword search alone has challenges capturing semantics and user intent, leading to results that lack relevant context; for example, finding date night or Christmas-themed movies. This can drive lower retention rates if users can’t reliably find the content they want. However, with large language models (LLMs), there is an opportunity to solve these semantic and user intent challenges. By combining embeddings that capture semantics with a technique called Retrieval Augmented Generation (RAG), you can generate more relevant answers based on retrieved context from your own data sources.

In this post, we show you how to securely create a movie chatbot by implementing RAG with your own data using Knowledge Bases for Amazon Bedrock. We use the IMDb and Box Office Mojo dataset to simulate a catalog for media and entertainment customers and showcase how you can build your own RAG solution in just a couple of steps.

Solution overview

The IMDb and Box Office Mojo Movies/TV/OTT licensable data package provides a wide range of entertainment metadata, including over 1.6 billion user ratings; credits for more than 13 million cast and crew members; 10 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

Introduction to Knowledge Bases for Amazon Bedrock

To equip an LLM with up-to-date proprietary information, organizations use RAG, a technique that involves fetching data from company data sources and enriching the prompt with that data to deliver more relevant and accurate responses. Knowledge Bases for Amazon Bedrock enable a fully managed RAG capability that allows you to customize LLM responses with contextual and relevant company data. Knowledge Bases automate the end-to-end RAG workflow, including ingestion, retrieval, prompt augmentation, and citations, eliminating the need for you to write custom code to integrate data sources and manage queries. Knowledge Bases for Amazon Bedrock also enable multi-turn conversations so that the LLM can answer complex user queries with the correct answer.

We use the following services as part of this solution:

We walk through the following high-level steps:

  1. Preprocess the IMDb data to create documents from every movie record and upload the data into an Amazon Simple Storage Service (Amazon S3) bucket.
  2. Create a knowledge base.
  3. Sync your knowledge base with your data source.
  4. Use the knowledge base to answer semantic queries about the movie catalog.

Prerequisites

The IMDb data used in this post requires a commercial content license and paid subscription to IMDb and Box Office Mojo Movies/TV/OTT licensing package on AWS Data Exchange. To inquire about a license and access sample data, visit developer.imdb.com. To access the dataset, refer to Power recommendation and search using an IMDb knowledge graph – Part 1 and follow the Access the IMDb data section.

Preprocess the IMDb data

Before we create a knowledge base, we need to preprocess the IMDb dataset into text files and upload them to an S3 bucket. In this post, we simulate a customer catalog using the IMDb dataset. We take 10,000 popular movies from the IMDb dataset for the catalog and build the dataset.

Use the following notebook to create the dataset with additional info like actors, director, and producer names. We use the following code to create a single file for a movie with all the information stored in the file in an unstructured text that can be understood by LLMs:

def create_txt_files_imdb(row):
    full_text = ""
    full_text += f"{row['originalTitle']} ({row['titleId']}) was shot in year {int(row['year'])} with rating {row['rating']} and poster url {row['poster_url']}.nn"
    full_text += f"{row['originalTitle']} has genres {', '.join(row['genres'])}.nn"
    full_text += f"{row['originalTitle']} has actors {', '.join(row['Actors'])}.nn"   
    full_text += f"{row['originalTitle']} has directors {', '.join(row['Directors'])}.nn"
    full_text += f"{row['originalTitle']} has producers {', '.join(row['Producers'])}.nn"
    full_text += f"{row['originalTitle']} has keyword {', '.join([x.replace('-',' ') for x in row['keyword']])}.nn"
    full_text += f"{row['originalTitle']} has location {', '.join(row['location'])}.nn"
    full_text += f"{row['originalTitle']} has plot {row['plot']}.nn"
    with open(f"<path>/data/imdb_data/{row['titleId']}.txt","w") as f:
        f.write(full_text)
    return full_text

After you have the data in .txt format, you can upload the data into Amazon S3 using the following command:

aws s3 cp <path to local data> s3://<bucket-name>/<path>/ --recursive

Create the IMDb Knowledge Base

Complete the following steps to create your knowledge base:

  1. On the Amazon Bedrock console, choose Knowledge base in the navigation pane.
  2. Choose Create knowledge base.
  3. For Knowledge base name, enter imdb.
  4. For Knowledge base description, enter an optional description, such as Knowledge base for ingesting and storing imdb data.
  5. For IAM permissions, select Create and use a new service role, then enter a name for your new service role.
  6. Choose Next.

knowledge base details console page

  1. For Data source name, enter imdb-s3.
  2. For S3 URI, enter the S3 URI that you uploaded the data to.
  3. In the Advanced settings – optional section, for Chunking strategy, choose No chunking.
  4. Choose Next.

Knowledge bases enable you to chunk your documents in smaller segments to make it straightforward for you to process large documents. In our case, we have already chunked the data into a smaller size document (one per movie).

knowledge base console 2

  1. In the Vector database section, select Quick create a new vector store.

Amazon Bedrock will automatically create a fully managed OpenSearch Serverless vector search collection and configure the settings for embedding your data sources using the chosen Titan Embedding G1 – Text embedding model.

knowledge base vector store page

  1. Choose Next.

  1. Review your settings and choose Create knowledge base.

Sync your data with the knowledge base

Now that you have created your knowledge base, you can sync the knowledge base with your data.

  1. On the Amazon Bedrock console, navigate to your knowledge base.
  2. In the Data source section, choose Sync.

knowledge base sync

After the data source is synced, you’re ready to query the data.

Improve search using semantic results

Complete the following steps to test the solution and improve your search using semantic results:

  1. On the Amazon Bedrock console, navigate to your knowledge base.
  2. Select your knowledge base and choose Test knowledge base.
  3. Choose Select model, and choose Anthropic Claude v2.1.
  4. Choose Apply.

Now you are ready to query the data.

We can ask some semantic questions, such as “Recommend me some Christmas themed movies.”

query Recommend me some Christmas themed movies.

Knowledge base responses contain citations that you can explore for response correctness and factuality.

knowledge base citations

You can also drill down on any information that you need from these movies. In the following example, we ask “who directed nightmare before christmas?”

“who directed nightmare before christmas?”

You can also ask more specific questions related to the genres and ratings, such as “show me classic animated movies with ratings greater than 7?”

show me classic animated movies with ratings greater than 7?

Augment your knowledge base with agents

Agents for Amazon Bedrock help you automate complex tasks. Agents can break down the user query into smaller tasks and call custom APIs or knowledge bases to supplement information for running actions. With Agents for Amazon Bedrock, developers can integrate intelligent agents into their apps, accelerating the delivery of AI-powered applications and saving weeks of development time. With agents, you can augment your knowledge base by adding more functionality like recommendations from Amazon Personalize for user-specific recommendations or performing actions such as filtering movies based on user needs.

Conclusion

In this post, we showed how to build a conversational movie chatbot using Amazon Bedrock in a few steps to answer semantic search and conversational experiences based on your own data and the IMDb and Box Office Mojo Movies/TV/OTT licensed dataset. In the next post, we go through the process of adding more functionality to your solution using Agents for Amazon Bedrock. To get started with knowledge bases on Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock.


About the Authors

Gaurav Rele is a Senior Data Scientist at the Generative AI Innovation Center, where he works with AWS customers across different verticals to accelerate their use of generative AI and AWS Cloud services to solve their business challenges.

Divya Bhargavi is a Senior Applied Scientist Lead at the Generative AI Innovation Center, where she solves high-value business problems for AWS customers using generative AI methods. She works on image/video understanding & retrieval, knowledge graph augmented large language models and personalized advertising use cases.

Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using Large Language Models, primarily through Amazon Bedrock and other AWS Cloud services.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

How Mendix is transforming customer experiences with generative AI and Amazon Bedrock

This post was co-written with Ricardo Perdigao, Solution Architecture Manager at Mendix, a Siemens business.

Mendix, a Siemens business, offers the low-code platform with the vision and execution designed for today’s complex software development challenges. Since 2005, we’ve helped thousands of organizations worldwide reimagine how they develop applications with our platform’s cutting-edge capabilities. Mendix allows enterprises to turn ideas into outcomes by delivering sophisticated applications to market faster than ever. Mendix has been named an industry leader by Gartner and Forrester.

In a world where customer-centric innovation is the key to staying ahead of the competition, businesses increasingly turn to advanced technology to revolutionize product offerings. For Mendix, integrating the cutting-edge generative AI capabilities of Amazon Bedrock has been a game changer in redefining our customer experience landscape.

In this post, we share how Mendix is revolutionizing customer experiences using Amazon Bedrock and generative AI.

Overview

In 2016, a new era of innovation began when Mendix announced a strategic collaboration with AWS. By taking advantage of the robust cloud infrastructure of AWS, Mendix was able to provide a secure, scalable, and reliable solution for enterprises across the globe. The primary objective of this partnership was to enable organizations to build, deploy, and manage enterprise-grade applications quickly and effectively.

This unique collaboration combines the agility of the Mendix low-code application development platform with the robustness of AWS Cloud services. This combination facilitates rapid application development and deployment, empowering businesses to respond swiftly to market changes and customer demands, achieving successful digital transformation at an accelerated pace.

However, the rapid evolution of technology brings new challenges. The rise of generative AI presents a unique opportunity to redefine how applications are developed and used. Integrating these advanced AI capabilities into a low-code environment is a complex task, requiring a solution that is innovative, scalable, secure, and easy to use, that nonetheless delivers significant value to users.

Recognizing the importance of staying competitive in this rapidly evolving technological landscape, Mendix is committed to enhancing its platform with seamless AI integrations. Mendix not only offers an AI-enabled low-code application development experience, but also seeks to equip our customers with user-friendly tools necessary for implementing generative AI in the solutions they build.

By combining the power of AI with the accessibility of low-code development, we are setting the stage for unprecedented innovation in the industry, empowering our customers to create AI-enhanced applications with ease and efficiency.

The solution: Integrating generative AI capabilities provided by Amazon Bedrock

As a first step on our journey, we embraced Amazon Bedrock to infuse our low-code platform with generative AI capabilities. Amazon Bedrock offers many ready-to-use AI models. These models excel at writing clear text, creating images from just descriptions, and translating between different languages.

The Mendix AWS Bedrock Connector in the Mendix Marketplace eliminates traditional complexities and streamlines the integration of generative AI within our ecosystem. This pivotal service, accompanied by a wealth of shared knowledge via samples and blog posts, has paved a frictionless path for integration.

Generative AI from Amazon Bedrock is reshaping the landscape for low-code developers, offering a remarkable foundation for rapidly creating sophisticated applications that were previously only possible with extensive development time. These AI models are designed to equip developers with the ability to develop applications that not only learn and adapt but to do so with a previously unseen depth of understanding. As we integrate these technologies into the Mendix environment, we’re ushering in a new era of democratizing generative AI.

Imagine a world where applications are no longer static but can dynamically generate personalized content, such as images and interfaces, by analyzing individual user data like browsing habits, geographic location, and the time of day. This capability of generative AI to tailor experiences to each user promises to boost engagement and satisfaction dramatically. In today’s data-driven enterprise world, where the sheer volume of information can be overwhelming, generative AI stands as a powerful technology, turning complex data into accessible insights, streamlining reports for executives, identifying trends, and predicting future outcomes faster than ever before.

Taking a step further, generative AI also offers actionable recommendations, not just interpretations. This feature is set to transform sectors like customer service, where it can advise service agents on the best subsequent actions or automate routine responses based on a profound comprehension of customer data. By bringing these innovations to the Mendix platform, we’re moving towards a future where applications anticipate and meet user needs proactively, turning every interaction into an opportunity for innovation and a delightful customer experience.

But our vision extends beyond the horizons of today’s achievements. With our sight set on redefining the fabric of low-code application development, we’ve immersed ourselves in pioneering research. Using the Mendix Extensibility framework, we explore generative AI’s potential to transform our industry from the ground up. Our investigative forays have already yielded exciting prospects—from conjuring comprehensive domain models with simple narrative inputs to automating data mapping and sample generation with AI’s interpretative prowess. We’re even on the cusp of enabling Mendix UI setups through intuitive AI-powered dialogs. These nascent concepts—demonstrated in the following video—are still being experimented on. But they promise to herald a new dawn for low-code innovation in the future.

In the following sections, we discuss our needs when selecting the right platform to build generative AI, and how Amazon Bedrock exceeded our expectations.

Ease of implementation

At Mendix, our diverse applications span text generation, summarization, virtual assistance, and multimodal image generation. Amazon Bedrock is our go-to platform for building and implementing generative AI, providing seamless access to cutting-edge generative AI foundational models. Its unified API simplifies experimentation and upgrades, facilitating quick and efficient integration of various models into our systems. This streamlined approach significantly reduces development and deployment efforts, enhancing overall efficiency and quickly innovating on disruptive technologies like generative AI.

Security is our top priority

At Mendix, as we innovate, generative AI is vital for innovation, yet security is critical. With Amazon Bedrock, we customized models using labeled data in Amazon Simple Storage Service (Amazon S3). We also added encryption using AWS Key Management Service (AWS KMS), Amazon Virtual Private Cloud (Amazon VPC), and AWS Private Link to establish private connectivity from a VPC to Amazon Bedrock hardened data, security, and access, fulfilling our stringent enterprise security needs.

As we learned with Amazon Bedrock, a private copy of the base model is launched as we fine-tune a foundation model. Our data is not shared with foundation models from Amazon and other leading AI startups or used to improve the base models. Amazon Titan and other leading AI foundation models from AI21 Labs, Anthropic, Cohere, Meta, and Stability AI do not have access to our data (prompts, completion results), ensuring responsible development and exceeding our security requirements.

Thanks to AWS and Amazon Bedrock, balancing the power of generative AI with robust security measures ensures responsible and safe development, fostering technological advancement with confidence. Amazon Bedrock exceeded our security requirements.

Continual updates and support

With Amazon Bedrock, users can benefit from continual updates and support for the available models. Enterprises like us have access to the latest advancements and improvements in AI technology, allowing us to remain ahead of the curve and adjust to evolving market trends and demands.

We can’t wait to further experiment with the new features of Amazon Bedrock announced at AWS re:Invent 2023, including Agents for Amazon Bedrock and Knowledge Bases for Amazon Bedrock, to accelerate our generative AI innovation on AWS.

Cost-effectiveness

The diverse model offerings in Amazon Bedrock empower you to select cost-effective large language models based on your use case. This flexibility enables companies to optimize AI investments, allocate resources efficiently, and achieve the best return on investment by choosing the most suitable model for each use case.

Conclusion

Mendix’s strategic integration of generative AI using Amazon Bedrock represents a groundbreaking move in democratizing tech innovation. Aligning with our legacy of empowering businesses with agility, Amazon Bedrock enhances our platform with advanced AI, ensuring every Mendix solution is not just current but future-engineered. We’re not merely adapting to low-code development evolution, but pioneering it.

To learn more about how your business can use the power of Amazon Bedrock and AWS to drive innovation and revolutionize customer experiences, see Why Using Amazon Bedrock in Your Mendix Apps is a Must.


About the Authors

Ricardo Perdigao has worked in the Enterprise Software industry for nearly 30 years, bringing a wealth of experience and innovation to his role as a Solutions Architect at Mendix, a high-productivity application development platform that empowers the creation and continuous improvement of mobile and web applications at scale. In 2023, Ricardo was honored with the prestigious Mendix Innovation Award for his groundbreaking research in Generative AI, further cementing his reputation as a forward-thinking leader in technology.

Suresh Patnam is the principal GTM Specialist AI/ML and Generative AI at AWS. He is passionate about helping businesses of all sizes transform into fast-moving digital organizations focusing on data, AI/ML, and generative AI. Suresh holds an MBA from Duke University-Fuqua School of Business. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Read More

Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

Train and host a computer vision model for tampering detection on Amazon SageMaker: Part 2

In the first part of this three-part series, we presented a solution that demonstrates how you can automate detecting document tampering and fraud at scale using AWS AI and machine learning (ML) services for a mortgage underwriting use case.

In this post, we present an approach to develop a deep learning-based computer vision model to detect and highlight forged images in mortgage underwriting. We provide guidance on building, training, and deploying deep learning networks on Amazon SageMaker.

In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.

Solution overview

To meet the objective of detecting document tampering in mortgage underwriting, we employ a computer vision model hosted on SageMaker for our image forgery detection solution. This model receives a testing image as input and generates a likelihood prediction of forgery as its output. The network architecture is as depicted in the following diagram.

Tampering detection model architecture

Image forgery mainly involves four techniques: splicing, copy-move, removal, and enhancement. Depending on the characteristics of the forgery, different clues can be used as the foundation for detection and localization. These clues include JPEG compression artifacts, edge inconsistencies, noise patterns, color consistency, visual similarity, EXIF consistency, and camera model.

Given the expansive realm of image forgery detection, we use the Error Level Analysis (ELA) algorithm as an illustrative method for detecting forgeries. We selected the ELA technique for this post for the following reasons:

  • It is quicker to implement and can easily catch tampering of images.
  • It works by analyzing the compression levels of different parts of an image. This allows it to detect inconsistencies that may indicate tampering—for example, if one area was copied and pasted from another image that had been saved at a different compression level.
  • It is good at detecting more subtle or seamless tampering that may be hard to spot with the naked eye. Even small changes to an image can introduce detectable compression anomalies.
  • It doesn’t rely on having the original unmodified image for comparison. ELA can identify tampering signs within only the questioned image itself. Other techniques often require the unmodified original to compare against.
  • It is a lightweight technique that only relies on analyzing compression artifacts in the digital image data. It doesn’t depend on specialized hardware or forensics expertise. This makes ELA accessible as a first-pass analysis tool.
  • The output ELA image can clearly highlight differences in compression levels, making tampered areas visibly obvious. This allows even a non-expert to recognize signs of possible manipulation.
  • It works on many image types (such as JPEG, PNG, and GIF) and requires only the image itself to analyze. Other forensic techniques may be more restricted in formats or original image requirements.

However, in real-world scenarios where you may have a combination of input documents (JPEG, PNG, GIF, TIFF, PDF), we recommend employing ELA in conjunction with various other methods, such as detecting inconsistencies in edges, noise patterns, color uniformity, EXIF data consistency, camera model identification, and font uniformity. We aim to update the code for this post with additional forgery detection techniques.

ELA’s underlying premise assumes that the input images are in JPEG format, known for its lossy compression. Nevertheless, the method can still be effective even if the input images were originally in a lossless format (such as PNG, GIF, or BMP) and later converted to JPEG during the tampering process. When ELA is applied to original lossless formats, it typically indicates consistent image quality without any deterioration, rendering it challenging to pinpoint altered areas. In JPEG images, the expected norm is for the entire picture to exhibit similar compression levels. However, if a particular section within the image displays a markedly different error level, it often suggests a digital alteration has been made.

ELA highlights differences in the JPEG compression rate. Regions with uniform coloring will likely have a lower ELA result (for example, a darker color compared to high-contrast edges). The things to look for to identify tampering or modification include the following:

  • Similar edges should have similar brightness in the ELA result. All high-contrast edges should look similar to each other, and all low-contrast edges should look similar. With an original photo, low-contrast edges should be almost as bright as high-contrast edges.
  • Similar textures should have similar coloring under ELA. Areas with more surface detail, such as a close-up of a basketball, will likely have a higher ELA result than a smooth surface.
  • Regardless of the actual color of the surface, all flat surfaces should have about the same coloring under ELA.

JPEG images use a lossy compression system. Each re-encoding (resave) of the image adds more quality loss to the image. Specifically, the JPEG algorithm operates on an 8×8 pixel grid. Each 8×8 square is compressed independently. If the image is completely unmodified, then all 8×8 squares should have similar error potentials. If the image is unmodified and resaved, then every square should degrade at approximately the same rate.

ELA saves the image at a specified JPEG quality level. This resave introduces a known amount of errors across the entire image. The resaved image is then compared against the original image. If an image is modified, then every 8×8 square that was touched by the modification should be at a higher error potential than the rest of the image.

The results from ELA are directly dependent on the image quality. You may want to know if something was added, but if the picture is copied multiple times, then ELA may only permit detecting the resaves. Try to find the best quality version of the picture.

With training and practice, ELA can also learn to identify image scaling, quality, cropping, and resave transformations. For example, if a non-JPEG image contains visible grid lines (1 pixel wide in 8×8 squares), then it means the picture started as a JPEG and was converted to non-JPEG format (such as PNG). If some areas of the picture lack grid lines or the grid lines shift, then it denotes a splice or drawn portion in the non-JPEG image.

In the following sections, we demonstrate the steps for configuring, training, and deploying the computer vision model.

Prerequisites

To follow along with this post, complete the following prerequisites:

  1. Have an AWS account.
  2. Set up Amazon SageMaker Studio. You can swiftly initiate SageMaker Studio using default presets, facilitating a rapid launch. For more information, refer to Amazon SageMaker simplifies the Amazon SageMaker Studio setup for individual users.
  3. Open SageMaker Studio and launch a system terminal.
    Setup system terminal
  4. Run the following command in the terminal:
    git clone https://github.com/aws-samples/document-tampering-detection.git
  5. The total cost of running SageMaker Studio for one user and the configurations of the notebook environment is $7.314 USD per hour.

Set up the model training notebook

Complete the following steps to set up your training notebook:

  1. Open the tampering_detection_training.ipynb file from the document-tampering-detection directory.
  2. Set up the notebook environment with the image TensorFlow 2.6 Python 3.8 CPU or GPU Optimized.
    You may run into issue of insufficient availability or hit the quota limit for GPU instances within your AWS account when selecting GPU optimized instances. To increase the quota, visit the Service Quotas console and increase the service limit for the specific instance type you need. You can also use a CPU optimized notebook environment in such cases.
  3. For Kernel, choose Python3.
  4. For Instance type, choose ml.m5d.24xlarge or any other large instance.

We selected a larger instance type to reduce the training time of the model. With an ml.m5d.24xlarge notebook environment, the cost per hour is $7.258 USD per hour.

Run the training notebook

Run each cell in the notebook tampering_detection_training.ipynb in order. We discuss some cells in more detail in the following sections.

Prepare the dataset with a list of original and tampered images

Before you run the following cell in the notebook, prepare a dataset of original and tampered documents based on your specific business requirements. For this post, we use a sample dataset of tampered paystubs, and bank statements. The dataset is available within the images directory of the GitHub repository.

Prepare dataset

The notebook reads the original and tampered images from the images/training directory.

The dataset for training is created using a CSV file with two columns: the path to the image file and the label for the image (0 for original image and 1 for tampered image).

Label dataset

Process the dataset by generating the ELA results of each training image

In this step, we generate the ELA result (at 90% quality) of the input training image. The function convert_to_ela_image takes two parameters: path, which is the path to an image file, and quality, representing the quality parameter for JPEG compression. The function performs the following steps:

  1. Convert the image to RGB format and resave the image as a JPEG file with the specified quality under the name tempresaved.jpg.
  2. Compute the difference between the original image and the resaved JPEG image (ELA) to determine the maximum difference in pixel values between the original and resaved images.
  3. Calculate a scale factor based on the maximum difference to adjust the brightness of the ELA image.
  4. Enhance the brightness of the ELA image using the calculated scale factor.
  5. Resize the ELA result to 128x128x3, where 3 represents the number of channels to reduce the input size for training.
  6. Return the ELA image.

In lossy image formats such as JPEG, the initial saving process leads to considerable color loss. However, when the image is loaded and subsequently re-encoded in the same lossy format, there’s generally less added color degradation. ELA outcomes emphasize the image areas most susceptible to color degradation upon resaving. Generally, alterations appear prominently in regions exhibiting higher potential for degradation compared to the rest of the image.

Next, the images are processed into a NumPy array for training. We then split the input dataset randomly into training and test or validation data (80/20). You can ignore any warnings when running these cells.

Convert to ELA for training

Depending on the size of dataset, running these cells could take time to complete. For the sample dataset we provided in this repository, it could take 5–10 minutes.

Configure the CNN model

In this step, we construct a minimal version of the VGG network with small convolutional filters. The VGG-16 consists of 13 convolutional layers and three fully connected layers. The following screenshot illustrates the architecture of our Convolutional Neural Network (CNN) model.

Tensorflow model architecture

Note the following configurations:

  • Input – The model takes in an image input size of 128x128x3.
  • Convolutional layers – The convolutional layers use a minimal receptive field (3×3), the smallest possible size that still captures up/down and left/right. This is followed by a rectified linear unit (ReLU) activation function that reduces training time. This is a linear function that will output the input if positive; otherwise, the output is zero. The convolution stride is fixed at the default (1 pixel) to keep the spatial resolution preserved after convolution (stride is the number of pixel shifts over the input matrix).
  • Fully connected layers – The network has two fully connected layers. The first dense layer uses ReLU activation, and the second uses softmax to classify the image as original or tampered.

You can ignore any warnings when running these cells.

Save the model artifacts

Save the trained model with a unique file name—for example, based on the current date and time—into a directory named model.

Save the tensorflow model artifacts

The model is saved in Keras format with the extension .keras. We also save the model artifacts as a directory named 1 containing serialized signatures and the state needed to run them, including variable values and vocabularies to deploy to a SageMaker runtime (which we discuss later in this post).

Measure model performance

The following loss curve shows the progression of the model’s loss over training epochs (iterations).

model accuracy plot

The loss function measures how well the model’s predictions match the actual targets. Lower values indicate better alignment between predictions and true values. Decreasing loss over epochs signifies that the model is improving. The accuracy curve illustrates the model’s accuracy over training epochs. Accuracy is the ratio of correct predictions to the total number of predictions. Higher accuracy indicates a better-performing model. Typically, accuracy increases during training as the model learns patterns and improves its predictive ability. These will help you determine if the model is overfitting (performing well on training data but poorly on unseen data) or underfitting (not learning enough from the training data).

The following confusion matrix visually represents how well the model accurately distinguishes between the positive (forged image, represented as value 1) and negative (untampered image, represented as value 0) classes.

Confusion matrix plot

Following the model training, our next step involves deploying the computer vision model as an API. This API will be integrated into business applications as a component of the underwriting workflow. To achieve this, we use Amazon SageMaker Inference, a fully managed service. This service seamlessly integrates with MLOps tools, enabling scalable model deployment, cost-efficient inference, enhanced model management in production, and reduced operational complexity. In this post, we deploy the model as a real-time inference endpoint. However, it’s important to note that, depending on the workflow of your business applications, the model deployment can also be tailored as batch processing, asynchronous handling, or through a serverless deployment architecture.

Set up the model deployment notebook

Complete the following steps to set up your model deployment notebook:

  1. Open the tampering_detection_model_deploy.ipynb file from document-tampering-detection directory.
  2. Set up the notebook environment with the image Data Science 3.0.
  3. For Kernel, choose Python3.
  4. For Instance type, choose ml.t3.medium.

With an ml.t3.medium notebook environment, the cost per hour is $0.056 USD.

Create a custom inline policy for the SageMaker role to allow all Amazon S3 actions

The AWS Identity and Access Management (IAM) role for SageMaker will be in the format AmazonSageMaker- ExecutionRole-<random numbers>. Make sure you’re using the correct role. The role name can be found under the user details within the SageMaker domain configurations.

Update the IAM role to include an inline policy to allow all Amazon Simple Storage Service (Amazon S3) actions. This will be required to automate the creation and deletion of S3 buckets that will store the model artifacts. You can limit the access to specific S3 buckets. Note that we used a wildcard for the S3 bucket name in the IAM policy (tamperingdetection*).

Run the deployment notebook

Run each cell in the notebook tampering_detection_model_deploy.ipynb in order. We discuss some cells in more detail in the following sections.

Create an S3 bucket

Run the cell to create an S3 bucket. The bucket will be named tamperingdetection<current date time> and in the same AWS Region as your SageMaker Studio environment.

Create Amazon S3 bucket

Create the model artifact archive and upload to Amazon S3

Create a tar.gz file from the model artifacts. We have saved the model artifacts as a directory named 1, containing serialized signatures and the state needed to run them, including variable values and vocabularies to deploy to the SageMaker runtime. You can also include a custom inference file called inference.py within the code folder in the model artifact. The custom inference can be used for preprocessing and postprocessing of the input image.

Tar file with model artifacts

Upload model artifacts to Amazon S3

Create a SageMaker inference endpoint

The cell to create a SageMaker inference endpoint may take a few minutes to complete.

Create Amazon SageMaker Inference endpoint

Test the inference endpoint

The function check_image preprocesses an image as an ELA image, sends it to a SageMaker endpoint for inference, retrieves and processes the model’s predictions, and prints the results. The model takes a NumPy array of the input image as an ELA image to provide predictions. The predictions are output as 0, representing an untampered image, and 1, representing a forged image.

Test Amazon SageMaker Inference endpoint

Let’s invoke the model with an untampered image of a paystub and check the result.

Test an original image

The model outputs the classification as 0, representing an untampered image.

Now let’s invoke the model with a tampered image of a paystub and check the result.

Test a forged image

The model outputs the classification as 1, representing a forged image.

Limitations

Although ELA is an excellent tool for helping detect modifications, there are a number of limitations, such as the following:

  • A single pixel change or minor color adjustment may not generate a noticeable change in the ELA because JPEG operates on a grid.
  • ELA only identifies what regions have different compression levels. If a lower-quality image is spliced into a higher-quality picture, then the lower-quality image may appear as a darker region.
  • Scaling, recoloring, or adding noise to an image will modify the entire image, creating a higher error level potential.
  • If an image is resaved multiple times, then it may be entirely at a minimum error level, where more resaves do not alter the image. In this case, the ELA will return a black image and no modifications can be identified using this algorithm.
  • With Photoshop, the simple act of saving the picture can auto-sharpen textures and edges, creating a higher error level potential. This artifact doesn’t identify intentional modification; it identifies that an Adobe product was used. Technically, ELA appears as a modification because Adobe automatically performed a modification, but the modification was not necessarily intentional by the user.

We recommend using ELA alongside other techniques previously discussed in the blog in order to detect a greater range of image manipulation cases. ELA can also serve as an independent tool for visually examining image disparities, especially when training a CNN-based model becomes challenging.

Clean up

To remove the resources you created as part of this solution, complete the following steps:

  1. Run the notebook cells under the Cleanup section. This will delete the following:
    1. SageMaker inference endpoint – The inference endpoint name will be tamperingdetection-<datetime>.
    2. Objects within the S3 bucket and the S3 bucket itself – The bucket name will be tamperingdetection<datetime>.
  2. Shut down the SageMaker Studio notebook resources.

Conclusion

In this post, we presented an end-to-end solution for detecting document tampering and fraud using deep learning and SageMaker. We used ELA to preprocess images and identify discrepancies in compression levels that may indicate manipulation. Then we trained a CNN model on this processed dataset to classify images as original or tampered.

The model can achieve strong performance, with an accuracy over 95% with a dataset (forged and original) suited for your business requirements. This indicates that it can reliably detect forged documents like paystubs and bank statements. The trained model is deployed to a SageMaker endpoint to enable low-latency inference at scale. By integrating this solution into mortgage workflows, institutions can automatically flag suspicious documents for further fraud investigation.

Although powerful, ELA has some limitations in identifying certain types of more subtle manipulation. As next steps, the model could be enhanced by incorporating additional forensic techniques into training and using larger, more diverse datasets. Overall, this solution demonstrates how you can use deep learning and AWS services to build impactful solutions that boost efficiency, reduce risk, and prevent fraud.

In Part 3, we demonstrate how to implement the solution on Amazon Fraud Detector.


About the authors


Anup Ravindranath
is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada working with Financial Services organizations. He helps customers to transform their businesses and innovate on cloud.

Vinnie Saini is a Senior Solutions Architect at Amazon Web Services (AWS) based in Toronto, Canada. She has been helping Financial Services customers transform on cloud, with AI and ML driven solutions laid on strong foundational pillars of Architectural Excellence.

Read More

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 1

Talk to your slide deck using multimodal foundation models hosted on Amazon Bedrock and Amazon SageMaker – Part 1

With the advent of generative AI, today’s foundation models (FMs), such as the large language models (LLMs) Claude 2 and Llama 2, can perform a range of generative tasks such as question answering, summarization, and content creation on text data. However, real-world data exists in multiple modalities, such as text, images, video, and audio. Take a PowerPoint slide deck, for example. It could contain information in the form of text, or embedded in graphs, tables, and pictures.

In this post, we present a solution that uses multimodal FMs such as the Amazon Titan Multimodal Embeddings model and LLaVA 1.5 and AWS services including Amazon Bedrock and Amazon SageMaker to perform similar generative tasks on multimodal data.

Solution overview

The solution provides an implementation for answering questions using information contained in the text and visual elements of a slide deck. The design relies on the concept of Retrieval Augmented Generation (RAG). Traditionally, RAG has been associated with textual data that can be processed by LLMs. In this post, we extend RAG to include images as well. This provides a powerful search capability to extract contextually relevant content from visual elements like tables and graphs along with text.

There are different ways to design a RAG solution that includes images. We have presented one approach here and will follow up with an alternate approach in the second post of this three-part series.

This solution includes the following components:

  • Amazon Titan Multimodal Embeddings model – This FM is used to generate embeddings for the content in the slide deck used in this post. As a multimodal model, this Titan model can process text, images, or a combination as input and generate embeddings. The Titan Multimodal Embeddings model generates vectors (embeddings) of 1,024 dimensions and is accessed via Amazon Bedrock.
  • Large Language and Vision Assistant (LLaVA) – LLaVA is an open source multimodal model for visual and language understanding and is used to interpret the data in the slides, including visual elements such as graphs and tables. We use the 7-billion parameter version LLaVA 1.5-7b in this solution.
  • Amazon SageMaker – The LLaVA model is deployed on a SageMaker endpoint using SageMaker hosting services, and we use the resulting endpoint to run inferences against the LLaVA model. We also use SageMaker notebooks to orchestrate and demonstrate this solution end to end.
  • Amazon OpenSearch Serverless – OpenSearch Serverless is an on-demand serverless configuration for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the Titan Multimodal Embeddings model. An index created in the OpenSearch Serverless collection serves as the vector store for our RAG solution.
  • Amazon OpenSearch Ingestion (OSI) – OSI is a fully managed, serverless data collector that delivers data to OpenSearch Service domains and OpenSearch Serverless collections. In this post, we use an OSI pipeline to deliver data to the OpenSearch Serverless vector store.

Solution architecture

The solution design consists of two parts: ingestion and user interaction. During ingestion, we process the input slide deck by converting each slide into an image, generate embeddings for these images, and then populate the vector data store. These steps are completed prior to the user interaction steps.

In the user interaction phase, a question from the user is converted into embeddings and a similarity search is run on the vector database to find a slide that could potentially contain answers to user question. We then provide this slide (in the form of an image file) to the LLaVA model and the user question as a prompt to generate an answer to the query. All the code for this post is available in the GitHub repo.

The following diagram illustrates the ingestion architecture.

Ingestion architecture diagram

The workflow steps are as follows:

  1. Slides are converted to image files (one per slide) in JPG format and passed to the Titan Multimodal Embeddings model to generate embeddings. In this post, we use the slide deck titled Train and deploy Stable Diffusion using AWS Trainium & AWS Inferentia from the AWS Summit in Toronto, June 2023, to demonstrate the solution. The sample deck has 31 slides, so we generate 31 sets of vector embeddings, each with 1,024 dimensions. We add additional metadata fields to these generated vector embeddings and create a JSON file. These additional metadata fields can be used to perform rich search queries using OpenSearch’s powerful search capabilities.
  2. The generated embeddings are put together in a single JSON file that is uploaded to Amazon Simple Storage Service (Amazon S3).
  3. Via Amazon S3 Event Notifications, an event is put in an Amazon Simple Queue Service (Amazon SQS) queue.
  4. This event in the SQS queue acts as a trigger to run the OSI pipeline, which in turn ingests the data (JSON file) as documents into the OpenSearch Serverless index. Note that the OpenSearch Serverless index is configured as the sink for this pipeline and is created as part of the OpenSearch Serverless collection.

The following diagram illustrates the user interaction architecture.

User interaction architecture

The workflow steps are as follows:

  1. A user submits a question related to the slide deck that has been ingested.
  2. The user input is converted into embeddings using the Titan Multimodal Embeddings model accessed via Amazon Bedrock. An OpenSearch vector search is performed using these embeddings. We perform a k-nearest neighbor (k=1) search to retrieve the most relevant embedding matching the user query. Setting k=1 retrieves the most relevant slide to the user question.
  3. The metadata of the response from OpenSearch Serverless contains a path to the image corresponding to the most relevant slide.
  4. A prompt is created by combining the user question and the image path and provided to LLaVA hosted on SageMaker. The LLaVA model is able to understand the user question and answer it by examining the data in the image.
  5. The result of this inference is returned to the user.

These steps are discussed in detail in the following sections. See the Results section for screenshots and details on the output.

Prerequisites

To implement the solution provided in this post, you should have an AWS account and familiarity with FMs, Amazon Bedrock, SageMaker, and OpenSearch Service.

This solution uses the Titan Multimodal Embeddings model. Ensure that this model is enabled for use in Amazon Bedrock. On the Amazon Bedrock console, choose Model access in the navigation pane. If Titan Multimodal Embeddings is enabled, the access status will state Access granted.

Manage model access in Amazon Bedrock

If the model is not available, enable access to the model by choosing Manage Model Access, selecting Titan Multimodal Embeddings G1, and choosing Request model access. The model is enabled for use immediately.

Request model access in Amazon Bedrock

Use an AWS CloudFormation template to create the solution stack

Use one of the following AWS CloudFormation templates (depending on your Region) to launch the solution resources.

AWS Region Link
us-east-1 Link to CloudFormation template for us-east-1
us-west-2 Link to CloudFormation template for us-west-2

After the stack is created successfully, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the value for MultimodalCollectionEndpoint, which we use in subsequent steps.

Resources created by the CloudFormation tempalate

The CloudFormation template creates the following resources:

  • IAM roles – The following AWS Identity and Access Management (IAM) roles are created. Update these roles to apply least-privilege permissions.
    • SMExecutionRole with Amazon S3, SageMaker, OpenSearch Service, and Bedrock full access.
    • OSPipelineExecutionRole with access to specific Amazon SQS and OSI actions.
  • SageMaker notebook – All the code for this post is run via this notebook.
  • OpenSearch Serverless collection – This is the vector database for storing and retrieving embeddings.
  • OSI pipeline – This is the pipeline for ingesting data into OpenSearch Serverless.
  • S3 bucket – All data for this post is stored in this bucket.
  • SQS queue – The events for triggering the OSI pipeline run are put in this queue.

The CloudFormation template configures the OSI pipeline with Amazon S3 and Amazon SQS processing as source and an OpenSearch Serverless index as sink. Any objects created in the specified S3 bucket and prefix (multimodal/osi-embeddings-json) will trigger SQS notifications, which are used by the OSI pipeline to ingest data into OpenSearch Serverless.

The CloudFormation template also creates network, encryption, and data access policies required for the OpenSearch Serverless collection. Update these policies to apply least-privilege permissions.

Note that the CloudFormation template name is referenced in SageMaker notebooks. If the default template name is changed, make sure you update the same in globals.py

Test the solution

After the prerequisite steps are complete and the CloudFormation stack has been created successfully, you’re now ready to test the solution:

  1. On the SageMaker console, choose Notebooks in the navigation pane.
  2. Select the MultimodalNotebookInstance notebook instance and choose Open JupyterLab.
    Notebook instance in Amazon SageMaker
  3. In File Browser, traverse to the notebooks folder to see the notebooks and supporting files.

The notebooks are numbered in the sequence in which they’re run. Instructions and comments in each notebook describe the actions performed by that notebook. We run these notebooks one by one.

  1. Choose 0_deploy_llava.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook deploys the LLaVA-v1.5-7B model to a SageMaker endpoint. In this notebook, we download the LLaVA-v1.5-7B model from HuggingFace Hub, replace the inference.py script with llava_inference.py, and create a model.tar.gz file for this model. The model.tar.gz file is uploaded to Amazon S3 and used for deploying the model on SageMaker endpoint. The llava_inference.py script has additional code to allow reading an image file from Amazon S3 and running inference on it.

  1. Choose 1_data_prep.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook downloads the slide deck, converts each slide into JPG file format, and uploads these to the S3 bucket used for this post.

  1. Choose 2_data_ingestion.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

We do the following in this notebook:

  • We create an index in the OpenSearch Serverless collection. This index stores the embeddings data for the slide deck. See the following code:
session = boto3.Session()
credentials = session.get_credentials()
auth = AWSV4SignerAuth(credentials, g.AWS_REGION, g.OS_SERVICE)

os_client = OpenSearch(
  hosts = [{'host': host, 'port': 443}],
  http_auth = auth,
  use_ssl = True,
  verify_certs = True,
  connection_class = RequestsHttpConnection,
  pool_maxsize = 20
)

index_body = """
{
  "settings": {
      "index.knn": true
  },
  "mappings": {
      "properties": {
          "vector_embedding": {
              "type": "knn_vector",
              "dimension": 1024,
              "method": {
                  "name": "hnsw",
                  "engine": "nmslib",
                  "parameters": {}
              }
          },
          "image_path": {
              "type": "text"
          },
          "metadata": {
              "properties": {
                  "slide_filename": {
                      "type": "text"
                  },
                  "model_id": {
                      "type": "text"
                  },
                  "slide_description": {
                      "type": "text"
                  }
              }
          }
      }
  }
}
"""
index_body = json.loads(index_body)
try:
  response = os_client.indices.create(index_name, body=index_body)
  logger.info(f"response received for the create index -> {response}")
except Exception as e:
  logger.error(f"error in creating index={index_name}, exception={e}")
  • We use Titan Multimodal Embeddings model to convert the JPG images created in the previous notebook into vector embeddings. These embeddings and additional metadata (such as the S3 path of the image file) are stored in a JSON file and uploaded to Amazon S3. Note that a single JSON file is created, which contains documents for all the slides (images) converted into embeddings. The following code snippet shows how an image (in the form of a Base64 encoded string) is converted into embeddings:
def get_multimodal_embeddings(bedrock: botocore.client, image: str) -> np.ndarray:
    body = json.dumps(dict(inputImage=image))
    try:
        response = bedrock.invoke_model(
            body=body, modelId=g.FMC_MODEL_ID, accept=g.ACCEPT_ENCODING, contentType=g.CONTENT_ENCODING
        )
        response_body = json.loads(response.get("body").read())
        embeddings = np.array([response_body.get("embedding")]).astype(np.float32)
    except Exception as e:
        logger.error(f"exception while image(truncated)={image[:10]}, exception={e}")
        embeddings = None

    return embeddings
  • This action triggers the OpenSearch Ingestion pipeline, which processes the file and ingests it into the OpenSearch Serverless index. The following is a sample of the JSON file created. (A vector with four dimensions is shown in the example code. The Titan Multimodal Embeddings model generates 1,024 dimensions.)
[
  {
    "image_path": "s3://<your-bucket-name>/path/to/file1.json",
    "metadata": {
      "slide_filename": "mypowerpoint1.pptx",
      "model_id": "amazon.titan-embed-image-v1",
      "slide_description": "This is a test slide deck"
    },
    "vector_embedding": [
      657.6052386529958,
      0.8865137233123771,
      763.870264592026
    ]
  }
] 
  1. Choose 3_rag_inference.ipynb to open it in JupyterLab.
  2. On the Run menu, choose Run All Cells to run the code in this notebook.

This notebook implements the RAG solution: we convert the user question into embeddings, find a similar image (slide) from the vector database, and provide the retrieved image to LLaVA to generate an answer to the user question. We use the following prompt template:

prompt_template: str = """Pretend that you are a helpful assistant that answers questions about content in a slide deck. 
  Using only the information in the provided slide image answer the following question.
  If you do not find the answer in the image then say I did not find the answer to this question in the slide deck.

  {question}
"""

The following code snippet provides the RAG workflow:

# create prompt and convert to embeddings
question: str = "As per the AI/ML flywheel, what do the AWS AI/ML services provide?"
prompt = prompt_template.format(question=question)
text_embeddings = get_text_embeddings(bedrock, question)

# vector db search
vector_db_response: Dict = find_similar_data(text_embeddings)

# download image for local notebook display
s3_img_path = vector_db_response.get('hits', {}).get('hits')[0].get('_source').get('image_path')
logger.info(f"going to answer the question="{question}" using the image "{s3_img_path}"")

!aws s3 cp {s3_img_path} .
local_img_path = os.path.basename(s3_img_path)
display(Image(filename=local_img_path))

# Ask LLaVA
data = {
    "image" : s3_img_path,
    "question" : prompt,
    "temperature" : 0.1,
}
output = predictor.predict(data)
logger.info(f"Image={s3_img_path}nQuestion: {question}nAnswer: {output}nn")

Results

The following is a compilation of some user questions and responses generated by our implementation. The Question column captures the user question, and the Answer column is the textual response generated by LLaVA. Image is the k-nearest (k=1) slide match returned by the OpenSearch Serverless vector search. This slide is interpreted by LLaVA to provide the answer.

Multimodal RAG results

Question Answer Image
How does Inf2 compare in performance to comparable EC2 instances? I need numbers. According to the slide deck, Inf2 instances by AWS Inferentia2 offer up to 4x higher throughput and 10x lower latency compared to comparable EC2 instances. Image for Question 1 in the Results table
As per the AI/ML flywheel, what do the AWS AI/ML services provide? The AWS AI/ML services provide better $/perfer capabilities, new capabilities, and investment in innovation. Image for Question 2 in the Results table
Compared to GPT-2, how many more parameters does GPT-3 have? What is the numerical difference between the parameter size of GPT-2 and GPT-3? According to the slide, GPT-3 has 175 billion parameters, while GPT-2 has 1.5 billion parameters. The numerical difference between the parameter size of GPT-2 and GPT-3 is 173.5 billion. Image for Question 3 in the Results table
What are quarks in particle physics? I did not find the answer to this question in the slide deck. Image for Question 4 in the Results table

Feel free to extend this solution to your slide decks. Simply update the SLIDE_DECK variable in globals.py with a URL to your slide deck and run the ingestion steps detailed in the previous section.

Tip

You can use OpenSearch Dashboards to interact with the OpenSearch API to run quick tests on your index and ingested data. The following screenshot shows an OpenSearch dashboard GET example.

View of OpenSearch Dashboards

Clean up

To avoid incurring future charges, delete the resources you created. You can do this by deleting the stack via the CloudFormation console.

Deletion of the CloudFormation stack

Additionally, delete the SageMaker inference endpoint created for LLaVA inferencing. You can do this by uncommenting the cleanup step in 3_rag_inference.ipynb and running the cell, or by deleting the endpoint via the SageMaker console: choose Inference and Endpoints in the navigation pane, then select the endpoint and delete it.

Conclusion

Enterprises generate new content all the time, and slide decks are a common mechanism used to share and disseminate information internally with the organization and externally with customers or at conferences. Over time, rich information can remain buried and hidden in non-text modalities like graphs and tables in these slide decks. You can use this solution and the power of multimodal FMs such as the Titan Multimodal Embeddings model and LLaVA to discover new information or uncover new perspectives on content in slide decks.

We encourage you to learn more by exploring Amazon SageMaker JumpStart, Amazon Titan models, Amazon Bedrock, and OpenSearch Service, and building a solution using the sample implementation provided in this post.

Look out for two additional posts as part of this series. Part 2 covers another approach you could take to talk to your slide deck. This approach generates and stores LLaVA inferences and uses those stored inferences to respond to user queries. Part 3 compares the two approaches.


About the authors

Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Manju Prasad is a Senior Solutions Architect within Strategic Accounts at Amazon Web Services. She focuses on providing technical guidance in a variety of domains, including AI/ML to a marquee M&E customer. Prior to joining AWS, she designed and built solutions for companies in the financial services sector and also for a startup.

Archana Inapudi is a Senior Solutions Architect at AWS supporting strategic customers. She has over a decade of experience helping customers design and build data analytics and database solutions. She is passionate about using technology to provide value to customers and achieve business outcomes.

Antara Raisa is an AI and ML Solutions Architect at Amazon Web Services supporting strategic customers based out of Dallas, Texas. She also has previous experience working with large enterprise partners at AWS, where she worked as a Partner Success Solutions Architect for digital native customers.

Read More

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart 

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart 

When deploying a large language model (LLM), machine learning (ML) practitioners typically care about two measurements for model serving performance: latency, defined by the time it takes to generate a single token, and throughput, defined by the number of tokens generated per second. Although a single request to the deployed endpoint would exhibit a throughput approximately equal to the inverse of model latency, this is not necessarily the case when multiple concurrent requests are simultaneously sent to the endpoint. Due to model serving techniques, such as client-side continuous batching of concurrent requests, latency and throughput have a complex relationship that varies significantly based on model architecture, serving configurations, instance type hardware, number of concurrent requests, and variations in input payloads such as number of input tokens and output tokens.

This post explores these relationships via a comprehensive benchmarking of LLMs available in Amazon SageMaker JumpStart, including Llama 2, Falcon, and Mistral variants. With SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models to deploy to dedicated Amazon SageMaker instances within a network-isolated environment. We provide theoretical principles on how accelerator specifications impact LLM benchmarking. We also demonstrate the impact of deploying multiple instances behind a single endpoint. Finally, we provide practical recommendations for tailoring the SageMaker JumpStart deployment process to align with your requirements on latency, throughput, cost, and constraints on available instance types. All the benchmarking results as well as recommendations are based on a versatile notebook that you can adapt to your use case.

Deployed endpoint benchmarking

The following figure shows the lowest latencies (left) and highest throughput (right) values for deployment configurations across a variety of model types and instance types. Importantly, each of these model deployments use default configurations as provided by SageMaker JumpStart given the desired model ID and instance type for deployment.

These latency and throughput values correspond to payloads with 256 input tokens and 256 output tokens. The lowest latency configuration limits model serving to a single concurrent request, and the highest throughput configuration maximizes the possible number of concurrent requests. As we can see in our benchmarking, increasing concurrent requests monotonically increases throughput with diminishing improvement for large concurrent requests. Additionally, models are fully sharded on the supported instance. For example, because the ml.g5.48xlarge instance has 8 GPUs, all SageMaker JumpStart models using this instance are sharded using tensor parallelism on all eight available accelerators.

We can note a few takeaways from this figure. First, not all models are supported on all instances; some smaller models, such as Falcon 7B, don’t support model sharding, whereas larger models have higher compute resource requirements. Second, as sharding increases, performance typically improves, but may not necessarily improve for small modelsThis is because small models such as 7B and 13B incur a substantial communication overhead when sharded across too many accelerators. We discuss this in more depth later. Finally, ml.p4d.24xlarge instances tend to have significantly better throughput due to memory bandwidth improvements of A100 over A10G GPUs. As we discuss later, the decision to use a particular instance type depends on your deployment requirements, including latency, throughput, and cost constraints.

How can you obtain these lowest latency and highest throughput configuration values? Let’s start by plotting latency vs. throughput for a Llama 2 7B endpoint on an ml.g5.12xlarge instance for a payload with 256 input tokens and 256 output tokens, as seen in the following curve. A similar curve exists for every deployed LLM endpoint.

As concurrency increases, throughput and latency also monotonically increase. Therefore, the lowest latency point occurs at a concurrent request value of 1, and you can cost-effectively increase system throughput by increasing concurrent requests. There exists a distinct “knee” in this curve, where it’s obvious that the throughput gains associated with additional concurrency don’t outweigh the associated increase in latency. The exact location of this knee is use case-specific; some practitioners may define the knee at the point where a pre-specified latency requirement is exceeded (for example, 100 ms/token), whereas others may use load test benchmarks and queueing theory methods like the half-latency rule, and others may use theoretical accelerator specifications.

We also note that the maximum number of concurrent requests is limited. In the preceding figure, the line trace ends with 192 concurrent requests. The source of this limitation is the SageMaker invocation timeout limit, where SageMaker endpoints timeout an invocation response after 60 seconds. This setting is account-specific and not configurable for an individual endpoint. For LLMs, generating a large number of output tokens can take seconds or even minutes. Therefore, large input or output payloads can cause the invocation requests to fail. Furthermore, if the number of concurrent requests is very large, then many requests will experience large queue times, driving this 60-second timeout limit. For the purpose of this study, we use the timeout limit to define the maximum throughput possible for a model deployment. Importantly, although a SageMaker endpoint may handle a large number of concurrent requests without observing an invocation response timeout, you may want to define maximum concurrent requests with respect to the knee in the latency-throughput curve. This is likely the point at which you start to consider horizontal scaling, where a single endpoint provisions multiple instances with model replicas and load balances incoming requests between the replicas, to support more concurrent requests.

Taking this one step further, the following table contains benchmarking results for different configurations for the Llama 2 7B model, including different number of input and output tokens, instance types, and number of concurrent requests. Note that the preceding figure only plots a single row of this table.

. Throughput (tokens/sec) Latency (ms/token)
Concurrent Requests 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512
Number of total tokens: 512,    Number of output tokens: 256
ml.g5.2xlarge 30 54 115 208 343 475 486 33 33 35 39 48 97 159
ml.g5.12xlarge 59 117 223 406 616 866 1098 1214 17 17 18 20 27 38 60 112
ml.g5.48xlarge 56 108 202 366 522 660 707 804 18 18 19 22 32 50 101 171
ml.p4d.24xlarge 49 85 178 353 654 1079 1544 2312 2905 2944 21 23 22 23 26 31 44 58 92 165
Number of total tokens: 4096,    Number of output tokens: 256
ml.g5.2xlarge 20 36 48 49 48 57 104 170
ml.g5.12xlarge 33 58 90 123 142 31 34 48 73 132
ml.g5.48xlarge 31 48 66 82 31 43 68 120
ml.p4d.24xlarge 39 73 124 202 278 290 26 27 33 43 66 107

We observe some additional patterns in this data. When increasing context size, latency increases and throughput decreases. For instance, on ml.g5.2xlarge with a concurrency of 1, throughput is 30 tokens/sec when the number of total tokens is 512, vs. 20 tokens/sec if the number of total tokens is 4,096. This is because it takes more time to process the larger input. We can also see that increasing GPU capability and sharding impacts the maximum throughput and maximum supported concurrent requests. The table shows that Llama 2 7B has notably different maximum throughput values for different instance types, and these maximum throughput values occur at different values of concurrent requests. These characteristics would drive an ML practitioner to justify the cost of one instance over another. For example, given a low latency requirement, the practitioner might select an ml.g5.12xlarge instance (4 A10G GPUs) over an ml.g5.2xlarge instance (1 A10G GPU). If given a high throughput requirement, the use of an ml.p4d.24xlarge instance (8 A100 GPUs) with full sharding would only be justified under high concurrency. Note, however, that it’s often beneficial to instead load multiple inference components of a 7B model on a single ml.p4d.24xlarge instance; such multi-model support is discussed later in this post.

The preceding observations were made for the Llama 2 7B model. However, similar patterns remain true for other models as well. A primary takeaway is that latency and throughput performance numbers are dependent on payload, instance type, and number of concurrent requests, so you will need to find the ideal configuration for your specific application. To generate the preceding numbers for your use case, you can run the linked notebook, where you can configure this load test analysis for your model, instance type, and payload.

Making sense of accelerator specifications

Selecting suitable hardware for LLM inference relies heavily on specific use cases, user experience goals, and the chosen LLM. This section attempts to create an understanding of the knee in the latency-throughput curve with respect to high-level principles based on accelerator specifications. These principles alone don’t suffice to make a decision: real benchmarks are necessary. The term device is used here to encompass all ML hardware accelerators. We assert the knee in the latency-throughput curve is driven by one of two factors:

  • The accelerator has exhausted memory to cache KV matrices, so subsequent requests are queued
  • The accelerator still has spare memory for the KV cache, but is using a large enough batch size that processing time is driven by compute operation latency rather than memory bandwidth

We typically prefer to be limited by the second factor because this implies the accelerator resources are saturated. Basically, you are maximizing the resources you payed for. Let’s explore this assertion in greater detail.

KV caching and device memory

Standard transformer attention mechanisms compute attention for each new token against all previous tokens. Most modern ML servers cache attention keys and values in device memory (DRAM) to avoid re-computation at every step. This is called this the KV cache, and it grows with batch size and sequence length. It defines how many user requests can be served in parallel and will determine the knee in the latency-throughput curve if the compute-bound regime in the second scenario mentioned earlier is not yet met, given the available DRAM. The following formula is a rough approximation for the maximum KV cache size.

In this formula, B is batch size and N is number of accelerators. For example, the Llama 2 7B model in FP16 (2 bytes/parameter) served on an A10G GPU (24 GB DRAM) consumes approximately 14 GB, leaving 10 GB for the KV cache. Plugging in the model’s full context length (N = 4096) and remaining parameters (n_layers=32, n_kv_attention_heads=32, and d_attention_head=128), this expression shows we are limited to serving a batch size of four users in parallel due to DRAM constraints. If you observe the corresponding benchmarks in the previous table, this is a good approximation for the observed knee in this latency-throughput curve. Methods such as grouped query attention (GQA) can reduce the KV cache size, in GQA’s case by the same factor it reduces the number of KV heads.

Arithmetic intensity and device memory bandwidth

The growth in the computational power of ML accelerators has outpaced their memory bandwidth, meaning they can perform many more computations on each byte of data in the amount of time it takes to access that byte.

The arithmetic intensity, or the ratio of compute operations to memory accesses, for an operation determines if it is limited by memory bandwidth or compute capacity on the selected hardware. For example, an A10G GPU (g5 instance type family) with 70 TFLOPS FP16 and 600 GB/sec bandwidth can compute approximately 116 ops/byte. An A100 GPU (p4d instance type family) can compute approximately 208 ops/byte. If the arithmetic intensity for a transformer model is under that value, it is memory-bound; if it is above, it is compute-bound. The attention mechanism for Llama 2 7B requires 62 ops/byte for batch size 1 (for an explanation, see A guide to LLM inference and performance), which means it is memory-bound. When the attention mechanism is memory-bound, expensive FLOPS are left unutilized.

There are two ways to better utilize the accelerator and increase arithmetic intensity: reduce the required memory accesses for the operation (this is what FlashAttention focuses on) or increase the batch size. However, we might not be able to increase our batch size enough to reach a compute-bound regime if our DRAM is too small to hold the corresponding KV cache. A crude approximation of the critical batch size B* that separates compute-bound from memory-bound regimes for standard GPT decoder inference is described by the following expression, where A_mb is the accelerator memory bandwidth, A_f is accelerator FLOPS, and N is the number of accelerators. This critical batch size can be derived by finding where memory access time equals computation time. Refer to this blog post to understand Equation 2 and its assumptions in greater detail.

This is the same ops/byte ratio we previously calculated for A10G, so the critical batch size on this GPU is 116. One way to approach this theoretical, critical batch size is to increase model sharding and split the cache across more N accelerators. This effectively increases the KV cache capacity as well as the memory-bound batch size.

Another benefit of model sharding is splitting model parameter and data loading work across N accelerators. This type of sharding is a type of model parallelism also referred to as tensor parallelism. Naively, there is N times the memory bandwidth and compute power in aggregate. Assuming no overhead of any kind (communication, software, and so on), this would decrease decoding latency per token by N if we are memory-bound, because token decoding latency in this regime is bound by the time it takes to load the model weights and cache. In real life, however, increasing the degree of sharding results in increased communication between devices to share intermediate activations at every model layer. This communication speed is limited by the device interconnect bandwidth. It’s difficult to estimate its impact precisely (for details, see Model parallelism), but this can eventually stop yielding benefits or deteriorate performance — this is especially true for smaller models, because smaller data transfers lead to lower transfer rates.

To compare ML accelerators based on their specs, we recommend the following. First, calculate the approximate critical batch size for each accelerator type according to the second equation and the KV cache size for the critical batch size according to the first equation. You can then use the available DRAM on the accelerator to calculate the minimum number of accelerators required to fit the KV cache and model parameters. If deciding between multiple accelerators, prioritize accelerators in order of lowest cost per GB/sec of memory bandwidth. Finally, benchmark these configurations and verify what is the best cost/token for your upper bound of desired latency.

Select an endpoint deployment configuration

Many LLMs distributed by SageMaker JumpStart use the text-generation-inference (TGI) SageMaker container for model serving. The following table discusses how to adjust a variety of model serving parameters to either affect model serving which impacts the latency-throughput curve or protect the endpoint against requests that would overload the endpoint. These are the primary parameters you can use to configure your endpoint deployment for your use case. Unless otherwise specified, we use default text generation payload parameters and TGI environment variables.

Environment Variable Description SageMaker JumpStart Default Value
Model serving configurations . .
MAX_BATCH_PREFILL_TOKENS Limits the number of tokens in the prefill operation. This operation generates the KV cache for a new input prompt sequence. It is memory intensive and compute bound, so this value caps the number of tokens allowed in a single prefill operation. Decoding steps for other queries pause while prefill is occurring. 4096 (TGI default) or model-specific maximum supported context length (SageMaker JumpStart provided), whichever is greater.
MAX_BATCH_TOTAL_TOKENS Controls the maximum number of tokens to include within a batch during decoding, or a single forward pass through the model. Ideally, this is set to maximize the usage of all available hardware. Not specified (TGI default). TGI will set this value with respect to remaining CUDA memory during model warm up.
SM_NUM_GPUS The number of shards to use. That is, the number of GPUs used to run the model using tensor parallelism. Instance dependent (SageMaker JumpStart provided). For each supported instance for a given model, SageMaker JumpStart provides the best setting for tensor parallelism.
Configurations to guard your endpoint (set these for your use case) . .
MAX_TOTAL_TOKENS This caps the memory budget of a single client request by limiting the number of tokens in the input sequence plus the number of tokens in the output sequence (the max_new_tokens payload parameter). Model-specific maximum supported context length. For example, 4096 for Llama 2.
MAX_INPUT_LENGTH Identifies the maximum allowed number of tokens in the input sequence for a single client request. Things to consider when increasing this value include: longer input sequences require more memory, which affects continuous batching, and many models have a supported context length that should not be exceeded. Model-specific maximum supported context length. For example, 4095 for Llama 2.
MAX_CONCURRENT_REQUESTS The maximum number of concurrent requests allowed by the deployed endpoint. New requests beyond this limit will immediately raise a model overloaded error to prevent poor latency for the current processing requests. 128 (TGI default). This setting allows you to obtain high throughput for a variety of use cases, but you should pin as appropriate to mitigate SageMaker invocation timeout errors.

The TGI server uses continuous batching, which dynamically batches concurrent requests together to share a single model inference forward pass. There are two types of forward passes: prefill and decode. Each new request must run a single prefill forward pass to populate the KV cache for the input sequence tokens. After the KV cache is populated, a decode forward pass performs a single next-token prediction for all batched requests, which is iteratively repeated to produce the output sequence. As new requests are sent to the server, the next decode step must wait so the prefill step can run for the new requests. This must occur before those new requests are included in subsequent continuously batched decode steps. Due to hardware constraints, the continuous batching used for decoding may not include all requests. At this point, requests enter a processing queue and inference latency starts to significantly increase with only minor throughput gain.

It’s possible to separate LLM latency benchmarking analyses into prefill latency, decode latency, and queue latency. The time consumed by each of these components is fundamentally different in nature: prefill is a one-time computation, decoding occurs one time for each token in the output sequence, and queueing involves server batching processes. When multiple concurrent requests are being processed, it becomes difficult to disentangle the latencies from each of these components because the latency experienced by any given client request involves queue latencies driven by the need to prefill new concurrent requests as well as queue latencies driven by the inclusion of the request in batch decoding processes. For this reason, this post focuses on end-to-end processing latency. The knee in the latency-throughput curve occurs at the point of saturation where queue latencies start to significantly increase. This phenomenon occurs for any model inference server and is driven by accelerator specifications.

Common requirements during deployment include satisfying a minimum required throughput, maximum allowed latency, maximum cost per hour, and maximum cost to generate 1 million tokens. You should condition these requirements on payloads that represent end-user requests. A design to meet these requirements should consider many factors, including the specific model architecture, size of the model, instance types, and instance count (horizontal scaling). In the following sections, we focus on deploying endpoints to minimize latency, maximize throughput, and minimize cost. This analysis considers 512 total tokens and 256 output tokens.

Minimize latency

Latency is an important requirement in many real-time use cases. In the following table, we look at minimum latency for each model and each instance type. You can achieve minimum latency by setting MAX_CONCURRENT_REQUESTS = 1.

Minimum Latency (ms/token)
Model ID ml.g5.2xlarge ml.g5.12xlarge ml.g5.48xlarge ml.p4d.24xlarge ml.p4de.24xlarge
Llama 2 7B 33 17 18 20
Llama 2 7B Chat 33 17 18 20
Llama 2 13B 22 23 23
Llama 2 13B Chat 23 23 23
Llama 2 70B 57 43
Llama 2 70B Chat 57 45
Mistral 7B 35
Mistral 7B Instruct 35
Mixtral 8x7B 33 27
Falcon 7B 33
Falcon 7B Instruct 33
Falcon 40B 53 33 27
Falcon 40B Instruct 53 33 28
Falcon 180B 42
Falcon 180B Chat 42

To achieve minimum latency for a model, you can use the following code while substituting your desired model ID and instance type:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.12xlarge",
    env={
        "MAX_CONCURRENT_REQUESTS": "1",
        "MAX_INPUT_TOKENS": "256",
        "MAX_TOTAL_TOKENS": "512",
    },
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

Note that the latency numbers change depending on the number of input and output tokens. However, the deployment process remains the same except the environment variables MAX_INPUT_TOKENS and MAX_TOTAL_TOKENS. Here, these environment variables are set to help guarantee endpoint latency requirements because larger input sequences may violate the latency requirement. Note that SageMaker JumpStart already provides the other optimal environment variables when selecting instance type; for instance, using ml.g5.12xlarge will set SM_NUM_GPUS to 4 in the model environment.

Maximize throughput

In this section, we maximize the number of generated tokens per second. This is typically achieved at the maximum valid concurrent requests for the model and the instance type. In the following table, we report the throughput achieved at the largest concurrent request value achieved before encountering a SageMaker invocation timeout for any request.

Maximum Throughput (tokens/sec), Concurrent Requests
Model ID ml.g5.2xlarge ml.g5.12xlarge ml.g5.48xlarge ml.p4d.24xlarge ml.p4de.24xlarge
Llama 2 7B 486 (64) 1214 (128) 804 (128) 2945 (512)
Llama 2 7B Chat 493 (64) 1207 (128) 932 (128) 3012 (512)
Llama 2 13B 787 (128) 496 (64) 3245 (512)
Llama 2 13B Chat 782 (128) 505 (64) 3310 (512)
Llama 2 70B 124 (16) 1585 (256)
Llama 2 70B Chat 114 (16) 1546 (256)
Mistral 7B 947 (64)
Mistral 7B Instruct 986 (128)
Mixtral 8x7B 701 (128) 3196 (512)
Falcon 7B 1340 (128)
Falcon 7B Instruct 1313 (128)
Falcon 40B 244 (32) 382 (64) 2699 (512)
Falcon 40B Instruct 245 (32) 415 (64) 2675 (512)
Falcon 180B 1100 (128)
Falcon 180B Chat 1081 (128)

To achieve maximum throughput for a model, you can use the following code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.12xlarge",
    env={
        "MAX_CONCURRENT_REQUESTS": "128",  # For your application, identify it from the benchmarking table with the maximum feasible concurrent requests.
        "MAX_INPUT_TOKENS": "256",
        "MAX_TOTAL_TOKENS": "512",
    },
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

Note that the maximum number of concurrent requests depends on the model type, instance type, maximum number of input tokens, and maximum number of output tokens. Therefore, you should set these parameters before setting MAX_CONCURRENT_REQUESTS.

Also note that a user interested in minimizing latency is often at odds with a user interested in maximizing throughput. The former is interested in real-time responses, whereas the latter is interested in batch processing such that the endpoint queue is always saturated, thereby minimizing processing downtime. Users who want to maximize throughput conditioned on latency requirements are often interested in operating at the knee in the latency-throughput curve.

Minimize cost

The first option to minimize cost involves minimizing cost per hour. With this, you can deploy a selected model on the SageMaker instance with the lowest cost per hour. For real-time pricing of SageMaker instances, refer to Amazon SageMaker pricing. In general, the default instance type for SageMaker JumpStart LLMs is the lowest-cost deployment option.

The second option to minimize cost involves minimizing the cost to generate 1 million tokens. This is a simple transformation of the table we discussed earlier to maximize throughput, where you can first compute the time it takes in hours to generate 1 million tokens (1e6 / throughput / 3600). You can then multiply this time to generate 1 million tokens with the price per hour of the specified SageMaker instance.

Note that instances with the lowest cost per hour aren’t the same as instances with the lowest cost to generate 1 million tokens. For instance, if the invocation requests are sporadic, an instance with the lowest cost per hour might be optimal, whereas in the throttling scenarios, the lowest cost to generate a million tokens might be more appropriate.

Tensor parallel vs. multi-model trade-off

In all previous analyses, we considered deploying a single model replica with a tensor parallel degree equal to the number of GPUs on the deployment instance type. This is the default SageMaker JumpStart behavior. However, as previously noted, sharding a model can improve model latency and throughput only up to a certain limit, beyond which inter-device communication requirements dominate computation time. This implies that it’s often beneficial to deploy multiple models with a lower tensor parallel degree on a single instance rather than a single model with a higher tensor parallel degree.

Here, we deploy Llama 2 7B and 13B endpoints on ml.p4d.24xlarge instances with tensor parallel (TP) degrees of 1, 2, 4, and 8. For clarity in model behavior, each of these endpoints only load a single model.

. Throughput (tokens/sec) Latency (ms/token)
Concurrent Requests 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512
TP Degree Llama 2 13B
1 38 74 147 278 443 612 683 722 26 27 27 29 37 45 87 174
2 49 92 183 351 604 985 1435 1686 1726 21 22 22 22 25 32 46 91 159
4 46 94 181 343 655 1073 1796 2408 2764 2819 23 21 21 24 25 30 37 57 111 172
8 44 86 158 311 552 1015 1654 2450 3087 3180 22 24 26 26 29 36 42 57 95 152
. Llama 2 7B
1 62 121 237 439 778 1122 1569 1773 1775 16 16 17 18 22 28 43 88 151
2 62 122 239 458 780 1328 1773 2440 2730 2811 16 16 17 18 21 25 38 56 103 182
4 60 106 211 420 781 1230 2206 3040 3489 3752 17 19 20 18 22 27 31 45 82 132
8 49 97 179 333 612 1081 1652 2292 2963 3004 22 20 24 26 27 33 41 65 108 167

Our previous analyses already showed significant throughput advantages on ml.p4d.24xlarge instances, which often translates to better performance in terms of cost to generate 1 million tokens over the g5 instance family under high concurrent request load conditions. This analysis clearly demonstrates that you should consider the trade-off between model sharding and model replication within a single instance; that is, a fully sharded model is not typically the best use of  ml.p4d.24xlarge compute resources for 7B and 13B model families. In fact, for the 7B model family, you obtain the best throughput for a single model replica with a tensor parallel degree of 4 instead of 8.

From here, you can extrapolate that the highest throughput configuration for the 7B model involves a tensor parallel degree of 1 with eight model replicas, and the highest throughput configuration for the 13B model is likely a tensor parallel degree of 2 with four model replicas. To learn more about how to accomplish this, refer to Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker, which demonstrates the use of inference component-based endpoints. Due to load balancing techniques, server routing, and sharing of CPU resources, you might not fully achieve throughput improvements exactly equal to the number of replicas times the throughput for a single replica.

Horizontal scaling

As observed earlier, each endpoint deployment has a limitation on the number of concurrent requests depending on the number of input and output tokens as well as the instance type. If this doesn’t meet your throughput or concurrent request requirement, you can scale up to utilize more than one instance behind the deployed endpoint. SageMaker automatically performs load balancing of queries between instances. For example, the following code deploys an endpoint supported by three instances:

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.2xlarge",
)
predictor = model.deploy(
    accept_eula=False,  # Change EULA acceptance to True
    initial_instance_count = 3,
)

The following table shows the throughput gain as a factor of number of instances for the Llama 2 7B model.

. . Throughput (tokens/sec) Latency (ms/token)
. Concurrent Requests 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Instance Count Instance Type Number of total tokens: 512, Number of output tokens: 256
1 ml.g5.2xlarge 30 60 115 210 351 484 492 32 33 34 37 45 93 160
2 ml.g5.2xlarge 30 60 115 221 400 642 922 949 32 33 34 37 42 53 94 167
3 ml.g5.2xlarge 30 60 118 228 421 731 1170 1400 32 33 34 36 39 47 57 110

Notably, the knee in the latency-throughput curve shifts to the right because higher instance counts can handle larger numbers of concurrent requests within the multi-instance endpoint. For this table, the concurrent request value is for the entire endpoint, not the number of concurrent requests that each individual instance receives.

You can also use autoscaling, a feature to monitor your workloads and dynamically adjust the capacity to maintain steady and predictable performance at the possible lowest cost. This is beyond the scope of this post. To learn more about autoscaling, refer to Configuring autoscaling inference endpoints in Amazon SageMaker.

Invoke endpoint with concurrent requests

Let’s suppose you have a large batch of queries that you would like to use to generate responses from a deployed model under high throughput conditions. For example, in the following code block, we compile a list of 1,000 payloads, with each payload requesting the generation of 100 tokens. In all, we are requesting the generation of 100,000 tokens.

payload = {
    "inputs": "I believe the meaning of life is to ",
    "parameters": {"max_new_tokens": 100, "details": True},
}
total_requests = 1000
payloads = [payload,] * total_requests

When sending a large number of requests to the SageMaker runtime API, you may experience throttling errors. To mitigate this, you can create a custom SageMaker runtime client that increases the number of retry attempts. You can provide the resulting SageMaker session object to either the JumpStartModel constructor or sagemaker.predictor.retrieve_default if you would like to attach a new predictor to an already deployed endpoint. In the following code, we use this session object when deploying a Llama 2 model with default SageMaker JumpStart configurations:

import boto3
from botocore.config import Config
from sagemaker.session import Session
from sagemaker.jumpstart.model import JumpStartModel

sagemaker_session = Session(
    sagemaker_runtime_client=boto3.client(
        "sagemaker-runtime",
        config=Config(connect_timeout=10, retries={"mode": "standard", "total_max_attempts": 20}),
    )
)
model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    sagemaker_session=sagemaker_session
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

This deployed endpoint has MAX_CONCURRENT_REQUESTS = 128 by default. In the following block, we use the concurrent futures library to iterate over invoking the endpoint for all payloads with 128 worker threads. At most, the endpoint will process 128 concurrent requests, and whenever a request returns a response, the executor will immediately send a new request to the endpoint.

import time
from concurrent import futures

concurrent_requests = 128

time_start = time.time()
with futures.ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
    responses = list(executor.map(predictor.predict, payloads))

total_tokens = sum([response[0]["details"]["generated_tokens"] for response in responses])
token_throughput = total_tokens / (time.time() - time_start)

This results in generating 100,000 total tokens with a throughput of 1255 tokens/sec on a single ml.g5.2xlarge instance. This takes approximately 80 seconds to process.

Note that this throughput value is notably different than the maximum throughput for Llama 2 7B on ml.g5.2xlarge in the previous tables of this post (486 tokens/sec at 64 concurrent requests). This is because the input payload uses 8 tokens instead of 256, the output token count is 100 instead of 256, and the smaller token counts allow for 128 concurrent requests. This is a final reminder that all latency and throughput numbers are payload dependent! Changing payload token counts will affect batching processes during model serving, which will in turn affect the emergent prefill, decode, and queue times for your application.

Conclusion

In this post, we presented benchmarking of SageMaker JumpStart LLMs, including Llama 2, Mistral, and Falcon. We also presented a guide to optimize latency, throughput, and cost for your endpoint deployment configuration. You can get started by running the associated notebook to benchmark your use case.


About the Authors

 Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

João Moura is a Senior AI/ML Specialist Solutions Architect at AWS. João helps AWS customers – from small startups to large enterprises – train and deploy large models efficiently, and more broadly build ML platforms on AWS.

Read More

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

Generative artificial intelligence (AI) applications built around large language models (LLMs) have demonstrated the potential to create and accelerate economic value for businesses. Examples of applications include conversational search, customer support agent assistance, customer support analytics, self-service virtual assistants, chatbots, rich media generation, content moderation, coding companions to accelerate secure, high-performance software development, deeper insights from multimodal content sources, acceleration of your organization’s security investigations and mitigations, and much more. Many customers are looking for guidance on how to manage security, privacy, and compliance as they develop generative AI applications. Understanding and addressing LLM vulnerabilities, threats, and risks during the design and architecture phases helps teams focus on maximizing the economic and productivity benefits generative AI can bring. Being aware of risks fosters transparency and trust in generative AI applications, encourages increased observability, helps to meet compliance requirements, and facilitates informed decision-making by leaders.

The goal of this post is to empower AI and machine learning (ML) engineers, data scientists, solutions architects, security teams, and other stakeholders to have a common mental model and framework to apply security best practices, allowing AI/ML teams to move fast without trading off security for speed. Specifically, this post seeks to help AI/ML and data scientists who may not have had previous exposure to security principles gain an understanding of core security and privacy best practices in the context of developing generative AI applications using LLMs. We also discuss common security concerns that can undermine trust in AI, as identified by the Open Worldwide Application Security Project (OWASP) Top 10 for LLM Applications, and show ways you can use AWS to increase your security posture and confidence while innovating with generative AI.

This post provides three guided steps to architect risk management strategies while developing generative AI applications using LLMs. We first delve into the vulnerabilities, threats, and risks that arise from the implementation, deployment, and use of LLM solutions, and provide guidance on how to start innovating with security in mind. We then discuss how building on a secure foundation is essential for generative AI. Lastly, we connect these together with an example LLM workload to describe an approach towards architecting with defense-in-depth security across trust boundaries.

By the end of this post, AI/ML engineers, data scientists, and security-minded technologists will be able to identify strategies to architect layered defenses for their generative AI applications, understand how to map OWASP Top 10 for LLMs security concerns to some corresponding controls, and build foundational knowledge towards answering the following top AWS customer question themes for their applications:

  • What are some of the common security and privacy risks with using generative AI based on LLMs in my applications that I can most impact with this guidance?
  • What are some ways to implement security and privacy controls in the development lifecycle for generative AI LLM applications on AWS?
  • What operational and technical best practices can I integrate into how my organization builds generative AI LLM applications to manage risk and increase confidence in generative AI applications using LLMs?

Improve security outcomes while developing generative AI

Innovation with generative AI using LLMs requires starting with security in mind to develop organizational resiliency, build on a secure foundation, and integrate security with a defense in depth security approach. Security is a shared responsibility between AWS and AWS customers. All the principles of the AWS Shared Responsibility Model are applicable to generative AI solutions. Refresh your understanding of the AWS Shared Responsibility Model as it applies to infrastructure, services, and data when you build LLM solutions.

Start with security in mind to develop organizational resiliency

Start with security in mind to develop organizational resiliency for developing generative AI applications that meet your security and compliance objectives. Organizational resiliency draws on and extends the definition of resiliency in the AWS Well-Architected Framework to include and prepare for the ability of an organization to recover from disruptions. Consider your security posture, governance, and operational excellence when assessing overall readiness to develop generative AI with LLMs and your organizational resiliency to any potential impacts. As your organization advances its use of emerging technologies such as generative AI and LLMs, overall organizational resiliency should be considered as a cornerstone of a layered defensive strategy to protect assets and lines of business from unintended consequences.

Organizational resiliency matters substantially for LLM applications

Although all risk management programs can benefit from resilience, organizational resiliency matters substantially for generative AI. Five of the OWASP-identified top 10 risks for LLM applications rely on defining architectural and operational controls and enforcing them at an organizational scale in order to manage risk. These five risks are insecure output handling, supply chain vulnerabilities, sensitive information disclosure, excessive agency, and overreliance. Begin increasing organizational resiliency by socializing your teams to consider AI, ML, and generative AI security a core business requirement and top priority throughout the whole lifecycle of the product, from inception of the idea, to research, to the application’s development, deployment, and use. In addition to awareness, your teams should take action to account for generative AI in governance, assurance, and compliance validation practices.

Build organizational resiliency around generative AI

Organizations can start adopting ways to build their capacity and capabilities for AI/ML and generative AI security within their organizations. You should begin by extending your existing security, assurance, compliance, and development programs to account for generative AI.

The following are the five key areas of interest for organizational AI, ML, and generative AI security:

  • Understand the AI/ML security landscape
  • Include diverse perspectives in security strategies
  • Take action proactively for securing research and development activities
  • Align incentives with organizational outcomes
  • Prepare for realistic security scenarios in AI/ML and generative AI

Develop a threat model throughout your generative AI Lifecycle

Organizations building with generative AI should focus on risk management, not risk elimination, and include threat modeling in and business continuity planning the planning, development, and operations of generative AI workloads. Work backward from production use of generative AI by developing a threat model for each application using traditional security risks as well as generative AI-specific risks. Some risks may be acceptable to your business, and a threat modeling exercise can help your company identify what your acceptable risk appetite is. For example, your business may not require 99.999% uptime on a generative AI application, so the additional recovery time associated to recovery using AWS Backup with Amazon S3 Glacier may be an acceptable risk. Conversely, the data in your model may be extremely sensitive and highly regulated, so deviation from AWS Key Management Service (AWS KMS) customer managed key (CMK) rotation and use of AWS Network Firewall to help enforce Transport Layer Security (TLS) for ingress and egress traffic to protect against data exfiltration may be an unacceptable risk.

Evaluate the risks (inherent vs. residual) of using the generative AI application in a production setting to identify the right foundational and application-level controls. Plan for rollback and recovery from production security events and service disruptions such as prompt injection, training data poisoning, model denial of service, and model theft early on, and define the mitigations you will use as you define application requirements. Learning about the risks and controls that need to be put in place will help define the best implementation approach for building a generative AI application, and provide stakeholders and decision-makers with information to make informed business decisions about risk. If you are unfamiliar with the overall AI and ML workflow, start by reviewing 7 ways to improve security of your machine learning workloads to increase familiarity with the security controls needed for traditional AI/ML systems.

Just like building any ML application, building a generative AI application involves going through a set of research and development lifecycle stages. You may want to review the AWS Generative AI Security Scoping Matrix to help build a mental model to understand the key security disciplines that you should consider depending on which generative AI solution you select.

Generative AI applications using LLMs are typically developed and operated following ordered steps:

  • Application requirements – Identify use case business objectives, requirements, and success criteria
  • Model selection – Select a foundation model that aligns with use case requirements
  • Model adaptation and fine-tuning – Prepare data, engineer prompts, and fine-tune the model
  • Model evaluation – Evaluate foundation models with use case-specific metrics and select the best-performing model
  • Deployment and integration – Deploy the selected foundation model on your optimized infrastructure and integrate with your generative AI application
  • Application monitoring – Monitor application and model performance to enable root cause analysis

Ensure teams understand the critical nature of security as part of the design and architecture phases of your software development lifecycle on Day 1. This means discussing security at each layer of your stack and lifecycle, and positioning security and privacy as enablers to achieving business objectives.Architect controls for threats before you launch your LLM application, and consider whether the data and information you will use for model adaptation and fine-tuning warrants controls implementation in the research, development, and training environments. As part of quality assurance tests, introduce synthetic security threats (such as attempting to poison training data, or attempting to extract sensitive data through malicious prompt engineering) to test out your defenses and security posture on a regular basis.

Additionally, stakeholders should establish a consistent review cadence for production AI, ML, and generative AI workloads and set organizational priority on understanding trade-offs between human and machine control and error prior to launch. Validating and assuring that these trade-offs are respected in the deployed LLM applications will increase the likelihood of risk mitigation success.

Build generative AI applications on secure cloud foundations

At AWS, security is our top priority. AWS is architected to be the most secure global cloud infrastructure on which to build, migrate, and manage applications and workloads. This is backed by our deep set of over 300 cloud security tools and the trust of our millions of customers, including the most security-sensitive organizations like government, healthcare, and financial services. When building generative AI applications using LLMs on AWS, you gain security benefits from the secure, reliable, and flexible AWS Cloud computing environment.

Use an AWS global infrastructure for security, privacy, and compliance

When you develop data-intensive applications on AWS, you can benefit from an AWS global Region infrastructure, architected to provide capabilities to meet your core security and compliance requirements. This is reinforced by our AWS Digital Sovereignty Pledge, our commitment to offering you the most advanced set of sovereignty controls and features available in the cloud. We are committed to expanding our capabilities to allow you to meet your digital sovereignty needs, without compromising on the performance, innovation, security, or scale of the AWS Cloud. To simplify implementation of security and privacy best practices, consider using reference designs and infrastructure as code resources such as the AWS Security Reference Architecture (AWS SRA) and the AWS Privacy Reference Architecture (AWS PRA). Read more about architecting privacy solutions, sovereignty by design, and compliance on AWS and use services such as AWS Config, AWS Artifact, and AWS Audit Manager to support your privacy, compliance, audit, and observability needs.

Understand your security posture using AWS Well-Architected and Cloud Adoption Frameworks

AWS offers best practice guidance developed from years of experience supporting customers in architecting their cloud environments with the AWS Well-Architected Framework and in evolving to realize business value from cloud technologies with the AWS Cloud Adoption Framework (AWS CAF). Understand the security posture of your AI, ML, and generative AI workloads by performing a Well-Architected Framework review. Reviews can be performed using tools like the AWS Well-Architected Tool, or with the help of your AWS team through AWS Enterprise Support. The AWS Well-Architected Tool automatically integrates insights from AWS Trusted Advisor to evaluate what best practices are in place and what opportunities exist to improve functionality and cost-optimization. The AWS Well-Architected Tool also offers customized lenses with specific best practices such as the Machine Learning Lens for you to regularly measure your architectures against best practices and identify areas for improvement. Checkpoint your journey on the path to value realization and cloud maturity by understanding how AWS customers adopt strategies to develop organizational capabilities in the AWS Cloud Adoption Framework for Artificial Intelligence, Machine Learning, and Generative AI. You might also find benefit in understanding your overall cloud readiness by participating in an AWS Cloud Readiness Assessment. AWS offers additional opportunities for engagement—ask your AWS account team for more information on how to get started with the Generative AI Innovation Center.

Accelerate your security and AI/ML learning with best practices guidance, training, and certification

AWS also curates recommendations from Best Practices for Security, Identity, & Compliance and AWS Security Documentation to help you identify ways to secure your training, development, testing, and operational environments. If you’re just getting started, dive deeper on security training and certification, consider starting with AWS Security Fundamentals and the AWS Security Learning Plan. You can also use the AWS Security Maturity Model to help guide you finding and prioritizing the best activities at different phases of maturity on AWS, starting with quick wins, through foundational, efficient, and optimized stages. After you and your teams have a basic understanding of security on AWS, we strongly recommend reviewing How to approach threat modeling and then leading a threat modeling exercise with your teams starting with the Threat Modeling For Builders Workshop training program. There are many other AWS Security training and certification resources available.

Apply a defense-in-depth approach to secure LLM applications

Applying a defense-in-depth security approach to your generative AI workloads, data, and information can help create the best conditions to achieve your business objectives. Defense-in-depth security best practices mitigate many of the common risks that any workload faces, helping you and your teams accelerate your generative AI innovation. A defense-in-depth security strategy uses multiple redundant defenses to protect your AWS accounts, workloads, data, and assets. It helps make sure that if any one security control is compromised or fails, additional layers exist to help isolate threats and prevent, detect, respond, and recover from security events. You can use a combination of strategies, including AWS services and solutions, at each layer to improve the security and resiliency of your generative AI workloads.

Diagram of defense-in-depth security layers

Many AWS customers align to industry standard frameworks, such as the NIST Cybersecurity Framework. This framework helps ensure that your security defenses have protection across the pillars of Identify, Protect, Detect, Respond, Recover, and most recently added, Govern. This framework can then easily map to AWS Security services and those from integrated third parties as well to help you validate adequate coverage and policies for any security event your organization encounters.

Diagram of defense-in-depth of AWS Security Services mapped to the NIST Cybersecurity Framework 2.0

Defense in depth: Secure your environment, then add enhanced AI/ML-specific security and privacy capabilities

A defense-in-depth strategy should start by protecting your accounts and organization first, and then layer on the additional built-in security and privacy enhanced features of services such as Amazon Bedrock and Amazon SageMaker. Amazon has over 30 services in the Security, Identity, and Compliance portfolio which are integrated with AWS AI/ML services, and can be used together to help secure your workloads, accounts, organization. To properly defend against the OWASP Top 10 for LLM, these should be used together with the AWS AI/ML services.

Start by implementing a policy of least privilege, using services like IAM Access Analyzer to look for overly permissive accounts, roles, and resources to restrict access using short-termed credentials. Next, make sure that all data at rest is encrypted with AWS KMS, including considering the use of CMKs, and all data and models are versioned and backed up using Amazon Simple Storage Service (Amazon S3) versioning and applying object-level immutability with Amazon S3 Object Lock. Protect all data in transit between services using AWS Certificate Manager and/or AWS Private CA, and keep it within VPCs using AWS PrivateLink. Define strict data ingress and egress rules to help protect against manipulation and exfiltration using VPCs with AWS Network Firewall policies. Consider inserting AWS Web Application Firewall (AWS WAF) in front to protect web applications and APIs from malicious bots, SQL injection attacks, cross-site scripting (XSS), and account takeovers with Fraud Control. Logging with AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, and Amazon Elastic Kubernetes Service (Amazon EKS) audit logs will help provide forensic review of each transaction available to services such as Amazon Detective. You can use Amazon Inspector to automate vulnerability discovery and management for Amazon Elastic Compute Cloud (Amazon EC2) instances, containers, AWS Lambda functions, and identify the network reachability of your workloads. Protect your data and models from suspicious activity using Amazon GuardDuty’s ML-powered threat models and intelligence feeds, and enabling its additional features for EKS Protection, ECS Protection, S3 Protection, RDS Protection, Malware Protection, Lambda Protection, and more. You can use services like AWS Security Hub to centralize and automate your security checks to detect deviations from security best practices and accelerate investigation and automate remediation of security findings with playbooks. You can also consider implementing a zero trust architecture on AWS to further increase fine-grained authentication and authorization controls for what human users or machine-to-machine processes can access on a per-request basis. Also consider using Amazon Security Lake to automatically centralize security data from AWS environments, SaaS providers, on premises, and cloud sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization.

After your generative AI workload environment has been secured, you can layer in AI/ML-specific features, such as Amazon SageMaker Data Wrangler to identify potential bias during data preparation and Amazon SageMaker Clarify to detect bias in ML data and models. You can also use Amazon SageMaker Model Monitor to evaluate the quality of SageMaker ML models in production, and notify you when there is drift in data quality, model quality, and feature attribution. These AWS AI/ML services working together (including SageMaker working with Amazon Bedrock) with AWS Security services can help you identify potential sources of natural bias and protect against malicious data tampering. Repeat this process for each of the OWASP Top 10 for LLM vulnerabilities to ensure you’re maximizing the value of AWS services to implement defense in depth to protect your data and workloads.

As AWS Enterprise Strategist Clarke Rodgers wrote in his blog post “CISO Insight: Every AWS Service Is A Security Service”, “I would argue that virtually every service within the AWS cloud either enables a security outcome by itself, or can be used (alone or in conjunction with one or more services) by customers to achieve a security, risk, or compliance objective.” And “Customer Chief Information Security Officers (CISOs) (or their respective teams) may want to take the time to ensure that they are well versed with all AWS services because there may be a security, risk, or compliance objective that can be met, even if a service doesn’t fall into the ‘Security, Identity, and Compliance’ category.”

Layer defenses at trust boundaries in LLM applications

When developing generative AI-based systems and applications, you should consider the same concerns as with any other ML application, as mentioned in the MITRE ATLAS Machine Learning Threat Matrix, such as being mindful of software and data component origins (such as performing an open source software audit, reviewing software bill of materials (SBOMs), and analyzing data workflows and API integrations) and implementing necessary protections against LLM supply chain threats. Include insights from industry frameworks, and be aware of ways to use multiple sources of threat intelligence and risk information to adjust and extend your security defenses to account for AI, ML, and generative AI security risks that are emergent and not included in traditional frameworks. Seek out companion information on AI-specific risks from industry, defense, governmental, international, and academic sources, because new threats emerge and evolve in this space regularly and companion frameworks and guides are updated frequently. For example, when using a Retrieval Augmented Generation (RAG) model, if the model doesn’t include the data it needs, it may request it from an external data source for using during inferencing and fine-tuning. The source that it queries may be outside of your control, and can be a potential source of compromise in your supply chain. A defense-in-depth approach should be extended towards external sources to establish trust, authentication, authorization, access, security, privacy, and accuracy of the data it is accessing. To dive deeper, read “Build a secure enterprise application with Generative AI and RAG using Amazon SageMaker JumpStart

Analyze and mitigate risk in your LLM applications

In this section, we analyze and discuss some risk mitigation techniques based on trust boundaries and interactions, or distinct areas of the workload with similar appropriate controls scope and risk profile. In this sample architecture of a chatbot application, there are five trust boundaries where controls are demonstrated, based on how AWS customers commonly build their LLM applications. Your LLM application may have more or fewer definable trust boundaries. In the following sample architecture, these trust boundaries are defined as:

  1. User interface interactions (request and response)
  2. Application interactions
  3. Model interactions
  4. Data interactions
  5. Organizational interactions and use

Diagram of example workflow for securing an LLM-based application and it's integration points

User interface interactions: Develop request and response monitoring

Detect and respond to cyber incidents related to generative AI in a timely manner by evaluating a strategy to address risk from the inputs and outputs of the generative AI application. For example, additional monitoring for behaviors and data outflow may need to be instrumented to detect sensitive information disclosure outside your domain or organization, in the case that it is used in the LLM application.

Generative AI applications should still uphold the standard security best practices when it comes to protecting data. Establish a secure data perimeter and secure sensitive data stores. Encrypt data and information used for LLM applications at rest and in transit. Protect data used to train your model from training data poisoning by understanding and controlling which users, processes, and roles are allowed to contribute to the data stores, as well as how data flows in the application, monitor for bias deviations, and using versioning and immutable storage in storage services such as Amazon S3. Establish strict data ingress and egress controls using services like AWS Network Firewall and AWS VPCs to protect against suspicious input and the potential for data exfiltration.

During the training, retraining, or fine-tuning process, you should be aware of any sensitive data that is utilized. After data is used during one of these processes, you should plan for a scenario where any user of your model suddenly becomes able to extract the data or information back out by utilizing prompt injection techniques. Understand the risks and benefits of using sensitive data in your models and inferencing. Implement robust authentication and authorization mechanisms for establishing and managing fine-grained access permissions, which don’t rely on LLM application logic to prevent disclosure. User-controlled input to a generative AI application has been demonstrated under some conditions to be able to provide a vector to extract information from the model or any non-user-controlled parts of the input. This can occur via prompt injection, where the user provides input that causes the output of the model to deviate from the expected guardrails of the LLM application, including providing clues to the datasets that the model was originally trained on.

Implement user-level access quotas for users providing input and receiving output from a model. You should consider approaches that don’t allow anonymous access under conditions where the model training data and information is sensitive, or where there is risk from an adversary training a facsimile of your model based on their input and your aligned model output. In general, if part of the input to a model consists of arbitrary user-provided text, consider the output to be susceptible to prompt injection, and accordingly ensure use of the outputs includes implemented technical and organizational countermeasures to mitigate insecure output handling, excessive agency, and overreliance. In the example earlier related to filtering for malicious input using AWS WAF, consider building a filter in front of your application for such potential misuse of prompts, and develop a policy for how to handle and evolve those as your model and data grows. Also consider a filtered review of the output before it is returned to the user to ensure it meets quality, accuracy, or content moderation standards. You may want to further customize this for your organization’s needs with an additional layer of control on inputs and outputs in front of your models to mitigate suspicious traffic patterns.

Application interactions: Application security and observability

Review your LLM application with attention to how a user could utilize your model to bypass standard authorization to a downstream tool or toolchain that they don’t have authorization to access or use. Another concern at this layer involves accessing external data stores by using a model as an attack mechanism using unmitigated technical or organizational LLM risks. For example, if your model is trained to access certain data stores that could contain sensitive data, you should ensure that you have proper authorization checks between your model and the data stores. Use immutable attributes about users that don’t come from the model when performing authorization checks. Unmitigated insecure output handling, insecure plugin design, and excessive agency can create conditions where a threat actor may use a model to trick the authorization system into escalating effective privileges, leading to a downstream component believing the user is authorized to retrieve data or take a specific action.

When implementing any generative AI plugin or tool, it is imperative to examine and comprehend the level of access being granted, as well as scrutinize the access controls that have been configured. Using unmitigated insecure generative AI plugins may render your system susceptible to supply chain vulnerabilities and threats, potentially leading to malicious actions, including running remote code.

Model interactions: Model attack prevention

You should be aware of the origin of any models, plugins, tools, or data you use, in order to evaluate and mitigate against supply chain vulnerabilities. For example, some common model formats permit the embedding of arbitrary runnable code in the models themselves. Use package mirrors, scanning, and additional inspections as relevant to your organizations security goals.

The datasets you train and fine-tune your models on must also be reviewed. If you further automatically fine-tune a model based on user feedback (or other end-user-controllable information), you must consider if a malicious threat actor could change the model arbitrarily based on manipulating their responses and achieve training data poisoning.

Data interactions: Monitor data quality and usage

Generative AI models such as LLMs generally work well because they have been trained on a large amount of data. Although this data helps LLMs complete complex tasks, it also can expose your system to risk of training data poisoning, which occurs when inappropriate data is included or omitted inside a training dataset that can alter a model’s behavior. To mitigate this risk, you should look at your supply chain and understand the data review process for your system before it’s used inside your model. Although the training pipeline is a prime source for data poisoning, you should also look at how your model gets data, such as in a RAG model or data lake, and if the source of that data is trusted and protected. Use AWS Security services such as AWS Security Hub, Amazon GuardDuty, and Amazon Inspector to help continuously monitor for suspicious activity in Amazon EC2, Amazon EKS, Amazon S3, Amazon Relational Database Service (Amazon RDS), and network access that may be indicators of emerging threats, and use Detective to visualize security investigations. Also consider using services such as Amazon Security Lake to accelerate security investigations by creating a purpose-built data lake to automatically centralize security data from AWS environments, SaaS providers, on premises, and cloud sources which contribute to your AI/ML workloads.

Organizational interactions: Implement enterprise governance guardrails for generative AI

Identify risks associated with the use of generative AI for your businesses. You should build your organization’s risk taxonomy and conduct risk assessments to make informed decisions when deploying generative AI solutions. Develop a business continuity plan (BCP) that includes AI, ML, and generative AI workloads and that can be enacted quickly to replace the lost functionality of an impacted or offline LLM application to meet your SLAs.

Identify process and resource gaps, inefficiencies, and inconsistencies, and improve awareness and ownership across your business. Threat model all generative AI workloads to identify and mitigate potential security threats that may lead to business-impacting outcomes, including unauthorized access to data, denial of service, and resource misuse. Take advantage of the new AWS Threat Composer Modeling Tool to help reduce time-to-value when performing threat modeling. Later in your development cycles, consider including introducing security chaos engineering fault injection experiments to create real-world conditions to understand how your system will react to unknowns and build confidence in the system’s resiliency and security.

Include diverse perspectives in developing security strategies and risk management mechanisms to ensure adherence and coverage for AI/ML and generative security across all job roles and functions. Bring a security mindset to the table from the inception and research of any generative AI application to align on requirements. If you need extra assistance from AWS, ask your AWS account manager to make sure that there is equal support by requesting AWS Solutions Architects from AWS Security and AI/ML to help in tandem.

Ensure that your security organization routinely takes actions to foster communication around both risk awareness and risk management understanding among generative AI stakeholders such as product managers, software developers, data scientists, and executive leadership, allowing threat intelligence and controls guidance to reach the teams that may be impacted. Security organizations can support a culture of responsible disclosure and iterative improvement by participating in discussions and bringing new ideas and information to generative AI stakeholders that relate to their business objectives. Learn more about our commitment to Responsible AI and additional responsible AI resources to help our customers.

Gain advantage in enabling better organizational posture for generative AI by unblocking time to value in the existing security processes of your organization. Proactively evaluate where your organization may require processes that are overly burdensome given the generative AI security context and refine these to provide developers and scientists a clear path to launch with the correct controls in place.

Assess where there may be opportunities to align incentives, derisk, and provide a clear line of sight on the desired outcomes. Update controls guidance and defenses to meet the evolving needs of AI/ML and generative AI application development to reduce confusion and uncertainty that can cost development time, increase risk, and increase impact.

Ensure that stakeholders who are not security experts are able to both understand how organizational governance, policies, and risk management steps apply to their workloads, as well as apply risk management mechanisms. Prepare your organization to respond to realistic events and scenarios that may occur with generative AI applications, and ensure that generative AI builder roles and response teams are aware of escalation paths and actions in case of concern for any suspicious activity.

Conclusion

To successfully commercialize innovation with any new and emerging technology requires starting with a security-first mindset, building on a secure infrastructure foundation, and thinking about how to further integrate security at each level of the technology stack early with a defense-in-depth security approach. This includes interactions at multiple layers of your technology stack, and integration points within your digital supply chain, to ensure organizational resiliency. Although generative AI introduces some new security and privacy challenges, if you follow fundamental security best practices such as using defense-in-depth with layered security services, you can help protect your organization from many common issues and evolving threats. You should implement layered AWS Security services across your generative AI workloads and larger organization, and focus on integration points in your digital supply chains to secure your cloud environments. Then you can use the enhanced security and privacy capabilities in AWS AI/ML services such as Amazon SageMaker and Amazon Bedrock to add further layers of enhanced security and privacy controls to your generative AI applications. Embedding security from the start will make it faster, easier, and more cost-effective to innovate with generative AI, while simplifying compliance. This will help you increase controls, confidence, and observability to your generative AI applications for your employees, customers, partners, regulators, and other concerned stakeholders.

Additional references


About the authors

Christopher Rae is a Principal Worldwide Security GTM Specialist focused on developing and executing strategic initiatives that accelerate and scale adoption of AWS security services. He is passionate about the intersection of cybersecurity and emerging technologies, with 20+ years of experience in global strategic leadership roles delivering security solutions to media, entertainment, and telecom customers. He recharges through reading, traveling, food and wine, discovering new music, and advising early-stage startups.

Elijah Winter is a Senior Security Engineer in Amazon Security, holding a BS in Cyber Security Engineering and infused with a love for Harry Potter. Elijah excels in identifying and addressing vulnerabilities in AI systems, blending technical expertise with a touch of wizardry. Elijah designs tailored security protocols for AI ecosystems, bringing a magical flair to digital defenses. Integrity driven, Elijah has a security background in both public and commercial sector organizations focused on protecting trust.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 3-year-old Sheepadoodle!

Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master’s degree in statistics from Texas A&M University.

Emily Soward is a Data Scientist with AWS Professional Services. She holds a Master of Science with Distinction in Artificial Intelligence from the University of Edinburgh in Scotland, United Kingdom with emphasis on Natural Language Processing (NLP). Emily has served in applied scientific and engineering roles focused on AI-enabled product research and development, operational excellence, and governance for AI workloads running at organizations in the public and private sector. She contributes to customer guidance as an AWS Senior Speaker and recently, as an author for AWS Well-Architected in the Machine Learning Lens.

Read More