First model to work across a wide range of products uses a second U-Net encoder to capture fine-grained product details.Read More
Distributed training and efficient scaling with the Amazon SageMaker Model Parallel and Data Parallel Libraries
There has been tremendous progress in the field of distributed deep learning for large language models (LLMs), especially after the release of ChatGPT in December 2022. LLMs continue to grow in size with billions or even trillions of parameters, and they often won’t fit into a single accelerator device such as GPU or even a single node such as ml.p5.32xlarge because of memory limitations. Customers training LLMs often must distribute their workload across hundreds or even thousands of GPUs. Enabling training at such scale remains a challenge in distributed training, and training efficiently in such a large system is another equally important problem. Over the past years, the distributed training community has introduced 3D parallelism (data parallelism, pipeline parallelism, and tensor parallelism) and other techniques (such as sequence parallelism and expert parallelism) to address such challenges.
In December 2023, Amazon announced the release of the SageMaker model parallel library 2.0 (SMP), which achieves state-of-the-art efficiency in large model training, together with the SageMaker distributed data parallelism library (SMDDP). This release is a significant update from 1.x: SMP is now integrated with open source PyTorch Fully Sharded Data Parallel (FSDP) APIs, which allows you to use a familiar interface when training large models, and is compatible with Transformer Engine (TE), unlocking tensor parallelism techniques alongside FSDP for the first time. To learn more about the release, refer to Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%.
In this post, we explore the performance benefits of Amazon SageMaker (including SMP and SMDDP), and how you can use the library to train large models efficiently on SageMaker. We demonstrate the performance of SageMaker with benchmarks on ml.p4d.24xlarge clusters up to 128 instances, and FSDP mixed precision with bfloat16 for the Llama 2 model. We start with a demonstration of near-linear scaling efficiencies for SageMaker, followed by analyzing contributions from each feature for optimal throughput, and end with efficient training with various sequence lengths up to 32,768 through tensor parallelism.
Near-linear scaling with SageMaker
To reduce the overall training time for LLM models, preserving high throughput when scaling to large clusters (thousands of GPUs) is crucial given the inter-node communication overhead. In this post, we demonstrate robust and near-linear scaling (by varying the number of GPUs for a fixed total problem size) efficiencies on p4d instances invoking both SMP and SMDDP.
In this section, we demonstrate SMP’s near-linear scaling performance. Here we train Llama 2 models of various sizes (7B, 13B, and 70B parameters) using a fixed sequence length of 4,096, the SMDDP backend for collective communication, TE enabled, a global batch size of 4 million, with 16 to 128 p4d nodes. The following table summarizes our optimal configuration and training performance (model TFLOPs per second).
Model size | Number of nodes | TFLOPs* | sdp* | tp* | offload* | Scaling efficiency |
7B | 16 | 136.76 | 32 | 1 | N | 100.0% |
32 | 132.65 | 64 | 1 | N | 97.0% | |
64 | 125.31 | 64 | 1 | N | 91.6% | |
128 | 115.01 | 64 | 1 | N | 84.1% | |
13B | 16 | 141.43 | 32 | 1 | Y | 100.0% |
32 | 139.46 | 256 | 1 | N | 98.6% | |
64 | 132.17 | 128 | 1 | N | 93.5% | |
128 | 120.75 | 128 | 1 | N | 85.4% | |
70B | 32 | 154.33 | 256 | 1 | Y | 100.0% |
64 | 149.60 | 256 | 1 | N | 96.9% | |
128 | 136.52 | 64 | 2 | N | 88.5% |
*At the given model size, sequence length, and number of nodes, we show the globally optimal throughput and configurations after exploring various sdp, tp, and activation offloading combinations.
The preceding table summarizes the optimal throughput numbers subject to sharded data parallel (sdp) degree (typically using FSDP hybrid sharding instead of full sharding, with more details in the next section), tensor parallel (tp) degree, and activation offloading value changes, demonstrating a near-linear scaling for SMP together with SMDDP. For example, given the Llama 2 model size 7B and sequence length 4,096, overall it achieves scaling efficiencies of 97.0%, 91.6%, and 84.1% (relative to 16 nodes) at 32, 64, and 128 nodes, respectively. The scaling efficiencies are stable across different model sizes and increase slightly as the model size gets larger.
SMP and SMDDP also demonstrate similar scaling efficiencies for other sequence lengths such as 2,048 and 8,192.
SageMaker model parallel library 2.0 performance: Llama 2 70B
Model sizes have continued to grow over the past years, along with frequent state-of-the-art performance updates in the LLM community. In this section, we illustrate performance in SageMaker for the Llama 2 model using a fixed model size 70B, sequence length of 4,096, and a global batch size of 4 million. To compare with the previous table’s globally optimal configuration and throughput (with SMDDP backend, typically FSDP hybrid sharding and TE), the following table extends to other optimal throughputs (potentially with tensor parallelism) with extra specifications on the distributed backend (NCCL and SMDDP), FSDP sharding strategies (full sharding and hybrid sharding), and enabling TE or not (default).
Model size | Number of nodes | TFLOPS | TFLOPs #3 config | TFLOPs improvement over baseline | ||||||||
. | . | NCCL full sharding: #0 | SMDDP full sharding: #1 | SMDDP hybrid sharding: #2 | SMDDP hybrid sharding with TE: #3 | sdp* | tp* | offload* | #0 → #1 | #1 → #2 | #2 → #3 | #0 → #3 |
70B | 32 | 150.82 | 149.90 | 150.05 | 154.33 | 256 | 1 | Y | -0.6% | 0.1% | 2.9% | 2.3% |
64 | 144.38 | 144.38 | 145.42 | 149.60 | 256 | 1 | N | 0.0% | 0.7% | 2.9% | 3.6% | |
128 | 68.53 | 103.06 | 130.66 | 136.52 | 64 | 2 | N | 50.4% | 26.8% | 4.5% | 99.2% |
*At the given model size, sequence length, and number of nodes, we show the globally optimal throughput and configuration after exploring various sdp, tp, and activation offloading combinations.
The latest release of SMP and SMDDP supports multiple features including native PyTorch FSDP, extended and more flexible hybrid sharding, transformer engine integration, tensor parallelism, and optimized all gather collective operation. To better understand how SageMaker achieves efficient distributed training for LLMs, we explore incremental contributions from SMDDP and the following SMP core features:
- SMDDP enhancement over NCCL with FSDP full sharding
- Replacing FSDP full sharding with hybrid sharding, which reduces communication cost to improve throughput
- A further boost to throughput with TE, even when tensor parallelism is disabled
- At lower resource settings, activation offloading might be able to enable training that would otherwise be infeasible or very slow due to high memory pressure
FSDP full sharding: SMDDP enhancement over NCCL
As shown in the previous table, when models are fully sharded with FSDP, although NCCL (TFLOPs #0) and SMDDP (TFLOPs #1) throughputs are comparable at 32 or 64 nodes, there is a huge improvement of 50.4% from NCCL to SMDDP at 128 nodes.
At smaller model sizes, we observe consistent and significant improvements with SMDDP over NCCL, starting at smaller cluster sizes, because SMDDP is able to mitigate the communication bottleneck effectively.
FSDP hybrid sharding to reduce communication cost
In SMP 1.0, we launched sharded data parallelism, a distributed training technique powered by Amazon in-house MiCS technology. In SMP 2.0, we introduce SMP hybrid sharding, an extensible and more flexible hybrid sharding technique that allows models to be sharded among a subset of GPUs, instead of all training GPUs, which is the case for FSDP full sharding. It’s useful for medium-sized models that don’t need to be sharded across the entire cluster in order to satisfy per-GPU memory constraints. This leads to clusters having more than one model replica and each GPU communicating with fewer peers at runtime.
SMP’s hybrid sharding enables efficient model sharding over a wider range, from the smallest shard degree with no out of memory issues up to the whole cluster size (which equates to full sharding).
The following figure illustrates the throughput dependence on sdp at tp = 1 for simplicity. Although it’s not necessarily the same as the optimal tp value for NCCL or SMDDP full sharding in the previous table, the numbers are quite close. It clearly validates the value of switching from full sharding to hybrid sharding at a large cluster size of 128 nodes, which is applicable to both NCCL and SMDDP. For smaller model sizes, significant improvements with hybrid sharding start at smaller cluster sizes, and the difference keeps increasing with cluster size.
Improvements with TE
TE is designed to accelerate LLM training on NVIDIA GPUs. Despite not using FP8 because it’s unsupported on p4d instances, we still see significant speedup with TE on p4d.
On top of MiCS trained with the SMDDP backend, TE introduces a consistent boost for throughput across all cluster sizes (the only exception is full sharding at 128 nodes), even when tensor parallelism is disabled (tensor parallel degree is 1).
For smaller model sizes or various sequence lengths, the TE boost is stable and non-trivial, in the range of approximately 3–7.6%.
Activation offloading at low resource settings
At low resource settings (given a small number of nodes), FSDP might experience a high memory pressure (or even out of memory in the worst case) when activation checkpointing is enabled. For such scenarios bottlenecked by memory, turning on activation offloading is potentially an option to improve performance.
For example, as we saw previously, although the Llama 2 at model size 13B and sequence length 4,096 is able to train optimally with at least 32 nodes with activation checkpointing and without activation offloading, it achieves the best throughput with activation offloading when limited to 16 nodes.
Enable training with long sequences: SMP tensor parallelism
Longer sequence lengths are desired for long conversations and context, and are getting more attention in the LLM community. Therefore, we report various long sequence throughputs in the following table. The table shows optimal throughputs for Llama 2 training on SageMaker, with various sequence lengths from 2,048 up to 32,768. At sequence length 32,768, native FSDP training is infeasible with 32 nodes at a global batch size of 4 million.
. | . | . | TFLOPS | ||
Model size | Sequence length | Number of nodes | Native FSDP and NCCL | SMP and SMDDP | SMP improvement |
7B | 2048 | 32 | 129.25 | 138.17 | 6.9% |
4096 | 32 | 124.38 | 132.65 | 6.6% | |
8192 | 32 | 115.25 | 123.11 | 6.8% | |
16384 | 32 | 100.73 | 109.11 | 8.3% | |
32768 | 32 | N.A. | 82.87 | . | |
13B | 2048 | 32 | 137.75 | 144.28 | 4.7% |
4096 | 32 | 133.30 | 139.46 | 4.6% | |
8192 | 32 | 125.04 | 130.08 | 4.0% | |
16384 | 32 | 111.58 | 117.01 | 4.9% | |
32768 | 32 | N.A. | 92.38 | . | |
*: max | . | . | . | . | 8.3% |
*: median | . | . | . | . | 5.8% |
When the cluster size is large and given a fixed global batch size, some model training might be infeasible with native PyTorch FSDP, lacking a built-in pipeline or tensor parallelism support. In the preceding table, given a global batch size of 4 million, 32 nodes, and sequence length 32,768, the effective batch size per GPU is 0.5 (for example, tp = 2 with batch size 1), which would otherwise be infeasible without introducing tensor parallelism.
Conclusion
In this post, we demonstrated efficient LLM training with SMP and SMDDP on p4d instances, attributing contributions to multiple key features, such as SMDDP enhancement over NCCL, flexible FSDP hybrid sharding instead of full sharding, TE integration, and enabling tensor parallelism in favor of long sequence lengths. After being tested over a wide range of settings with various models, model sizes, and sequence lengths, it exhibits robust near-linear scaling efficiencies, up to 128 p4d instances on SageMaker. In summary, SageMaker continues to be a powerful tool for LLM researchers and practitioners.
To learn more, refer to SageMaker model parallelism library v2, or contact the SMP team at sm-model-parallel-feedback@amazon.com.
Acknowledgements
We’d like to thank Robert Van Dusen, Ben Snyder, Gautam Kumar, and Luis Quintela for their constructive feedback and discussions.
About the Authors
Xinle Sheila Liu is an SDE in Amazon SageMaker. In her spare time, she enjoys reading and outdoor sports.
Suhit Kodgule is a Software Development Engineer with the AWS Artificial Intelligence group working on deep learning frameworks. In his spare time, he enjoys hiking, traveling, and cooking.
Victor Zhu is a Software Engineer in Distributed Deep Learning at Amazon Web Services. He can be found enjoying hiking and board games around the SF Bay Area.
Derya Cavdar works as a software engineer at AWS. Her interests include deep learning and distributed training optimization.
Teng Xu is a Software Development Engineer in the Distributed Training group in AWS AI. He enjoys reading.
Manage your Amazon Lex bot via AWS CloudFormation templates
Amazon Lex is a fully managed artificial intelligence (AI) service with advanced natural language models to design, build, test, and deploy conversational interfaces in applications. It employs advanced deep learning technologies to understand user input, enabling developers to create chatbots, virtual assistants, and other applications that can interact with users in natural language.
Managing your Amazon Lex bots using AWS CloudFormation allows you to create templates defining the bot and all the AWS resources it depends on. AWS CloudFormation provides and configures those resources on your behalf, removing the risk of human error when deploying bots to new environments. The benefits of using CloudFormation include:
- Consistency – A CloudFormation template provides a more consistent and automated way to deploy and manage the resources associated with an Amazon Lex bot.
- Version control – With AWS CloudFormation, you can use version control systems like Git to manage your CloudFormation templates. This allows you to maintain different versions of your bot and roll back to previous versions if needed.
- Reusability – You can reuse CloudFormation templates across multiple environments, such as development, staging, and production. This saves time and effort in defining the same bot across different environments.
- Expandability – As your Amazon Lex bot grows in complexity, managing it through the AWS Management Console becomes more challenging. AWS CloudFormation allows for a more streamlined and efficient approach to managing the bot’s definition and resources.
- Automation – Using a CloudFormation template allows you to automate the deployment process. You can use AWS services like AWS CodePipeline and AWS CodeBuild to build, test, and deploy your Amazon Lex bot automatically.
In this post, we guide you through the steps involved in creating a CloudFormation template for an Amazon Lex V2 bot.
Solution overview
We have chosen the Book Trip bot as our starting point for this exercise. We use a CloudFormation template to create a new bot from scratch, including defining intents, slots, and other required components. Additionally, we explore topics such as version control, aliases, integrating AWS Lambda functions, creating conditional branches, and enabling logging.
Prerequisites
You should have the following prerequisites:
- An AWS account to create and deploy a CloudFormation template
- The necessary AWS Identity and Access Management (IAM) permissions to deploy AWS CloudFormation and the resources used in the template
- Basic knowledge of Amazon Lex, Lambda functions, and associated services
- Basic knowledge of creating and deploying CloudFormation templates
Create an IAM role
To begin, you need to create an IAM role that the bot will use. You can achieve this by initializing a CloudFormation template and adding the IAM role as a resource. You can use the following template to create the role. If you download the example template and deploy it, you should see that an IAM role has been created. We provide examples of templates as we go through this post and merge them as we get further along.
Configure the Amazon Lex bot
Next, you need to add the bot definition. The following is the YAML template for the Amazon Lex bot definition; you construct the necessary components one by one:
To create a bot that only includes the bot definition without any intent, you can use the following template. Here, you specify the bot’s name, the ARN of the role that you previously created, data privacy settings, and more:
You can download the updated template. Deploying the updated template allows you to create both the role and the bot definition. Note that you’re updating the stack you created in the previous step.
The final step entails defining the BotLocales
, which form the majority of the bot’s functionality. This includes, for example, Intents
and Slot types
. The following is the YAML template:
In this case, you build the BookHotel
intent, which requires a custom slot type for room types. You set the LocaleId
, then the VoiceSettings
. Then you add the SlotTypes
and their corresponding values.
The next step is to define the Intents
, starting with the first intent, BookHotel
, which involves adding utterances, slots, and slot priorities. The details of these nodes are demonstrated in the provided template. Finally, you add the second intent, which is the FallbackIntent
. See the following code:
You can download the CloudFormation template for the work done until now. After you update your stack with this template, a functional bot will be deployed. On the Amazon Lex console, you can confirm that there is a draft version of the bot, and a default alias named TestBotAlias
has been created.
Create a new bot version and alias
Amazon Lex supports publishing versions of bots, intents, and slot types so that you can control your client applications’ implementation. A version is a numbered snapshot of your bot definition that you can publish for use in different parts of your workflow, such as development, beta deployment, and production. Amazon Lex bots also support aliases. An alias is a pointer to a specific version of a bot. With an alias, you can update your client applications’ version. In practical scenarios, bot aliases are used for blue/green deployments and managing environment-specific configurations like development and production environments.
To illustrate, let’s say you point an alias to version 1 of your bot. When it’s time to update the bot, you can publish version 2 and change the alias to point to the new version. Because your applications use the alias instead of a specific version, all clients receive the new functionality without requiring updates.
Keep in mind that when you modify the CloudFormation template and initiate deployment, the changes are implemented within the draft version, primarily meant for testing. After you complete your testing phase, you can establish a new version to finalize the changes you’ve incorporated so far.
Next, you create a new bot version based on your draft, set up a new alias, and link the version to this alias. The following are the two new resources to add to your template:
You can download the new version of the template and deploy it by updating your stack. You can see on the Amazon Lex console that a new version is created and associated with a new alias called BookHotelDemoAlias
.
When you create a new version of an Amazon Lex bot, it typically increments the version number sequentially, starting from 1. To discern a specific version, you can refer to its description.
Add a Lambda function
To initialize values or validate user input for your bot, you can add a Lambda function as a code hook to your bot. Similarly, you can use a Lambda function for fulfillment as well, for example writing data to databases or calling APIs save the collected information. For more information, refer to Enabling custom logic with AWS Lambda functions.
Let’s add a new resource for the Lambda function to the CloudFormation template. Although it’s generally not advised to embed code in CloudFormation templates, we do so here solely for the sake of making the demo deployment less complicated. See the following code:
To use this Lambda function for the fulfillment, enable the code hook settings in your intent:
Because you made changes to your bot, you can create a new version of the bot by adding a new resource named BookHotelVersionWithLambda
in the template:
The Lambda function is associated with a bot alias. Amazon Lex V2 can use one Lambda function per bot alias per language. Therefore, you must update your alias in the template to add the Lambda function resource. You can do so in the BotAliasLocalSettings
section. You also need to point the alias to the new version you created. The following code is the modified alias configuration:
Up until now, you have only linked the Lambda function with the alias. However, you need to grant permission to allow the alias to invoke the Lambda function. In the following code, you add the Lambda invoke permission for Amazon Lex and specify the alias ARN as the source ARN:
You can download the latest version of the template. After updating your stack with this version, you will have an Amazon Lex bot integrated with a Lambda function.
Conditional branches
Now let’s explore the conditional branch feature of the Amazon Lex bot and consider a scenario where booking more than five nights in Seattle is not allowed for the next week. As per the business requirement, the conversation should end with an appropriate message if the user attempts to book more than five nights in Seattle. The conditional branch for that is represented in the CloudFormation template under the SlotCaptureSetting
:
Because you changed the bot definition, you need to create a new version in the template and link it with the alias. This is a temporary modification because the business plans to allow large bookings in Seattle soon. The following are the two new resources you add to the template:
You can download the updated template. After you update your stack with this template version, the alias will be directed to the version incorporating the conditional branching feature. To undo this modification, you can update the alias to revert back to the previous version.
Logs
You can also enable logs for your Amazon Lex bot. To do so, you must update the bot’s role to grant permissions for writing Amazon CloudWatch logs. The following is an example of adding a CloudWatch policy to the role:
To ensure consistent and predictable behavior, you should be as specific as possible when defining resource names and properties in CloudFormation templates. This is because the use of the wildcard character (*) in CloudFormation templates can pose potential security risks and lead to unintended consequences. Therefore, it’s recommended to avoid using wildcards and instead use explicit values wherever possible.
Next, you create a CloudWatch log group resource, as shown in the following code, to direct your logs to this group:
Finally, you update your alias to enable conversation log settings:
When you update the stack with this template, you enable the conversation logs for your bot. A new version is not created in this step because there are no changes to your bot resource. You can download the latest version of the template.
Clean Up
To prevent incurring charges in the future, delete the CloudFormation stack you created.
Conclusion
In this post, we discussed the step-by-step process to create a CloudFormation template for an Amazon Lex V2 bot. Initially, we deployed a basic bot, then we explored the potential of aliases and versions and how to use them efficiently with templates. Next, we learned how to integrate a Lambda function with an Amazon Lex V2 bot and implemented conditional branching in the bot’s conversation flow to accommodate business requirements. Finally, we added logging features by creating a CloudWatch log group resource and updating the bot’s role with the necessary permissions.
The template allows for the straightforward deployment and management of the bot, with the ability to revert changes as necessary. Overall, the CloudFormation template is useful for managing and optimizing an Amazon Lex V2 bot.
As the next step, you can explore sample Amazon Lex bots and apply the techniques discussed in this post to convert them into CloudFormation templates. This hands-on practice will solidify your understanding of managing Amazon Lex V2 bots through infrastructure as code.
About the Authors
Thomas Rindfuss is a Sr. Solutions Architect on the Amazon Lex team. He invents, develops, prototypes, and evangelizes new technical features and solutions for Language AI services that improves the customer experience and eases adoption.
Rijeesh Akkambeth Chathoth is a Professional Services Consultant at AWS. He helps customers in achieving their desired business
outcomes in the Contact Center space by leveraging Amazon Connect, Amazon Lex and GenAI features.
A secure approach to generative AI with AWS
Generative artificial intelligence (AI) is transforming the customer experience in industries across the globe. Customers are building generative AI applications using large language models (LLMs) and other foundation models (FMs), which enhance customer experiences, transform operations, improve employee productivity, and create new revenue channels.
FMs and the applications built around them represent extremely valuable investments for our customers. They’re often used with highly sensitive business data, like personal data, compliance data, operational data, and financial information, to optimize the model’s output. The biggest concern we hear from customers as they explore the advantages of generative AI is how to protect their highly sensitive data and investments. Because their data and model weights are incredibly valuable, customers require them to stay protected, secure, and private, whether that’s from their own administrator’s accounts, their customers, vulnerabilities in software running in their own environments, or even their cloud service provider from having access.
At AWS, our top priority is safeguarding the security and confidentiality of our customers’ workloads. We think about security across the three layers of our generative AI stack:
- Bottom layer – Provides the tools for building and training LLMs and other FMs
- Middle layer – Provides access to all the models along with tools you need to build and scale generative AI applications
- Top layer – Includes applications that use LLMs and other FMs to make work stress-free by writing and debugging code, generating content, deriving insights, and taking action
Each layer is important to making generative AI pervasive and transformative.
With the AWS Nitro System, we delivered a first-of-its-kind innovation on behalf of our customers. The Nitro System is an unparalleled computing backbone for AWS, with security and performance at its core. Its specialized hardware and associated firmware are designed to enforce restrictions so that nobody, including anyone in AWS, can access your workloads or data running on your Amazon Elastic Compute Cloud (Amazon EC2) instances. Customers have benefited from this confidentiality and isolation from AWS operators on all Nitro-based EC2 instances since 2017.
By design, there is no mechanism for any Amazon employee to access a Nitro EC2 instance that customers use to run their workloads, or to access data that customers send to a machine learning (ML) accelerator or GPU. This protection applies to all Nitro-based instances, including instances with ML accelerators like AWS Inferentia and AWS Trainium, and instances with GPUs like P4, P5, G5, and G6.
The Nitro System enables Elastic Fabric Adapter (EFA), which uses the AWS-built AWS Scalable Reliable Datagram (SRD) communication protocol for cloud-scale elastic and large-scale distributed training, enabling the only always-encrypted Remote Direct Memory Access (RDMA) capable network. All communication through EFA is encrypted with VPC encryption without incurring any performance penalty.
The design of the Nitro System has been validated by the NCC Group, an independent cybersecurity firm. AWS delivers a high level of protection for customer workloads, and we believe this is the level of security and confidentiality that customers should expect from their cloud provider. This level of protection is so critical that we’ve added it in our AWS Service Terms to provide an additional assurance to all of our customers.
Innovating secure generative AI workloads using AWS industry-leading security capabilities
From day one, AWS AI infrastructure and services have had built-in security and privacy features to give you control over your data. As customers move quickly to implement generative AI in their organizations, you need to know that your data is being handled securely across the AI lifecycle, including data preparation, training, and inferencing. The security of model weights—the parameters that a model learns during training that are critical for its ability to make predictions—is paramount to protecting your data and maintaining model integrity.
This is why it is critical for AWS to continue to innovate on behalf of our customers to raise the bar on security across each layer of the generative AI stack. To do this, we believe that you must have security and confidentiality built in across each layer of the generative AI stack. You need to be able to secure the infrastructure to train LLMs and other FMs, build securely with tools to run LLMs and other FMs, and run applications that use FMs with built-in security and privacy that you can trust.
At AWS, securing AI infrastructure refers to zero access to sensitive AI data, such as AI model weights and data processed with those models, by any unauthorized person, either at the infrastructure operator or at the customer. It’s comprised of three key principles:
- Complete isolation of the AI data from the infrastructure operator – The infrastructure operator must have no ability to access customer content and AI data, such as AI model weights and data processed with models.
- Ability for customers to isolate AI data from themselves – The infrastructure must provide a mechanism to allow model weights and data to be loaded into hardware, while remaining isolated and inaccessible from customers’ own users and software.
- Protected infrastructure communications – The communication between devices in the ML accelerator infrastructure must be protected. All externally accessible links between the devices must be encrypted.
The Nitro System fulfills the first principle of Secure AI Infrastructure by isolating your AI data from AWS operators. The second principle provides you with a way to remove administrative access of your own users and software to your AI data. AWS not only offers you a way to achieve that, but we also made it straightforward and practical by investing in building an integrated solution between AWS Nitro Enclaves and AWS Key Management Service (AWS KMS). With Nitro Enclaves and AWS KMS, you can encrypt your sensitive AI data using keys that you own and control, store that data in a location of your choice, and securely transfer the encrypted data to an isolated compute environment for inferencing. Throughout this entire process, the sensitive AI data is encrypted and isolated from your own users and software on your EC2 instance, and AWS operators cannot access this data. Use cases that have benefited from this flow include running LLM inferencing in an enclave. Until today, Nitro Enclaves operate only in the CPU, limiting the potential for larger generative AI models and more complex processing.
We announced our plans to extend this Nitro end-to-end encrypted flow to include first-class integration with ML accelerators and GPUs, fulfilling the third principle. You will be able to decrypt and load sensitive AI data into an ML accelerator for processing while providing isolation from your own operators and verified authenticity of the application used for processing the AI data. Through the Nitro System, you can cryptographically validate your applications to AWS KMS and decrypt data only when the necessary checks pass. This enhancement allows AWS to offer end-to-end encryption for your data as it flows through generative AI workloads.
We plan to offer this end-to-end encrypted flow in the upcoming AWS-designed Trainium2 as well as GPU instances based on NVIDIA’s upcoming Blackwell architecture, which both offer secure communications between devices, the third principle of Secure AI Infrastructure. AWS and NVIDIA are collaborating closely to bring a joint solution to market, including NVIDIA’s new NVIDIA Blackwell GPU 21 platform, which couples NVIDIA’s GB200 NVL72 solution with the Nitro System and EFA technologies to provide an industry-leading solution for securely building and deploying next-generation generative AI applications.
Advancing the future of generative AI security
Today, tens of thousands of customers are using AWS to experiment and move transformative generative AI applications into production. Generative AI workloads contain highly valuable and sensitive data that needs the level of protection from your own operators and the cloud service provider. Customers using AWS Nitro-based EC2 instances have received this level of protection and isolation from AWS operators since 2017, when we launched our innovative Nitro System.
At AWS, we’re continuing that innovation as we invest in building performant and accessible capabilities to make it practical for our customers to secure their generative AI workloads across the three layers of the generative AI stack, so that you can focus on what you do best: building and extending the uses of the generative AI to more areas. Learn more here.
About the authors
Anthony Liguori is an AWS VP and Distinguished Engineer for EC2
Colm MacCárthaigh is an AWS VP and Distinguished Engineer for EC2
Cost-effective document classification using the Amazon Titan Multimodal Embeddings Model
Organizations across industries want to categorize and extract insights from high volumes of documents of different formats. Manually processing these documents to classify and extract information remains expensive, error prone, and difficult to scale. Advances in generative artificial intelligence (AI) have given rise to intelligent document processing (IDP) solutions that can automate the document classification, and create a cost-effective classification layer capable of handling diverse, unstructured enterprise documents.
Categorizing documents is an important first step in IDP systems. It helps you determine the next set of actions to take depending on the type of document. For example, during the claims adjudication process, the accounts payable team receives the invoice, whereas the claims department manages the contract or policy documents. Traditional rule engines or ML-based classification can classify the documents, but often reach a limit on types of document formats and support for the dynamic addition of a new classes of document. For more information, see Amazon Comprehend document classifier adds layout support for higher accuracy.
In this post, we discuss document classification using the Amazon Titan Multimodal Embeddings model to classify any document types without the need for training.
Amazon Titan Multimodal Embeddings
Amazon recently introduced Titan Multimodal Embeddings in Amazon Bedrock. This model can create embeddings for images and text, enabling the creation of document embeddings to be used in new document classification workflows.
It generates optimized vector representations of documents scanned as images. By encoding both visual and textual components into unified numerical vectors that encapsulate semantic meaning, it enables rapid indexing, powerful contextual search, and accurate classification of documents.
As new document templates and types emerge in business workflows, you can simply invoke the Amazon Bedrock API to dynamically vectorize them and append to their IDP systems to rapidly enhance document classification capabilities.
Solution overview
Let’s examine the following document classification solution with the Amazon Titan Multimodal Embeddings model. For optimal performance, you should customize the solution to your specific use case and existing IDP pipeline setup.
This solution classifies documents using vector embedding semantic search by matching an input document to an already indexed gallery of documents. We use the following key components:
- Embeddings – Embeddings are numerical representations of real-world objects that machine learning (ML) and AI systems use to understand complex knowledge domains like humans do.
- Vector databases – Vector databases are used to store embeddings. Vector databases efficiently index and organize the embeddings, enabling fast retrieval of similar vectors based on distance metrics like Euclidean distance or cosine similarity.
- Semantic search – Semantic search works by considering the context and meaning of the input query and its relevance to the content being searched. Vector embeddings are an effective way to capture and retain the contextual meaning of text and images. In our solution, when an application wants to perform a semantic search, the search document is first converted into an embedding. The vector database with relevant content is then queried to find the most similar embeddings.
In the labeling process, a sample set of business documents like invoices, bank statements, or prescriptions are converted into embeddings using the Amazon Titan Multimodal Embeddings model and stored in a vector database against predefined labels. The Amazon Titan Multimodal Embedding model was trained using the Euclidean L2 algorithm and therefore for best results the vector database used should support this algorithm.
The following architecture diagram illustrates how you can use the Amazon Titan Multimodal Embeddings model with documents in an Amazon Simple Storage Service (Amazon S3) bucket for image gallery creation.
The workflow consists of the following steps:
- A user or application uploads a sample document image with classification metadata to a document image gallery. An S3 prefix or S3 object metadata can be used to classify gallery images.
- An Amazon S3 object notification event invokes the embedding AWS Lambda function.
- The Lambda function reads the document image and translates the image into embeddings by calling Amazon Bedrock and using the Amazon Titan Multimodal Embeddings model.
- Image embeddings, along with document classification, are stored in the vector database.
When a new document needs classification, the same embedding model is used to convert the query document into an embedding. Then, a semantic similarity search is performed on the vector database using the query embedding. The label retrieved against the top embedding match will be the classification label for the query document.
The following architecture diagram illustrates how to use the Amazon Titan Multimodal Embeddings model with documents in an S3 bucket for image classification.
The workflow consists of the following steps:
- Documents that require classification are uploaded to an input S3 bucket.
- The classification Lambda function receives the Amazon S3 object notification.
- The Lambda function translates the image to an embedding by calling the Amazon Bedrock API.
- The vector database is searched for a matching document using semantic search. Classification of the matching document is used to classify the input document.
- The input document is moved to the target S3 directory or prefix using the classification retrieved from the vector database search.
To help you test the solution with your own documents, we have created an example Python Jupyter notebook, which is available on GitHub.
Prerequisites
To run the notebook, you need an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to call Amazon Bedrock. Additionally, on the Model access page of the Amazon Bedrock console, make sure that access is granted for the Amazon Titan Multimodal Embeddings model.
Implementation
In the following steps, replace each user input placeholder with your own information:
- Create the vector database. In this solution, we use an in-memory FAISS database, but you could use an alternative vector database. Amazon Titan’s default dimension size is 1024.
- After the vector database is created, enumerate over the sample documents, creating embeddings of each and store those into the vector database
- Test with your documents. Replace the folders in the following code with your own folders that contain known document types:
- Using the Boto3 library, call Amazon Bedrock. The variable
inputImageB64
is a base64 encoded byte array representing your document. The response from Amazon Bedrock contains the embeddings.
- Add the embeddings to the vector database, with a class ID that represents a known document type:
- With the vector database populated with images (representing our gallery), you can uncover similarities with new documents. For example, the following is the syntax used for search. The k=1 tells FAISS to return the top 1 match.
In addition, the Euclidean L2 distance between the image on hand and the found image is also returned. If the image is an exact match, this value would be 0. The larger this value is, the further apart the images are in similarity.
Additional considerations
In this section, we discuss additional considerations for using the solution effectively. This includes data privacy, security, integration with existing systems, and cost estimates.
Data privacy and security
The AWS shared responsibility model applies to data protection in Amazon Bedrock. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. Customers are responsible for maintaining control over their content that is hosted on this infrastructure. As a customer, you are responsible for the security configuration and management tasks for the AWS services that you use.
Data protection in Amazon Bedrock
Amazon Bedrock avoids using customer prompts and continuations to train AWS models or share them with third parties. Amazon Bedrock doesn’t store or log customer data in its service logs. Model providers don’t have access to Amazon Bedrock logs or access to customer prompts and continuations. As a result, the images used for generating embeddings through the Amazon Titan Multimodal Embeddings model are not stored or employed in training AWS models or external distribution. Additionally, other usage data, such as timestamps and logged account IDs, is excluded from model training.
Integration with existing systems
The Amazon Titan Multimodal Embeddings model underwent training with the Euclidean L2 algorithm, so the vector database being used should be compatible with this algorithm.
Cost estimate
At the time of writing this post, as per Amazon Bedrock Pricing for the Amazon Titan Multimodal Embeddings model, the following are the estimated costs using on-demand pricing for this solution:
- One-time indexing cost – $0.06 for a single run of indexing, assuming a 1,000 images gallery
- Classification cost – $6 for 100,000 input images per month
Clean up
To avoid incurring future charges, delete the resources you created, such as the Amazon SageMaker notebook instance, when not in use.
Conclusion
In this post, we explored how you can use the Amazon Titan Multimodal Embeddings model to build an inexpensive solution for document classification in the IDP workflow. We demonstrated how to create an image gallery of known documents and perform similarity searches with new documents to classify them. We also discussed the benefits of using multimodal image embeddings for document classification, including their ability to handle diverse document types, scalability, and low latency.
As new document templates and types emerge in business workflows, developers can invoke the Amazon Bedrock API to vectorize them dynamically and append to their IDP systems to rapidly enhance document classification capabilities. This creates an inexpensive, infinitely scalable classification layer that can handle even the most diverse, unstructured enterprise documents.
Overall, this post provides a roadmap for building an inexpensive solution for document classification in the IDP workflow using Amazon Titan Multimodal Embeddings.
As next steps, check out What is Amazon Bedrock to start using the service. And follow Amazon Bedrock on the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Bedrock.
About the Authors
Sumit Bhati is a Senior Customer Solutions Manager at AWS, specializes in expediting the cloud journey for enterprise customers. Sumit is dedicated to assisting customers through every phase of their cloud adoption, from accelerating migrations to modernizing workloads and facilitating the integration of innovative practices.
David Girling is a Senior AI/ML Solutions Architect with over 20 years of experience in designing, leading, and developing enterprise systems. David is part of a specialist team that focuses on helping customers learn, innovate, and utilize these highly capable services with their data for their use cases.
Ravi Avula is a Senior Solutions Architect in AWS focusing on Enterprise Architecture. Ravi has 20 years of experience in software engineering and has held several leadership roles in software engineering and software architecture working in the payments industry.
George Belsian is a Senior Cloud Application Architect at AWS. He is passionate about helping customers accelerate their modernization and cloud adoption journey. In his current role, George works alongside customer teams to strategize, architect, and develop innovative, scalable solutions.
A quick guide to Amazon’s 20+ papers at ICASSP 2024
This year’s papers address topics such as speech enhancement, spoken-language understanding, dialogue, paralinguistics, and pitch estimation.Read More
AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS
AWS was delighted to present to and connect with over 18,000 in-person and 267,000 virtual attendees at NVIDIA GTC, a global artificial intelligence (AI) conference that took place March 2024 in San Jose, California, returning to a hybrid, in-person experience for the first time since 2019.
AWS has had a long-standing collaboration with NVIDIA for over 13 years. AWS was the first Cloud Service Provider (CSP) to offer NVIDIA GPUs in the public cloud, and remains among the first to deploy NVIDIA’s latest technologies.
Looking back at AWS re:Invent 2023, Jensen Huang, founder and CEO of NVIDIA, chatted with AWS CEO Adam Selipsky on stage, discussing how NVIDIA and AWS are working together to enable millions of developers to access powerful technologies needed to rapidly innovate with generative AI. NVIDIA is known for its cutting-edge accelerators and full-stack solutions that contribute to advancements in AI. The company is combining this expertise with the highly scalable, reliable, and secure AWS Cloud infrastructure to help customers run advanced graphics, machine learning, and generative AI workloads at an accelerated pace.
The collaboration between AWS and NVIDIA further expanded at GTC 2024, with the CEOs from both companies sharing their perspectives on the collaboration and state of AI in a press release:
“The deep collaboration between our two organizations goes back more than 13 years, when together we launched the world’s first GPU cloud instance on AWS, and today we offer the widest range of NVIDIA GPU solutions for customers,” says Adam Selipsky, CEO of AWS. “NVIDIA’s next-generation Grace Blackwell processor marks a significant step forward in generative AI and GPU computing. When combined with AWS’s powerful Elastic Fabric Adapter networking, Amazon EC2 UltraClusters’ hyper-scale clustering, and our unique AWS Nitro System’s advanced virtualization and security capabilities, we make it possible for customers to build and run multi-trillion parameter large language models faster, at massive scale, and more securely than anywhere else. Together, we continue to innovate to make AWS the best place to run NVIDIA GPUs in the cloud.”
“AI is driving breakthroughs at an unprecedented pace, leading to new applications, business models, and innovation across industries,” says Jensen Huang, founder and CEO of NVIDIA. “Our collaboration with AWS is accelerating new generative AI capabilities and providing customers with unprecedented computing power to push the boundaries of what’s possible.”
Joint announcements and keynote
On the first day of the NVIDIA GTC, AWS and NVIDIA made a joint announcement focused on their strategic collaboration to advance generative AI. Huang included the AWS and NVIDIA collaboration on a slide during his keynote, highlighting the following announcements. The GTC keynote had over 21 million views within the first 72 hours.
- AWS will offer the new NVIDIA Blackwell platform as Amazon Elastic Compute Cloud (Amazon EC2) instances and NVIDIA DGX Cloud to accelerate performance of building and running inference on multi-trillion parameter large language models (LLMs). Blackwell’s secure AI capabilities integrated with the AWS Nitro System and AWS Key Management Service (AWS KMS) will provide customers end-to-end control of their training data and model weights.
- AWS will provide the cloud infrastructure for Project Ceiba, an AI supercomputer built exclusively on AWS with NVIDIA DGX Cloud, which will feature 20,736 NVIDIA GB200 Grace Blackwell Superchips capable of 414 exaflops for NVIDIA’s own AI R&D.
- The Amazon SageMaker integration with NVIDIA NIM inference microservices will help customers further optimize price-performance of foundation models running on GPUs. (To learn more, see Optimize price-performance of LLM inference on NVIDIA GPUs using the Amazon SageMaker integration with NVIDIA NIM Microservices.)
- AWS HealthOmics with the NVIDIA BioNeMo platform will accelerate generative AI in biology and drug discovery. (To learn more, refer to NVIDIA BioNeMo Expands Computer-Aided Drug Discovery With New Foundation Models, Protein language model training with NVIDIA BioNeMo framework on AWS ParallelCluster, and Find the Next Blockbuster with NVIDIA BioNeMo Framework on Amazon SageMaker.)
- Amazon Robotics and NVIDIA’s long-standing collaboration regarding innovations in advanced simulations was also highlighted.
Media coverage
By March 22, AWS’s announcement with NVIDIA had generated 104 articles mentioning AWS and Amazon. The vast majority of coverage mentioned AWS’s plans to offer Blackwell-based instances. Adam Selipsky appeared on CNBC’s Mad Money to discuss the long-standing collaboration between AWS and NVIDIA, among the many other ways AWS is innovating in generative AI, stating that AWS has been the first to bring many of its GPUs to the cloud to drive efficiency and scalability for customers.
Project Ceiba has also been a focus in media coverage. Forbes referred to Project Ceiba as the “most exciting” project by AWS and NVIDIA, stating that it “should accelerate the pace of innovation in AI, making it possible to tackle more complex problems, develop more sophisticated models, and achieve previously unattainable breakthroughs.” The Next Platform ran an in-depth piece on Ceiba, stating that “the size and the aggregate compute of Ceiba cluster are both being radically expanded, which will give AWS a very large supercomputer in one of its data centers” and NVIDIA will use it to do AI research, among other things.
Live from GTC
“Live from GTC” was an on-site studio at GTC for invited speakers to have a fireside chat with tech influencers like VentureBeat. Chetan Kapoor, Director of Product Management for Amazon EC2 at AWS, was interviewed by VentureBeat at the Live from GTC studio, where he discussed AWS’s presence and highlighted key announcements at GTC.
The AWS booth and sessions
The AWS booth showcased generative AI services, like the LLMs with Anthropic and Cohere on Amazon Bedrock, PartyRock, Amazon Q, Amazon SageMaker JumpStart, and more. Highlights included:
- AWS AI Chess Robots – Two robotic arms playing chess against each other, with each move generated in the cloud with LLMs on Amazon Bedrock and powered by the NVIDIA Jetson platform and NVIDIA GPUs
- Wormhole – An alien robot from Media.Monks, who was busy having intelligent conversations with booth visitors powered by NVIDIA and a serverless Retrieval Augmented Generation (RAG) model using Claude 3 on Amazon Bedrock, along with other AWS services – Including SageMaker, Amazon Polly, and more
Additionally, AWS had 10 GTC sessions showcasing how the latest technologies from AWS and NVIDIA can drive business outcomes using generative AI. Some highlights include: - How Genius Sports Transforms NFL Game Viewing with Accelerated Computing on AWS (Presented by Amazon Web Services)
- Accelerate Time to Train Your Largest Generative AI Models With SageMaker HyperPod (Presented by Amazon Web Services)
AWS presence with partners and customers
During GTC, AWS invited 23 partner and customer solution demos to join its booth with either a dedicated demo kiosk or a 30-minute in-booth session. Such partners and customers included Ansys, Anthropic, Articul8, Bria.ai, Cohere, Deci, Deepbrain.AI, Denali Advanced Integration, Ganit, Hugging Face, Lilt, Linker Vision, Mavenir, MCE, Media.Monks, Modular, NVIDIA, Perplexity, Quantiphi, Run.ai, Salesforce, Second Spectrum, and Slalom.
Among them, high-potential early-stage startups in generative AI across the globe were showcased with a dedicated kiosk at the AWS booth. The AWS Startups team works closely with these companies by investing and supporting their growth, offering resources through programs like AWS Activate.
AWS Generative AI Competency
NVIDIA was one of the 45 launch partners for the new AWS Generative AI Competency program. The Generative AI Center of Excellence for AWS Partners team members were on site at the AWS booth, presenting this program for both existing and potential AWS partners. The program offers valuable resources along with best practices for all AWS partners to build, market, and sell generative AI solutions jointly with AWS.
Additional resources
Watch a video recap of the AWS presence at NVIDIA GTC 2024. For additional resources about the AWS and NVIDIA collaboration, refer to the AWS at NVIDIA GTC 2024 resource hub.
About the Author
Julie Tang is the Senior Global Partner Marketing Manager for Generative AI at Amazon Web Services (AWS), where she collaborates closely with NVIDIA to plan and execute partner marketing initiatives focused on generative AI. Throughout her tenure at AWS, she has held various partner marketing roles, including Global IoT Solutions, AWS Partner Solution Factory, and Sr. Campaign Manager in Americas Field Marketing. Prior to AWS, Julie served as the Marketing Director at Segway. She holds a Master’s degree in Communications Management with a focus on marketing and entertainment management from the University of Southern California, and dual Bachelor’s degrees in Law and Broadcast Journalism from Fudan University.
Using Amazon web traffic to track the eclipse
An animation that projects traffic fluctuations onto the U.S. map offers an example of how the Supply Chain Optimization Technologies team uses data visualization to glean insights.Read More
Build an active learning pipeline for automatic annotation of images with AWS services
This blog post is co-written with Caroline Chung from Veoneer.
Veoneer is a global automotive electronics company and a world leader in automotive electronic safety systems. They offer best-in-class restraint control systems and have delivered over 1 billion electronic control units and crash sensors to car manufacturers globally. The company continues to build on a 70-year history of automotive safety development, specializing in cutting-edge hardware and systems that prevent traffic incidents and mitigate accidents.
Automotive in-cabin sensing (ICS) is an emerging space that uses a combination of several types of sensors such as cameras and radar, and artificial intelligence (AI) and machine learning (ML) based algorithms for enhancing safety and improving riding experience. Building such a system can be a complex task. Developers have to manually annotate large volumes of images for training and testing purposes. This is very time consuming and resource intensive. The turnaround time for such a task is several weeks. Furthermore, companies have to deal with issues such as inconsistent labels due to human errors.
AWS is focused on helping you increase your development speed and lower your costs for building such systems through advanced analytics like ML. Our vision is to use ML for automated annotation, enabling retraining of safety models, and ensuring consistent and reliable performance metrics. In this post, we share how, by collaborating with Amazon’s Worldwide Specialist Organization and the Generative AI Innovation Center, we developed an active learning pipeline for in-cabin image head bounding boxes and key points annotation. The solution reduces cost by over 90%, accelerates the annotation process from weeks to hours in terms of the turnaround time, and enables reusability for similar ML data labeling tasks.
Solution overview
Active learning is an ML approach that involves an iterative process of selecting and annotating the most informative data to train a model. Given a small set of labeled data and a large set of unlabeled data, active learning improves model performance, reduces labeling effort, and integrates human expertise for robust results. In this post, we build an active learning pipeline for image annotations with AWS services.
The following diagram demonstrates the overall framework for our active learning pipeline. The labeling pipeline takes images from an Amazon Simple Storage Service (Amazon S3) bucket and outputs annotated images with the cooperation of ML models and human expertise. The training pipeline preprocesses data and uses them to train ML models. The initial model is set up and trained on a small set of manually labeled data, and will be used in the labeling pipeline. The labeling pipeline and training pipeline can be iterated gradually with more labeled data to enhance the model’s performance.
In the labeling pipeline, an Amazon S3 Event Notification is invoked when a new batch of images comes into the Unlabeled Datastore S3 bucket, activating the labeling pipeline. The model produces the inference results on the new images. A customized judgement function selects parts of the data based on the inference confidence score or other user-defined functions. This data, with its inference results, is sent for a human labeling job on Amazon SageMaker Ground Truth created by the pipeline. The human labeling process helps annotate the data, and the modified results are combined with the remaining auto annotated data, which can be used later by the training pipeline.
Model retraining happens in the training pipeline, where we use the dataset containing the human-labeled data to retrain the model. A manifest file is produced to describe where the files are stored, and the same initial model is retrained on the new data. After retraining, the new model replaces the initial model, and the next iteration of the active learning pipeline starts.
Model deployment
Both the labeling pipeline and training pipeline are deployed on AWS CodePipeline. AWS CodeBuild instances are used for implementation, which is flexible and fast for a small amount of data. When speed is needed, we use Amazon SageMaker endpoints based on the GPU instance to allocate more resources to support and accelerate the process.
The model retraining pipeline can be invoked when there is new dataset or when the model’s performance needs improvement. One critical task in the retraining pipeline is to have the version control system for both the training data and the model. Although AWS services such as Amazon Rekognition have the integrated version control feature, which makes the pipeline straightforward to implement, customized models require metadata logging or additional version control tools.
The entire workflow is implemented using the AWS Cloud Development Kit (AWS CDK) to create necessary AWS components, including the following:
- Two roles for CodePipeline and SageMaker jobs
- Two CodePipeline jobs, which orchestrate the workflow
- Two S3 buckets for the code artifacts of the pipelines
- One S3 bucket for labeling the job manifest, datasets, and models
- Preprocessing and postprocessing AWS Lambda functions for the SageMaker Ground Truth labeling jobs
The AWS CDK stacks are highly modularized and reusable across different tasks. The training, inference code, and SageMaker Ground Truth template can be replaced for any similar active learning scenarios.
Model training
Model training includes two tasks: head bounding box annotation and human key points annotation. We introduce them both in this section.
Head bounding box annotation
Head bounding box annotation is a task to predict the location of a bounding box of the human head in an image. We use an Amazon Rekognition Custom Labels model for head bounding box annotations. The following sample notebook provides a step-by-step tutorial on how to train a Rekognition Custom Labels model via SageMaker.
We first need to prepare the data to start the training. We generate a manifest file for the training and a manifest file for the test dataset. A manifest file contains multiple items, each of which is for an image. The following is an example of the manifest file, which includes the image path, size, and annotation information:
Using the manifest files, we can load datasets to a Rekognition Custom Labels model for training and testing. We iterated the model with different amounts of training data and tested it on the same 239 unseen images. In this test, the mAP_50
score increased from 0.33 with 114 training images to 0.95 with 957 training images. The following screenshot shows the performance metrics of the final Rekognition Custom Labels model, which yields great performance in terms of F1 score, precision, and recall.
We further tested the model on a withheld dataset that has 1,128 images. The model consistently predicts accurate bounding box predictions on the unseen data, yielding a high mAP_50
of 94.9%. The following example shows an auto-annotated image with a head bounding box.
Key points annotation
Key points annotation produces locations of key points, including eyes, ears, nose, mouth, neck, shoulders, elbows, wrists, hips, and ankles. In addition to the location prediction, visibility of each point is needed to predict in this specific task, for which we design a novel method.
For key points annotation, we use a Yolo 8 Pose model on SageMaker as the initial model. We first prepare the data for training, including generating label files and a configuration .yaml file following Yolo’s requirements. After preparing the data, we train the model and save artifacts, including the model weights file. With the trained model weights file, we can annotate the new images.
In the training stage, all the labeled points with locations, including visible points and occluded points, are used for training. Therefore, this model by default provides the location and confidence of the prediction. In the following figure, a large confidence threshold (main threshold) near 0.6 is capable of dividing the points that are visible or occluded versus outside of camera’s viewpoints. However, occluded points and visible points are not separated by the confidence, which means the predicted confidence is not useful for predicting the visibility.
To get the prediction of visibility, we introduce an additional model trained on the dataset containing only visible points, excluding both occluded points and outside of camera’s viewpoints. The following figure shows the distribution of points with different visibility. Visible points and other points can be separated in the additional model. We can use a threshold (additional threshold) near 0.6 to get the visible points. By combining these two models, we design a method to predict the location and visibility.
A key point is first predicted by the main model with location and main confidence, then we get the additional confidence prediction from the additional model. Its visibility is then classified as follows:
- Visible, if its main confidence is greater than its main threshold, and its additional confidence is greater than the additional threshold
- Occluded, if its main confidence is greater than its main threshold, and its additional confidence is less than or equal to the additional threshold
- Outside of camera’s review, if otherwise
An example of key points annotation is demonstrated in the following image, where solid marks are visible points and hollow marks are occluded points. Outside of the camera’s review points are not shown.
Based on the standard OKS definition on the MS-COCO dataset, our method is able to achieve mAP_50 of 98.4% on the unseen test dataset. In terms of visibility, the method yields a 79.2% classification accuracy on the same dataset.
Human labeling and retraining
Although the models achieve great performance on test data, there are still possibilities for making mistakes on new real-world data. Human labeling is the process to correct these mistakes for enhancing model performance using retraining. We designed a judgement function that combined the confidence value that output from the ML models for the output of all head bounding box or key points. We use the final score to identify these mistakes and the resultant bad labeled images, which need to be sent to the human labeling process.
In addition to bad labeled images, a small portion of images are randomly chosen for human labeling. These human-labeled images are added into the current version of the training set for retraining, enhancing model performance and overall annotation accuracy.
In the implementation, we use SageMaker Ground Truth for the human labeling process. SageMaker Ground Truth provides a user-friendly and intuitive UI for data labeling. The following screenshot demonstrates a SageMaker Ground Truth labeling job for head bounding box annotation.
The following screenshot demonstrates a SageMaker Ground Truth labeling job for key points annotation.
Cost, speed, and reusability
Cost and speed are the key advantages of using our solution compared to human labeling, as shown in the following tables. We use these tables to represent the cost savings and speed accelerations. Using the accelerated GPU SageMaker instance ml.g4dn.xlarge, the whole life training and inference cost on 100,000 images is 99% less than the cost of human labeling, while the speed is 10–10,000 times faster than the human labeling, depending on the task.
The first table summarizes the cost performance metrics.
Model | mAP_50 based on 1,128 test images | Training cost based on 100,000 images | Inference cost based on 100,000 images | Cost reduction compared to human annotation | Inference time based on 100,000 images | Time acceleration compared to human annotation |
Rekognition head bounding box | 0.949 | $4 | $22 | 99% less | 5.5 h | Days |
Yolo Key points | 0.984 | $27.20 | * $10 | 99.9% less | minutes | Weeks |
The following table summarizes performance metrics.
Annotation Task | mAP_50 (%) | Training Cost ($) | Inference Cost ($) | Inference Time |
Head Bounding Box | 94.9 | 4 | 22 | 5.5 hours |
Key Points | 98.4 | 27 | 10 | 5 minutes |
Moreover, our solution provides reusability for similar tasks. Camera perception developments for other systems like advanced driver assist system (ADAS) and in-cabin systems can also adopt our solution.
Summary
In this post, we showed how to build an active learning pipeline for automatic annotation of in-cabin images utilizing AWS services. We demonstrate the power of ML, which enables you to automate and expedite the annotation process, and the flexibility of the framework that uses models either supported by AWS services or customized on SageMaker. With Amazon S3, SageMaker, Lambda, and SageMaker Ground Truth, you can streamline data storage, annotation, training, and deployment, and achieve reusability while reducing costs significantly. By implementing this solution, automotive companies can become more agile and cost-efficient by using ML-based advanced analytics such as automated image annotation.
Get started today and unlock the power of AWS services and machine learning for your automotive in-cabin sensing use cases!
About the Authors
Yanxiang Yu is an Applied Scientist at at the Amazon Generative AI Innovation Center. With over 9 years of experience building AI and machine learning solutions for industrial applications, he specializes in generative AI, computer vision, and time series modeling.
Tianyi Mao is an Applied Scientist at AWS based out of Chicago area. He has 5+ years of experience in building machine learning and deep learning solutions and focuses on computer vision and reinforcement learning with human feedbacks. He enjoys working with customers to understand their challenges and solve them by creating innovative solutions using AWS services.
Yanru Xiao is an Applied Scientist at the Amazon Generative AI Innovation Center, where he builds AI/ML solutions for customers’ real-world business problems. He has worked in several fields, including manufacturing, energy, and agriculture. Yanru obtained his Ph.D. in Computer Science from Old Dominion University.
Paul George is an accomplished product leader with over 15 years of experience in automotive technologies. He is adept at leading product management, strategy, Go-to-Market and systems engineering teams. He has incubated and launched several new sensing and perception products globally. At AWS, he is leading strategy and go-to-market for autonomous vehicle workloads.
Caroline Chung is an engineering manager at Veoneer (acquired by Magna International), she has over 14 years of experience developing sensing and perception systems. She currently leads interior sensing pre-development programs at Magna International managing a team of compute vision engineers and data scientists.
Knowledge Bases for Amazon Bedrock now supports custom prompts for the RetrieveAndGenerate API and configuration of the maximum number of retrieved results
With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for Retrieval Augmented Generation (RAG). Access to additional data helps the model generate more relevant, context-specific, and accurate responses without retraining the FMs.
In this post, we discuss two new features of Knowledge Bases for Amazon Bedrock specific to the RetrieveAndGenerate
API: configuring the maximum number of results and creating custom prompts with a knowledge base prompt template. You can now choose these as query options alongside the search type.
Overview and benefits of new features
The maximum number of results option gives you control over the number of search results to be retrieved from the vector store and passed to the FM for generating the answer. This allows you to customize the amount of background information provided for generation, thereby giving more context for complex questions or less for simpler questions. It allows you to fetch up to 100 results. This option helps improve the likelihood of relevant context, thereby improving the accuracy and reducing the hallucination of the generated response.
The custom knowledge base prompt template allows you to replace the default prompt template with your own to customize the prompt that’s sent to the model for response generation. This allows you to customize the tone, output format, and behavior of the FM when it responds to a user’s question. With this option, you can fine-tune terminology to better match your industry or domain (such as healthcare or legal). Additionally, you can add custom instructions and examples tailored to your specific workflows.
In the following sections, we explain how you can use these features with either the AWS Management Console or SDK.
Prerequisites
To follow along with these examples, you need to have an existing knowledge base. For instructions to create one, see Create a knowledge base.
Configure the maximum number of results using the console
To use the maximum number of results option using the console, complete the following steps:
- On the Amazon Bedrock console, choose Knowledge bases in the left navigation pane.
- Select the knowledge base you created.
- Choose Test knowledge base.
- Choose the configuration icon.
- Choose Sync data source before you start testing your knowledge base.
- Under Configurations, for Search Type, select a search type based on your use case.
For this post, we use hybrid search because it combines semantic and text search to provider greater accuracy. To learn more about hybrid search, see Knowledge Bases for Amazon Bedrock now supports hybrid search.
- Expand Maximum number of source chunks and set your maximum number of results.
To demonstrate the value of the new feature, we show examples of how you can increase the accuracy of the generated response. We used Amazon 10K document for 2023 as the source data for creating the knowledge base. We use the following query for experimentation: “In what year did Amazon’s annual revenue increase from $245B to $434B?”
The correct response for this query is “Amazon’s annual revenue increased from $245B in 2019 to $434B in 2022,” based on the documents in the knowledge base. We used Claude v2 as the FM to generate the final response based on the contextual information retrieved from the knowledge base. Claude 3 Sonnet and Claude 3 Haiku are also supported as the generation FMs.
We ran another query to demonstrate the comparison of retrieval with different configurations. We used the same input query (“In what year did Amazon’s annual revenue increase from $245B to $434B?”) and set the maximum number of results to 5.
As shown in the following screenshot, the generated response was “Sorry, I am unable to assist you with this request.”
Next, we set the maximum results to 12 and ask the same question. The generated response is “Amazon’s annual revenue increase from $245B in 2019 to $434B in 2022.”
As shown in this example, we are able to retrieve the correct answer based on the number of retrieved results. If you want to learn more about the source attribution that constitutes the final output, choose Show source details to validate the generated answer based on the knowledge base.
Customize a knowledge base prompt template using the console
You can also customize the default prompt with your own prompt based on the use case. To do so on the console, complete the following steps:
- Repeat the steps in the previous section to start testing your knowledge base.
- Enable Generate responses.
- Select the model of your choice for response generation.
We use the Claude v2 model as an example in this post. The Claude 3 Sonnet and Haiku model is also available for generation.
- Choose Apply to proceed.
After you choose the model, a new section called Knowledge base prompt template appears under Configurations.
- Choose Edit to start customizing the prompt.
- Adjust the prompt template to customize how you want to use the retrieved results and generate content.
For this post, we gave a few examples for creating a “Financial Advisor AI system” using Amazon financial reports with custom prompts. For best practices on prompt engineering, refer to Prompt engineering guidelines.
We now customize the default prompt template in several different ways, and observe the responses.
Let’s first try a query with the default prompt. We ask “What was the Amazon’s revenue in 2019 and 2021?” The following shows our results.
From the output, we find that it’s generating the free-form response based on the retrieved knowledge. The citations are also listed for reference.
Let’s say we want to give extra instructions on how to format the generated response, like standardizing it as JSON. We can add these instructions as a separate step after retrieving the information, as part of the prompt template:
The final response has the required structure.
By customizing the prompt, you can also change the language of the generated response. In the following example, we instruct the model to provide an answer in Spanish.
After removing $output_format_instructions$
from the default prompt, the citation from the generated response is removed.
In the following sections, we explain how you can use these features with the SDK.
Configure the maximum number of results using the SDK
To change the maximum number of results with the SDK, use the following syntax. For this example, the query is “In what year did Amazon’s annual revenue increase from $245B to $434B?” The correct response is “Amazon’s annual revenue increase from $245B in 2019 to $434B in 2022.”
The ‘numberOfResults
’ option under ‘retrievalConfiguration
’ allows you to select the number of results you want to retrieve. The output of the RetrieveAndGenerate
API includes the generated response, source attribution, and the retrieved text chunks.
The following are the results for different values of ‘numberOfResults
’ parameters. First, we set numberOfResults = 5
.
Then we set numberOfResults = 12
.
Customize the knowledge base prompt template using the SDK
To customize the prompt using the SDK, we use the following query with different prompt templates. For this example, the query is “What was the Amazon’s revenue in 2019 and 2021?”
The following is the default prompt template:
The following is the customized prompt template:
With the default prompt template, we get the following response:
If you want to provide additional instructions around the output format of the response generation, like standardizing the response in a specific format (like JSON), you can customize the existing prompt by providing more guidance. With our custom prompt template, we get the following response.
The ‘promptTemplate
‘ option in ‘generationConfiguration
‘ allows you to customize the prompt for better control over answer generation.
Conclusion
In this post, we introduced two new features in Knowledge Bases for Amazon Bedrock: adjusting the maximum number of search results and customizing the default prompt template for the RetrieveAndGenerate
API. We demonstrated how to configure these features on the console and via SDK to improve performance and accuracy of the generated response. Increasing the maximum results provides more comprehensive information, whereas customizing the prompt template allows you to fine-tune instructions for the foundation model to better align with specific use cases. These enhancements offer greater flexibility and control, enabling you to deliver tailored experiences for RAG-based applications.
For additional resources to start implementing in your AWS environment, refer to the following:
- User guide: Knowledge bases for Amazon Bedrock
- YouTube video: Use RAG to improve responses in generative AI application
- GitHub repo code samples: Amazon Bedrock Knowledge Base – Samples for building RAG workflows
About the authors
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in Generative AI, Artificial Intelligence, Machine Learning, and System Design. He is passionate about developing state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.
Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.
Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.