Amazon AWS – Page 86

Automating product description generation with Amazon Bedrock

November 24, 2023

by Dhaval Shah Amazon AWS

In today’s ever-evolving world of ecommerce, the influence of a compelling product description cannot be overstated. It can be the decisive factor that turns a potential visitor into a paying customer or sends them clicking off to a competitor’s site. The manual creation of these descriptions across a vast array of products is a labor-intensive process, and it can slow down the velocity of new innovation. This is where Amazon Bedrock with its generative AI capabilities steps in to reshape the game. In this post, we dive into how Amazon Bedrock is transforming the product description generation process, empowering e-retailers to efficiently scale their businesses while conserving valuable time and resources.

Unlocking the power of generative AI in retail

Generative AI has captured the attention of boards and CEOs worldwide, prompting them to ask, “How can we leverage generative AI for our business?” One of the most promising applications of generative AI in ecommerce is using it to craft product descriptions. Retailers and brands have invested significant resources in testing and evaluating the most effective descriptions, and generative AI excels in this area.

Creating engaging and informative product descriptions for a vast catalog is a monumental task, especially for global ecommerce platforms. Manual translation and adaptation of product descriptions for each market consumes time and resources. This results in generic or incomplete descriptions, leading to reduced sales and customer satisfaction.

The power of Amazon Bedrock: AI-generated product descriptions

Amazon Bedrock is a fully managed service that simplifies generative AI development, offering high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API. It provides a comprehensive set of capabilities for building generative AI applications while ensuring privacy and security are maintained. With Amazon Bedrock, you can experiment with various FMs and customize them privately using techniques like fine-tuning and Retrieval Augmented Generation (RAG). The platform enables you to create managed agents for complex business tasks without the need for coding, such as booking travel, processing insurance claims, creating ad campaigns, and managing inventory.

For example, ecommerce platforms can initially generate basic product descriptions that include size, color, and price. However, Amazon Bedrock’s flexibility allows these descriptions to be fine-tuned to incorporate customer reviews, integrate brand-specific language, and highlight specific product features, resulting in tailored descriptions that resonate with the target audience. Moreover, Amazon Bedrock offers access to foundation models from Amazon and leading AI startups through an intuitive API, making the entire process seamless and efficient.

Using AI can have the following impact on the product description process:

Faster approvals – Vendors experience a streamlined process, moving from product listing to approval in under an hour, eliminating frustrating delays
Improved product listing velocity – When automated, your ecommerce marketplace sees a surge in product listings, offering consumers access to the latest merchandise nearly instantaneously
Future-proofing – By embracing cutting-edge AI, you secure your position as a forward-looking platform ready to meet evolving market demands
Innovation – This solution liberates teams from mundane tasks, allowing them to focus on higher-value work and fostering a culture of innovation

Solution overview

Before we dive into the technical details, let’s see the high-level preview of what this solution offers. This solution will allow you to create and manage product descriptions for your ecommerce platform. It empowers your platform to:

Generate descriptions from text – With the power of generative AI, Amazon Bedrock can convert plain text descriptions into vivid, informative, and captivating product descriptions.
Craft images – Beyond text, it can also craft images that align perfectly with the product descriptions, enhancing the visual appeal of your listings.
Enhance existing content – Do you have existing product descriptions that need a fresh perspective? Amazon Bedrock can take your current content and make it even more compelling and engaging.

This solution is available in the AWS Solutions Library. We’ve provided detailed instructions in the accompanying README file. The README file contains all the information you need to get started, from requirements to deployment guidelines.

The system architecture comprises several core components:

UI portal – This is the user interface (UI) designed for vendors to upload product images.
Amazon Rekognition – Amazon Rekognition is an image analysis service that detects objects, text, and labels in images.
Amazon Bedrock – Foundation models in Amazon Bedrock use the labels detected by Amazon Rekognition to generate product descriptions.
AWS Lambda – AWS Lambda provides serverless compute for processing.
Product database – The central repository stores vendor products, images, labels, and generated descriptions. This could be any database of your choice. Note that in this solution, all of the storage is in the UI.
Admin portal – This portal provides oversight of the system and product listings, ensuring smooth operation. This is not part of the solution; we’ve added it for understanding.

The following diagram illustrates the flow of data and interactions within the system

The workflow includes the following steps:

The client initiates a request to the Amazon API Gateway REST API.
Amazon API Gateway passes the request to AWS Lambda through a proxy integration.
When operating on product image inputs, AWS Lambda calls Amazon Rekognition to detect objects in the image.
AWS Lambda calls LLMs hosted by Amazon Bedrock, such as the Amazon Titan language models, to generate product descriptions.
The response is passed back from AWS Lambda to Amazon API Gateway.
Finally, HTTP response from Amazon API Gateway is returned to the client.

Example use case

Imagine a vendor uploads a product image of shoes, and Amazon Rekognition identifies key attributes like “white shoes,” “sneaker,” and “durable.” The Amazon Bedrock Titan AI takes this information and generates a product description like, “Here is a draft product description for a canvas running shoe based on the product photo: Introducing the Canvas Runner, the perfect lightweight sneaker for your active lifestyle. This running shoe features a breathable canvas upper with leather accents for a stylish, classic look. The lace-up design provides a secure fit, while the padded tongue and collar add comfort. Inside, a removable cushioned insole supports and comforts your feet. The EVA midsole absorbs shock with each step, reducing fatigue. Flex grooves in the rubber outsole ensure flexibility and traction. With its simple, retro-inspired style, the Canvas Runner seamlessly transitions from workouts to everyday wear. Whether you’re running errands or running miles, this versatile sneaker will keep you moving in comfort and style.”

Design details

Let’s explore the components in more detail:

User interface:
- Front end – The front end of the vendor portal allows vendors to upload product images and displays product listings.
- API calls – The portal communicates with the backend through APIs to process images and generate descriptions.
Amazon Rekognition:
- Image analysis – Triggered by API calls, Amazon Rekognition analyzes images and detects objects, text, and labels.
- Label output – It outputs label data derived from the analysis.
Amazon Bedrock:
- NLP text generation – Amazon Bedrock uses the Amazon Titan natural language processing (NLP) model to generate textual descriptions.
- Label integration – It takes the labels detected by Amazon Rekognition as input to generate product descriptions.
- Style matching – Amazon Bedrock provides fine-tuning capabilities for Amazon Titan models to ensure that the generated descriptions match the style of the platform.
AWS Lambda:
- Processing – Lambda handles the API calls to services.
Product database:
- Flexible database – The product database is chosen based on customer preferences and requirements. Note this is not provided as part of the solution.

Additional capabilities

This solution goes beyond just generating product descriptions. It offers two more incredible options:

Image and description generation from text – With the power of generative AI, Amazon Bedrock can take text descriptions and create corresponding images along with detailed product descriptions. Consider the potential:
- Instantly visualizing products from text.
- Automating image creation for large catalogs.
- Enhancing customer experience with rich visuals.
- Reducing content creation time and costs.
Description enhancement – If you already have existing product descriptions, Amazon Bedrock can enhance them. Simply supply the text and the prompt, and Amazon Bedrock will skillfully enhance and enrich the content, rendering it highly captivating and engaging for your customers.

Conclusion

In the fiercely competitive world of ecommerce, staying at the forefront of innovation is imperative. Amazon Bedrock offers a transformative capability for e-retailers looking to enhance their product content, optimize their listing process, and drive sales. With the power of AI-generated product descriptions, businesses can create compelling, informative, and culturally relevant content that resonates deeply with customers. The future of ecommerce has arrived, and it’s driven by machine learning with Amazon Bedrock.

Are you ready to unlock the full potential of AI-powered product descriptions? Take the next step in revolutionizing your ecommerce platform. Visit the AWS Solutions Library and explore how Amazon Bedrock can transform your product descriptions, streamline your processes, and boost your sales. It’s time to supercharge your ecommerce with Amazon Bedrock!

About the Authors

Dhaval Shah is a Senior Solutions Architect at AWS, specializing in Machine Learning. With a strong focus on digital native businesses, he empowers customers to leverage AWS and drive their business growth. As an ML enthusiast, Dhaval is driven by his passion for creating impactful solutions that bring positive change. In his leisure time, he indulges in his love for travel and cherishes quality moments with his family.

Doug Tiffan is the Head of World Wide Solution Strategy for Fashion & Apparel at AWS. In his role, Doug works with Fashion & Apparel executives to understand their goals and align with them on the best solutions. Doug has over 30 years of experience in retail, holding several merchandising and technology leadership roles. Doug holds a BBA from Texas A&M University and is based in Houston, Texas.

Nikhil Sharma is a Solutions Architecture Leader at Amazon Web Services (AWS) where he and his team of Solutions Architects help AWS customers solve critical business challenges using AWS cloud technologies and services.

Kevin Bell is a Sr. Solutions Architect at AWS based in Seattle. He has been building things in the cloud for about 10 years. You can find him online as @bellkev on GitHub.

Nipun Chagari is a Principal Solutions Architect based in the Bay Area, CA. Nipun is passionate about helping customers adopt Serverless technology to modernize applications and achieve their business objectives. His recent focus has been on assisting organizations in adopting modern technologies to enable digital transformation. Apart from work, Nipun finds joy in playing volleyball, cooking and traveling with his family.

Marshall Bunch is a Solutions Architect at AWS helping North American customers design secure, scalable and cost-effective workloads in the cloud. His passion lies in solving age-old business problems where data and the newest technologies enable novel solutions. Beyond his professional pursuits, Marshall enjoys hiking and camping in Colorado’s beautiful Rocky Mountains.

Altaaf Dawoodjee is a Solutions Architect Leader that supports AdTech customers in the Digital Native Business (DNB) segment at Amazon Web Service (AWS). He has over 20 years of experience in Technology and has deep expertise in Analytics. He is passionate about helping drive successful business outcomes for his customers leveraging the AWS cloud.

Scott Bell is a dynamic leader and innovator with 25+ years of technology management experience. He is passionate about leading and developing teams in providing technology to meet the challenges of global users and businesses. He has extensive experience in leading technology teams which provide global technology solutions supporting 35+ languages. He is also passionate about the way the AI and Generative AI transform businesses and the way they support customer’s current unmet needs.

Sachin Shetti is a Principal Customer Solution Manager at AWS. He is passionate about helping enterprises succeed and realize significant benefits from cloud adoption, driving everything from basic migration to large-scale cloud transformation across people, processes, and technology. Prior to joining AWS, Sachin worked as a software developer for over 12 years and held multiple senior leadership positions leading technology delivery and transformation in healthcare, financial services, retail, and insurance. He has an Executive MBA and a Bachelor’s degree in Mechanical Engineering.

Optimizing costs for Amazon SageMaker Canvas with automatic shutdown of idle apps

November 24, 2023

by Davide Gallitelli Amazon AWS

Amazon SageMaker Canvas is a rich, no-code Machine Learning (ML) and Generative AI workspace that has allowed customers all over the world to more easily adopt ML technologies to solve old and new challenges thanks to its visual, no-code interface. It does so by covering the ML workflow end-to-end: whether you’re looking for powerful data preparation and AutoML, managed endpoint deployment, simplified MLOps capabilities, and ready-to-use models powered by AWS AI services and Generative AI, SageMaker Canvas can help you to achieve your goals.

As companies of all sizes adopt SageMaker Canvas, customers asked for ways to optimize cost. As defined in the AWS Well-Architected Framework, a cost-optimized workload fully uses all resources, meets your functional requirements, and achieves an outcome at the lowest possible price point.

Today, we’re introducing a new way to further optimize costs for SageMaker Canvas applications. SageMaker Canvas now collects Amazon CloudWatch metrics that provide insight into app usage and idleness. Customers can use this information to shut down automatically idle SageMaker Canvas applications to avoiding incurring unintended costs.

In this post, we’ll show you how to automatically shut down idle SageMaker Canvas apps to control costs by using a simple serverless architecture. Templates used in this post are available in GitHub.

Understanding and tracking costs

Education is always the first step into understanding and controlling costs for any workload, either on-premises or in the cloud. Let’s start by reviewing the SageMaker Canvas pricing model. In a nutshell, SageMaker Canvas has a pay-as-you-go pricing model, based on two dimensions:

Workspace instance: formerly known as session time, is the cost associated with running the SageMaker Canvas app
AWS service charges: costs associated with training the models, deploying the endpoints, generating inferences (resources to spin up SageMaker Canvas).

Customers always have full control over the resources that are launched by SageMaker Canvas and can keep track of costs associated with the SageMaker Canvas app by using the AWS Billing and Cost Management service. For more information, refer to Manage billing and cost in SageMaker Canvas.

To limit the cost associated with the workspace instances, as a best practice, you must log out, do not close the browser tab. To log out, choose the Log out button on the left panel of the SageMaker Canvas app.

Automatically shutting down SageMaker Canvas applications

For IT Administrators that are looking to provide automated controls for shutting down SageMaker Canvas applications and keeping costs under control, there are two approaches:

Shutdown applications on a schedule (every day at 19:00 or every Friday at 18:00)
Shutdown automatically idle applications (when the application hasn’t been used for two hours)

Shutdown applications on a schedule

Scheduled shutdown of SageMaker Canvas applications can be achieved with very little effort by using a cron expression (with Amazon EventBridge Cron Rule), a compute component (an AWS Lambda function) that calls the Amazon SageMaker API DeleteApp. This approach has been discussed in the Provision and manage ML environments with Amazon SageMaker Canvas using AWS CDK and AWS Service Catalog post, and implemented in the associated GitHub repository.

One of the advantages of the above architecture is that it is very simple to duplicate it to achieve scheduled creation of the SageMaker Canvas app. By using a combination of scheduled creation and scheduled deletion, a cloud administrator can make sure that the SageMaker Canvas application is ready to be used whenever users start their business day (e.g. 9AM on a work day), and that the app also automatically shuts down at the end of the business day (e.g. 7PM on a work day, always shut down during weekends). All that is needed to do is change the line of code calling the DeleteApp API into CreateApp, as well as updating the cron expression to reflect the desired app creation time.

While this approach is very easy to implement and test, a drawback of the suggested architecture is that it does not take into account whether an application is currently being used or not, shutting it down regardless of its current activity status. According to different situations, this might cause friction with active users, which might suddenly see their session terminated.

You can retrieve the template associated to this architecture from the following GitHub repository:

Shutdown automatically idle applications

Starting today, Amazon SageMaker Canvas emits CloudWatch metrics that provide insight into app usage and idleness. This allows an administrator to define a solution that reads the idleness metric, compares it against a threshold, and defines a specific logic for automatic shutdown. A more detailed overview of the idleness metric emitted by SageMaker Canvas is shown in the following paragraph.

To achieve automatic shutdown of SageMaker Canvas applications based on the idleness metrics, we provide an AWS CloudFormation template. This template consists of three main components:

An Amazon CloudWatch Alarm, which runs a query to check the MAX value of the TimeSinceLastActive metric. If this value is greater than a threshold provided as input to the CloudFormation template, it triggers the rest of the automation. This query can be run on a single user profile, on a single domain, or across all domains. According to the level of control that you wish to have, you can use:
1. the all-domains-all-users template, which checks this across all users and all domains in the region where the template is deployed
2. the one-domain-all-users template, which checks this across all users in one domain in the region where the template is deployed
3. the one-domain-one-user template, which checks this for one user profile, in one domain, in the region where the template is deployed
The alarm state change creates an event on the default event bus in Amazon EventBridge, which has an Amazon EventBridge Rule set up to trigger an AWS Lambda function
The AWS Lambda function identifies which SageMaker Canvas app has been running in idle for more than the specified threshold, and deletes it with the DeleteApp API.

You can retrieve the AWS CloudFormation templates associated to this architecture from the following GitHub repository:

How SageMaker Canvas idleness metric work

SageMaker Canvas emits a TimeSinceLastActive metric in the /aws/sagemaker/Canvas/AppActivity namespace, which shows the number of seconds that the app has been idle with no user activity. We can use this new metric to trigger an automatic shutdown of the SageMaker Canvas app when it has been idle for a defined period. SageMaker Canvas exposes the TimeSinceLastActive with the following schema:

{
    "Namespace": "/aws/sagemaker/Canvas/AppActivity",
    "Dimensions": [
        [
            "DomainId",
            "UserProfileName"
        ]
    ],
    "Metrics": [
        {
            "Name": "TimeSinceLastActive",
            "Unit": "Seconds",
            "Value": 12345
        }
    ]
}

The key components of this metric are as follows:

Dimensions, in particular DomainID and UserProfileName, that allow an administrator to pinpoint which applications are idle across all domains and users
Value of the metric, which indicates the number of seconds since the last activity in the SageMaker Canvas applications. SageMaker Canvas considers the following as activity:
- Any action taken in the SageMaker Canvas application (clicking a button, transforming a dataset, generating an in-app inference, deploying a model);
- Using a ready-to-use model or interacting with the Generative AI models using chat interface;
- A batch inference scheduled to run at a specific time; for more information, refer to Manage automations.

This metric can be read via Amazon CloudWatch API such as get_metric_data. For example, using the AWS SDK for Python (boto3):

import boto3, datetime

cw = boto3.client('cloudwatch')
metric_data_results = cw.get_metric_data(
    MetricDataQueries=[
        {
            "Id": "q1",
            "Expression": 'SELECT MAX(TimeSinceLastActive) FROM "/aws/sagemaker/Canvas/AppActivity" GROUP BY DomainId, UserProfileName',
            "Period": 900
        }
    ],
    StartTime=datetime.datetime(2023, 1, 1),
    EndTime=datetime.datetime.now(),
    ScanBy='TimestampAscending'
)

The Python query extracts the MAX value of TimeSinceLastActive from the namespace associated to SageMaker Canvas after grouping these values by DomainID and UserProfileName.

Deploying and testing the auto-shutdown solution

To deploy the auto-shutdown stack, do the following:

Download the AWS CloudFormation template that refers to the solution you want to implement from the above GitHub repository. Choose whether you want to implement a solution for all SageMaker Domains, for a single SageMaker Domain, or for a single user;
Update template parameters:
1. The idle timeout – time (in seconds) that the SageMaker Canvas app is allowed to stay in idle before it gets shutdown; default value is 2 hours
2. The alarm period – aggregation time (in seconds) used by CloudWatch Alarm to compute the idle timeout; default value is 20 minutes
3. (optional) SageMaker Domain ID and user profile name
Deploy the CloudFormation stack to create the resources

Once deployed (should take less than two minutes), the AWS Lambda function and Amazon CloudWatch alarm are configured to automatically shut down the Canvas app when idle. To test the auto-shutdown script, do the following:

Make sure that the SageMaker Canvas app is running within the right domain and with the right user profile (if you have configured them).
Stop using the SageMaker Canvas app and wait for the idle timeout period (default, 2 hours)
Check that the app is stopped after being idle for the threshold time by checking that the CloudWatch alarm has been triggered and, after triggering the automation, it has gone back to the normal state.

In our test, we have set the idle timeout period to two hours (7200 seconds). In the following graph plotted by Amazon CloudWatch Metrics, you can see that the SageMaker Canvas app has been emitting the TimeSinceLastActive metric until the threshold was met (1), which triggered the alarm. Once the alarm was triggered, the AWS Lambda function was executed, which deleted the app and brought the metric back below the threshold (2).

Conclusion

In this post, we implemented an automated shutdown solution for idle SageMaker Canvas apps using AWS Lambda and CloudWatch Alarm and the newly emitted metric of idleness from SageMaker Canvas. Thanks to this solution, customers not only can optimize costs for their ML workloads but can also avoid unintended charges for applications that they forgot were running in their SageMaker Domain.

We’re looking forward to seeing what new use cases and workloads customers can solve with the peace of mind brought by this solution. For more examples of how SageMaker Canvas can help you achieve your business goals, refer to the following posts:

To learn how you can run production-level workloads with Amazon SageMaker Canvas, refer to the following posts:

About the authors

Davide Gallitelli is a Senior Specialist Solutions Architect for AI/ML. He is based in Brussels and works closely with customers all around the globe that are looking to adopt Low-Code/No-Code Machine Learning technologies, and Generative AI. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Huong Nguyen is a Sr. Product Manager at AWS. She is leading the data ecosystem integration for SageMaker, with 14 years of experience building customer-centric and data-driven products for both enterprise and consumer spaces.

Gunjan Garg is a Principal Engineer at Amazon SageMaker team in AWS, providing technical leadership for the product. She has worked in several roles in the AI/ML org for last 5 years and is currently focused on Amazon SageMaker Canvas.

Ziyao Huang is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is passionate about building great product that makes ML easy for the customers. Outside of work, Ziyao likes to read, and hang out with his friends.

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

November 24, 2023

by Greg Benson Amazon AWS

This post was co-written with Greg Benson, Chief Scientist; Aaron Kesler, Sr. Product Manager; and Rich Dill, Enterprise Solutions Architect from SnapLogic.

Many customers are building generative AI apps on Amazon Bedrock and Amazon CodeWhisperer to create code artifacts based on natural language. This use case highlights how large language models (LLMs) are able to become a translator between human languages (English, Spanish, Arabic, and more) and machine interpretable languages (Python, Java, Scala, SQL, and so on) along with sophisticated internal reasoning. This emergent ability in LLMs has compelled software developers to use LLMs as an automation and UX enhancement tool that transforms natural language to a domain-specific language (DSL): system instructions, API requests, code artifacts, and more. In this post, we show you how SnapLogic, an AWS customer, used Amazon Bedrock to power their SnapGPT product through automated creation of these complex DSL artifacts from human language.

When customers create DSL objects from LLMs, the resulting DSL is either an exact replica or a derivative of an existing interface data and schema that forms the contract between the UI and the business logic in the backing service. This pattern is particularly trending with independent software vendors (ISVs) and software as a service (SaaS) ISVs due to their unique way of representing configurations through code and the desire to simplify the user experience for their customers. Example use cases include:

Converting natural language to tool-specific visuals and calculation expressions, as seen with the recently announced generative BI capabilities in Amazon QuickSight
Generating the correct API call based on individual user intent and context across best-of-breed productivity SaaS apps with the upcoming productivity features in AWS AppFabric
Creating elaborate integration pipelines with simple human prompts, as you will read more about in this post

The most straightforward way to build and scale text-to-pipeline applications with LLMs on AWS is using Amazon Bedrock. Amazon Bedrock is the easiest way to build and scale generative AI applications with foundation models (FMs). It is a fully managed service that offers access to a choice of high-performing foundation FMs from leading AI via a single API, along with a broad set of capabilities you need to build generative AI applications with privacy and security. Anthropic, an AI safety and research lab that builds reliable, interpretable, and steerable AI systems, is one of the leading AI companies that offers access to their state-of-the art LLM, Claude, on Amazon Bedrock. Claude is an LLM that excels at a wide range of tasks, from thoughtful dialogue, content creation, complex reasoning, creativity, and coding. Anthropic offers both Claude and Claude Instant models, all of which are available through Amazon Bedrock. Claude has quickly gained popularity in these text-to-pipeline applications because of its improved reasoning ability, which allows it to excel in ambiguous technical problem solving. Claude 2 on Amazon Bedrock supports a 100,000-token context window, which is equivalent to about 200 pages of English text. This is a particularly important feature that you can rely on when building text-to-pipeline applications that require complex reasoning, detailed instructions, and comprehensive examples.

SnapLogic background

SnapLogic is an AWS customer on a mission to bring enterprise automation to the world. The SnapLogic Intelligent Integration Platform (IIP) enables organizations to realize enterprise-wide automation by connecting their entire ecosystem of applications, databases, big data, machines and devices, APIs, and more with pre-built, intelligent connectors called Snaps. SnapLogic recently released a feature called SnapGPT, which provides a text interface where you can type the desired integration pipeline you want to create in simple human language. SnapGPT uses Anthropic’s Claude model through Amazon Bedrock to automate the creation of these integration pipelines as code, which are then used through SnapLogic’s flagship integration solution. However, SnapLogic’s journey to SnapGPT has been a culmination of many years operating in the AI space.

SnapLogic’s AI journey

In the realm of integration platforms, SnapLogic has consistently been at the forefront, harnessing the transformative power of artificial intelligence. Over the years, the company’s commitment to innovating with AI has become evident, especially when we trace the journey from Iris to AutoLink.

The humble beginnings with Iris

In 2017, SnapLogic unveiled Iris, an industry-first AI-powered integration assistant. Iris was designed to use machine learning (ML) algorithms to predict the next steps in building a data pipeline. By analyzing millions of metadata elements and data flows, Iris could make intelligent suggestions to users, democratizing data integration and allowing even those without a deep technical background to create complex workflows.

AutoLink: Building momentum

Building on the success and learnings from Iris, SnapLogic introduced AutoLink, a feature aimed at further simplifying the data mapping process. The tedious task of manually mapping fields between source and target systems became a breeze with AutoLink. Using AI, AutoLink automatically identified and suggested potential matches. Integrations that once took hours could be run in mere minutes.

The generative leap with SnapGPT

SnapLogic’s latest foray in AI brings us SnapGPT, which aims to revolutionize integration even further. With SnapGPT, SnapLogic introduces the world’s first generative integration solution. This is not just about simplifying existing processes, but entirely reimagining how integrations are designed. The power of generative AI can create entire integration pipelines from scratch, optimizing the workflow based on the desired outcome and data characteristics.

SnapGPT is extremely impactful to SnapLogic’s customers because they are able to drastically decrease the amount of time required to generate their first SnapLogic pipeline. Traditionally, SnapLogic customers would need to spend days or weeks configuring integration pipelines from scratch. Now, these customers are able to simply ask SnapGPT to, for example, “create a pipeline which will move all of my active SFDC customers to WorkDay.” A working first draft of a pipeline is automatically created for this customer, drastically cutting down the development time required for creation of the base of their integration pipeline. This allows the end customer to spend more time focusing on what has true business impact to them instead of working on configurations of an integration pipeline. The following example shows how a SnapLogic customer can enter a description into the SnapGPT feature to quickly generate a pipeline, using natural language.

AWS and SnapLogic have collaborated closely throughout this product build and have learned a lot along the way. The rest of this post will focus on the technical learnings AWS and SnapLogic have had around using LLMs for text-to-pipeline applications.

Solution overview

To solve this text-to-pipeline problem, AWS and SnapLogic designed a comprehensive solution shown in the following architecture.

A request to SnapGPT goes through the following workflow:

A user submits a description to the application.
SnapLogic uses a Retrieval Augmented Generation (RAG) approach to retrieve relevant examples of SnapLogic pipelines that are similar to the user’s request.
These extracted relevant examples are combined with the user input and go through some text preprocessing before they’re sent to Claude on Amazon Bedrock.
Claude produces a JSON artifact that represents a SnapLogic pipeline.
The JSON artifact is directly integrated to the core SnapLogic integration platform.
The SnapLogic pipeline is rendered to the user in a visual friendly manner.

Through various experimentation between AWS and SnapLogic, we have found the prompt engineering step of the solution diagram to be extremely important to generating high-quality outputs for these text-to-pipeline outputs. The next section goes further into some specific techniques used with Claude in this space.

Prompt experimentation

Throughout the development phase of SnapGPT, AWS and SnapLogic found that rapid iteration on prompts being sent to Claude was a critical development task to improving the accuracy and relevancy of text-to-pipeline outputs in SnapLogic’s outputs. By using Amazon SageMaker Studio interactive notebooks, the AWS and SnapLogic team were able to quickly work through different versions of prompts by using the Boto3 SDK connection to Amazon Bedrock. Notebook-based development allowed the teams to quickly create client-side connections to Amazon Bedrock, include text-based descriptions alongside Python code for sending prompts to Amazon Bedrock, and hold joint prompt engineering sessions where iterations were made quickly between multiple personas.

Anthropic Claude prompt engineering methods

In this section, we describe some of the iterative techniques we used to create a high-performing prompt based on an illustrative user request: “Make a pipeline which uses the ExampleCompany database which retrieves all active customers.” Note that this example is not the schema that SnapGPT is powered by, and is only used to illustrate a text-to-pipeline application.

To baseline our prompt engineering, we use the following original prompt:

Make a pipeline which uses the ExampleCompany database which retrieves all active customers

The expected output is as follows:

{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.customers WHERE status = 'active'"
}

Improvement #1: Using the Human and Assistant annotations

Claude’s training procedure teaches the FM to understand dialogue between a human and an assistant in its prompt structure. Claude users can take advantage of this structure by ending their prompt in Assistant:, which will trigger Claude to start generating the response to a query based on what the human has said. Note that because Claude will continue to generate text until it is told to stop, make sure you use a stop sequence of nnHuman: in the API request to Amazon Bedrock when using Claude. This will ensure you only receive the response from the assistant and not the predicted next response from the human in the dialogue chain. These nnHuman and nnAssistant indicators not only help in limiting Claude’s response to relevant information, but also help in providing clear demarcation between primary human inputs and assistant responses from other interactions. This especially helps in a few-shot prompt setting where you can have multiple queries and responses with queries tagged with nnHuman and previous responses tagged with nnAssistant. Let’s look at the improved prompt and associated illustrative API call to Amazon Bedrock via the Boto3 SDK.

We use the following prompt:

Human: Make a pipeline which uses the ExampleCompany database which retrieves all active customers

Assistant:

The output is as follows:

Here is a MongoDB pipeline to retrieve all active customers from the ExampleCompany database:
pipeline = [
    {
        '$match': {
            'database': 'ExampleCompany', 
            'status': 'active'
        }
    }
]
db.customers.aggregate(pipeline)
The key steps are:

– `$match` to filter the documents to only those in the ExampleCompany database and with status equal to “active”.

– `db.customers.aggregate()` runs the pipeline on the customers collection and returns the filtered results.

So this will return all customer documents from the ExampleCompany database where the status field is set to “active”.

Notice how in the output, Claude makes up an answer to our question without any context. The answer is potentially helpful, but it’s not specific to the problem we’re trying to solve because it is not provided with enough context. It’s also important to note that Anthropic requires this human and assistant format when using the Claude models; if you don’t include this step, Amazon Bedrock will return an error.

Improvement #2: Include hyper-specific context for Claude before it’s told what to do

Claude performs best when it’s given good context about what it needs to accomplish as well as how it is supposed to accomplish that task. Here are a few examples of things to include as context for Claude before you tell it the task:

Describe the job function that you want Claude to fulfill. By doing this, you are making Claude think in the role of someone working in that job, or more like someone just being hired in that specific job role. This will help Claude formulate the answer from that specific job role’s perspective, reducing hallucination and randomness in the answer.
Include technical details regarding the output format. Foundation models are generally good at following actions, so asking Claude to generate the output in a specific format or schema with one or a few examples will help Claude provide the answer in the right format, reducing the need for filtering and postprocessing of the output. The example in the following prompt shows a fictitious schema that matches the expected output.

Let’s combine these suggestions to improve upon our original prompt:

Human: Your job is to act as an expert on ETL pipelines. Specifically, your job is to create a JSON representation of an ETL pipeline which will solve the user request provided to you.

The JSON output should follow the following format:
{
  "database": "the name of the database you will be using here",
  "query": "place a sql representation for solving the task at hand here"
}
Here is your task: make a pipeline which uses the ExampleCompany database which retrieves all active customers

Assistant:

The output is as follows:

Here is a JSON representation of an ETL pipeline to retrieve all active customers from the ExampleCompany database:
{
  "database": "ExampleCompany",
  "query": "SELECT * FROM customers WHERE status = 'active'"
}
This pipeline uses the ExampleCompany database and a SQL query to select all records from the customers table where the status is ‘active’.

With the addition of context, the model is now able to get most of the way to our answer. It now knows how to structure the output and it understands that it needs to write a SQL statement to solve the task. However, you will notice that the model doesn’t correctly use the ec_prod.customers nomenclature we are expecting. Room for improvement! Also note, in practice, you will need to include significantly more context than what we provided in this example for high-quality results in text-to-pipeline use cases.

Improvement #3: Enclosing examples in XML tags

Few-shot learning is a popular way to increase the accuracy with FMs where you have relevant examples to guide the FM in their task. This is particularly important when working in text-to-pipeline applications because these hyper-specific pipeline outputs are usually not included in any public dataset because they are specific to an individual customer’s data representation. When using Claude, you can take advantage of its innate knowledge of arbitrary XML tags in order to provide these examples in a few-shot format. In the following improved prompt, notice the addition of examples enclosed in <example> XML tags. Also, you can use abbreviations for human and assistant with H and A to provide context to Claude without making it think these were specific pieces of the dialogue that have already occurred.

We use the following prompt:

Human: Your job is to act as an expert on ETL pipelines. Specifically, your job is to create a JSON representation of an ETL pipeline which will solve the user request provided to you.

The JSON output should follow the following format:
{
  "database": "the name of the database you will be using here"
  "query": "place a sql representation for solving the task at hand here"
}
Here are some examples of user requests and associated output JSON artifacts which correctly solve the task provided. Make sure to use the same nomenclature which is provided in the examples if you are using the same databases.

<example-1>

H: query the ExampleCompany database for any active orders for customer #1234

A:
{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.orders WHERE customer_id = 1234 AND status  = 'active'"
}
The reasoning for this query is that the customer #1234 is filtered and we are searching the “orders” table for any existing records after the “active” filter.

</example-1>

<example-2>

H: remove all customers in the ExampleCompany database which have not been active in the last month

A:
{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.customers WHERE status != 'active'"
}
The reasoning for this query is that the “ec_prod.customers” table is the only table in ExampleCompany database which contains customer records.

</example-2>

Here is your task: make a pipeline which uses the ExampleCompany database which retrieves all active customers.

Assistant:

The output is as follows:

Here is the JSON representation for an ETL pipeline to retrieve all active customers from the ExampleCompany database:
{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.customers WHERE status = 'active'"
}
The reasoning is:

The ExampleCompany database is specified as the database to use

The ec_prod.customers table contains the customer data

The status field indicates if a customer is active so the query filters for only active customers

This should retrieve all active customer records from the ExampleCompany database. Let me know if you need any clarification or have additional requirements for the ETL pipeline!

The JSON output now has the correct database nomenclature, which Claude was able to use because of the examples that we provided. Coupling human and assistant indicators with XML tags makes it easier for Claude to understand what is the task and what was the answer in similar examples with very clear separation between multiple few-shot examples. The easier it is for Claude to understand, the better and more relevant the answer will be, further reducing the chance for the model to hallucinate and provide random irrelevant answers.

Improvement #4: Triggering Claude to begin JSON generation with XML tags

A small challenge with text-to-pipeline applications using FMs is the need to exactly parse an output from resulting text so it can be interpreted as code in a downstream application. One way to solve this with Claude is to take advantage of its XML tag understanding and combine this with a custom stop sequence. In the following prompt, we have instructed Claude to enclose the output in <json></json> XML tags. Then, we have added the <json> tag to the end of the prompt. This ensures that the first text that comes out of Claude will be the start of the JSON output. If you don’t do this, Claude often responds with some conversational text, then the true code response. By instructing Claude to immediately start generating the output, you can easily stop generation when you see the closing </json> tag. This is shown in the updated Boto3 API call. The benefits of this technique are twofold. First, you are able to exactly parse the code response from Claude. Second, you are able to reduce cost because Claude only generates code outputs and no extra text. This reduces cost on Amazon Bedrock because you are charged for each token that is produced as output from all FMs.

We use the following prompt:

Human: Your job is to act as an expert on ETL pipelines. Specifically, your job is to create a JSON representation of an ETL pipeline which will solve the user request provided to you.

The JSON output should follow the following format:
{
    "database": "the name of the database you will be using here"
    "query": "place a sql representation for solving the task at hand here"
}
Here are some examples of user requests and associated output JSON artifacts which correctly solve the task provided. Make sure to use the same nomenclature which is provided in the examples if you are using the same databases.

<example-1>

H: query the ExampleCompany database for any active orders for customer #1234

A:

<json>
{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.orders WHERE customer_id = 1234 AND status  = 'active'"
}
</json>

The reasoning for this query is that the customer #1234 is filtered and we are searching the “orders” table for any existing records after the “active” filter.

</example-1>

<example-2>

H: remove all customers in the ExampleCompany database which have not been active in the last month

A:

<json>
{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.customers WHERE status != 'active'"
}
</json>

The reasoning for this query is that the “ec_prod.customers” table is the only table in ExampleCompany database which contains customer records.

</example-2>

Always remember to enclose your JSON outputs in <json></json> tags.

Here is your task: make a pipeline which uses the ExampleCompany database which retrieves all active customers.

Assistant:

<json>

We use the following code:

body = json.dumps({"prompt": prompt, "stop_sequences": ['nnHuman:', '</json>']})
response = bedrock.invoke_model(
   body=body,
   modelId='anthropic.claude-v2'
)

The output is as follows:

{
  "database": "ExampleCompany",
  "query": "SELECT * FROM ec_prod.customers WHERE status = 'active'"
}

Now we have arrived at the expected output with only the JSON object returned! By using this method, we are able to generate an immediately usable technical artifact as well as reduce the cost of generation by reducing output tokens.

Conclusion

To get started today with SnapGPT, request a free trial of SnapLogic or request a demo of the product. If you would like to use these concepts for building applications today, we recommend experimenting hands-on with the prompt engineering section in this post, using the same flow on a different DSL generation use case that suits your business, and diving deeper into the RAG features that are available through Amazon Bedrock.

SnapLogic and AWS have been able to partner effectively to build an advanced translator between human language and the complex schema of SnapLogic integration pipelines powered by Amazon Bedrock. Throughout this journey, we have seen how the output generated with Claude can be improved in text-to-pipeline applications using specific prompt engineering techniques. AWS and SnapLogic are excited to continue this partnership in Generative AI and look forward to future collaboration and innovation in this fast-moving space.

About the Authors

Greg Benson is a Professor of Computer Science at the University of San Francisco and Chief Scientist at SnapLogic. He joined the USF Department of Computer Science in 1998 and has taught undergraduate and graduate courses including operating systems, computer architecture, programming languages, distributed systems, and introductory programming. Greg has published research in the areas of operating systems, parallel computing, and distributed systems. Since joining SnapLogic in 2010, Greg has helped design and implement several key platform features including cluster processing, big data processing, the cloud architecture, and machine learning. He currently is working on Generative AI for data integration.

Aaron Kesler is the Senior Product Manager for AI products and services at SnapLogic, Aaron applies over ten years of product management expertise to pioneer AI/ML product development and evangelize services across the organization. He is the author of the upcoming book “What’s Your Problem?” aimed at guiding new product managers through the product management career. His entrepreneurial journey began with his college startup, STAK, which was later acquired by Carvertise with Aaron contributing significantly to their recognition as Tech Startup of the Year 2015 in Delaware. Beyond his professional pursuits, Aaron finds joy in golfing with his father, exploring new cultures and foods on his travels, and practicing the ukulele.

Rich Dill is a Principal Solutions Architect with experience cutting broadly across multiple areas of specialization. A track record of success spanning multi-platform enterprise software and SaaS. Well known for turning customer advocacy (serving as the voice of the customer) into revenue-generating new features and products. Proven ability to drive cutting-edge products to market and projects to completion on schedule and under budget in fast-paced onshore and offshore environments. A simple way to describe me: the mind of a scientist, the heart of an explorer and the soul of an artist.

Clay Elmore is an AI/ML Specialist Solutions Architect at AWS. After spending many hours in a materials research lab, his background in chemical engineering was quickly left behind to pursue his interest in machine learning. He has worked on ML applications in many different industries ranging from energy trading to hospitality marketing. Clay’s current work at AWS centers around helping customers bring software development practices to ML and generative AI workloads, allowing customers to build repeatable, scalable solutions in these complex environments. In his spare time, Clay enjoys skiing, solving Rubik’s cubes, reading, and cooking.

Sina Sojoodi is a technology executive, systems engineer, product leader, ex-founder and startup advisor. He joined AWS in March 2021 as a Principal Solutions Architect. Sina is currently the US-West ISV area lead Solutions Architect. He works with SaaS and B2B software companies to build and grow their businesses on AWS. Previous to his role at Amazon, Sina was a technology executive at VMware, and Pivotal Software (IPO in 2018, VMware M&A in 2020) and served multiple leadership roles including founding engineer at Xtreme Labs (Pivotal acquisition in 2013). Sina has dedicated the past 15 years of his work experience to building software platforms and practices for enterprises, software businesses and the public sector. He is an industry leader with a passion for innovation. Sina holds a BA from the University of Waterloo where he studied Electrical Engineering and Psychology.

Sandeep Rohilla is a Senior Solutions Architect at AWS, supporting ISV customers in the US West region. He focuses on helping customers architect solutions leveraging containers and generative AI on the AWS cloud. Sandeep is passionate about understanding customers’ business problems and helping them achieve their goals through technology. He joined AWS after working more than a decade as a solutions architect, bringing his 17 years of experience to bear. Sandeep holds an MSc. in Software Engineering from the University of the West of England in Bristol, UK.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Amazon EC2 DL2q instance for cost-efficient, high-performance AI inference is now generally available

November 22, 2023

by A K Roy Amazon AWS

This is a guest post by A.K Roy from Qualcomm AI.

Amazon Elastic Compute Cloud (Amazon EC2) DL2q instances, powered by Qualcomm AI 100 Standard accelerators, can be used to cost-efficiently deploy deep learning (DL) workloads in the cloud. They can also be used to develop and validate performance and accuracy of DL workloads that will be deployed on Qualcomm devices. DL2q instances are the first instances to bring Qualcomm’s artificial intelligent (AI) technology to the cloud.

With eight Qualcomm AI 100 Standard accelerators and 128 GiB of total accelerator memory, customers can also use DL2q instances to run popular generative AI applications, such as content generation, text summarization, and virtual assistants, as well as classic AI applications for natural language processing and computer vision. Additionally, Qualcomm AI 100 accelerators feature the same AI technology used across smartphones, autonomous driving, personal computers, and extended reality headsets, so DL2q instances can be used to develop and validate these AI workloads before deployment.

New DL2q instance highlights

Each DL2q instance incorporates eight Qualcomm Cloud AI100 accelerators, with an aggregated performance of over 2.8 PetaOps of Int8 inference performance and 1.4 PetaFlops of FP16 inference performance. The instance has an aggregate 112 of AI cores, accelerator memory capacity of 128 GB and memory bandwidth of 1.1 TB per second.

Each DL2q instance has 96 vCPUs, a system memory capacity of 768 GB and supports a networking bandwidth of 100 Gbps as well as Amazon Elastic Block Store (Amazon EBS) storage of 19 Gbps.

Instance name	vCPUs	Cloud AI100 accelerators	Accelerator memory	Accelerator memory BW (aggregated)	Instance memory	Instance networking	Storage (Amazon EBS) bandwidth
DL2q.24xlarge	96	8	128 GB	1.088 TB/s	768 GB	100 Gbps	19 Gbps

Qualcomm Cloud AI100 accelerator innovation

The Cloud AI100 accelerator system-on-chip (SoC) is a purpose-built, scalable multi-core architecture, supporting a wide range of deep-learning use-cases spanning from the datacenter to the edge. The SoC employs scalar, vector, and tensor compute cores with an industry-leading on-die SRAM capacity of 126 MB. The cores are interconnected with a high-bandwidth low-latency network-on-chip (NoC) mesh.

The AI100 accelerator supports a broad and comprehensive range of models and use-cases. The table below highlights the range of the model support.

Model category	Number of models	Examples
NLP	157	BERT, BART, FasterTransformer, T5, Z-code MOE
Generative AI – NLP	40	LLaMA, CodeGen, GPT, OPT, BLOOM, Jais, Luminous, StarCoder, XGen
Generative AI – Image	3	Stable diffusion v1.5 and v2.1, OpenAI CLIP
CV – Image classification	45	ViT, ResNet, ResNext, MobileNet, EfficientNet
CV – Object detection	23	YOLO v2, v3, v4, v5, and v7, SSD-ResNet, RetinaNet
CV – Other	15	LPRNet, Super-resolution/SRGAN, ByteTrack
Automotive networks*	53	Perception and LIDAR, pedestrian, lane, and traffic light detection
Total	>300

* Most automotive networks are composite networks consisting of a fusion of individual networks.

The large on-die SRAM on the DL2q accelerator enables efficient implementation of advanced performance techniques such as MX6 micro-exponent precision for storing the weights and MX9 micro-exponent precision for accelerator-to-accelerator communication. The micro-exponent technology is described in the following Open Compute Project (OCP) industry announcement: AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI » Open Compute Project.

The instance user can use the following strategy to maximize the performance-per-cost:

Store weights using the MX6 micro-exponent precision in the on-accelerator DDR memory. Using the MX6 precision maximizes the utilization of the available memory capacity and the memory-bandwidth to deliver best-in-class throughput and latency.
Compute in FP16 to deliver the required use case accuracy, while using the superior on-chip SRAM and spare TOPs on the card, to implement high-performance low-latency MX6 to FP16 kernels.
Use an optimized batching strategy and a higher batch-size by using the large on-chip SRAM available to maximize the reuse of weights, while retaining the activations on-chip to the maximum possible.

DL2q AI Stack and toolchain

The DL2q instance is accompanied by the Qualcomm AI Stack that delivers a consistent developer experience across Qualcomm AI in the cloud and other Qualcomm products. The same Qualcomm AI stack and base AI technology runs on the DL2q instances and Qualcomm edge devices, providing customers a consistent developer experience, with a unified API across their cloud, automotive, personal computer, extended reality, and smartphone development environments.

The toolchain enables the instance user to quickly onboard a previously trained model, compile and optimize the model for the instance capabilities, and subsequently deploy the compiled models for production inference use-cases in three steps shown in the following figure.

To learn more about tuning the performance of a model, see the Cloud AI 100 Key Performance Parameters Documentation.

Get started with DL2q instances

In this example, you compile and deploy a pre-trained BERT model from Hugging Face on an EC2 DL2q instance using a pre-built available DL2q AMI, in four steps.

You can use either a pre-built Qualcomm DLAMI on the instance or start with an Amazon Linux2 AMI and build your own DL2q AMI with the Cloud AI 100 Platform and Apps SDK available in this Amazon Simple Storage Service (Amazon S3) bucket: s3://ec2-linux-qualcomm-ai100-sdks/latest/.

The steps that follow use the pre-built DL2q AMI, Qualcomm Base AL2 DLAMI.

Use SSH to access your DL2q instance with the Qualcomm Base AL2 DLAMI AMI and follow steps 1 thru 4.

Step 1. Set up the environment and install required packages

Install Python 3.8.

sudo amazon-linux-extras install python3.8

Set up the Python 3.8 virtual environment.

python3.8 -m venv /home/ec2-user/userA/pyenv

Activate the Python 3.8 virtual environment.

source /home/ec2-user/userA/pyenv/bin/activate

Install the required packages, shown in the requirements.txt document available at the Qualcomm public Github site.
```
pip3 install -r requirements.txt
```

Import the necessary libraries.

import transformers 
from transformers import AutoTokenizer, AutoModelForMaskedLM
import sys
import qaic
import os
import torch
import onnx
from onnxsim import simplify
import argparse
import numpy as np

Step 2. Import the model

Import and tokenize the model.

model_card = 'bert-base-cased'
model = AutoModelForMaskedLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card)

Define a sample input and extract the inputIds and attentionMask.

sentence = "The dog [MASK] on the mat."
encodings = tokenizer(sentence, max_length=128, truncation=True, padding="max_length", return_tensors='pt')
inputIds = encodings["input_ids"]
attentionMask = encodings["attention_mask"]

Convert the model to ONNX, which can then be passed to the compiler.

# Set dynamic dims and axes.
dynamic_dims = {0: 'batch', 1 : 'sequence'}
dynamic_axes = {
    "input_ids" : dynamic_dims,
    "attention_mask" : dynamic_dims,
    "logits" : dynamic_dims
}
input_names = ["input_ids", "attention_mask"]
inputList = [inputIds, attentionMask]

torch.onnx.export(
    model,
    args=tuple(inputList),
    f=f"{gen_models_path}/{model_base_name}.onnx",
    verbose=False,
    input_names=input_names,
    output_names=["logits"],
    dynamic_axes=dynamic_axes,
    opset_version=11,
)

You’ll run the model in FP16 precision. So, you need to check if the model contains any constants beyond the FP16 range. Pass the model to the fix_onnx_fp16 function to generate the new ONNX file with the fixes required.

from onnx import numpy_helper
        
def fix_onnx_fp16(
    gen_models_path: str,
    model_base_name: str,
) -> str:
    finfo = np.finfo(np.float16)
    fp16_max = finfo.max
    fp16_min = finfo.min
    model = onnx.load(f"{gen_models_path}/{model_base_name}.onnx")
    fp16_fix = False
    for tensor in onnx.external_data_helper._get_all_tensors(model):
        nptensor = numpy_helper.to_array(tensor, gen_models_path)
        if nptensor.dtype == np.float32 and (
            np.any(nptensor > fp16_max) or np.any(nptensor < fp16_min)
        ):
            # print(f'tensor value : {nptensor} above {fp16_max} or below {fp16_min}')
            nptensor = np.clip(nptensor, fp16_min, fp16_max)
            new_tensor = numpy_helper.from_array(nptensor, tensor.name)
            tensor.CopyFrom(new_tensor)
            fp16_fix = True
            
    if fp16_fix:
        # Save FP16 model
        print("Found constants out of FP16 range, clipped to FP16 range")
        model_base_name += "_fix_outofrange_fp16"
        onnx.save(model, f=f"{gen_models_path}/{model_base_name}.onnx")
        print(f"Saving modified onnx file at {gen_models_path}/{model_base_name}.onnx")
    return model_base_name

fp16_model_name = fix_onnx_fp16(gen_models_path=gen_models_path, model_base_name=model_base_name)

Step 3. Compile the model

The qaic-exec command line interface (CLI) compiler tool is used to compile the model. The input to this compiler is the ONNX file generated in step 2. The compiler produces a binary file (called QPC, for Qualcomm program container) in the path defined by -aic-binary-dir argument.

In the compile command below, you use four AI compute cores and a batch size of one to compile the model.

/opt/qti-aic/exec/qaic-exec 
-m=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16.onnx 
-aic-num-cores=4 
-convert-to-fp16 
-onnx-define-symbol=batch,1 -onnx-define-symbol=sequence,128 
-aic-binary-dir=bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc 
-aic-hw -aic-hw-version=2.0 
-compile-only

The QPC is generated in the bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc folder.

Step 4. Run the model

Set up a session to run the inference on a Cloud AI100 Qualcomm accelerator in the DL2q instance.

The Qualcomm qaic Python library is a set of APIs that provides support for running inference on the Cloud AI100 accelerator.

Use the Session API call to create an instance of session. The Session API call is the entry point to using the qaic Python library.

qpcPath = 'bert-base-cased/generatedModels/bert-base-cased_fix_outofrange_fp16_qpc'

bert_sess = qaic.Session(model_path= qpcPath+'/programqpc.bin', num_activations=1)  
bert_sess.setup() # Loads the network to the device. 

# Here we are reading out all the input and output shapes/types
input_shape, input_type = bert_sess.model_input_shape_dict['input_ids']
attn_shape, attn_type = bert_sess.model_input_shape_dict['attention_mask']
output_shape, output_type = bert_sess.model_output_shape_dict['logits']

#create the input dictionary for given input sentence
input_dict = {"input_ids": inputIds.numpy().astype(input_type), "attention_mask" : attentionMask.numpy().astype(attn_type)}

#run inference on Cloud AI 100
output = bert_sess.run(input_dict)

Restructure the data from output buffer with output_shape and output_type.

token_logits = np.frombuffer(output['logits'], dtype=output_type).reshape(output_shape)

Decode the output produced.

mask_token_logits = torch.from_numpy(token_logits[0, mask_token_index, :]).unsqueeze(0)
top_5_results = torch.topk(mask_token_logits, 5, dim=1)
print("Model output (top5) from Qualcomm Cloud AI 100:")
for i in range(5):
    idx = top_5_results.indices[0].tolist()[i]
    val = top_5_results.values[0].tolist()[i]
    word = tokenizer.decode([idx])
    print(f"{i+1} :(word={word}, index={idx}, logit={round(val,2)})")

Here are the outputs for the input sentence “The dog [MASK] on the mat.”

1 :(word=sat, index=2068, logit=11.46)
2 :(word=landed, index=4860, logit=11.11)
3 :(word=spat, index=15732, logit=10.95)
4 :(word=settled, index=3035, logit=10.84)
5 :(word=was, index=1108, logit=10.75)

That’s it. With just a few steps, you compiled and ran a PyTorch model on an Amazon EC2 DL2q instance. To learn more about onboarding and compiling models on the DL2q instance, see the Cloud AI100 Tutorial Documentation.

To learn more about which DL model architectures are a good fit for AWS DL2q instances and the current model support matrix, see the Qualcomm Cloud AI100 documentation.

Available now

You can launch DL2q instances today in the US West (Oregon) and Europe (Frankfurt) AWS Regions as On-demand, Reserved, and Spot Instances, or as part of a Savings Plan. As usual with Amazon EC2, you pay only for what you use. For more information, see Amazon EC2 pricing.

DL2q instances can be deployed using AWS Deep Learning AMIs (DLAMI), and container images are available through managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.

To learn more, visit the Amazon EC2 DL2q instance page, and send feedback to AWS re:Post for EC2 or through your usual AWS Support contacts.

About the authors

A.K Roy is a Director of Product Management at Qualcomm, for Cloud and Datacenter AI products and solutions. He has over 20 years of experience in product strategy and development, with the current focus of best-in-class performance and performance/$ end-to-end solutions for AI inference in the Cloud, for the broad range of use-cases, including GenAI, LLMs, Auto and Hybrid AI.

Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD degree in Computational Physics from the University of Colorado at Boulder.

Your guide to generative AI and ML at AWS re:Invent 2023

November 22, 2023

by Denis V. Batalov Amazon AWS

Yes, the AWS re:Invent season is upon us and as always, the place to be is Las Vegas! You marked your calendars, you booked your hotel, and you even purchased the airfare. Now all you need is some guidance on generative AI and machine learning (ML) sessions to attend at this twelfth edition of re:Invent. And although generative AI has appeared in previous events, this year we’re taking it to the next level. In addition to several exciting announcements during keynotes, most of the sessions in our track will feature generative AI in one form or another, so we can truly call our track “Generative AI and ML.” In this post, we give you a sense of how the track is organized and highlight a few sessions we think you’ll like. And although our track focuses on generative AI, many other tracks have related sessions. Use the “Generative AI” tag as you are browsing the session catalog to find them.

The technical sessions in our track are divided into five areas. First, we’ll have a few foundational sessions related to various aspects of Amazon Bedrock—a fully managed generative AI service we launched earlier this year. These will help you understand the building blocks of your generative AI applications. Second, we’ll have sessions covering the common generative AI use cases and applications. Here you’ll also have a chance to discover novel use cases and techniques. Third, a number of sessions will be of interest to ML practitioners who build, deploy, and operationalize both traditional and generative AI models. This year, learn about LLMOps, not just MLOps! Then, as we started doing last re:Invent, we’ll be offering several sessions on how to build AI responsibly. The greater the power of latest transformer-based models, the greater the responsibility of all ML practitioners to do this right. Be sure to check out the session on the just launched PartyRock, an educational tool for providing any builder with low-friction access to learn through experimentation in a foundation model playground built on Amazon Bedrock. And last but not least (and always fun!) are the sessions dedicated to AWS DeepRacer!

Generative AI is at the heart of the AWS Village this year. Interact with several demos that feature new applications, including a competition that involves using generative AI tech to pilot a drone around an obstacle course. Talk with AWS experts in 14 different industries and explore industry-specific generative AI use cases, including demos from advertising and marketing, aerospace and satellite, manufacturing, and more. The Emerging Tech Zone within the Expo features innovative startups that were selected into the AWS Generative AI Accelerator and the NVIDIA Inception 100 programs.

If you’re new to re:Invent, you can attend sessions of the following types:

Keynotes – Join in person or virtually and learn about all the exciting announcements.
Innovation Talks – Learn about the latest cloud technology from AWS technology leaders and discover how these advancements can help you push your business forward. These sessions will be livestreamed, recorded, and published to YouTube.
Breakout sessions – These 60-minute sessions are expected to have broad appeal, are delivered to larger audiences, and will be recorded. If you miss them, you can watch them on demand after re:Invent.
Chalk talks – Enjoy 60 minutes of content delivered to smaller audiences with an interactive whiteboarding session. Chalk talks are where discussions happen, and these offer you the greatest opportunity to ask questions or share your opinion.
Workshops – In these hands-on learning opportunities, in the course of 2 hours, you’ll be able to build a solution to a problem, and understand the inner workings of the resulting infrastructure and cross-service interaction. Bring your laptop and be ready to learn!
Builders’ sessions – These highly interactive 60-minute mini-workshops are conducted in small groups of fewer than 10 attendees. Some of these appeal to beginners, and others are on specialized topics.
NEW! Code talks – In this new session type for re:Invent 2023, code talks are similar to our popular chalk talk format, but instead of focusing on an architecture solution with whiteboarding, the speakers lead an interactive discussion featuring live coding or code samples. These 60-minute sessions focus on the actual code that goes into building a solution. Attendees are encouraged to ask questions and follow along.

If you have reserved your seat at any of the sessions, great! If not, we always set aside some spots for walk-ins, so make a plan and come to the session early.

To help you plan your agenda for this year’s re:Invent, here are some highlights of the generative AI and ML track. So buckle up, and start registering for your favorite sessions.

Visit the session catalog to learn about all our generative AI and ML sessions.

Keynotes

Adam Selipsky, Chief Executive Officer, Amazon Web Services – Keynote

Tuesday November 28 | 8:30 AM – 10:30 AM (PST) | The Venetian

Join Adam Selipsky, CEO of Amazon Web Services, as he shares his perspective on cloud transformation. He highlights innovations in data, infrastructure, and artificial intelligence and machine learning that are helping AWS customers achieve their goals faster, mine untapped potential, and create a better future.

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote

Wednesday November 29 | 8:30 AM – 10:30 AM (PST) | The Venetian

A powerful relationship between humans, data, and AI is unfolding right before us. Generative AI is augmenting our productivity and creativity in new ways, while also being fueled by massive amounts of enterprise data and human intelligence. Join Swami Sivasubramanian, Vice President of Data and AI at AWS, to discover how you can use your company data to build differentiated generative AI applications and accelerate productivity for employees across your organization. Also hear from customer speakers with real-world examples of how they’ve used their data to support their generative AI use cases and create new experiences for their customers.

Innovation Talks

Dr. Bratin Saha, VP of AWS AI and ML Services | AIM245-INT | Innovate faster with generative AI

Wednesday November 29 | 1:00 PM – 2:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

With the emergence of generative AI, we are at a tipping point in the widespread adoption of machine learning. Join Dr. Bratin Saha, VP of AWS AI and ML Services, to hear how customers across industries are transforming their business with the latest breakthroughs in AI and ML, including generative AI. Discover the latest AWS innovations, hear from top customers, and explore where AI/ML is headed.

Francessca Vasquez, Vice President of Professional Services | ARC217-INT | From hype to impact: Building a generative AI architecture

Wednesday November 29 | 11:30 AM – 12:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Generative AI represents a paradigm shift for how companies operate today. Generative AI is empowering developers to reimagine customer experiences and applications while transforming virtually every industry. Organizations are rapidly innovating to create the right architecture for scaling generative AI securely, economically, and responsibly to deliver business value. In this talk, learn how leaders are modernizing their data foundation, selecting industry-leading foundation models, and deploying purpose-built accelerators to unlock the possibilities of generative AI.

Shaown Nandi, AWS Director of Technology for Industries and Strategic Accounts | AIM248-INT | Unlocking the industry potential of generative AI

Wednesday November 29 | 4:00 PM – 5:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Generative AI has captured the imagination of many industries and is poised to bring in the next wave of technological advancements. In this innovation talk, hear how the largest industries, from healthcare and financial services to automotive and media and entertainment, are using generative AI to drive outcomes for their customers. Join Shaown Nandi, AWS Director of Technology for Industries and Strategic Accounts, and industry leaders to hear how generative AI is accelerating content creation and helping organizations reimagine customer experiences.

Mai-Lan Tomsen Bukovec, Vice President, Technology | AIM250-INT | Putting your data to work with generative AI

Thursday November 30 | 12:30 PM – 1:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

How can you turn your data lake into a business advantage with generative AI? In this talk, explore strategies for putting your proprietary datasets to work when building unique, differentiated generative AI solutions. Learn how to utilize your datasets using Amazon SageMaker and Amazon Bedrock as well as popular frameworks like PyTorch with AWS compute, storage, and analytics. Hear best practices for using unstructured (video, image, PDF), semi-structured (Parquet), and table-formatted (Iceberg) data for training, fine-tuning, checkpointing, and prompt engineering. Also hear different architectural patterns that customers use today to harness their business data for customized generative AI solutions.

Breakout sessions

AIM218 (LVL 200) | Build your first generative AI application with Amazon Bedrock

Monday November 27 | 2:30 PM – 3:30 PM (PST)

We are truly at an exciting inflection point in the widespread adoption of ML with the growth of generative AI applications. In this session, learn how to build your first generative AI application with key services such as Amazon Bedrock. Get hints and tips for getting started fast, and see example reference architectures for common use cases built with AWS AI and ML such as self-service customer support, text analysis, report generation, post-call analysis, and forecasting trends.

Reserve your seat now!

AIM225 (LVL 200) | Drive personalized CX using generative AI and Amazon Personalize

Tuesday November 28 | 5:00 PM – 6:00 PM (PST)

Delivering the best experience is critical to capture and retain customers today. With generative AI, it is possible to hyper-personalize targeted recommendations for shopping and streaming. While standard taglines like “People who bought this also bought . . .” or “Because you watched . . .” entice some, they don’t fully address individual interests. Companies must find ways to dynamically generate compelling, highly customized content. Amazon Personalize delivers capabilities powered by ML and generative AI to help brands create meaningful experiences. Join this session to hear from powerhouse AWS media customer FOX and learn how hyper-personalized experiences can be used to build engagement and drive revenue.

Reserve your seat now!

AIM327 (LVL 300) | Scaling FM inference to hundreds of models with Amazon SageMaker

Wednesday November 29 | 4:30 PM – 5:30 PM (PST)

Companies need robust and cost-effective solutions to deploy foundation models at scale. Additionally, SaaS providers need scalable and cost-effective ways to serve hundreds of models to their customers. This session explores how to use Amazon SageMaker to roll out hundreds of FMs cost-effectively at scale. Get a detailed overview of deployment strategies to support large-scale generative AI inferencing for SaaS, and learn how to architect solutions that maximize scaling capabilities for performance and cost.

Reserve your seat now!

AIM333 (LVL 300) | Explore text-generation FMs for top use cases with Amazon Bedrock

Tuesday November 28| 2:00 PM – 3:00 PM (PST)

Foundation models can be used for natural language processing tasks such as summarization, text generation, classification, open-ended Q&A, and information extraction. With Amazon Bedrock, you can choose powerful FMs from AI21 Labs, Anthropic, and Cohere to find the right FM for your use case such as the Jurassic-2, Claude, and Command families of text-generation FMs. Join this session to learn which FM is best suited for your use case.

Reserve your seat now!

AIM332 (LVL 300) | Explore image generation and search with FMs on Amazon Bedrock

Thursday November 30 | 11:00 AM – 12:00 PM (PST)

Foundation models understand multiple forms of input, such as images and texts. Join this session to learn how to build transformational experiences using images in Amazon Bedrock.

Reserve your seat now!

AIM377 (LVL 300) | Prompt engineering best practices for LLMs on Amazon Bedrock

Monday November 27 | 9:00 AM – 10:00 AM (PST)

Prompt engineering is the process of guiding large language models to produce desired outputs. In this session, get an overview of prompt engineering best practices and learn how to choose the most appropriate formats, phrases, words, and symbols to get the most out of generative AI solutions while improving accuracy and performance. This session uses the Claude 2 LLM as an example of how prompt engineering helps to solve complex customer use cases. Also learn how prompts can be integrated with your architecture and how to use API parameters for tuning the model parameters using Amazon Bedrock.

Reserve your seat now!

Chalk talks

AIM341 (LVL 300) | Deliver customized search capabilities using Amazon Bedrock

Wednesday November 29 | 5:30 PM – 6:30 PM (PST)

Vector embeddings are numerical representations of your text, image, audio, and video data that can be used to understand the relationship between sentences or words to find more relevant and contextual information in response to a user query. Embeddings can be stored in a database and are used to enable streamlined and more accurate searches. You can use an embeddings model in Amazon Bedrock to create vectors of your organization’s data, which can then be used to enable semantic search. Join this hands-on chalk talk to learn how.

Reserve your seat now!

AIM340-R (LVL 300) | Customize your FMs securely to deliver differentiated experiences

Wednesday November 29 | 6:00 PM – 7:00 PM (PST)

Foundation model customizations help you build differentiated generative AI applications using your own data. It’s easy to securely customize models in Amazon Bedrock. You can point Amazon Bedrock at a few labeled examples in Amazon S3, and the service can fine-tune the FM for a particular task without having to annotate large volumes of data; none of your data is used to train the original base FMs. Join this chalk talk for a deep dive on FM customizations through an interactive demo.

Reserve your seat now!

This session will be repeated Thursday, November 30 11:00 AM – 12:00 PM (PST), and Friday, December 1 8:30 AM – 9:30 AM PST.

AIM342 (LVL 300) | Advancing responsible AI: Assessing and mitigating risk

Wednesday November 29 | 4:30 PM – 5:30 PM (PST)

Risk assessment is an essential part of developing AI solutions responsibly, especially with emerging industry standards and laws regarding AI risk, such as ISO 42001 and the EU AI Act. This chalk talk provides an introduction to best practices for risk assessment related to fairness, robustness, explainability, privacy and security, transparency, and governance. Explore examples to estimate the severity and likelihood of potential events that could be harmful. Learn about Amazon SageMaker tooling for model governance, bias, explainability, and monitoring, and about transparency in the form of service cards as potential risk mitigation strategies.

Reserve your seat now!

AIM347-R (LVL 300) | Next-generation ML builder experience

Thursday November 30 | 4:00 PM – 5:00 PM (PST)

Amazon SageMaker offers different integrated development environments (IDEs) that are purpose-built for machine learning. In this chalk talk, learn how to select and use your preferred environment to perform end-to-end ML development steps, from preparing data to building, training, and deploying your ML models. Discover how you can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, collaborate seamlessly within your organization, and deploy models to production all in one place.

Reserve your seat now!

This session will be repeated Friday, December 1 9:00 AM – 10:00 AM (PST), and Friday, December 1 11:30 AM – 12:00 PM (PST).

AIM352-R (LVL 300) | Securely build generative AI apps and control data with Amazon Bedrock

Monday November 27 | 11:30 AM – 12:30 PM (PST)

Generative AI applications have captured widespread attention; however, they have also introduced new security challenges, especially around the handling of customer data. Organizations want to ensure that their data remains safe and secure while working with foundation models and don’t want to worry about their data being used to train an FM. Amazon Bedrock provides comprehensive data protection and privacy. In this chalk talk, explore architectures, data flows, and security-related aspects of model fine-tuning, as well as prompting and inference, while you learn about Amazon Bedrock’s security capabilities.

Reserve your seat now!

This session will be repeated Wednesday, November 29 6:00 PM – 7:00 PM (PST), and Thursday, November 30 4:00 PM – 5:00 PM (PST).

AIM404 (LVL 400) | Train and deploy FMs on Amazon EC2 and Amazon SageMaker, feat. Flip AI

Wednesday November 29 | 2:30 PM – 3:30 PM (PST)

Organizations that are running machine learning systems and generative AI applications on their local laptops/servers want to take advantage of the scalability and performance of the AWS Cloud. In this chalk talk, hear about compute and ML services from self-managed Amazon EC2 to fully managed Amazon SageMaker that you can use to build, train, and deploy foundation models. See a demo of how you can fine-tune a Stable Diffusion model on Amazon EC2 and then deploy it on SageMaker using the AWS Deep Learning AMIs (DLAMI) and AWS Deep Learning Containers. Also, hear how Flip AI built their own models using these AWS services.

Reserve your seat now!

Workshops

AIM302 (LVL 300) | Use generative AI to extract insights from contact center recordings

Monday November 27 | 8:30 AM – 10:30 AM (PST)

Learn how to derive insights from contact center recordings and other media using Amazon Transcribe and generative AI. In this workshop, learn how to combine automatic call recording, transcription, post-call analysis, sentiment analysis, issue detection, and call summarization from your own telephony recordings (Cisco, Genesys, Talkdesk, Avaya, and more) using AWS Contact Center Intelligence (CCI) solutions and generative AI. See demos on how to build analytics dashboards and integrations between LLMs and Amazon QuickSight to visualize your key metrics. You must bring your laptop to participate.

Reserve your seat now!

AIM307 (LVL 300) | Retrieval Augmented Generation with Amazon Bedrock

Wednesday November 29 | 8:30 AM – 10:30 AM (PST)

Large language models are often limited by the data they were trained on and don’t always provide up-to-date responses—or worse, they make things up. To overcome this limitation, you can supplement prompts with up-to-date information using embeddings stored in vector databases, a process known as Retrieval Augmented Generation (RAG). With supplemental information in the prompt providing more context, the LLM can respond more accurately and is less likely to hallucinate. In this workshop, learn how to use vector databases with Amazon Bedrock, a service that makes foundation models from Amazon and leading AI companies available via a single API. You must bring your laptop to participate.

Reserve your seat now!

AIM304 (LVL 300) | How to generate text responsibly using foundation models on AWS

Wednesday November 29 | 5:30 PM – 7:30 PM (PST)

Foundation models such as Claude are commonly used to create new pieces of original content, such as short stories, essays, social media posts, and webpage copy, and also to summarize text from articles, blog posts, books, and more. In this workshop, learn how you can generate text in minutes using foundation models available through Amazon Bedrock in a responsible way. You must bring your laptop to participate.

Reserve your seat now!

Code talks

AIM364-R (LVL 300) | Boost ML development with Amazon SageMaker Studio notebooks

Tuesday November 28 | 4:00 PM – 5:00 PM (PST)

Amazon SageMaker Studio notebooks are collaborative notebooks that you can launch quickly and that can help you integrate with purpose-built ML tools in SageMaker and other AWS services for complete ML development. In this code talk, learn how to prepare data at scale using built-in data preparation assistance, co-edit the same notebook in real time, and automate conversion of notebook code to production-ready jobs. This talk also introduces the new generative AI-powered features that can help you maximize productivity, write higher-quality code, and improve security.

Reserve your seat now!

This session will be repeated Wednesday, November 29 12:00 PM – 1:00 PM (PST).

Builders’ sessions

AIM219-R (LVL 200) | Learn and experiment with LLMs in Amazon SageMaker Studio Lab

Monday November 27 | 10:00 AM – 11:00 AM (PST)

Machine learning can sound complicated, but Amazon SageMaker Studio Lab makes it easier for anyone to get started at no cost. In this hands-on builders’ session, be guided through the basics of experimenting with large language models in Amazon SageMaker Studio Lab. No prior machine learning experience is required. You must bring your laptop to participate.

This session will be repeated Monday, November 27 4:00 PM – 5:00 PM (PST), Tuesday, November 28 3:30 PM – 4:30 PM (PST), Wednesday, November 29, 12:00 PM – 1:00 PM (PST), and Thursday, November 30 11:30 AM – 12:30 PM (PST).

Reserve your seat now!

AWS DeepRacer

Get ready to race with AWS DeepRacer at re:Invent 2023!

Developers, fasten your seatbelts—AWS DeepRacer is bringing ML to everyone at re:Invent! Whether you’re looking to get started with ML or improve your skills, AWS DeepRacer offers an exciting way to get hands-on with ML.

Watch the world’s top 72 racers of the AWS DeepRacer 2023 League battle it out Monday through Wednesday at our Championship Stadium at the Venetian Expo. It will all come down to the finale on Wednesday (November 29) at 2:30 PM (PST) as the eight finalists compete for the cup and $44,000 in prize money. You can also get behind the wheel yourself on November 30, when the track opens for the 2024 Open Racing. Post the fastest time and you’ll win a ticket back to Vegas for the 2024 Championship!

Dive into 10 not-to-miss workshops where you’ll learn to train reinforcement learning models, solve business problems with generative AI, and more. Want to learn tips and tricks from the best racers in the world? Be sure to check out our DPR301 workshop featuring five of our top AWS DeepRacer League Champions who will be sharing their approaches for training their AWS DeepRacer models and answering questions during an open Q&A.

Don’t forget to check out the rest of the AWS DeepRacer workshops before they fill up to reserve your spot! Whether you take a workshop, take a spin in our gamified virtual racing simulator, catch the global competition, or test your own ML model on the track, AWS DeepRacer brings the thrill of high-speed racing to hands-on machine learning at re:Invent. Let the countdown begin. We can’t wait to see you in Las Vegas!

See you at re:Invent!

Make sure to check out the re:Invent content catalog and the generative AI at re:Invent guide for more gen AI and ML content at re:Invent. We’ll see you there!

About the authors

Denis V. Batalov is a 17-year Amazon veteran and a PhD in Machine Learning, Denis worked on such exciting projects as Search Inside the Book, Amazon Mobile apps and Kindle Direct Publishing. Since 2013 he has helped AWS customers adopt AI/ML technology as a Solutions Architect. Currently, Denis is a Worldwide Tech Leader for AI/ML responsible for the functioning of AWS ML Specialist Solutions Architects globally. Denis is a frequent public speaker, you can follow him on Twitter @dbatalov.

Paxton Hall is a Marketing Program Manager for the AWS AI/ML Community on the AI/ML Education team at AWS. He has worked in retail and experiential marketing for the past 7 years, focused on developing communities and marketing campaigns. Out of the office, he’s passionate about public lands access and conservation, and enjoys backcountry skiing, climbing, biking, and hiking throughout Washington’s Cascade mountains.

Build a contextual chatbot for financial services using Amazon SageMaker JumpStart, Llama 2 and Amazon OpenSearch Serverless with Vector Engine

November 22, 2023

by Sunil Padmanabhan Amazon AWS

The financial service (FinServ) industry has unique generative AI requirements related to domain-specific data, data security, regulatory controls, and industry compliance standards. In addition, customers are looking for choices to select the most performant and cost-effective machine learning (ML) model and the ability to perform necessary customization (fine-tuning) to fit their business use cases. Amazon SageMaker JumpStart is ideally suited for generative AI use cases for FinServ customers because it provides the necessary data security controls and meets compliance standards requirements.

In this post, we demonstrate question answering tasks using a Retrieval Augmented Generation (RAG)-based approach with large language models (LLMs) in SageMaker JumpStart using a simple financial domain use case. RAG is a framework for improving the quality of text generation by combining an LLM with an information retrieval (IR) system. The LLM generated text, and the IR system retrieves relevant information from a knowledge base. The retrieved information is then used to augment the LLM’s input, which can help improve the accuracy and relevance of the model generated text. RAG has been shown to be effective for a variety of text generation tasks, such as question answering and summarization. It is a promising approach for improving the quality and accuracy of text generation models.

Advantages of using SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a broad selection of state-of-the-art models for use cases such as content writing, image generation, code generation, question answering, copywriting, summarization, classification, information retrieval, and more. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

SageMaker JumpStart is ideally suited for generative AI use cases for FinServ customers because it offers the following:

Customization capabilities – SageMaker JumpStart provides example notebooks and detailed posts for step-by-step guidance on domain adaptation of foundation models. You can follow these resources for fine-tuning, domain adaptation, and instruction of foundation models or to build RAG-based applications.
Data security – Ensuring the security of inference payload data is paramount. With SageMaker JumpStart, you can deploy models in network isolation with single-tenancy endpoint provision. Furthermore, you can manage access control to selected models through the private model hub capability, aligning with individual security requirements.
Regulatory controls and compliances – Compliance with standards such as HIPAA BAA, SOC123, PCI, and HITRUST CSF is a core feature of SageMaker, ensuring alignment with the rigorous regulatory landscape of the financial sector.
Model choices – SageMaker JumpStart offers a selection of state-of-the-art ML models that consistently rank among the top in industry-recognized HELM benchmarks. These include, but are not limited to, Llama 2, Falcon 40B, AI21 J2 Ultra, AI21 Summarize, Hugging Face MiniLM, and BGE models.

In this post, we explore building a contextual chatbot for financial services organizations using a RAG architecture with the Llama 2 foundation model and the Hugging Face GPTJ-6B-FP16 embeddings model, both available in SageMaker JumpStart. We also use Vector Engine for Amazon OpenSearch Serverless (currently in preview) as the vector data store to store embeddings.

Limitations of large language models

LLMs have been trained on vast volumes of unstructured data and excel in general text generation. Through this training, LLMs acquire and store factual knowledge. However, off-the-shelf LLMs present limitations:

Their offline training renders them unaware of up-to-date information.
Their training on predominantly generalized data diminishes their efficacy in domain-specific tasks. For instance, a financial firm might prefer its Q&A bot to source answers from its latest internal documents, ensuring accuracy and compliance with its business rules.
Their reliance on embedded information compromises interpretability.

To use specific data in LLMs, three prevalent methods exist:

Embedding data within the model prompts, allowing it to utilize this context during output generation. This can be zero-shot (no examples), few-shot (limited examples), or many-shot (abundant examples). Such contextual prompting steers models towards more nuanced results.
Fine-tuning the model using pairs of prompts and completions.
RAG, which retrieves external data (non-parametric) and integrates this data into the prompts, enriching the context.

However, the first method grapples with model constraints on context size, making it tough to input lengthy documents and possibly increasing costs. The fine-tuning approach, while potent, is resource-intensive, particularly with ever-evolving external data, leading to delayed deployments and increased costs. RAG combined with LLMs offers a solution to the previously mentioned limitations.

Retrieval Augmented Generation

RAG retrieves external data (non-parametric) and integrates this data into ML prompts, enriching the context. Lewis et al. introduced RAG models in 2020, conceptualizing them as a fusion of a pre-trained sequence-to-sequence model (parametric memory) and a dense vector index of Wikipedia (non-parametric memory) accessed via a neural retriever.

Here’s how RAG operates:

Data sources – RAG can draw from varied data sources, including document repositories, databases, or APIs.
Data formatting – Both the user’s query and the documents are transformed into a format suitable for relevancy comparisons.
Embeddings – To facilitate this comparison, the query and the document collection (or knowledge library) are transformed into numerical embeddings using language models. These embeddings numerically encapsulate textual concepts.
Relevancy search – The user query’s embedding is compared to the document collection’s embeddings, identifying relevant text through a similarity search in the embedding space.
Context enrichment – The identified relevant text is appended to the user’s original prompt, thereby enhancing its context.
LLM processing – With the enriched context, the prompt is fed to the LLM, which, due to the inclusion of pertinent external data, produces relevant and precise outputs.
Asynchronous updates – To ensure the reference documents remain current, they can be updated asynchronously along with their embedding representations. This ensures that future model responses are grounded in the latest information, guaranteeing accuracy.

In essence, RAG offers a dynamic method to infuse LLMs with real-time, relevant information, ensuring the generation of precise and timely outputs.

The following diagram shows the conceptual flow of using RAG with LLMs.

Solution overview

The following steps are required to create a contextual question answering chatbot for a financial services application:

Use the SageMaker JumpStart GPT-J-6B embedding model to generate embeddings for each PDF document in the Amazon Simple Storage Service (Amazon S3) upload directory.
Identify relevant documents using the following steps:
- Generate an embedding for the user’s query using the same model.
- Use OpenSearch Serverless with the vector engine feature to search for the top K most relevant document indexes in the embedding space.
- Retrieve the corresponding documents using the identified indexes.
Combine the retrieved documents as context with the user’s prompt and question. Forward this to the SageMaker LLM for response generation.

We employ LangChain, a popular framework, to orchestrate this process. LangChain is specifically designed to bolster applications powered by LLMs, offering a universal interface for various LLMs. It streamlines the integration of multiple LLMs, ensuring seamless state persistence between calls. Moreover, it boosts developer efficiency with features like customizable prompt templates, comprehensive application-building agents, and specialized indexes for search and retrieval. For an in-depth understanding, refer to the LangChain documentation.

Prerequisites

You need the following prerequisites to build our context-aware chatbot:

An AWS account with appropriate AWS Identity and Access Management (IAM) permissions.
An Amazon SageMaker Studio domain and user. For setup instructions, refer to Onboard to Amazon SageMaker Domain using Quick setup.
An OpenSearch Serverless collection.
A SageMaker execution role with access to OpenSearch Serverless.

For instructions on how to set up an OpenSearch Serverless vector engine, refer to Introducing the vector engine for Amazon OpenSearch Serverless, now in preview.

For a comprehensive walkthrough of the following solution, clone the GitHub repo and refer to the Jupyter notebook.

Deploy the ML models using SageMaker JumpStart

To deploy the ML models, complete the following steps:

Deploy the Llama 2 LLM from SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
llm_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")
llm_predictor = llm_model.deploy()
llm_endpoint_name = llm_predictor.endpoint_name

Deploy the GPT-J embeddings model:

embeddings_model = JumpStartModel(model_id = "huggingface-textembedding-gpt-j-6b-fp16")
embed_predictor = embeddings_model.deploy()
embeddings_endpoint_name = embed_predictor.endpoint_name

Chunk data and create a document embeddings object

In this section, you chunk the data into smaller documents. Chunking is a technique for splitting large texts into smaller chunks. It’s an essential step because it optimizes the relevance of the search query for our RAG model, which in turn improves the quality of the chatbot. The chunk size depends on factors such as the document type and the model used. A chunk chunk_size=1600 has been selected because this is the approximate size of a paragraph. As models improve, their context window size will increase, allowing for larger chunk sizes.

Refer to the Jupyter notebook in the GitHub repo for the complete solution.

Extend the LangChain SageMakerEndpointEmbeddings class to create a custom embeddings function that uses the gpt-j-6b-fp16 SageMaker endpoint you created earlier (as part of employing the embeddings model):

from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

logger = logging.getLogger(__name__)

# extend the SagemakerEndpointEmbeddings class from langchain to provide a custom embedding function
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(
        self, texts: List[str], chunk_size: int = 1
    ) → List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.
 
        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        st = time.time()
        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            results.extend(response)
        time_taken = time.time() - st
        logger.info(
            f"got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}"
        )
        print(
            f"got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}"
        )
        return results

# class for serializing/deserializing requests/responses to/from the embeddings model
class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"
 
    def transform_input(self, prompt: str, model_kwargs={}) → bytes:
 
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")
 
    def transform_output(self, output: bytes) → str:
 
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        if len(embeddings) == 1:
            return [embeddings[0]]
        return embeddings

def create_sagemaker_embeddings_from_js_model(
    embeddings_endpoint_name: str, aws_region: str
) → SagemakerEndpointEmbeddingsJumpStart:
 
    content_handler = ContentHandler()
    embeddings = SagemakerEndpointEmbeddingsJumpStart(
        endpoint_name=embeddings_endpoint_name,
        region_name=aws_region,
        content_handler=content_handler,
    )
    return embeddings

Create the embeddings object and batch the creation of the document embeddings:

embeddings = create_sagemaker_embeddings_from_js_model(embeddings_endpoint_name, aws_region)

These embeddings are stored in the vector engine using LangChain OpenSearchVectorSearch. You store these embeddings in the next section. Store the document embedding in OpenSearch Serverless. You’re now ready to iterate over the chunked documents, create the embeddings, and store these embeddings in the OpenSearch Serverless vector index created in vector search collections. See the following code:
```
docsearch = OpenSearchVectorSearch.from_texts(
texts = [d.page_content for d in docs],
embedding=embeddings,
opensearch_url=[{'host': _aoss_host, 'port': 443}],
http_auth=awsauth,
timeout = 300,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection,
index_name=_aos_index
)
```

Question and answering over documents

So far, you have chunked a large document into smaller ones, created vector embeddings, and stored them in a vector engine. Now you can answer questions regarding this document data. Because you created an index over the data, you can do a semantic search; this way, only the most relevant documents required to answer the question are passed via the prompt to the LLM. This allows you to save time and money by only passing relevant documents to the LLM. For more details on using document chains, refer to Documents.

Complete the following steps to answer questions using the documents:

To use the SageMaker LLM endpoint with LangChain, you use langchain.llms.sagemaker_endpoint.SagemakerEndpoint, which abstracts the SageMaker LLM endpoint. You perform a transformation for the request and response payload as shown in the following code for the LangChain SageMaker integration. Note that you may need to adjust the code in ContentHandler based on the content_type and accepts format of the LLM model you choose to use.

content_type = "application/json"
accepts = "application/json"
def transform_input(self, prompt: str, model_kwargs: dict) → bytes:
        payload = {
            "inputs": [
                [
                    {
                        "role": "system",
                        "content": prompt,
                    },
                    {"role": "user", "content": prompt},
                ],
            ],
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
            },
        }
        input_str = json.dumps(
            payload,
        )
        return input_str.encode("utf-8")

def transform_output(self, output: bytes) → str:
    response_json = json.loads(output.read().decode("utf-8"))
    content = response_json[0]["generation"]["content"]

    return content

content_handler = ContentHandler()

sm_jumpstart_llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs={"max_new_tokens": 300},
        endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
        content_handler=content_handler,
    )

Now you’re ready to interact with the financial document.

Use the following query and prompt template to ask questions regarding the document:

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler

query = "Summarize the earnings report and also what year is the report for"
prompt_template = """Only use context to answer the question at the end.
 
{context}
 
Question: {question}
Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
 
 
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) → bytes:
        payload = {
            "inputs": [
                [
                    {
                        "role": "system",
                        "content": prompt,
                    },
                    {"role": "user", "content": prompt},
                ],
            ],
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
            },
        }
        input_str = json.dumps(
            payload,
        )
        return input_str.encode("utf-8")
 
    def transform_output(self, output: bytes) → str:
        response_json = json.loads(output.read().decode("utf-8"))
        content = response_json[0]["generation"]["content"]
        return content

content_handler = ContentHandler()
 
chain = load_qa_chain(
    llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs={"max_new_tokens": 300},
        endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
        content_handler=content_handler,
    ),
    prompt=prompt,
)
sim_docs = docsearch.similarity_search(query, include_metadata=False)
chain({"input_documents": sim_docs, "question": query}, return_only_outputs=True)

Cleanup

To avoid incurring future costs, delete the SageMaker inference endpoints that you created in this notebook. You can do so by running the following in your SageMaker Studio notebook:

# Delete LLM
llm_predictor.delete_model()
llm_predictor.delete_predictor(delete_endpoint_config=True)

# Delete Embeddings Model
embed_predictor.delete_model()
embed_predictor.delete_predictor(delete_endpoint_config=True)

If you created an OpenSearch Serverless collection for this example and no longer require it, you can delete it via the OpenSearch Serverless console.

Conclusion

In this post, we discussed using RAG as an approach to provide domain-specific context to LLMs. We showed how to use SageMaker JumpStart to build a RAG-based contextual chatbot for a financial services organization using Llama 2 and OpenSearch Serverless with a vector engine as the vector data store. This method refines text generation using Llama 2 by dynamically sourcing relevant context. We’re excited to see you bring your custom data and innovate with this RAG-based strategy on SageMaker JumpStart!

About the authors

Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.

Suleman Patel is a Senior Solutions Architect at Amazon Web Services (AWS), with a special focus on Machine Learning and Modernization. Leveraging his expertise in both business and technology, Suleman helps customers design and build solutions that tackle real-world business problems. When he’s not immersed in his work, Suleman loves exploring the outdoors, taking road trips, and cooking up delicious dishes in the kitchen.

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence

November 22, 2023

by Brijesh Pati Amazon AWS

The IDP Well-Architected Lens is intended for all AWS customers who use AWS to run intelligent document processing (IDP) solutions and are searching for guidance on how to build secure, efficient, and reliable IDP solutions on AWS.

Building a production-ready solution in the cloud involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

An IDP pipeline usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific terms or words. The IDP Well-Architected Custom Lens outlines the steps for an AWS Well-Architected review, which allows you to evaluate and identify technical risks within your IDP workloads. This custom lens integrates best practices and guidance to effectively navigate and overcome common challenges in the management of IDP workloads.

This post focuses on the Operational Excellence pillar of the IDP solution. Operational excellence in IDP means applying the principles of robust software development and maintaining a high-quality customer experience to the field of document processing, while consistently meeting or surpassing service level agreements (SLAs). It involves organizing teams effectively, designing IDP systems to handle workloads efficiently, operating these systems at scale, and continuously evolving them to meet customer needs.

In this post, we start with the introduction of the Operational Excellence pillar and design principles, and then deep dive into four focus areas: organizational culture, workload design, build and release optimization, and observability. By reading this post, you will learn about the Operational Excellence pillar in the Well-Architected Framework with the IDP case study.

Design principles

For IDP workloads, operational excellence translates to the following:

High accuracy and low error rates in document data extraction – Precision in extracting data from documents is paramount, which minimizes errors and ensures that the information used for decision-making is trustworthy
Fast processing of high document volumes with low latency – Efficiency in handling large volumes of documents swiftly allows organizations to keep pace with business demands, reducing bottlenecks
Continuous monitoring for swift diagnosis and resolution of issues – Proactive monitoring and maintenance help in quickly identifying and resolving any interruptions in the document processing pipeline, maintaining a smooth operational flow
Rapid iteration to improve models and workflows – Implementing a feedback loop that facilitates constant refinement of algorithms and processes ensures the system evolves to meet emerging challenges and efficiency standards
Cost optimization to ensure resources align with workload demands – Strategic resource management ensures that financial investment into IDP systems yields maximum value, adjusting resources dynamically in line with fluctuating document processing demands
Adherence to SLAs – Meeting or exceeding the standards and turnaround times promised to customers is crucial for maintaining trust and satisfaction

Effective design strategies must be aligned with these objectives, ensuring that the IDP systems are not only technically capable but also optimized for real-world challenges. This elevates operational excellence from a backend goal to a strategic asset, one that is integral to the success of the entire enterprise. Based on the design principles of the Operational Excellence pillar, we propose the following design principles for this custom lens.

Design Principles	Description
Align IDP SLAs with Overall Document Workflow Objectives	IDP typically functions as an integral component of the broader document workflow managed by business teams. Therefore, it is essential that the SLAs for IDP are carefully crafted as subsets of the overall document workflow SLAs. This approach ensures that the IDP’s performance expectations are in harmony with the larger workflow objectives, providing a clear and consistent standard for processing speed, accuracy, and reliability. By doing so, businesses can create a cohesive and efficient document management system that aligns with the overarching business goals and stakeholder expectations, fostering trust and dependability in the system’s capabilities.
Codify Operations for Efficiency and Reproducibility	By performing operations as code and incorporating automated deployment methodologies, organizations can achieve scalable, repeatable, and consistent processes. This not only minimizes the potential for human error but also paves the way for seamless integration of new data sources and processing techniques.
Proactively Anticipate and Plan for System Failures	Because IDP systems process vast array of documents with varied complexities, potential issues can emerge at any stage of the document processing pipeline. You should conduct “pre-mortem” exercises to pre-emptively identify potential sources of failure so that they can be removed or mitigated. Regularly simulate failure scenarios and validate your understanding of their impact. Test your response procedures to ensure they are effective and that teams are familiar with their process. Set up regular game days to test workload and team responses to simulated events.
Iterate Frequently with Feedback Mechanisms	As your document processing workload evolves, ensure your operational strategies adapt in sync and look for opportunities to improve them: Make frequent, small, reversible changes – Design workloads to allow components to be updated regularly to increase the flow of beneficial changes into your workload. Make changes in small increments that can be reversed if they fail to aid in the identification and resolution of issues introduced to your environment. Learn from all operational failures – Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
Monitor Operational Health	Ensure a shift from mere monitoring to advanced observability within your IDP framework. This entails a comprehensive understanding of the system’s health. By effectively collecting and correlating telemetry data, you can glean actionable insights, facilitating pre-emptive detection and mitigation of issues.
Pursue Metrics-Driven Quality and Continuous Improvement	In IDP, what gets measured gets improved. Define and track key metrics related to document accuracy, processing times, and model efficacy. It is crucial to pursue a metrics-driven strategy that emphasizes the quality of data extraction at the field level, particularly for high-impact fields. Harness a flywheel approach, wherein continuous data feedback is utilized to routinely orchestrate and evaluate enhancements to your models and processes.
Integrate Human Oversight for Process Effectiveness	Although automation and ML algorithms significantly advance the efficiency of IDP, there are scenarios where human reviewers can augment and enhance the outcomes, especially in situations with regulatory demands or when encountering low-quality scans. Human oversight based on confidence score thresholds can be a valuable addition.

Focus areas

The design principles and best practices for the Operational Excellence pillar come from what we have learned from our customers and our IDP experts. Use these as a guide when making design choices, making sure they fit well with what your business needs from the IDP solution. Applying the IDP Well-Architected Lens also helps you validate that these choices are aimed at achieving operational excellence, ensuring they meet your specific operational goals.

The following are the key focus areas for operational excellence of IDP solution in the cloud:

Organizational culture – Organizational culture is pivotal in shaping how IDP projects are implemented and managed. This culture is sustained by clear SLAs that set definitive expectations for processing times and accuracy, ensuring all team members are oriented towards common goals. This is complemented by a centralized function that acts as the hub for operational excellence, consolidating best practices and steering IDP projects towards success.
Workload design – This involves creating a system capable of flexibly handling varying demands, optimizing for quality and accuracy in document processing, and efficiently integrating with external systems.
Build and release optimization – This area emphasizes the implementation of standardized DevSecOps processes. The goal is to streamline the development lifecycle and use automation to ensure smooth and rapid deployment of updates or new features. This approach aims to enhance the efficiency, security, and reliability of the IDP system development and deployment.
Observability – In IDP, observability is focused on comprehensive monitoring, alerting, and logging capabilities, along with managing service quotas. This involves keeping a vigilant eye on the system’s performance, setting up effective alert mechanisms for potential issues, maintaining detailed logs for analysis, and ensuring the system operates within its resource allocations.

Organizational culture

To achieve operational excellence in IDP, organizations must embed certain best practices into their culture and daily operations. The following are a few critical areas that can guide organizations in optimizing their IDP workflows:

Culture and operating model – Cultivate a culture that champions the strategic design, deployment, and management of IDP workloads. This should be a cultural norm, integrated into the operating model to support agility and responsiveness in document processing.
Business and SLA alignment – Align IDP initiatives with business objectives and SLAs. This practice ensures that document processing supports the overall business strategy and meets the performance metrics valued by stakeholders.
Continuous AWS training – Commit to regular training and upskilling in AWS services to enhance IDP capabilities. A well-trained team can use AWS’s evolving features for improved document processing efficiency and innovation.
Change management – Establish robust change management processes to navigate the IDP landscape’s dynamic nature. Effective change management supports smooth transitions and helps maintain uninterrupted IDP operations during upgrades or shifts in strategy.
Defined metrics for IDP success – Establish and monitor clear metrics to measure the success and impact of the IDP operations. For example: With Amazon CloudWatch, you could monitor the number of documents processed through Amazon Textract. Similarly, monitoring the volume and size of documents being uploaded into Amazon Simple Storage Service (Amazon S3) can give insights into the rate at which processing demand is increasing. Furthermore, with AWS Step Functions, you can use the built-in metrics to track the processing job success rate, offering insights into the effectiveness of the workflow orchestration.
Iterative improvements – Encourage a culture of feedback and iterative development to refine IDP processes. By regularly analyzing performance data and user feedback, the organization can make informed, incremental improvements to the IDP system.
Feedback loop from human review – Integrate a feedback loop from human review into the IDP system. This provides valuable insights that you can use to continuously improve the accuracy and effectiveness of the automated processes.

Workload design

An effective workload design is essential for successful management of intelligent document processing systems. This design must be adaptable to meet diverse demands to handle varying demands, maintaining high quality and accuracy, and achieving seamless integration with other systems. The following are the best practices that can help achieve these goals:

Utilizing IDP Workflow stages– When designing an architecture for IDP, it is important to consider the typical stages of an IDP workflow, which may vary based on specific use cases and business needs. Common stages include data capture, document classification, document text extraction, content enrichment, document review and validation , and data consumption. By clearly defining and separating these stages in your architecture, you create a more resilient system. This approach helps in isolating different components in the event of a failure, leading to smoother operations and easier maintenance.
Flexible demand handling – Create a document processing system that can easily adapt to changes in demand. This ensures that as business needs shift, the system can scale up or down accordingly and continue to operate smoothly.
- For example, when interfacing with Amazon Textract, ensure you manage throttling and dropped connections by setting the config parameter when creating the Amazon Textract client. It is recommended to set a retry count of 5, because the AWS SDK retries an operation this specified number of times before considering it a failure. Incorporating this mechanism can handle throttling more effectively by using the SDK’s built-in exponential backoff strategy.
- AWS might periodically update the service limits based on various factors. Stay updated with the latest documentation and adjust your throttling management strategies accordingly. For example, you can use the Amazon Textract Service Quotas Calculator to estimate the quota values that will satisfy your use case. If your application consistently runs into throttling limits, consider requesting AWS to increase your service quotas for Amazon Textract and Amazon Comprehend.
Quality and accuracy optimization – Maximize the precision of data extraction with Amazon Textract by preparing documents in a format conducive to high accuracy, as outlined in the AWS Textract Best Practices. Take advantage of Textract’s Layout feature, which is pre-trained on a diverse array of documents from various industries, including financial services and insurance. This feature simplifies data extraction by reducing the need for complex post-processing code, enhancing efficiency in document processing operations, ultimately enhancing both quality and efficiency in their document processing operations.
Seamless external integrations – Ensure that your IDP system can integrate efficiently with external services and systems. This provides a cohesive workflow and allows for broader functionality within the document processing pipeline. For example, review the existing architecture for modularity and identify components that handle external system integrations and break down integration logic into smaller, granular functions using AWS Lambda for flexibility and scalability. Continuously seek feedback from developers and integration partners to refine and optimize the architecture. Employ strategies for decoupled operations, such as event-driven processing, where services like Amazon EventBridge can be utilized for capturing and routing events from external systems.
Transparent and adaptable processing – Set up clear, traceable paths for each piece of data from its origin to extraction, which builds trust in the system. Keep documentation of processing rules thorough and up to date, fostering a transparent environment for all stakeholders.
Enhance IDP with Amazon Comprehend Flywheel and Amazon Textract Custom Queries
- Leverage the Amazon Comprehend flywheel for a streamlined ML process, from data ingestion to deployment. By centralizing datasets within the flywheel’s dedicated Amazon S3 data lake, you ensure efficient data management. Regular flywheel iterations guarantee models are trained with the latest data and evaluated for optimal performance. Always promote the highest-performing models to active status, and deploy endpoints synchronized with the active model, reducing manual interventions. This systematic approach, grounded in MLOps principles, drives operational excellence and assures superior model quality.
- Additionally, with the recent introduction of the Amazon Textract Custom Queries feature, you can refine the extraction process to meet unique business requirements by using natural language questions, thereby improving accuracy for specific document types. Custom Queries simplifies the adaptation of the Amazon Textract Queries feature, eliminating the need for deep ML expertise and facilitating a more intuitive way to extract valuable information from documents.

Build and release optimization

Streamlining the build and release processes is vital for the agility and security of IDP solutions. The following are key practices in build and release optimization, focusing on automation, continuous integration and continuous delivery (CI/CD), and security:

Automated deployment – Design your IDP solution using infrastructure-as-code (IaC) principles for consistent and repeatable deployments; the serverless infrastructure can be deployed with AWS Cloud Development Kit (AWS CDK) and orchestrated with low-code visual workflow service like AWS Step Functions.
CI/CD pipelines – Leverage tools like AWS CodePipeline, AWS CodeBuild, AWS CodeDeploy for the automation of build, test, and release phases of IDP components and models. Set up automated rollbacks to mitigate deployment risks, and integrate change tracking and governance for thorough validation before production deployment.
Security with AWS KMS – Operational excellence isn’t solely about efficiency; security plays an integral role as well. Specifically, for Amazon Comprehend endpoints where customer-managed keys encrypt underlying models, maintaining the integrity using AWS Key Management Service (AWS KMS) key permissions become vital. Utilize AWS Trusted Advisor to check endpoint access risks and manage KMS key permissions.
Seamless integration with diverse external systems – Tailor build and release pipelines to emphasize seamless integration with diverse external systems. Use AWS services and best practices to design document processing workflows to easily interface and adapt to various external requirements. This ensures consistency and agility in deployments, prioritizing operational excellence even in complex integration scenarios.

Observability

Achieving operational excellence in IDP necessitates an integrated approach where monitoring and observability play pivotal roles. Below are the key practices to ensure clarity, insight, and continuous improvement within an AWS environment:

Comprehensive observability – Implement a thorough monitoring and observability solution with tools like Amazon CloudWatch Logs for services such as Amazon Textract and Amazon Comprehend. This approach provides clear operational insights for all stakeholders, fostering efficient operation, responsive event handling, and a cycle of continuous improvement.
Amazon Comprehend Endpoint monitoring and auto scaling – Employ Trusted Advisor for diligent monitoring of Amazon Comprehend endpoints to optimize resource utilization. Adjust throughput configurations or use AWS Application Auto Scaling to align resources with demand, enhancing efficiency and cost-effectiveness.
Amazon Textract monitoring strategy – For operational excellence in utilizing Amazon Textract, adopt a holistic approach:
- Utilize CloudWatch to diligently monitor Amazon Textract operations, drawing insights from key metrics like SuccessfulRequestCount, ThrottledCount, ResponseTime, ServerErrorCount, UserErrorCount
- Set precise alarms based on these metrics, and integrate them with Amazon Simple Notification Service (Amazon SNS) for real-time anomaly detection.
- Act swiftly on these notifications, ensuring prompt issue rectification and consistent document processing efficiency. This strategy combines meticulous monitoring with proactive intervention, setting the gold standard for operational excellence.
Logging API calls with AWS CloudTrail – With AWS CloudTrail , you can gain visibility into API call history and user activity, crucial for operational monitoring and swift incident response. Amazon Textract and Amazon Comprehend services are integrated with AWS CloudTrail.

Conclusion

In this post, we shared design principles, focus areas, foundations and best practices for achieving operational excellence in your IDP solution. By adopting the Well-Architected Framework principles covered in this post, you can optimize your IDP workloads for operational excellence. Focus on key areas like IaC, instrumentation, observability, and continuous improvement, which will help you achieve operational excellence and ensure your IDP systems deliver business value at scale in a secure and compliant manner.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

AWS is committed to the IDP Well-Architected Lens as a living tool. As the IDP solutions and related AWS AI services evolve and new AWS services become available, we will update the IDP Lens Well-Architected accordingly.

If you want to learn more about the AWS Well-Architected Framework, refer to AWS Well-Architected.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

About the Authors

Brijesh Pati is an Enterprise Solutions Architect at AWS. His primary focus is helping enterprise customers adopt cloud technologies for their workloads. He has a background in application development and enterprise architecture and has worked with customers from various industries such as sports, finance, energy and professional services. His interests include serverless architectures and AI/ML.

Mia Chang is a ML Specialist Solutions Architect for Amazon Web Services. She works with customers in EMEA and shares best practices for running AI/ML workloads on the cloud with her background in applied mathematics, computer science, and AI/ML. She focuses on NLP-specific workloads, and shares her experience as a conference speaker and a book author. In her free time, she enjoys hiking, board games, and brewing coffee.

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Tim Condello is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.

Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Build well-architected IDP solutions with a custom lens – Part 2: Security

November 22, 2023

by Sherry Ding Amazon AWS

Building a production-ready solution in AWS involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn current operational and architectural recommendations for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in AWS.

An intelligent document processing (IDP) project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific entities or phrases. This IDP Well-Architected Custom Lens provides you the guidance to tackle the common challenges we see in the field. By answering a series of questions in this custom lens, you will identify the potential risks and be able to resolve them by following the improvement plan.

This post focuses on the Security pillar of the IDP solution. Starting from the introduction of the Security Pillar and design principles, we then examine the solution design and implementation with four focus areas: access control, data protection, key and secret management, and workload configuration. By reading this post, you will learn about the Security Pillar in the Well-Architected Framework, and its application to the IDP solutions.

Design principles

The Security Pillar encompasses the ability of an IDP solution to protect input documents, document processing systems, and output assets, taking advantage of AWS technologies to improve security while processing documents intelligently.

All of the AWS AI services (for example, Amazon Textract, Amazon Comprehend, or Amazon Comprehend Medical) used in IDP solutions are fully managed AI services where AWS secures their physical infrastructure, API endpoints, OS, and application code, and handles service resilience and failover within a given region. As an AWS customer, you can therefore focus on using these services to accomplish your IDP tasks, rather than on securing these elements. There are a number of design principles that can help you strengthen your IDP workload security:

Implement a strong identity foundation – Implement the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your AWS resources in IDP applications. Centralize identity management, and aim to eliminate reliance on long-term static credentials.
Maintain traceability – AI services used in IDP are integrated with AWS CloudTrail, which enables you to monitor, alert on, and audit actions and changes to your IDP environment with low latency. Their integration with Amazon CloudWatch allows you to integrate log and metric collection with your IDP system to automatically investigate and take action.
Automate current security recommendations – Automated software-based security mechanisms improve your ability to securely scale more rapidly and cost-effectively. Create secured IDP architectures, including the implementation of controls that are defined and managed as code in version-controlled templates by using AWS CloudFormation.
Protect data in transit and at rest – Encryption in transit is supported by default for all of the AI services required for IDP. Pay attention to protection of data at rest and data produced in IDP outputs. Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate.
Grant least privilege permissions to people – IDP largely reduces the need for direct access and manual processing of documents. Only involving necessary people to do case validation or augmentation tasks reduces the risk of document mishandling and human error when dealing with sensitive data.
Prepare for security events – Prepare for an incident by having incident management and investigation policy and processes in place that align to your organizational requirements. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.

Focus areas

Before you architect an IDP workload, you need to put practices in place to meet your security requirements. This post focuses on the Security pillar with four focus areas:

Access control – In an IDP application, access control is the key part to ensure information security. It’s not only related to ensuring that only authorized users are able to access the application, but also about ensuring that other services are only able to access the environment and interact with each other in a suitably secure manner.
Data protection – Because encrypting data in transit is supported by default for all of the AI services required for IDP, data protection in an IDP application focuses more on encrypting data at rest and managing sensitive information such as personally identifiable information (PII).
Key and secret management – The encryption approach that you use to secure your IDP workflow may include different keys to encrypt data and authorize users across multiple services and related systems. Applying a comprehensive key and secret management system provides durable and secure mechanisms to further protect your IDP application and data.
Workload configuration – Workload configuration involves multiple design principles, including using monitoring and auditing services to maintain traceability of transactions and data in your IDP workload, setting up incident response procedures, and separating different IDP workloads from each other.

Access control

In focus area of access control, consider the following current recommendations:

Use VPC endpoints to a establish private connection with IDP related services – You can use Amazon Textract, Amazon Comprehend, and Amazon Simple Storage Service (Amazon S3) APIs through a world-routable network or keep your network traffic within the AWS network by using VPC endpoints. To follow current security recommnedations, you should keep your IDP traffic within your VPCs, and establish a private connection between your VPC and Amazon Textract or Amazon Comprehend by creating interface VPC endpoints. You can also access Amazon S3 from your VPC using gateway VPC endpoints.
Set up a centralized identity provider – For authenticating users and systems to your IDP application, setting up a centralized identity provider makes it easier to manage access across multiple IDP applications and services. This reduces the need for multiple sets of credentials and provides an opportunity to integrate with existing human resources (HR) processes.
- For federation with individual AWS accounts, you can use centralized identities for AWS with a SAML 2.0-based provider with AWS Identity and Access Management (IAM).
- For federation to multiple accounts in your AWS Organizations, you can configure your identity source in AWS IAM Identity Center and specify where your users and groups are managed.
Use IAM roles to control access and enforce least privilege access – To manage user access to IDP services, you should create IAM roles for user access to services in the IDP application and attach the appropriate policies and tags to achieve least privilege access. Roles should then be assigned to appropriate groups as managed in your identity provider. You can also use IAM roles for assigning service usage permissions, thereby employing ephemeral AWS Security Token Service (STS) credentials for calling service APIs. For circumstances where AWS services need to be called for IDP purposes from systems not running on AWS, use AWS IAM Roles Anywhere to obtain temporary security credentials in IAM for workloads running outside of AWS.
Protect Amazon Textract and Amazon Comprehend in your account from cross-service impersonation – An IDP application usually employs multiple AWS services, such that one service may call another service. Therefore, you need to prevent cross-service “confused deputy” scenarios. We recommend using the aws:SourceArn and aws:SourceAccount global condition context keys in resource policies to limit the permissions that Amazon Textract or Amazon Comprehend gives another service to the resource.

Data protection

The following are some current recommendations to consider for data protection:

Follow current recommendations to secure sensitive data in data stores – IDP usually involves multiple data stores. Sensitive data in these data stores needs to be secured. Current security recommendations in this area involve defining IAM controls, multiple ways to implement detective controls on databases, strengthening infrastructure security surrounding your data via network flow control, and data protection through encryption and tokenization.
Encrypt data at rest in Amazon Textract – Amazon Textract uses Transport Layer Security (TLS) and VPC endpoints to encrypt data in transit. The method of encrypting data at rest for use by Amazon Textract is server-side encryption. You can choose from the following options:
- Server-side encryption with Amazon S3 (SSE-S3) – When you use Amazon S3 managed keys, each object is encrypted with a unique key. As an additional safeguard, this method encrypts the key itself with a primary key that it regularly rotates.
- Server-side encryption with AWS KMS (SSE-KMS) – There are separate permissions for the use of an AWS Key Management Service (AWS KMS) key that provide protection against unauthorized access of your objects in Amazon S3. SSE-KMS also provides you with an audit trail in CloudTrail that shows when your KMS key was used, and by whom. Additionally, you can create and manage KMS keys that are unique to you, your service, and your Region.
Encrypt the output from Amazon Textract asynchronous API in a custom S3 bucket – When you start an asynchronous Amazon Textract job by calling StartDocumentTextDetection or StartDocumentAnalysis, an optional parameter in the API action is called OutputConfig. This parameter allows you to specify the S3 bucket for storing the output. Another optional input parameter KMSKeyId allows you to specify the KMS customer managed key (CMK) to use to encrypt the output.
Use AWS KMS encryption in Amazon Comprehend – Amazon Comprehend works with AWS KMS to provide enhanced encryption for your data. Integration with AWS KMS enables you to encrypt the data in the storage volume for Start* and Create* jobs, and it encrypts the output results of Start* jobs using your own KMS key.
- For use via the AWS Management Console, Amazon Comprehend encrypts custom models with its own KMS key.
- For use via the AWS Command Line Interface (AWS CLI), Amazon Comprehend can encrypt custom models using either its own KMS key or a provided CMK, and we recommend the latter.
Protect PII in IDP output – For documents including PII, any PII in IDP output also needs to be protected. You can either secure the output PII in your data store or redact the PII in your IDP output.
- If you need to store the PII in your IDP downstream, look into defining IAM controls, implementing protective and detective controls on databases, strengthening infrastructure security surrounding your data via network flow control, and implementing data protection through encryption and tokenization.
- If you don’t need to store the PII in your IDP downstream, consider redacting the PII in your IDP output. You can design a PII redaction step using Amazon Comprehend in your IDP workflow.

Key and secret management

Consider the following current recommendations for managing keys and secrets:

Use AWS KMS to implement secure key management for cryptographic keys – You need to define an encryption approach that includes the storage, rotation, and access control of keys, which helps provide protection for your content. AWS KMS helps you manage encryption keys and integrates with many AWS services. It provides durable, secure, and redundant storage for your KMS keys.
Use AWS Secrets Manager to implement secret management – An IDP workflow may have secrets such as database credentials in multiple services or stages. You need a tool to store, manage, retrieve, and potentially rotate these secrets. AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, application credentials, and other secrets throughout their lifecycles. Storing the credentials in Secrets Manager helps mitigate the risk of possible credential exfiltration by anyone who can inspect your application code.

Workload configuration

To configure workload, follow these current recommendations:

Separate multiple IDP workloads using different AWS accounts – We recommend establishing common guardrails and isolation between environments (such as production, development, and test) and workloads through a multi-account strategy. AWS provides tools to manage your workloads at scale through a multi-account strategy to establish this isolation boundary. When you have multiple AWS accounts under central management, your accounts should be organized into a hierarchy defined by groupings of organizational units (OUs). Security controls can then be organized and applied to the OUs and member accounts, establishing consistent preventative controls on member accounts in the organization.
Log Amazon Textract and Amazon Comprehend API calls with CloudTrail – Amazon Textract and Amazon Comprehend are integrated with CloudTrail. The calls captured include calls from the service console and calls from your own code to the services’ API endpoints.
Establish incident response procedures – Even with comprehensive, preventative and detective controls, your organization should still have processes in place to respond to and mitigate the potential impact of security incidents. Putting the tools and controls in place ahead of a security incident, then routinely practicing incident response through simulations, will help you verify that your environment can support timely investigation and recovery.

Conclusion

In this post we shared design principles and current recommendations for Security Pillar in building well-architected IDP solutions.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

For next steps, you can read more about the AWS Well-Architected Framework and refer to our Guidance for Intelligent Document Processing on AWS to design and build your IDP application. Please also reach out to your account team for a Well-Architected review for your IDP workload. If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

AWS is committed to the IDP Well-Architected Lens as a living tool. As the IDP solutions and related AWS AI services evolve, we will update the IDP Well-Architected Lens accordingly.

About the Authors

Build well-architected IDP solutions with a custom lens – Part 3: Reliability

November 22, 2023

by Rui Cardoso Amazon AWS

The IDP Well-Architected Custom Lens is intended for all AWS customers who use AWS to run intelligent document processing (IDP) solutions and are searching for guidance on how to build a secure, efficient, and reliable IDP solution on AWS.

An IDP project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific terms or words. The IDP Well-Architected Custom Lens outlines the steps for performing an AWS Well-Architected review that allows you to assess and identify technical risks of your IDP workloads. It provides guidance to tackle the common challenges we see among the field, supporting you to architect your IDP workloads according to best practices.

This post focuses on the Reliability pillar of the IDP solution. Starting from the introduction of the Reliability pillar and design principles, we then dive deep into the solution design and implementation with three focus areas: foundations, change management, and failure management. By reading this post, you will learn about the Reliability pillar in the Well-Architected Framework with the IDP case study.

Design principles

The reliability pillar encompasses the ability of an IDP solution to perform document processing correctly and consistently when it’s expected and according to the defined business rules. This includes the ability to operate and test the full IDP workflow and its total lifecycle.

There are a number of principles that can help you to increase reliability. Keep these in mind as we discuss best practices:

Automatically recover from failure – By monitoring your IDP workflow for key performance indicators (KPIs), you can run automation when a threshold is breached. This allows you to track and be notified automatically if any failure occurs and trigger automated recovery processes that work around or repair the failure. Based on KPI measures, you can also anticipate failures and apply remediation actions before they occur.
Test recovery procedures – Test how your IDP workflow fails, and validate recovery procedures. Use automation to simulate different scenarios or recreate scenarios that led to failure before.
Scale and adjust service capacity – Monitor IDP workflow demand and usage, and automatically adjust AWS service capacity, to maintain the optimal level to satisfy demand without over- or under-provisioning. Control and be aware of service quotas, limits, and constraints of your IDP components services, such as Amazon Textract and Amazon Comprehend.
Automate changes – Use automation when applying changes to your IDP workflow infrastructure. Manage changes through automation, which then can be tracked and reviewed.

Focus areas

The design principles and best practices of the reliability pillar are based on insights gathered from our customers and our IDP technical specialist communities. Use them as guidance and support for your design decisions and align them with your business requirements of your IDP solution. Applying the IDP Well-Architected Lens helps you validate the resilience and efficiency of your IDP solution design, and provides recommendations to address any gaps you might identify.

The following are best practice areas for reliability of an IDP solution in the cloud:

Foundations – AWS AI services such as Amazon Textract and Amazon Comprehend provide a set of soft and hard limits for different dimensions of usage. It’s important to review these limits and ensure your IDP solution adheres to any soft limits, while not exceeding any hard limits.
Change management – Treat your IDP solution as infrastructure as code (IaC), allowing you to automate monitoring and change management. Use version control across components such as infrastructure and Amazon Comprehend custom models, and track changes back to point-in-time release.
Failure management – Because an IDP workflow is an event-driven solution, your application must be resilient to handling known and unknown errors. A well-architected IDP solution has the ability to prevent failures and withstand failures when they occur by using logging and retry mechanisms. It’s important to design resilience into your IDP workflow architecture and plan for disaster recovery.

Foundations

AWS AI services provide ready-made intelligence, such as automated data extraction and analysis, using Amazon Textract, Amazon Comprehend, and Amazon Augmented AI (Amazon A2I), for your IDP workflows. There are service limits (or quotas) for these services to avoid over-provisioning and to limit request rates on API operations, protecting the services from abuse.

When planning and designing your IDP solution architecture, consider the following best practices:

Be aware of unchangeable Amazon Textract and Amazon Comprehend service quotas, limits, and constraints – Accepted file formats, size and page count, languages, document rotations, and image size are some examples of these hard limits for Amazon Textract that can’t be changed.
- Accepted file formats include JPEG, PNG, PDF, and TIFF files. (JPEG 2000-encoded images within PDFs are supported). Document preprocessing is required before using Amazon Textract if the file format is not supported (for example, Microsoft Word or Excel). In this case, you must convert unsupported document formats to PDF or image format.
- Amazon Comprehend has different quotas for built-in models, custom models, and flywheels. Make sure that your use case is aligned with Amazon Comprehend quotas.
Adjust Amazon Textract and Amazon Comprehend service quotas to meet your needs – The Amazon Textract Service Quotas Calculator can help you estimate the quota values that will cover your use case. You should manage your service quotas across accounts or Regions if you’re planning a disaster recovery failover between accounts or Regions for your solution. When requesting an increase of Amazon Textract quotas, make sure to follow these recommendations:
- Use the Amazon Textract Service Quotas Calculator to estimate your optimal quota value.
- Changes in requests can cause spiky network traffic, affecting throughput. Use a queueing serverless architecture or other mechanism to smooth traffic and get the most out of your allocated transactions per second (TPS).
- Implement retry logic to handle throttled calls and dropped connections.
- Configure exponential backoff and jitter to improve throughput.

Change management

Changes to your IDP workflow or its environment, such as spikes in demand or a corrupted document file, must be anticipated and accommodated to achieve a higher reliability of the solution. Some of these changes are covered by the foundations best practices described in the previous section, but those alone are not enough to accommodate changes. The following best practices must also be considered:

Use Amazon CloudWatch to monitor your IDP workflow components, such as Amazon Textract and Amazon Comprehend. Collect metrics from the IDP workflow, automate responses to alarms, and send notifications as required to your workflow and business objectives.
Deploy your IDP workflow solution and all infrastructure changes with automation using IaC, such as the AWS Cloud Development Kit (AWS CDK) and pre-built IDP AWS CDK constructs. This removes the potential for introducing human error and enables you to test before changing to your production environment.
If your use case requires an Amazon Comprehend custom model, consider using a flywheel to simplify the process of improving the custom model over time. A flywheel orchestrates the tasks associated with training and evaluating a new custom model version.
If your use case requires it, customize the output of the Amazon Textract pre-trained Queries feature by training and using an adapter for the Amazon Textract base model. Consider the following best practices when creating queries for your adapters:
- Adapter quotas define the preceding limits for adapter training. Consider these limits and raise a service quota increase request, if required:
  - Maximum number of adapters – Number of adapters allowed (you can have several adapter versions under a single adapter).
  - Maximum adapter versions created per month – Number of successful adapter versions that can be created per AWS account per month.
  - Maximum in-progress adapter versions – Number of in-progress adapter versions (adapter training) per account.
- Make sure to use a set of documents representative of your use case (minimum five training docs and five testing docs).
- Provide as many documents for training as possible (up to 2,500 pages of training documents and 1,000 for test documents).
- Annotate queries using a variety of answers. For example, if the answer to a query is “Yes” or “No,” the annotated samples should have occurrences of both “Yes” and “No.”
- Maintain consistency in annotation style and while annotating fields with spaces.
- Use the exact query used in training for inference.
- After each round of adapter training, review the performance metrics to determine if you need to further improve your adapter to achieve your goals. Upload a new document set for training or review document annotations that have low accuracy scores before you start a new training to create an improved version of the adapter.
- Use the AutoUpdate feature for custom adapters. This feature attempts automated retraining if the AutoUpdate flag is enabled on an adapter.

Failure management

When designing an IDP solution, one important aspect to consider is its resilience, how to handle known and unknown errors that can occur. The IDP solution should have the capabilities of logging errors and retry failed operations, during the different stages of the IDP workflow. In this section, we discuss the details on how to design your IDP workflow to handle failures.

Prepare your IDP workflow to manage and withstand failures

“Everything fails, all the time,” is a famous quote from AWS CTO Werner Vogels. Your IDP solution, like everything else, will eventually fail. The question is how can it withstand failures without impacting your IDP solution users. Your IDP architecture design must be aware of failures as they occur and take action to avoid impact on availability. This must be done automatically, and without user impact. Consider the following best practices:

Use Amazon Simple Storage Service (Amazon S3) as your scalable data store for IDP workflow documents to process. Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage.
Back up all your IDP workflow data according to your business requirements. Implement a strategy to recover or reproduce data in case of data loss. Align this strategy with a defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) that meet your business requirements.
If required, plan and implement a disaster recovery failover strategy of your IDP solution across AWS accounts and Regions.
Use the Amazon Textract OutputConfig feature and Amazon Comprehend OutputDataConfig feature to store the results of asynchronous processing from Amazon Textract or Amazon Comprehend to a designated S3 bucket. This allows the workflow to continue from that point rather than repeat the Amazon Textract or Amazon Comprehend invocation. The following code shows how to start an Amazon Textract asynchronous API job to analyze a document and store encrypted inference output in a defined S3 bucket. For additional information, refer to the Amazon Textract client documentation.

import boto3
client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
    FeatureTypes=[
        'TABLES'|'FORMS'|'QUERIES'|'SIGNATURES'|'LAYOUT',
    ],
    …
    OutputConfig={
        'S3Bucket': 'string',
        'S3Prefix': 'string'
    },
    KMSKeyId='string'
    …
)

Design your IDP workflow to prevent failures

The reliability of a workload starts with upfront design decisions. Architecture choices will impact your workload behavior and its resilience. To improve the reliability of your IDP solution, follow these best practices.

Firstly, design your architecture following the IDP workflow. Although the stages in an IDP workflow may vary and be influenced by use case and business requirements, the stages of data capture, document classification, text extraction, content enrichment, review and validation, and consumption are typically parts of IDP workflow. These well-defined stages can be used to separate functionalities and isolate them in case of failure.

You can use Amazon Simple Queue Service (Amazon SQS) to decouple IDP workflow stages. A decoupling pattern helps isolate behavior of architecture components from other components that depend on it, increasing resiliency and agility.

Secondly, control and limit retry calls. AWS services such as Amazon Textract can fail if the maximum number of TPS allotted is exceeded, causing the service to throttle your application or drop your connection.

You should manage throttling and dropped connections by automatically retrying the operation (both synchronous and asynchronous operations). However, you should also specify a limited number of retries, after which the operation fails and throws an exception. If you make too many calls to Amazon Textract in a short period of time, it throttles your calls and sends a ProvisionedThroughputExceededExceptionerror in the operation response.

In addition, use exponential backoff and jitter for retries to improve throughput. For example, using Amazon Textract, specify the number of retries by including the config parameter when you create the Amazon Textract client. We recommend a retry count of five. In the following example code, we use the config parameter to automatically retry an operation using adaptive mode and a maximum of five retries:

import boto3
from botocore.client import Config

documents = ['doc-img-1.png','doc-img-2.png', 'doc-img-3.png',
             'doc-img-4.png', 'doc-img-5.png']

config = Config(
    retries = {
        'max_attempts': 5,
        'mode': 'adaptive'
        }
)

client = boto3.client('textract', config=config)

for documentName in documents:
    response = client.detect_document_text(
        DocumentLocation = {
            'S3Object': {
                'Bucket': 'string',
                'Name': documentName
                }
                })
    
    ...

Take advantage of AWS SDKs, such as the AWS SDK for Python (Boto3), to assist in retrying client calls to AWS services such as Amazon Textract and Amazon Comprehend. There are three retry modes available:

Legacy mode – Retries calls for a limited number of errors and exceptions and include an exponential backoff by a base factor of 2.
Standard mode – Standardizes the retry logic and behavior consistent with other AWS SDKs and extends the functionality of retries over that found in legacy mode. Any retry attempt will include an exponential backoff by a base factor of 2 for a maximum backoff time of 20 seconds.
Adaptive mode – Includes all the features of standard mode and it introduces a client-side rate limiting through the use of a token bucket and rate limit variables that are dynamically updated with each retry attempt. It offers flexibility in client-side retries that adapts to the error or exception state response from an AWS service. With each new retry attempt, adaptive mode modifies the rate limit variables based on the error, exception, or HTTP status code presented in the response from the AWS service. These rate limit variables are then used to calculate a new call rate for the client. Each exception, error, or non-success HTTP response from an AWS service updates the rate limit variables as retries occur until a success is reached, the token bucket is exhausted, or the configured maximum attempts value is reached. Examples of exceptions, errors, or non-success HTTP responses:

# Transient errors/exceptions
RequestTimeout
RequestTimeoutException
PriorRequestNotComplete
ConnectionError
HTTPClientError

# Service-side throttling/limit errors and exceptions
Throttling
ThrottlingException
ThrottledException
RequestThrottledException
TooManyRequestsException
ProvisionedThroughputExceededException
TransactionInProgressException
RequestLimitExceeded
BandwidthLimitExceeded
LimitExceededException
RequestThrottled
SlowDown
EC2ThrottledException

#Retry attempts on nondescriptive, transient error codes. Specifically, these HTTP status codes: 500, 502, 503, 504.

Conclusion

In this post, we shared design principles, focus areas, foundations and best practices for reliability in your IDP solution.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

If you want to learn more about the AWS Well-Architected Framework, refer to AWS Well-Architected.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

About the Authors

Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency

November 22, 2023

by Mia Chang Amazon AWS

When a customer has a production-ready intelligent document processing (IDP) workload, we often receive requests for a Well-Architected review. To build an enterprise solution, developer resources, cost, time and user-experience have to be balanced to achieve the desired business outcome. The AWS Well-Architected Framework provides a systematic way for organizations to learn operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

The IDP Well-Architected Custom Lens follows the AWS Well-Architected Framework, reviewing the solution with six pillars with the granularity of a specific AI or machine learning (ML) use case, and providing the guidance to tackle common challenges. The IDP Well-Architected Custom Lens in the Well-Architected Tool contains questions regarding each of the pillars. By answering these questions, you can identify potential risks and resolve them by following your improvement plan.

This post focuses on the Performance Efficiency pillar of the IDP workload. We dive deep into designing and implementing the solution to optimize for throughput, latency, and overall performance. We start with discussing some common indicators that you should conduct a Well-Architected review, and introduce the fundamental approaches with design principles. Then we go through each focus area from a technical perspective.

To follow along with this post, you should be familiar with the previous posts in this series (Part 1 and Part 2) and the guidelines in Guidance for Intelligent Document Processing on AWS. These resources introduce common AWS services for IDP workloads and suggested workflows. With this knowledge, you’re now ready to learn more about productionizing your workload.

Common indicators

The following are common indicators that you should conduct a Well-Architected Framework review for the Performance Efficiency pillar:

High latency – When the latency of optical character recognition (OCR), entity recognition, or the end-to-end workflow takes longer than your previous benchmark, this may be an indicator that the architecture design doesn’t cover load testing or error handling.
Frequent throttling – You may experience throttling by AWS services like Amazon Textract due to request limits. This means that the architecture needs to be adjusted by reviewing the architecture workflow, synchronous and asynchronous implementation, transactions per second (TPS) calculation, and more.
Debugging difficulties – When there’s a document process failure, you may not have an effective way to identify where the error is located in the workflow, which service it’s related to, and why the failure occurred. This means the system lacks visibility into logs and failures. Consider revisiting the logging design of the telemetry data and adding infrastructure as code (IaC), such as document processing pipelines, to the solution.

Indicators	Description	Architectural Gap
High Latency	OCR, entity recognition, or end-to-end workflow latency exceeds previous benchmark	Load Testing Error Handling
Frequent Throttling	Throttling by AWS services like Amazon Textract due to request limits	Sync vs Async TPS calculation
Hard to Debug	No visibility into location, cause, and reason for document processing failures	Logging Design Document Processing Pipelines

Design principles

In this post, we discuss three design principles: delegating complex AI tasks, IaC architectures, and serverless architectures. When you encounter a trade-off between two implementations, you can revisit the design principles with the business priorities of your organization so that you can make decisions effectively.

Delegating complex AI tasks – You can enable faster AI adoption in your organization by offloading the ML model development lifecycle to managed services and taking advantage of the model development and infrastructure provided by AWS. Rather than requiring your data science and IT teams to build and maintain AI models, you can use pre-trained AI services that can automate tasks for you. This allows your teams to focus on higher-value work that differentiates your business, while the cloud provider handles the complexity of training, deploying, and scaling the AI models.
IaC architectures – When running an IDP solution, the solution includes multiple AI services to perform the end-to-end workflow chronologically. You can architect the solution with workflow pipelines using AWS Step Functions to enhance fault tolerance, parallel processing, visibility, and scalability. These advantages can enable you to optimize the usage and cost of underlying AI services.
Serverless architectures – IDP is often an event-driven solution, initiated by user uploads or scheduled jobs. The solution can be horizontally scaled out by increasing the call rates for the AI services, AWS Lambda, and other services involved. A serverless approach provides scalability without over-provisioning resources, preventing unnecessary expenses. The monitoring behind the serverless design assists in detecting performance issues.

Figure 1.The benefit when applying design principles.

With these three design principles in mind, organizations can establish an effective foundation for AI/ML adoption on cloud platforms. By delegating complexity, implementing resilient infrastructure, and designing for scale, organizations can optimize their AI/ML solutions.

In the following sections, we discuss how to address common challenges in regards to technical focus areas.

Focus areas

When reviewing performance efficiency, we review the solution from five focus areas: architecture design, data management, error handling, system monitoring, and model monitoring. With these focus areas, you can conduct an architecture review from different aspects to enhance the effectivity, observability, and scalability of the three components of an AI/ML project, data, model, or business goal.

Architecture design

By going through the questions in this focus area, you will review the existing workflow to see if it follows best practices. The suggested workflow provides a common pattern that organizations can follow and prevents trial-and-error costs.

Based on the proposed architecture, the workflow follows the six stages of data capture, classification, extraction, enrichment, review and validation, and consumption. In the common indicators we discussed earlier, two out of three come from architecture design problems. This is because when you start a project with an improvised approach, you may meet project restraints when trying to align your infrastructure to your solution. With the architecture design review, the improvised design can be decoupled as stages, and each of them can be reevaluated and reordered.

You can save time, money, and labor by implementing classifications in your workflow, and documents go to downstream applications and APIs based on document type. This enhances the observability of the document process and makes the solution straightforward to maintain when adding new document types.

Data management

Performance of an IDP solution includes latency, throughput, and the end-to-end user experience. How to manage the document and its extracted information in the solution is the key to data consistency, security, and privacy. Additionally, the solution must handle high data volumes with low latency and high throughput.

When going through the questions of this focus area, you will review the document workflow. This includes data ingestion, data preprocessing, converting documents to document types accepted by Amazon Textract, handling incoming document streams, routing documents by type, and implementing access control and retention policies.

For example, by storing a document in the different processed phases, you can reverse processing to the previous step if needed. The data lifecycle ensures the reliability and compliance for the workload. By using the Amazon Textract Service Quotas Calculator (see the following screenshot), asynchronous features on Amazon Textract, Lambda, Step Functions, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Service (Amazon SNS), organizations can automate and scale document processing tasks to meet specific workload needs.

Figure 2. Amazon Textract Service Quota Calculator.

Error handling

Robust error handling is critical for tracking the document process status, and it provides the operation team time to react to any abnormal behaviors, such as unexpected document volumes, new document types, or other unplanned issues from third-party services. From the organization’s perspective, proper error handling can enhance system uptime and performance.

You can break down error handling into two key aspects:

AWS service configuration – You can implement retry logic with exponential backoff to handle transient errors like throttling. When you start processing by calling an asynchronous Start* operation, such as StartDocumentTextDetection, you can specify that the completion status of the request is published to an SNS topic in the NotificationChannel configuration. This helps you avoid throttling limits on API calls due to polling the Get* APIs. You can also implement alarms in Amazon CloudWatch and triggers to alert when unusual error spikes occur.
Error report enhancement – This includes detailed messages with an appropriate level of detail by error type and descriptions of error handling responses. With the proper error handling setup, systems can be more resilient by implementing common patterns like automatically retrying intermittent errors, using circuit breakers to handle cascading failures, and monitoring services to gain insight into errors. This allows the solution to balance between retry limits and prevents never-ending circuit loops.

Model monitoring

The performance of ML models is monitored for degradation over time. As data and system conditions change, the model performance and efficiency metrics are tracked to ensure retraining is performed when needed.

The ML model in an IDP workflow can be an OCR model, entity recognition model, or classification model. The model can come from an AWS AI service, an open source model on Amazon SageMaker, Amazon Bedrock, or other third-party services. You must understand the limitations and use cases of each service in order to identify ways to improve the model with human feedback and enhance service performance over time.

A common approach is using service logs to understand different levels of accuracy. These logs can help the data science team identify and understand any need for model retraining. Your organization can choose the retraining mechanism—it can be quarterly, monthly, or based on science metrics, such as when accuracy drops below a given threshold.

The goal of monitoring is not just detecting issues, but closing the loop to continuously refine models and keep the IDP solution performing as the external environment evolves.

System monitoring

After you deploy the IDP solution in production, it’s important to monitor key metrics and automation performance to identify areas for improvement. The metrics should include business metrics and technical metrics. This allows the company to evaluate the system’s performance, identify issues, and make improvements to models, rules, and workflows over time to increase the automation rate to understand the operational impact.

On the business side, metrics like extraction accuracy for important fields, overall automation rate indicating the percentage of documents processed without human intervention, and average processing time per document are paramount. These business metrics help quantify the end-user experience and operational efficiency gains.

Technical metrics including error and exception rates occurring throughout the workflow are essential to track from an engineering perspective. The technical metrics can also monitor at each level from end to end and provide a comprehensive view of a complex workload. You can break the metrics down into different levels, such as solution level, end-to-end workflow level, document type level, document level, entity recognition level, and OCR level.

Now that you have reviewed all the questions in this pillar, you can assess the other pillars and develop an improvement plan for your IDP workload.

Conclusion

In this post, we discussed common indicators that you may need to perform a Well-Architected Framework review for the Performance Efficiency pillar for your IDP workload. We then walked through design principles to provide a high-level overview and discuss the solution goal. By following these suggestions in reference to the IDP Well-Architected Custom Lens and by reviewing the questions by focus area, you should now have a project improvement plan.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability