Amazon AWS – Page 114

Your guide to generative AI and ML at AWS re:Invent 2023

November 22, 2023

by Denis V. Batalov Amazon AWS

Yes, the AWS re:Invent season is upon us and as always, the place to be is Las Vegas! You marked your calendars, you booked your hotel, and you even purchased the airfare. Now all you need is some guidance on generative AI and machine learning (ML) sessions to attend at this twelfth edition of re:Invent. And although generative AI has appeared in previous events, this year we’re taking it to the next level. In addition to several exciting announcements during keynotes, most of the sessions in our track will feature generative AI in one form or another, so we can truly call our track “Generative AI and ML.” In this post, we give you a sense of how the track is organized and highlight a few sessions we think you’ll like. And although our track focuses on generative AI, many other tracks have related sessions. Use the “Generative AI” tag as you are browsing the session catalog to find them.

The technical sessions in our track are divided into five areas. First, we’ll have a few foundational sessions related to various aspects of Amazon Bedrock—a fully managed generative AI service we launched earlier this year. These will help you understand the building blocks of your generative AI applications. Second, we’ll have sessions covering the common generative AI use cases and applications. Here you’ll also have a chance to discover novel use cases and techniques. Third, a number of sessions will be of interest to ML practitioners who build, deploy, and operationalize both traditional and generative AI models. This year, learn about LLMOps, not just MLOps! Then, as we started doing last re:Invent, we’ll be offering several sessions on how to build AI responsibly. The greater the power of latest transformer-based models, the greater the responsibility of all ML practitioners to do this right. Be sure to check out the session on the just launched PartyRock, an educational tool for providing any builder with low-friction access to learn through experimentation in a foundation model playground built on Amazon Bedrock. And last but not least (and always fun!) are the sessions dedicated to AWS DeepRacer!

Generative AI is at the heart of the AWS Village this year. Interact with several demos that feature new applications, including a competition that involves using generative AI tech to pilot a drone around an obstacle course. Talk with AWS experts in 14 different industries and explore industry-specific generative AI use cases, including demos from advertising and marketing, aerospace and satellite, manufacturing, and more. The Emerging Tech Zone within the Expo features innovative startups that were selected into the AWS Generative AI Accelerator and the NVIDIA Inception 100 programs.

If you’re new to re:Invent, you can attend sessions of the following types:

Keynotes – Join in person or virtually and learn about all the exciting announcements.
Innovation Talks – Learn about the latest cloud technology from AWS technology leaders and discover how these advancements can help you push your business forward. These sessions will be livestreamed, recorded, and published to YouTube.
Breakout sessions – These 60-minute sessions are expected to have broad appeal, are delivered to larger audiences, and will be recorded. If you miss them, you can watch them on demand after re:Invent.
Chalk talks – Enjoy 60 minutes of content delivered to smaller audiences with an interactive whiteboarding session. Chalk talks are where discussions happen, and these offer you the greatest opportunity to ask questions or share your opinion.
Workshops – In these hands-on learning opportunities, in the course of 2 hours, you’ll be able to build a solution to a problem, and understand the inner workings of the resulting infrastructure and cross-service interaction. Bring your laptop and be ready to learn!
Builders’ sessions – These highly interactive 60-minute mini-workshops are conducted in small groups of fewer than 10 attendees. Some of these appeal to beginners, and others are on specialized topics.
NEW! Code talks – In this new session type for re:Invent 2023, code talks are similar to our popular chalk talk format, but instead of focusing on an architecture solution with whiteboarding, the speakers lead an interactive discussion featuring live coding or code samples. These 60-minute sessions focus on the actual code that goes into building a solution. Attendees are encouraged to ask questions and follow along.

If you have reserved your seat at any of the sessions, great! If not, we always set aside some spots for walk-ins, so make a plan and come to the session early.

To help you plan your agenda for this year’s re:Invent, here are some highlights of the generative AI and ML track. So buckle up, and start registering for your favorite sessions.

Visit the session catalog to learn about all our generative AI and ML sessions.

Keynotes

Adam Selipsky, Chief Executive Officer, Amazon Web Services – Keynote

Tuesday November 28 | 8:30 AM – 10:30 AM (PST) | The Venetian

Join Adam Selipsky, CEO of Amazon Web Services, as he shares his perspective on cloud transformation. He highlights innovations in data, infrastructure, and artificial intelligence and machine learning that are helping AWS customers achieve their goals faster, mine untapped potential, and create a better future.

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote

Wednesday November 29 | 8:30 AM – 10:30 AM (PST) | The Venetian

A powerful relationship between humans, data, and AI is unfolding right before us. Generative AI is augmenting our productivity and creativity in new ways, while also being fueled by massive amounts of enterprise data and human intelligence. Join Swami Sivasubramanian, Vice President of Data and AI at AWS, to discover how you can use your company data to build differentiated generative AI applications and accelerate productivity for employees across your organization. Also hear from customer speakers with real-world examples of how they’ve used their data to support their generative AI use cases and create new experiences for their customers.

Innovation Talks

Dr. Bratin Saha, VP of AWS AI and ML Services | AIM245-INT | Innovate faster with generative AI

Wednesday November 29 | 1:00 PM – 2:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

With the emergence of generative AI, we are at a tipping point in the widespread adoption of machine learning. Join Dr. Bratin Saha, VP of AWS AI and ML Services, to hear how customers across industries are transforming their business with the latest breakthroughs in AI and ML, including generative AI. Discover the latest AWS innovations, hear from top customers, and explore where AI/ML is headed.

Francessca Vasquez, Vice President of Professional Services | ARC217-INT | From hype to impact: Building a generative AI architecture

Wednesday November 29 | 11:30 AM – 12:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Generative AI represents a paradigm shift for how companies operate today. Generative AI is empowering developers to reimagine customer experiences and applications while transforming virtually every industry. Organizations are rapidly innovating to create the right architecture for scaling generative AI securely, economically, and responsibly to deliver business value. In this talk, learn how leaders are modernizing their data foundation, selecting industry-leading foundation models, and deploying purpose-built accelerators to unlock the possibilities of generative AI.

Shaown Nandi, AWS Director of Technology for Industries and Strategic Accounts | AIM248-INT | Unlocking the industry potential of generative AI

Wednesday November 29 | 4:00 PM – 5:00 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

Generative AI has captured the imagination of many industries and is poised to bring in the next wave of technological advancements. In this innovation talk, hear how the largest industries, from healthcare and financial services to automotive and media and entertainment, are using generative AI to drive outcomes for their customers. Join Shaown Nandi, AWS Director of Technology for Industries and Strategic Accounts, and industry leaders to hear how generative AI is accelerating content creation and helping organizations reimagine customer experiences.

Mai-Lan Tomsen Bukovec, Vice President, Technology | AIM250-INT | Putting your data to work with generative AI

Thursday November 30 | 12:30 PM – 1:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B

How can you turn your data lake into a business advantage with generative AI? In this talk, explore strategies for putting your proprietary datasets to work when building unique, differentiated generative AI solutions. Learn how to utilize your datasets using Amazon SageMaker and Amazon Bedrock as well as popular frameworks like PyTorch with AWS compute, storage, and analytics. Hear best practices for using unstructured (video, image, PDF), semi-structured (Parquet), and table-formatted (Iceberg) data for training, fine-tuning, checkpointing, and prompt engineering. Also hear different architectural patterns that customers use today to harness their business data for customized generative AI solutions.

Breakout sessions

AIM218 (LVL 200) | Build your first generative AI application with Amazon Bedrock

Monday November 27 | 2:30 PM – 3:30 PM (PST)

We are truly at an exciting inflection point in the widespread adoption of ML with the growth of generative AI applications. In this session, learn how to build your first generative AI application with key services such as Amazon Bedrock. Get hints and tips for getting started fast, and see example reference architectures for common use cases built with AWS AI and ML such as self-service customer support, text analysis, report generation, post-call analysis, and forecasting trends.

Reserve your seat now!

AIM225 (LVL 200) | Drive personalized CX using generative AI and Amazon Personalize

Tuesday November 28 | 5:00 PM – 6:00 PM (PST)

Delivering the best experience is critical to capture and retain customers today. With generative AI, it is possible to hyper-personalize targeted recommendations for shopping and streaming. While standard taglines like “People who bought this also bought . . .” or “Because you watched . . .” entice some, they don’t fully address individual interests. Companies must find ways to dynamically generate compelling, highly customized content. Amazon Personalize delivers capabilities powered by ML and generative AI to help brands create meaningful experiences. Join this session to hear from powerhouse AWS media customer FOX and learn how hyper-personalized experiences can be used to build engagement and drive revenue.

Reserve your seat now!

AIM327 (LVL 300) | Scaling FM inference to hundreds of models with Amazon SageMaker

Wednesday November 29 | 4:30 PM – 5:30 PM (PST)

Companies need robust and cost-effective solutions to deploy foundation models at scale. Additionally, SaaS providers need scalable and cost-effective ways to serve hundreds of models to their customers. This session explores how to use Amazon SageMaker to roll out hundreds of FMs cost-effectively at scale. Get a detailed overview of deployment strategies to support large-scale generative AI inferencing for SaaS, and learn how to architect solutions that maximize scaling capabilities for performance and cost.

Reserve your seat now!

AIM333 (LVL 300) | Explore text-generation FMs for top use cases with Amazon Bedrock

Tuesday November 28| 2:00 PM – 3:00 PM (PST)

Foundation models can be used for natural language processing tasks such as summarization, text generation, classification, open-ended Q&A, and information extraction. With Amazon Bedrock, you can choose powerful FMs from AI21 Labs, Anthropic, and Cohere to find the right FM for your use case such as the Jurassic-2, Claude, and Command families of text-generation FMs. Join this session to learn which FM is best suited for your use case.

Reserve your seat now!

AIM332 (LVL 300) | Explore image generation and search with FMs on Amazon Bedrock

Thursday November 30 | 11:00 AM – 12:00 PM (PST)

Foundation models understand multiple forms of input, such as images and texts. Join this session to learn how to build transformational experiences using images in Amazon Bedrock.

Reserve your seat now!

AIM377 (LVL 300) | Prompt engineering best practices for LLMs on Amazon Bedrock

Monday November 27 | 9:00 AM – 10:00 AM (PST)

Prompt engineering is the process of guiding large language models to produce desired outputs. In this session, get an overview of prompt engineering best practices and learn how to choose the most appropriate formats, phrases, words, and symbols to get the most out of generative AI solutions while improving accuracy and performance. This session uses the Claude 2 LLM as an example of how prompt engineering helps to solve complex customer use cases. Also learn how prompts can be integrated with your architecture and how to use API parameters for tuning the model parameters using Amazon Bedrock.

Reserve your seat now!

Chalk talks

AIM341 (LVL 300) | Deliver customized search capabilities using Amazon Bedrock

Wednesday November 29 | 5:30 PM – 6:30 PM (PST)

Vector embeddings are numerical representations of your text, image, audio, and video data that can be used to understand the relationship between sentences or words to find more relevant and contextual information in response to a user query. Embeddings can be stored in a database and are used to enable streamlined and more accurate searches. You can use an embeddings model in Amazon Bedrock to create vectors of your organization’s data, which can then be used to enable semantic search. Join this hands-on chalk talk to learn how.

Reserve your seat now!

AIM340-R (LVL 300) | Customize your FMs securely to deliver differentiated experiences

Wednesday November 29 | 6:00 PM – 7:00 PM (PST)

Foundation model customizations help you build differentiated generative AI applications using your own data. It’s easy to securely customize models in Amazon Bedrock. You can point Amazon Bedrock at a few labeled examples in Amazon S3, and the service can fine-tune the FM for a particular task without having to annotate large volumes of data; none of your data is used to train the original base FMs. Join this chalk talk for a deep dive on FM customizations through an interactive demo.

Reserve your seat now!

This session will be repeated Thursday, November 30 11:00 AM – 12:00 PM (PST), and Friday, December 1 8:30 AM – 9:30 AM PST.

AIM342 (LVL 300) | Advancing responsible AI: Assessing and mitigating risk

Wednesday November 29 | 4:30 PM – 5:30 PM (PST)

Risk assessment is an essential part of developing AI solutions responsibly, especially with emerging industry standards and laws regarding AI risk, such as ISO 42001 and the EU AI Act. This chalk talk provides an introduction to best practices for risk assessment related to fairness, robustness, explainability, privacy and security, transparency, and governance. Explore examples to estimate the severity and likelihood of potential events that could be harmful. Learn about Amazon SageMaker tooling for model governance, bias, explainability, and monitoring, and about transparency in the form of service cards as potential risk mitigation strategies.

Reserve your seat now!

AIM347-R (LVL 300) | Next-generation ML builder experience

Thursday November 30 | 4:00 PM – 5:00 PM (PST)

Amazon SageMaker offers different integrated development environments (IDEs) that are purpose-built for machine learning. In this chalk talk, learn how to select and use your preferred environment to perform end-to-end ML development steps, from preparing data to building, training, and deploying your ML models. Discover how you can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, collaborate seamlessly within your organization, and deploy models to production all in one place.

Reserve your seat now!

This session will be repeated Friday, December 1 9:00 AM – 10:00 AM (PST), and Friday, December 1 11:30 AM – 12:00 PM (PST).

AIM352-R (LVL 300) | Securely build generative AI apps and control data with Amazon Bedrock

Monday November 27 | 11:30 AM – 12:30 PM (PST)

Generative AI applications have captured widespread attention; however, they have also introduced new security challenges, especially around the handling of customer data. Organizations want to ensure that their data remains safe and secure while working with foundation models and don’t want to worry about their data being used to train an FM. Amazon Bedrock provides comprehensive data protection and privacy. In this chalk talk, explore architectures, data flows, and security-related aspects of model fine-tuning, as well as prompting and inference, while you learn about Amazon Bedrock’s security capabilities.

Reserve your seat now!

This session will be repeated Wednesday, November 29 6:00 PM – 7:00 PM (PST), and Thursday, November 30 4:00 PM – 5:00 PM (PST).

AIM404 (LVL 400) | Train and deploy FMs on Amazon EC2 and Amazon SageMaker, feat. Flip AI

Wednesday November 29 | 2:30 PM – 3:30 PM (PST)

Organizations that are running machine learning systems and generative AI applications on their local laptops/servers want to take advantage of the scalability and performance of the AWS Cloud. In this chalk talk, hear about compute and ML services from self-managed Amazon EC2 to fully managed Amazon SageMaker that you can use to build, train, and deploy foundation models. See a demo of how you can fine-tune a Stable Diffusion model on Amazon EC2 and then deploy it on SageMaker using the AWS Deep Learning AMIs (DLAMI) and AWS Deep Learning Containers. Also, hear how Flip AI built their own models using these AWS services.

Reserve your seat now!

Workshops

AIM302 (LVL 300) | Use generative AI to extract insights from contact center recordings

Monday November 27 | 8:30 AM – 10:30 AM (PST)

Learn how to derive insights from contact center recordings and other media using Amazon Transcribe and generative AI. In this workshop, learn how to combine automatic call recording, transcription, post-call analysis, sentiment analysis, issue detection, and call summarization from your own telephony recordings (Cisco, Genesys, Talkdesk, Avaya, and more) using AWS Contact Center Intelligence (CCI) solutions and generative AI. See demos on how to build analytics dashboards and integrations between LLMs and Amazon QuickSight to visualize your key metrics. You must bring your laptop to participate.

Reserve your seat now!

AIM307 (LVL 300) | Retrieval Augmented Generation with Amazon Bedrock

Wednesday November 29 | 8:30 AM – 10:30 AM (PST)

Large language models are often limited by the data they were trained on and don’t always provide up-to-date responses—or worse, they make things up. To overcome this limitation, you can supplement prompts with up-to-date information using embeddings stored in vector databases, a process known as Retrieval Augmented Generation (RAG). With supplemental information in the prompt providing more context, the LLM can respond more accurately and is less likely to hallucinate. In this workshop, learn how to use vector databases with Amazon Bedrock, a service that makes foundation models from Amazon and leading AI companies available via a single API. You must bring your laptop to participate.

Reserve your seat now!

AIM304 (LVL 300) | How to generate text responsibly using foundation models on AWS

Wednesday November 29 | 5:30 PM – 7:30 PM (PST)

Foundation models such as Claude are commonly used to create new pieces of original content, such as short stories, essays, social media posts, and webpage copy, and also to summarize text from articles, blog posts, books, and more. In this workshop, learn how you can generate text in minutes using foundation models available through Amazon Bedrock in a responsible way. You must bring your laptop to participate.

Reserve your seat now!

Code talks

AIM364-R (LVL 300) | Boost ML development with Amazon SageMaker Studio notebooks

Tuesday November 28 | 4:00 PM – 5:00 PM (PST)

Amazon SageMaker Studio notebooks are collaborative notebooks that you can launch quickly and that can help you integrate with purpose-built ML tools in SageMaker and other AWS services for complete ML development. In this code talk, learn how to prepare data at scale using built-in data preparation assistance, co-edit the same notebook in real time, and automate conversion of notebook code to production-ready jobs. This talk also introduces the new generative AI-powered features that can help you maximize productivity, write higher-quality code, and improve security.

Reserve your seat now!

This session will be repeated Wednesday, November 29 12:00 PM – 1:00 PM (PST).

Builders’ sessions

AIM219-R (LVL 200) | Learn and experiment with LLMs in Amazon SageMaker Studio Lab

Monday November 27 | 10:00 AM – 11:00 AM (PST)

Machine learning can sound complicated, but Amazon SageMaker Studio Lab makes it easier for anyone to get started at no cost. In this hands-on builders’ session, be guided through the basics of experimenting with large language models in Amazon SageMaker Studio Lab. No prior machine learning experience is required. You must bring your laptop to participate.

This session will be repeated Monday, November 27 4:00 PM – 5:00 PM (PST), Tuesday, November 28 3:30 PM – 4:30 PM (PST), Wednesday, November 29, 12:00 PM – 1:00 PM (PST), and Thursday, November 30 11:30 AM – 12:30 PM (PST).

Reserve your seat now!

AWS DeepRacer

Get ready to race with AWS DeepRacer at re:Invent 2023!

Developers, fasten your seatbelts—AWS DeepRacer is bringing ML to everyone at re:Invent! Whether you’re looking to get started with ML or improve your skills, AWS DeepRacer offers an exciting way to get hands-on with ML.

Watch the world’s top 72 racers of the AWS DeepRacer 2023 League battle it out Monday through Wednesday at our Championship Stadium at the Venetian Expo. It will all come down to the finale on Wednesday (November 29) at 2:30 PM (PST) as the eight finalists compete for the cup and $44,000 in prize money. You can also get behind the wheel yourself on November 30, when the track opens for the 2024 Open Racing. Post the fastest time and you’ll win a ticket back to Vegas for the 2024 Championship!

Dive into 10 not-to-miss workshops where you’ll learn to train reinforcement learning models, solve business problems with generative AI, and more. Want to learn tips and tricks from the best racers in the world? Be sure to check out our DPR301 workshop featuring five of our top AWS DeepRacer League Champions who will be sharing their approaches for training their AWS DeepRacer models and answering questions during an open Q&A.

Don’t forget to check out the rest of the AWS DeepRacer workshops before they fill up to reserve your spot! Whether you take a workshop, take a spin in our gamified virtual racing simulator, catch the global competition, or test your own ML model on the track, AWS DeepRacer brings the thrill of high-speed racing to hands-on machine learning at re:Invent. Let the countdown begin. We can’t wait to see you in Las Vegas!

See you at re:Invent!

Make sure to check out the re:Invent content catalog and the generative AI at re:Invent guide for more gen AI and ML content at re:Invent. We’ll see you there!

About the authors

Denis V. Batalov is a 17-year Amazon veteran and a PhD in Machine Learning, Denis worked on such exciting projects as Search Inside the Book, Amazon Mobile apps and Kindle Direct Publishing. Since 2013 he has helped AWS customers adopt AI/ML technology as a Solutions Architect. Currently, Denis is a Worldwide Tech Leader for AI/ML responsible for the functioning of AWS ML Specialist Solutions Architects globally. Denis is a frequent public speaker, you can follow him on Twitter @dbatalov.

Paxton Hall is a Marketing Program Manager for the AWS AI/ML Community on the AI/ML Education team at AWS. He has worked in retail and experiential marketing for the past 7 years, focused on developing communities and marketing campaigns. Out of the office, he’s passionate about public lands access and conservation, and enjoys backcountry skiing, climbing, biking, and hiking throughout Washington’s Cascade mountains.

Build a contextual chatbot for financial services using Amazon SageMaker JumpStart, Llama 2 and Amazon OpenSearch Serverless with Vector Engine

November 22, 2023

by Sunil Padmanabhan Amazon AWS

The financial service (FinServ) industry has unique generative AI requirements related to domain-specific data, data security, regulatory controls, and industry compliance standards. In addition, customers are looking for choices to select the most performant and cost-effective machine learning (ML) model and the ability to perform necessary customization (fine-tuning) to fit their business use cases. Amazon SageMaker JumpStart is ideally suited for generative AI use cases for FinServ customers because it provides the necessary data security controls and meets compliance standards requirements.

In this post, we demonstrate question answering tasks using a Retrieval Augmented Generation (RAG)-based approach with large language models (LLMs) in SageMaker JumpStart using a simple financial domain use case. RAG is a framework for improving the quality of text generation by combining an LLM with an information retrieval (IR) system. The LLM generated text, and the IR system retrieves relevant information from a knowledge base. The retrieved information is then used to augment the LLM’s input, which can help improve the accuracy and relevance of the model generated text. RAG has been shown to be effective for a variety of text generation tasks, such as question answering and summarization. It is a promising approach for improving the quality and accuracy of text generation models.

Advantages of using SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a broad selection of state-of-the-art models for use cases such as content writing, image generation, code generation, question answering, copywriting, summarization, classification, information retrieval, and more. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.

SageMaker JumpStart is ideally suited for generative AI use cases for FinServ customers because it offers the following:

Customization capabilities – SageMaker JumpStart provides example notebooks and detailed posts for step-by-step guidance on domain adaptation of foundation models. You can follow these resources for fine-tuning, domain adaptation, and instruction of foundation models or to build RAG-based applications.
Data security – Ensuring the security of inference payload data is paramount. With SageMaker JumpStart, you can deploy models in network isolation with single-tenancy endpoint provision. Furthermore, you can manage access control to selected models through the private model hub capability, aligning with individual security requirements.
Regulatory controls and compliances – Compliance with standards such as HIPAA BAA, SOC123, PCI, and HITRUST CSF is a core feature of SageMaker, ensuring alignment with the rigorous regulatory landscape of the financial sector.
Model choices – SageMaker JumpStart offers a selection of state-of-the-art ML models that consistently rank among the top in industry-recognized HELM benchmarks. These include, but are not limited to, Llama 2, Falcon 40B, AI21 J2 Ultra, AI21 Summarize, Hugging Face MiniLM, and BGE models.

In this post, we explore building a contextual chatbot for financial services organizations using a RAG architecture with the Llama 2 foundation model and the Hugging Face GPTJ-6B-FP16 embeddings model, both available in SageMaker JumpStart. We also use Vector Engine for Amazon OpenSearch Serverless (currently in preview) as the vector data store to store embeddings.

Limitations of large language models

LLMs have been trained on vast volumes of unstructured data and excel in general text generation. Through this training, LLMs acquire and store factual knowledge. However, off-the-shelf LLMs present limitations:

Their offline training renders them unaware of up-to-date information.
Their training on predominantly generalized data diminishes their efficacy in domain-specific tasks. For instance, a financial firm might prefer its Q&A bot to source answers from its latest internal documents, ensuring accuracy and compliance with its business rules.
Their reliance on embedded information compromises interpretability.

To use specific data in LLMs, three prevalent methods exist:

Embedding data within the model prompts, allowing it to utilize this context during output generation. This can be zero-shot (no examples), few-shot (limited examples), or many-shot (abundant examples). Such contextual prompting steers models towards more nuanced results.
Fine-tuning the model using pairs of prompts and completions.
RAG, which retrieves external data (non-parametric) and integrates this data into the prompts, enriching the context.

However, the first method grapples with model constraints on context size, making it tough to input lengthy documents and possibly increasing costs. The fine-tuning approach, while potent, is resource-intensive, particularly with ever-evolving external data, leading to delayed deployments and increased costs. RAG combined with LLMs offers a solution to the previously mentioned limitations.

Retrieval Augmented Generation

RAG retrieves external data (non-parametric) and integrates this data into ML prompts, enriching the context. Lewis et al. introduced RAG models in 2020, conceptualizing them as a fusion of a pre-trained sequence-to-sequence model (parametric memory) and a dense vector index of Wikipedia (non-parametric memory) accessed via a neural retriever.

Here’s how RAG operates:

Data sources – RAG can draw from varied data sources, including document repositories, databases, or APIs.
Data formatting – Both the user’s query and the documents are transformed into a format suitable for relevancy comparisons.
Embeddings – To facilitate this comparison, the query and the document collection (or knowledge library) are transformed into numerical embeddings using language models. These embeddings numerically encapsulate textual concepts.
Relevancy search – The user query’s embedding is compared to the document collection’s embeddings, identifying relevant text through a similarity search in the embedding space.
Context enrichment – The identified relevant text is appended to the user’s original prompt, thereby enhancing its context.
LLM processing – With the enriched context, the prompt is fed to the LLM, which, due to the inclusion of pertinent external data, produces relevant and precise outputs.
Asynchronous updates – To ensure the reference documents remain current, they can be updated asynchronously along with their embedding representations. This ensures that future model responses are grounded in the latest information, guaranteeing accuracy.

In essence, RAG offers a dynamic method to infuse LLMs with real-time, relevant information, ensuring the generation of precise and timely outputs.

The following diagram shows the conceptual flow of using RAG with LLMs.

Solution overview

The following steps are required to create a contextual question answering chatbot for a financial services application:

Use the SageMaker JumpStart GPT-J-6B embedding model to generate embeddings for each PDF document in the Amazon Simple Storage Service (Amazon S3) upload directory.
Identify relevant documents using the following steps:
- Generate an embedding for the user’s query using the same model.
- Use OpenSearch Serverless with the vector engine feature to search for the top K most relevant document indexes in the embedding space.
- Retrieve the corresponding documents using the identified indexes.
Combine the retrieved documents as context with the user’s prompt and question. Forward this to the SageMaker LLM for response generation.

We employ LangChain, a popular framework, to orchestrate this process. LangChain is specifically designed to bolster applications powered by LLMs, offering a universal interface for various LLMs. It streamlines the integration of multiple LLMs, ensuring seamless state persistence between calls. Moreover, it boosts developer efficiency with features like customizable prompt templates, comprehensive application-building agents, and specialized indexes for search and retrieval. For an in-depth understanding, refer to the LangChain documentation.

Prerequisites

You need the following prerequisites to build our context-aware chatbot:

An AWS account with appropriate AWS Identity and Access Management (IAM) permissions.
An Amazon SageMaker Studio domain and user. For setup instructions, refer to Onboard to Amazon SageMaker Domain using Quick setup.
An OpenSearch Serverless collection.
A SageMaker execution role with access to OpenSearch Serverless.

For instructions on how to set up an OpenSearch Serverless vector engine, refer to Introducing the vector engine for Amazon OpenSearch Serverless, now in preview.

For a comprehensive walkthrough of the following solution, clone the GitHub repo and refer to the Jupyter notebook.

Deploy the ML models using SageMaker JumpStart

To deploy the ML models, complete the following steps:

Deploy the Llama 2 LLM from SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
llm_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")
llm_predictor = llm_model.deploy()
llm_endpoint_name = llm_predictor.endpoint_name

Deploy the GPT-J embeddings model:

embeddings_model = JumpStartModel(model_id = "huggingface-textembedding-gpt-j-6b-fp16")
embed_predictor = embeddings_model.deploy()
embeddings_endpoint_name = embed_predictor.endpoint_name

Chunk data and create a document embeddings object

In this section, you chunk the data into smaller documents. Chunking is a technique for splitting large texts into smaller chunks. It’s an essential step because it optimizes the relevance of the search query for our RAG model, which in turn improves the quality of the chatbot. The chunk size depends on factors such as the document type and the model used. A chunk chunk_size=1600 has been selected because this is the approximate size of a paragraph. As models improve, their context window size will increase, allowing for larger chunk sizes.

Refer to the Jupyter notebook in the GitHub repo for the complete solution.

Extend the LangChain SageMakerEndpointEmbeddings class to create a custom embeddings function that uses the gpt-j-6b-fp16 SageMaker endpoint you created earlier (as part of employing the embeddings model):

from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

logger = logging.getLogger(__name__)

# extend the SagemakerEndpointEmbeddings class from langchain to provide a custom embedding function
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(
        self, texts: List[str], chunk_size: int = 1
    ) → List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.
 
        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        st = time.time()
        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            results.extend(response)
        time_taken = time.time() - st
        logger.info(
            f"got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}"
        )
        print(
            f"got results for {len(texts)} in {time_taken}s, length of embeddings list is {len(results)}"
        )
        return results

# class for serializing/deserializing requests/responses to/from the embeddings model
class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"
 
    def transform_input(self, prompt: str, model_kwargs={}) → bytes:
 
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")
 
    def transform_output(self, output: bytes) → str:
 
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        if len(embeddings) == 1:
            return [embeddings[0]]
        return embeddings

def create_sagemaker_embeddings_from_js_model(
    embeddings_endpoint_name: str, aws_region: str
) → SagemakerEndpointEmbeddingsJumpStart:
 
    content_handler = ContentHandler()
    embeddings = SagemakerEndpointEmbeddingsJumpStart(
        endpoint_name=embeddings_endpoint_name,
        region_name=aws_region,
        content_handler=content_handler,
    )
    return embeddings

Create the embeddings object and batch the creation of the document embeddings:

embeddings = create_sagemaker_embeddings_from_js_model(embeddings_endpoint_name, aws_region)

These embeddings are stored in the vector engine using LangChain OpenSearchVectorSearch. You store these embeddings in the next section. Store the document embedding in OpenSearch Serverless. You’re now ready to iterate over the chunked documents, create the embeddings, and store these embeddings in the OpenSearch Serverless vector index created in vector search collections. See the following code:
```
docsearch = OpenSearchVectorSearch.from_texts(
texts = [d.page_content for d in docs],
embedding=embeddings,
opensearch_url=[{'host': _aoss_host, 'port': 443}],
http_auth=awsauth,
timeout = 300,
use_ssl = True,
verify_certs = True,
connection_class = RequestsHttpConnection,
index_name=_aos_index
)
```

Question and answering over documents

So far, you have chunked a large document into smaller ones, created vector embeddings, and stored them in a vector engine. Now you can answer questions regarding this document data. Because you created an index over the data, you can do a semantic search; this way, only the most relevant documents required to answer the question are passed via the prompt to the LLM. This allows you to save time and money by only passing relevant documents to the LLM. For more details on using document chains, refer to Documents.

Complete the following steps to answer questions using the documents:

To use the SageMaker LLM endpoint with LangChain, you use langchain.llms.sagemaker_endpoint.SagemakerEndpoint, which abstracts the SageMaker LLM endpoint. You perform a transformation for the request and response payload as shown in the following code for the LangChain SageMaker integration. Note that you may need to adjust the code in ContentHandler based on the content_type and accepts format of the LLM model you choose to use.

content_type = "application/json"
accepts = "application/json"
def transform_input(self, prompt: str, model_kwargs: dict) → bytes:
        payload = {
            "inputs": [
                [
                    {
                        "role": "system",
                        "content": prompt,
                    },
                    {"role": "user", "content": prompt},
                ],
            ],
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
            },
        }
        input_str = json.dumps(
            payload,
        )
        return input_str.encode("utf-8")

def transform_output(self, output: bytes) → str:
    response_json = json.loads(output.read().decode("utf-8"))
    content = response_json[0]["generation"]["content"]

    return content

content_handler = ContentHandler()

sm_jumpstart_llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs={"max_new_tokens": 300},
        endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
        content_handler=content_handler,
    )

Now you’re ready to interact with the financial document.

Use the following query and prompt template to ask questions regarding the document:

from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler

query = "Summarize the earnings report and also what year is the report for"
prompt_template = """Only use context to answer the question at the end.
 
{context}
 
Question: {question}
Answer:"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
 
 
class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) → bytes:
        payload = {
            "inputs": [
                [
                    {
                        "role": "system",
                        "content": prompt,
                    },
                    {"role": "user", "content": prompt},
                ],
            ],
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
            },
        }
        input_str = json.dumps(
            payload,
        )
        return input_str.encode("utf-8")
 
    def transform_output(self, output: bytes) → str:
        response_json = json.loads(output.read().decode("utf-8"))
        content = response_json[0]["generation"]["content"]
        return content

content_handler = ContentHandler()
 
chain = load_qa_chain(
    llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs={"max_new_tokens": 300},
        endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
        content_handler=content_handler,
    ),
    prompt=prompt,
)
sim_docs = docsearch.similarity_search(query, include_metadata=False)
chain({"input_documents": sim_docs, "question": query}, return_only_outputs=True)

Cleanup

To avoid incurring future costs, delete the SageMaker inference endpoints that you created in this notebook. You can do so by running the following in your SageMaker Studio notebook:

# Delete LLM
llm_predictor.delete_model()
llm_predictor.delete_predictor(delete_endpoint_config=True)

# Delete Embeddings Model
embed_predictor.delete_model()
embed_predictor.delete_predictor(delete_endpoint_config=True)

If you created an OpenSearch Serverless collection for this example and no longer require it, you can delete it via the OpenSearch Serverless console.

Conclusion

In this post, we discussed using RAG as an approach to provide domain-specific context to LLMs. We showed how to use SageMaker JumpStart to build a RAG-based contextual chatbot for a financial services organization using Llama 2 and OpenSearch Serverless with a vector engine as the vector data store. This method refines text generation using Llama 2 by dynamically sourcing relevant context. We’re excited to see you bring your custom data and innovate with this RAG-based strategy on SageMaker JumpStart!

About the authors

Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.

Suleman Patel is a Senior Solutions Architect at Amazon Web Services (AWS), with a special focus on Machine Learning and Modernization. Leveraging his expertise in both business and technology, Suleman helps customers design and build solutions that tackle real-world business problems. When he’s not immersed in his work, Suleman loves exploring the outdoors, taking road trips, and cooking up delicious dishes in the kitchen.

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence

November 22, 2023

by Brijesh Pati Amazon AWS

The IDP Well-Architected Lens is intended for all AWS customers who use AWS to run intelligent document processing (IDP) solutions and are searching for guidance on how to build secure, efficient, and reliable IDP solutions on AWS.

Building a production-ready solution in the cloud involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

An IDP pipeline usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific terms or words. The IDP Well-Architected Custom Lens outlines the steps for an AWS Well-Architected review, which allows you to evaluate and identify technical risks within your IDP workloads. This custom lens integrates best practices and guidance to effectively navigate and overcome common challenges in the management of IDP workloads.

This post focuses on the Operational Excellence pillar of the IDP solution. Operational excellence in IDP means applying the principles of robust software development and maintaining a high-quality customer experience to the field of document processing, while consistently meeting or surpassing service level agreements (SLAs). It involves organizing teams effectively, designing IDP systems to handle workloads efficiently, operating these systems at scale, and continuously evolving them to meet customer needs.

In this post, we start with the introduction of the Operational Excellence pillar and design principles, and then deep dive into four focus areas: organizational culture, workload design, build and release optimization, and observability. By reading this post, you will learn about the Operational Excellence pillar in the Well-Architected Framework with the IDP case study.

Design principles

For IDP workloads, operational excellence translates to the following:

High accuracy and low error rates in document data extraction – Precision in extracting data from documents is paramount, which minimizes errors and ensures that the information used for decision-making is trustworthy
Fast processing of high document volumes with low latency – Efficiency in handling large volumes of documents swiftly allows organizations to keep pace with business demands, reducing bottlenecks
Continuous monitoring for swift diagnosis and resolution of issues – Proactive monitoring and maintenance help in quickly identifying and resolving any interruptions in the document processing pipeline, maintaining a smooth operational flow
Rapid iteration to improve models and workflows – Implementing a feedback loop that facilitates constant refinement of algorithms and processes ensures the system evolves to meet emerging challenges and efficiency standards
Cost optimization to ensure resources align with workload demands – Strategic resource management ensures that financial investment into IDP systems yields maximum value, adjusting resources dynamically in line with fluctuating document processing demands
Adherence to SLAs – Meeting or exceeding the standards and turnaround times promised to customers is crucial for maintaining trust and satisfaction

Effective design strategies must be aligned with these objectives, ensuring that the IDP systems are not only technically capable but also optimized for real-world challenges. This elevates operational excellence from a backend goal to a strategic asset, one that is integral to the success of the entire enterprise. Based on the design principles of the Operational Excellence pillar, we propose the following design principles for this custom lens.

Design Principles	Description
Align IDP SLAs with Overall Document Workflow Objectives	IDP typically functions as an integral component of the broader document workflow managed by business teams. Therefore, it is essential that the SLAs for IDP are carefully crafted as subsets of the overall document workflow SLAs. This approach ensures that the IDP’s performance expectations are in harmony with the larger workflow objectives, providing a clear and consistent standard for processing speed, accuracy, and reliability. By doing so, businesses can create a cohesive and efficient document management system that aligns with the overarching business goals and stakeholder expectations, fostering trust and dependability in the system’s capabilities.
Codify Operations for Efficiency and Reproducibility	By performing operations as code and incorporating automated deployment methodologies, organizations can achieve scalable, repeatable, and consistent processes. This not only minimizes the potential for human error but also paves the way for seamless integration of new data sources and processing techniques.
Proactively Anticipate and Plan for System Failures	Because IDP systems process vast array of documents with varied complexities, potential issues can emerge at any stage of the document processing pipeline. You should conduct “pre-mortem” exercises to pre-emptively identify potential sources of failure so that they can be removed or mitigated. Regularly simulate failure scenarios and validate your understanding of their impact. Test your response procedures to ensure they are effective and that teams are familiar with their process. Set up regular game days to test workload and team responses to simulated events.
Iterate Frequently with Feedback Mechanisms	As your document processing workload evolves, ensure your operational strategies adapt in sync and look for opportunities to improve them: Make frequent, small, reversible changes – Design workloads to allow components to be updated regularly to increase the flow of beneficial changes into your workload. Make changes in small increments that can be reversed if they fail to aid in the identification and resolution of issues introduced to your environment. Learn from all operational failures – Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
Monitor Operational Health	Ensure a shift from mere monitoring to advanced observability within your IDP framework. This entails a comprehensive understanding of the system’s health. By effectively collecting and correlating telemetry data, you can glean actionable insights, facilitating pre-emptive detection and mitigation of issues.
Pursue Metrics-Driven Quality and Continuous Improvement	In IDP, what gets measured gets improved. Define and track key metrics related to document accuracy, processing times, and model efficacy. It is crucial to pursue a metrics-driven strategy that emphasizes the quality of data extraction at the field level, particularly for high-impact fields. Harness a flywheel approach, wherein continuous data feedback is utilized to routinely orchestrate and evaluate enhancements to your models and processes.
Integrate Human Oversight for Process Effectiveness	Although automation and ML algorithms significantly advance the efficiency of IDP, there are scenarios where human reviewers can augment and enhance the outcomes, especially in situations with regulatory demands or when encountering low-quality scans. Human oversight based on confidence score thresholds can be a valuable addition.

Focus areas

The design principles and best practices for the Operational Excellence pillar come from what we have learned from our customers and our IDP experts. Use these as a guide when making design choices, making sure they fit well with what your business needs from the IDP solution. Applying the IDP Well-Architected Lens also helps you validate that these choices are aimed at achieving operational excellence, ensuring they meet your specific operational goals.

The following are the key focus areas for operational excellence of IDP solution in the cloud:

Organizational culture – Organizational culture is pivotal in shaping how IDP projects are implemented and managed. This culture is sustained by clear SLAs that set definitive expectations for processing times and accuracy, ensuring all team members are oriented towards common goals. This is complemented by a centralized function that acts as the hub for operational excellence, consolidating best practices and steering IDP projects towards success.
Workload design – This involves creating a system capable of flexibly handling varying demands, optimizing for quality and accuracy in document processing, and efficiently integrating with external systems.
Build and release optimization – This area emphasizes the implementation of standardized DevSecOps processes. The goal is to streamline the development lifecycle and use automation to ensure smooth and rapid deployment of updates or new features. This approach aims to enhance the efficiency, security, and reliability of the IDP system development and deployment.
Observability – In IDP, observability is focused on comprehensive monitoring, alerting, and logging capabilities, along with managing service quotas. This involves keeping a vigilant eye on the system’s performance, setting up effective alert mechanisms for potential issues, maintaining detailed logs for analysis, and ensuring the system operates within its resource allocations.

Organizational culture

To achieve operational excellence in IDP, organizations must embed certain best practices into their culture and daily operations. The following are a few critical areas that can guide organizations in optimizing their IDP workflows:

Culture and operating model – Cultivate a culture that champions the strategic design, deployment, and management of IDP workloads. This should be a cultural norm, integrated into the operating model to support agility and responsiveness in document processing.
Business and SLA alignment – Align IDP initiatives with business objectives and SLAs. This practice ensures that document processing supports the overall business strategy and meets the performance metrics valued by stakeholders.
Continuous AWS training – Commit to regular training and upskilling in AWS services to enhance IDP capabilities. A well-trained team can use AWS’s evolving features for improved document processing efficiency and innovation.
Change management – Establish robust change management processes to navigate the IDP landscape’s dynamic nature. Effective change management supports smooth transitions and helps maintain uninterrupted IDP operations during upgrades or shifts in strategy.
Defined metrics for IDP success – Establish and monitor clear metrics to measure the success and impact of the IDP operations. For example: With Amazon CloudWatch, you could monitor the number of documents processed through Amazon Textract. Similarly, monitoring the volume and size of documents being uploaded into Amazon Simple Storage Service (Amazon S3) can give insights into the rate at which processing demand is increasing. Furthermore, with AWS Step Functions, you can use the built-in metrics to track the processing job success rate, offering insights into the effectiveness of the workflow orchestration.
Iterative improvements – Encourage a culture of feedback and iterative development to refine IDP processes. By regularly analyzing performance data and user feedback, the organization can make informed, incremental improvements to the IDP system.
Feedback loop from human review – Integrate a feedback loop from human review into the IDP system. This provides valuable insights that you can use to continuously improve the accuracy and effectiveness of the automated processes.

Workload design

An effective workload design is essential for successful management of intelligent document processing systems. This design must be adaptable to meet diverse demands to handle varying demands, maintaining high quality and accuracy, and achieving seamless integration with other systems. The following are the best practices that can help achieve these goals:

Utilizing IDP Workflow stages– When designing an architecture for IDP, it is important to consider the typical stages of an IDP workflow, which may vary based on specific use cases and business needs. Common stages include data capture, document classification, document text extraction, content enrichment, document review and validation , and data consumption. By clearly defining and separating these stages in your architecture, you create a more resilient system. This approach helps in isolating different components in the event of a failure, leading to smoother operations and easier maintenance.
Flexible demand handling – Create a document processing system that can easily adapt to changes in demand. This ensures that as business needs shift, the system can scale up or down accordingly and continue to operate smoothly.
- For example, when interfacing with Amazon Textract, ensure you manage throttling and dropped connections by setting the config parameter when creating the Amazon Textract client. It is recommended to set a retry count of 5, because the AWS SDK retries an operation this specified number of times before considering it a failure. Incorporating this mechanism can handle throttling more effectively by using the SDK’s built-in exponential backoff strategy.
- AWS might periodically update the service limits based on various factors. Stay updated with the latest documentation and adjust your throttling management strategies accordingly. For example, you can use the Amazon Textract Service Quotas Calculator to estimate the quota values that will satisfy your use case. If your application consistently runs into throttling limits, consider requesting AWS to increase your service quotas for Amazon Textract and Amazon Comprehend.
Quality and accuracy optimization – Maximize the precision of data extraction with Amazon Textract by preparing documents in a format conducive to high accuracy, as outlined in the AWS Textract Best Practices. Take advantage of Textract’s Layout feature, which is pre-trained on a diverse array of documents from various industries, including financial services and insurance. This feature simplifies data extraction by reducing the need for complex post-processing code, enhancing efficiency in document processing operations, ultimately enhancing both quality and efficiency in their document processing operations.
Seamless external integrations – Ensure that your IDP system can integrate efficiently with external services and systems. This provides a cohesive workflow and allows for broader functionality within the document processing pipeline. For example, review the existing architecture for modularity and identify components that handle external system integrations and break down integration logic into smaller, granular functions using AWS Lambda for flexibility and scalability. Continuously seek feedback from developers and integration partners to refine and optimize the architecture. Employ strategies for decoupled operations, such as event-driven processing, where services like Amazon EventBridge can be utilized for capturing and routing events from external systems.
Transparent and adaptable processing – Set up clear, traceable paths for each piece of data from its origin to extraction, which builds trust in the system. Keep documentation of processing rules thorough and up to date, fostering a transparent environment for all stakeholders.
Enhance IDP with Amazon Comprehend Flywheel and Amazon Textract Custom Queries
- Leverage the Amazon Comprehend flywheel for a streamlined ML process, from data ingestion to deployment. By centralizing datasets within the flywheel’s dedicated Amazon S3 data lake, you ensure efficient data management. Regular flywheel iterations guarantee models are trained with the latest data and evaluated for optimal performance. Always promote the highest-performing models to active status, and deploy endpoints synchronized with the active model, reducing manual interventions. This systematic approach, grounded in MLOps principles, drives operational excellence and assures superior model quality.
- Additionally, with the recent introduction of the Amazon Textract Custom Queries feature, you can refine the extraction process to meet unique business requirements by using natural language questions, thereby improving accuracy for specific document types. Custom Queries simplifies the adaptation of the Amazon Textract Queries feature, eliminating the need for deep ML expertise and facilitating a more intuitive way to extract valuable information from documents.

Build and release optimization

Streamlining the build and release processes is vital for the agility and security of IDP solutions. The following are key practices in build and release optimization, focusing on automation, continuous integration and continuous delivery (CI/CD), and security:

Automated deployment – Design your IDP solution using infrastructure-as-code (IaC) principles for consistent and repeatable deployments; the serverless infrastructure can be deployed with AWS Cloud Development Kit (AWS CDK) and orchestrated with low-code visual workflow service like AWS Step Functions.
CI/CD pipelines – Leverage tools like AWS CodePipeline, AWS CodeBuild, AWS CodeDeploy for the automation of build, test, and release phases of IDP components and models. Set up automated rollbacks to mitigate deployment risks, and integrate change tracking and governance for thorough validation before production deployment.
Security with AWS KMS – Operational excellence isn’t solely about efficiency; security plays an integral role as well. Specifically, for Amazon Comprehend endpoints where customer-managed keys encrypt underlying models, maintaining the integrity using AWS Key Management Service (AWS KMS) key permissions become vital. Utilize AWS Trusted Advisor to check endpoint access risks and manage KMS key permissions.
Seamless integration with diverse external systems – Tailor build and release pipelines to emphasize seamless integration with diverse external systems. Use AWS services and best practices to design document processing workflows to easily interface and adapt to various external requirements. This ensures consistency and agility in deployments, prioritizing operational excellence even in complex integration scenarios.

Observability

Achieving operational excellence in IDP necessitates an integrated approach where monitoring and observability play pivotal roles. Below are the key practices to ensure clarity, insight, and continuous improvement within an AWS environment:

Comprehensive observability – Implement a thorough monitoring and observability solution with tools like Amazon CloudWatch Logs for services such as Amazon Textract and Amazon Comprehend. This approach provides clear operational insights for all stakeholders, fostering efficient operation, responsive event handling, and a cycle of continuous improvement.
Amazon Comprehend Endpoint monitoring and auto scaling – Employ Trusted Advisor for diligent monitoring of Amazon Comprehend endpoints to optimize resource utilization. Adjust throughput configurations or use AWS Application Auto Scaling to align resources with demand, enhancing efficiency and cost-effectiveness.
Amazon Textract monitoring strategy – For operational excellence in utilizing Amazon Textract, adopt a holistic approach:
- Utilize CloudWatch to diligently monitor Amazon Textract operations, drawing insights from key metrics like SuccessfulRequestCount, ThrottledCount, ResponseTime, ServerErrorCount, UserErrorCount
- Set precise alarms based on these metrics, and integrate them with Amazon Simple Notification Service (Amazon SNS) for real-time anomaly detection.
- Act swiftly on these notifications, ensuring prompt issue rectification and consistent document processing efficiency. This strategy combines meticulous monitoring with proactive intervention, setting the gold standard for operational excellence.
Logging API calls with AWS CloudTrail – With AWS CloudTrail , you can gain visibility into API call history and user activity, crucial for operational monitoring and swift incident response. Amazon Textract and Amazon Comprehend services are integrated with AWS CloudTrail.

Conclusion

In this post, we shared design principles, focus areas, foundations and best practices for achieving operational excellence in your IDP solution. By adopting the Well-Architected Framework principles covered in this post, you can optimize your IDP workloads for operational excellence. Focus on key areas like IaC, instrumentation, observability, and continuous improvement, which will help you achieve operational excellence and ensure your IDP systems deliver business value at scale in a secure and compliant manner.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

AWS is committed to the IDP Well-Architected Lens as a living tool. As the IDP solutions and related AWS AI services evolve and new AWS services become available, we will update the IDP Lens Well-Architected accordingly.

If you want to learn more about the AWS Well-Architected Framework, refer to AWS Well-Architected.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

About the Authors

Brijesh Pati is an Enterprise Solutions Architect at AWS. His primary focus is helping enterprise customers adopt cloud technologies for their workloads. He has a background in application development and enterprise architecture and has worked with customers from various industries such as sports, finance, energy and professional services. His interests include serverless architectures and AI/ML.

Mia Chang is a ML Specialist Solutions Architect for Amazon Web Services. She works with customers in EMEA and shares best practices for running AI/ML workloads on the cloud with her background in applied mathematics, computer science, and AI/ML. She focuses on NLP-specific workloads, and shares her experience as a conference speaker and a book author. In her free time, she enjoys hiking, board games, and brewing coffee.

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Tim Condello is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.

Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Build well-architected IDP solutions with a custom lens – Part 2: Security

November 22, 2023

by Sherry Ding Amazon AWS

Building a production-ready solution in AWS involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn current operational and architectural recommendations for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in AWS.

An intelligent document processing (IDP) project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific entities or phrases. This IDP Well-Architected Custom Lens provides you the guidance to tackle the common challenges we see in the field. By answering a series of questions in this custom lens, you will identify the potential risks and be able to resolve them by following the improvement plan.

This post focuses on the Security pillar of the IDP solution. Starting from the introduction of the Security Pillar and design principles, we then examine the solution design and implementation with four focus areas: access control, data protection, key and secret management, and workload configuration. By reading this post, you will learn about the Security Pillar in the Well-Architected Framework, and its application to the IDP solutions.

Design principles

The Security Pillar encompasses the ability of an IDP solution to protect input documents, document processing systems, and output assets, taking advantage of AWS technologies to improve security while processing documents intelligently.

All of the AWS AI services (for example, Amazon Textract, Amazon Comprehend, or Amazon Comprehend Medical) used in IDP solutions are fully managed AI services where AWS secures their physical infrastructure, API endpoints, OS, and application code, and handles service resilience and failover within a given region. As an AWS customer, you can therefore focus on using these services to accomplish your IDP tasks, rather than on securing these elements. There are a number of design principles that can help you strengthen your IDP workload security:

Implement a strong identity foundation – Implement the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your AWS resources in IDP applications. Centralize identity management, and aim to eliminate reliance on long-term static credentials.
Maintain traceability – AI services used in IDP are integrated with AWS CloudTrail, which enables you to monitor, alert on, and audit actions and changes to your IDP environment with low latency. Their integration with Amazon CloudWatch allows you to integrate log and metric collection with your IDP system to automatically investigate and take action.
Automate current security recommendations – Automated software-based security mechanisms improve your ability to securely scale more rapidly and cost-effectively. Create secured IDP architectures, including the implementation of controls that are defined and managed as code in version-controlled templates by using AWS CloudFormation.
Protect data in transit and at rest – Encryption in transit is supported by default for all of the AI services required for IDP. Pay attention to protection of data at rest and data produced in IDP outputs. Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate.
Grant least privilege permissions to people – IDP largely reduces the need for direct access and manual processing of documents. Only involving necessary people to do case validation or augmentation tasks reduces the risk of document mishandling and human error when dealing with sensitive data.
Prepare for security events – Prepare for an incident by having incident management and investigation policy and processes in place that align to your organizational requirements. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.

Focus areas

Before you architect an IDP workload, you need to put practices in place to meet your security requirements. This post focuses on the Security pillar with four focus areas:

Access control – In an IDP application, access control is the key part to ensure information security. It’s not only related to ensuring that only authorized users are able to access the application, but also about ensuring that other services are only able to access the environment and interact with each other in a suitably secure manner.
Data protection – Because encrypting data in transit is supported by default for all of the AI services required for IDP, data protection in an IDP application focuses more on encrypting data at rest and managing sensitive information such as personally identifiable information (PII).
Key and secret management – The encryption approach that you use to secure your IDP workflow may include different keys to encrypt data and authorize users across multiple services and related systems. Applying a comprehensive key and secret management system provides durable and secure mechanisms to further protect your IDP application and data.
Workload configuration – Workload configuration involves multiple design principles, including using monitoring and auditing services to maintain traceability of transactions and data in your IDP workload, setting up incident response procedures, and separating different IDP workloads from each other.

Access control

In focus area of access control, consider the following current recommendations:

Use VPC endpoints to a establish private connection with IDP related services – You can use Amazon Textract, Amazon Comprehend, and Amazon Simple Storage Service (Amazon S3) APIs through a world-routable network or keep your network traffic within the AWS network by using VPC endpoints. To follow current security recommnedations, you should keep your IDP traffic within your VPCs, and establish a private connection between your VPC and Amazon Textract or Amazon Comprehend by creating interface VPC endpoints. You can also access Amazon S3 from your VPC using gateway VPC endpoints.
Set up a centralized identity provider – For authenticating users and systems to your IDP application, setting up a centralized identity provider makes it easier to manage access across multiple IDP applications and services. This reduces the need for multiple sets of credentials and provides an opportunity to integrate with existing human resources (HR) processes.
- For federation with individual AWS accounts, you can use centralized identities for AWS with a SAML 2.0-based provider with AWS Identity and Access Management (IAM).
- For federation to multiple accounts in your AWS Organizations, you can configure your identity source in AWS IAM Identity Center and specify where your users and groups are managed.
Use IAM roles to control access and enforce least privilege access – To manage user access to IDP services, you should create IAM roles for user access to services in the IDP application and attach the appropriate policies and tags to achieve least privilege access. Roles should then be assigned to appropriate groups as managed in your identity provider. You can also use IAM roles for assigning service usage permissions, thereby employing ephemeral AWS Security Token Service (STS) credentials for calling service APIs. For circumstances where AWS services need to be called for IDP purposes from systems not running on AWS, use AWS IAM Roles Anywhere to obtain temporary security credentials in IAM for workloads running outside of AWS.
Protect Amazon Textract and Amazon Comprehend in your account from cross-service impersonation – An IDP application usually employs multiple AWS services, such that one service may call another service. Therefore, you need to prevent cross-service “confused deputy” scenarios. We recommend using the aws:SourceArn and aws:SourceAccount global condition context keys in resource policies to limit the permissions that Amazon Textract or Amazon Comprehend gives another service to the resource.

Data protection

The following are some current recommendations to consider for data protection:

Follow current recommendations to secure sensitive data in data stores – IDP usually involves multiple data stores. Sensitive data in these data stores needs to be secured. Current security recommendations in this area involve defining IAM controls, multiple ways to implement detective controls on databases, strengthening infrastructure security surrounding your data via network flow control, and data protection through encryption and tokenization.
Encrypt data at rest in Amazon Textract – Amazon Textract uses Transport Layer Security (TLS) and VPC endpoints to encrypt data in transit. The method of encrypting data at rest for use by Amazon Textract is server-side encryption. You can choose from the following options:
- Server-side encryption with Amazon S3 (SSE-S3) – When you use Amazon S3 managed keys, each object is encrypted with a unique key. As an additional safeguard, this method encrypts the key itself with a primary key that it regularly rotates.
- Server-side encryption with AWS KMS (SSE-KMS) – There are separate permissions for the use of an AWS Key Management Service (AWS KMS) key that provide protection against unauthorized access of your objects in Amazon S3. SSE-KMS also provides you with an audit trail in CloudTrail that shows when your KMS key was used, and by whom. Additionally, you can create and manage KMS keys that are unique to you, your service, and your Region.
Encrypt the output from Amazon Textract asynchronous API in a custom S3 bucket – When you start an asynchronous Amazon Textract job by calling StartDocumentTextDetection or StartDocumentAnalysis, an optional parameter in the API action is called OutputConfig. This parameter allows you to specify the S3 bucket for storing the output. Another optional input parameter KMSKeyId allows you to specify the KMS customer managed key (CMK) to use to encrypt the output.
Use AWS KMS encryption in Amazon Comprehend – Amazon Comprehend works with AWS KMS to provide enhanced encryption for your data. Integration with AWS KMS enables you to encrypt the data in the storage volume for Start* and Create* jobs, and it encrypts the output results of Start* jobs using your own KMS key.
- For use via the AWS Management Console, Amazon Comprehend encrypts custom models with its own KMS key.
- For use via the AWS Command Line Interface (AWS CLI), Amazon Comprehend can encrypt custom models using either its own KMS key or a provided CMK, and we recommend the latter.
Protect PII in IDP output – For documents including PII, any PII in IDP output also needs to be protected. You can either secure the output PII in your data store or redact the PII in your IDP output.
- If you need to store the PII in your IDP downstream, look into defining IAM controls, implementing protective and detective controls on databases, strengthening infrastructure security surrounding your data via network flow control, and implementing data protection through encryption and tokenization.
- If you don’t need to store the PII in your IDP downstream, consider redacting the PII in your IDP output. You can design a PII redaction step using Amazon Comprehend in your IDP workflow.

Key and secret management

Consider the following current recommendations for managing keys and secrets:

Use AWS KMS to implement secure key management for cryptographic keys – You need to define an encryption approach that includes the storage, rotation, and access control of keys, which helps provide protection for your content. AWS KMS helps you manage encryption keys and integrates with many AWS services. It provides durable, secure, and redundant storage for your KMS keys.
Use AWS Secrets Manager to implement secret management – An IDP workflow may have secrets such as database credentials in multiple services or stages. You need a tool to store, manage, retrieve, and potentially rotate these secrets. AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, application credentials, and other secrets throughout their lifecycles. Storing the credentials in Secrets Manager helps mitigate the risk of possible credential exfiltration by anyone who can inspect your application code.

Workload configuration

To configure workload, follow these current recommendations:

Separate multiple IDP workloads using different AWS accounts – We recommend establishing common guardrails and isolation between environments (such as production, development, and test) and workloads through a multi-account strategy. AWS provides tools to manage your workloads at scale through a multi-account strategy to establish this isolation boundary. When you have multiple AWS accounts under central management, your accounts should be organized into a hierarchy defined by groupings of organizational units (OUs). Security controls can then be organized and applied to the OUs and member accounts, establishing consistent preventative controls on member accounts in the organization.
Log Amazon Textract and Amazon Comprehend API calls with CloudTrail – Amazon Textract and Amazon Comprehend are integrated with CloudTrail. The calls captured include calls from the service console and calls from your own code to the services’ API endpoints.
Establish incident response procedures – Even with comprehensive, preventative and detective controls, your organization should still have processes in place to respond to and mitigate the potential impact of security incidents. Putting the tools and controls in place ahead of a security incident, then routinely practicing incident response through simulations, will help you verify that your environment can support timely investigation and recovery.

Conclusion

In this post we shared design principles and current recommendations for Security Pillar in building well-architected IDP solutions.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

For next steps, you can read more about the AWS Well-Architected Framework and refer to our Guidance for Intelligent Document Processing on AWS to design and build your IDP application. Please also reach out to your account team for a Well-Architected review for your IDP workload. If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

AWS is committed to the IDP Well-Architected Lens as a living tool. As the IDP solutions and related AWS AI services evolve, we will update the IDP Well-Architected Lens accordingly.

About the Authors

Build well-architected IDP solutions with a custom lens – Part 3: Reliability

November 22, 2023

by Rui Cardoso Amazon AWS

The IDP Well-Architected Custom Lens is intended for all AWS customers who use AWS to run intelligent document processing (IDP) solutions and are searching for guidance on how to build a secure, efficient, and reliable IDP solution on AWS.

An IDP project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific terms or words. The IDP Well-Architected Custom Lens outlines the steps for performing an AWS Well-Architected review that allows you to assess and identify technical risks of your IDP workloads. It provides guidance to tackle the common challenges we see among the field, supporting you to architect your IDP workloads according to best practices.

This post focuses on the Reliability pillar of the IDP solution. Starting from the introduction of the Reliability pillar and design principles, we then dive deep into the solution design and implementation with three focus areas: foundations, change management, and failure management. By reading this post, you will learn about the Reliability pillar in the Well-Architected Framework with the IDP case study.

Design principles

The reliability pillar encompasses the ability of an IDP solution to perform document processing correctly and consistently when it’s expected and according to the defined business rules. This includes the ability to operate and test the full IDP workflow and its total lifecycle.

There are a number of principles that can help you to increase reliability. Keep these in mind as we discuss best practices:

Automatically recover from failure – By monitoring your IDP workflow for key performance indicators (KPIs), you can run automation when a threshold is breached. This allows you to track and be notified automatically if any failure occurs and trigger automated recovery processes that work around or repair the failure. Based on KPI measures, you can also anticipate failures and apply remediation actions before they occur.
Test recovery procedures – Test how your IDP workflow fails, and validate recovery procedures. Use automation to simulate different scenarios or recreate scenarios that led to failure before.
Scale and adjust service capacity – Monitor IDP workflow demand and usage, and automatically adjust AWS service capacity, to maintain the optimal level to satisfy demand without over- or under-provisioning. Control and be aware of service quotas, limits, and constraints of your IDP components services, such as Amazon Textract and Amazon Comprehend.
Automate changes – Use automation when applying changes to your IDP workflow infrastructure. Manage changes through automation, which then can be tracked and reviewed.

Focus areas

The design principles and best practices of the reliability pillar are based on insights gathered from our customers and our IDP technical specialist communities. Use them as guidance and support for your design decisions and align them with your business requirements of your IDP solution. Applying the IDP Well-Architected Lens helps you validate the resilience and efficiency of your IDP solution design, and provides recommendations to address any gaps you might identify.

The following are best practice areas for reliability of an IDP solution in the cloud:

Foundations – AWS AI services such as Amazon Textract and Amazon Comprehend provide a set of soft and hard limits for different dimensions of usage. It’s important to review these limits and ensure your IDP solution adheres to any soft limits, while not exceeding any hard limits.
Change management – Treat your IDP solution as infrastructure as code (IaC), allowing you to automate monitoring and change management. Use version control across components such as infrastructure and Amazon Comprehend custom models, and track changes back to point-in-time release.
Failure management – Because an IDP workflow is an event-driven solution, your application must be resilient to handling known and unknown errors. A well-architected IDP solution has the ability to prevent failures and withstand failures when they occur by using logging and retry mechanisms. It’s important to design resilience into your IDP workflow architecture and plan for disaster recovery.

Foundations

AWS AI services provide ready-made intelligence, such as automated data extraction and analysis, using Amazon Textract, Amazon Comprehend, and Amazon Augmented AI (Amazon A2I), for your IDP workflows. There are service limits (or quotas) for these services to avoid over-provisioning and to limit request rates on API operations, protecting the services from abuse.

When planning and designing your IDP solution architecture, consider the following best practices:

Be aware of unchangeable Amazon Textract and Amazon Comprehend service quotas, limits, and constraints – Accepted file formats, size and page count, languages, document rotations, and image size are some examples of these hard limits for Amazon Textract that can’t be changed.
- Accepted file formats include JPEG, PNG, PDF, and TIFF files. (JPEG 2000-encoded images within PDFs are supported). Document preprocessing is required before using Amazon Textract if the file format is not supported (for example, Microsoft Word or Excel). In this case, you must convert unsupported document formats to PDF or image format.
- Amazon Comprehend has different quotas for built-in models, custom models, and flywheels. Make sure that your use case is aligned with Amazon Comprehend quotas.
Adjust Amazon Textract and Amazon Comprehend service quotas to meet your needs – The Amazon Textract Service Quotas Calculator can help you estimate the quota values that will cover your use case. You should manage your service quotas across accounts or Regions if you’re planning a disaster recovery failover between accounts or Regions for your solution. When requesting an increase of Amazon Textract quotas, make sure to follow these recommendations:
- Use the Amazon Textract Service Quotas Calculator to estimate your optimal quota value.
- Changes in requests can cause spiky network traffic, affecting throughput. Use a queueing serverless architecture or other mechanism to smooth traffic and get the most out of your allocated transactions per second (TPS).
- Implement retry logic to handle throttled calls and dropped connections.
- Configure exponential backoff and jitter to improve throughput.

Change management

Changes to your IDP workflow or its environment, such as spikes in demand or a corrupted document file, must be anticipated and accommodated to achieve a higher reliability of the solution. Some of these changes are covered by the foundations best practices described in the previous section, but those alone are not enough to accommodate changes. The following best practices must also be considered:

Use Amazon CloudWatch to monitor your IDP workflow components, such as Amazon Textract and Amazon Comprehend. Collect metrics from the IDP workflow, automate responses to alarms, and send notifications as required to your workflow and business objectives.
Deploy your IDP workflow solution and all infrastructure changes with automation using IaC, such as the AWS Cloud Development Kit (AWS CDK) and pre-built IDP AWS CDK constructs. This removes the potential for introducing human error and enables you to test before changing to your production environment.
If your use case requires an Amazon Comprehend custom model, consider using a flywheel to simplify the process of improving the custom model over time. A flywheel orchestrates the tasks associated with training and evaluating a new custom model version.
If your use case requires it, customize the output of the Amazon Textract pre-trained Queries feature by training and using an adapter for the Amazon Textract base model. Consider the following best practices when creating queries for your adapters:
- Adapter quotas define the preceding limits for adapter training. Consider these limits and raise a service quota increase request, if required:
  - Maximum number of adapters – Number of adapters allowed (you can have several adapter versions under a single adapter).
  - Maximum adapter versions created per month – Number of successful adapter versions that can be created per AWS account per month.
  - Maximum in-progress adapter versions – Number of in-progress adapter versions (adapter training) per account.
- Make sure to use a set of documents representative of your use case (minimum five training docs and five testing docs).
- Provide as many documents for training as possible (up to 2,500 pages of training documents and 1,000 for test documents).
- Annotate queries using a variety of answers. For example, if the answer to a query is “Yes” or “No,” the annotated samples should have occurrences of both “Yes” and “No.”
- Maintain consistency in annotation style and while annotating fields with spaces.
- Use the exact query used in training for inference.
- After each round of adapter training, review the performance metrics to determine if you need to further improve your adapter to achieve your goals. Upload a new document set for training or review document annotations that have low accuracy scores before you start a new training to create an improved version of the adapter.
- Use the AutoUpdate feature for custom adapters. This feature attempts automated retraining if the AutoUpdate flag is enabled on an adapter.

Failure management

When designing an IDP solution, one important aspect to consider is its resilience, how to handle known and unknown errors that can occur. The IDP solution should have the capabilities of logging errors and retry failed operations, during the different stages of the IDP workflow. In this section, we discuss the details on how to design your IDP workflow to handle failures.

Prepare your IDP workflow to manage and withstand failures

“Everything fails, all the time,” is a famous quote from AWS CTO Werner Vogels. Your IDP solution, like everything else, will eventually fail. The question is how can it withstand failures without impacting your IDP solution users. Your IDP architecture design must be aware of failures as they occur and take action to avoid impact on availability. This must be done automatically, and without user impact. Consider the following best practices:

Use Amazon Simple Storage Service (Amazon S3) as your scalable data store for IDP workflow documents to process. Amazon S3 provides a highly durable storage infrastructure designed for mission-critical and primary data storage.
Back up all your IDP workflow data according to your business requirements. Implement a strategy to recover or reproduce data in case of data loss. Align this strategy with a defined Recovery Point Objective (RPO) and Recovery Time Objective (RTO) that meet your business requirements.
If required, plan and implement a disaster recovery failover strategy of your IDP solution across AWS accounts and Regions.
Use the Amazon Textract OutputConfig feature and Amazon Comprehend OutputDataConfig feature to store the results of asynchronous processing from Amazon Textract or Amazon Comprehend to a designated S3 bucket. This allows the workflow to continue from that point rather than repeat the Amazon Textract or Amazon Comprehend invocation. The following code shows how to start an Amazon Textract asynchronous API job to analyze a document and store encrypted inference output in a defined S3 bucket. For additional information, refer to the Amazon Textract client documentation.

import boto3
client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
    FeatureTypes=[
        'TABLES'|'FORMS'|'QUERIES'|'SIGNATURES'|'LAYOUT',
    ],
    …
    OutputConfig={
        'S3Bucket': 'string',
        'S3Prefix': 'string'
    },
    KMSKeyId='string'
    …
)

Design your IDP workflow to prevent failures

The reliability of a workload starts with upfront design decisions. Architecture choices will impact your workload behavior and its resilience. To improve the reliability of your IDP solution, follow these best practices.

Firstly, design your architecture following the IDP workflow. Although the stages in an IDP workflow may vary and be influenced by use case and business requirements, the stages of data capture, document classification, text extraction, content enrichment, review and validation, and consumption are typically parts of IDP workflow. These well-defined stages can be used to separate functionalities and isolate them in case of failure.

You can use Amazon Simple Queue Service (Amazon SQS) to decouple IDP workflow stages. A decoupling pattern helps isolate behavior of architecture components from other components that depend on it, increasing resiliency and agility.

Secondly, control and limit retry calls. AWS services such as Amazon Textract can fail if the maximum number of TPS allotted is exceeded, causing the service to throttle your application or drop your connection.

You should manage throttling and dropped connections by automatically retrying the operation (both synchronous and asynchronous operations). However, you should also specify a limited number of retries, after which the operation fails and throws an exception. If you make too many calls to Amazon Textract in a short period of time, it throttles your calls and sends a ProvisionedThroughputExceededExceptionerror in the operation response.

In addition, use exponential backoff and jitter for retries to improve throughput. For example, using Amazon Textract, specify the number of retries by including the config parameter when you create the Amazon Textract client. We recommend a retry count of five. In the following example code, we use the config parameter to automatically retry an operation using adaptive mode and a maximum of five retries:

import boto3
from botocore.client import Config

documents = ['doc-img-1.png','doc-img-2.png', 'doc-img-3.png',
             'doc-img-4.png', 'doc-img-5.png']

config = Config(
    retries = {
        'max_attempts': 5,
        'mode': 'adaptive'
        }
)

client = boto3.client('textract', config=config)

for documentName in documents:
    response = client.detect_document_text(
        DocumentLocation = {
            'S3Object': {
                'Bucket': 'string',
                'Name': documentName
                }
                })
    
    ...

Take advantage of AWS SDKs, such as the AWS SDK for Python (Boto3), to assist in retrying client calls to AWS services such as Amazon Textract and Amazon Comprehend. There are three retry modes available:

Legacy mode – Retries calls for a limited number of errors and exceptions and include an exponential backoff by a base factor of 2.
Standard mode – Standardizes the retry logic and behavior consistent with other AWS SDKs and extends the functionality of retries over that found in legacy mode. Any retry attempt will include an exponential backoff by a base factor of 2 for a maximum backoff time of 20 seconds.
Adaptive mode – Includes all the features of standard mode and it introduces a client-side rate limiting through the use of a token bucket and rate limit variables that are dynamically updated with each retry attempt. It offers flexibility in client-side retries that adapts to the error or exception state response from an AWS service. With each new retry attempt, adaptive mode modifies the rate limit variables based on the error, exception, or HTTP status code presented in the response from the AWS service. These rate limit variables are then used to calculate a new call rate for the client. Each exception, error, or non-success HTTP response from an AWS service updates the rate limit variables as retries occur until a success is reached, the token bucket is exhausted, or the configured maximum attempts value is reached. Examples of exceptions, errors, or non-success HTTP responses:

# Transient errors/exceptions
RequestTimeout
RequestTimeoutException
PriorRequestNotComplete
ConnectionError
HTTPClientError

# Service-side throttling/limit errors and exceptions
Throttling
ThrottlingException
ThrottledException
RequestThrottledException
TooManyRequestsException
ProvisionedThroughputExceededException
TransactionInProgressException
RequestLimitExceeded
BandwidthLimitExceeded
LimitExceededException
RequestThrottled
SlowDown
EC2ThrottledException

#Retry attempts on nondescriptive, transient error codes. Specifically, these HTTP status codes: 500, 502, 503, 504.

Conclusion

In this post, we shared design principles, focus areas, foundations and best practices for reliability in your IDP solution.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

If you want to learn more about the AWS Well-Architected Framework, refer to AWS Well-Architected.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

About the Authors

Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency

November 22, 2023

by Mia Chang Amazon AWS

When a customer has a production-ready intelligent document processing (IDP) workload, we often receive requests for a Well-Architected review. To build an enterprise solution, developer resources, cost, time and user-experience have to be balanced to achieve the desired business outcome. The AWS Well-Architected Framework provides a systematic way for organizations to learn operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

The IDP Well-Architected Custom Lens follows the AWS Well-Architected Framework, reviewing the solution with six pillars with the granularity of a specific AI or machine learning (ML) use case, and providing the guidance to tackle common challenges. The IDP Well-Architected Custom Lens in the Well-Architected Tool contains questions regarding each of the pillars. By answering these questions, you can identify potential risks and resolve them by following your improvement plan.

This post focuses on the Performance Efficiency pillar of the IDP workload. We dive deep into designing and implementing the solution to optimize for throughput, latency, and overall performance. We start with discussing some common indicators that you should conduct a Well-Architected review, and introduce the fundamental approaches with design principles. Then we go through each focus area from a technical perspective.

To follow along with this post, you should be familiar with the previous posts in this series (Part 1 and Part 2) and the guidelines in Guidance for Intelligent Document Processing on AWS. These resources introduce common AWS services for IDP workloads and suggested workflows. With this knowledge, you’re now ready to learn more about productionizing your workload.

Common indicators

The following are common indicators that you should conduct a Well-Architected Framework review for the Performance Efficiency pillar:

High latency – When the latency of optical character recognition (OCR), entity recognition, or the end-to-end workflow takes longer than your previous benchmark, this may be an indicator that the architecture design doesn’t cover load testing or error handling.
Frequent throttling – You may experience throttling by AWS services like Amazon Textract due to request limits. This means that the architecture needs to be adjusted by reviewing the architecture workflow, synchronous and asynchronous implementation, transactions per second (TPS) calculation, and more.
Debugging difficulties – When there’s a document process failure, you may not have an effective way to identify where the error is located in the workflow, which service it’s related to, and why the failure occurred. This means the system lacks visibility into logs and failures. Consider revisiting the logging design of the telemetry data and adding infrastructure as code (IaC), such as document processing pipelines, to the solution.

Indicators	Description	Architectural Gap
High Latency	OCR, entity recognition, or end-to-end workflow latency exceeds previous benchmark	Load Testing Error Handling
Frequent Throttling	Throttling by AWS services like Amazon Textract due to request limits	Sync vs Async TPS calculation
Hard to Debug	No visibility into location, cause, and reason for document processing failures	Logging Design Document Processing Pipelines

Design principles

In this post, we discuss three design principles: delegating complex AI tasks, IaC architectures, and serverless architectures. When you encounter a trade-off between two implementations, you can revisit the design principles with the business priorities of your organization so that you can make decisions effectively.

Delegating complex AI tasks – You can enable faster AI adoption in your organization by offloading the ML model development lifecycle to managed services and taking advantage of the model development and infrastructure provided by AWS. Rather than requiring your data science and IT teams to build and maintain AI models, you can use pre-trained AI services that can automate tasks for you. This allows your teams to focus on higher-value work that differentiates your business, while the cloud provider handles the complexity of training, deploying, and scaling the AI models.
IaC architectures – When running an IDP solution, the solution includes multiple AI services to perform the end-to-end workflow chronologically. You can architect the solution with workflow pipelines using AWS Step Functions to enhance fault tolerance, parallel processing, visibility, and scalability. These advantages can enable you to optimize the usage and cost of underlying AI services.
Serverless architectures – IDP is often an event-driven solution, initiated by user uploads or scheduled jobs. The solution can be horizontally scaled out by increasing the call rates for the AI services, AWS Lambda, and other services involved. A serverless approach provides scalability without over-provisioning resources, preventing unnecessary expenses. The monitoring behind the serverless design assists in detecting performance issues.

Figure 1.The benefit when applying design principles.

With these three design principles in mind, organizations can establish an effective foundation for AI/ML adoption on cloud platforms. By delegating complexity, implementing resilient infrastructure, and designing for scale, organizations can optimize their AI/ML solutions.

In the following sections, we discuss how to address common challenges in regards to technical focus areas.

Focus areas

When reviewing performance efficiency, we review the solution from five focus areas: architecture design, data management, error handling, system monitoring, and model monitoring. With these focus areas, you can conduct an architecture review from different aspects to enhance the effectivity, observability, and scalability of the three components of an AI/ML project, data, model, or business goal.

Architecture design

By going through the questions in this focus area, you will review the existing workflow to see if it follows best practices. The suggested workflow provides a common pattern that organizations can follow and prevents trial-and-error costs.

Based on the proposed architecture, the workflow follows the six stages of data capture, classification, extraction, enrichment, review and validation, and consumption. In the common indicators we discussed earlier, two out of three come from architecture design problems. This is because when you start a project with an improvised approach, you may meet project restraints when trying to align your infrastructure to your solution. With the architecture design review, the improvised design can be decoupled as stages, and each of them can be reevaluated and reordered.

You can save time, money, and labor by implementing classifications in your workflow, and documents go to downstream applications and APIs based on document type. This enhances the observability of the document process and makes the solution straightforward to maintain when adding new document types.

Data management

Performance of an IDP solution includes latency, throughput, and the end-to-end user experience. How to manage the document and its extracted information in the solution is the key to data consistency, security, and privacy. Additionally, the solution must handle high data volumes with low latency and high throughput.

When going through the questions of this focus area, you will review the document workflow. This includes data ingestion, data preprocessing, converting documents to document types accepted by Amazon Textract, handling incoming document streams, routing documents by type, and implementing access control and retention policies.

For example, by storing a document in the different processed phases, you can reverse processing to the previous step if needed. The data lifecycle ensures the reliability and compliance for the workload. By using the Amazon Textract Service Quotas Calculator (see the following screenshot), asynchronous features on Amazon Textract, Lambda, Step Functions, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Service (Amazon SNS), organizations can automate and scale document processing tasks to meet specific workload needs.

Figure 2. Amazon Textract Service Quota Calculator.

Error handling

Robust error handling is critical for tracking the document process status, and it provides the operation team time to react to any abnormal behaviors, such as unexpected document volumes, new document types, or other unplanned issues from third-party services. From the organization’s perspective, proper error handling can enhance system uptime and performance.

You can break down error handling into two key aspects:

AWS service configuration – You can implement retry logic with exponential backoff to handle transient errors like throttling. When you start processing by calling an asynchronous Start* operation, such as StartDocumentTextDetection, you can specify that the completion status of the request is published to an SNS topic in the NotificationChannel configuration. This helps you avoid throttling limits on API calls due to polling the Get* APIs. You can also implement alarms in Amazon CloudWatch and triggers to alert when unusual error spikes occur.
Error report enhancement – This includes detailed messages with an appropriate level of detail by error type and descriptions of error handling responses. With the proper error handling setup, systems can be more resilient by implementing common patterns like automatically retrying intermittent errors, using circuit breakers to handle cascading failures, and monitoring services to gain insight into errors. This allows the solution to balance between retry limits and prevents never-ending circuit loops.

Model monitoring

The performance of ML models is monitored for degradation over time. As data and system conditions change, the model performance and efficiency metrics are tracked to ensure retraining is performed when needed.

The ML model in an IDP workflow can be an OCR model, entity recognition model, or classification model. The model can come from an AWS AI service, an open source model on Amazon SageMaker, Amazon Bedrock, or other third-party services. You must understand the limitations and use cases of each service in order to identify ways to improve the model with human feedback and enhance service performance over time.

A common approach is using service logs to understand different levels of accuracy. These logs can help the data science team identify and understand any need for model retraining. Your organization can choose the retraining mechanism—it can be quarterly, monthly, or based on science metrics, such as when accuracy drops below a given threshold.

The goal of monitoring is not just detecting issues, but closing the loop to continuously refine models and keep the IDP solution performing as the external environment evolves.

System monitoring

After you deploy the IDP solution in production, it’s important to monitor key metrics and automation performance to identify areas for improvement. The metrics should include business metrics and technical metrics. This allows the company to evaluate the system’s performance, identify issues, and make improvements to models, rules, and workflows over time to increase the automation rate to understand the operational impact.

On the business side, metrics like extraction accuracy for important fields, overall automation rate indicating the percentage of documents processed without human intervention, and average processing time per document are paramount. These business metrics help quantify the end-user experience and operational efficiency gains.

Technical metrics including error and exception rates occurring throughout the workflow are essential to track from an engineering perspective. The technical metrics can also monitor at each level from end to end and provide a comprehensive view of a complex workload. You can break the metrics down into different levels, such as solution level, end-to-end workflow level, document type level, document level, entity recognition level, and OCR level.

Now that you have reviewed all the questions in this pillar, you can assess the other pillars and develop an improvement plan for your IDP workload.

Conclusion

In this post, we discussed common indicators that you may need to perform a Well-Architected Framework review for the Performance Efficiency pillar for your IDP workload. We then walked through design principles to provide a high-level overview and discuss the solution goal. By following these suggestions in reference to the IDP Well-Architected Custom Lens and by reviewing the questions by focus area, you should now have a project improvement plan.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

About the Authors

Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization

November 22, 2023

by Suyin Wang Amazon AWS

Building a production-ready solution in the cloud involves a series of trade-off between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS.

An intelligent document processing (IDP) project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific terms or words. The IDP Well-Architected Custom Lens outlines the steps for performing an AWS Well-Architected review, and helps you assess and identify the risks in your IDP workloads. It also provides guidance to tackle common challenges, enabling you to architect your IDP workloads according to best practices.

This post focuses on the Cost Optimization pillar of the IDP solution. A cost-optimized workload fully utilizes all resources, achieves an outcome at the lowest possible price point, and meets your functional requirements. We start with an introduction of the Cost Optimization pillar and design principles, and then dive deep into the four focus areas: financial management, resource provision, data management, and cost monitoring. By reading this post, you will learn about the Cost Optimization pillar in the Well-Architected Framework with the IDP case study.

Design principles

Cost optimization is a continual process of refinement and improvement over the span of a workload’s lifecycle. The practices in this post can help you build and operate cost-aware IDP workloads that achieve business outcomes while minimizing costs and allowing your organization to maximize its return on investment.

Several principles can help you to improve cost optimization. Let’s consider different project phases. For example, during the project planning phase, you should invest in cloud financial management skills and tools, and align finance and tech teams to incorporate both business and technology perspectives. In the project development phase, we recommend adopting a consumption model and adjusting usage dynamically. When you’re ready for production, always monitor and analyze the spending.

Keep the following in mind as we discuss best practices:

Implement cloud financial management – To achieve financial success and accelerate business value realization with your IDP solution, you must invest in cloud financial management. Your organization must dedicate the necessary time and resources for building capability in this new domain of technology and usage management.
Cultivate a partnership between technology and finance – Involve finance and technology teams in cost and usage discussions while building your IDP solution and at all stages of your cloud journey. Teams should regularly meet and discuss topics such as organizational goals and targets with your IDP solution, current state of cost and usage, and financial and accounting practices.
Adopt a consumption model and adjust dynamically – Provision resources and manage data with cost awareness, and manage your project stage and environment with cost optimization over time. Pay only for the resources you consume, and increase or decrease usage depending on business requirements. For example, development and test environments for your IDP solution are typically only used for 8 hours a day during the work week. By stopping development and test environment resources when not in use, such as outside of the 40 working hours per week, you can reduce costs by 75% compared to running them continuously for 168 hours per week.
Monitor, attribute, and analyze expenditure – Measure the business output of the workload and the costs associated with delivery. Use this data to understand the gains you make from increasing output, increasing functionality, and reducing cost with your IDP workflow. AWS provides tools such as Amazon CloudWatch, tags, and AWS CloudTrail to make it straightforward to accurately identify the cost and usage of workloads, make sure you utilize resources to measure return on investment (ROI), and enable workload owners to optimize their resources and reduce costs.

Focus areas

The design principles and best practices of the Cost Optimization pillar are based on insights gathered from our customers and our IDP technical specialist communities. Use them as guidance and support for your design decisions, and align these with the business requirements of your IDP solution. Applying the IDP Well-Architected Custom Lens helps you validate the resilience and efficiency of your IDP solution, and provides recommendations to address any gaps you might identify.

You might have encountered cases when the financial team independently performs financial planning for your cloud usage, which turned out to be disrupted by the technical complexity. It’s also possible to ignore resource and data management while provisioning services, thereby creating unexpected cost items on your billings. In this post, we help you navigate through these situations and provide guidelines for cost optimization with your IDP solution, so you don’t have learn these lessons in a costly way. The following are four best practices areas for cost optimization of an IDP solution in the cloud: financial management, resource provisioning, data management, and cost monitoring.

Financial management

Establishing a team that can take responsibility for cost optimization is critical for successful adoption of cloud technology, and this is true for building an IDP solution as well. Relevant teams in both technology and finance within your organization must be involved in cost and usage discussions at all stages when building your IDP solution and along your cloud journey. The following are some key implementation steps to establish a dedicated cloud financial management team:

Define key members – Make sure that all relevant parts of your organization contribute and have a stake in cost management. Most importantly, you need to establish collaboration between finance and technology. Consider the following general groups, and include members with domain expertise in financial and business areas, as well as in technology, to integrate the knowledge for better financial management:
- Financial leads – CFOs, financial controllers, financial planners, business analysts, procurement, sourcing, and accounts payable must understand the cloud model of consumption, purchasing options, and the monthly invoicing process. Finance needs to partner with technology teams to create and socialize an IT value story, helping business teams understand how technology spend is linked to business outcomes.
- Technology leads – Technology leads (including product and application owners) must be aware of financial requirements (for example, budget constraints) as well as business requirements (for example, service level agreements). This allows the workload to be implemented to achieve the desired goals of the organization.
Define goals and metrics – The function needs to deliver value to the organization in different ways. These goals are defined and will continually evolve as the organization evolves. This function also needs to regularly report to the organization on the organization’s cost optimization capability.
Establish regular cadence – The group should come together regularly to review their goals and metrics. A typical cadence involves reviewing the state of the organization, any programs or services currently running, and overall financial and optimization metrics.

Resource provisioning

Given the various configurations and pricing models of AWS services as part of the IDP solution, you should only provision resources based on what you need and adjust your provisioning over time to align with your business requirement or development stage. Additionally, make sure you take advantage of free services offered by AWS to lower your overall cost. When provisioning resources for your IDP solution, consider the following best practices:

Decide between asynchronous inference or synchronous inference – You should adopt synchronous inference for real-time processing of a single document. Choose asynchronous jobs to analyze large documents or multiple documents in one batch, because asynchronous jobs handle large batches more cost-effectively.
Manage Amazon Comprehend endpoint inference units – Depending on your needs, you can adjust the throughput of your Amazon Comprehend endpoint after creating it. This can be achieved by updating the endpoint’s inference units (IUs). If you’re not actively using the endpoint for an extended period, you should set up an auto scaling policy to reduce your costs. If you’re no longer using an endpoint, you can delete the endpoint to avoid incurring additional cost.
Manage Amazon SageMaker endpoints – Similarly, for organizations that aim for inference type selection and endpoints running time management, you can deploy open source models on Amazon SageMaker. SageMaker provides different options for model inferences, and you can delete endpoints that aren’t being used or set up an auto scaling policy to reduce your costs on model endpoints.

Data management

Data plays a key role throughout your IDP solution, from building and delivering. Starting with the initial ingestion, data is pushed across different stages of processing, and eventually is returned as output to end-users. It’s important to understand how your choice of data management will impact the overall IDP solution cost. Consider the following best practices:

Adopt Amazon S3 Intelligent-Tiering – The Amazon S3 Intelligent-Tiering storage class is designed to optimize storage costs in Amazon Simple Storage Service (Amazon S3) by automatically moving data to the most cost-effective access tier when access patterns change, without operational overhead or impact on performance. There are two ways to move data into S3 Intelligent-Tiering:
- Directly PUT data into S3 Intelligent-Tiering by specifying INTELLIGENT_TIERING in the x-amz-storage-class header.
- Define S3 Lifecycle configurations to transition objects from S3 Standard or S3 Standard-Infrequent Access to S3 Intelligent-Tiering.
Enforce data retention policies throughout the IDP workflow – Use S3 Lifecycle configurations on an S3 bucket to define actions for Amazon S3 to take during an object’s lifecycle, as well as deletion at the end of the object’s lifecycle, based on your business requirements.
Split documents into single pages for specific FeatureType processing – FeatureType is a parameter for the Document Analysis API calls (both synchronous and asynchronous) in Amazon Textract. As of this writing, it includes the following values: TABLES, FORMS, QUERIES, SIGNATURES, and LAYOUT. Amazon Textract charges based on the number of pages and images processed. Not all pages might include the information you need to extract. Splitting documents into single pages and only focusing on the pages with the FeatureType you need can help avoid unnecessary processing, thereby reducing your overall cost.

So far, we’ve discussed best practices on the implementation and deployment of your IDP solution. When your IDP solution is deployed and ready for production, cost monitoring is an important area for you to observe and control the cost directly. In the following section, we discuss how to best perform cost monitoring with your IDP solution.

Cost monitoring

Cost optimization begins with a granular understanding of the breakdown in cost and usage; the ability to model and forecast future spend, usage, and features; and the implementation of sufficient mechanisms to align cost and usage to your organization’s objectives. To improve the cost optimization of your IDP solution, follow these best practices.

Design cost monitoring for the lifetime of IDP workflow

Define and implement a method to track resources and their associations with the IDP system over their lifetime. You can use tagging to identify the workload or function of the resource:

Implement a tagging scheme – Implement a tagging scheme that identifies the workload the resource belongs to, verifying that all resources within the workload are tagged accordingly. Tagging helps you categorize resources by purpose, team, environment, or other criteria relevant to your business. For more detail on tagging use cases, strategies, and techniques, see Best Practices for Tagging AWS Resources.
- Tagging at the service level allows for more granular monitoring and control of your cost. For example, with Amazon Comprehend in an IDP workflow, you can use tags on Amazon Comprehend analysis jobs, custom classification models, custom entity recognition models, and endpoints to organize your Amazon Comprehend resources and providing tag-based cost monitoring and control.
- When tagging at the service level isn’t applicable, you can navigate to other resources for cost allocation reporting. For example, because Amazon Textract charges on a one-page basis, you can track the number of synchronous API calls to Amazon Textract for cost calculations (each synchronous API call maps to one page of the document). If you have large documents and want to utilize asynchronous APIs, you can use open source libraries to count the number of pages, or use Amazon Athena to write queries and extract the information from your CloudTrail logs to extract the page information for cost tracking.
Implement workload throughput or output monitoring – Implement workload throughput monitoring or alarming, initiating on either input requests or output completions. Configure it to provide notifications when workload requests or outputs drop to zero, indicating the workload resources are no longer used. Incorporate a time factor if the workload periodically drops to zero under normal conditions.
Group AWS resources – Create groups for AWS resources. You can use AWS resource groups to organize and manage your AWS resources that are in the same Region. You can add tags to most of your resources to help identify and sort your resources within your organization. Use Tag Editor to add tags to supported resources in bulk. Consider using AWS Service Catalog to create, manage, and distribute portfolios of approved products to end-users and manage the product lifecycle.

Use monitoring tools

AWS offers a variety of tools and resources to monitor the cost and usage of your IDP solution. The following is a list of AWS tools that help with cost monitoring and control:

AWS Budgets – Configure AWS Budgets on all accounts for your workload. Set budgets for the overall account spend and budgets for the workloads by using tags. Configure notifications in AWS Budgets to receive alerts for when you exceed your budgeted amounts or when your estimated costs exceed your budgets.
AWS Cost Explorer – Configure AWS Cost Explorer for your workload and accounts to visualize your cost data for further analysis. Create a dashboard for the workload that tracks overall spend, key usage metrics for the workload, and forecasts of future costs based on your historical cost data.
AWS Cost Anomaly Detection – Use AWS Cost Anomaly Detection for your accounts, core services, or cost categories you created to monitor your cost and usage and detect unusual spends. You can receive alerts individually in aggregated reports, and receive alerts in an email or an Amazon Simple Notification Service (Amazon SNS) topic, which allows you to analyze and determine the root cause of the anomaly and identify the factor that is driving the cost increase.
Advanced tools – Optionally, you can create custom tools for your organization that provide additional detail and granularity. You can implement advanced analysis capabilities using Athena and dashboards using Amazon QuickSight. Consider using Cloud Intelligence Dashboards for preconfigured, advanced dashboards. You can also work with AWS Partners and adopt their cloud management solutions to activate cloud bill monitoring and optimization in one convenient location.

Cost attribution and analysis

The process of categorizing costs is crucial in budgeting, accounting, financial reporting, decision-making, benchmarking, and project management. By classifying and categorizing expenses, teams can gain a better understanding of the types of costs they will incur throughout their cloud journey, helping them make informed decisions and manage budgets effectively. To improve the cost attribution and analysis of your IDP solution, follow these best practices:

Define your organization’s categories – Meet with stakeholders to define categories that reflect your organization’s structure and requirements. These will directly map to the structure of existing financial categories, such as business unit, budget, cost center, or department.
Define your functional categories – Meet with stakeholders to define categories that reflect the functions within your business. This may be your IDP workload or application names and the type of environment, such as production, testing, or development.
Define AWS cost categories – You can create cost categories to organize your cost and usage information. Use AWS Cost Categories to map your AWS costs and usage into meaningful categories. With cost categories, you can organize your costs using a rule-based engine.

Conclusion

In this post, we shared design principles, focus areas, and best practices for cost optimization in your IDP workflow.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

AWS is committed to the IDP Well-Architected Lens as a living tool. As IDP solutions and related AWS AI services evolve, and as new AWS services become available, we will update the IDP Well-Architected Lens accordingly.

To get started with IDP on AWS, refer to Guidance for Intelligent Document Processing on AWS to design and build your IDP application. For a deeper dive into end-to-end solutions that cover data ingestion, classification, extraction, enrichment, verification and validation, and consumption, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. Additionally, Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain covers how to extend a new or existing IDP architecture with large language models (LLMs). You’ll learn you can integrate Amazon Textract with LangChain as a document loader, use Amazon Bedrock to extract data from documents, and use generative AI capabilities within the various IDP phases.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

About the Authors

Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

November 22, 2023

by Christian Denich Amazon AWS

An intelligent document processing (IDP) project typically combines optical character recognition (OCR) and natural language processing (NLP) to automatically read and understand documents. Customers across all industries run IDP workloads on AWS to deliver business value by automating use cases such as KYC forms, tax documents, invoices, insurance claims, delivery reports, inventory reports, and more. IDP workflows on AWS can help you extract business insights from your documents, reduce manual effort, and process documents faster and with higher accuracy.

Building a production-ready IDP solution in the cloud requires a series of trade-offs between cost, availability, processing speed, and sustainability. This post provides guidance and best practices on how to improve the sustainability of your IDP workflow using Amazon Textract, Amazon Comprehend, and the IDP Well-Architected Custom Lens.

The AWS Well-Architected Framework helps you understand the benefits and risks of decisions made while building workloads on AWS. The AWS Well-Architected Custom Lenses complement the Well-Architected Framework with more industry-, domain-, or workflow-specific content. By using the Well-Architected Framework and the IDP Well-Architected Custom Lens, you will learn about operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

The IDP Well-Architected Custom Lens provides you with guidance on how to address common challenges in IDP workflows that we see in the field. By answering a series of questions in the Well-Architected Tool, you will be able to identify the potential risks and address them by following the improvement plan.

This post focuses on the Sustainability pillar of the IDP custom lens. The Sustainability pillar focuses on designing and implementing the solution to minimize the environmental impact of your workload and minimize waste by adhering to the following design principles: understand your impact, maximize resource utilization and use managed services, and anticipate change and prepare for improvements. These principles help you stay focused as you dive into the focus areas: achieving business results with sustainability in mind, effectively managing your data and its lifecycle, and being ready for and driving continuous improvement.

Design principles

The Sustainability pillar focuses on designing and implementing the solution through the following design principles:

Understand your impact – Measure the sustainability impact of your IDP workload and model the future impact of your workload. Include all sources of impact, including the impact of customer use of your products. This also includes the impact of IDP that enables digitization and allows your company or customers to complete paperless processes. Establish key performance indicators (KPIs) for your IDP workload to evaluate ways to improve productivity and efficiency while reducing environmental impact.
Maximize resource utilization and use managed services – Minimize idle resources, processing, and storage to reduce the total energy required to run your IDP workload. AWS operates at scale, so sharing services across a broad customer base helps maximize resource utilization, which maximizes energy efficiency and reduces the amount of infrastructure needed to support IDP workloads. With AWS managed services, you can minimize the impact of your IDP workload on compute, networking, and storage.
Anticipate change and prepare for improvements – Anticipate change and support the upstream improvements your partners and suppliers make to help you reduce the impact of your IDP workloads. Continuously monitor and evaluate new, more efficient hardware and software offerings. Design for flexibility to lower barriers for introducing changes and allow for the rapid adoption of new efficient technologies.

Focus areas

The design principles and best practices of the Sustainability pillar are based on insights gathered from our customers and our IDP technical specialist communities. You can use them as guidance to support your design decisions and align your IDP solution with your business and sustainability requirements.

The following are the focus areas for sustainability of IDP solutions in the cloud: achieve business results with sustainability in mind, effectively manage your data and its lifecycle, and be ready for and drive continuous improvement.

Achieve business results with sustainability in mind

To determine the best Regions for your business needs and sustainability goals, we recommend the following steps:

Evaluate and shortlist potential Regions – Start by shortlisting potential Regions for your workload based on your business requirements, including compliance, cost, and latency. Newer services and features are deployed to Regions gradually. Refer to List of AWS Services Available by Region to check which Regions have the services and features you need to run your IDP workload.
Choose a Region powered by 100% renewable energy – From your shortlist, identify Regions close to Amazon’s renewable energy projects and Regions where, in 2022, the electricity consumed was attributable to 100% renewable energy. Based on the Greenhouse Gas (GHG) Protocol, there are two methods for tracking emissions from electricity production: market-based and location-based. Companies can choose one of these methods based on their sustainability policies to track and compare their emissions from year to year. Amazon uses the market-based model to report our emissions. To reduce your carbon footprint, select a Region where, in 2022, the electricity consumed was attributable to 100% renewable energy.

Effectively manage your data and its lifecycle

Data plays a key role throughout your IDP solution. Starting with the initial data ingestion, data is pushed through various stages of processing, and finally returned as output to end-users. It’s important to understand how data management choices will affect the overall IDP solution and its sustainability. Storing and accessing data efficiently, in addition to reducing idle storage resources, results in a more efficient and sustainable architecture. When considering different storage mechanisms, remember that you’re making tradeoffs between resource efficiency, access latency, and reliability. This means you’ll need to select your management pattern accordingly. In this section, we discuss some best practices for data management.

Create and ingest only relevant data

To optimize your storage footprint for sustainability, evaluate what data is needed to meet your business objectives and create and ingest only relevant data along your IDP workflow.

Store only relevant data

When designing your IDP workflow, consider for each step in your workflow which intermediate data outputs need to be stored. In most IDP workflows, it’s not necessary to store the data used or created in each intermediate step because it can be easily reproduced. To improve sustainability, only store data that is not easily reproducible. If you need to store intermediate results, consider whether they qualify for a lifecycle rule that archives and deletes them more quickly than data with stricter retention requirements.

Preserve data across computing environments such as development and staging. Implement mechanisms to enforce a data lifecycle management process including archiving and deletion and continuously identify unused data and delete it.

To optimize your data ingest and storage, consider the optimal data resolution that satisfies the use case. Amazon Textract requires at least 150 DPI. If your document isn’t in a supported Amazon Textract format (PDF, TIFF, JPEG, and PNG) and you need to convert it, experiment to find the optimal resolution for best results rather than choosing the maximum resolution.

Use the right technology to store data

For IDP workflows, most of the data is likely to be documents. Amazon Simple Storage Service (Amazon S3) is an object storage built to store and retrieve any amount of data from anywhere, making it well suited for IDP workflows. Using different Amazon S3 storage tiers is a key component of optimizing storage for sustainability.

When considering different storage mechanisms, remember that you’re making trade-offs between resource efficiency, access latency, and reliability. That means you’ll need to select your management pattern accordingly. By storing less volatile data on technologies designed for efficient long-term storage, you can optimize your storage footprint. For archiving data or storing data that changes slowly, Amazon S3 Glacier and Amazon S3 Glacier Deep Archive are available. Depending on your data classification and workflow, you can choose Amazon S3 One Zone-IA, which reduces power and server capacity by storing data within a single Availability Zone.

Actively manage your data lifecycle according to your sustainability goals

Managing your data lifecycle means optimizing your storage footprint. For IDP workflows, first identify your data retention requirements. Based on to your retention requirements, create Amazon S3 Lifecycle configurations that automatically transfer objects to a different storage class based on your predefined rules. For data with no retention requirements and unknown or changing access patterns, use Amazon S3 Intelligent-Tiering to monitor access patterns and automatically move objects between tiers.

Continuously optimize your storage footprint by using the right tools

Over time, the data usage and access pattern in your IDP workflow may change. Tools like Amazon S3 Storage Lens deliver visibility into storage usage and activity trends, and even make recommendations for improvements. You can use this information to further lower the environmental impact of storing data.

Enable data and compute proximity

As you make your IDP workflow available to more customers, the amount of data traveling over the network will increase. Similarly, the larger the size of the data and the greater the distance a packet must travel, the more resources are required to transmit it.

Reducing the amount of data sent over the network and optimizing the path a packet takes will result in more efficient data transfer. Setting up data storage close to data processing helps optimize sustainability at the network layer. Ensure that the Region used to store the data is the same Region where you have deployed your IDP workflow. This approach helps minimize the time and cost of transferring data to the computing environment.

Be ready for and drive continuous improvement

Improving sustainability for your IDP workflow is a continuous process that requires flexible architectures and automation to support smaller, frequent improvements. When your architecture is loosely coupled and uses serverless and managed services, you can enable new features without difficulty and replace components to improve sustainability and gain performance efficiencies. In this section, we share some best practices.

Improve safely and continuously through automation

Using automation to deploy all changes reduces the potential for human error and enables you to test before making production changes to ensure your plans are complete. Automate your software delivery process using continuous integration and continuous delivery (CI/CD) pipelines to test and deploy potential improvements to reduce effort and limit errors caused by manual processes. Define changes using infrastructure as code (IaC): all configurations should be defined declaratively and stored in a source control system like AWS CodeCommit, just like application code. Infrastructure provisioning, orchestration, and deployment should also support IaC.

Use serverless services for workflow orchestration

IDP workflows are typically characterized by high peaks and periods of inactivity (such as outside of business hours), and are mostly driven by events (for example, when a new document is uploaded). This makes them a good fit for serverless solutions. AWS serverless services can help you build a scalable solution for IDP workflows quickly and sustainably. Services such as AWS Lambda, AWS Step Functions, and Amazon EventBridge help orchestrate your workflow driven by events and minimize idle resources to improve sustainability.

Use an event-driven architecture

Using AWS serverless services to implement an event-driven approach will allow you to build scalable, fault-tolerant IDP workflows and minimize idle resources.

For example, you can configure Amazon S3 to start a new workflow when a new document is uploaded. Amazon S3 can trigger EventBridge or call a Lambda function to start an Amazon Textract detection job. You can use Amazon Simple Notification Service (Amazon SNS) topics for event fanout or to send job completion messages. You can use Amazon Simple Queue Service (Amazon SQS) for reliable and durable communication between microservices, such as invoking a Lambda function to read Amazon Textract output and then calling a custom Amazon Comprehend classifier to classify a document.

Use managed services like Amazon Textract and Amazon Comprehend

You can perform IDP using a self-hosted custom model or managed services such as Amazon Textract and Amazon Comprehend. By using managed services instead of your custom model, you can reduce the effort required to develop, train, and retrain your custom model. Managed services use shared resources, reducing the energy required to build and maintain an IDP solution and improving sustainability.

Review AWS blog posts to stay informed about feature updates

There are several blog posts and resources available to help you stay on top of AWS announcements and learn about new features that may improve your IDP workload.
AWS re:Post is a community-driven Q&A service designed to help AWS customers remove technical roadblocks, accelerate innovation, and enhance operations. AWS re:Post has over 40 topics, including a community dedicated to AWS Well-Architected. AWS also has service-specific blogs to help you to stay up to date for Amazon Textract and Amazon Comprehend.

Conclusion

In this post, we shared design principles, focus areas, and best practices for optimizing sustainability in your IDP workflow. To learn more about sustainability in the cloud, refer to the following series on Optimizing your AWS Infrastructure for Sustainability, Part I: Compute, Part II: Storage, and Part III: Networking.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
Build well-architected IDP solutions with a custom lens – Part 2: Security
Build well-architected IDP solutions with a custom lens – Part 3: Reliability
Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

To get started with IDP on AWS, refer to Guidance for Intelligent Document Processing on AWS to design and build your IDP application. For a deeper dive into end-to-end solutions that cover data ingestion, classification, extraction, enrichment, verification and validation, and consumption, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. Additionally, Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain covers how to extend a new or existing IDP architecture with large language models (LLMs). You’ll learn how you can integrate Amazon Textract with LangChain as a document loader, use Amazon Bedrock to extract data from documents, and use generative AI capabilities within the various IDP phases.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.

About the Author

Christian Denich is a Global Customer Solutions Manager at AWS. He is passionate about automotive, AI/ML and developer productivity. He supports some the world’s largest automotive brands on their cloud journey, encompassing cloud and business strategy as well as technology. Before joining AWS, Christian worked at BMW Group in both hardware and software development in various projects including connected navigation.

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

November 22, 2023

by Jerry Mannil Amazon AWS

For decades, Amazon has pioneered and innovated machine learning (ML), bringing delightful experiences to its customers. From the earliest days, Amazon has used ML for various use cases such as book recommendations, search, and fraud detection. Similar to the rest of the industry, the advancements of accelerated hardware have allowed Amazon teams to pursue model architectures using neural networks and deep learning (DL).

The M5 program within Amazon Search owns the discovery learning strategy for Amazon and builds large-scale models across multi-lingual, multi-locale, multi-entity, multitask, and multi-modal such as text, image, and video. The M5 program has been serving universal embeddings and large-scale foundation models to hundreds of ML teams across Amazon while maintaining strict controls over cost optimization. In order to achieve this, the M5 team regularly evaluates new techniques to reduce cost.

Like many ML organizations, accelerators are largely used to accelerate DL training and inference. When AWS launched purpose-built accelerators with the first release of AWS Inferentia in 2020, the M5 team quickly began to utilize them to more efficiently deploy production workloads, saving both cost and reducing latency. Last year, AWS launched its AWS Trainium accelerators, which optimize performance per cost for developing and building next generation DL models. In this post, we discuss how M5 was able to reduce the cost to train their models by 30%, and share some of the best practices we learned along the way.

Trainium instances

With the advances in purpose-built accelerators, Amazon also provides compelling accelerators in the form of AWS Inferentia and Trainium. As their names imply, these chips are optimized to exceed the needs of inference and training workloads, respectively. For large-scale training of foundation models that reach billions of parameters in size, Trainium Trn1 and Trn1n instances are ideal choices due to their characteristics. Trn1 instances are powered by the state-of-the-art NeuronCore-v2, and have a copious amount of accelerator compute and memory. Trn1n instances can also be chosen for a greater amount of networking bandwidth (1,600 Gbs), so are ideally suited for performant training with cost optimization in mind.

To use accelerators, you need a software layer to support them. With Trn and Inf chips, the AWS Neuron SDK unlocks Amazon purpose-built accelerators with the help of PyTorch XLA. PyTorch XLA converts PyTorch’s eager mode to lazy mode graph-based implementation. These graphs are then used and further compiled to be used with the accelerator. PyTorch Neuron (part of the Neuron SDK) enables PyTorch users to train their models on Trainium NeuronCores with a few lines of code.

Model and workload

The M5 team trains and deploys foundational models and universal representations to assist various teams across Amazon in bringing delight to Amazon.com customers. One such model is a text encoder model followed by a multi-layer perceptron (MLP) with explicit or implicit feature interactions defined by the neural network architecture with hundreds of millions of trainable parameters. This model is trained on billions of tokens, and is used to generate millions of embeddings in an offline batch inference setting. These embeddings are inputs to a customer-facing tier-1 Amazon service.

The infrastructure for the production pipeline uses AWS Batch with fair share queuing strategies, using an EFA-enabled multi-node trn1.32xlarge cluster as the compute for model training. Functionally, the production pipeline performs incremental model training, evaluation of trained model, and offline batch inference on the trained model, all using PyTorch as the underlying DL library.

Goals

Delighting our customers is a foremost tenet. Given the customer-facing nature of the pipeline, it’s critical that all service-level agreements (SLAs) be met without regressions. We identified two critical acceptance criteria to adapt our existing GPU production pipeline and transition it to Trainium:

Model quality – The quality of our models directly impacts customer experience. We require that there should be less than 0.1% difference in model quality between GPU and Trainium.
Training throughput – We iteratively train our models periodically to provide the freshest experience to our customers. We require that model convergence must be achieved within a predefined period of time (such as 1 week) to meet our production SLAs.

In the following sections, we share our journey of working backward from this criteria, and our learnings to support Amazon-scale production workloads.

Training script

Before starting with model training, we need to make changes to the training script to make it XLA compliant. Given the size of the model, we use distributed data parallel (DDP) to train the model. DDP allows us to increase the throughput of model training by scaling up the number of machines used to run model training, without any code changes. We followed the instructions provided in the Neuron PyTorch MLP training tutorial to add XLA-specific constructs in our training scripts. These code changes are straightforward to implement. The following are some significant technical learnings from the exercise that greatly improved our model throughput:

Placement of xm.mark_step() – xm.mark_step() compiles and runs the lazily collected computation graphs. Invoking mark_step too many times will lead to a larger number of small graphs, whereas invoking it too few times will lead to few, but large graphs. Depending on your application, the throughput and implementation of your model training will vary based on your placement of xm.mark_step(). Our implementation places one xm.mark_step() after a forward and backward pass, and one after the optimizer step.
Data loader wrapping with XLA multiprocessing device loader – This is a critical step that can be easily missed. The multiprocessing device loader torch_xla.distributed.parallel_loader.MpDeviceLoader loads training data on each XLA device with options to preload and overlap data loading with device runs for improving throughput. The device loader also invokes xm.mark_step() and is therefore able to build graphs for data loading to device from host.

Compilation for Trainium

Traditionally, the model development cycle with GPUs involves making changes to the model or training script and directly running it on the GPU device. Accelerators such as Trainium that use XLA require an additional step before model training can be run on the accelerator. XLA computation graphs can only be run after they have been compiled. Generally, there are two ways to perform this compilation: Ahead of Time (AOT), where you trace and compile all graphs first and then run them, or Just In Time (JIT), where graphs are traced, compiled, and run as they are encountered. The Neuron SDK provides both of these out of the box. Typically, AOT compilation is performed first. Graphs are then run after this compilation. If new graphs are encountered, the Neuron runtime invokes a JIT compilation before running them. To perform AOT compilation, the Neuron SDK provides neuron_parallel_compile, a compilation utility that extracts graphs from a trial run of the training script and performs parallel AOT compilation.

An important aspect of AOT compilation is to ensure that no new computation graphs are created over the course of training. One source of new computation graphs (and therefore recompilations) is dynamic shapes of the training batches during model training. We found that using static shapes and fixed-size batches eliminates training time compilations and greatly improves training throughput without any effect on model accuracy. By enforcing such constraints on training, we observed that only 4–5 steps of model training, one step of model validation, and checkpointing the model one time is required for tracing all the graphs during AOT compilation. It’s important to note that the Neuron SDK is constantly evolving, and in the future will support dynamic shapes as well.

Furthermore, the compiled graphs are stored in the Neuron Persistent Cache on disk or in an Amazon Simple Storage Service (Amazon S3) bucket. This is especially useful for production workloads where model architecture and training configuration doesn’t change. Therefore, the overhead of compilation is incurred just one time. Using the cache is as simple as setting an environment flag:

export NEURON_COMPILE_CACHE_URL="s3://BUCKET/KEY"

The Neuron compiler also provides three compiler-level optimization options (O1, O2, O3) to balance compilation time and model run throughput. O1 enables core optimizations on the compute graph and minimizes compilation time, O3 provides improved model run throughput at the cost of higher compilation time, and O2 (default option) is a balance between the two. For our use case, we used the O1 optimization and observed an 86% reduction in compilation time with no change to model accuracy metrics, while observing approximately a 5–7% reduction in throughput compared to the default optimization (O2). Depending on the use case, you can choose different levels of optimization.

To summarize, we used the following flags for compilation:

NEURON_CC_FLAGS="--target trn1 --auto-cast all --auto-cast-type bf16 --model-type transformer --optlevel O1"

Checkpoint compatibility

When compilation is successfully complete, we can proceed to train our models on Trainium. As mentioned earlier, we incrementally train our models, meaning we load a previously trained model checkpoint and continue training with new data. PyTorch and PyTorch XLA allow seamless transitioning between accelerators through checkpoint interoperability. Having the flexibility of moving between GPU and Trainium enabled us to seamlessly load the previous GPU model and train on Trainium machines. This was critical to ensure that we can initialize our model with the best previously trained model without any production downtime or loss in model accuracy.

Because the GPU model was saved using standard PyTorch model saving utilities, we were able to use the PyTorch checkpoint loading utility to load the GPU model on Trainium devices.

For example, on GPU/CPU, you can save the model with the following code:

torch.save(model.state_dict(), PATH)

Then you load the model back on Trainium:

import torch_xla.core.xla_model as xm
xla_device = xm.xla_device()
model = MyModel(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.to(xla_device)

Similarly, you can save the model on Trainium with the following code:

import torch_xla.core.xla_model as xm
# automatically moves the data to CPU for the master device
xm.save(model.state_dict(), PATH)

And load the model back on GPU/CPU:

model = MyModel(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.to(device) # can be any device

In fact, because we use DDP for model training, the model loading is agnostic of the number of machines used to train the previous checkpoint. This allows us to horizontally scale the Trn1 fleet with no code changes or adverse effects to model training. These PyTorch-based checkpoints can be directly used or even torch-scripted for inference use cases on AWS Inferentia2 or other accelerators.

Operational stability

It cannot be emphasized enough that running workloads in production requires multiple SLAs to be met. For our use case, apart from the model quality and training throughput SLAs, it’s imperative that the production pipeline be operationally stable, meaning minimal downtime and disruptions during model training, evaluation, and inference.

As with the existing GPU based pipeline, we added numerous mechanisms to make the pipeline operationally stable. Before starting model training, we run multiple sanity tests to assess the health of the machines. These tests generally include simple tensor operations to verify the health of the accelerator devices. We have observed that for distributed training, it’s important to run tests to verify collective communication between instances as well. We used the NCCOM test suite from the Neuron SDK to achieve this, running a variety of operations such as all-gather, all-reduce, and reduce-scatter.

Even after following the suggestions we’ve mentioned, we have observed that transient issues are inevitable in any pipeline, irrespective of the underlying accelerator. To build resiliency in any training pipeline, we recommend building in retry mechanisms to resolve these potential issues. We use AWS Batch automated retries to retry jobs that encounter a transient failure during model training. These restarts can be costly if a failure is encountered towards the end of training. To counter this problem, we have adapted our training scripts to load a previously trained model checkpoint and continue training from that point. With this functionality, we are able to aggressively restart failed training jobs with minimal overhead.

With these resiliency mechanisms in place, we were able to achieve 98.5% success rates for our workloads on Trn1, comparable to our existing GPU pipeline success rates.

Results

To validate the accuracy of our models, we initialized two models from the same GPU checkpoint, and trained one on Trainium and the other on a comparable GPU. Both models were trained with the same training hyperparameters. The dataset used for metrics calculation is a holdout dataset, and we evaluate the model’s accuracy on this dataset every N global steps. X-axis is the global step, and Y-axis is the model accuracy. We observed less than 0.1% difference in the model accuracy at each point in the following graph.

Furthermore, to evaluate the cost-effectiveness of the model training, we prefer to compare the wall clock time taken to reach model convergence. We believe this provides a more practical view of cost savings compared to measures such as cost per token, achieved FLOPS/dollar, and other factors. Considering the training time of trn1.32xl and comparable Amazon Elastic Compute Cloud (Amazon EC2) instances, we have observed that Trainium offers up to 30% cheaper cost to model convergence.

Conclusion

There are many factors to consider when evaluating different accelerators for your DL workloads. Some of the most important are model quality, throughput, cost, and availability. It is paramount to ensure that your model quality and throughput are not sacrificed based on the accelerator you choose.

Thanks to our partnership and collaboration with the Annapurna Neuron team, the Amazon Search M5 team has been able to save up to 30% in cost by moving to Trainium. The team is able to use Trainium and achieve model quality and throughput parity with comparable accelerators in the market. Checkpoint interoperability and minimal code changes with support for XLA have allowed M5 to choose between multiple accelerators for their workloads. This has enabled the M5 team to take advantage of the large compute power of Trainium, and build accelerator agnostic solutions to delight Amazon.com customers. From an operational standpoint, Trainium has been proven capable of supporting tier-1 services at Amazon scale. The M5 team continues to move more workloads to Trainium to provide the best models for Amazon at the lowest costs.

In summary, the M5 team has been able to perform cost-effective, production-grade ML training by adding Trainium to the fleet of accelerators. We encourage you to take a look at Trainium and other Neuron devices like AWS Inferentia to reap the benefits of purpose-built Amazon silicon for ML workloads. Get started easily with one of the many tutorials featuring different models, like Llama 2, available on Trainium.

About the Authors

Jerry Mannil is a software engineer at Amazon Search. He works on improving the efficiency, robustness and scalibility of the distributed training infrastructure.

Ken Su is a software engineer at Amazon Search. He works on improving training efficiency and scalable distributed training workflow. Outside work, he likes hiking and tennis.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.

Geospatial generative AI with Amazon Bedrock and Amazon Location Service

November 22, 2023

by Jeff DeMuth Amazon AWS

Today, geospatial workflows typically consist of loading data, transforming it, and then producing visual insights like maps, text, or charts. Generative AI can automate these tasks through autonomous agents. In this post, we discuss how to use foundation models from Amazon Bedrock to power agents to complete geospatial tasks. These agents can perform various tasks and answer questions using location-based services like geocoding available through Amazon Location Service. We also share sample code that uses an agent to bridge the capabilities of Amazon Bedrock with Amazon Location. Additionally, we discuss the design considerations that went into building it.

Amazon Bedrock is a fully managed service that offers an easy-to-use API for accessing foundation models for text, image, and embedding. Amazon Location offers an API for maps, places, and routing with data provided by trusted third parties such as Esri, HERE, Grab, and OpenStreetMap. If you need full control of your infrastructure, you can use Amazon SageMaker JumpStart, which gives you the ability to deploy foundation models and has access to hundreds of models.

Solution overview

In the realm of large language models (LLMs), an agent is an entity that can autonomously reason and complete tasks with an LLM’s help. This allows LLMs to go beyond text generation to conduct conversations and complete domain-specific tasks. To guide this behavior, we employ reasoning patterns. According to the research paper Large Language Models are Zero-Shot Reasoners, LLMs excel at high-level reasoning, despite having a knowledge cutoff.

We selected Claude 2 as our foundational model from Amazon Bedrock with the aim of creating a geospatial agent capable of handling geospatial tasks. The overarching concept was straightforward: think like a geospatial data scientist. The task involved writing Python code to read data, transform it, and then visualize it in an interesting map. We utilized a prompting pattern known as Plan-and-Solve Prompting for this purpose.

Using a Plan-and-Solve strategy allows for multi-step reasoning and developing a high-level plan as the first task. This works well for our load, transform, and visualize workflow, and is the high-level plan our agent will use. Each of these subtasks are sent to Claude 2 to solve separately.

We devised an example task to create a price heatmap of Airbnb listings in New York. To plan a path to complete the task, the agent needs to understand the dataset. The agent needs to know the columns in the dataset and the type of data in those columns. We generate a summary from the dataset so the agent can plan for the task provided by the user, in this case, generating a heatmap.

Prerequisites

There are a few prerequisites to deploy the demo. You’ll need access to an AWS account with an access key or AWS Identity and Access Management (IAM) role with permissions to Amazon Bedrock and Amazon Location. You will need to create a map, a place index, and an Amazon Location API key using the Amazon Location console. You will also need access to either a local or virtual environment where Docker is installed. In our demonstration, we use an Amazon Elastic Compute Cloud (Amazon EC2) instance running Amazon Linux with Docker installed.

Read and summarize the data

To give the agent context about the dataset, we prompt Claude 2 to write Python code that reads the data and provides a summary relevant to our task. The following are a few of the prompts we included. The full list is available in the prompts.py file in the project.

Your Role: You are a geospatial data analyzer designed to analyze data schema from arbitrary geospatial datasets.
Your Task: You are provided a set of file URLs. You need to generate a Python function that meets the following:
    Read each file using geopandas. Each file could be CSV, shape file, or GeoJSON.
    Use built-in function resolved_file_url= get_data_file_url(file_url, session_id) to get downloadable URLs.

From these prompts, Claude 2 generated the following Python code:

import geopandas as gpd
import pandas as pd
import numpy as np
 
def analyze_geospatial_data(file_urls):
    dataframes = []
    for file_url in file_urls:
        resolved_file_url = get_data_file_url(file_url, 'a5b188fe-91f7-4f92-92fe-c4030721a427', 'local')
        if file_url.endswith('.csv'):
            df = pd.read_csv(resolved_file_url)
        elif file_url.endswith('.shp'):
            df = gpd.read_file(resolved_file_url)
        elif file_url.endswith('.parquet'):
            df = pd.read_parquet(resolved_file_url)
        elif file_url.endswith('.geojson'):
            df = gpd.read_file(resolved_file_url)
        else:
            raise ValueError('Unsupported file format')
 
        df = df.sample(n=3, random_state=42)
        dataframes.append({
            'file_url': file_url, 
            'resolved_file_url': resolved_file_url,
            'data_frame': df,
            'column_names': list(df.columns)
        })
    return dataframes
 
file_urls = ['agent://airbnb_listings_price.csv']
dataframes = analyze_geospatial_data(file_urls)

The agent:// prefix tells our CLI that this file is stored inside the session storage folder. Upon running the CLI, it will create a geospatial-agent-session-storage folder to store local data. The agent now recognizes the uploaded Airbnb data and its column names. Next, let’s ask Claude for some hints to generate a heatmap using these columns. We will pass in the following prompts.

Your Role: You are a geospatial data analyzer designed to analyze data schema from arbitrary geospatial data sets.
Your Task: You are provided with a table with some rows data. Your task is to generate a summary that describes the data in the table following the requirements below.
    The first sentence should be summary of the data in the table from the aspect of the user action.
    If there is no geometry column in the table, the second sentence should note column names that can be used to generate a geometry column in geopandas.

Claude 2 replies with a hint

Summary: The table contains columns for listing id, name, host id, host name, location details like neighborhood, latitude, and longitude that can be used to generate a geometry column for plotting. Price column contains listing price that can be used to generate a heatmap of listing prices.

Plan a solution

Now we can ask Claude 2 to draft a high-level plan. We can use a graph library for Python called NetworkX to map out the steps for solving our problem.

Similar to our last prompt, we pass in a few requirements to guide Claude 2:

Your Role: You are a geospatial data scientist and an expert Python developer.
Your Task: A series of consecutive steps.
Your Reply needs to meet these requirements:
    Create a single NetworkX graph.
    No disconnected components allowed.
    First operations are data loading.
    Use geopandas for spatial data.

From these prompts, Claude 2 generated the following Python code:

import networkx as nx
G = nx.DiGraph()
# Load the data
G.add_node("airbnb_data_url", node_type="data", data_path="agent://airbnb_listings_price.csv", 
           description="URL for Airbnb listing price data")
G.add_node("load_data", node_type="operation", operation_type="load",
           description="Load the Airbnb listing price data")
G.add_edge("airbnb_data_url", "load_data")
G.add_node("airbnb_df", node_type="data", 
           description="Pandas DataFrame containing Airbnb listing prices")
G.add_edge("load_data", "airbnb_df")
# Transform the data
G.add_node("add_geometry", node_type="operation", operation_type="transform",
           description="Create a geometry column using latitude and longitude")
G.add_edge("airbnb_df", "add_geometry")
G.add_node("airbnb_gdf", node_type="data",
           description="GeoDataFrame with geometry column added")
G.add_edge("add_geometry", "airbnb_gdf")
# Visualize the data
G.add_node("plot_heatmap", node_type="operation", operation_type="visualization",
           description="Generate a heatmap of listing prices")
G.add_edge("airbnb_gdf", "plot_heatmap")
G.add_node("price_heatmap", node_type="data", 
           description="The heatmap plot showing Airbnb listing prices")
G.add_edge("plot_heatmap", "price_heatmap")

In this NetworkX graph, Claude 2 breaks down the process into three main segments:

Loading data – Importing the Airbnb listing prices from a given URL into a Pandas DataFrame
Transforming data – Creating a geometry column based on the latitude and longitude coordinates
Visualizing data – Generating a heatmap to display the pricing of Airbnb listings

This approach allows for a clear and straightforward implementation of the geospatial task at hand. We can use GraphViz to visualize the following workflow.

Implement the plan

Now that Claude 2 has provided us with a plan, it’s time to bring it to life. For each step, we prompt Claude 2 to write the corresponding code. To keep Claude 2 focused, we supply high-level requirements for each task. Let’s dive into the code that Claude 2 generated for each individual phase.

Load the data

To load the Airbnb listing price data into a Pandas DataFrame, we create a prompt and pass in some parameters. The “Load Airbnb data” in the Operation_task is referencing the Load Data node in our graph we created earlier.

Your Role: You are a geospatial data scientist and an expert Python developer.
Operation_task: You need to generate a Python function to do: Load Airbnb data.
Your Reply needs to meet these requirements:
    The function definition is: 
    load_data(airbnb_data_url='agent://airbnb_listings_price.csv').
    When doing spatial joins, retain at least 1 geometry column.
    Point function requires importing shapely library.

From these prompts, Claude 2 generated the following Python code:

import pandas as pd
def load_data(airbnb_data_url='agent://airbnb_listings_price.csv'):
    """
    Load Airbnb listing price data
    """
    file_url = get_data_file_url(airbnb_data_url, session_id, storage_mode) 
    airbnb_df = pd.read_csv(file_url)
    return airbnb_df

Transform the data

Next, Claude 2 generates the code to add a geometry column to our DataFrame using latitude and longitude. For this prompt, we pass in the following requirements:

Your Role: You are a geospatial data scientist and an expert Python developer.
Operation_task: You need to generate a Python function to do: 
    Add geometry column from latitude/longitude.
    Your Reply needs to meet these requirements:
    The function definition is: add_geometry(filtered_df).
    The function return line is: return airbnb_gdf.

From these prompts, Claude 2 generated the following Python code:

import geopandas as gpd
from shapely.geometry import Point
def add_geometry(airbnb_df):
    """
    Add geometry column from latitude and longitude
    """
    airbnb_df['geometry'] = airbnb_df.apply(lambda row: Point(row['longitude'], row['latitude']), axis=1)
    airbnb_gdf = gpd.GeoDataFrame(airbnb_df, geometry='geometry')
 
    return airbnb_gdf

Visualize the data

Finally, Claude 2 builds a heatmap visualization using pydeck, which is a Python library for spatial rendering. For this prompt, we pass in the following requirements:

Your Role: You are a geospatial data scientist and an expert Python developer.
Operation_task: You need to generate a Python function to do: Generate heatmap from price.
Your Reply needs to meet these requirements:
    The function definition is: generate_heatmap(airbnb_gdf).
    The function return line is: return airbnb_heatmap.

From these prompts, Claude 2 generated the following Python code:

import pandas as pd
import geopandas as gpd
import pydeck as pdk
 
def plot_heatmap(airbnb_gdf):
    """
    Plot heatmap of listing prices
    """
    price_heatmap = pdk.Deck(
        map_style=location_map_style(),
        initial_view_state=pdk.ViewState(
            latitude=airbnb_gdf['latitude'].mean(),
            longitude=airbnb_gdf['longitude'].mean(),
            zoom=11,
            pitch=50,
        ),
        layers=[
            pdk.Layer(
                'HexagonLayer',
                data=airbnb_gdf,
                get_position=['longitude', 'latitude'],
                radius=100,
                elevation_scale=4,
                elevation_range=[0, 1000],
                pickable=True,
                extruded=True,
            ),
            pdk.Layer(
                'ScatterplotLayer',
                data=airbnb_gdf,
                get_position=['longitude', 'latitude'],
                get_color='[200, 30, 0, 160]',
                get_radius=200,
            ),
        ],
    )
 
    # Save heatmap HTML
    price_heatmap.to_html(get_local_file_path('airbnb_heatmap.html', session_id, task_name))
 
    return price_heatmap

When Claude 2 returns a response, it also includes some helpful notes explaining how each function meets the provided requirements. For example, for the heatmap visualization, Claude 2 noted the following:

"This function generates a heatmap of Airbnb listing prices using pydeck and saves the resulting HTML locally. It fulfills the requirements specified in the prompt."

Assemble the generated code

Now that Claude 2 has created the individual building blocks, it’s time to put it all together. The agent automatically assembles all these snippets into a single Python file. This script calls each of our functions in sequence, streamlining the entire process.

The final step looks like the following code:

session_id = "a5b188fe-91f7-4f92-92fe-c4030721a427"
task_name = "1694813661_airbnb_listings_price_heatmap"
storage_mode = "local"
# Sequentially invoke the functions
airbnb_df = load_data(airbnb_data_url='agent://airbnb_listings_price.csv')
airbnb_gdf = add_geometry(airbnb_df)
price_heatmap = plot_heatmap(airbnb_gdf)

After the script is complete, we can see that Claude 2 has created an HTML file with the code to visualize our heatmap. The following image shows New York on an Amazon Location basemap with a heatmap visualizing Airbnb listing prices.

Use Amazon Location with Amazon Bedrock

Although our Plan-and-Solve agent can handle this geospatial task, we need to take a slightly different approach for tasks like geocoding an address. For this, we can use a strategy called ReAct, where we combine reasoning and acting with our LLM.

In the ReAct pattern, the agent reasons and acts based on customer input and the tools at its disposal. To equip this Claude 2-powered agent with the capability to geocode, we developed a geocoding tool. This tool uses the Amazon Location Places API, specifically the SearchPlaceIndexForText method, to convert an address into its geographic coordinates.

Agent: Hi! I'm Agent Smith, your conversational geospatial assistant. How can I assist you today?
You: >? Hello, can you give me the coordinates for 112 E 11th St, Austin, TX 78701?
Agent: The coordinates for 112 E 11th St, Austin, TX 78701 are longitude -97.740590981087 and latitude 30.274118017533.

Within this brief exchange, the agent deciphers your intent to geocode an address, activates the geocoding tool, and returns the latitude and longitude.

Whether it’s plotting a heatmap or geocoding an address, Claude 2 combined with agents like ReAct and Plan and Solve can simplify geospatial workflows.

Deploy the demo

To get started, complete the following steps:

Clone the following repository either to your local machine or to an EC2 instance. You may need to run aws configure --profile <profilename> and set a default Region; this application was tested using us-east-1.

git clone https://github.com/aws-samples/amazon-location-geospatial-agent/

Now that we have the repository cloned, we configure our environment variables.

Change directories into the cloned project folder:

cd amazon-location-geospatial-agent

Edit the .env file using your preferred text editor:

vim .env

Add your map name, place index name, and API key:

API_KEY_NAME=AgentAPIKey
MAP_NAME=AgentMap
PLACE_INDEX_NAME=AgentPlaceIndex

Run the following command to build your container:

docker build -t agent .

Run the following command to run and connect to your Docker container:

docker run --rm -it -v ~/.aws:/root/.aws --entrypoint bash agent

Grab the Airbnb dataset:

apt install -y wget
wget http://data.insideairbnb.com/united-states/ny/new-york-city/2023-10-01/visualisations/listings.csv
cp listings.csv data/listings.csv

Run the following command to create a session. We use sessions to isolate unique chat environments.

SESSION_ID="3c18d48c-9c9b-488f-8229-e2e8016fa851" FILE_NAME="listings.csv" make create-session

Now you’re ready to start the application.

Run the following command to begin the chat application:

poetry run agent --session-id 3c18d48c-9c9b-488f-8229-e2e8016fa851 --profile <profilename>

You will be greeted with a chat prompt.

You can begin by asking the following question:

I've uploaded the file listings.csv. Draw a heatmap of Airbnb listing price.

The agent grabs the Airbnb_listings_price.csv file we have downloaded to the /data folder and parses it into a geospatial DataFrame. Then it generates the code to transform the data as well as the code for the visualization. Finally, it creates an HTML file that will be written in the /data folder, which you can open to visualize the heatmap in a browser.

Another example uses the Amazon Location Places API to geocode an address. If we ask the agent to geocode the address 112 E 11th St, Austin, TX 78701, we will get a response as shown in the following image.

Conclusion

In this post, we provided a brief overview of Amazon Bedrock and Amazon Location, and how you can use them together to analyze and visualize geospatial data. We also walked through Plan-and-Solve and ReAct and how we used them in our agent.

Our example only scratches the surface. Try downloading our sample code and adding your own agents and tools for your geospatial tasks.

About the authors

Jeff Demuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys traveling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.

Swagata Prateek is a Senior Software Engineer working in Amazon Location Service at Amazon Web Services (AWS) where he focuses on Generative AI and geospatial.