Researchers in the World-Wide Design Engineering organization combine simulations and real-life motion to design human-centered workspaces.Read More
Provide live agent assistance for your chatbot users with Amazon Lex and Talkdesk cloud contact center
Amazon Lex provides advanced conversational artificial intelligence (AI) capabilities to enable self-service support for your organization’s contact center. With Amazon Lex, you can implement an omnichannel strategy where customers engage via phone, websites, and messaging platforms. The bots can answer FAQs, provide self-service experiences, or triage customer requests before transferring to a human agent. Amazon Lex integrates with state-of-the-art contact centers including Amazon Connect, Genesys Cloud, and Amazon Chime SDK to facilitate a seamless omnichannel experience.
This is the second post of a two-part series. The integration of Amazon Lex with Talkdesk cloud contact center is inspired by WaFd Bank (WaFd)’s digital innovation journey to enhance customer experience. In our previous post, we described how Amazon Lex integrates with the Talkdesk cloud contact center for the voice channel. In this post, we are focusing on the chat channel to show how to use Amazon Lex and the Amazon Lex Web UI to enable live agents to interact with your customers in real time. For example, the following figure shows screenshots of a chatbot transitioning a customer to a live agent chat (courtesy of WaFd Bank).
Solution overview
The following diagram illustrates the solution architecture.
In the preceding architecture, the following sequence of steps takes place in a live customer/agent conversation:
- Using the Amazon Lex Web UI, a customer asks to be connected to an agent. The associated Amazon Lex chatbot is configured with an escalation intent to process the incoming agent assistance request.
- The Amazon Lex fulfillment AWS Lambda function retrieves the Talkdesk touchpoint ID and Talkdesk OAuth secrets from AWS Secrets Manager and initiates a request to Talkdesk Digital Connect using the Start a Conversation API. In the payload, the function includes information that may be useful to an agent, such as the customer sentiment or the history of previously traversed intents.
- If the request to the Talkdesk API is successful, a Talkdesk conversation ID is returned to Amazon Lex.
- The Amazon Lex fulfillment Lambda function stores the conversation ID in Amazon Lex session attributes, thus making the conversation ID accessible to the Amazon Lex Web UI.
- The Amazon Lex Web UI opens a communication session with agents on the Talkdesk contact center through a WebSocket API in Amazon API Gateway.
- The Lambda associated with the WebSocket API first stores the Talkdesk conversation ID to WebSocket client ID mappings in Amazon DynamoDB. Then, through the Talkdesk Send a Message API, the Lambda function sends the customer’s message to the agent on Talkdesk contact center.
- Your agent responds to the customer with a message sent through the callback Rest API in API Gateway. The payload includes the conversation ID of the active conversation.
- The callback Rest API is configured to support the agents’ incoming messages as well as the agent’s closing of the conversation. In order to send the agent’s message to the customer, the supporting Lambda function reads the WebSocket client ID associated to the conversation ID from the DynamoDB table. This makes sure the agent’s message is delivered to the appropriate WebSocket client ID.
- The agent’s response is displayed through the Amazon Lex Web UI and the customer responds or closes the chat as appropriate. Steps 6–9 are repeated as long as the conversation remains active. If the agent ends the conversation, the customer is notified and the WebSocket connection is closed.
In the following sections, we walk you through the steps to build the solution architecture. Dependencies among each step are cross-referenced.
Prerequisites
To implement the solution presented in this post, you should first familiarize yourself with the following AWS services and features:
- Amazon API Gateway
- Amazon CloudWatch
- Amazon DynamoDB
- AWS Identity and Access Management (IAM)
- AWS Lambda
- Amazon Lex
- AWS Secrets Manager
Additionally, you should be familiar with the following Talkdesk services:
- Talkdesk cloud contact center account
- Talkdesk Digital Engagement with chat channel for agents
- Talkdesk Digital Connect API – To send messages to the agent, use the following Talkdesk Digital Connect API components:
- Start a Conversation – Create an identifier used to interact with an agent
- Send a Message – Send a message to the agent
Prepare your Talkdesk instance for the Amazon Lex Web UI chat with an agent
This section outlines the basic steps required to configure the Talkdesk chat with agent experience using the Talkdesk Digital Connect channel. Review Talkdesk APIs for further details for any additional tasks that may be required as part of your specific implementation.
Complete the following steps:
- Enable Talkdesk Digital Connect on your Talkdesk instance.
- Configure your agents’ accounts and assign them to the agents’ queues.
- Build a Talkdesk Studio flow.
This will be used to send chat users to an inbox for agents to assign. A sample is provided with this solution.
- To create an integration for your Amazon Lex Web UI instance, in the Talkdesk Builder navigation pane, select Integrations.
- On the Actions tab, configure three actions using the input and output schemas provided through the following links:
- Create a Talkdesk Digital Connect Touchpoint.
- Name the Touchpoint Lex Web UI Chat and record the Touchpoint ID.
This will be stored in Secrets Manager as dev/talkdesk/touchpoint/ids
.
- In Talkdesk Builder, choose OAuth Clients in the navigation pane to set up OAuth credentials.
- Select Grant type for Client credentials and set Scope to
digital-connect:write
. - Record the client ID and secret key from the Keys tab.
These will be stored in Secrets Manager as dev/talkdesk/client/keys
and used to authenticate and communicate with the Talkdesk API.
- In your AWS account, store the two secrets in Secrets Manager.
The following screenshot shows the details of the Touchpoint ID as a Secrets Manager secret.
The following screenshot shows the details of the client ID as a Secrets Manager secret.
Deploy the Talkdesk Amazon Lex CloudFormation template
The following AWS CloudFormation template creates all the resources of the solution architecture. This includes all necessary IAM roles to invoke API operations, run associated Lambda functions, access secrets on Secrets Manager, and store and retrieve conversation ID and WebSocket client ID pairs from DynamoDB.
To facilitate monitoring and debugging, a CloudWatch log group is created for each of the resources.
The CloudFormation template provides additional details for each of the resources.
Complete the following steps to deploy the template:
- Sign in to the AWS Management Console.
- Choose Launch Stack for your AWS Region to begin the CloudFormation stack creation process.
US East (N. Virginia) US West (Oregon) Asia Pacific (Singapore) Asia Pacific (Sydney) Asia Pacific (Tokyo) Europe (Frankfurt) Europe (Ireland) Europe (London) - For Stack name, enter a name.
- For TDAUTHHOST, enter the URL of your Talkdesk instance.
- Leave the other parameters as default and choose Next
- Select the acknowledgement check boxes and choose Create stack.
- After the CloudFormation template is complete, record the values for the following keys on the Outputs tab to use in later steps:
APIGatewayApiKey
BotAliasId
BotId
CallbackRestAPI
WebSocketAPIEndpoint
Update the Talkdesk instance
Log in to your Talkdesk instance and complete the following steps to update your instance:
- In Talkdesk Builder, select Integrations in the navigation pane.
- On the Settings tab, locate Base path and enter the callback Rest API URL you recorded earlier.
- Under Other settings, set
x-api-key
to the value of the API Gateway key.
Deploy the Amazon Lex Web UI
The solution outlined in this post uses the Amazon Lex Web UI, a full-featured web client to deploy your Amazon Lex chatbot on your website. With the Amazon Lex Web UI, you can quickly bring your chatbot-powered application to life while minimizing time-to-value.
- Choose Launch Stack for the Region in which you will use your chatbot:
US East (N. Virginia) US West (Oregon) Asia Pacific (Singapore) Asia Pacific (Sydney) Asia Pacific (Tokyo) Europe (Frankfurt) Europe (Ireland) Europe (London) - For LexV2BotId, enter the value for
BotId
. - For LexV2BotAliasId, enter the value for
BotAliasId
. - Launch the stack.
- When deployment is complete, locate the Amazon Simple Storage Service (Amazon S3) URL for
WebAppBucket
. - Navigate to the S3 bucket on the Amazon S3 console and download the
lex-web-ui-loader-config.json
file. - Open the file and modify or add the following parameters:
- In the connect configuration section, add the new parameter
talkDeskWebsocketEndpoint
and set its value to theWebSocket
endpoint. - In the UI configuration section, set
enableLiveChat
to true.
- In the connect configuration section, add the new parameter
- Upload the modified
lex-web-ui-loader-config.json
file and overwrite the previous version of the file in the S3 bucket. - Return to the CloudFormation stack Outputs tab and find the
WebAppDomainName
link.
This will redirect you to a full-page version of the Amazon Lex Web UI. From here, you can test the Talkdesk integration and confirm that the bot is able to connect to Talkdesk using the WebSocket connection.
Test the solution
Now you’re ready to try the Amazon Lex and Talkdesk chat interaction:
- Start your Banking Bot chat window using the
WebAppUrl
provided as output in the CloudFormation stack. - Log in to your Talkdesk Digital Connect channel and navigate to Conversations.
- In the Banking Bot chat window, request to talk to an agent.
- Watch the customer’s message being delivered to the Talkdesk Conversations Inbox.
- The Talkdesk agent self-assigns the conversation and starts engaging with the customer.
The following video demonstrates the chat experience.
Clean up
To clean up your resources, complete the following steps:
- On the AWS CloudFormation console, select Stacks in the navigation pane.
- Select the
LexTalkdesk
stack (or the stack name you provided), and select Delete. - Delete the stack resources by selecting Delete stack.
Conclusion
Amazon Lex brings the power of conversational self-service to your customer preferred channels, such as phone, web chat, and messaging applications. In this post, we demonstrated a solution that provides live agent assistance on your website with Amazon Lex, Amazon Lex Web UI, and Talkdesk cloud contact center. We provided a CloudFormation stack that includes DynamoDB and Lambda resources, and a Rest API and WebSocket API in API Gateway to maintain a communication session with agents in the Talkdesk contact center.
This solution is meant to be a reference architecture or a quick implementation guide that can be tailored to suit your organization’s requirements. If you need help setting up this solution, AWS Professional Services and Talkdesk are available to help you and your team through the process of selecting the right technologies for your cloud contact center.
About the authors
Grazia Russo Lassner is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specialises in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family time.
Austin Johnson is a Solutions Architect, helping to maintain the Lex Web UI open source library.
Chris Brown is a Principal Natural Language AI consultant at AWS focused on digital customer experiences – including mobile apps, websites, marketing campaigns, and most recently conversational AI applications. Chris is an award-winning strategist and product manager – working with the Fortune 100 to deliver the best experiences for their customers. In his free time, Chris enjoys traveling, music, art, and experiencing new cultures.
Bruno Mateus is a Principal Engineer at Talkdesk. With over 20 years of experience in the software industry, he specialises in large-scale distributed systems. When not working, he enjoys spending time outside with his family, trekking, mountain bike riding, and motorcycle riding.
Jonathan Diedrich is a Principal Solutions Consultant at Talkdesk. He works on enterprise and strategic projects to ensure technical execution and adoption. Outside of work, he enjoys ice hockey and games with the family.
Crispim Tribuna is a Senior Software Engineer at Talkdesk currently focusing on the AI-based virtual agent project. He has over 17 years of experience in computer science, with a focus on telecommunications, IPTV, and fraud prevention. In his free time, he enjoys spending time with his family, running (he has completed three marathons), and riding motorcycles.
Advanced RAG patterns on Amazon SageMaker
Today, customers of all industries—whether it’s financial services, healthcare and life sciences, travel and hospitality, media and entertainment, telecommunications, software as a service (SaaS), and even proprietary model providers—are using large language models (LLMs) to build applications like question and answering (QnA) chatbots, search engines, and knowledge bases. These generative AI applications are not only used to automate existing business processes, but also have the ability to transform the experience for customers using these applications. With the advancements being made with LLMs like the Mixtral-8x7B Instruct, derivative of architectures such as the mixture of experts (MoE), customers are continuously looking for ways to improve the performance and accuracy of generative AI applications while allowing them to effectively use a wider range of closed and open source models.
A number of techniques are typically used to improve the accuracy and performance of an LLM’s output, such as fine-tuning with parameter efficient fine-tuning (PEFT), reinforcement learning from human feedback (RLHF), and performing knowledge distillation. However, when building generative AI applications, you can use an alternative solution that allows for the dynamic incorporation of external knowledge and allows you to control the information used for generation without the need to fine-tune your existing foundational model. This is where Retrieval Augmented Generation (RAG) comes in, specifically for generative AI applications as opposed to the more expensive and robust fine-tuning alternatives we’ve discussed. If you’re implementing complex RAG applications into your daily tasks, you may encounter common challenges with your RAG systems such as inaccurate retrieval, increasing size and complexity of documents, and overflow of context, which can significantly impact the quality and reliability of generated answers.
This post discusses RAG patterns to improve response accuracy using LangChain and tools such as the parent document retriever in addition to techniques like contextual compression in order to enable developers to improve existing generative AI applications.
Solution overview
In this post, we demonstrate the use of Mixtral-8x7B Instruct text generation combined with the BGE Large En embedding model to efficiently construct a RAG QnA system on an Amazon SageMaker notebook using the parent document retriever tool and contextual compression technique. The following diagram illustrates the architecture of this solution.
You can deploy this solution with just a few clicks using Amazon SageMaker JumpStart, a fully managed platform that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly and with ease, accelerating the development and deployment of machine learning (ML) applications. One of the key components of SageMaker JumpStart is the Model Hub, which offers a vast catalog of pre-trained models, such as the Mixtral-8x7B, for a variety of tasks.
Mixtral-8x7B uses an MoE architecture. This architecture allows different parts of a neural network to specialize in different tasks, effectively dividing the workload among multiple experts. This approach enables the efficient training and deployment of larger models compared to traditional architectures.
One of the main advantages of the MoE architecture is its scalability. By distributing the workload across multiple experts, MoE models can be trained on larger datasets and achieve better performance than traditional models of the same size. Additionally, MoE models can be more efficient during inference because only a subset of experts needs to be activated for a given input.
For more information on Mixtral-8x7B Instruct on AWS, refer to Mixtral-8x7B is now available in Amazon SageMaker JumpStart. The Mixtral-8x7B model is made available under the permissive Apache 2.0 license, for use without restrictions.
In this post, we discuss how you can use LangChain to create effective and more efficient RAG applications. LangChain is an open source Python library designed to build applications with LLMs. It provides a modular and flexible framework for combining LLMs with other components, such as knowledge bases, retrieval systems, and other AI tools, to create powerful and customizable applications.
We walk through constructing a RAG pipeline on SageMaker with Mixtral-8x7B. We use the Mixtral-8x7B Instruct text generation model with the BGE Large En embedding model to create an efficient QnA system using RAG on a SageMaker notebook. We use an ml.t3.medium instance to demonstrate deploying LLMs via SageMaker JumpStart, which can be accessed through a SageMaker-generated API endpoint. This setup allows for the exploration, experimentation, and optimization of advanced RAG techniques with LangChain. We also illustrate the integration of the FAISS Embedding store into the RAG workflow, highlighting its role in storing and retrieving embeddings to enhance the system’s performance.
We perform a brief walkthrough of the SageMaker notebook. For more detailed and step-by-step instructions, refer to the Advanced RAG Patterns with Mixtral on SageMaker Jumpstart GitHub repo.
The need for advanced RAG patterns
Advanced RAG patterns are essential to improve upon the current capabilities of LLMs in processing, understanding, and generating human-like text. As the size and complexity of documents increase, representing multiple facets of the document in a single embedding can lead to a loss of specificity. Although it’s essential to capture the general essence of a document, it’s equally crucial to recognize and represent the varied sub-contexts within. This is a challenge you are often faced with when working with larger documents. Another challenge with RAG is that with retrieval, you aren’t aware of the specific queries that your document storage system will deal with upon ingestion. This could lead to information most relevant to a query being buried under text (context overflow). To mitigate failure and improve upon the existing RAG architecture, you can use advanced RAG patterns (parent document retriever and contextual compression) to reduce retrieval errors, enhance answer quality, and enable complex question handling.
With the techniques discussed in this post, you can address key challenges associated with external knowledge retrieval and integration, enabling your application to deliver more precise and contextually aware responses.
In the following sections, we explore how parent document retrievers and contextual compression can help you deal with some of the problems we’ve discussed.
Parent document retriever
In the previous section, we highlighted challenges that RAG applications encounter when dealing with extensive documents. To address these challenges, parent document retrievers categorize and designate incoming documents as parent documents. These documents are recognized for their comprehensive nature but aren’t directly utilized in their original form for embeddings. Rather than compressing an entire document into a single embedding, parent document retrievers dissect these parent documents into child documents. Each child document captures distinct aspects or topics from the broader parent document. Following the identification of these child segments, individual embeddings are assigned to each, capturing their specific thematic essence (see the following diagram). During retrieval, the parent document is invoked. This technique provides targeted yet broad-ranging search capabilities, furnishing the LLM with a wider perspective. Parent document retrievers provide LLMs with a twofold advantage: the specificity of child document embeddings for precise and relevant information retrieval, coupled with the invocation of parent documents for response generation, which enriches the LLM’s outputs with a layered and thorough context.
Contextual compression
To address the issue of context overflow discussed earlier, you can use contextual compression to compress and filter the retrieved documents in alignment with the query’s context, so only pertinent information is kept and processed. This is achieved through a combination of a base retriever for initial document fetching and a document compressor for refining these documents by paring down their content or excluding them entirely based on relevance, as illustrated in the following diagram. This streamlined approach, facilitated by the contextual compression retriever, greatly enhances RAG application efficiency by providing a method to extract and utilize only what’s essential from a mass of information. It tackles the issue of information overload and irrelevant data processing head-on, leading to improved response quality, more cost-effective LLM operations, and a smoother overall retrieval process. Essentially, it’s a filter that tailors the information to the query at hand, making it a much-needed tool for developers aiming to optimize their RAG applications for better performance and user satisfaction.
Prerequisites
If you’re new to SageMaker, refer to the Amazon SageMaker Development Guide.
Before you get started with the solution, create an AWS account. When you create an AWS account, you get a single sign-on (SSO) identity that has complete access to all the AWS services and resources in the account. This identity is called the AWS account root user.
Signing in to the AWS Management Console using the email address and password that you used to create the account gives you complete access to all the AWS resources in your account. We strongly recommend that you do not use the root user for everyday tasks, even the administrative ones.
Instead, adhere to the security best practices in AWS Identity and Access Management (IAM), and create an administrative user and group. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks.
The Mixtral-8x7b model requires an ml.g5.48xlarge instance. SageMaker JumpStart provides a simplified way to access and deploy over 100 different open source and third-party foundation models. In order to launch an endpoint to host Mixtral-8x7B from SageMaker JumpStart, you may need to request a service quota increase to access an ml.g5.48xlarge instance for endpoint usage. You can request service quota increases through the console, AWS Command Line Interface (AWS CLI), or API to allow access to those additional resources.
Set up a SageMaker notebook instance and install dependencies
To get started, create a SageMaker notebook instance and install the required dependencies. Refer to the GitHub repo to ensure a successful setup. After you set up the notebook instance, you can deploy the model.
You can also run the notebook locally on your preferred integrated development environment (IDE). Make sure that you have the Jupyter notebook lab installed.
Deploy the model
Deploy the Mixtral-8X7B Instruct LLM model on SageMaker JumpStart:
Deploy the BGE Large En embedding model on SageMaker JumpStart:
Set up LangChain
After importing all the necessary libraries and deploying the Mixtral-8x7B model and BGE Large En embeddings model, you can now set up LangChain. For step-by-step instructions, refer to the GitHub repo.
Data preparation
In this post, we use several years of Amazon’s Letters to Shareholders as a text corpus to perform QnA on. For more detailed steps to prepare the data, refer to the GitHub repo.
Question answering
Once the data is prepared, you can use the wrapper provided by LangChain, which wraps around the vector store and takes input for the LLM. This wrapper performs the following steps:
- Take the input question.
- Create a question embedding.
- Fetch relevant documents.
- Incorporate the documents and the question into a prompt.
- Invoke the model with the prompt and generate the answer in a readable manner.
Now that the vector store is in place, you can start asking questions:
Regular retriever chain
In the preceding scenario, we explored the quick and straightforward way to get a context-aware answer to your question. Now let’s look at a more customizable option with the help of RetrievalQA, where you can customize how the documents fetched should be added to the prompt using the chain_type parameter. Also, in order to control how many relevant documents should be retrieved, you can change the k parameter in the following code to see different outputs. In many scenarios, you might want to know which source documents the LLM used to generate the answer. You can get those documents in the output using return_source_documents
, which returns the documents that are added to the context of the LLM prompt. RetrievalQA also allows you to provide a custom prompt template that can be specific to the model.
Let’s ask a question:
Parent document retriever chain
Let’s look at a more advanced RAG option with the help of ParentDocumentRetriever. When working with document retrieval, you may encounter a trade-off between storing small chunks of a document for accurate embeddings and larger documents to preserve more context. The parent document retriever strikes that balance by splitting and storing small chunks of data.
We use a parent_splitter
to divide the original documents into larger chunks called parent documents and a child_splitter
to create smaller child documents from the original documents:
The child documents are then indexed in a vector store using embeddings. This enables efficient retrieval of relevant child documents based on similarity. To retrieve relevant information, the parent document retriever first fetches the child documents from the vector store. It then looks up the parent IDs for those child documents and returns the corresponding larger parent documents.
Let’s ask a question:
Contextual compression chain
Let’s look at another advanced RAG option called contextual compression. One challenge with retrieval is that usually we don’t know the specific queries your document storage system will face when you ingest data into the system. This means that the information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.
The contextual compression retriever addresses the challenge of retrieving relevant information from a document storage system, where the pertinent data may be buried within documents containing a lot of text. By compressing and filtering the retrieved documents based on the given query context, only the most relevant information is returned.
To use the contextual compression retriever, you’ll need:
- A base retriever – This is the initial retriever that fetches documents from the storage system based on the query
- A document compressor – This component takes the initially retrieved documents and shortens them by reducing the contents of individual documents or dropping irrelevant documents altogether, using the query context to determine relevance
Adding contextual compression with an LLM chain extractor
First, wrap your base retriever with a ContextualCompressionRetriever
. You’ll add an LLMChainExtractor, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.
Initialize the chain using the ContextualCompressionRetriever
with an LLMChainExtractor
and pass the prompt in via the chain_type_kwargs
argument.
Let’s ask a question:
Filter documents with an LLM chain filter
The LLMChainFilter is a slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents:
Initialize the chain using the ContextualCompressionRetriever
with an LLMChainFilter
and pass the prompt in via the chain_type_kwargs
argument.
Let’s ask a question:
Compare results
The following table compares results from different queries based on technique.
Technique | Query 1 | Query 2 | Comparison |
How did AWS evolve? | Why is Amazon successful? | ||
Regular Retriever Chain Output | AWS (Amazon Web Services) evolved from an initially unprofitable investment to an $85B annual revenue run rate business with strong profitability, offering a wide range of services and features, and becoming a significant part of Amazon’s portfolio. Despite facing skepticism and short-term headwinds, AWS continued to innovate, attract new customers, and migrate active customers, offering benefits such as agility, innovation, cost-efficiency, and security. AWS also expanded its long-term investments, including chip development, to provide new capabilities and change what’s possible for its customers. | Amazon is successful due to its continuous innovation and expansion into new areas such as technology infrastructure services, digital reading devices, voice-driven personal assistants, and new business models like the third-party marketplace. Its ability to scale operations quickly, as seen in the rapid expansion of its fulfillment and transportation networks, also contributes to its success. Additionally, Amazon’s focus on optimization and efficiency gains in its processes has resulted in productivity improvements and cost reductions. The example of Amazon Business highlights the company’s capability to leverage its e-commerce and logistics strengths in different sectors. | Based on the responses from the regular retriever chain, we notice that although it provides long answers, it suffers from context overflow and fails to mention any significant details from the corpus in regards to responding to the query provided. The regular retrieval chain is not able to capture the nuances with depth or contextual insight, potentially missing critical aspects of the document. |
Parent Document Retriever Output | AWS (Amazon Web Services) started with a feature-poor initial launch of the Elastic Compute Cloud (EC2) service in 2006, providing only one instance size, in one data center, in one region of the world, with Linux operating system instances only, and without many key features like monitoring, load balancing, auto-scaling, or persistent storage. However, AWS’s success allowed them to quickly iterate and add the missing capabilities, eventually expanding to offer various flavors, sizes, and optimizations of compute, storage, and networking, as well as developing their own chips (Graviton) to push price and performance further. AWS’s iterative innovation process required significant investments in financial and people resources over 20 years, often well in advance of when it would pay out, to meet customer needs and improve long-term customer experiences, loyalty, and returns for shareholders. | Amazon is successful due to its ability to constantly innovate, adapt to changing market conditions, and meet customer needs in various market segments. This is evident in the success of Amazon Business, which has grown to drive roughly $35B in annualized gross sales by delivering selection, value, and convenience to business customers. Amazon’s investments in ecommerce and logistics capabilities have also enabled the creation of services like Buy with Prime, which helps merchants with direct-to-consumer websites drive conversion from views to purchases. | The parent document retriever delves deeper into the specifics of AWS’s growth strategy, including the iterative process of adding new features based on customer feedback and the detailed journey from a feature-poor initial launch to a dominant market position, while providing a context-rich response. Responses cover a wide range of aspects, from technical innovations and market strategy to organizational efficiency and customer focus, providing a holistic view of the factors contributing to success along with examples. This can be attributed to the parent document retriever’s targeted yet broad-ranging search capabilities. |
LLM Chain Extractor: Contextual Compression Output | AWS evolved by starting as a small project inside Amazon, requiring significant capital investment and facing skepticism from both inside and outside the company. However, AWS had a head start on potential competitors and believed in the value it could bring to customers and Amazon. AWS made a long-term commitment to continue investing, resulting in over 3,300 new features and services launched in 2022. AWS has transformed how customers manage their technology infrastructure and has become an $85B annual revenue run rate business with strong profitability. AWS has also continuously improved its offerings, such as enhancing EC2 with additional features and services after its initial launch. | Based on the provided context, Amazon’s success can be attributed to its strategic expansion from a book-selling platform to a global marketplace with a vibrant third-party seller ecosystem, early investment in AWS, innovation in introducing the Kindle and Alexa, and substantial growth in annual revenue from 2019 to 2022. This growth led to the expansion of the fulfillment center footprint, creation of a last-mile transportation network, and building a new sortation center network, which were optimized for productivity and cost reductions. | The LLM chain extractor maintains a balance between covering key points comprehensively and avoiding unnecessary depth. It dynamically adjusts to the query’s context, so the output is directly relevant and comprehensive. |
LLM Chain Filter: Contextual Compression Output | AWS (Amazon Web Services) evolved by initially launching feature-poor but iterating quickly based on customer feedback to add necessary capabilities. This approach allowed AWS to launch EC2 in 2006 with limited features and then continuously add new functionalities, such as additional instance sizes, data centers, regions, operating system options, monitoring tools, load balancing, auto-scaling, and persistent storage. Over time, AWS transformed from a feature-poor service to a multi-billion-dollar business by focusing on customer needs, agility, innovation, cost-efficiency, and security. AWS now has an $85B annual revenue run rate and offers over 3,300 new features and services each year, catering to a wide range of customers from start-ups to multinational companies and public sector organizations. | Amazon is successful due to its innovative business models, continuous technological advancements, and strategic organizational changes. The company has consistently disrupted traditional industries by introducing new ideas, such as an ecommerce platform for various products and services, a third-party marketplace, cloud infrastructure services (AWS), the Kindle e-reader, and the Alexa voice-driven personal assistant. Additionally, Amazon has made structural changes to improve its efficiency, such as reorganizing its US fulfillment network to decrease costs and delivery times, further contributing to its success. | Similar to the LLM chain extractor, the LLM chain filter makes sure that although the key points are covered, the output is efficient for customers looking for concise and contextual answers. |
Upon comparing these different techniques, we can see that in contexts like detailing AWS’s transition from a simple service to a complex, multi-billion-dollar entity, or explaining Amazon’s strategic successes, the regular retriever chain lacks the precision the more sophisticated techniques offer, leading to less targeted information. Although very few differences are visible between the advanced techniques discussed, they are by far more informative than regular retriever chains.
For customers in industries such as healthcare, telecommunications, and financial services who are looking to implement RAG in their applications, the limitations of the regular retriever chain in providing precision, avoiding redundancy, and effectively compressing information make it less suited to fulfilling these needs compared to the more advanced parent document retriever and contextual compression techniques. These techniques are able to distill vast amounts of information into the concentrated, impactful insights that you need, while helping improve price-performance.
Clean up
When you’re done running the notebook, delete the resources you created in order to avoid accrual of charges for the resources in use:
Conclusion
In this post, we presented a solution that allows you to implement the parent document retriever and contextual compression chain techniques to enhance the ability of LLMs to process and generate information. We tested out these advanced RAG techniques with the Mixtral-8x7B Instruct and BGE Large En models available with SageMaker JumpStart. We also explored using persistent storage for embeddings and document chunks and integration with enterprise data stores.
The techniques we performed not only refine the way LLM models access and incorporate external knowledge, but also significantly improve the quality, relevance, and efficiency of their outputs. By combining retrieval from large text corpora with language generation capabilities, these advanced RAG techniques enable LLMs to produce more factual, coherent, and context-appropriate responses, enhancing their performance across various natural language processing tasks.
SageMaker JumpStart is at the center of this solution. With SageMaker JumpStart, you gain access to an extensive assortment of open and closed source models, streamlining the process of getting started with ML and enabling rapid experimentation and deployment. To get started deploying this solution, navigate to the notebook in the GitHub repo.
About the Authors
Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.
Sebastian Bustillo is a Solutions Architect at AWS. He focuses on AI/ML technologies with a profound passion for generative AI and compute accelerators. At AWS, he helps customers unlock business value through generative AI. When he’s not at work, he enjoys brewing a perfect cup of specialty coffee and exploring the world with his wife.
Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and Data Analytics. At AWS, Armando helps customers integrating cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.
Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.
Marco Punio is a Solutions Architect focused on generative AI strategy, applied AI solutions and conducting research to help customers hyper-scale on AWS. Marco is a digital native cloud advisor with experience in the FinTech, Healthcare & Life Sciences, Software-as-a-service, and most recently, in Telecommunications industries. He is a qualified technologist with a passion for machine learning, artificial intelligence, and mergers & acquisitions. Marco is based in Seattle, WA and enjoys writing, reading, exercising, and building applications in his free time.
AJ Dhimine is a Solutions Architect at AWS. He specializes in generative AI, serverless computing and data analytics. He is an active member/mentor in Machine Learning Technical Field Community and has published several scientific papers on various AI/ML topics. He works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. He is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges. Outside of work, AJ enjoys traveling, and is currently at 53 countries with a goal of visiting every country in the world.
Efficient continual pre-training LLMs for financial domains
Large language models (LLMs) are generally trained on large publicly available datasets that are domain agnostic. For example, Meta’s Llama models are trained on datasets such as CommonCrawl, C4, Wikipedia, and ArXiv. These datasets encompass a broad range of topics and domains. Although the resulting models yield amazingly good results for general tasks, such as text generation and entity recognition, there is evidence that models trained with domain-specific datasets can further improve LLM performance. For example, the training data used for BloombergGPT is 51% domain-specific documents, including financial news, filings, and other financial materials. The resulting LLM outperforms LLMs trained on non-domain-specific datasets when tested on finance-specific tasks. The authors of BloombergGPT concluded that their model outperforms all other models tested for four of the five financial tasks. The model provided even better performance when tested for Bloomberg’s internal financial tasks by a wide margin—as much as 60 points better (out of 100). Although you can learn more about the comprehensive evaluation results in the paper, the following sample captured from the BloombergGPT paper can give you a glimpse of the benefit of training LLMs using financial domain-specific data. As shown in the example, the BloombergGPT model provided correct answers while other non-domain-specific models struggled:
This post provides a guide to training LLMs specifically for the financial domain. We cover the following key areas:
- Data collection and preparation – Guidance on sourcing and curating relevant financial data for effective model training
- Continual pre-training vs. fine-tuning – When to use each technique to optimize your LLM’s performance
- Efficient continual pre-training – Strategies to streamline the continual pre-training process, saving time and resources
This post brings together the expertise of the applied science research team within Amazon Finance Technology and the AWS Worldwide Specialist team for the Global Financial Industry. Some of the content is based on the paper Efficient Continual Pre-training for Building Domain Specific Large Language Models.
Collecting and preparing finance data
Domain continual pre-training necessities a large-scale, high-quality, domain-specific dataset. The following are the main steps for domain dataset curation:
- Identify data sources – Potential data sources for domain corpus include open web, Wikipedia, books, social media, and internal documents.
- Domain data filters – Because the ultimate goal is to curate domain corpus, you might need apply additional steps to filter out samples that irrelevant to the target domain. This reduces useless corpus for continual pre-training and reduces training cost.
- Preprocessing – You might consider a series of preprocessing steps to improve data quality and training efficiency. For example, certain data sources can contain a fair number of noisy tokens; deduplication is considered a useful step to improve data quality and reduce training cost.
To develop financial LLMs, you can use two important data sources: News CommonCrawl and SEC filings. An SEC filing is a financial statement or other formal document submitted to the US Securities and Exchange Commission (SEC). Publicly listed companies are required to file various documents regularly. This creates a large number of documents over the years. News CommonCrawl is a dataset released by CommonCrawl in 2016. It contains news articles from news sites all over the world.
News CommonCrawl is available on Amazon Simple Storage Service (Amazon S3) in the commoncrawl
bucket at crawl-data/CC-NEWS/
. You can get the listings of files using the AWS Command Line Interface (AWS CLI) and the following command:
In Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors use a URL and keyword-based approach to filter financial news articles from generic news. Specifically, the authors maintain a list of important financial news outlets and a set of keywords related to financial news. We identify an article as financial news if either it comes from financial news outlets or any keywords show up in the URL. This simple yet effective approach enables you to identify financial news from not only financial news outlets but also finance sections of generic news outlets.
SEC filings are available online through the SEC’s EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database, which provides open data access. You can scrape the filings from EDGAR directly, or use APIs in Amazon SageMaker with a few lines of code, for any period of time and for a large number of tickers (i.e., the SEC assigned identifier). To learn more, refer to SEC Filing Retrieval.
The following table summarizes the key details of both data sources.
. | News CommonCrawl | SEC Filing |
Coverage | 2016-2022 | 1993-2022 |
Size | 25.8 billion words | 5.1 billion words |
The authors go through a few extra preprocessing steps before the data is fed into a training algorithm. First, we observe that SEC filings contain noisy text due to the removal of tables and figures, so the authors remove short sentences that are deemed to be table or figure labels. Secondly, we apply a locality sensitive hashing algorithm to deduplicate the new articles and filings. For SEC filings, we deduplicate at the section level instead of the document level. Lastly, we concatenate documents into a long string, tokenize it, and chunk the tokenization into pieces of max input length supported by the model to be trained. This improves the throughput of continual pre-training and reduces the training cost.
Continual pre-training vs. fine-tuning
Most available LLMs are general purpose and lack domain-specific abilities. Domain LLMs have shown considerable performance in medical, finance, or scientific domains. For an LLM to acquire domain-specific knowledge, there are four methods: training from scratch, continual pre-training, instruction fine-tuning on domain tasks, and Retrieval Augmented Generation (RAG).
In traditional models, fine-tuning is usually used to create task-specific models for a domain. This means maintaining multiple models for multiple tasks like entity extraction, intent classification, sentiment analysis, or question answering. With the advent of LLMs, the need to maintain separate models has become obsolete by using techniques like in-context learning or prompting. This saves the effort required to maintain a stack of models for related but distinct tasks.
Intuitively, you can train LLMs from scratch with domain-specific data. Although most of the work to create domain LLMs has focused on training from scratch, it is prohibitively expensive. For example, the GPT-4 model costs over $100 million to train. These models are trained on a mix of open domain data and domain data. Continual pre-training can help models acquire domain-specific knowledge without incurring the cost of pre-training from scratch because you pre-train an existing open domain LLM on only the domain data.
With instruction fine-tuning on a task, you can’t make the model acquire domain knowledge because the LLM only acquires domain information contained in the instruction fine-tuning dataset. Unless a very large dataset for instruction fine-tuning is used, it is not enough to acquire domain knowledge. Sourcing high-quality instruction datasets is usually challenging and is the reason to use LLMs in first place. Also, instruction fine-tuning on one task can affect performance on other tasks (as seen in this paper). However, instruction fine-tuning is more cost-effective than either of the pre-training alternatives.
The following figure compares traditional task-specific fine-tuning. vs in-context learning paradigm with LLMs.
RAG is the most effective way of guiding an LLM to generate responses grounded in a domain. Although it can guide a model to generate responses by providing facts from the domain as auxiliary information, it doesn’t acquire the domain-specific language because the LLM is still relying on non-domain language style to generate the responses.
Continual pre-training is a middle ground between pre-training and instruction fine-tuning in terms of cost while being a strong alternative to gaining domain-specific knowledge and style. It can provide a general model over which further instruction fine-tuning on limited instruction data can be performed. Continual pre-training can be a cost-effective strategy for specialized domains where set of downstream tasks is large or unknown and labeled instruction tuning data is limited. In other scenarios, instruction fine-tuning or RAG might be more suitable.
To learn more about fine-tuning, RAG, and model training, refer to Fine-tune a foundation model, Retrieval Augmented Generation (RAG), and Train a Model with Amazon SageMaker, respectively. For this post, we focus on efficient continual pre-training.
Methodology of efficient continual pre-training
Continual pre-training consists of the following methodology:
- Domain-Adaptive Continual Pre-training (DACP) – In the paper Efficient Continual Pre-training for Building Domain Specific Large Language Models, the authors continually pre-train the Pythia language model suite on the financial corpus to adapt it to the finance domain. The objective is to create financial LLMs by feeding data from the whole financial domain into an open-sourced model. Because the training corpus contains all the curated datasets in the domain, the resultant model should acquire finance-specific knowledge, thereby becoming a versatile model for various financial tasks. This results in FinPythia models.
- Task-Adaptive Continual Pre-training (TACP) – The authors pre-train the models further on labeled and unlabeled task data to tailor them for specific tasks. In certain circumstances, developers may prefer models delivering better performance on a group of in-domain tasks rather than a domain-generic model. TACP is designed as continual pre-training aiming to enhance performance on targeted tasks, without requirements for labeled data. Specifically, the authors continually pre-train the open sourced models on the task tokens (without labels). The primary limitation of TACP lies in constructing task-specific LLMs instead of foundation LLMs, owing to the sole use of unlabeled task data for training. Although DACP uses a much larger corpus, it is prohibitively expensive. To balance these limitations, the authors propose two approaches that aim to build domain-specific foundation LLMs while preserving superior performance on target tasks:
- Efficient Task-Similar DACP (ETS-DACP) – The authors propose selecting a subset of financial corpus that is highly similar to the task data using embedding similarity. This subset is used for continual pre-training to make it more efficient. Specifically, the authors continually pre-train the open sourced LLM on a small corpus extracted from the financial corpus that is close to the target tasks in distribution. This can help improve task performance because we adopt the model to the distribution of task tokens despite labeled data not being required.
- Efficient Task-Agnostic DACP (ETA-DACP) – The authors propose using metrics like perplexity and token type entropy that don’t require task data to select samples from financial corpus for efficient continual pre-training. This approach is designed to deal with scenarios where task data is unavailable or more versatile domain models for the broader domain are preferred. The authors adopt two dimensions to select data samples that are important for obtaining domain information from a subset of pre-training domain data: novelty and diversity. Novelty, measured by the perplexity recorded by the target model, refers to the information that was unseen by the LLM before. Data with high novelty indicates novel knowledge for the LLM, and such data is viewed as more difficult to learn. This updates generic LLMs with intensive domain knowledge during continual pre-training. Diversity, on the other hand, captures the diversity of distributions of token types in the domain corpus, which has been documented as a useful feature in the research of curriculum learning on language modeling.
The following figure compares an example of ETS-DACP (left) vs. ETA-DACP (right).
We adopt two sampling schemes to actively select data points from curated financial corpus: hard sampling and soft sampling. The former is done by first ranking the financial corpus by corresponding metrics and then selecting the top-k samples, where k is predetermined according to the training budget. For the latter, the authors assign sampling weights for each data points according the metric values, and then randomly sample k data points to meet the training budget.
Result and analysis
The authors evaluate the resulting financial LLMs on an array of financial tasks to investigate the efficacy of continual pre-training:
- Financial Phrase Bank – A sentiment classification task on financial news.
- FiQA SA – An aspect-based sentiment classification task based on financial news and headlines.
- Headline – A binary classification task on whether a headline on a financial entity contains certain information.
- NER – A financial named entity extraction task based on credit risk assessment section of SEC reports. Words in this task are annotated with PER, LOC, ORG, and MISC.
Because financial LLMs are instruction fine-tuned, the authors evaluate models in a 5-shot setting for each task for the sake of robustness. On average, the FinPythia 6.9B outperforms Pythia 6.9B by 10% across four tasks, which demonstrates the efficacy of domain-specific continual pre-training. For the 1B model, the improvement is less profound, but performance still improves 2% on average.
The following figure illustrates the performance difference before and after DACP on both models.
The following figure showcases two qualitative examples generated by Pythia 6.9B and FinPythia 6.9B. For two finance-related questions regarding an investor manager and a financial term, Pythia 6.9B doesn’t understand the term or recognize the name, whereas FinPythia 6.9B generates detailed answers correctly. The qualitative examples demonstrate that continual pre-training enables the LLMs to acquire domain knowledge during the process.
The following table compares various efficient continual pre-training approaches. ETA-DACP-ppl is ETA-DACP based on perplexity (novelty), and ETA-DACP-ent is based on entropy (diversity). ETS-DACP-com is similar to DACP with data selection by averaging all three metrics. The following are a few takeaways from the results:
- Data selection methods are efficient – They surpass standard continual pre-training with just 10% of training data. Efficient continual pre-training including Task-Similar DACP (ETS-DACP), Task-Agnostic DACP based on entropy (ESA-DACP-ent) and Task-Similar DACP based on all three metrics (ETS-DACP-com) outperforms standard DACP on average despite the fact that they are trained on only 10% of financial corpus.
- Task-aware data selection works the best in line with small language models research – ETS-DACP records the best average performance among all the methods and, based on all three metrics, records the second-best task performance. This suggests that using unlabeled task data is still an effective approach to boost task performance in the case of LLMs.
- Task-agnostic data selection is close second – ESA-DACP-ent follows the performance of the task-aware data selection approach, implying that we could still boost task performance by actively selecting high-quality samples not tied to specific tasks. This paves the way to build financial LLMs for the whole domain while achieving superior task performance.
One critical question regarding continual pre-training is whether it negatively affects the performance on non-domain tasks. The authors also evaluate the continually pre-trained model on four widely used generic tasks: ARC, MMLU, TruthQA, and HellaSwag, which measure the ability of question answering, reasoning, and completion. The authors find that continual pre-training does not adversely affect non-domain performance. For more details, refer to Efficient Continual Pre-training for Building Domain Specific Large Language Models.
Conclusion
This post offered insights into data collection and continual pre-training strategies for training LLMs for financial domain. You can start training your own LLMs for financial tasks using Amazon SageMaker Training or Amazon Bedrock today.
About the Authors
Yong Xie is an applied scientist in Amazon FinTech. He focuses on developing large language models and Generative AI applications for finance.
Karan Aggarwal is a Senior Applied Scientist with Amazon FinTech with a focus on Generative AI for finance use-cases. Karan has extensive experience in time-series analysis and NLP, with particular interest in learning from limited labeled data
Aitzaz Ahmad is an Applied Science Manager at Amazon where he leads a team of scientists building various applications of Machine Learning and Generative AI in Finance. His research interests are in NLP, Generative AI, and LLM Agents. He received his PhD in Electrical Engineering from Texas A&M University.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in financial service build machine learning solutions on AWS.
Raghvender Arni leads the Customer Acceleration Team (CAT) within AWS Industries. The CAT is a global cross-functional team of customer facing cloud architects, software engineers, data scientists, and AI/ML experts and designers that drives innovation via advanced prototyping, and drives cloud operational excellence via specialized technical expertise.
Achieve DevOps maturity with BMC AMI zAdviser Enterprise and Amazon Bedrock
In software engineering, there is a direct correlation between team performance and building robust, stable applications. The data community aims to adopt the rigorous engineering principles commonly used in software development into their own practices, which includes systematic approaches to design, development, testing, and maintenance. This requires carefully combining applications and metrics to provide complete awareness, accuracy, and control. It means evaluating all aspects of a team’s performance, with a focus on continuous improvement, and it applies just as much to mainframe as it does to distributed and cloud environments—maybe more.
This is achieved through practices like infrastructure as code (IaC) for deployments, automated testing, application observability, and complete application lifecycle ownership. Through years of research, the DevOps Research and Assessment (DORA) team has identified four key metrics that indicate the performance of a software development team:
- Deployment frequency – How often an organization successfully releases to production
- Lead time for changes – The amount of time it takes a commit to get into production
- Change failure rate – The percentage of deployments causing a failure in production
- Time to restore service – How long it takes an organization to recover from a failure in production
These metrics provide a quantitative way to measure the effectiveness and efficiency of DevOps practices. Although much of the focus around analysis of DevOps is on distributed and cloud technologies, the mainframe still maintains a unique and powerful position, and it can use the DORA 4 metrics to further its reputation as the engine of commerce.
This blog post discusses how BMC Software added AWS Generative AI capabilities to its product BMC AMI zAdviser Enterprise. The zAdviser uses Amazon Bedrock to provide summarization, analysis, and recommendations for improvement based on the DORA metrics data.
Challenges of tracking DORA 4 metrics
Tracking DORA 4 metrics means putting the numbers together and placing them on a dashboard. However, measuring productivity is essentially measuring the performance of individuals, which can make them feel scrutinized. This situation might necessitate a shift in organizational culture to focus on collective achievements and emphasize that automation tools enhance the developer experience.
It’s also vital to avoid focusing on irrelevant metrics or excessively tracking data. The essence of DORA metrics is to distill information into a core set of key performance indicators (KPIs) for evaluation. Mean time to restore (MTTR) is often the simplest KPI to track—most organizations use tools like BMC Helix ITSM or others that record events and issue tracking.
Capturing lead time for changes and change failure rate can be more challenging, especially on mainframes. Lead time for changes and change failure rate KPIs aggregate data from code commits, log files, and automated test results. Using a Git-based SCM pulls these insight together seamlessly. Mainframe teams using BMC’s Git-based DevOps platform, AMI DevX ,can collect this data as easily as distributed teams can.
Solution overview
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon via a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
BMC AMI zAdviser Enterprise provides a wide range of DevOps KPIs to optimize mainframe development and enable teams to proactvely identify and resolve issues. Using machine learning, AMI zAdviser monitors mainframe build, test and deploy functions across DevOps tool chains and then offers AI-led recommendations for continuous improvement. In addition to capturing and reporting on development KPIs, zAdviser captures data on how the BMC DevX products are adopted and used. This includes the number of programs that were debugged, the outcome of testing efforts using the DevX testing tools, and many other data points. These additional data points can provide deeper insight into the development KPIs, including the DORA metrics, and may be used in future generative AI efforts with Amazon Bedrock.
The following architecture diagram shows the final implementation of zAdviser Enterprise utilizing generative AI to provide summarization, analysis, and recommendations for improvement based on the DORA metrics KPI data.
The solution workflow includes the following steps:
- Create the aggregation query to retrieve the metrics from Elasticsearch.
- Extract the stored mainframe metrics data from zAdviser, which is hosted in Amazon Elastic Compute Cloud (Amazon EC2) and deployed in AWS.
- Aggregate the data retrieved from Elasticsearch and form the prompt for the generative AI Amazon Bedrock API call.
- Pass the generative AI prompt to Amazon Bedrock (using Anthropic’s Claude2 model on Amazon Bedrock).
- Store the response from Amazon Bedrock (an HTML-formatted document) in Amazon Simple Storage Service (Amazon S3).
- Trigger the KPI email process via AWS Lambda:
- The HTML-formatted email is extracted from Amazon S3 and added to the body of the email.
- The PDF for customer KPIs is extracted from zAdviser and attached to the email.
- The email is sent to subscribers.
The following screenshot shows the LLM summarization of DORA metrics generated using Amazon Bedrock and sent as an email to the customer, with a PDF attachment that contains the DORA metrics KPI dashboard report by zAdviser.
Key takeaways
In this solution, you don’t need to worry about your data being exposed on the internet when sent to an AI client. The API call to Amazon Bedrock doesn’t contain any personally identifiable information (PII) or any data that could identify a customer. The only data transmitted consists of numerical values in the form of the DORA metric KPIs and instructions for the generative AI’s operations. Importantly, the generative AI client does not retain, learn from, or cache this data.
The zAdviser engineering team was successful in rapidly implementing this feature within a short time span. The rapid progress was facilitated by zAdviser’s substantial investment in AWS services and, importantly, the ease of using Amazon Bedrock via API calls. This underscores the transformative power of generative AI technology embodied in the Amazon Bedrock API. This API, equipped with the industry-specific knowledge repository zAdviser Enterprise and customized with continuously collected organization-specific DevOps metrics, demonstrates the potential of AI in this field.
Generative AI has the potential to lower the barrier to entry to build AI-driven organizations. Large language models (LLMs) in particular can bring tremendous value to enterprises seeking to explore and use unstructured data. Beyond chatbots, LLMs can be used in a variety of tasks, such as classification, editing, and summarization.
Conclusion
This post discussed the transformational impact of generative AI technology in the form of Amazon Bedrock APIs equipped with the industry-specific knowledge that BMC zAdviser possesses, tailored with organization-specific DevOps metrics collected on an ongoing basis.
Check out the BMC website to learn more and set up a demo.
About the Authors
Sunil Bemarkar is a Sr. Partner Solutions Architect at Amazon Web Services. He works with various Independent Software Vendors (ISVs) and Strategic customers across industries to accelerate their digital transformation journey and cloud adoption.
Vij Balakrishna is a Senior Partner Development manager at Amazon Web Services. She helps independent software vendors (ISVs) across industries to accelerate their digital transformation journey.
Spencer Hallman is the Lead Product Manager for the BMC AMI zAdviser Enterprise. Previously, he was the Product Manager for BMC AMI Strobe and BMC AMI Ops Automation for Batch Thruput. Prior to Product Management, Spencer was the Subject Matter Expert for Mainframe Performance. His diverse experience over the years has also included programming on multiple platforms and languages as well as working in the Operations Research field. He has a Master of Business Administration with a concentration in Operations Research from Temple University and a Bachelor of Science in Computer Science from the University of Vermont. He lives in Devon, PA and when he’s not attending virtual meetings, enjoys walking his dogs, riding his bike and spending time with his family.
Fine-tune your Amazon Titan Image Generator G1 model using Amazon Bedrock model customization
Amazon Titan lmage Generator G1 is a cutting-edge text-to-image model, available via Amazon Bedrock, that is able to understand prompts describing multiple objects in various contexts and captures these relevant details in the images it generates. It is available in US East (N. Virginia) and US West (Oregon) AWS Regions and can perform advanced image editing tasks such as smart cropping, in-painting, and background changes. However, users would like to adapt the model to unique characteristics in custom datasets that the model is not already trained on. Custom datasets can include highly proprietary data that is consistent with your brand guidelines or specific styles such as a previous campaign. To address these use cases and generate fully personalized images, you can fine-tune Amazon Titan Image Generator with your own data using custom models for Amazon Bedrock.
From generating images to editing them, text-to-image models have broad applications across industries. They can enhance employee creativity and provide the ability to imagine new possibilities simply with textual descriptions. For example, it can aid design and floor planning for architects and allow faster innovation by providing the ability to visualize various designs without the manual process of creating them. Similarly, it can aid in design across various industries such as manufacturing, fashion design in retail, and game design by streamlining the generation of graphics and illustrations. Text-to-image models also enhance your customer experience by allowing for personalized advertising as well as interactive and immersive visual chatbots in media and entertainment use cases.
In this post, we guide you through the process of fine-tuning the Amazon Titan Image Generator model to learn two new categories: Ron the dog and Smila the cat, our favorite pets. We discuss how to prepare your data for the model fine-tuning task and how to create a model customization job in Amazon Bedrock. Finally, we show you how to test and deploy your fine-tuned model with Provisioned Throughput.
Ron the dog | Smila the cat |
Evaluating model capabilities before fine-tuning a job
Foundation models are trained on large amounts of data, so it is possible that your model will work well enough out of the box. That’s why it’s good practice to check if you actually need to fine-tune your model for your use case or if prompt engineering is sufficient. Let’s try to generate some images of Ron the dog and Smila the cat with the base Amazon Titan Image Generator model, as shown in the following screenshots.
As expected, the out-of-the-box model does not know Ron and Smila yet, and the generated outputs show different dogs and cats. With some prompt engineering, we can provide more details to get closer to the look of our favorite pets.
Although the generated images are more similar to Ron and Smila, we see that the model is not able to reproduce the full likeness of them. Let’s now start a fine-tuning job with the photos from Ron and Smila to get consistent, personalized outputs.
Fine-tuning Amazon Titan Image Generator
Amazon Bedrock provides you with a serverless experience for fine-tuning your Amazon Titan Image Generator model. You only need to prepare your data and select your hyperparameters, and AWS will handle the heavy lifting for you.
When you use the Amazon Titan Image Generator model to fine-tune, a copy of this model is created in the AWS model development account, owned and managed by AWS, and a model customization job is created. This job then accesses the fine-tuning data from a VPC and the amazon Titan model has its weights updated. The new model is then saved to an Amazon Simple Storage Service (Amazon S3) located in the same model development account as the pre-trained model. It can now be used for inference only by your account and is not shared with any other AWS account. When running inference, you access this model via a provisioned capacity compute or directly, using batch inference for Amazon Bedrock. Independently from the inference modality chosen, your data remains in your account and is not copied to any AWS owned account or used to improve the Amazon Titan Image Generator model.
The following diagram illustrates this workflow.
Data privacy and network security
Your data used for fine-tuning including prompts, as well as the custom models, remain private in your AWS account. They are not shared or used for model training or service improvements, and aren’t shared with third-party model providers. All the data used for fine-tuning is encrypted in transit and at rest. The data remains in the same Region where the API call is processed. You can also use AWS PrivateLink to create a private connection between the AWS account where your data resides and the VPC.
Data preparation
Before you can create a model customization job, you need to prepare your training dataset. The format of your training dataset depends on the type of customization job you are creating (fine-tuning or continued pre-training) and the modality of your data (text-to-text, text-to-image, or image-to-embedding). For the Amazon Titan Image Generator model, you need to provide the images that you want to use for the fine-tuning and a caption for each image. Amazon Bedrock expects your images to be stored on Amazon S3 and the pairs of images and captions to be provided in a JSONL format with multiple JSON lines.
Each JSON line is a sample containing an image-ref, the S3 URI for an image, and a caption that includes a textual prompt for the image. Your images must be in JPEG or PNG format. The following code shows an example of the format:
{"image-ref": "s3://bucket/path/to/image001.png", "caption": "<prompt text>"} {"image-ref": "s3://bucket/path/to/image002.png", "caption": "<prompt text>"} {"image-ref": "s3://bucket/path/to/image003.png", "caption": "<prompt text>"}
Because “Ron” and “Smila” are names that could also be used in other contexts, such as a person’s name, we add the identifiers “Ron the dog” and “Smila the cat” when creating the prompt to fine-tune our model. Although it’s not a requirement for the fine-tuning workflow, this additional information provides more contextual clarity for the model when it is being customized for the new classes and will avoid the confusion of ‘“Ron the dog” with a person called Ron and “Smila the cat” with the city Smila in Ukraine. Using this logic, the following images show a sample of our training dataset.
Ron the dog laying on a white dog bed | Ron the dog sitting on a tile floor | Ron the dog laying on a car seat |
Smila the cat lying on a couch | Smila the cat staring at the camera laying on a couch | Smila the cat laying in a pet carrier |
When transforming our data to the format expected by the customization job, we get the following sample structure:
{"image-ref": "<S3_BUCKET_URL>/ron_01.jpg", "caption": "Ron the dog laying on a white dog bed"} {"image-ref": "<S3_BUCKET_URL>/ron_02.jpg", "caption": "Ron the dog sitting on a tile floor"} {"image-ref": "<S3_BUCKET_URL>/ron_03.jpg", "caption": "Ron the dog laying on a car seat"} {"image-ref": "<S3_BUCKET_URL>/smila_01.jpg", "caption": "Smila the cat lying on a couch"} {"image-ref": "<S3_BUCKET_URL>/smila_02.jpg", "caption": "Smila the cat sitting next to the window next to a statue cat"} {"image-ref": "<S3_BUCKET_URL>/smila_03.jpg", "caption": "Smila the cat lying on a pet carrier"}
After we have created our JSONL file, we need to store it on an S3 bucket to start our customization job. Amazon Titan Image Generator G1 fine-tuning jobs will work with 5–10,000 images. For the example discussed in this post, we use 60 images: 30 of Ron the dog and 30 of Smila the cat. In general, providing more varieties of the style or class you are trying to learn will improve the accuracy of your fine-tuned model. However, the more images you use for fine-tuning, the more time will be required for the fine-tuning job to complete. The number of images used also influence the pricing of your fine-tuned job. Refer to Amazon Bedrock Pricing for more information.
Fine-tuning Amazon Titan Image Generator
Now that we have our training data ready, we can begin a new customization job. This process can be done both via the Amazon Bedrock console or APIs. To use the Amazon Bedrock console, complete the following steps:
- On the Amazon Bedrock console, choose Custom models in the navigation pane.
- On the Customize model menu, choose Create fine-tuning job.
- For Fine-tuned model name, enter a name for your new model.
- For Job configuration, enter a name for the training job.
- For Input data, enter the S3 path of the input data.
- In the Hyperparameters section, provide values for the following:
- Number of steps – The number of times the model is exposed to each batch.
- Batch size – The number of samples processed before updating the model parameters.
- Learning rate – The rate at which the model parameters are updated after each batch. The choice of these parameters depends on a given dataset. As a general guideline, we recommend you start by fixing the batch size to 8, the learning rate to 1e-5, and set the number of steps according to the number of images used, as detailed in the following table.
Number of images provided | 8 | 32 | 64 | 1,000 | 10,000 |
Number of steps recommended | 1,000 | 4,000 | 8,000 | 10,000 | 12,000 |
If the results of your fine-tuning job are not satisfactory, consider increasing the number of steps if you don’t observe any signs of the style in generated images, and decreasing the number of steps if you observe the style in the generated images but with artifacts or blurriness. If the fine-tuned model fails to learn the unique style in your dataset even after 40,000 steps, consider increasing the batch size or the learning rate.
- In the Output data section, enter the S3 output path where the validation outputs, including the periodically recorded validation loss and accuracy metrics, are stored.
- In the Service access section, generate a new AWS Identity and Access Management (IAM) role or choose an existing IAM role with the necessary permissions to access your S3 buckets.
This authorization enables Amazon Bedrock to retrieve input and validation datasets from your designated bucket and store validation outputs seamlessly in your S3 bucket.
- Choose Fine-tune model.
With the correct configurations set, Amazon Bedrock will now train your custom model.
Deploy the fine-tuned Amazon Titan Image Generator with Provisioned Throughput
After you create custom model, Provisioned Throughput allows you to allocate a predetermined, fixed rate of processing capacity to the custom model. This allocation provides a consistent level of performance and capacity for handling workloads, which results in better performance in production workloads. The second advantage of Provisioned Throughput is cost control, because standard token-based pricing with on-demand inference mode can be difficult to predict at large scales.
When the fine tuning of your model is complete, this model will appear on the Custom models’ page on the Amazon Bedrock console.
To purchase Provisioned Throughput, select the custom model that you just fine-tuned and choose Purchase Provisioned Throughput.
This prepopulates the selected model for which you want to purchase Provisioned Throughput. For testing your fine-tuned model before deployment, set model units to a value of 1 and set the commitment term to No commitment. This quickly lets you start testing your models with your custom prompts and check if the training is adequate. Moreover, when new fine-tuned models and new versions are available, you can update the Provisioned Throughput as long as you update it with other versions of the same model.
Fine-tuning results
For our task of customizing the model on Ron the dog and Smila the cat, experiments showed that the best hyperparameters were 5,000 steps with a batch size of 8 and a learning rate of 1e-5.
The following are some examples of the images generated by the customized model.
Ron the dog wearing a superhero cape | Ron the dog on the moon | Ron the dog in a swimming pool with sunglasses |
Smila the cat on the snow | Smila the cat in black and white staring at the camera | Smila the cat wearing a Christmas hat |
Conclusion
In this post, we discussed when to use fine-tuning instead of engineering your prompts for better-quality image generation. We showed how to fine-tune the Amazon Titan Image Generator model and deploy the custom model on Amazon Bedrock. We also provided general guidelines on how to prepare your data for fine-tuning and set optimal hyperparameters for more accurate model customization.
As a next step, you can adapt the following example to your use case to generate hyper-personalized images using Amazon Titan Image Generator.
About the Authors
Maira Ladeira Tanke is a Senior Generative AI Data Scientist at AWS. With a background in machine learning, she has over 10 years of experience architecting and building AI applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through generative AI solutions on Amazon Bedrock. In her free time, Maira enjoys traveling, playing with her cat Smila, and spending time with her family someplace warm.
Dani Mitchell is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is focused on computer vision use cases and helping customers across EMEA accelerate their ML journey.
Bharathi Srinivasan is a Data Scientist at AWS Professional Services, where she loves to build cool things on Amazon Bedrock. She is passionate about driving business value from machine learning applications, with a focus on responsible AI. Outside of building new AI experiences for customers, Bharathi loves to write science fiction and challenge herself with endurance sports.
Achin Jain is an Applied Scientist with the Amazon Artificial General Intelligence (AGI) team. He has expertise in text-to-image models and is focused on building the Amazon Titan Image Generator.
Sustainability call for proposals — Spring 2024
Addressing the challenges associated with generating consistent, transparent, and accurate carbon measurements.Read More
Build a receipt and invoice processing pipeline with Amazon Textract
In today’s business landscape, organizations are constantly seeking ways to optimize their financial processes, enhance efficiency, and drive cost savings. One area that holds significant potential for improvement is accounts payable. On a high level, the accounts payable process includes receiving and scanning invoices, extraction of the relevant data from scanned invoices, validation, approval, and archival. The second step (extraction) can be complex. Each invoice and receipt look different. The labels are imperfect and inconsistent. The most important pieces of information such as price, vendor name, vendor address, and payment terms are often not explicitly labeled and have to be interpreted based on context. The traditional approach of using human reviewers to extract the data is time-consuming, error-prone, and not scalable.
In this post, we show how to automate the accounts payable process using Amazon Textract for data extraction. We also provide a reference architecture to build an invoice automation pipeline that enables extraction, verification, archival, and intelligent search.
Solution overview
The following architecture diagram shows the stages of a receipt and invoice processing workflow. It starts with a document capture stage to securely collect and store scanned invoices and receipts. The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense
API to extract financially related relationships between text such as vendor name, invoice receipt date, order date, amount due, amount paid, and so on. In the next stage, you use predefined expense rules to determine if you should automatically approve or reject the receipt. Approved and rejected documents go to their respective folders within the Amazon Simple Storage Service (Amazon S3) bucket. For approved documents, you can search all the extracted fields and values using Amazon OpenSearch Service. You can visualize the indexed metadata using OpenSearch Dashboards. Approved documents are also set up to be moved to Amazon S3 Intelligent-Tiering for long-term retention and archival using S3 lifecycle policies.
The following sections take you through the process of creating the solution.
Prerequisites
To deploy this solution, you must have the following:
- An AWS account.
- An AWS Cloud9 environment. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal.
To create the AWS Cloud9 environment, provide a name and description. Keep everything else as default. Choose the IDE link on the AWS Cloud9 console to navigate to IDE. You’re now ready to use the AWS Cloud9 environment.
Deploy the solution
To set up the solution, you use the AWS Cloud Development Kit (AWS CDK) to deploy an AWS CloudFormation stack.
- In your AWS Cloud9 IDE terminal, clone the GitHub repository and install the dependencies. Run the following commands to deploy the
InvoiceProcessor
stack:
The deployment takes around 25 minutes with the default configuration settings from the GitHub repo. Additional output information is also available on the AWS CloudFormation console.
- After the AWS CDK deployment is complete, create expense validation rules in an Amazon DynamoDB table. You can use the same AWS Cloud9 terminal to run the following commands:
- In the S3 bucket that starts with
invoiceprocessorworkflow-invoiceprocessorbucketf1-*
, create an uploads folder.
In Amazon Cognito, you should already have an existing user pool called OpenSearchResourcesCognitoUserPool*
. We use this user pool to create a new user.
- On the Amazon Cognito console, navigate to the user pool
OpenSearchResourcesCognitoUserPool*
. - Create a new Amazon Cognito user.
- Provide a user name and password of your choice and note them for later use.
- Upload the documents random_invoice1 and random_invoice2 to the S3
uploads
folder to start the workflows.
Now let’s dive into each of the document processing steps.
Document Capture
Customers handle invoices and receipts in a multitude of formats from different vendors. These documents are received through channels like hard copies, scanned copies uploaded to file storage, or shared storage devices. In the document capture stage, you store all scanned copies of receipts and invoices in a highly scalable storage such as in an S3 bucket.
Extraction
The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense
API to extract financially related relationships between text such as Vendor Name, Invoice Receipt Date, Order Date, Amount Due/Paid, etc.
AnalyzeExpense is an API dedicated to processing invoice and receipts documents. It is available both as a synchronous or asynchronous API. The synchronous API allows you to send images in bytes format, and the asynchronous API allows you to send files in JPG, PNG, TIFF, and PDF formats. The AnalyzeExpense
API response consists of three distinct sections:
- Summary fields – This section includes both normalized keys and the explicitly mentioned keys along with their values.
AnalyzeExpense
normalizes the keys for contact-related information such as vendor name and vendor address, tax ID-related keys such as tax payer ID, payment-related keys such as amount due and discount, and general keys such as invoice ID, delivery date, and account number. Keys that are not normalized still appear in the summary fields as key-value pairs. For a complete list of supported expense fields, refer to Analyzing Invoices and Receipts. - Line items – This section includes normalized line item keys such as item description, unit price, quantity, and product code.
- OCR block – The block contains the raw text extract from the invoice page. The raw text extract can be used for postprocessing and identifying information that is not covered as part of the summary and line item fields.
This post uses the Amazon Textract IDP CDK constructs (AWS CDK components to define infrastructure for intelligent document processing (IDP) workflows), which allows you to build use case-specific, customizable IDP workflows. The constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS CDK constructs, the actual AWS CDK stacks, and AWS Step Functions.
The following figure shows the Step Functions workflow.
The extraction workflow includes the following steps:
- InvoiceProcessor-Decider – An AWS Lambda function that verifies if the input document format is supported by Amazon Textract. For more details about supported formats, refer to Input Documents.
- DocumentSplitter – A Lambda function that generates 2,500-page (max) chunks from documents and can process large multi-page documents.
- Map State – A Lambda function that processes each chunk in parallel.
- TextractAsync – This task calls Amazon Textract using the asynchronous API following best practices with Amazon Simple Notification Service (Amazon SNS) notifications and uses
OutputConfig
to store the Amazon Textract JSON output to the S3 bucket you created earlier. It consists of two Lambda functions: one to submit the document for processing and one that is triggered on the SNS notification. - TextractAsyncToJSON2 – Because the
TextractAsync
task can produce multiple paginated output files, theTextractAsyncToJSON2
process combines them into one JSON file.
We discuss the details of the next three steps in the following sections.
Verification and approval
For the verification stage, the SetMetaData
Lambda function verifies whether the uploaded file is a valid expense as per the rules configured previously in DynamoDB table. For this post, you use the following sample rules:
- Verification is successful if
INVOICE_RECEIPT_ID
is present and matches the regex(?i)[0-9]{3}[a-z]{3}[0-9]{3}$
and ifPO_NUMBER
is present and matches the regex(?i)[a-z0-9]+$
- Verification is un-successful if either
PO_NUMBER
orINVOICE_RECEIPT_ID
is incorrect or missing in the document.
After the files are processed, the expense verification function moves the input files to either approved
or declined
folders in the same S3 bucket.
For the purposes of this solution, we use DynamoDB to store the expense validation rules. However, you can modify this solution to integrate with your own or commercial expense validation or management solutions.
Intelligent index and search
With the OpenSearchPushInvoke
Lambda function, the extracted expense metadata is pushed to an OpenSearch Service index and is available for search.
The final TaskOpenSearchMapping
step clears the context, which otherwise could exceed the Step Functions quota of maximum input or output size for a task, state, or workflow run.
After the OpenSearch Service index is created, you can search for keywords from the extracted text via OpenSearch Dashboards.
Archival, audit, and analytics
To manage the lifecycle and archival of invoices and receipts, you can configure S3 lifecycle rules to transition S3 objects from Standard to Intelligent-Tiering storage classes. S3 Intelligent-Tiering monitors access patterns and automatically moves objects to the Infrequent Access tier when they haven’t been accessed for 30 consecutive days. After 90 days of no access, the objects are moved to the Archive Instant Access tier without performance impact or operational overhead.
For auditing and analytics, this solution uses OpenSearch Service for running analytics on invoice requests. OpenSearch Service enables you to effortlessly ingest, secure, search, aggregate, view, and analyze data for a number of use cases, such as log analytics, application search, enterprise search, and more.
Log in to OpenSearch Dashboards and navigate to Stack Management, Saved objects, then choose Import. Choose the invoices.ndjson file from the cloned repository and choose Import. This prepopulates indexes and builds the visualization.
Refresh the page and navigate to Home, Dashboard, and open Invoices. You can now select and apply filters and expand the time window to explore past invoices.
Clean up
When you’re finished evaluating Amazon Textract for processing receipts and invoices, we recommend cleaning up any resources that you might have created. Complete the following steps:
- Delete all content from the S3 bucket
invoiceprocessorworkflow-invoiceprocessorbucketf1-*
. - In AWS Cloud9, run the following commands to delete Amazon Cognito resources and CloudFormation stacks:
- Delete the AWS Cloud9 environment that you created from the AWS Cloud9 console.
Conclusion
In this post, we provided an overview of how we can build an invoice automation pipeline using Amazon Textract for data extraction and create a workflow for validation, archival, and search. We provided code samples on how to use the AnalyzeExpense
API for extraction of critical fields from an invoice.
To get started, sign in to the Amazon Textract console to try this feature. To learn more about Amazon Textract capabilities, refer to the Amazon Textract Developer Guide or Textract Resources. To learn more about IDP, refer to the IDP with AWS AI services Part 1 and Part 2 posts.
About the Authors
Sushant Pradhan is a Sr. Solutions Architect at Amazon Web Services, helping enterprise customers. His interests and experience include containers, serverless technology, and DevOps. In his spare time, Sushant enjoys spending time outdoors with his family.
Shibin Michaelraj is a Sr. Product Manager with the AWS Textract team. He is focused on building AI/ML-based products for AWS customers.
Suprakash Dutta is a Sr. Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs intelligent document processing solutions.
Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel and ride his motorcycle in Texas Hill Country.
Updating large language models by directly editing network layers
Automated method that uses gradients to identify salient layers prevents regression on previously seen data.Read More
Best practices for building secure applications with Amazon Transcribe
Amazon Transcribe is an AWS service that allows customers to convert speech to text in either batch or streaming mode. It uses machine learning–powered automatic speech recognition (ASR), automatic language identification, and post-processing technologies. Amazon Transcribe can be used for transcription of customer care calls, multiparty conference calls, and voicemail messages, as well as subtitle generation for recorded and live videos, to name just a few examples. In this blog post, you will learn how to power your applications with Amazon Transcribe capabilities in a way that meets your security requirements.
Some customers entrust Amazon Transcribe with data that is confidential and proprietary to their business. In other cases, audio content processed by Amazon Transcribe may contain sensitive data that needs to be protected to comply with local laws and regulations. Examples of such information are personally identifiable information (PII), personal health information (PHI), and payment card industry (PCI) data. In the following sections of the blog, we cover different mechanisms Amazon Transcribe has to protect customer data both in transit and at rest. We share the following seven security best practices to build applications with Amazon Transcribe that meet your security and compliance requirements:
- Use data protection with Amazon Transcribe
- Communicate over a private network path
- Redact sensitive data if needed
- Use IAM roles for applications and AWS services that require Amazon Transcribe access
- Use tag-based access control
- Use AWS monitoring tools
- Enable AWS Config
The following best practices are general guidelines and don’t represent a complete security solution. Because these best practices might not be appropriate or sufficient for your environment, use them as helpful considerations rather than prescriptions.
Best practice 1 – Use data protection with Amazon Transcribe
Amazon Transcribe conforms to the AWS shared responsibility model, which differentiates AWS responsibility for security of the cloud from customer responsibility for security in the cloud.
AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. As the customer, you are responsible for maintaining control over your content that is hosted on this infrastructure. This content includes the security configuration and management tasks for the AWS services that you use. For more information about data privacy, see the Data Privacy FAQ.
Protecting data in transit
Data encryption is used to make sure that data communication between your application and Amazon Transcribe remains confidential. The use of strong cryptographic algorithms protects data while it is being transmitted.
Amazon Transcribe can operate in one of the two modes:
- Streaming transcriptions allow media stream transcription in real time
- Batch transcription jobs allow transcription of audio files using asynchronous jobs.
In streaming transcription mode, client applications open a bidirectional streaming connection over HTTP/2 or WebSockets. An application sends an audio stream to Amazon Transcribe, and the service responds with a stream of text in real time. Both HTTP/2 and WebSockets streaming connections are established over Transport Layer Security (TLS), which is a widely accepted cryptographic protocol. TLS provides authentication and encryption of data in transit using AWS certificates. We recommend using TLS 1.2 or later.
In batch transcription mode, an audio file first needs to be put in an Amazon Simple Storage Service (Amazon S3) bucket. Then a batch transcription job referencing the S3 URI of this file is created in Amazon Transcribe. Both Amazon Transcribe in batch mode and Amazon S3 use HTTP/1.1 over TLS to protect data in transit.
All requests to Amazon Transcribe over HTTP and WebSockets must be authenticated using AWS Signature Version 4. It is recommended to use Signature Version 4 to authenticate HTTP requests to Amazon S3 as well, although authentication with older Signature Version 2 is also possible in some AWS Regions. Applications must have valid credentials to sign API requests to AWS services.
Protecting data at rest
Amazon Transcribe in batch mode uses S3 buckets to store both the input audio file and the output transcription file. Customers use an S3 bucket to store the input audio file, and it is highly recommended to enable encryption on this bucket. Amazon Transcribe supports the following S3 encryption methods:
- Server-Side Encryption with Amazon S3 Managed Keys (SSE-S3)
- Server-Side Encryption with KMS keys Stored in AWS Key Management Service (SSE-KMS)
Both methods encrypt customer data as it is written to disks and decrypt it when you access it using one of the strongest block cyphers available: 256-bit Advanced Encryption Standard (AES-256) GCM.When using SSE-S3, encryption keys are managed and regularly rotated by the Amazon S3 service. For additional security and compliance, SSE-KMS provides customers with control over encryption keys via AWS Key Management Service (AWS KMS). AWS KMS gives additional access controls because you have to have permissions to use the appropriate KMS keys in order to encrypt and decrypt objects in S3 buckets configured with SSE-KMS. Also, SSE-KMS provides customers with an audit trail capability that keeps records of who used your KMS keys and when.
The output transcription can be stored in the same or a different customer-owned S3 bucket. In this case, the same SSE-S3 and SSE-KMS encryption options apply. Another option for Amazon Transcribe output in batch mode is using a service-managed S3 bucket. Then output data is put in a secure S3 bucket managed by Amazon Transcribe service, and you are provided with a temporary URI that can be used to download your transcript.
Amazon Transcribe uses encrypted Amazon Elastic Block Store (Amazon EBS) volumes to temporarily store customer data during media processing. The customer data is cleaned up for both complete and failure cases.
Best practice 2 – Communicate over a private network path
Many customers rely on encryption in transit to securely communicate with Amazon Transcribe over the Internet. However, for some applications, data encryption in transit may not be sufficient to meet security requirements. In some cases, data is required to not traverse public networks such as the internet. Also, there may be a requirement for the application to be deployed in a private environment not connected to the internet. To meet these requirements, use interface VPC endpoints powered by AWS PrivateLink.
The following architectural diagram demonstrates a use case where an application is deployed on Amazon EC2. The EC2 instance that is running the application does not have access to the internet and is communicating with Amazon Transcribe and Amazon S3 via interface VPC endpoints.
In some scenarios, the application that is communicating with Amazon Transcribe may be deployed in an on-premises data center. There may be additional security or compliance requirements that mandate that data exchanged with Amazon Transcribe must not transit public networks such as the internet. In this case, private connectivity via AWS Direct Connect can be used. The following diagram shows an architecture that allows an on-premises application to communicate with Amazon Transcribe without any connectivity to the internet.
Best practice 3 – Redact sensitive data if needed
Some use cases and regulatory environments may require the removal of sensitive data from transcripts and audio files. Amazon Transcribe supports identifying and redacting personally identifiable information (PII) such as names, addresses, Social Security numbers, and so on. This capability can be used to enable customers to achieve payment card industry (PCI) compliance by redacting PII such as credit or debit card number, expiration date, and three-digit card verification code (CVV). Transcripts with redacted information will have PII replaced with placeholders in square brackets indicating what type of PII was redacted. Streaming transcriptions support the additional capability to only identify PII and label it without redaction. The types of PII redacted by Amazon Transcribe vary between batch and streaming transcriptions. Refer to Redacting PII in your batch job and Redacting or identifying PII in a real-time stream for more details.
The specialized Amazon Transcribe Call Analytics APIs have a built-in capability to redact PII in both text transcripts and audio files. This API uses specialized speech-to-text and natural language processing (NLP) models trained specifically to understand customer service and sales calls. For other use cases, you can use this solution to redact PII from audio files with Amazon Transcribe.
Additional Amazon Transcribe security best practices
Best practice 4 – Use IAM roles for applications and AWS services that require Amazon Transcribe access. When you use a role, you don’t have to distribute long-term credentials, such as passwords or access keys, to an EC2 instance or AWS service. IAM roles can supply temporary permissions that applications can use when they make requests to AWS resources.
Best Practice 5 – Use tag-based access control. You can use tags to control access within your AWS accounts. In Amazon Transcribe, tags can be added to transcription jobs, custom vocabularies, custom vocabulary filters, and custom language models.
Best Practice 6 – Use AWS monitoring tools. Monitoring is an important part of maintaining the reliability, security, availability, and performance of Amazon Transcribe and your AWS solutions. You can monitor Amazon Transcribe using AWS CloudTrail and Amazon CloudWatch.
Best Practice 7 – Enable AWS Config. AWS Config enables you to assess, audit, and evaluate the configurations of your AWS resources. Using AWS Config, you can review changes in configurations and relationships between AWS resources, investigate detailed resource configuration histories, and determine your overall compliance against the configurations specified in your internal guidelines. This can help you simplify compliance auditing, security analysis, change management, and operational troubleshooting.
Compliance validation for Amazon Transcribe
Applications that you build on AWS may be subject to compliance programs, such as SOC, PCI, FedRAMP, and HIPAA. AWS uses third-party auditors to evaluate its services for compliance with various programs. AWS Artifact allows you to download third-party audit reports.
To find out if an AWS service is within the scope of specific compliance programs, refer to AWS Services in Scope by Compliance Program. For additional information and resources that AWS provides to help customers with compliance, refer to Compliance validation for Amazon Transcribe and AWS compliance resources.
Conclusion
In this post, you have learned about various security mechanisms, best practices, and architectural patterns available for you to build secure applications with Amazon Transcribe. You can protect your sensitive data both in transit and at rest with strong encryption. PII redaction can be used to enable removal of personal information from your transcripts if you do not want to process and store it. VPC endpoints and Direct Connect allow you to establish private connectivity between your application and the Amazon Transcribe service. We also provided references that will help you validate compliance of your application using Amazon Transcribe with programs such as SOC, PCI, FedRAMP, and HIPAA.
As next steps, check out Getting started with Amazon Transcribe to quickly start using the service. Refer to Amazon Transcribe documentation to dive deeper into the service details. And follow Amazon Transcribe on the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Transcribe.
About the Author
Alex Bulatkin is a Solutions Architect at AWS. He enjoys helping communication service providers build innovative solutions in AWS that are redefining the telecom industry. He is passionate about working with customers on bringing the power of AWS AI services into their applications. Alex is based in the Denver metropolitan area and likes to hike, ski, and snowboard.