Improving inclusion and accessibility through automated document translation with an open source app using Amazon Translate

Improving inclusion and accessibility through automated document translation with an open source app using Amazon Translate

Organizations often offer support in multiple languages, saying “contact us for translations.” However, customers who don’t speak the predominant language often don’t know that translations are available or how to request them. This can lead to poor customer experience and lost business. A better approach is proactively providing information in multiple languages so customers can access it directly. This leads to more informed, satisfied, and included customers.

In this post, we share how we identified these challenges and overcame them through our work with Swindon Borough Council. We developed the Document Translation app, which uses Amazon Translate, to address these issues. The app is a business user app for self-serve translations. The app is created in partnership with Swindon Council and released as open source code freely available for your organization to use.

Translation challenges

We identified three key challenges:

  • Accuracy and quality
  • Cost to translate
  • Time to translate

Accuracy and quality

Translation accuracy and quality are critical, because the results must be accurate and understood. As quoted in the Swindon Borough Council case study:

“The council ran small-scale trials with the main digital translation providers that can support the different languages spoken by Swindon’s citizens. It recruited local bilingual volunteers to assess the quality of the machine translations against their first languages, and Amazon Translate came out on top.”

The Document Translation app uses Amazon Translate for performing translations. Amazon Translate provides high-quality document translations for contextual, accurate, and fluent translations. It supports many languages and dialects, providing broad coverage for customers worldwide. Custom terminology, a feature of Amazon Translate,is dynamically utilized by the app workflow when a language has matching custom terminology available.

Cost to translate

High costs of manual translation can prohibit organizations from supporting multiple languages, straining already tight budgets. Balancing language inclusivity and budget limitations poses a significant challenge when relying solely on traditional translation methods.

Swindon Borough Council paid around £159.81 ($194.32 USD) per single-page document, limiting them to providing translation only where legally required. As discussed in the case study, Swindon Borough Council slashed 99.96% of translation costs using Amazon Translate:

“Such dramatic savings mean that it’s no longer limited to translating only documents it is legally required to provide—it can offer citizens wider access to content for minimal extra cost.”

Customers report third-party translation services fees as a major cost. The neural machine translation technology of Amazon Translate dramatically lowers these costs.

Following the Cost Optimization pillar of the AWS Well-Architected Framework further led to implementing an AWS Graviton architecture using AWS Lambda and an infrequently accessed Amazon DynamoDB table class. With no server management overhead or continually running systems, this helps keep costs low.

Time to translate

Manual translation delays that lower customer satisfaction also include internal processes, approvals, and logistics arrangements in place to control costs and protect sensitive and private content. Swindon Borough Council stated that turnaround times could take up to 17 days:

“First, it was slow. The internal process required manual inputs from many different people. On average, that process took up to 12 days, and the time required by the translation agency was 3–5 days. That meant total translation time for a document was up to 17 days.”

This app offers a business user self-serve portal for document translations. Users can upload documents and download translations for sharing without slow manual intervention. Amazon Translate can perform translations in about 10 minutes.

Solution overview

The app’s business user portal is a browser-based UI that has been translated into all languages and dialects supported by Amazon Translate. The dynamic React UI doesn’t require server software. To accelerate development, UI components such as buttons and input boxes come from the AWS Cloudscape Design library. For interacting with AWS services, the AWS Amplify JS library for React simplifies the authentication, security, and API requests.

Document Translation Demo
Fig.1 – Translating a document.
Multiple Language UI
Fig.2 – Localized user interface.
Client Overview
Fig.3 – Client architecture overview.

The backend uses several serverless and event-driven AWS services, including AWS Step Functions for low-code workflows, AWS AppSync for a GraphQL API, and Amazon Translate. This architecture enables fast development and reduces ongoing management overhead, as shown in the following diagram.

Translation Overview
Fig.4 – Translation architecture overview.

The app is built with Infrastructure as Code (IaC) using the AWS Cloud Development Kit (AWS CDK). The AWS CDK is an open source software development framework used to model and provision cloud applications. Using the Typescript CDK provides a reliable, repeatable, and extensible foundation for deployments. Paired with a consistent continuous integration and delivery (CI/CD) pipeline, deployments are predictable. Reusable components are extracted into constructs and imported where needed, providing consistency and best practices such as AWS Identity and Access Management (IAM) roles, Amazon CloudWatch logging, and AWS X-Ray tracing for all Lambda functions.

Pipeline Overview
Fig.5 – Continuous integration and continuous delivery pipeline overview.

App deployment

The app is effortless to deploy using the AWS CDK. The AWS CDK allows modeling of the entire stack, including frontend React code, backend functions and workflows, and cloud infrastructure definitions packaged together.

Before deployment, review any prerequisites you may want to use, such as connecting this to your organization’s single sign-on with the SAML provider.

The installation wizard provides the necessary commands. AWS CloudShell allows you to run these commands without installing anything locally. The app documentation covers all advanced options available. Installation takes 30–60 minutes and is monitored from AWS CodePipeline.

Installtion Options Form
Fig.6 – Installation wizard.

A self-paced Immersion Day is available for your technical teams to get hands-on experience with the services and build core components. Alternatively, your AWS account team can provide personalized guidance through the workshop.

Additional feature: Simply Readable

This app is designed with multiple features (as of this writing, Document Translation and Simply Readable). Simply Readable enables you to create Easy Read documents with generative artificial intelligence (AI) using Amazon Bedrock. The app can be installed with or without this feature.

Conclusion

The Document Translation app provides translations in your customers’ native languages. Amazon Translate enables accurate translation at scale. Communicating in customers’ languages shows respect, improves understanding, and builds trust.

Translation capabilities should be core to any growth strategy, building loyalty and revenue through superior localized experiences.

Business leaders should evaluate solutions like Amazon Translate to overcome language barriers and share their brand. Enabling multilingual communication conveys “We value you, we hear you, and we want your experience with us to be positive.”

To learn more about the app, see the FAQ.


About the Author

Philip WhitesidePhilip Whiteside is a Solutions Architect (SA) at Amazon Web Services. Philip is passionate about overcoming barriers by utilizing technology.

Read More

Automate chatbot for document and data retrieval using Agents and Knowledge Bases for Amazon Bedrock

Automate chatbot for document and data retrieval using Agents and Knowledge Bases for Amazon Bedrock

Numerous customers face challenges in managing diverse data sources and seek a chatbot solution capable of orchestrating these sources to offer comprehensive answers. This post presents a solution for developing a chatbot capable of answering queries from both documentation and databases, with straightforward deployment.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. For documentation retrieval, Retrieval Augmented Generation (RAG) stands out as a key tool. It allows you to retrieve data from sources beyond the foundation model, enhancing prompts by integrating contextually relevant retrieved data. You can use prompt engineering to prevent hallucination and make sure that the answer is grounded in the source documentations. To retrieve data from database, you can use foundation models (FMs) offered by Amazon Bedrock, converting text into SQL queries with specified constraints. This process empowers the extraction of data from Amazon Athena tables, effectively addressing inquiries related to data.

For handling more intricate queries, achieving comprehensive answers demands information sourced from both documentation and databases. Agents for Amazon Bedrock is a generative AI tool offered through Amazon Bedrock that enables generative AI applications to execute multistep tasks across company systems and data sources. This integration allows for the synthesis of combined information, resulting in detailed and exhaustive answers.

This post demonstrates how to build a chatbot using Amazon Bedrock including Agents for Amazon Bedrock and Knowledge Bases for Amazon Bedrock, within an automated solution. The code used in this solution is available in the GitHub repo.

Solution overview

In this post, we use publicly available data, encompassing both unstructured and structured formats, to showcase our entirely automated chatbot system. Our unstructured data comes from the Amazon EC2 User Guide for Linux Instances and Amazon EC2 Instance Types documentation, and the structured data is derived from the EC2 Instance On-Demand Pricing for the US East (N. Virginia) AWS Region.

The following diagram illustrates the solution architecture.

The diagram details a comprehensive AWS Cloud-based setup within a specific Region, using multiple AWS services. The primary interface for the chatbot is a Streamlit application hosted on an Amazon Elastic Container Service (Amazon ECS) cluster, with accessibility managed by an Application Load Balancer. Queries made through this interface activate the AWS Lambda Invocation function, which interfaces with an agent. This agent responds to user inquiries by either consulting the knowledge base or by invoking an Agent Executor Lambda function. This function invokes a set of actions associated with the agent, following a predefined API schema. The knowledge base uses a serverless Amazon OpenSearch Service index as its vector database foundation. Additionally, the Agent Executor function generates SQL queries that are run against the AWS Glue database through Athena.

Deploy the solution with the AWS CDK

The AWS Cloud Development Kit (AWS CDK) is an open source software development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. Our AWS CDK stack deploys resources from the following AWS services:

Refer to the instructions provided in the README.md file for deploying the solution using the AWS CDK. After you have completed all the necessary setup, you can deploy the stack with the following command:

cdk deploy

Amazon Bedrock features

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this section, we discuss some of the key Amazon Bedrock features and resources that we use in our solution.

Agents for Amazon Bedrock

Agents for Amazon Bedrock enables you to build and configure autonomous agents in your application. An agent helps your end-users complete actions based on organization data and user input. Agents orchestrate interactions between FMs, data sources, software applications, and user conversations.

For example, you can ask a question such as, “Which EC2 instance is recommended for machine learning model inference and how much does it cost?” The agent can help break down complex tasks into smaller tasks:

“To answer this question, I will:

1. Call the GET::ChatBotBedrockAgentActionGroup::getUc2Response function to get the latest EC2 instance type and pricing details for machine learning inference.

2. Call the GET::x_amz_knowledgebase_WSGY5QH4ZD::Search function to get recommendations on which EC2 instance type is best for machine learning inference.

3. Return the recommended EC2 instance type and the pricing details to the user.

I have checked that I have access to the GET::ChatBotBedrockAgentActionGroup::getUc2Response and GET::x_amz_knowledgebase_WSGY5QH4ZD::Search functions.”

Then the agent will gather information through these three steps and integrate them into a final answer:

“The AWS Inferentia and Trainium instances are well-suited for machine learning model inference workloads. The Inf1 and Trn1 instances deliver high performance inference using dedicated ML chips like Inferentia and Trainium at lower costs compared to GPU-based instances.According to the latest information, AWS Inferentia and Trainium instance prices range from $0.228 per hour for an inf1.xlarge instance to $24.78 per hour for a trn1n.32xlarge instance. The cheapest Inferentia instance is the inf1.xlarge at $0.228 per hour, while the most expensive is the inf2.48xlarge at $12.9813 per hour. The cheapest Trainium instance is the trn1.2xlarge at $1.3438 per hour, while the most expensive is the trn1n.32xlarge at $24.78 per hour.”

Before you create your agent, you should set up action groups and knowledge bases that you want to add to your agent:

  • Action groups define the tasks that you want your agent to help customers carry out.
  • Knowledge bases provide a repository of information that the agent can query to answer customer queries and improve its generated responses. For more information, see Knowledge bases for Amazon Bedrock.

After you complete the AWS CDK deployment, you can verify your agent along with its corresponding knowledge base and action group by completing the following steps:

  1. On the Amazon Bedrock console, choose Agents in the navigation pane.
  2. Choose the name of your agent.
  3. Choose the working draft.

You can review the action group and knowledge base in the working draft.

Knowledge Bases for Amazon Bedrock

Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow (managed RAG), from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows. For this post, we created a knowledge base for Amazon Bedrock using the AWS CDK; it’s based on the database of EC2 instance documentation stored in an S3 bucket.

Action groups

An action group consists of the following components that you set up:

  • An OpenAPI schema that defines the APIs that your action group should call. Your agent uses the API schema to determine the fields it needs to elicit from the customer to populate for the API request.
  • A Lambda function that defines the business logic for the action that your agent will carry out.

For each action group in an agent, you define a Lambda function to program the business logic for carrying out an action group and customize how you want the API response to be returned. You use the variables from the input event to define your functions and return a response to the agent. In our use case, we used Amazon Bedrock FMs, converting text into SQL queries with specified constraints. This process empowers the extraction of data from Athena tables, effectively addressing inquiries related to data.

The following screenshot shows an Athena table and sample query.

Sample questions and answers

After the AWS CDK deployment is complete, you can either test the agent on the Amazon Bedrock console or through the Streamlit app URL listed in the outputs of the chatbot stack on the AWS CloudFormation console, as shown in the following screenshot.

In the UI of the chatbot, you can view the source of the response. If the response comes from the knowledge base, you will see a link related to the documentation. If the response is sourced from the Amazon EC2 pricing table, you will see the SQL query text converted from the relevant table. The chatbot is also capable of answering questions that require information from both data sources. The following screenshots show some sample questions and answers with different data sources.

Each response from an Amazon Bedrock agent is accompanied by a trace that details the steps being orchestrated by the agent. The trace helps you follow the agent’s reasoning process that leads it to the response it gives at that point in the conversation.

When you show the trace in the test window in the console, a window appears showing a trace for each step in the reasoning process. You can view each step of the trace in real time as your agent performs orchestration. Each step can be one of the following traces:

  • PreProcessingTrace – Traces the input and output of the preprocessing step, in which the agent contextualizes and categorizes user input and determines if it is valid
  • OrchestrationTrace – Traces the input and output of the orchestration step, in which the agent interprets the input, invokes APIs and queries knowledge bases, and returns output to either continue orchestration or respond to the user
  • PostProcessingTrace – Traces the input and output of the postprocessing step, in which the agent handles the final output of the orchestration and determines how to return the response to the user
  • FailureTrace – Traces the reason that a step failed

Customizations for your own dataset

To integrate your custom data into the solution, follow the structured guidelines in this section and tailor them to your requirements. These steps are designed to provide a seamless and efficient integration process, enabling you to deploy the solution effectively with your own data.

Integrate knowledge base data

To prepare your data for integration, locate the assets/knowledgebase_data_source/ directory and place your dataset within this folder.

To make configuration adjustments, access the cdk.json file. Navigate to the context/config/paths/knowledgebase_file_name field and update it accordingly. Furthermore, modify the context/config/bedrock_instructions/knowledgebase_instruction field in the cdk.json file to accurately reflect the nuances and context of your new dataset.

Integrate structural data

To organize your structural data, within the assets/data_query_data_source/ directory, create a subdirectory (for example, tabular_data). Deposit your structured dataset (acceptable formats include CSV, JSON, ORC, and Parquet) into this newly created subfolder.

For configuration and code updates, make the following changes:

  • Update the cdk.json file’s context/config/paths/athena_table_data_prefix field to align with the new data path
  • Revise code/action-lambda/dynamic_examples.csv by incorporating new text to SQL examples that correspond with your dataset
  • Revise code/action-lambda/prompt_templates.py to mirror the attributes of your new tabular data
  • Modify the cdk.json file’s context/config/bedrock_instructions/action_group_description field to elucidate the purpose and functionality of the action Lambda function tailored for your dataset
  • Reflect the new functionalities of your action Lambda function in the assets/agent_api_schema/artifacts_schema.json file

General updates

In the cdk.json file, under the context/config/bedrock_instructions/agent_instruction section, provide a comprehensive description of the intended functionality and design purpose for your agents, taking into account the newly integrated data.

Clean up

To delete your resources when you’re finished using the solution and to avoid future costs, you can either delete the stack on the AWS CloudFormation console or run the following command in the terminal:

cdk destroy

Conclusion

In this post, we illustrated the process of using the AWS CDK to establish and oversee a set of AWS resources designed to construct a chatbot on Amazon Bedrock. If you’re interested in connecting to your data source and developing your own chatbot, you can begin exploring with Amazon Bedrock.


About the Authors

Jundong Qiao is a Machine Learning Engineer at AWS Professional Service, where he specializes in implementing and enhancing AI/ML capabilities across various sectors. His expertise encompasses building next-generation AI solutions, including chatbots and predictive models that drive efficiency and innovation. Prior to AWS, Jundong was an Engineering Manager in Machine Learning at ACV Auctions, where he led initiatives that leveraged AI/ML to address intricate issues within the automotive industry.

Kara Yang is a data scientist at AWS Professional Services, adept at leveraging cloud computing, machine learning, and Generative AI to tackle diverse industry challenges. Passionately dedicated to innovation, she consistently pursues new technologies, refines solutions, and delights in sharing her expertise through writing and presentations.

Kiowa Jackson is a Machine Learning Engineer at AWS ProServe, dedicated to helping customers leverage Generative AI for creating and deploying novel applications. He is passionate about placing the benefits of GenAI in the hands of users through real-world use cases.

Praveen Kumar Jeyarajan is a Principal DevOps Consultant at AWS, supporting Enterprise customers and their journey to the cloud. He has 13+ years of DevOps experience and is skilled in solving myriad technical challenges using the latest technologies. He holds a Masters degree in Software Engineering. Outside of work, he enjoys watching movies and playing tennis.

Shuai Cao is a Senior Data Science Manager focused on Generative AI at Amazon Web Services. He leads teams of data scientists, machine learning engineers, and application architects to deliver AI/ML solutions for customers. Outside of work, he enjoys composing and arranging music.

Read More

Build private and secure enterprise generative AI apps with Amazon Q Business and AWS IAM Identity Center

Build private and secure enterprise generative AI apps with Amazon Q Business and AWS IAM Identity Center

As of April 30, 2024 Amazon Q Business is generally available. Amazon Q Business is a conversational assistant powered by generative artificial intelligence (AI) that enhances workforce productivity by answering questions and completing tasks based on information in your enterprise systems. Your employees can access enterprise content securely and privately using web applications built with Amazon Q Business. The success of these applications depends on two key factors: first, that an end-user of the application is only able to see responses generated from documents they have been granted access to, and second, that each user’s conversation history is private, secure, and accessible only to the user.

Amazon Q Business operationalizes this by validating the identity of the user every time they access the application so that the application can use the end-user’s identity to restrict tasks and answers to documents that the user has access to. This outcome is achieved with a combination of AWS IAM Identity Center and Amazon Q Business. IAM Identity Center stores the user identity, is the authoritative source of identity information for Amazon Q Business applications, and validates the user’s identity when they access an Amazon Q Business application. You can configure IAM Identity Center to use your enterprise identity provider (IdP)—such as Okta or Microsoft Entra ID—as the identity source. Amazon Q Business makes sure that access control lists (ACLs) for enterprise documents being indexed are matched to the user identities provided by IAM Identity Center, and that these ACLs are honored every time the application calls Amazon Q Business APIs to respond to user queries.

In this post, we show how IAM Identity Center acts as a gateway to steer user identities created by your enterprise IdP as the identity source, for Amazon Q Business, and how Amazon Q Business uses these identities to respond securely and confidentially to user queries. We use an example of a generative AI employee assistant built with Amazon Q Business, demonstrate how to set it up to only respond using enterprise content that each employee has permissions to access, and show how employees are able to converse securely and privately with this assistant.

Solution overview

The following diagram shows a high-level architecture of how the enterprise IdP, IAM Identity Center instance, and Amazon Q Business application interact with each other to enable an authenticated user to securely and privately interact with an Amazon Q Business application using an Amazon Q Business web experience from their web browser.

When using an external IdP such as Okta, users and groups are first provisioned in the IdP and then automatically synchronized with the IAM Identity Center instance using the SCIM protocol. When a user starts the Amazon Q Business web experience, they are authenticated with their IdP using single sign-on, and the tokens obtained from the IdP are used by Amazon Q Business to validate the user with IAM Identity Center. After validation, a chat session is started with the user.

The sample use case in this post uses an IAM Identity Center account instance with its identity source configured as Okta, which is used as the IdP. Then we ingest content from Atlassian Confluence. The Amazon Q Business built-in connector for Confluence ingests the local users and groups configured in Confluence, as well as ACLs for the spaces and documents, to the Amazon Q Business application index. These users from the data source are matched with the users configured in the IAM Identity Center instance, and aliases are created in Amazon Q Business User Store for correct ACL enforcement.

Prerequisites

To implement this solution for the sample use case of this post, you need an IAM Identity Center instance and Okta identity provider as identity source. We provide more information about these resources in this section.

IAM Identity Center instance

An Amazon Q Business application requires an IAM Identity Center instance to be associated with it. There are two types of IAM Identity Center instances: an organization instance and an account instance. Amazon Q Business applications can work with either type of instance. These instances store the user identities that are created by an IdP, as well as the groups to which the users belong.

For production use cases, an IAM Identity Center organization instance is recommended. The advantage of an organization instance is that it can be used by an Amazon Q Business application in any AWS account in AWS Organizations, and you only pay once for a user in your company, if you have multiple Amazon Q Business applications spread across several AWS accounts and you use organization instance. Many AWS enterprise customers use Organizations, and have IAM Identity Center organization instances associated with them.

For proof of concept and departmental use cases, or in situations when an AWS account is not part of an AWS Organization and you don’t want to create a new AWS organization, you can use an IAM Identity Center account instance to enable an Amazon Q Business application. In this case, only the Amazon Q Business application configured in the AWS account in which the account instance is created will be able to use that instance.

Amazon Q Business implements a per-user subscription fee. A user is billed only one time if they are uniquely identifiable across different accounts and different Amazon Q Business applications. For example, if multiple Amazon Q Business applications are within a single AWS account, a user that is uniquely identified by an IAM Identity Center instance tied to this account will only be billed one time for using these applications. If your organization has two accounts, and you have an organization-level IAM Identity Center instance, a user who is uniquely identified in the organization-level instance will be billed only one time even though they access applications in both accounts. However, if you have two account-level IAM Identity Center instances, a user in one account can’t be identified as the same user in another account because there is no central identity. This means that the same user will be billed twice. We therefore recommend using organization-level IAM Identity Center instances for production use cases to optimize costs.

In both these cases, the Amazon Q Business application needs to be in the same AWS Region as the IAM Identity Center instance.

Identity source

If you already use an IdP such as Okta or Entra ID, you can continue to use your preferred IdP with Amazon Q Business applications. In this case, the IAM Identity Center instance is configured to use the IdP as its identity source. The users and user groups from the IdP can be automatically synced to the IAM Identity Center instance using SCIM. Many AWS enterprise customers already have this configured for their IAM Identity Center organization instance. For more information about all the supported IdPs, see Getting started tutorials. The process is similar for IAM Identity Center organization instances and account instances.

AWS IAM Identity Center instance configured with Okta as the identity source

The following screenshot shows the IAM Identity Center application configured in Okta, and the users and groups from the Okta configuration assigned to this application.

The following screenshot shows the IAM Identity Center instance user store after configuring Okta as the identity source. Here the user and group information is automatically provisioned (synchronized) from Okta into IAM Identity Center using the System for Cross-domain Identity Management (SCIM) v2.0 protocol.

Configure an Amazon Q Business application with IAM Identity Center enabled

Complete the following steps to create an Amazon Q Business application and enable IAM Identity Center:

  1. On the Amazon Q Business console, choose Create application.
  2. For Application name, enter a name.
  3. Unless you need to change the AWS Identity and Access Management (IAM) role for the application or customize encryption settings, keep the default settings.
  4. Choose Create.
  5. On the Select retriever page, unless you want to configure a preexisting Amazon Kendra index as a retriever, or you need to configure storage units for more than 20,000 documents, you can continue with the default settings.
  6. Choose Next.

For more information about Amazon Q Business retrievers, refer to Creating and selecting a retriever for an Amazon Q Business application.

  1. On the Connect data sources page, for Data sources, choose Confluence.

The following instructions demonstrate how to configure the Confluence data source. These may differ for other data sources.

  1. For Data source name, enter a name.
  2. For Source¸ select Confluence Cloud.
  3. For Confluence URL, enter the Confluence URL.
  4. For Authentication, select Basic authentication.
  5. For AWS Secrets Manager secret, choose an AWS Secrets Manager secret.
  6. For Virtual Private Cloud, choose No VPC.
  7. For IAM role, choose Create a new service role.
  8. For Role name¸ either go with the provided name or edit it for your new role.
  9. For Sync scope, select the contents to sync.
  10. For Sync mode, select Full sync.
  11. For Frequency, choose Run on demand.
  12. For Field mappings, leave the defaults.
  13. Choose Add data source.
  14. Choose Next.
  15. On the Add groups and users page, choose Add groups and users.
  16. In the pop-up window, choose Get started.
  17. Search for users based on their display name or groups, then choose the user or group you want to add to the application.
  18. Add more users as needed.
  19. Choose Assign.
  20. You will see the following screen:
  21. Choose subscription for each user by clicking on the Choose subscription pull down and then selecting the check mark.
  22. After choosing subscription for all the users, your screen will look as below. Unless you want to change the service role, choose Create application.

After the application is created, you will see the application settings page, as shown in the following screenshot.

Employee AI assistant use case

To illustrate how you can build a secure and private generative AI assistant for your employees using Amazon Q Business applications, let’s take a sample use case of an employee AI assistant in an enterprise corporation. Two new employees, Mateo Jackson and Mary Major, have joined the company on two different projects, and have finished their employee orientation. They have been given corporate laptops, and their accounts are provisioned in the corporate IdP. They have been told to get help from the employee AI assistant for any questions related to their new team member activities and their benefits.

The company uses Confluence to manage their enterprise content. The sample Amazon Q application used to run the scenarios for this post is configured with a data source using the built-in connector for Confluence to index the enterprise Confluence spaces used by employees. The example uses three Confluence spaces: AnyOrgApp Project, ACME Project Space, and AJ-DEMO-HR-SPACE. The access permissions for these spaces are as follows:

  • AJ-DEMO-HR-SPACE – All employees, including Mateo and Mary
  • AnyOrgApp Project – Employees assigned to the project including Mateo
  • ACME Project Space – Employees assigned to the project including Mary

Let’s look at how Mateo and Mary experience their employee AI assistant.

Both are provided with the URL of the employee AI assistant web experience. They use the URL and sign in to the IdP from the browsers of their laptops. Mateo and Mary both want to know about their new team member activities and their fellow team members. They ask the same questions to the employee AI assistant but get different responses, because each has access to separate projects. In the following screenshots, the browser window on the left is for Mateo Jackson and the one on the right is for Mary Major. Mateo gets information about the AnyOrgApp project and Mary gets information about the ACME project.

Mateo chooses Sources under the question about team members to take a closer look at the team member information, and Mary choosing Sources under the question for new team member onboarding activities. The following screenshots show their updated views.

Mateo and Mary want to find out more about the benefits their new job offers and how the benefits are applicable to their personal and family situations.

The following screenshot shows that Mary asks the employee AI assistant questions about her benefits and eligibility.

Mary can also refer to the source documents.

The following screenshot shows that Mateo asks the employee AI assistant different questions about his eligibility.

Mateo looks at the following source documents.

Both Mary and Mateo first want to know their eligibility for benefits. But after that, they have different questions to ask. Even though the benefits-related documents are accessible by both Mary and Mateo, their conversations with employee AI assistant are private and personal. The assurance that their conversation history is private and can’t be seen by any other user is critical for the success of a generative AI employee productivity assistant.

Clean up

If you created a new Amazon Q Business application to try out the integration with IAM Identity Center, and don’t plan to use it further, unsubscribe and remove assigned users from the application and delete it so that your AWS account does not accumulate costs.

To unsubscribe and remove users go to the application details page and select Manage access and subscriptions.

Select all the users, and then use the Edit button to choose Unsubscribe and remove as shown below.

Delete the application after removing the users, going back to the application details page and selecting Delete.

Conclusion

For enterprise generative AI assistants such as the one shown in this post to be successful, they must respect access control as well as assure the privacy and confidentiality of every employee. Amazon Q Business and IAM Identity Center provide a solution that authenticates each user and validates the user identity at each step to enforce access control along with privacy and confidentiality.

To achieve this, IAM Identity Center acts as a gateway to sync user and group identities from an IdP (such as Okta), and Amazon Q Business uses IAM Identity Center-provided identities to uniquely identify a user of an Amazon Q Business application (in this case, an employee AI assistant). Document ACLs and local users set up in the data source (such as Confluence) are matched up with the user and group identities provided by IAM Identity Center. At query time, Amazon Q Business answers questions from users utilizing only those documents that they are provided access to by the document ACLs.

If you want to know more, take a look at the Amazon Q Business launch blog post on AWS News Blog, and refer to Amazon Q Business User Guide. For more information on IAM Identity Center, refer to the AWS IAM Identity Center User Guide.


About the Authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect in the Amazon Q Business service team at AWS. Abhinav works with AWS customers and partners to help them build generative AI solutions on AWS.

Venky Nagapudi is a Senior Manager of Product Management for Q Business, Amazon Comprehend and Amazon Translate. His focus areas on Q Business include user identity management, and using offline intelligence from documents to improve Q Business accuracy and helpfulness.

Read More

Enhance customer service efficiency with AI-powered summarization using Amazon Transcribe Call Analytics

Enhance customer service efficiency with AI-powered summarization using Amazon Transcribe Call Analytics

In the fast-paced world of customer service, efficiency and accuracy are paramount. After each call, contact center agents often spend up to a third of the total call time summarizing the customer conversation. Additionally, manual summarization can lead to inconsistencies in the style and level of detail due to varying interpretations of note-taking guidelines. This post-contact work can not only add to customer wait times, but also can put pressure on some agents to avoid taking notes altogether. Supervisors also spend a considerable amount of time listening to call recordings or reading transcripts to understand the gist of a customer conversation when investigating customer issues or evaluating an agent’s performance. This can make it challenging to scale quality management within the contact center.

To address these issues, we launched a generative artificial intelligence (AI) call summarization feature in Amazon Transcribe Call Analytics. Transcribe Call Analytics is a generative AI-powered API for generating highly accurate call transcripts and extracting conversation insights to improve customer experience, agent productivity, and supervisor productivity. Powered by Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) through a single API, generative call summarization in Transcribe Call Analytics produces call summaries that reduce the time agents spend capturing and summarizing notes after each conversation. This reduces customer wait times and improves agent productivity. Generative call summarization also provides supervisors with quick insight into a conversation without the need to listen to the entire call recording or read the entire transcript.

As Praphul Kumar, Chief Product Officer at SuccessKPI, noted,

“Generative call summarization in the Amazon Transcribe Call Analytics API has enabled us to add generative AI capabilities to our platform faster. With this feature, we are able to improve productivity in our customer’s contact center by automatically summarizing calls and removing the need for agents to write after call notes. We are looking forward to bringing this valuable capability into the hands of many more large enterprises.”

We previously published Use generative AI to increase agent productivity through automated call summarization. This new generative call summarization feature automatically integrates with multiple services and handles necessary configurations, making it simple and seamless to start using and realizing the benefits. You don’t need to manually integrate with services or perform additional configurations. Simply turn the feature on from the Amazon Transcribe console or using the start_call_analytics_job API. You can also use generative call summarization through Amazon Transcribe Post Call Analytics Solution for post-call summaries.

In this post, we show you how to use the new generative call summarization feature.

Solution overview

The following diagram illustrates the solution architecture.

You can upload a call recording in Amazon S3 and start a Transcribe Call Analytics job. The summary is generated and uploaded back to S3 along with the transcript and analytics as a single JSON.

We show you how to use the generative call summarization feature with a call sample inquiring about a used car through the following high-level steps:

  1. Create a new Post Call Analytics job and turn on the generative call summarization feature.
  2. Review the generative call summarization results.

Prerequisites

To get started, upload your recorded file or the sample file provided to an Amazon Simple Storage Service (Amazon S3) bucket.

Create a new Post call analytics job

Complete the following steps to create a new Post call analytics job:

  1. On the Amazon Transcribe console, choose Post-call Analytics in the navigation pane under Amazon Transcribe Call Analytics.
  2. Choose Create job.
  3. For Name, enter summarysample.
  4. In the Language settings and Model type sections, leave the default settings.
  5. For Input file location on S3, browse to the S3 bucket containing the uploaded audio file and choose Choose.
  6. In the Output data section, leave as default.
  7. Create a new AWS Identity and Access Management (IAM) role named summarysamplerole that provides Amazon Transcribe service permissions to read the audio files from the S3 bucket.
  8. In the Role permissions details section, leave as default and choose Next.
  9. Toggle Generative call summarization on and choose Create job.

Review the transcription and summary

When the status of the job is Complete, you can review the transcription and summary by choosing the job name summarysample. The Text tab shows the Agent and Customer sentences clearly separated.

The Generative call summarization tab provides a concise summary of the call.

Choose Download transcript for the JSON output containing the transcript and summary.

Conclusion

The world of customer service is constantly evolving, and organizations must adapt to meet the growing demands of their clients. Amazon Transcribe Call Analytics introduces an innovative solution to streamline the post-call process and enhance productivity. With generative call summarization, contact center agents can devote more time to engage with customers, and supervisors can gain insights quickly without extensive call reviews. This feature improves efficiency and empowers enterprises to scale their quality management efforts, enabling them to deliver exceptional customer experiences.

Generative call summarization in Amazon Transcribe Call Analytics is generally available today in English in US East (N. Virginia) and US West (Oregon). We invite you to share your thoughts and questions in the comments section.

Learn more:


About the Authors

Ami Dani is a Senior Technical Program Manager at AWS focusing on AI/ML services. During her career, she has focused on delivering transformative software development projects for the federal government and large companies in industries as diverse as advertising, entertainment, and finance. Ami has experience driving business growth, implementing innovative training programs and successfully managing complex, high-impact projects. She is a strategic problem-solver and collaborative partner, consistently delivering results that exceed expectations.

Gopikrishnan Anilkumar is a Senior Technical Product Manager on the Amazon Transcribe team. He has 10 years of product management experience across a variety of domains and is passionate about AI/ML. Outside of work, Gopikrishnan loves to travel and enjoys playing cricket.

Read More

Accelerate software development and leverage your business data with generative AI assistance from Amazon Q

Accelerate software development and leverage your business data with generative AI assistance from Amazon Q

We believe generative artificial intelligence (AI) has the potential to transform virtually every customer experience. To make this possible, we’re rapidly innovating to provide the most comprehensive set of capabilities across the three layers of the generative AI stack. This includes the bottom layer with infrastructure to train Large Language Models (LLMs) and other Foundation Models (FMs) and produce inferences or predictions, the middle layer with tools to easily and rapidly build generative AI applications, and the top-layer where we’re investing in game-changing applications. While all of these layers are important for the advancement of generative AI, I’m excited today to share more on our investments in the top application layer.

With the assistance of generative AI, everyone from developers and business analysts to employees in specialized areas like customer service or supply chain operations, can be more productive, creative, and data-driven than ever before. But for generative AI apps and assistants to be truly useful at work, they must know an organization’s data, their customers, their operations, and their business. Many of today’s assistants can’t be easily personalized and they weren’t designed to meet the data privacy and security requirements companies need.

That’s why we invented Amazon Q, the most capable generative AI-powered assistant for accelerating software development and leveraging business data. I’m excited to share that Amazon Q Developer, Amazon Q Business, and Amazon Q in QuickSight are available today along with several new features. You can see a quick breakdown of today’s announcement and demos in this video from Dr. Matt Wood.

Amazon Q is the most capable work assistant available today, and we built it with security and privacy in mind from the start- if an employee can’t access a data source normally, they can’t access it through Q either.

We are bringing you a ton of great capabilities and experiences today, and I thought I’d call out a just a few.

Amazon Q Developer – your assistant for the entire software development lifecycle

Amazon Q Developer helps developers and IT professionals (IT pros) with all of their tasks—from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. Customers are already seeing the value in Amazon Q Developer: Eviden, a digital transformation services company, is experiencing 20-40% in productivity gains; Switchboard MD, a healthcare company, has reduced their time to deploy new features for products by 25%; and Datapel Systems, a warehouse management and inventory stock solutions company, is achieving a remarkable efficiency improvement of at least 70%.

Amazon Q Developer helps developers build faster and more securely by generating code suggestions and recommendations in near real time. In fact, Amazon Q Developer has the highest reported code acceptance rates in the industry for assistants that perform multi-line code suggestions, with BT Group recently reporting they accepted 37% of Q’s code suggestions and National Australia Bank reporting a 50% acceptance rate. Previously, some of these coding assistant capabilities were provided by Amazon CodeWhisperer, and are now part of Amazon Q Developer.

Amazon Q Developer agent capabilities can autonomously perform a range of tasks–everything from implementing features, documenting, and refactoring code, to performing software upgrades. You can ask Amazon Q Developer to add a new checkout feature to your e-commerce app, and it will analyze your existing codebase, map out the implementation plan spanning multiple files, and upon your approval, execute all the required code changes and tests in minutes. Check out this feature in action, Implement an API with Amazon Q Developer Agent for Software Development. Carrying out these tasks, the agent for software development achieved the highest scores of 13.4% on the SWE-Bench Leaderboard and 20.5% on the SWE-bench Leaderboard (Lite), a dataset that benchmarks coding capabilities. These updates will be available to customers in the coming days.

Q can also automate app upgrades, reducing days of work to minutes. Recently, a five-person Amazon team used the Amazon Q Code Transformation agent to upgrade over 1,000 production applications from Java 8 to Java 17 in just two days (the average time per application was less than 10 minutes), saving months of time and improving application performance. Today, Amazon Q performs Java language upgrades, and cross-platform .NET upgrades are coming soon to accelerate migrations to Linux saving customers millions in licensing fees. Check out the code transformation agent in action, Upgrade a Java App with Amazon Q Developer Agent for Code Transformation.

Starting today, you can also ask Q Developer questions about your AWS account like “What instances are currently running in US East 1?” or “What’s my S3 bucket encryption?” or “What were my EC2 costs by region last month?” and Amazon Q Developer will list the resources and details, in a summarized answer with links to learn more.

Learn more about Amazon Q Developer at the AWS News Blog.

Amazon Q Business empowers employees to be more creative, data-driven, efficient, prepared, and productive

Our vision for Amazon Q Business is to make the power of generative AI accessible to every business to get insights from all their data (unstructured and structured), take actions, and build applications.

Most companies have troves of valuable data that is hard to access and parse through. With Amazon Q Business, employees can get answers to questions across business data such as company policies, product information, business results, code base, people, and many other topics by connecting to enterprise data repositories to summarize the data logically, analyze trends, and engage in dialog about the data. To make this possible, Amazon Q Business has more built-in, managed and secure data connectors to connect your enterprise data than any other generative AI assistant. This includes commonly used business tools, such as wikis, intranets, Atlassian, Gmail, Microsoft Exchange, Salesforce, ServiceNow, Slack, and Amazon Simple Storage Service (Amazon S3). You can even build custom plugins enabling Amazon Q Business to directly take actions like submitting time-off requests. Need the latest sales figures summarized? Looking for competitive intel on a prospective client? Amazon Q Business’s advanced language capabilities will quickly synthesize relevant info from scattered documents, databases and chat logs, into a coherent response.

One of the most exciting announcements today is a feature of Amazon Q Business called Amazon Q Apps. Amazon Q Apps helps every employee go from conversation to building generative AI-powered app in seconds, making it much easier to streamline and automate daily tasks. Creating an application with Amazon Q Apps is straightforward—employees can describe the type of app they want in natural language, or just tell Amazon Q Apps to do it from a conversation where Amazon Q helped solve a problem. For instance, a marketer could ask Q Apps to create an app that generates compelling customer stories by just inputting the customer’s name, products used, business challenge, and business impact. In seconds, Q creates the app that is then shareable to other marketers throughout the organization. heck out more examples at Introducing Amazon Q Apps (Preview).

Recently, I sat down with Praerit Garg, President of Product & Innovation at Smartsheet, to discuss how Amazon Q Business is helping their employees get answers faster and ultimately improve productivity, as well as onboard new hires more quickly.

Learn more about Amazon Q Business at the Amazon Machine Learning Blog.

Generative BI allows analysts to build detailed dashboards in minutes and business users to get insights fast

Historically, businesses have stored large amounts of their valuable structured data in their databases and data warehouses, which are usually accessible only through business intelligence (BI) tools. When business executives needed to extract information from the data, they had to rely on over-taxed business analysts to build dashboards, which could often take days or weeks. Even when dashboards were created it was difficult to extract and share important insights from these dashboards. Now, Amazon Q brings its advanced generative AI technology to Amazon QuickSight, AWS’s unified BI service built for the cloud. With Amazon Q in QuickSight, customers get a generative BI assistant that allows business analysts to use natural language to reduce the time to build a BI dashboard from hours to minutes. And it helps everyone in an organization become more data-driven by making data more accessible. It is the only BI product where business users can get AI-driven executive summaries of dashboards, ask questions of data beyond what is presented in the dashboards–for instant answers, and create detailed and customizable data stories highlighting key insights, trends, and drivers. All you have to do is say what you want in natural language.

Showpad, a leading provider of sales enablement solutions, is able to give customers the ability to query data without the need for a complex user interface or the need-to-know SQL. Integration took just a little over a week, and Showpad was able to customize the experience so it blends seamlessly into their own experience.

Clinigence Health is leveraging Amazon Q in QuickSight to identify insights and trends within data in minutes, a process that previously took hours.

See more about Amazon Q in QuickSight at the Business Intelligence Blog.

New pricing including Amazon Q Developer Free Tier

Along with these new features, we’re also announcing new pricing tiers that make it even easier to use Amazon Q. We are offering an Amazon Q Developer Free Tier, which provides free coding to individuals in the IDE and command line, as well as free limited usage of most advanced capabilities, like Amazon Q Developer Agents. For customers who want organizational license management, the ability to customize Amazon Q Developer to their codebase for more relevant coding suggestions, as well as higher limits on advanced capabilities, we are offering the Amazon Q Developer Pro tier for $19 per user per month. The Amazon Q Business Pro subscription at $20 per user/per month provides users access to the full suite of Amazon Q Business capabilities, including access to Amazon Q Apps and Amazon Q in QuickSight (Reader Pro). Finally, you can access a free trial of Amazon Q Developer Pro and Amazon Q Business Pro until June 30, 2024.

We’re excited about bringing you these new capabilities across Amazon Q Developer, Amazon Q Business, and Amazon Q in QuickSight. We are just getting started in helping you be more productive, better leverage your organizational data, and create new ways of working.

Check out the following resources to learn more about this announcement:


About the author

Swami Sivasubramanian is Vice President of Data and Machine Learning at AWS. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More

Amazon Q Business and Amazon Q in QuickSight empowers employees to be more data-driven and make better, faster decisions using company knowledge

Amazon Q Business and Amazon Q in QuickSight empowers employees to be more data-driven and make better, faster decisions using company knowledge

Today, we announced the General Availability of Amazon Q, the most capable generative AI powered assistant for accelerating software development and leveraging companies’ internal data. “During the preview, early indications signaled Amazon Q could help our customers’ employees become more than 80% more productive at their jobs; and with the new features we’re planning on introducing in the future, we think this will only continue to grow,” shared Dr. Swami Sivasubramanian, vice president of Artificial Intelligence and Data at AWS. Employees across every organization collectively spend hours every week searching internal sources for information, piecing together analyses, writing reports, building presentations, creating and searching for insights in dashboards, or adapting content for different customers or audiences. We built Amazon Q Business and Amazon Q in QuickSight to make this much simpler.

Amazon Q Business is a generative AI–powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. It empowers employees to be more creative, data-driven, efficient, prepared, and productive.

Amazon Q Business unites more data sources than any other generative AI assistant available today

Amazon Q Business easily and securely connects to 40+ commonly used business tools, such as wikis, intranets, Atlassian, Gmail, Microsoft Exchange, Salesforce, ServiceNow, Slack, and Amazon Simple Storage Service (Amazon S3)–more than any other generative AI assistant available today. Simply point Q at your enterprise data repositories, and it will search all of your data, summarize logically, analyze trends, and engage in dialog with end users about the data. This helps business users to access all of their data, no matter where it resides in their organization. Watch use cases of Amazon Q Business through its simple web base interface.

Built from the ground up with security and privacy in mind

Amazon Q Business is built to be secure and private by design. It seamlessly integrates with a customer’s existing identities, roles, and access permissions to personalize the interactions for each individual user, while maintaining the highest levels of security. It generates accurate responses based on enterprise information, and customers can restrict sensitive topics, block keywords, and filter out inappropriate content and does not use customer content to train the underlying model for anybody else. If you want to learn more about how to set up and administer Q Business, check out the News Blog: Amazon Q Business.

Generative BI allows analysts and business users to build detailed dashboards in minutes

Amazon QuickSight is AWS’s unified Business Intelligence (BI) service built for the cloud. With Amazon Q in QuickSight, customers get a Generative BI assistant that allows business analysts to use natural language to build BI dashboards in minutes and easily create visualizations and complex calculations. It is also the only BI product where business users can get AI-driven executive summaries of dashboards, ask questions of data beyond what is presented in the dashboards–for instant answers, and create detailed and customizable data stories highlighting key insights, trends, and drivers. Business users can ask to “build a story about how the business has changed over the last month for a business review with leadership” and in seconds Amazon Q creates a story in multiple parts explaining different aspects of their data with specific insights and supporting visuals, including specific ideas of how to improve the business. Users can choose to layout content in an easy to share document or presentation where they can customize text, images, and themes, and use Amazon Q to rewrite and improve the text. You can read more about all the updates to Amazon QuickSight at the AWS Business Intelligence Blog, and watch the Unlock the power of Generative BI with Amazon Q in QuickSight on creating and sharing content based on your own data.

First-of-its-kind capability that helps every employee go from conversation to generative AI-powered app in seconds

Today, we announced a new capability of Amazon Q Business, called Amazon Q Apps (in preview) that allows employees to easily and quickly create generative AI-powered apps based on their company data, without requiring any prior coding experience. With Amazon Q Apps, employees simply describe the app they want, in natural language, or they can take an existing conversation where Amazon Q Business helped them solve a problem and, with one click, Amazon Q will instantly generate an app that accomplishes their desired task that can be easily shared across their organization.

For example, generating employee onboarding plans for new recruits can be a long and laborious process. They require many hours of searching through different data stores and documents to find the appropriate content for the new employee and oftentimes the content is out of date or not specific enough to their role. With Amazon Q, an HR professional can simply describe an app that could pull together a personalized onboarding plan for a new employee, simply by inputting their name and employee ID. In a matter of seconds, Amazon Q Apps will build an app that can automatically generate a personalized onboarding plan tailored to the employee, their role, and the department using the latest data. The HR professional can then share the app with hiring managers across the company to instantly build personalized onboarding plans for their own teams. Now, with Amazon Q Apps, business users can easily, quickly, and securely build an app based on enterprise information to improve their work productivity. Watch the Introducing Amazon Q Apps to see how easy it is to implement.

Bani Bedi, senior vice president, Corporate Development and Strategy at Smartsheet, said:

“Amazon Q Business is streamlining knowledge management and accelerating employee productivity at Smartsheet. Previously, it was too difficult for our 3,300 employees to find the information they needed across public help documents, training courses, and hundreds of all-employee Slack help channels. We have consolidated our organizational knowledge into a single AI engine to give our workforce immediate answers, significantly boosting employee productivity.”

You can hear more in the interview AWS Fireside Chat with Smartsheet.

We’re really excited to share Amazon Q Business and Amazon Q in QuickSight with you. If you want more information on generative AI at AWS, you can find it AWS Generative AI.


About the Authors

Mukesh Karki is GM of Amazon Q Business.

Tracy Daugherty is GM of Amazon Quicksight.

Read More

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

This is a guest post co-authored with Ville Tuulos (Co-founder and CEO) and Eddie Mattia (Data Scientist) of Outerbounds.

To build a production-grade AI system today (for example, to do multilingual sentiment analysis of customer support conversations), what are the primary technical challenges? Historically, natural language processing (NLP) would be a primary research and development expense. In 2024, however, organizations are using large language models (LLMs), which require relatively little focus on NLP, shifting research and development from modeling to the infrastructure needed to support LLM workflows.

For AWS and Outerbounds customers, the goal is to build a differentiated machine learning and artificial intelligence (ML/AI) system and reliably improve it over time. This often means the method of using a third-party LLM API won’t do for security, control, and scale reasons. Owning the infrastructural control and knowhow to run workflows that power AI systems is a requirement.

Returning to the original question, three MLOps challenges may arise:

  • You need high-quality data to train and fine-tune models
  • You need a diverse cloud infrastructure for experimentation, training, tracking, and orchestrating the production system
  • You need a significant amount of compute to power the system

In this post, we highlight a collaboration between Outerbounds and AWS that takes a step towards addressing the last two challenges. First, the AWS Trainium accelerator provides a high-performance, cost-effective, and readily available solution for training and fine-tuning large models. Second, open source Metaflow provides the necessary software infrastructure to build production-grade ML/AI systems in a developer-friendly manner. It provides an approachable, robust Python API for the full infrastructure stack of ML/AI, from data and compute to workflows and observability.

In the following sections, we first introduce Metaflow and the Trainium integration. We then show how to set up the infrastructure stack you need to take your own data assets and pre-train or fine-tune a state-of-the-art Llama2 model on Trainium hardware.

Metaflow overview

Metaflow was originally developed at Netflix to enable data scientists and ML engineers to build ML/AI systems quickly and deploy them on production-grade infrastructure. Netflix open sourced the framework in 2019 with integrations to AWS services like AWS Batch, AWS Step Functions (see Unbundling Data Science Workflows with Metaflow and AWS Step Functions), Kubernetes, and throughput-optimized Amazon Simple Storage Service (Amazon S3), so you can build your own Netflix-scale ML/AI environment in your AWS account.

The key motivation of Metaflow is to address the typical needs of all ML/AI projects with a straightforward, human-centric API, from prototype to production (and back). The following figure illustrates this workflow.

Typical workflow with Metaflow and AWS Trainium

Metaflow’s coherent APIs simplify the process of building real-world ML/AI systems in teams. Metaflow helps scientists and engineers access, move, and manipulate data efficiently; track and version experiments and models; orchestrate and integrate workflows to surrounding systems; and scale compute to the cloud easily. Moreover, it has first-class support for teams, such as namespacing and deploying workflows in versioned production branches.

Now, with today’s announcement, you have another straightforward compute option for workflows that need to train or fine-tune demanding deep learning models: running them on Trainium.

How Metaflow integrates with Trainium

From a Metaflow developer perspective, using Trainium is similar to other accelerators. After a Metaflow deployment is configured to access Trainium chips through the compute platform customers use with Metaflow (which we discuss later in this post), ML engineers and data scientists can operate autonomously in the land of deep learning code. Scientists can write PyTorch, Hugging Face, and use the AWS Neuron SDK along with the NeuronX Distributed SDK to optimize these frameworks to target Trainium devices, and Metaflow integrates with the underlying AWS services to separate concerns about how to actually run the code at scale.

As illustrated by the following figure, you can declare the following in a few lines of Python code:

  • How many nodes to launch
  • How many Trainium devices to use per node
  • How the nodes are interconnected (Elastic Fabric Adapter)
  • How often to check the resource utilization
  • What training script the torchrun process should run on each node

Configuring a training job using a Metaflow FlowSpec

You can initialize the training process in the start step, which directs the next train step to run on two parallel instances (num_parallel=2). The decorators of the train step configure your desired training setup:

  • @torchrun – Sets up PyTorch Distributed across two instances
  • @batch – Configures the Trainium nodes, managed by AWS Batch
  • @neuron_monitor – Activates the monitoring UI that allows you to monitor the utilization of the Trainium cores

Metaflow allows you to configure all this functionality in a few lines of code. However, the main benefit is that you can embed Trainium-based training code inside a larger production system, using the scaffolding provided by Metaflow.

Benefits of using Trainium with Metaflow

Trainium and Metaflow work together to solve problems like what we discussed earlier in this post. The Trainium devices and Neuron software stack make it straightforward for teams to access and effectively use the high-performance hardware needed for cutting-edge AI.

Trainium provides a few key benefits for building real-world AI systems:

  • Trainium instances can help reduce generative AI model training and fine-tuning costs by up to 50% over comparable instances on AWS
  • It is readily available in many AWS Regions, is often more available than GPU-based instance types, and scaling is available in the most popular Regions worldwide
  • The hardware and software are mature and actively developed by AWS

If you have been struggling with GPU availability and cost, you’ll surely appreciate these benefits. Using Trainium effectively can require a bit of infrastructure effort and knowledge, which is a key motivation for this integration. Through Metaflow and the deployment scripts provided in this post, you should be able to get started with Trainium with ease.

Besides easy access, using Trainium with Metaflow brings a few additional benefits:

Infrastructure accessibility

Metaflow is known for its developer-friendly APIs that allow ML/AI developers to focus on developing models and applications, and not worry about infrastructure. Metaflow helps engineers manage the infrastructure, making sure it integrates with existing systems and policies effortlessly.

Data, model, and configuration management

Metaflow provides built-in, seamless artifact persistence, tracking, and versioning, which covers the full state of the workflows, making sure you’ll follow MLOps best practices. Thanks to Metaflow’s high-throughput S3 client, you can load and save datasets and model checkpoints very quickly, without having to worry about extra infrastructure such as shared file systems. You can use artifacts to manage configuration, so everything from hyperparameters to cluster sizing can be managed in a single file, tracked alongside the results.

Observability

Metaflow comes with a convenient UI, which you can customize to observe metrics and data that matter to your use cases in real time. In the case of Trainium, we provide a custom visualization that allows you to monitor utilization of the NeuronCores inside Trainium instances, making sure that resources are used efficiently. The following screenshot shows an example of the visualization for core (top) and memory (bottom) utilization.

Visualizing NeuronCore and memory utilization

Multi-node compute

Finally, a huge benefit of Metaflow is that you can use it to manage advanced multi-instance training clusters, which would take a lot of involved engineering otherwise. For instance, you can train a large PyTorch model, sharded across Trainium instances, using Metaflow’s @torchrun and @batch decorators.

Behind the scenes, the decorators set up a training cluster using AWS Batch multi-node with a specified number of Trainium instances, configured to train a PyTorch model across the instances. By using the launch template we provide in this post, the setup can benefit from low-latency, high-throughput networking via Elastic Fabric Adapter (EFA) networking interfaces.

Solution overview

As a practical example, let’s set up the complete stack required to pre-train Llama2 for a few epochs on Trainium using Metaflow. The same recipe applies to the fine-tuning examples in the repository.

Deploy and configure Metaflow

If you already use a Metaflow deployment, you can skip to the next step to deploy the Trainium compute environment.

Deployment

To deploy a Metaflow stack using AWS CloudFormation, complete the following steps:

  1. Download the CloudFormation template.
  2. On the CloudFormation console, choose Stacks in the navigation pane.
  3. Choose Create new stack.
  4. For Prepare template¸ select Template is ready.
  5. For Template source, select Upload a template file.
  6. Upload the template.
  7. Choose Next.

Deploy Metaflow stack using CloudFormation

  1. If you are brand new to Metaflow, or are trying this recipe as a proof of concept, we suggest you change the APIBasicAuth parameter to false and leave all other default parameter settings.
  2. Complete the stack creation process.

Specify stack details

After you create the CloudFormation stack and configure Metaflow to use the stack resources, there is no additional setup required. For more information about the Metaflow components that AWS CloudFormation deploys, see AWS Managed with CloudFormation.

Configuration

To use the stack you just deployed from your laptop or cloud workstation, complete the following steps:

  1. Prepare a Python environment and install Metaflow in it:
pip install metaflow
  1. Run metaflow configure aws in a terminal.
metaflow configure aws

After the CloudFormation stack deployment is complete, the Outputs on the stack details page will contain a list of resource names and their values, which you can use in the Metaflow AWS configuration prompts.

Deploy a Trainium compute environment

The default Metaflow deployment from the previous step has an AWS Batch compute environment, but it will not be able to schedule jobs to run on Amazon Elastic Compute Cloud (Amazon EC2) instances with Trainium devices. To deploy an AWS Batch compute environment for use with Trainium accelerators, you can use the following CloudFormation template. Complete the following steps:

  1. Download the CloudFormation template.
  2. On the CloudFormation console, choose Stacks in the navigation pane.
  3. Choose Create new stack.
  4. For Prepare template¸ select Template is ready.
  5. For Template source, select Upload a template file.
  6. Upload the template.
  7. Choose Next.
  8. Complete the stack creation process.

Take note of the name of the AWS Batch job queue that you created to use in a later step.

Prepare a base Docker image to run Metaflow tasks

Metaflow tasks run inside Docker containers when AWS Batch is used as a compute backend. To run Trainium jobs, developers need to build a custom image and specify it in the @batch decorator Metaflow developers use to declare task resources:

@batch(trainium=16, efa=8, image=”YOUR_IMAGE_IN_ECR” )
@step
def train_llama2(self):
    # neuron distributed training code

To make the image, complete the following steps:

  1. Create an Amazon Elastic Container Registry (Amazon ECR) registry to store your image in.
  2. Create and log in to an EC2 instance with sufficient memory. For this post, we used Ubuntu x86 OS on a C5.4xlarge instance.
  3. Install Docker.
  4. Copy the following Dockerfile to your instance.
  5. Authenticate with the upstream base image provider:
aws ecr get-login-password 
--region $REGION | docker login 
--username AWS 
--password-stdin 763104351884.dkr.ecr.$REGION.amazonaws.com
  1. Build the image:
docker build . -t $YOUR_IMAGE_NAME:$YOUR_IMAGE_TAG
  1. On the Amazon ECR console, navigate to the ECR registry you created, and you will find the commands needed to authenticate from the EC2 instance and push your image.

Clone the repository on your workstation

Now you’re ready to verify the infrastructure is working properly, after which you can run complex distributed training code like Llama2 training. To get started, clone the examples repository to the workstation where you configured Metaflow with AWS:

git clone https://github.com/outerbounds/metaflow-trainium

Verify the infrastructure with an allreduce example

To validate your infrastructure configuration, complete the following steps:

  1. Navigate to the allreduce example:
cd allreduce-trn
  1. Open the flow.py file and make sure to set the job queue and image to the name of the queue you deployed with AWS CloudFormation and the image you pushed to Amazon ECR, respectively.
  2. To run the allreduce code, run the following Metaflow command:
python flow.py --package-suffixes=.sh run

You can find the logs (truncated in the following code snippet for readability) in the Metaflow UI:

Task is starting (status SUBMITTED)...
Task is starting (status RUNNABLE)... (parallel node status: [SUBMITTED:3])
Task is starting (status STARTING)... (parallel node status: [SUBMITTED:3])
Task is starting (status RUNNING)... (parallel node status: [SUBMITTED:3])
Setting up task environment.
Downloading code package...
Code package downloaded.
Task is starting.
...
Compiler status PASS
result OK step 0: tensor([[64., 64., 64.],
[64., 64., 64.]], device='xla:1')
...
result OK step 900: tensor([[64., 64., 64.],
[64., 64., 64.]], device='xla:1')
Before final rendezvous
Waiting for batch secondary tasks to finish

Configure and run any Neuron distributed code

If the allreduce test runs successfully, you are ready to move on to meaningful workloads. To complete this onboarding, complete the following steps:

  1. Navigate to the llama2-7b-pretrain-trn directory.
  2. Similar to the all reduce example, before using this code, you need to modify the config.py file so that it matches the AWS Batch job queue and ECR image that you created. Open the file, find these lines, and modify them to your values:
class BatchJobConfig:
    # <snip>
    image: str = "YOUR_IMAGE"
    job_queue: str = "YOUR_QUEUE"
  1. After modifying these values, and any others you want to experiment with, run the following command:
python config.py
  1. Then run the workflow to pre-train your own Llama2 model from scratch:
python flow.py run --config-file config.yaml

This will train the model on however many nodes you specify in the config.py file, and will push the trained model result to Amazon S3 storage, versioned by Metaflow’s data store using the flow name and run ID.

Logs will appear like the following (truncated from a sample run of five steps for readability):

Task is starting (status SUBMITTED)...
Task is starting (status RUNNABLE)... (parallel node status: [SUBMITTED:3])
Task is starting (status STARTING)... (parallel node status: [SUBMITTED:3])
Task is starting (status RUNNING)... (parallel node status: [SUBMITTED:3])
Setting up task environment.
Downloading code package...
Code package downloaded.
Task is starting.
...
initializing tensor model parallel with size 8
initializing pipeline model parallel with size 1
initializing data parallel with size 16
...
Epoch 0 begin Fri Mar 15 21:19:10 2024
...
Compiler status PASS
...
(0, 3) step_loss : 15.4375 learning_rate : 3.00e-04 throughput : 4.38
(0, 4) step_loss : 12.1250 learning_rate : 1.50e-04 throughput : 5.47
(0, 5) step_loss : 11.8750 learning_rate : 0.00e+00 throughput : 6.44
...
Writing data to the provided results file: /metaflow/metaflow/metrics.json
...
Waiting for batch secondary tasks to finish

Clean up

To clean up resources, delete the CloudFormation stacks for your Metaflow deployment and Trainium compute environment:

aws cloudformation delete-stack --stack-name metaflow
aws cloudformation delete-stack --stack-name trn1-batch

Conclusion

You can get started experimenting with the solution presented in this post in your environment today. Follow the instructions in the GitHub repository to pre-train a Llama2 model on Trainium devices. Additionally, we have prepared examples for fine-tuning Llama2 and BERT models, demonstrating how you can use the Optimum Neuron package to use the integration from this post with any Hugging Face model.

We are happy to help you get started. Join the Metaflow community Slack for support, to provide feedback, and share experiences!


About the authors

Ville Tuulos is a co-founder and CEO of Outerbounds, a developer-friendly ML/AI platform. He has been developing infrastructure for ML and AI for over two decades in academia and as a leader at a number of companies. At Netflix, he led the ML infrastructure team that created Metaflow, a popular open-source, human-centric foundation for ML/AI systems. He is also the author of a book, Effective Data Science Infrastructure, published by Manning.

Eddie Mattia is in scientific computing and more recently building machine learning developer tools. He has worked as a researcher in academia, in customer-facing and engineering roles at MLOps startups, and as a product manager at Intel. Currently, Eddie is working to improve the open-source Metaflow project and is building tools for AI researchers and MLOps developers at Outerbounds.

Vidyasagar specializes in high performance computing, numerical simulations, optimization techniques and software development across industrial and academic environments. At AWS, Vidyasagar is a Senior Solutions Architect developing predictive models, generative AI and simulation technologies. Vidyasagar has a PhD from the California Institute of Technology.

Diwakar Bansal is an AWS Senior Specialist focused on business development and go-to-market for GenAI and Machine Learning accelerated computing services. Diwakar has led product definition, global business development, and marketing of technology products in the fields of IOT, Edge Computing, and Autonomous Driving focusing on bringing AI and Machine leaning to these domains. Diwakar is passionate about public speaking and thought leadership in the Cloud and GenAI space.

Sadaf Rasool is a Machine Learning Engineer with the Annapurna ML Accelerator team at AWS. As an enthusiastic and optimistic AI/ML professional, he holds firm to the belief that the ethical and responsible application of AI has the potential to enhance society in the years to come, fostering both economic growth and social well-being.

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Read More

Cohere Command R and R+ are now available in Amazon SageMaker JumpStart

Cohere Command R and R+ are now available in Amazon SageMaker JumpStart

This blog post is co-written with Pradeep Prabhakaran from Cohere. 

Today, we are excited to announce that Cohere Command R and R+ foundation models are available through Amazon SageMaker JumpStart to deploy and run inference. Command R/R+ are the state-of-the-art retrieval augmented generation (RAG)-optimized models designed to tackle enterprise-grade workloads.

In this post, we walk through how to discover and deploy Cohere Command R/R+ via SageMaker JumpStart.

What are Cohere Command R and Command R+?

Cohere Command R is a family of highly scalable language models that balance high performance with strong accuracy. Command R family – include Command R and Command R+ models – are optimized for RAG based workflows such as conversational interaction and long context tasks, enabling companies to move beyond proof of concept and into production. These powerful models are designed to handle complex tasks with high performance and strong accuracy, making them suitable for real-world applications.

Command R boasts high precision on RAG and tool use tasks, low latency and high throughput, a long 128,000-token context length, and strong capabilities across 10 key languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese.

Command R+ is the newest model, optimized for extremely performant conversational interaction and long-context tasks. It is recommended for workflows that lean on complex RAG functionality and multi-step tool use (agents), while Cohere R is well-suited for simpler RAG and single-step tool use tasks, as well as applications where price is a major consideration.

What is SageMaker JumpStart

With SageMaker JumpStart, you can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances from a network-isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy Cohere Command R/R+ models with a few choices in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK. Doing so enables you to derive model performance and machine learning operations (MLOps) controls with SageMaker features such as SageMaker PipelinesSageMaker Debugger, or container logs.

The model is deployed in an AWS secure environment and under your virtual private cloud (VPC) controls, helping provide data security. Cohere Command R/R+ models are available today for deployment and inferencing in Amazon SageMaker Studio in us-east-1 (N. Virginia), us-east-2 (Ohio), us-west-1 (N. California), us-west-2 (Oregon), Canada (Central), eu-central-1 (Frankfurt), eu-west-1 (Ireland), eu-west-2 (London), eu-west-3 (Paris), eu-north-1 (Stockholm), ap-southeast-1 (Singapore), ap-southeast-2 (Sydney), ap-northeast-1 (Tokyo) , ap-northeast-2 (Seoul), ap-south-1 (Mumbai), and sa-east-1 (Sao Paulo).

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

From the SageMaker JumpStart landing page, you can easily discover various models by browsing through different hubs, which are named after model providers. The Cohere Command R and R+ models are available in the Cohere hub. If you don’t see these models, ensure you have the latest SageMaker Studio version by shutting down and restarting Studio Classic Apps.

To find the Command R and R+ models, search for “Command R” in the search box located at the top left of the SageMaker JumpStart landing page. Each model can be deployed on Amazon Elastic Compute Cloud (EC2) P5 instances powered by NVIDIA H100 Tensor Core GPUs (p5.48xlarge) and Amazon EC2 P4de instances powered by NVIDIA A100 Tensor Core GPUs (ml.p4de.24xlarge).

Deploy a model

To illustrate model deployment, we’ll deploy Cohere Command R+ on NVIDIA H100. Choose the model card to open the corresponding model detail page.

When you choose Deploy, a window appears prompting you to subscribe to the model on AWS Marketplace. Choose Subscribe, which redirects you to the AWS Marketplace listing for Cohere Command R+ (H100). Follow the on-screen instructions to complete the subscription process.

Once subscribed, return to the model detail page and choose Deploy in the window. The deployment process initiates.

Alternatively, you can choose Notebooks on the model card and open the example notebook in JupyterLab. This notebook provides end-to-end guidance on deploying the model for inference and cleaning up resources. You can also find this example notebook in the Cohere SageMaker GitHub repository. To ensure the security of the endpoint, you can configure AWS Key Management Service (KMS) key for a SageMaker endpoint configuration.

If an endpoint has already been created, you can simply connect to it:

co = Client(region_name=region)

co.connect_to_endpoint(endpoint_name="cohere-command-r-plus")

Real-time inference

Once your endpoint has been connected, you can perform real-time inference using the co.chat endpoint.

message = "Write a LinkedIn post about starting a career in tech:"
response = co.chat(message=message, stream=False)

Multilingual capabilities

Command R/R+ is optimized to perform well in 10 key languages, as listed in the introduction. Additionally, pre-training data have been included for the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, Persian.

The model has been trained to respond in the language of the user. Here’s an example in Spanish:

co.chat(
  message="Écris une description de produit pour une voiture électrique en 50 à 75 mots"
)

Here’s what the response might look like:

Découvrez la voiture électrique qui va révolutionner votre façon de conduire.
Avec son design élégant, cette voiture offre une expérience de conduit unique avec une accélération puissante et une autonomie impressionnante. Sa technologie avancée vous garantit une charge rapide et une fiabilité inégalée. Avec sa conception innovante et durable, cette voiture est parfaite pour les trajets urbains et les longues distances. Profitez d'une conduite silencieuse et vivez l'expérience de la voiture électrique!

Command R/R+ can also perform cross-lingual tasks, such as translation or answering questions about content in other languages.

Chat with documents (RAG)

Command R/R+ can ground its generations. This means that it can generate responses based on a list of supplied document snippets, and it includes citations in its response indicating the source of the information.

For example, the code snippet that follows produces an answer to “How deep is the Mariana Trench” along with inline citations based on the provided on-line documents.

Request:

message="How deep is the Mariana Trench"
documents = [
    {
       "id": "national_geographic_everest",
       "title": "Height of Mount Everest",
       "snippet": "The height of Mount Everest is 29,035 feet",
       "url": "https://education.nationalgeographic.org/resource/mount-everest/",
    },
    {
        "id": "national_geographic_mariana",
        "title": "Depth of the Mariana Trench",
        "snippet": "The depth of the Mariana Trench is 36,070 feet",
        "url": "https://www.nationalgeographic.org/activity/mariana-trench-deepest-place-earth",
    }
]

response = co.chat(message=message, documents=documents, stream=False)

Response:

{
   text: “The depth of the Mariana Trench is 36,070 feet.”,
   citations: [
      {'start': 35, 'end': 47, 'text': '36,070 feet.', 'document_ids': ['national_geographic_mariana']}
   ],
   documents: [
      {'id': 'national_geographic_mariana', 
       'snippet': 'The depth of the Mariana Trench is 36,070 feet', 
       'title': 'Depth of the Mariana Trench'
	'url':'https://www.nationalgeographic.org/activity/mariana-trench-deepest-place-earth'}
   ]
}

Single-Step & Multi-Step Tool Use

Command R/R+, comes with a Tool Use API that enables the language model to interact with user-defined tools to automate highly sophisticated tasks. Command R/R+ in Tool Use mode creates API payloads (JSONs with specific parameters) based on user interactions and conversational history. These can be used to instruct any other application or tool.

For example, an application can be instructed to automatically categorize and route support tickets to the appropriate individual, change a status in customer relationship management software (CRM), or retrieve relevant snippets from a vector database. It comes in two variants; single-step and multi-step:

  • Single-step tool use enables a richer set of behaviors by leveraging data stored in tools, taking actions through APIs, interacting with a vector database, querying a search engine, etc.
  • Multi-step tool use is an extension of this basic idea and allows the model to call more than one tool in a sequence of steps, using the results from one tool call in a subsequent step. This process allows the language model to reason, perform dynamic actions, and quickly adapt based on information coming from external sources.

To explore these capabilities further, you can refer to the provided Jupyter notebook and Cohere’s AWS GitHub repository, which offer additional examples showcasing various use cases and applications.

Clean Up

After you’ve finished running the notebook and exploring the Cohere Command R and R+ models, it’s essential to clean up the resources you’ve created to avoid incurring unnecessary charges. Follow these steps to delete the resources and stop the billing:

co.delete_endpoint()
co.close()

Conclusion

In this post, we explored how to leverage the powerful capabilities of Cohere’s Command R and R+ models on Amazon SageMaker JumpStart. These state-of-the-art large language models are specifically designed to excel at real-world enterprise use cases, offering unparalleled performance and scalability. With their availability on SageMaker JumpStart and AWS Marketplace, you now have seamless access to these cutting-edge models, enabling you to unlock new levels of productivity and innovation in your natural language processing projects.


About the authors

Pradeep Prabhakaran is a Customer Solutions Architect at Cohere. In his current role at Cohere, Pradeep acts as a trusted technical advisor to customers and partners, providing guidance and strategies to help them realize the full potential of Cohere’s cutting-edge Generative AI platform. Prior to joining Cohere, Pradeep was a Principal Customer Solutions Manager at Amazon Web Services, where he led Enterprise Cloud transformation programs for large enterprises. Prior to AWS, Pradeep has held various leadership positions at consulting companies such as Slalom, Deloitte, and Wipro. Pradeep holds a Bachelor’s degree in Engineering and is based in Dallas, TX.

James Yi is a Senior AI/ML Partner Solutions Architect at Amazon Web Services. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in GenAI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Read More

Revolutionizing large language model training with Arcee and AWS Trainium

Revolutionizing large language model training with Arcee and AWS Trainium

This is a guest post by Mark McQuade, Malikeh Ehghaghi, and Shamane Siri from Arcee.

In recent years, large language models (LLMs) have gained attention for their effectiveness, leading various industries to adapt general LLMs to their data for improved results, making efficient training and hardware availability crucial. At Arcee, we focus primarily on enhancing the domain adaptation of LLMs in a client-centric manner. Arcee’s innovative continual pre-training (CPT) and model merging techniques have brought a significant leap forward in the efficient training of LLMs, with particularly strong evaluations in the medical, legal, and financial verticals. Close collaboration with AWS Trainium has also played a major role in making the Arcee platform extremely performant, not only accelerating model training but also reducing overall costs and enforcing compliance and data integrity in the secure AWS environment. In this post, we show you how efficient we make our continual pre-training by using Trainium chips.

Understanding continual pre-training

Arcee recognizes the critical importance of continual CPT [1] in tailoring models to specific domains, as evidenced by previous studies such as PMC-LLaMA [2] and ChipNeMo [3]. These projects showcase the power of domain adaptation pre-training in enhancing model performance across diverse fields, from medical applications to industrial chip design. Inspired by these endeavors, our approach to CPT involves extending the training of base models like Llama 2 using domain-specific datasets, allowing us to fine-tune models to the nuances of specialized fields. To further amplify the efficiency of our CPT process, we collaborated with the Trainium team, using their cutting-edge technology to enhance a Llama 2 [4] model using a PubMed dataset [2] comprising 88 billion tokens. This collaboration represents a significant milestone in our quest for innovation, and through this post, we’re excited to share the transformative insights we’ve gained. Join us as we unveil the future of domain-specific model adaptation and the potential of CPT with Trainium in optimizing model performance for real-world applications.

Dataset collection

We followed the methodology outlined in the PMC-Llama paper [6] to assemble our dataset, which includes PubMed papers sourced from the Semantic Scholar API and various medical texts cited within the paper, culminating in a comprehensive collection of 88 billion tokens. For further details on the dataset, the original paper offers in-depth information.

To prepare this dataset for training, we used the Llama 2 tokenizer within an AWS Glue pipeline for efficient processing. We then organized the data so that each row contained 4,096 tokens, adhering to recommendations from the Neuron Distributed tutorials.

Why Trainium?

Continual pre-training techniques like the ones described in this post require access to high-performance compute instances, which has become more difficult to get as more developers are using generative artificial intelligence (AI) and LLMs for their applications. Traditionally, these workloads have been deployed to GPUs; however, in recent years, the cost and availability of GPUs has stifled model building innovations. With the introduction of Trainium, we are able to unlock new techniques that enable us to continue model innovations that will allow us to build models more efficiently and most importantly, at lower costs. Trainium is the second-generation machine learning (ML) accelerator that AWS purpose built to help developers access high-performance model training accelerators to help lower training costs by up to 50% over comparable Amazon Elastic Compute Cloud (Amazon EC2) instances. With Trainium available in AWS Regions worldwide, developers don’t have to take expensive, long-term compute reservations just to get access to clusters of GPUs to build their models. Trainium instances offer developers the performance they need with the elasticity they want to optimize both for training efficiency and lowering model building costs.

Setting up the Trainium cluster

We used AWS ParallelCluster to build a High Performance Computing (HPC) compute environment that uses Trn1 compute nodes to run our distributed ML training job (see the GitHub tutorial). You can also use developer flows like Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Ray, or others (to learn more, see Developer Flows). After the nodes were launched, we ran a training task to confirm that the nodes were working, and used slurm commands to check the job status. In this part, we used the AWS pcluster command to run a .yaml file to generate the cluster. Our cluster consisted of 16 nodes, each equipped with a trn1n.32xlarge instance featuring 32 GB of VRAM.

We set up our ParallelCluster infrastructure as shown in the following diagram (source).

As shown in the preceding figure, inside a VPC, there are two subnets, a public one and a private one. The head node resides in the public subnet, and the compute fleet (in this case, Trn1 instances) is in the private subnet. A NAT gateway is also needed in order for nodes in the private subnet to connect to clients outside the VPC. In the following section, we describe how to set up the necessary infrastructure for Trn1 ParallelCluster.

Set up the environment

To set up your environment, complete the following steps:

  1. Install the VPC and necessary components for ParallelCluster. For instructions, see VPC setup for ParallelCluster with Trn1.
  2. Create and launch ParallelCluster in the VPC. For instructions, see Create ParallelCluster.

Now you can launch a training job to submit a model training script as a slurm job.

Deploy to Trainium

Trainium-based EC2 Trn1 instances use the AWS Neuron SDK and support common ML frameworks like PyTorch and TensorFlow. Neuron allows for effortless distributed training and has integrations with Megatron Nemo and Neuron Distributed.

When engaging with Trainium, it’s crucial to understand several key parameters:

  • Tensor parallel size – This determines the level of tensor parallelization, particularly in self-attention computations within transformers, and is crucial for optimizing memory usage (not computational time efficiency) during model loading
  • NeuronCores – Each Trainium device has two NeuronCores, and an eight-node setup equates to a substantial 256 cores
  • Mini batch – This reflects the number of examples processed in each batch as determined by the data loader
  • World size – This is the total count of nodes involved in the training operation

A deep understanding of these parameters is vital for anyone looking to harness the power of Trainium devices effectively.

Train the model

For this post, we train a Llama 2 7B model with tensor parallelism. For a streamlined and effective training process, we adhered to the following steps:

  1. Download the Llama 2 full checkpoints (model weights and tokenizer) from Hugging Face.
  2. Convert these checkpoints to a format compatible with the Neuron Distributed setup, so they can be efficiently utilized in our training infrastructure.
  3. Determine the number of steps required per epoch, incorporating the effective batch size and dataset size to tailor the training process to our specific needs.
  4. Launch the training job, carefully monitoring its progress and performance.
  5. Periodically save training checkpoints. Initially, this process may be slow due to its synchronous nature, but improvements are anticipated as the NeuronX team works on enhancements.
  6. Finally, convert the saved checkpoints back to a standard format for subsequent use, employing scripts for seamless conversion.

For more details, you can find the full implementation of the training steps in the following GitHub repository.

Clean up

Don’t forget to tear down any resources you set up in this post.

Results

Our study focused on evaluating the quality of the CPT-enhanced checkpoints. We monitored the perplexity of a held-out PubMed dataset [6] across various checkpoints obtained during training, which provided valuable insights into the model’s performance improvements over time.

Through this journey, we’ve advanced our model’s capabilities, and hope to contribute to the broader community’s understanding of effective model adaptation strategies.

The following figure shows the perplexity of the baseline Llama 2 7B checkpoint vs. its CPT-enhanced checkpoint on the PMC test dataset. Based on these findings, continual pre-training on domain-specific raw data, specifically PubMed papers in our study, resulted in an enhancement of the Llama 2 7B checkpoint, leading to improved perplexity of the model on the PMC test set.

The following figure shows the perplexity of the CPT-enhanced checkpoints of the Llama 2 7B model across varying numbers of trained tokens. The increasing number of trained tokens correlated with enhanced model performance, as measured by the perplexity metric.

The following figure shows the perplexity comparison between the baseline Llama 2 7B model and its CPT-enhanced checkpoints, with and without data mixing. This underscores the significance of data mixing, where we have added 1% of general tokens to the domain-specific dataset, wherein utilizing a CPT-enhanced checkpoint with data mixing exhibited better performance compared to both the baseline Llama 2 7B model and the CPT-enhanced checkpoint solely trained on PubMed data.

Conclusion

Arcee’s innovative approach to CPT and model merging, as demonstrated through our collaboration with Trainium, signifies a transformative advancement in the training of LLMs, particularly in specialized domains such as medical research. By using the extensive capabilities of Trainium, we have not only accelerated the model training process, but also significantly reduced costs, with an emphasis on security and compliance that provides data integrity within a secure AWS environment.

The results from our training experiments, as seen in the improved perplexity scores of domain-specific models, underscore the effectiveness of our method in enhancing the performance and applicability of LLMs across various fields. This is particularly evident from the direct comparisons of time-to-train metrics between Trainium and traditional GPU setups, where Trainium’s efficiency and cost-effectiveness shine.

Furthermore, our case study using PubMed data for domain-specific training highlights the potential of Arcee’s CPT strategies to fine-tune models to the nuances of highly specialized datasets, thereby creating more accurate and reliable tools for professionals in those fields.

As we continue to push the boundaries of what’s possible in LLM training, we encourage researchers, developers, and enterprises to take advantage of the scalability, efficiency, and enhanced security features of Trainium and Arcee’s methodologies. These technologies not only facilitate more effective model training, but also open up new avenues for innovation and practical application in AI-driven industries.

The integration of Trainium’s advanced ML capabilities with Arcee’s pioneering strategies in model training and adaptation is poised to revolutionize the landscape of LLM development, making it more accessible, economical, and tailored to meet the evolving demands of diverse industries.

To learn more about Arcee.ai, visit Arcee.ai or reach out to our team.

Additional resources

References

  1. Gupta, Kshitij, et al. “Continual Pre-Training of Large Language Models: How to (re) warm your model?.” arXiv preprint arXiv:2308.04014 (2023).
  2. Wu, Chaoyi, et al. “Pmc-LLaMA: Towards building open-source language models for medicine.” arXiv preprint arXiv:2305.10415 6 (2023).
  3. Liu, Mingjie, et al. “Chipnemo: Domain-adapted llms for chip design.” arXiv preprint arXiv:2311.00176 (2023).
  4. Touvron, Hugo, et al. “Llama 2: Open foundation and fine-tuned chat models.” arXiv preprint arXiv:2307.09288 (2023).
  5. https://aws.amazon.com/ec2/instance-types/trn1/
  6. Wu, C., Zhang, X., Zhang, Y., Wang, Y., & Xie, W. (2023). Pmc-llama: Further fine tuning llama on medical papers. arXiv preprint arXiv:2304.14454.

About the Authors

Mark McQuade is the CEO/Co-Founder at Arcee. Mark co-founded Arcee with a vision to empower enterprises with industry-specific AI solutions. This idea emerged from his time at Hugging Face, where he helped spearhead the Monetization team, collaborating with high-profile enterprises. This frontline experience exposed him to critical industry pain points: the reluctance to rely on closed source APIs and the challenges of training open source models without compromising data security.

Shamane Siri Ph.D. is the Head of Applied NLP Research at Arcee. Before joining Arcee, Shamane worked in both industry and academia, developing recommendation systems using language models to address the cold start problem, and focusing on information retrieval, multi-modal emotion recognition, and summarization. Shamane has also collaborated with the Hugging Face Transformers crew and Meta Reality Labs on cutting-edge projects. He holds a PhD from the University of Auckland, where he specialized in domain adaptation of foundational language models.

Malikeh Ehghaghi is an Applied NLP Research Engineer at Arcee. Malikeh’s research interests are NLP, domain-adaptation of LLMs, ML for healthcare, and responsible AI. She earned an MScAC degree in Computer Science from the University of Toronto. She previously collaborated with Lavita AI as a Machine Learning Consultant, developing healthcare chatbots in partnership with Dartmouth Center for Precision Health and Artificial Intelligence. She also worked as a Machine Learning Research Scientist at Cambridge Cognition Inc. and Winterlight Labs, with a focus on monitoring and detection of mental health disorders through speech and language. Malikeh has authored several publications presented at top-tier conferences such as ACL, COLING, AAAI, NAACL, IEEE-BHI, and MICCAI.

Read More