Discover insights from Box with the Amazon Q Box connector

Discover insights from Box with the Amazon Q Box connector

Seamless access to content and insights is crucial for delivering exceptional customer experiences and driving successful business outcomes. Box, a leading cloud content management platform, serves as a central repository for diverse digital assets and documents in many organizations. An enterprise Box account typically contains a wealth of materials, including documents, presentations, knowledge articles, and more. However, extracting meaningful information from the vast amount of Box data can be challenging without the right tools and capabilities. Employees in roles such as customer support, project management, and product management require the ability to effortlessly query Box content, uncover relevant insights, and make informed decisions that address customer needs effectively.

Building a generative artificial intelligence (AI)-powered conversational application that is seamlessly integrated with your enterprise’s relevant data sources requires time, money, and people. First, you need to develop connectors to those data sources. Next, you need to index this data to make it available for a Retrieval Augmented Generation (RAG) approach where relevant passages are delivered with high accuracy to a large language model (LLM). To do this, you need to select an index that provides the capabilities to index the content for semantic and vector search, build the infrastructure to retrieve and rank the answers, and build a feature-rich web application. You also need to hire and staff a large team to build, maintain, and manage such a system.

Amazon Q Business is a fully managed generative AI-powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business can help you get fast, relevant answers to pressing questions, solve problems, generate content, and take action using the data and expertise found in your company’s information repositories, code, and enterprise systems (such as Box, among others). Amazon Q provides out-of-the-box native data source connectors that can index content into a built-in retriever and uses an LLM to provide accurate, well-written answers. A data source connector is a component of Amazon Q that helps integrate and synchronize data from multiple repositories into one index.

Amazon Q Business offers multiple prebuilt connectors to a large number of data sources, including Box Content Cloud, Atlassian Confluence, Amazon Simple Storage Service (Amazon S3), Microsoft SharePoint, Salesforce, and many more, and helps you create your generative AI solution with minimal configuration. For a full list of Amazon Q Business supported data source connectors, see Amazon Q Business connectors.

In this post, we guide you through the process of configuring and integrating Amazon Q for Business with your Box Content Cloud. This will enable your support, project management, product management, leadership, and other teams to quickly obtain accurate answers to their questions from the documents stored in your Box account.

Find accurate answers from Box documents using Amazon Q Business

After you integrate Amazon Q Business with Box, you can ask questions based on the documents stored in your Box account. For example:

  • Natural language search – You can search for information within documents located in any folder by using conversational language, simplifying the process of finding desired data without the need to remember specific keywords or filters.
  • Summarization – You can ask Amazon Q Business to summarize contents of documents to meet your needs. This enables you to quickly understand the main points and find relevant information in your documents without having to scan through individual document descriptions manually.

Overview of the Box connector for Amazon Q Business

To crawl and index contents in Box, you can configure the Amazon Q Business Box connector as a data source in your Amazon Q Business application. When you connect Amazon Q Business to a data source and initiate the sync process, Amazon Q Business crawls and indexes documents from the data source into its index.

Types of documents

Let’s look at what are considered as documents in the context of the Amazon Q business Box connector. A document is a collection of information that consists of a title, the content (or the body), metadata (data about the document), and access control list (ACL) information to make sure answers are provided from documents that the user has access to.

The Amazon Q Business Box connector supports crawling of the following entities in Box:

  • Files – Each file is considered a single document
  • Comments – Each comment is considered a single document
  • Tasks – Each task is considered a single document
  • Web links – Each web link is considered a single document

Additionally, Box users can create custom objects and custom metadata fields. Amazon Q supports the crawling and indexing of these custom objects and custom metadata.

The Amazon Q Business Box connector also supports the indexing of a rich set of metadata from the various entities in Box. It further provides the ability to map these source metadata fields to Amazon Q index fields for indexing this metadata. These field mappings allow you to map Box field names to Amazon Q index field names. There are two types of metadata fields that Amazon Q connectors support:

  • Reserved or default fields – These are required with each document, such as the title, creation date, or author
  • Custom metadata fields – These are fields created in the data source in addition to what the data source already provides

Refer to Box data source connector field mappings for more information.

Authentication

Before you index the content from Box, you need to first establish a secure connection between the Amazon Q Business connector for Box with your Box cloud instance. To establish a secure connection, you need to authenticate with the data source. Let’s look at the supported authentication mechanisms for the Box connector.

The Amazon Q Box connector supports tokens with JWT authentication by Box as the authentication method. This authentication approach requires the configuration of several parameters, including the Box client ID, client secret, public key ID, private key, and passphrase. By implementing this token-based JWT authentication, the Amazon Q Business assistant can securely connect to and interact with data stored within the Box platform on behalf of your organization.

Refer to JWT Auth in the Box Developer documentation for more information on setting up and managing JWT tokens in Box.

Supported box subscriptions

To integrate Amazon Q Business with Box using the Box connector, access to Box Enterprise or Box Enterprise Plus plans is required. Both plans provide the necessary capabilities to create a custom application, download a JWT token as an administrator, and then configure the connector to ingest relevant data from Box.

Secure querying with ACL crawling, identity crawling, and User Store

The success of Amazon Q Business applications hinges on two key factors: making sure end-users only see responses generated from documents they have access to, and maintaining the privacy and security of each user’s conversation history. Amazon Q Business achieves this by validating the user’s identity every time they access the application, and using this to restrict tasks and answers to the user’s authorized documents. This is accomplished through the integration of AWS IAM Identity Center, which serves as the authoritative identity source and validates users. You can configure IAM Identity Center to use your enterprise identity provider (IdP)—such as Okta or Microsoft Entra ID—as the identity source.

ACLs and identity crawling are enabled by default and can’t be disabled. The Box connector automatically retrieves user identities and ACLs from the connected data sources. This allows Amazon Q Business to filter chat responses based on the end-user’s document access level, so they only see the information they are authorized to view. If you need to index documents without ACLs, you must explicitly mark them as public in your data source. For more information on how the Amazon Q Business connector crawls Box ACLs, refer to How Amazon Q Business connector crawls Box ACLs.

In the Box platform, an administrative user can provision additional user accounts and assign varying permission levels, such as viewer, editor, or co-owner, to files or folders. Fine-grained access is further enhanced through the Amazon Q User Store, which is an Amazon Q data source connector feature that streamlines user and group management across all the data sources attached to your application. This granular permission mapping enables Amazon Q Business to efficiently enforce access controls based on the user’s identity and permissions within the Box environment. For more information on the Amazon Q Business User store, refer to Understanding Amazon Q Business User Store.

Solution overview

In this post, we walk through the steps to configure a Box connector for an Amazon Q Business application. We use an existing Amazon Q application and configure the Box connector to sync data from specific Box folders, map relevant Box fields to the Amazon Q index, initiate the data sync, and then query the ingested Box data using the Amazon Q web experience.

As part of querying the Amazon Q Business application, we cover how to ask natural language questions on documents present in your Box folders and get back relevant results and insights using Amazon Q Business.

Prerequisites

For this walkthrough, you need the following:

Create users in IAM Identity Center

For this post, you need to create three sample users in IAM Identity Center. One user will act as the admin user; the other two will serve as department-specific users. This is to simulate the configuration of user-level access control on distinct folders within your Box account. Make sure to use the same email addresses when creating the users in your Box account.

Complete the following steps to create the users in IAM Identity Center:

  1. On the IAM Identity Center console, choose Users in the navigation pane.
  2. Choose Add user.
  3. For Username, enter a user name. For example, john_doe.
  4. For Password, select Send an email to this user with password setup instructions.
  5. For Email address and Confirm email address, enter your email address.
  6. For First name and Last name, enter John and Doe, respectively. You can also provide your preferred first and last names if necessary.
  7. Keep all other fields as default and choose Next.

  1. On the Add user to groups page, keep everything as default and choose Next.
  2. Verify the details on the Review and add user page, then choose Add user.

The user will get an email containing a link to join IAM Identity Center.

  1. Choose Accept Invitation and set up a password for your user. Remember to note it down for testing the Amazon Q Business application later.
  2. If required by your organization, complete the multi-factor authentication (MFA) setup for this user to enhance security during sign-in.
  3. Confirm that you can log in as the first user using the credentials you created in the previous step.
  4. Repeat the previous steps to create your second department-specific user. Use a different email address for this user. For example, set Username as mary_major, First name as Mary, and Last name as Major. Alternatively, you can use your own values if preferred.
  5. Verify that you can log in as the second user using the credentials you created in the previous step.
  6. Repeat the previous steps to create the third user, who will serve as the admin. Use your Box admin user’s email address for this account, and choose your preferred user name, first name, and last name. For this example, saanvi_sarkar will act as the admin user.
  7. Confirm that you can log in as the admin user using the credentials you created in the previous step.

This concludes the setup of all three users in the IAM Identity Center, each with unique email addresses.

Create two users in your Box account

For this example, you need two demo users in your Box account in addition to the admin user. Complete the following steps to create these two demo users, using the same email addresses you used when setting up these users in IAM Identity Center:

  1. Log in to your Box Enterprise Admin Console as an admin user.
  2. Choose Users & Groups in the navigation pane.

On the Managed Users tab, the admin user is listed by default.

  1. To create your first department-specific user, choose Add Users, then choose Add Users Manually.

  1. Enter the same name and email address that you used while creating this first department-specific user in IAM Identity Center. For example, use John Doe for Name and his email address for Email. You don’t need to specify groups or folders.
  2. Select the acknowledgement check box to agree to the payment method for adding this new user to your Box account.
  3. Choose Next.

  1. On the Add Users page, choose Complete to agree and add this new user to your Box account.
  2. To create your second department-specific user, choose Add Users, then choose Add Users Manually.
  3. Enter the same name and email address that you used while creating this second department-specific user in IAM Identity Center. For example, use Mary Major for Name and her email address for Email. You don’t need to specify groups or folders.

You now have all three users provisioned in your Box account.

Create a custom Box application for Amazon Q

Before you configure the Box data source connector in Amazon Q Business, you create a custom Box application in your Box account.

Complete the following steps to create an application and configure its authentication method:

  1. Log in to your Box Enterprise Developer Console as an admin user.
  2. Choose My Apps in the navigation pane.
  3. Choose Create New App.
  4. Choose Custom App.

  1. For App name, enter a name for your app. For example, AmazonQConnector.
  2. For Purpose, choose Other.
  3. For Please specify, enter Other.
  4. Leave the other options blank and choose Next.

  1. For Authentication Method, select Server Authentication (with JWT).
  2. Choose Create App.

  1. In My Apps, choose your created app and go to the Configuration
  2. In the App Access Level section, choose App + Enterprise Access.

  1. In the Application Scopes section, select the following permissions:
    1. Write all files and folders stored in Box
    2. Manage users
    3. Manage groups
    4. Manage enterprise properties

  1. In the Advanced Features section, select Make API calls using the as-user header.
  2. In the Add and Manage Public Keys section, choose Generate a Public/Private Keypair.

  1. Complete the two-step verification process and choose OK to download the JSON file to your computer.

  1. Choose Save Changes.
  2. On the Authorization tab, choose Review and Submit.

  1. In the Review App Authorization Submission pop-up, for App description, enter AmazonQConnector and choose Submit.

Your Box Enterprise owner needs to approve the application before you can use it. Complete the following steps to complete the authorization:

  1. Log in to your Box Enterprise Admin Console as the admin user.
  2. Choose Apps in the navigation pane and choose the Customs App Manager tab to view the apps that need to be authorized.
  3. Choose the AmazonQConnector app that says Pending Authorization.
  4. Choose the options menu (three dots) and choose Authorize App.

  1. Choose Authorize in the Authorize App pop-up.

This will authorize your AmazonQConnector application and change the status to Authorized.

You can review the downloaded JSON file in your computer’s downloads directory. It contains the client ID, client secret, public key ID, private key, passphrase, and enterprise ID, which you’ll need when creating the Box data source in a later step.

Add sample documents to your Box account

In this step, upload sample documents to your Box account. Later, you use the Amazon Q Box data source connector to crawl and index these documents.

  1. Download the zip file to your computer.
  2. Extract the files to a folder called AWS_Whitepapers.

  1. Log in to your Box Enterprise account as an admin user.
  2. Upload the AWS_Whitepapers folder to your Box account.

At the time of writing, this folder contains 6 folders and 60 files within them.

Set user-specific permissions on folders in your Box account

In this step, you set up user-level access control for two users on two separate folders in your Box account.

For this ACL simulation, consider the two department-specific users created earlier. Assume John is part of the machine learning (ML) team, so he needs access only to the Machine_Learning folder contents, whereas Mary belongs to the database team, so she needs access only to the Databases folder contents.

Log in to your Box account as an admin and grant viewer access to each user for their respective folders, as shown in the following screenshots. This restricts them to see only their assigned folder’s contents.

The Machine_Learning folder is accessible to the owner and user John Doe only.

The Databases folder is accessible to the owner and user Mary Major only.

Configure the Box connector for your Amazon Q Business application

Complete the following steps to configure your Box connector for Amazon Q Business:

  1. On the Amazon Q Business console, choose Applications in the navigation pane.
  2. Select the application you want to add the Box connector to.
  3. On the Actions menu, choose Edit.

  1. On the Update application page, leave all values unchanged and choose Update.

  1. On the Update retriever page, leave all values unchanged and choose Next.

  1. On the Connect data sources page, on the All tab, search for Box.
  2. Choose the plus sign next to the Box connector.

  1. On the Add data source page, for Data source name, enter a name, for example, box-data-source.
  2. Open the JSON file you downloaded from the Box Developer Console.

The file contains values for clientID, clientSecret, publicKeyID, privateKey, passphrase, and enterpriseID.

  1. In the Source section, for Box enterprise ID, enter the value of the enterpriseID key from the JSON file.

  1. For Authorization, no change is needed because by default the ACLs are set to ON for the Box data source connector.
  2. In the Authentication section, under AWS Secrets Manager secret, choose Create and add a new secret.
  3. For Secret name, enter a name for the secret, for example, connector. The prefix QBusiness-Box- is automatically added for you.
  4. For the remaining fields, enter the corresponding values from the downloaded JSON file.
  5. Choose Save to add the secret.

  1. In the Configure VPC and Security group section, use the default setting (No VPC) for this post.
  2. Identity crawling is enabled by default, so no changes are necessary.

  1. In the IAM role section, choose Create a new role (Recommended) and enter a role name, for example, box-role.

For more information on the required permissions to include in the IAM role, see IAM roles for data sources.

  1. In the Sync scope section, in addition to file contents, you can include Box web links, comments, and tasks to your index. Use the default setting (unchecked) for this post.
  2. In the Additional configuration section, you can choose to include or exclude regular expression (regex) patterns. These regex patterns can be applied based on the file name, file type, or file path. For this demo, we skip the regex patterns configuration.

  1. In the Sync mode section, select New, modified, or deleted content sync.
  2. In the Sync run schedule section, choose Run on demand.

  1. In the Field Mappings section, keep the default settings.

After you complete the retriever creation, you can modify field mappings and add custom field attributes. You can access field mapping by editing the data source.

  1. Choose Add data source and wait for the retriever to get created.

It can take a few seconds for the required roles and the connector to be created.

After the data source is created, you’re redirected to the Connect data sources page to add more data sources as needed.

  1. For this walkthrough, choose Next.
  2. In the Update groups and users section, choose Add groups and users to add the groups and users from IAM Identity Center set up by your administrator.

  1. In the Add or assign users and groups pop-up, select Assign existing users and groups to add existing users configured in your connected IAM Identity Center and choose Next.

Optionally, if you have permissions to add users to connected IAM Identity Center, you can select Add new users.

  1. On the Assign users and groups page, choose Get Started.
  2. In the search box, enter John Doe and choose his user name.

  1. Add the second user, Mary Major, by entering her name in the search box.

  1. Optionally, you can add the admin user to this application.
  2. Choose Assign to add these users to this Amazon Q app.
  3. In the Groups and users section, choose the Users tab, where you will see no subscriptions configured currently.
  4. Choose Manage access and subscriptions to configure the subscription.

  1. On the Manage access and subscriptions page, choose the Users
  2. Select your users.
  3. Choose Change subscription and choose Update subscription tier.

  1. On the Confirm subscription change page, for New subscription, choose Business Pro.
  2. Choose Confirm.

  1. Verify the changed subscription for all three users, then choose Done.

  1. Choose Update application to complete adding and setting up the Box data connector for Amazon Q Business.

Configure Box field mappings

To help you structure data for retrieval and chat filtering, Amazon Q Business crawls data source document attributes or metadata and maps them to fields in your Amazon Q index. Amazon Q has reserved fields that it uses when querying your application. When possible, Amazon Q automatically maps these built-in fields to attributes in your data source.

If a built-in field doesn’t have a default mapping, or if you want to map additional index fields, use the custom field mappings to specify how a data source attribute maps to your Amazon Q application.

  1. On the Amazon Q Business console, choose your application.
  2. Under Data sources, select your data source.
  3. On the Actions menu, choose Edit.

  1. In the Field mappings section, select the required fields to crawl under Files and folders, Comments, Tasks, and Web Links that are available and choose Update.

 When selecting all items, make sure you navigate through each page by choosing the page numbers and selecting Select All on every page to include all mapped items.

Index sample documents from the Box account

The Box connector setup for Amazon Q is now complete. Because you configured the data source sync schedule to run on demand, you need to start it manually.

In the Data sources section, choose the data source box-data-source and choose Sync now.

The Current sync state changes to Syncing – crawling, then to Syncing – indexing.

After a few minutes, the Current sync state changes to Idle, the Last sync status changes to Successful, and the Sync run history section shows more details, including the number of documents added.

As shown in the following screenshot, Amazon Q has successfully scanned and added all 60 files from the AWS_Whitepapers Box folder.

Query Box data using the Amazon Q web experience

Now that the data synchronization is complete, you can start exploring insights from Amazon Q. In the newly created Amazon Q application, choose Customize web experience to open a new tab with a preview of the UI and options to customize according to your needs.

You can customize the Title, Subtitle, and Welcome message as needed, which will be reflected in the UI.

For this walkthrough, we use the defaults and choose View web experience to be redirected to the login page for the Amazon Q application.

  1. Log in to the application as your first department-specific user, John Doe, using the credentials for the user that were added to the Amazon Q application.

When the login is successful, you’ll be redirected to the Amazon Q assistant UI, where you can start asking questions using natural language and get insights from your Box index.

  1. Enter a prompt in the Amazon Q Business AI assistant at the bottom, such as “What AWS AI/ML service can I use to convert text from one language to another?” Press Enter or choose the arrow icon to generate the response. You can also try your own prompts.

Because John Doe has access to the Machine_Learning folder, Amazon Q Business successfully processed his query that was related to ML and displayed the response. You can choose Sources to view the source files contributing to the response, enhancing its authenticity.

  1. Let’s attempt a different prompt related to the Databases folder, which John doesn’t have access to. Enter the prompt “How to reduce the amount of read traffic and connections to my Amazon RDS database?” or choose your own database-related prompt. Press Enter or choose the arrow icon to generate the response.

As anticipated, you’ll receive a response from the Amazon Q Business application indicating it couldn’t generate a reply from the documents John can access. Because John lacks access to the Databases folder, the Amazon Q Business application couldn’t generate a response.

  1. Go back to the Amazon Q Business Applications page and choose your application again.
  2. This time, open the web experience URL in private mode to initiate a new session, avoiding interference with the previous session.
  3. Log in as Mary Major, the second department-specific user. Use her user name, password, and any MFA you set up initially.
  4. Enter a prompt in the Amazon Q Business AI assistant at the bottom, such as “How to reduce the amount of read traffic and connections to my Amazon RDS database?” Press Enter or choose the arrow icon to generate the response. You can also try your own prompts.

Because Mary has access to the Databases folder, Amazon Q Business successfully processed her query that was related to databases and displayed the response. You can choose Sources to view the source files that contributed in generating the response.

  1. Now, let’s attempt a prompt that contains information from the Machine_Learning folder, which Mary isn’t authorized to access. Enter the prompt “What AWS AI/ML service can I use to convert text from one language to another?” or choose your own ML-related prompt.

As anticipated, the Amazon Q Business application will indicate it couldn’t generate a response because Mary lacks access to the Machine_Learning folder.

The preceding test scenarios illustrate the functionality of the Amazon Q Box connector in crawling and indexing documents along with their associated ACLs. With this mechanism, only users with the relevant permissions can access the respective folders and files within the linked Box account.

Congratulations! You’ve effectively utilized Amazon Q to unveil answers and insights derived from the content indexed from your Box account.

Frequently asked questions

In this section, we provide guidance to frequently asked questions.

Amazon Q Business is unable to answer your questions

If you get the response “Sorry, I could not find relevant information to complete your request,” this may be due to a few reasons:

  • No permissions – ACLs applied to your Box account don’t allow you to query certain data sources. If this is the case, reach out to your application administrator to make sure your ACLs are configured to access the data sources.
  • Data connector sync failed – Your data connector may have failed to sync information from the source to the Amazon Q Business application. Verify the data connector’s sync run schedule and sync history to confirm the sync is successful.
  • Incorrect regex pattern – Validate the correct definition of the regex include or exclude pattern when setting up the Box data source.

If none of these reasons apply to your use case, open a support case and work with your technical account manager to get this resolved.

How to generate responses from authoritative data sources

If you want Amazon Q Business to only generate responses from authoritative data sources, the use of guardrails can be highly beneficial. Within the application settings, you can specify the authorized data repositories, such as content management systems and knowledge bases, from which the assistant is permitted to retrieve and synthesize information. By defining these approved data sources as guardrails, you can instruct Amazon Q Business to only use reliable, up-to-date, and trustworthy information, eliminating the risk of incorporating data from non-authoritative or potentially unreliable sources.

Additionally, Amazon Q Business offers the capability to define content filters as part of Guardrails for Amazon Bedrock. These filters can specify the types of content, topics, or keywords deemed appropriate and aligned with your organization’s policies and standards. By incorporating these content-based guardrails, you can further refine the assistant’s responses to make sure they align with your authoritative information and messaging. The integration of Amazon Q Business with IAM Identity Center also serves as a critical guardrail, allowing you to validate user identities and align ACLs to make sure end-users only receive responses based on their authorized data access.

Amazon Q Business responds using old (stale) data even though your data source is updated

If you find that Amazon Q Business is responding with outdated or stale data, you can use the relevance tuning and boosting features to surface the latest documents. The relevance tuning functionality allows you to adjust the weightings assigned to various document attributes, such as recency, to prioritize the most recent information. Boosting can also be used to explicitly elevate the ranking of the latest documents, making sure they are prominently displayed in the assistant’s responses. For more information on relevance tuning, refer to Boosting chat responses using relevance tuning.

Additionally, it’s important to review the sync schedule and status for your data connectors. Verifying the sync frequency and the last successful sync run can help identify any issues with data freshness. Adjusting the sync schedule or running manual syncs, as needed, can help keep the data up to date and improve the relevance of the Amazon Q Business responses. For more information, refer to Sync run schedule.

Clean up

To prevent incurring additional costs, it’s essential to clean up and remove any resources created during the implementation of this solution. Specifically, you should delete the Amazon Q application, which will consequently remove the associated index and data connectors. However, any IAM roles and secrets created during the Amazon Q application setup process need to be removed separately. Failing to clean up these resources may result in ongoing charges, so it’s crucial to take the necessary steps to completely remove all components related to this solution.

Complete the following steps to delete the Amazon Q application, secret, and IAM role:

  1. On the Amazon Q Business console, select the application that you created.
  2. On the Actions menu, choose Delete and confirm the deletion.
  3. On the Secrets Manager console, select the secret that was created for the Box connector.
  4. On the Actions menu, choose Delete.
  5. Select the waiting period as 7 days and choose Schedule deletion.
  6. On the IAM console, select the role that was created during the Amazon Q application creation.
  7. Choose Delete and confirm the deletion.
  8. Delete the AWS_Whitepapers folder and its contents from your Box
  9. Delete the two demo users that you created in your Box Enterprise account.
  10. On the IAM Identity Center console, choose Users in the navigation pane.
  11. Select the three demo users that you created and choose Delete users to remove these users.

Conclusion

The Amazon Q Box connector allows organizations to seamlessly integrate their Box files into the powerful generative AI capabilities of Amazon Q. By following the steps outlined in this post, you can quickly configure the Box connector as a data source for Amazon Q and initiate synchronization of your Box information. The native field mapping options enable you to customize exactly which Box data to include in Amazon Q’s index.

Amazon Q can serve as a powerful assistant capable of providing rich insights and summaries about your Box files directly from natural language queries.

The Amazon Q Box integration represents a valuable tool for software teams to gain AI-driven visibility into their organization’s document repository. By bridging Box’s industry-leading content management with Amazon’s cutting-edge generative AI, teams can drive productivity, make better informed decisions, and unlock deeper insights into their organization’s knowledge base. As generative AI continues advancing, integrations like this will become critical for organizations aiming to deliver streamlined, data-driven software development lifecycles.

To learn more about the Amazon Q connector for Box, refer to Connecting Box to Amazon Q.


About the Author

Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel and ride his motorcycle in Texas Hill Country.

Senthil Kamala Rathinam is a Solutions Architect at Amazon Web Services specializing in data and analytics. He is passionate about helping customers design and build modern data platforms. In his free time, Senthil loves to spend time with his family and play badminton.

Vijai Gandikota is a Principal Product Manager in the Amazon Q and Amazon Kendra organization of Amazon Web Services. He is responsible for the Amazon Q and Amazon Kendra connectors, ingestion, security, and other aspects of the Amazon Q and Amazon Kendra services.

Read More

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

This post is co-written with Aishwarya Gupta, Apurva Gawad, and Oliver Cody from Twilio.

Today’s leading companies trust Twilio’s Customer Engagement Platform (CEP) to build direct, personalized relationships with their customers everywhere in the world. Twilio enables companies to use communications and data to add intelligence and security to every step of the customer journey, from sales and marketing to growth, customer service, and many more engagement use cases in a flexible, programmatic way. Across 180 countries, millions of developers and hundreds of thousands of businesses use Twilio to create personalized experiences for their customers. As one of the largest AWS customers, Twilio engages with data, artificial intelligence (AI), and machine learning (ML) services to run their daily workloads.

Data is the foundational layer for all generative AI and ML applications. Managing and retrieving the right information can be complex, especially for data analysts working with large data lakes and complex SQL queries. To address this, Twilio partnered with AWS to develop a virtual assistant that helps their data analysts find and retrieve relevant data from Twilio’s data lake by converting user questions asked in natural language to SQL queries. This virtual assistant tool uses Amazon Bedrock, a fully managed generative AI service that provides access to high-performing foundation models (FMs) and capabilities like Retrieval Augmented Generation (RAG). RAG optimizes language model outputs by extending the models’ capabilities to specific domains or an organization’s internal data for tailored responses.

This post highlights how Twilio enabled natural language-driven data exploration of business intelligence (BI) data with RAG and Amazon Bedrock.

Twilio’s use case

Twilio wanted to provide an AI assistant to help their data analysts find data in their data lake. They used the metadata layer (schema information) over their data lake consisting of views (tables) and models (relationships) from their data reporting tool, Looker, as the source of truth. Looker is an enterprise platform for BI and data applications that helps data analysts explore and share insights in real time.

Twilio implemented RAG using Anthropic Claude 3 on Amazon Bedrock to develop a virtual assistant tool called AskData for their data analysts. This tool converts questions from data analysts asked in natural language (such as “Which table contains customer address information?”) into a SQL query using the schema information available in Looker Modeling Language (LookML) models and views. The analysts can run this generated SQL directly, saving them the time to first identify the tables containing relevant information and then write a SQL query to retrieve the information.

The AskData tool provides ease of use and efficiency to its users:

  • Users need accurate information about the data in a quick and accessible manner to make business decisions. Providing a tool to minimize their time spent finding tables and writing SQL queries allows them to focus more on business outcomes and less on logistical tasks.
  • Users typically reach out to the engineering support channel when they have questions about data that is deeply embedded in the data lake or if they can’t access it using various queries. Having an AI assistant can reduce the engineering time spent in responding to these queries and provide answers more quickly.

Solution overview

In this post, we show you a step-by-step implementation and design of the AskData tool designed to serve as an AI assistant for Twilio’s data analysts. We discuss the following:

  • How to use a RAG approach to retrieve the relevant LookML metadata corresponding to users’ questions with the help of efficient data chunking and indexing and generate SQL queries from natural language
  • How to select the optimal large language model (LLM) for your use case from Amazon Bedrock
  • How analysts can query the data using natural language questions
  • The benefits of using RAG for data analysis, including increased productivity and reduced engineering overhead of finding the data (tables) and writing SQL queries.

This solution uses Amazon Bedrock, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon Simple Storage Service (Amazon S3). The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

  1. An end-user (data analyst) asks a question in natural language about the data that resides within a data lake.
  2. This question uses metadata (schema information) stored in Amazon RDS and conversation history stored in DynamoDB for personalized retrieval to the user’s questions:
    • The RDS database (PostgreSQL with pgvector) stores the LookML tables and views as embeddings that are retrieved through a vector similarity search.
    • The DynamoDB table stores the previous conversation history with this user.
  3. The context and natural language question are parsed through Amazon Bedrock using an FM (in this case, Anthropic Claude 3 Haiku), which responds with a personalized SQL query that the user can use to retrieve accurate information from the data lake. The following is the prompt template that is used for generating the SQL query:
Human: The context information below represents the LookML data for Looker views and models. 
Using this context data, please generate a presto SQL query that will return the correct result for the user's question. 
Please provide a SQL query with the correct syntax, table names, and column names based on the provided LookML data.

<instructions>

1. Use the correct underlying SQL table names (table name in sql_table_name) 
and column names (use column names from the dimensions of the view as they are the correct column names). 
Use the following as an example:

{{example redacted}}

2. Join tables as necessary to get the correct result. 
- Avoid unnecessary joins if not explicitly requested by the user.

3. Avoid unnecessary filters if not explicitly requested by the user.

4. If the view has a derived table, use the derived query to answer question 
using table names and column names from derived query. Use the following as an example:

{{example redacted}}

5. The schema name is represented as <schema>.<table_name> within the LookML views. 
Use the existing schema name or "public" as the schema name if no schema is specified.

</instructions>

This is the chat history from previous messages:

<chat_history>

{chat_history}

</chat_history>

<context>

{context}

</context>

This is the user question:

<question>

{question}

</question>

Assistant: Here is a SQL query for the user question:

The solution comprises four main steps:

  1. Use semantic search on LookML metadata to retrieve the relevant tables and views corresponding to the user questions.
  2. Use FMs on Amazon Bedrock to generate accurate SQL queries based on the retrieved table and view information.
  3. Create a simple web application using LangChain and Streamlit.
  4. Refine your existing application using strategic methods such as prompt engineering, optimizing inference parameters and other LookML content.

Prerequisites

To implement the solution, you should have an AWS account, model access to your choice of FM on Amazon Bedrock, and familiarity with DynamoDB, Amazon RDS, and Amazon S3.

Access to Amazon Bedrock FMs isn’t granted by default. To gain access to an FM, an AWS Identity and Access Management (IAM) user with sufficient permissions needs to request access to it through the Amazon Bedrock console. After access is provided to a model, it is available for the users in the account.

To manage model access, choose Model access in the navigation pane on the Amazon Bedrock console. The model access page lets you view a list of available models, the output modality of the model, whether you have been granted access to it, and the End User License Agreement (EULA). You should review the EULA for terms and conditions of using a model before requesting access to it. For information about model pricing, refer to Amazon Bedrock pricing.

Model access

Model access

Structure and index the data

In this solution, we use the RAG approach to retrieve the relevant schema information from LookML metadata corresponding to users’ questions and then generate a SQL query using this information.

This solution uses two separate collections that are created in our vector store: one for Looker views and another for Looker models. We used the sentence-transformers/all-mpnet-base-v2 model for creating vector embeddings and PostgreSQL with pgvector as our vector database. As long as the LookML file doesn’t exceed the context window of the LLM used to generate the final response, we don’t split the file into chunks and instead pass the file in its entirety to the embeddings model. The vector similarity search is able to find the correct files that contain the LookML tables and views relevant to the user’s question. We can pass the entire LookML file contents to the LLM, taking advantage of its large context window, and the LLM is able to pick the schemas for the relevant tables and views to generate the SQL query.

The two subsets of LookML metadata provide distinct types of information about the data lake. Views represent individual tables, and models define the relationships between those tables. By separating these components, we can first retrieve the relevant views based on the user’s question, and then use those results to identify the associated models that capture the relationships between the retrieved views.

This two-step procedure provides a more comprehensive understanding of the relevant tables and their relationships to the user question. The following diagram shows how both subsets of metadata are chunked and stored as embeddings in different vectors for enhanced retrieval. The LookML view and model information is brought into Amazon S3 through a separate data pipeline (not shown).

Content ingestion into vector db

Content ingestion into vector db

Select the optimal LLM for your use case

Selecting the right LLM for any use case is essential. Every use case has different requirements for context length, token size, and the ability to handle various tasks like summarization, task completion, chatbot applications, and so on. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon within a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

This solution is implemented using Anthropic Claude 3, available through Amazon Bedrock. Anthropic Claude 3 is chosen for two main reasons:

  • Increased context window – Anthropic Claude 3 can handle up to 200,000 tokens in its context, allowing for processing larger LookML queries and tables. This expanded capacity is crucial when dealing with complex or extensive data, so the LLM has access to the necessary information for accurate and informed responses to the user.
  • Enhanced reasoning abilities – Anthropic Claude 3 demonstrates enhanced performance when working with larger contexts, enabling it to better understand and respond to user queries that require a deeper comprehension of the views, models, and their relationships. You can gain granular control over the reasoning capabilities using several prompt engineering techniques.

Build a web application

This solution uses LangChain and Streamlit to build a web application and integrate Amazon Bedrock into it. LangChain is a framework specifically designed to simplify the creation of applications using LLMs, and it’s straightforward to use Amazon Bedrock through LangChain using the Amazon Bedrock component available in LangChain. We use Streamlit to develop the frontend for this web application.

For data analysts to effortlessly interact with and get queries to extract relevant data from their data lake, this solution implements a chat engine using the ConversationalRetrievalChain mechanism, which enables you to pass a custom vector store retriever, prompt, and conversation history to the LLM and generate personalized answers to user questions. To store the chat history, we use DynamoDB with the user session ID as the primary key. DynamoDB is a highly scalable and durable NoSQL database service, enabling you to efficiently store and retrieve chat histories for multiple user sessions concurrently. The following screenshot shows an example of the chat interface developed using Streamlit.

AskData user interface

AskData user interface

The following is an example scenario to get an idea of the user workflow when interacting with AskData:

User Question - how do I find a customer address?

AskData Response - Based on the provided LookML data, 
you can find the customer address by querying the raw_mailing_addresses table, 
which contains address-related information. 
Here is the SQL query to retrieve the customer address:

SELECT
    account,
    address1,
    address2,
    city,
    province,
    postal_code,
    iso_country
FROM
    raw_mailing_addresses

This query will return the account SID (which can be used to identify the customer), the first line of the address, the second line of the address (if any), the city, the province or state, the postal code, and the ISO country code for all entries in the raw_mailing_addresses table. If you have a specific customer’s account and you want to retrieve the address for that customer, you can add a WHERE clause to the query:

SELECT
    account,
    address1,
    address2,
    city,
    province,
    postal_code,
    iso_country
FROM
    raw_mailing_addresses
WHERE
    account = 'YourCustomerAccount' -- Replace with the actual account

Replace ‘YourCustomerAccount‘ with the actual account of the customer whose address you want to find.

Optimize the application

Although using an LLM to answer user questions about data is efficient, it comes with recognized limitations, such as the ability of the LLM to generate inaccurate responses, often due to hallucinated information. To enhance the accuracy of our application and reduce hallucinations, we did the following:

  • Set the temperature for the LLM to 0.1 to reduce the LLM’s propensity for overly creative responses.
  • Added instructions in the prompt to only generate the SQL query based on the context (schema, chat history) being provided in the prompt.
  • Meticulously removed duplicate and redundant entries from the LookML data before it was ingested into the vector database.
  • Added a user experience feedback (a rating from 1–5 with an optional text input for comments) as part of the UI of AskData. We used the feedback to improve the quality of our data, prompts, and inference parameter settings.

Based on user feedback, the application achieved a net promoter score (NPS) of 40, surpassing the initial target score of 35%. We set this target due to the following key factors: the lack of relevant information for specific user questions within the LookML data, specific rules related to the structure of SQL queries that might need to be added, and the expectation that sometimes the LLM would make a mistake in spite of all the measures we put in place.

Conclusion

In this post, we illustrated how to use generative AI to significantly enhance the efficiency of data analysts. By using LookML as metadata for our data lake, we constructed vector stores for views (tables) and models (relationships). With the RAG framework, we efficiently retrieved pertinent information from these stores and provided it as context to the LLM alongside user queries and any previous chat history. The LLM then seamlessly generated SQL queries in response.

Our development process was streamlined thanks to various AWS services, particularly Amazon Bedrock, which facilitated the integration of LLM for query responses, and Amazon RDS, serving as our vector stores.

Check out the following resources to learn more:

Get started with Amazon Bedrock today, and leave your feedback and questions in the comments section.


About the Authors

Apurva Gawad is a Senior Data Engineer at Twilio specializing in building scalable systems for data ingestion and empowering business teams to derive valuable insights from data. She has a keen interest in AI exploration, blending technical expertise with a passion for innovation. Outside of work, she enjoys traveling to new places, always seeking fresh experiences and perspectives.

Aishwarya Gupta is a Senior Data Engineer at Twilio focused on building data systems to empower business teams to derive insights. She enjoys to travel and explore new places, foods, and culture.

Oliver Cody is a Senior Data Engineering Manager at Twilio with over 28 years of professional experience, leading multidisciplinary teams across EMEA, NAMER, and India. His experience spans all things data across various domains and sectors. He has focused on developing innovative data solutions, significantly optimizing performance and reducing costs.

Amit Arora is an AI and ML specialist architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

Johnny Chivers is a Senior Solutions Architect working within the Strategic Accounts team at AWS. With over 10 years of experience helping customers adopt new technologies, he guides them through architecting end-to-end solutions spanning infrastructure, big data, and AI.

Read More

Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

Improve AI assistant response accuracy using Knowledge Bases for Amazon Bedrock and a reranking model

AI chatbots and virtual assistants have become increasingly popular in recent years thanks the breakthroughs of large language models (LLMs). Trained on a large volume of datasets, these models incorporate memory components in their architectural design, allowing them to understand and comprehend textual context.

Most common use cases for chatbot assistants focus on a few key areas, including enhancing customer experiences, boosting employee productivity and creativity, or optimizing business processes. For instance, customer support, troubleshooting, and internal and external knowledge-based search.

Despite these capabilities, a key challenge with chatbots is generating high-quality and accurate responses. One way of solving this challenge is to use Retrieval Augmented Generation (RAG). RAG is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. Reranking seeks to improve search relevance by reordering the result set returned by a retriever with a different model. In this post, we explain how two techniques—RAG and reranking—can help improve chatbot responses using Knowledge Bases for Amazon Bedrock.

Solution overview

RAG is a technique that combines the strengths of knowledge base retrieval and generative models for text generation. It works by first retrieving relevant responses from a database, then using those responses as context to feed the generative model to produce a final output. Using a RAG approach for building a chatbot has many advantages. For example, retrieving responses from its database before generating a response could provide more relevant and coherent responses. This helps improve the conversational flow. RAG also scales better with more data compared to pure generative models, and it doesn’t require fine-tuning of the model when new data is added to the knowledge base. Additionally, the retrieval component enables the model to incorporate external knowledge by retrieving relevant background information from its database. This approach helps provide factual, in-depth, and knowledgeable responses.

To find an answer, RAG takes an approach that uses vector search across the documents. The advantage of using vector search is speed and scalability. Rather than scanning every single document to find the answer, with the RAG approach, you turn the texts (knowledge base) into embeddings and store these embeddings in the database. The embeddings are a compressed version of the documents, represented by an array of numerical values. After the embeddings are stored, the vector search queries the vector database to find the similarity based on the vectors associated with the documents. Typically, a vector search will return the top k most relevant documents based on the user question, and return the k results. However, because the similarity algorithm in a vector database works on vectors and not documents, vector search doesn’t always return the most relevant information in the top k results. This directly impacts the accuracy of the response if the most relevant contexts aren’t available to the LLM.

Reranking is a technique that can further improve the responses by selecting the best option out of several candidate responses. The following architecture illustrates how a reranking solution could work.

Bedrock KB Reranking model architecture

Architecture diagram for Reranking model integration with Knowledge Bases for Bedrock

Let’s create a question answering solution, where we ingest The Great Gatsby, a 1925 novel by American writer F. Scott Fitzgerald. This book is publicly available through Project Gutenberg. We use Knowledge Bases for Amazon Bedrock to implement the end-to-end RAG workflow and ingest the embeddings into an Amazon OpenSearch Serverless vector search collection. We then retrieve answers using standard RAG and a two-stage RAG, which involves a reranking API. We then compare results from these two methods.

The code sample is available in this GitHub repo.

In the following sections, we walk through the high-level steps:

  1. Prepare the dataset.
  2. Generate questions from the document using an Amazon Bedrock LLM.
  3. Create a knowledge base that contains this book.
  4. Retrieve answers using the knowledge base retrieve API
  5. Evaluate the response using the RAGAS
  6. Retrieve answers again by running a two-stage RAG, using the knowledge base retrieve API and then applying reranking on the context.
  7. Evaluate the two-stage RAG response using the RAGAS framework.
  8. Compare the results and the performance of each RAG approach.

For efficiency purposes, we provided sample code in a notebook used to generate a set of questions and answers. These Q&A pairs are used in the RAG evaluation process. We highly recommend having a human to validate each question and answer for accuracy.

The following sections explains major steps with the help of code blocks.

Prerequisites

To clone the GitHub repository to your local machine, open a terminal window and run the following commands:

git clone https://github.com/aws-samples/amazon-bedrock-samples
cd knowledge-bases/features-examples/03-advanced-concepts/reranking

Prepare the dataset

Download the book from the Project Gutenberg website. For this post, we create 10 large documents from this book and upload them to Amazon Simple Storage Service (Amazon S3):

target_url = "https://www.gutenberg.org/ebooks/64317.txt.utf-8" # the great gatsby
data = urllib.request.urlopen(target_url)
my_texts = []
for line in data:
my_texts.append(line.decode())

doc_size = 700 # size of the document to determine number of batches
batches = math.ceil(len(my_texts) / doc_size)

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
s3_prefix = "bedrock/knowledgebase/datasource"

start = 0
s3 = boto3.client("s3")
for batch in range(batches):
    batch_text_arr = my_texts[start:start+doc_size]
    batch_text = "".join(batch_text_arr)
    s3.put_object(
        Body=batch_text,
        Bucket=default_bucket,
        Key=f"{s3_prefix}/{start}.txt"
    )
    start += doc_size  

Create Knowledge Base for Bedrock

If you’re new to using Knowledge Bases for Amazon Bedrock, refer to Knowledge Bases for Amazon Bedrock now supports Amazon Aurora PostgreSQL and Cohere embedding models, where we described how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow.

In this step, you create a knowledge base using a Boto3 client. You use Amazon Titan Text Embedding v2 to convert the documents into embeddings (‘embeddingModelArn’) and point to the S3 bucket you created earlier as the data source (dataSourceConfiguration):

bedrock_agent = boto3.client("bedrock-agent")
response = bedrock_agent.create_knowledge_base(
    name=knowledge_base_name,
    description='Knowledge Base for Bedrock',
    roleArn=role_arn,
    knowledgeBaseConfiguration={
        'type': 'VECTOR',
        'vectorKnowledgeBaseConfiguration': {
            'embeddingModelArn': embedding_model_arn
        }
    },
    storageConfiguration={
        'type': 'OPENSEARCH_SERVERLESS',
        'opensearchServerlessConfiguration': {
            'collectionArn': collection_arn,
            'vectorIndexName': index_name,
            'fieldMapping': {
                'vectorField':  "bedrock-knowledge-base-default-vector",
                'textField': 'AMAZON_BEDROCK_TEXT_CHUNK',
                'metadataField': 'AMAZON_BEDROCK_METADATA'
            }
        }
    }
)
knowledge_base_id = response['knowledgeBase']['knowledgeBaseId']
knowledge_base_name = response['knowledgeBase']['name']

response = bedrock_agent.create_data_source(
    knowledgeBaseId=knowledge_base_id,
    name=f"{knowledge_base_name}-ds",
    dataSourceConfiguration={
        'type': 'S3',
        's3Configuration': {
            'bucketArn': f"arn:aws:s3:::{bucket}",
            'inclusionPrefixes': [
                f"{s3_prefix}/",
            ]
        }
    },
    vectorIngestionConfiguration={
        'chunkingConfiguration': {
            'chunkingStrategy': 'FIXED_SIZE',
            'fixedSizeChunkingConfiguration': {
                'maxTokens': 300,
                'overlapPercentage': 10
            }
        }
    }
)
data_source_id = response['dataSource']['dataSourceId']

response = bedrock_agent.start_ingestion_job(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id,
)

Generate questions from the document

We use Anthropic Claude on Amazon Bedrock to generate a list of 10 questions and the corresponding answers. The Q&A data serves as the foundation for the RAG evaluation based on the approaches that we are going to implement. We define the generated answers from this step as ground truth data. See the following code:

prompt_template = """The question should be diverse in nature 
across the document. The question should not contain options, not start with Q1/ Q2. 
Restrict the question to the context information provided.

<document>
{{document}}
</document>

Think step by step and pay attention to the number of question to create.

Your response should follow the format as followed:

Question: question
Answer: answer

"""
system_prompt = """You are a professor. Your task is to setup 1 question for an upcoming 
quiz/examination based on the given document wrapped in <document></document> XML tag."""

prompt = prompt_template.replace("{{document}}", documents)
temperature = 0.9
top_k = 250
messages = [{"role": "user", "content": [{"text": prompt}]}]
# Base inference parameters to use.
inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
# Additional inference parameters to use.
additional_model_fields = {"top_k": top_k}

# Send the message.
response = bedrock_runtime.converse(
    modelId=model_id,
    messages=messages,
    system=[{"text": system_prompt}],
    inferenceConfig=inference_config,
    additionalModelRequestFields=additional_model_fields
)
print(response['output']['message']['content'][0]['text'])
result = response['output']['message']['content'][0]['text']
q_pos = [(a.start(), a.end()) for a in list(re.finditer("Question:", result))]
a_pos = [(a.start(), a.end()) for a in list(re.finditer("Answer:", result))]

Retrieve answers using the knowledge base APIs

We use the generated questions and retrieve answers from the knowledge base using the retrieve and converse APIs:

contexts = []
answers = []

for question in questions:
    response = agent_runtime.retrieve(
        knowledgeBaseId=knowledge_base_id,
        retrievalQuery={
            'text': question
        },
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': topk
            }
        }
    )
    
    retrieval_results = response['retrievalResults']
    local_contexts = []
    for result in retrieval_results:
        local_contexts.append(result['content']['text'])
    contexts.append(local_contexts)
    combined_docs = "n".join(local_contexts)
    prompt = llm_prompt_template.replace("{{documents}}", combined_docs)
    prompt = prompt.replace("{{query}}", question)
    temperature = 0.9
    top_k = 250
    messages = [{"role": "user", "content": [{"text": prompt}]}]
    # Base inference parameters to use.
    inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
    # Additional inference parameters to use.
    additional_model_fields = {"top_k": top_k}

    # Send the message.
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )
    answers.append(response['output']['message']['content'][0]['text'])

Evaluate the RAG response using the RAGAS framework

We now evaluate the effectiveness of the RAG using a framework called RAGAS. The framework provides a suite of metrics to evaluate different dimensions. In our example, we evaluate responses based on the following dimensions:

  • Answer relevancy – This metric focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information. This metric is computed using the question and the answer, with values ranging between 0–1, where higher scores indicate better relevancy.
  • Answer similarity – This assesses the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0–1. A higher score signifies a better alignment between the generated answer and the ground truth.
  • Context relevancy – This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of 0–1, with higher values indicating better relevancy.
  • Answer correctness – The assessment of answer correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0–1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

A summarized report for standard RAG approach based on RAGAS evaluation:

answer_relevancy: 0.9006225160334027

answer_similarity: 0.7400904157096762

answer_correctness: 0.32703043056663855

context_relevancy: 0.024797687553157175

Two-stage RAG: Retrieve and rerank

Now that you have the results with the retrieve_and_generate API, let’s explore the two-stage retrieval approach by extending the standard RAG approach to integrate with a reranking model. In the context of RAG, reranking models are used after an initial set of contexts are retrieved by the retriever. The reranking model takes in the list of results and reranks each one based on the similarity between the context and the user query. In our example, we use a powerful reranking model called bge-reranker-large. The model is available in the Hugging Face Hub and is also free for commercial use. In the following code, we use the knowledge base’s retrieve API so we can get the handle on the context, and rerank it using the reranking model deployed as an Amazon SageMaker endpoint. We provide the sample code for deploying the reranking model in SageMaker in the GitHub repository. Here’s a code snippet that demonstrates two-stage retrieval process:

def generate_two_stage_context_answers(bedrock_runtime, 
                                       agent_runtime, 
                                       model_id, 
                                       knowledge_base_id, 
                                       retrieval_topk, 
                                       reranking_model, 
                                       questions, 
                                       rerank_top_k=3):
    contexts = []
    answers = []
    predictor = Predictor(endpoint_name=reranking_model, serializer=JSONSerializer(), deserializer=JSONDeserializer())
    for question in questions:
        retrieval_results = two_stage_retrieval(agent_runtime, knowledge_base_id, question, retrieval_topk, predictor, rerank_top_k)
        local_contexts = []
        documents = []
        for result in retrieval_results:
            local_contexts.append(result)

        contexts.append(local_contexts)
        combined_docs = "n".join(local_contexts)
        prompt = llm_prompt_template.replace("{{documents}}", combined_docs)
        prompt = prompt.replace("{{query}}", question)
        temperature = 0.9
        top_k = 250
        messages = [{"role": "user", "content": [{"text": prompt}]}]
        inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
        additional_model_fields = {"top_k": top_k}
        
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig=inference_config,
            additionalModelRequestFields=additional_model_fields
        )
        answers.append(response['output']['message']['content'][0]['text'])
    return contexts, answers

Evaluate the two-stage RAG response using the RAGAS framework

We evaluate the answers generated by the two-stage retrieval process. The following is a summarized report based on RAGAS evaluation:

answer_relevancy: 0.841581671275458

answer_similarity: 0.7961827348349313

answer_correctness: 0.43361356731293665

context_relevancy: 0.06049484724216884

Compare the results

Let’s compare the results from our tests. As shown in the following figure, the reranking API improves context relevancy, answer correctness, and answer similarity, which are important for improving the accuracy of the RAG process.

2 stage RAG evaluation metrics

RAG vs Two Stage Retrieval evaluation metrics

Similarly, we also measured the RAG latency for both approaches. The results can be shown in the following metrics and the corresponding chart:

Standard RAG latency: 76.59s

Two Stage Retrieval latency: 312.12s

reranking model speed comparison

Latency metric for RAG and Two Stage Retrieval process

In summary, using a reranking model (tge-reranker-large) hosted on an ml.m5.xlarge instance yields approximately four times the latency compared to the standard RAG approach. We recommend testing with different reranking model variants and instance types to obtain the optimal performance for your use case.

Conclusion

In this post, we demonstrated how to implement a two-stage retrieval process by integrating a reranking model. We explored how integrating a reranking model with Knowledge Bases for Amazon Bedrock can provide better performance. Finally, we used RAGAS, an open source framework, to provide context relevancy, answer relevancy, answer similarity, and answer correctness metrics for both approaches.

Try out this retrieval process today, and share your feedback in the comments.


About the Author

Wei Teh is an Machine Learning Solutions Architect at AWS. He is passionate about helping customers achieve their business objectives using cutting-edge machine learning solutions. Outside of work, he enjoys outdoor activities like camping, fishing, and hiking with his family.

Pallavi Nargund is a Principal Solutions Architect at AWS. In her role as a cloud technology enabler, she works with customers to understand their goals and challenges, and give prescriptive guidance to achieve their objective with AWS offerings. She is passionate about women in technology and is a core member of Women in AI/ML at Amazon. She speaks at internal and external conferences such as AWS re:Invent, AWS Summits, and webinars. Outside of work she enjoys volunteering, gardening, cycling and hiking.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Automate the machine learning model approval process with Amazon SageMaker Model Registry and Amazon SageMaker Pipelines

Automate the machine learning model approval process with Amazon SageMaker Model Registry and Amazon SageMaker Pipelines

Innovations in artificial intelligence (AI) and machine learning (ML) are causing organizations to take a fresh look at the possibilities these technologies can offer. As you aim to bring your proofs of concept to production at an enterprise scale, you may experience challenges aligning with the strict security compliance requirements of their organization. In the face of these challenges, MLOps offers an important path to shorten your time to production while increasing confidence in the quality of deployed workloads by automating governance processes.

ML models in production are not static artifacts. They reflect the environment where they are deployed and, therefore, require comprehensive monitoring mechanisms for model quality, bias, and feature importance. Organizations often want to introduce additional compliance checks that validate that the model aligns with their organizational standards before it is deployed. These frequent manual checks can create long lead times to deliver value to customers. Automating these checks allows them to be repeated regularly and consistently rather than organizations having to rely on infrequent manual point- in-time checks.

This post illustrates how to use common architecture principles to transition from a manual monitoring process to one that is automated. You can use these principles and existing AWS services such as Amazon SageMaker Model Registry and Amazon SageMaker Pipelines to deliver innovative solutions to your customers while maintaining compliance for your ML workloads.

Challenge

As AI becomes ubiquitous, it’s increasingly used to process information and interact with customers in a sensitive context. Suppose a tax agency is interacting with its users through a chatbot. It’s important that this new system aligns with organizational guidelines by allowing developers to have a high degree of confidence that it responds accurately and without bias. At maturity, an organization may have tens or even hundreds of models in production. How can you make sure every model is properly vetted before it’s deployed and on each deployment?

Traditionally, organizations have created manual review processes to keep updated code from becoming available to the public through mechanisms such as an Enterprise Review Committee (ERC), Enterprise Review Board (ERB), or a Change Advisory Board (CAB).

Just as mechanisms have evolved with the rise of continuous integration and continuous delivery (CI/CD), MLOps can reduce the need for manual processes while increasing the frequency and thoroughness of quality checks. Through automation, you can scale in-demand skillsets, such as model and data analysis, introducing and enforcing in-depth analysis of your models at scale across diverse product teams.

In this post, we use SageMaker Pipelines to define the required compliance checks as code. This allows you to introduce analysis of arbitrary complexity while not being limited by the busy schedules of highly technical individuals. Because the automation takes care of repetitive analytics tasks, technical resources can focus on relentlessly improving the quality and thoroughness of the MLOps pipeline to improve compliance posture, and make sure checks are performing as expected.

Deployment of an ML model to production generally requires at least two artifacts to be approved: the model and the endpoint. In our example, the organization is willing to approve a model for deployment if it passes their checks for model quality, bias, and feature importance prior to deployment. Secondly, the endpoint can be approved for production if it performs as expected when deployed into a production-like environment. In a subsequent post, we walk you through how to deploy a model and implement sample compliance checks. In this post, we discuss how you can extend this process to large language models (LLMs), which produce a varied set of outputs and introduce complexities regarding automated quality assurance checks.

Aligning with AWS multi-account best practices

The solution outlined in this post spans across several accounts in a given AWS organization. For a deeper look at the various components required for an AWS organization multi-account enterprise ML environment, see MLOps foundation roadmap for enterprises with Amazon SageMaker. In this post, we refer to the advanced analytics governance account as the AI/ML governance account. We focus on the development of the enforcement mechanism for the centralized automated model approval within this account.

This account houses centralized components such as a model registry on SageMaker Model Registry, ML project templates on SageMaker Projects, model cards on Amazon SageMaker Model Cards, and container images on Amazon Elastic Container Registry (Amazon ECR).

We use an isolated environment (in this case, a separate AWS environment) to deploy and promote across various environments. You can modify the strategies discussed in this post along the spectrum of centralized vs. decentralized depending on the posture of your organization. For this example, we provide a centralized model. You can also extend this model to align with strict compliance requirements. For example, the AI/ML governance team trusts the development teams are sending the correct bias and explainability reports for a given model. Additional checks could be included to “trust by verify” to further bolster the posture of this organization. Additional complexities such as this are not addressed in this post. To dive further into the topic of MLOps secure implementations, refer to Amazon SageMaker MLOps: from idea to production in six steps.

Solution overview

The following diagram illustrates the solution architecture using SageMaker Pipelines to automate model approval.

A node, RegisteredModelValidationStep on top pointing to the second node below, UpdateModelStatusStep

The workflow comprises a comprehensive process for model building, training, evaluation, and approval within an organization containing different AWS accounts, integrating various AWS services. The detailed steps are as follows:

  1. Data scientists from the product team use Amazon SageMaker Studio to create Jupyter notebooks used to facilitate data preprocessing and model pre-building. The code is committed to AWS CodeCommit, a managed source control service. Optionally, you can commit to third-party version control systems such as GitHub, GitLab, or Enterprise Git.
  2. The commit to CodeCommit invokes the SageMaker pipeline, which runs several steps, including model building and training, and running processing jobs using Amazon SageMaker Clarify to generate bias and explainability reports.
    • SageMaker Clarify processes and stores its outputs, including model artifacts and reports in JSON format, in an Amazon Simple Storage Service (Amazon S3) bucket.
    • A model is registered in the SageMaker model registry with a model version.
  3. The Amazon S3 PUT action invokes an AWS Lambda
  4. This Lambda function copies all the artifacts from the S3 bucket in the development account to another S3 bucket in the AI/ML governance account, providing restricted access and data integrity. This post assumes your accounts and S3 buckets are in the same AWS Region. For cross-Region copying, see Copy data from an S3 bucket to another account and Region by using the AWS CLI.
  5. Registering the model invokes a default Amazon CloudWatch event associated with SageMaker model registry actions.
  6. The CloudWatch event is consumed by Amazon EventBridge, which invokes another Lambda
  7. This Lambda function is tasked with starting the SageMaker approval pipeline.
  8. The SageMaker approval pipeline evaluates the artifacts against predefined benchmarks to determine if they meet the approval criteria.
  9. Based on the evaluation, the pipeline updates the model status to approved or rejected accordingly.

This workflow provides a robust, automated process for model approval using AWS’s secure, scalable infrastructure and services. Each step is designed to make sure that only models meeting the set criteria are approved, maintaining high standards for model performance and fairness.

Prerequisites

To implement this solution, you need to first create and register an ML model in the SageMaker model registry with the necessary SageMaker Clarify artifacts. You can create and run the pipeline by following the example provided in the following GitHub repository.

The following sections assume that a model package version has been registered with status Pending Manual Approval. This status allows you to build an approval workflow. You can either have a manual approver or set up an automated approval workflow based on metrics checks in the aforementioned reports.

Build your pipeline

SageMaker Pipelines allows you to define a series of interconnected steps defined as code using the Pipelines SDK. You can extend the pipeline to help meet your organizational needs with both automated and manual approval steps. In this example, we build the pipeline to include two major steps. The first step evaluates artifacts uploaded to the AI/ML governance account by the model build pipeline against threshold values set by model registry administrators for model quality, bias, and feature importance. The second step receives the evaluation and updates the model’s status and metadata based on the values received. The pipeline is represented in SageMaker Pipelines by the following DAG.

A node, RegisteredModelValidationStep on top pointing to the second node below, UpdateModelStatusStep

Next, we dive into the code required for the pipeline and its steps. First, we define a pipeline session to help manage AWS service integration as we define our pipeline. This can be done as follows:

pipeline_session = PipelineSession()

Each step runs as a SageMaker Processor for which we specify a small instance type due to the minimal compute requirements of our pipeline. The processor can be defined as follows:

from sagemaker.processing import Processor
step_processor=Processor(
    image_uri=image_uri,
    role=role, 
    instance_type="ml.t3.medium", 
    base_job_name=base_job_name,
    instance_count=1,  
    sagemaker_session=pipeline_session,
)

We then define the pipeline steps using step_processor.run(…) as the input parameter to run our custom script inside the defined environment.

Validate model package artifacts

The first step takes two arguments: default_bucket and model_package_group_name. It outputs the results of the checks in JSON format stored in Amazon S3. The step is defined as follows:

process_step = ProcessingStep(
    name="RegisteredModelValidationStep",
    step_args= step_processor.run(
        code="automated-model-approval/model-approval-checks.py",
        inputs=[],
        outputs=[
            ProcessingOutput(
                output_name="checks",
                destination=f"s3://{default_bucket}/governance-pipeline/processor/",
                source="/opt/ml/processing/output"
        )],
        arguments=[
            "--default_bucket", default_bucket_s3, 
            "--model_package_group_name", model_package_group_name
        ]
    )
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail.

Values passed to arguments can be parsed using standard methods like argparse and will be used throughout the script. We use these values to retrieve the model package. We then parse the model package’s metadata to find the location of the model quality, bias, and explainability reports. See the following code:

model_package_arn = client.list_model_packages(ModelPackageGroupName=model_package_group_name)[
        "ModelPackageSummaryList"
    ][0]["ModelPackageArn"]
    model_package_metrics = 
client.describe_model_package(ModelPackageName=model_package_arn)["ModelMetrics"]
model_quality_s3_key = model_package_metrics["ModelQuality"]["Statistics"]["S3Uri"].split(f"{default_bucket}/")[1]
model_quality_bias = model_package_metrics["Bias"]
model_quality_pretrain_bias_key = model_quality_bias["PreTrainingReport"]["S3Uri"].split(f"{default_bucket}/")[1]
model_quality__post_train_bias_key = model_quality_bias["PostTrainingReport"]["S3Uri"].split(f"{default_bucket}/")[1]
model_explainability_s3_key = model_package_metrics["Explainability"]["Report"]["S3Uri"].split(f"{default_bucket}/")[1]

The reports retrieved are simple JSON files we can then parse. In the following example, we retrieve the treatment equity and compare to our threshold in order to return a True or False result. Treatment equity is defined as the difference in the ratio of false negatives to false positives for the advantaged vs. disadvantaged group. We arbitrarily set the optimal threshold to be 0.8.

s3_obj = s3_client.get_object(Bucket=default_bucket, Key=model_quality__post_train_bias_key)
s3_obj_data = s3_obj['Body'].read().decode('utf-8')
model_quality__post_train_bias_json = json.loads(s3_obj_data)
treatment_equity = model_quality__post_train_bias_json["post_training_bias_metrics"][
        "facets"]["column_8"][0]["metrics"][-1]["value"]
treatment_equity_check_threshold = 0.8
treatment_equity_check = True if treatment_equity < treatment_equity_check_threshold else False

After running through the measures of interest, we return the true/false checks to a JSON file that will be copied to Amazon S3 as per the output variable of the ProcessingStep.

Update the model package status in the model registry

When the initial step is complete, we use the JSON file created in Amazon S3 as input to update the model package’s status and metadata. See the following code:

update_model_status_step = ProcessingStep(
    name="UpdateModelStatusStep",
    step_args=step_processor.run(
        code="automated-model-approval/validate-model.py",
        inputs=[
            ProcessingInput(
                source=process_step.properties.ProcessingOutputConfig.Outputs[
                    "checks"
                ].S3Output.S3Uri,
                destination="/opt/ml/processing/input",
            ),
        ],
        outputs=[],
        arguments=[
            "--model_package_group_name", model_package_group_name
        ]
    ),
)

This step runs the custom script passed to the code parameter. We now explore this script in more detail. First, parse the values in checks.json to evaluate if the model passed all checks or review the reasons for failure:

is_approved = True
reasons = []
with open('/opt/ml/processing/input/checks.json') as checks:
        checks = json.load(checks)
        print(f"checks: {checks}")
        for key, value in checks.items():            
            if not value:
                is_approved = False
                reasons.append(key)

After we know if the model should be approved or rejected, we update the model status and metadata as follows:

if is_approved:
        approval_description = "Model package meets organisational guidelines"
else:
        approval_description = "Model values for the following checks does not meet threshold: "

for reason in reasons:
approval_description+= f"{reason} "
        
model_package_update_input_dict = {
        "ModelPackageArn" : model_package_arn,
        "ApprovalDescription": approval_description,
        "ModelApprovalStatus" : "Approved" if is_approved else "Rejected"
    }
    
model_package_update_response = client.update_model_package(**model_package_update_input_dict)

This step produces a model with a status of Approved or Rejected based on the set of checks specified in the first step.

Orchestrate the steps as a SageMaker pipeline

We orchestrate the previous steps as a SageMaker pipeline with two parameter inputs passed as arguments to the various steps:

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.parameters import ParameterString

model_package_group_name = ParameterString(
name="ModelPackageGroupName", default_value="ModelPackageGroupName is required variable."
)

default_bucket_s3 = ParameterString(
name="Bucket", default_value="Bucket is required variable")

pipeline = Pipeline(
    name=pipeline_name,
    parameters=[model_package_group_name, default_bucket_s3],
    steps=[process_step, update_model_status_step],
)

It’s straightforward to extend this pipeline by adding elements into the list passed to the steps parameter. In the next section, we explore how to run this pipeline as new model packages are registered to our model registry.

Run the event-driven pipeline

In this section, we outline how to invoke the pipeline using an EventBridge rule and Lambda function.

Create a Lambda function and select the Python 3.9 runtime. The following function retrieves the model package ARN, the model package group name, and the S3 bucket where the artifacts are stored based on the event. It then starts running the pipeline using these values:

import json
import boto3
sagemaker_client = boto3.client('sagemaker')

def lambda_handler(event, context):
    model_arn = event.get('detail', {}).get('ModelPackageArn', 'Unknown')
    model_package_group_name = event.get('detail', {}).get('ModelPackageGroupName', 'Unknown') 
    model_package_name = event.get('detail', {}).get('ModelPackageName', 'Unknown') 
    model_data_url = event.get('InferenceSpecification', {}).get('ModelDataUrl', 'Unknown')        
    
    # Specify the name of your SageMaker pipeline
    pipeline_name = 'model-governance-pipeline'
    
    # Define multiple parameters
    pipeline_parameters = [
    {'Name': "ModelPackageGroupName", 'Value': model_package_group_name}, {'Name': "Bucket", 'Value': model_data_url},
   ]
    # Start the pipeline execution
    response = sagemaker_client.start_pipeline_execution(
    	PipelineName=pipeline_name,
    	PipelineExecutionDisplayName=pipeline_name,
    	PipelineParameters=pipeline_parameters
    )
    
    # Return the response
    return response

After defining the Lambda function, we create the EventBridge rule to automatically invoke the function when a new model package is registered with PendingManualApproval into the model registry. You can use AWS CloudFormation and the following template to create the rule:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "CloudFormation template for EventBridge rule 'invoke-model-approval-checks'",
  "Resources": {
    "EventRule0": {
      "Type": "AWS::Events::Rule",
      "Properties": {
        "EventBusName": "default",
        "EventPattern": {
          "source": ["aws.sagemaker"],
          "detail-type": ["SageMaker Model Package State Change"],
          "detail": {
            "ModelApprovalStatus": ["PendingManualApproval"]
          }
        },
        "Name": "invoke-model-approval-checks",
        "State": "ENABLED",
        "Targets": [{
          "Id": "Id403a084c-2837-4408-940f-b808389653d1",
          "Arn": "<Your Lambda function ARN>"
        }]
      }
    }
  }
}

We now have a SageMaker pipeline consisting of two steps being invoked when a new model is registered to evaluate model quality, bias, and feature importance metrics and update the model status accordingly.

Applying this approach to generative AI models

In this section, we explore how the complexities introduced by LLMs change the automated monitoring workflow.

Traditional ML models typically produce concise outputs with obvious ground truths in their training dataset. In contrast, LLMs can generate long, nuanced sequences that may have little to no ground truth due to the autoregressive nature of training this segment of model. This strongly influences various components of the governance pipeline we’ve described.

For instance, in traditional ML models, bias is detected by looking at the distributions of labels over different population subsets (for example, male vs. female). The labels (often a single number or a few numbers) are a clear and simple signal used to measure bias. In contrast, generative models produce lengthy and complex answers, which don’t provide an obvious signal to be used for monitoring. HELM (a holistic framework for evaluating foundation models) allows you to simplify monitoring by untangling the evaluation process into metrics of concern. This includes accuracy, calibration and uncertainty, robustness, fairness, bias and stereotypes, toxicity, and efficiency. We then apply downstream processes to measure for these metrics independently. This is generally done using standardized datasets composed of examples and a variety of accepted responses.

We concretely evaluate four metrics of interest to any governance pipelines for LLMs: memorization and copyright, disinformation, bias, and toxicity, as described in HELM. This is done by collecting inference results from the model pushed to the model registry. The benchmarks include:

  • Memorization and copyright with books from bookscorpus, which uses popular books from a bestseller list and source code of the Linux kernel. This can be quickly extended to include a number of copyrighted works.
  • Disinformation with headlines from the MisinfoReactionFrames dataset, which has false headlines across a number of topics.
  • Bias with Bias Benchmark for Question Answering (BBQ). This QA dataset works to highlight biases affecting various social groups.
  • Toxicity with Bias in Open-ended Language Generation Dataset (BOLD), which benchmarks across profession, gender, race, religion, and political ideology.

Each of these datasets is publicly available. They each allow complex aspects of a generative model’s behavior to be isolated and distilled down to a single number. This flow is described in the following architecture.

A flow from the benchmark data set going into the Large Language Model, which gets saved into Requests & Responds, which gets sent to the processing job. The processing job has additional input from benchmark datasets from Ground Truth. Ultimately, metrics are sent to a Metrics & Results bucket.

For a detailed view of this topic along with important mechanisms to scale in production, refer to Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services.

Conclusion

In this post, we discussed a sample solution to begin automating your compliance checks for models going into production. As AI/ML becomes increasingly common, organizations require new tools to codify the expertise of their highly skilled employees in the AI/ML space. By embedding your expertise as code and running these automated checks against models using event-driven architectures, you can increase both the speed and quality of models by empowering yourself to run these checks as needed rather than relying on the availability of individuals for manual compliance or quality assurance reviews By using well-known CI/CD techniques in the application development lifecycle and applying them to the ML modeling lifecycle, organizations can scale in the era of generative AI.

If you have any thoughts or questions, please leave them in the comments section.


About the Authors

A headshot of JaysonJayson Sizer McIntosh is a Senior Solutions Architect at Amazon Web Services (AWS) in the World Wide Public Sector (WWPS) based in Ottawa (Canada) where he primarily works with public sector customers as an IT generalist with a focus on Dev(Sec)Ops/CICD. Bringing his experience implementing cloud solutions in high compliance environments, he is passionate about helping customers successfully deliver modern cloud-based services to their users.

Nicolas Bernier is an AI/ML Solutions Architect, part of the Canadian Public Sector team at AWS. He is currently conducting research in Federated Learning and holds five AWS certifications, including the ML Specialty Certification. Nicolas is passionate about helping customers deepen their knowledge of AWS by working with them to translate their business challenges into technical solutions.

A headshot of PoojaPooja Ayre is a seasoned IT professional with over 9 years of experience in product development, having worn multiple hats throughout her career. For the past two years, she has been with AWS as a Solutions Architect, specializing in AI/ML. Pooja is passionate about technology and dedicated to finding innovative solutions that help customers overcome their roadblocks and achieve their business goals through the strategic use of technology. Her deep expertise and commitment to excellence make her a trusted advisor in the IT industry.

Read More

Build custom generative AI applications powered by Amazon Bedrock

Build custom generative AI applications powered by Amazon Bedrock

With last month’s blog, I started a series of posts that highlight the key factors that are driving customers to choose Amazon Bedrock. I explored how Bedrock enables customers to build a secure, compliant foundation for generative AI applications. Now I’d like to turn to a slightly more technical, but equally important differentiator for Bedrock—the multiple techniques that you can use to customize models and meet your specific business needs.

As we’ve all heard, large language models (LLMs) are transforming the way we leverage artificial intelligence (AI) and enabling businesses to rethink core processes. Trained on massive datasets, these models can rapidly comprehend data and generate relevant responses across diverse domains, from summarizing content to answering questions. The wide applicability of LLMs explains why customers across healthcare, financial services, and media and entertainment are moving quickly to adopt them. However, our customers tell us that while pre-trained LLMs excel at analyzing vast amounts of data, they often lack the specialized knowledge necessary to tackle specific business challenges.

Customization unlocks the transformative potential of large language models. Amazon Bedrock equips you with a powerful and comprehensive toolset to transform your generative AI from a one-size-fits-all solution into one that is finely tailored to your unique needs. Customization includes varied techniques such as Prompt Engineering, Retrieval Augmented Generation (RAG), and fine-tuning and continued pre-training. Prompt Engineering involves carefully crafting prompts to get a desired response from LLMs. RAG combines knowledge retrieved from external sources with language generation to provide more contextual and accurate responses. Model Customization techniques—including fine-tuning and continued pre-training involve further training a pre-trained language model on specific tasks or domains for improved performance. These techniques can be used in combination with each other to train base models in Amazon Bedrock with your data to deliver contextual and accurate outputs. Read the below examples to understand how customers are using customization in Amazon Bedrock to deliver on their use cases.

Thomson Reuters, a global content and technology company, has seen positive results with Claude 3 Haiku, but anticipates even better results with customization. The company—which serves professionals in legal, tax, accounting, compliance, government, and media—expects that it will see even faster and more relevant AI results by fine-tuning Claude with their industry expertise.

“We’re excited to fine-tune Anthropic’s Claude 3 Haiku model in Amazon Bedrock to further enhance our Claude-powered solutions. Thomson Reuters aims to provide accurate, fast, and consistent user experiences. By optimizing Claude around our industry expertise and specific requirements, we anticipate measurable improvements that deliver high-quality results at even faster speeds. We’ve already seen positive results with Claude 3 Haiku, and fine-tuning will enable us to tailor our AI assistance more precisely.”

– Joel Hron, Chief Technology Officer at Thomson Reuters.

At Amazon, we see Buy with Prime using Amazon Bedrock’s cutting-edge RAG-based customization capabilities to drive greater efficiency. Their order on merchants’ sites are covered by Buy with Prime Assist, 24/7 live chat customer service. They recently launched a chatbot solution in beta capable of handling product support queries. The solution is powered by Amazon Bedrock and customized with data to go beyond traditional email-based systems. My colleague Amit Nandy, Product Manager at Buy with Prime, says,

“By indexing merchant websites, including subdomains and PDF manuals, we constructed tailored knowledge bases that provided relevant and comprehensive support for each merchant’s unique offerings. Combined with Claude’s state-of-the-art foundation models and Guardrails for Amazon Bedrock, our chatbot solution delivers a highly capable, secure, and trustworthy customer experience. Shoppers can now receive accurate, timely, and personalized assistance for their queries, fostering increased satisfaction and strengthening the reputation of Buy with Prime and its participating merchants.”

Stories like these are the reason why we continue to double down on our customization capabilities for generative AI applications powered by Amazon Bedrock.

In this blog, we’ll explore the three major techniques for customizing LLMs in Amazon Bedrock. And, we’ll cover related announcements from the recent AWS New York Summit.

Prompt Engineering: Guiding your application toward desired answers

Prompts are the primary inputs that drive LLMs to generate answers. Prompt engineering is the practice of carefully crafting these prompts to guide LLMs effectively. Learn more here. Well-designed prompts can significantly boost a model’s performance by providing clear instructions, context, and examples tailored to the task at hand. Amazon Bedrock supports multiple prompt engineering techniques. For example, few-shot prompting provides examples with desired outputs to help models better understand tasks, such as sentiment analysis samples labeled “positive” or “negative.” Zero-shot prompting provides task descriptions without examples. And chain-of-thought prompting enhances multi-step reasoning by asking models to break down complex problems, which is useful for arithmetic, logic, and deductive tasks.

Our Prompt Engineering Guidelines outline various prompting strategies and best practices for optimizing LLM performance across applications. Leveraging these techniques can help practitioners achieve their desired outcomes more effectively. However, developing optimal prompts that elicit the best responses from foundational models is a challenging and iterative process, often requiring weeks of refinement by developers.

Zero-shot prompting Few-shot prompting
Zero-shot prompting Zero-shot prompting
Chain-of-thought prompting with Prompt Flows Visual Builder
Prompt Flow Visual Builder

Retrieval-Augmented Generation: Augmenting results with retrieved data

LLMs generally lack specialized knowledge, jargon, context, or up-to-date information needed for specific tasks. For instance, legal professionals seeking reliable, current, and accurate information within their domain may find interactions with generalist LLMs inadequate. Retrieval-Augmented Generation (RAG) is the process of allowing a language model to consult an authoritative knowledge base outside of its training data sources—before generating a response.

The RAG process involves three main steps:

  • Retrieval: Given an input prompt, a retrieval system identifies and fetches relevant passages or documents from a knowledge base or corpus.
  • Augmentation: The retrieved information is combined with the original prompt to create an augmented input.
  • Generation: The LLM generates a response based on the augmented input, leveraging the retrieved information to produce more accurate and informed outputs.

Amazon Bedrock’s Knowledge Bases is a fully managed RAG feature that allows you to connect LLMs to internal company data sources—delivering relevant, accurate, and customized responses. To offer greater flexibility and accuracy in building RAG-based applications, we announced multiple new capabilities at the AWS New York Summit. For example, now you can securely access data from new sources like the web (in preview), allowing you to index public web pages, or access enterprise data from Confluence, SharePoint, and Salesforce (all in preview). Advanced chunking options are another exciting new feature, enabling you to create custom chunking algorithms tailored to your specific needs, as well as leverage built-in semantic and hierarchical chunking options. You now have the capability to extract information with precision from complex data formats (e.g., complex tables within PDFs), thanks to advanced parsing techniques. Plus, the query reformulation feature allows you to deconstruct complex queries into simpler sub-queries, enhancing retrieval accuracy. All these new features help you reduce the time and cost associated with data access and construct highly accurate and relevant knowledge resources—all tailored to your specific enterprise use cases.

Model Customization: Enhancing performance for specific tasks or domains

Model customization in Amazon Bedrock is a process to customize pre-trained language models for specific tasks or domains. It involves taking a large, pre-trained model and further training it on a smaller, specialized dataset related to your use case. This approach leverages the knowledge acquired during the initial pre-training phase while adapting the model to your requirements, without losing the original capabilities. The fine-tuning process in Amazon Bedrock is designed to be efficient, scalable, and cost-effective, enabling you to tailor language models to your unique needs, without the need for extensive computational resources or data. In Amazon Bedrock, model fine-tuning can be combined with prompt engineering or the Retrieval-Augmented Generation (RAG) approach to further enhance the performance and capabilities of language models. Model customization can be implemented both for labeled and unlabeled data.

Fine-Tuning with labeled data involves providing labeled training data to improve the model’s performance on specific tasks. The model learns to associate appropriate outputs with certain inputs, adjusting its parameters for better task accuracy. For instance, if you have a dataset of customer reviews labeled as positive or negative, you can fine-tune a pre-trained model within Bedrock on this data to create a sentiment analysis model tailored to your domain. At the AWS New York Summit, we announced Fine-tuning for Anthropic’s Claude 3 Haiku. By providing task-specific training datasets, users can fine-tune and customize Claude 3 Haiku, boosting its accuracy, quality, and consistency for their business applications.

Continued Pre-training with unlabeled data, also known as domain adaptation, allows you to further train the LLMs on your company’s proprietary, unlabeled data. It exposes the model to your domain-specific knowledge and language patterns, enhancing its understanding and performance for specific tasks.

Customization holds the key to unlocking the true power of generative AI

Large language models are revolutionizing AI applications across industries, but tailoring these general models with specialized knowledge is key to unlocking their full business impact. Amazon Bedrock empowers organizations to customize LLMs through Prompt Engineering techniques, such as Prompt Management and Prompt Flows, that help craft effective prompts. Retrieval-Augmented Generation—powered by Amazon Bedrock’s Knowledge Bases—lets you integrate LLMs with proprietary data sources to generate accurate, domain-specific responses. And Model Customization techniques, including fine-tuning with labeled data and continued pre-training with unlabeled data, help optimize LLM behavior for your unique needs. After taking a close look at these three main customization methods, it’s clear that while they may take different approaches, they all share a common goal—to help you address your specific business problems..

Resources       

For more information on customization with Amazon Bedrock, check the below resources:

  1. Learn more about Amazon Bedrock
  2. Learn more about Amazon Bedrock Knowledge Bases
  3. Read announcement blog on additional data connectors in Knowledge Bases for Amazon Bedrock
  4. Read blog on advanced chunking and parsing options in Knowledge Bases for Amazon Bedrock
  5. Learn more about Prompt Engineering
  6. Learn more about Prompt Engineering techniques and best practices
  7. Read announcement blog on Prompt Management and Prompt Flows
  8. Learn more about fine-tuning and continued pre-training
  9. Read the announcement blog on fine-tuning Anthropic’s Claude 3 Haiku

About the author

Vasi Philomin is VP of Generative AI at AWS. He leads generative AI efforts, including Amazon Bedrock and Amazon Titan.

Read More

Use Amazon Bedrock to generate, evaluate, and understand code in your software development pipeline

Use Amazon Bedrock to generate, evaluate, and understand code in your software development pipeline

Generative artificial intelligence (AI) models have opened up new possibilities for automating and enhancing software development workflows. Specifically, the emergent capability for generative models to produce code based on natural language prompts has opened many doors to how developers and DevOps professionals approach their work and improve their efficiency. In this post, we provide an overview of how to take advantage of the advancements of large language models (LLMs) using Amazon Bedrock to assist developers at various stages of the software development lifecycle (SDLC).

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The following process architecture proposes an example SDLC flow that incorporates generative AI in key areas to improve the efficiency and speed of development.

The intent of this post is to focus on how developers can create their own systems to augment, write, and audit code by using models within Amazon Bedrock instead of relying on out-of-the-box coding assistants. We discuss the following topics:

  • A coding assistant use case to help developers write code faster by providing suggestions
  • How to use the code understanding capabilities of LLMs to surface insights and recommendations
  • An automated application generation use case to generate functioning code and automatically deploy changes into a working environment

Considerations

It’s important to consider some technical options when choosing your model and approach to implementing this functionality at each step. One such option is the base model to use for the task. With each model having been trained on a different corpus of data, there will inherently be different task performance per model. Anthropic’s Claude 3 on Amazon Bedrock models write code effectively out of the box in many common coding languages, for example, whereas others may not be able to reach that performance without further customization. Customization, however, is another technical choice to make. For instance, if your use case includes a less common language or framework, customizing the model through fine-tuning or using Retrieval Augmented Generation (RAG) may be necessary to achieve production-quality performance, but involves more complexity and engineering effort to implement effectively.

There is an abundance of literature breaking down these trade-offs; for this post, we are just describing what should be explored in its own right. We are simply laying the context that goes into the builder’s initial steps in implementing their generative AI-powered SDLC journey.

Coding assistant

Coding assistants are a very popular use case, with an abundance of examples from which to choose. AWS offers several services that can be applied to assist developers, either through in-line completion from tools like Amazon CodeWhisperer, or to be interacted with via natural language using Amazon Q. Amazon Q for builders has several implementations of this functionality, such as:

In nearly all the use cases described, there can be an integration with the chat interface and assistants. The use cases here are focused on more direct code generation use cases using natural language prompts. This is not to be confused with in-line generation tools that focus on autocompleting a coding task.

The key benefit of an assistant over in-line generation is that you can start new projects based on simple descriptions. For instance, you can describe that you want a serverless website that will allow users to post in blog fashion, and Amazon Q can start building the project by providing sample code and making recommendations on which frameworks to use to do this. This natural language entry point can give you a template and framework to operate within so you can spend more time on the differentiating logic of your application rather than the setup of repeatable and commoditized components.

Code understanding

It’s common for a company that begins to experiment with generative AI to augment the productivity of their individual developers to then use LLMs to infer meaning and functionality of code to improve the reliability, efficiency, security, and speed of the development process. Code understanding by humans is a central part of the SDLC: creating documentation, performing code reviews, and applying best practices. Onboarding new developers can be a challenge even for mature teams. Instead of a more senior developer taking time to respond to questions, an LLM with awareness of the code base and the team’s coding standards could be used to explain sections of code and design decisions to the new team member. The onboarding developer has everything they need with a rapid response time and the senior developer can focus on building. In addition to user-facing behaviors, this same mechanism can be repurposed to work completely behind the scenes to augment existing continuous integration and continuous delivery (CI/CD) processes as an additional reviewer.

For instance, you can use prompt engineering techniques to guide and automate the application of coding standards, or include the existing code base as referential material to use custom APIs. You can also take proactive measures by prefixing each prompt with a reminder to follow the coding standards and make a call to get them from document storage, passing them to the model as context with the prompt. As a retroactive measure, you can add a step during the review process to check the written code against the standards to enforce adherence, similar to how a team code review would work. For example, let’s say that one of the team’s standards is to reuse components. During the review step, the model can read over a new code submission, note that the component already exists in the code base, and suggest to the reviewer to reuse the existing component instead of recreating it.

The following diagram illustrates this type of workflow.

Application generation

You can extend the concepts from the use cases described in this post to create a full application generation implementation. In the traditional SDLC, a human creates a set of requirements, makes a design for the application, writes some code to implement that design, builds tests, and receives feedback on the system from external sources or people, and then the process repeats. The bottleneck in this cycle typically comes at the implementation and testing phases. An application builder needs to have substantive technical skills to write code effectively, and there are typically numerous iterations required to debug and perfect code—even for the most skilled builders. In addition, a foundational knowledge of a company’s existing code base, APIs, and IP are fundamental to implementing an effective solution, which can take humans a long time to learn. This can slow down the time to innovation for new teammates or teams with technical skills gaps. As mentioned earlier, if models can be used with the capability to both create and interpret code, pipelines can be created that perform the developer iterations of the SDLC by feeding outputs of the model back in as input.

The following diagram illustrates this type of workflow.

For example, you can use natural language to ask a model to write an application that prints all the prime numbers between 1–100. It returns a block of code that can be run with applicable tests defined. If the program doesn’t run or some tests fail, the error and failing code can be fed back into the model, asking it to diagnose the problem and suggest a solution. The next step in the pipeline would be to take the original code, along with the diagnosis and suggested solution, and stitch the code snippets together to form a new program. The SDLC restarts in the testing phase to get new results, and either iterates again or a working application is produced. With this basic framework, an increasing number of components can be added in the same manner as in a traditional human-based workflow. This modular approach can be continuously improved until there is a robust and powerful application generation pipeline that simply takes in a natural language prompt and outputs a functioning application, handling all of the error correction and best practice adherence behind the scenes.

The following diagram illustrates this advanced workflow.

Conclusion

We are at the point in the adoption curve of generative AI that teams are able to get real productivity gains from using the variety of techniques and tools available. In the near future, it will be imperative to take advantage of these productivity gains to stay competitive. One thing we do know is that the landscape will continue to rapidly progress and change, so building a system tolerant of change and flexibility is key. Developing your components in a modular fashion allows for stability in the face of an ever-changing technical landscape while being ready to adopt the latest technology at each step of the way.

For more information about how to get started building with LLMs, see these resources:


About the Authors

Ian Lenora is an experienced software development leader who focuses on building high-quality cloud native software, and exploring the potential of artificial intelligence. He has successfully led teams in delivering complex projects across various industries, optimizing efficiency and scalability. With a strong understanding of the software development lifecycle and a passion for innovation, Ian seeks to leverage AI technologies to solve complex problems and create intelligent, adaptive software solutions that drive business value.

Cody Collins is a New York-based Solutions Architect at Amazon Web Services, where he collaborates with ISV customers to build cutting-edge solutions in the cloud. He has extensive experience in delivering complex projects across diverse industries, optimizing for efficiency and scalability. Cody specializes in AI/ML technologies, enabling customers to develop ML capabilities and integrate AI into their cloud applications.

Samit KumbhaniSamit Kumbhani is an AWS Senior Solutions Architect in the New York City area with over 18 years of experience. He currently collaborates with Independent Software Vendors (ISVs) to build highly scalable, innovative, and secure cloud solutions. Outside of work, Samit enjoys playing cricket, traveling, and biking.

Read More

Inference AudioCraft MusicGen models using Amazon SageMaker

Inference AudioCraft MusicGen models using Amazon SageMaker

Music generation models have emerged as powerful tools that transform natural language text into musical compositions. Originating from advancements in artificial intelligence (AI) and deep learning, these models are designed to understand and translate descriptive text into coherent, aesthetically pleasing music. Their ability to democratize music production allows individuals without formal training to create high-quality music by simply describing their desired outcomes.

Generative AI models are revolutionizing music creation and consumption. Companies can take advantage of this technology to develop new products, streamline processes, and explore untapped potential, yielding significant business impact. Such music generation models enable diverse applications, from personalized soundtracks for multimedia and gaming to educational resources for students exploring musical styles and structures. It assists artists and composers by providing new ideas and compositions, fostering creativity and collaboration.

One prominent example of a music generation model is AudioCraft MusicGen by Meta. MusicGen code is released under MIT, model weights are released under CC-BY-NC 4.0. MusicGen can create music based on text or melody inputs, giving you better control over the output. The following diagram shows how MusicGen, a single stage auto-regressive Transformer model, can generate high-quality music based on text descriptions or audio prompts.

Music Generation Models - MusicGen Input Output flow

MusicGen uses cutting-edge AI technology to generate diverse musical styles and genres, catering to various creative needs. Unlike traditional methods that include cascading several models, such as hierarchically or upsampling, MusicGen operates as a single language model, which operates over several streams of compressed discrete music representation (tokens). This streamlined approach empowers users with precise control over generating high-quality mono and stereo samples tailored to their preferences, revolutionizing AI-driven music composition.

MusicGen models can be used across education, content creation, and music composition. They can enable students to experiment with diverse musical styles, generate custom soundtracks for multimedia projects, and create personalized music compositions. Additionally, MusicGen can assist musicians and composers, fostering creativity and innovation.

This post demonstrates how to deploy MusicGen, a music generation model on Amazon SageMaker using asynchronous inference. We specifically focus on text conditioned generation of music samples using MusicGen models.

Solution overview

With the ability to generate audio, music, or video, generative AI models can be computationally intensive and time-consuming. Generative AI models with audio, music, and video output can use asynchronous inference that queues incoming requests and process them asynchronously. Our solution involves deploying the AudioCraft MusicGen model on SageMaker using SageMaker endpoints for asynchronous inference. This entails deploying AudioCraft MusicGen models sourced from the Hugging Face Model Hub onto a SageMaker infrastructure.

The following solution architecture diagram shows how a user can generate music using natural language text as an input prompt by using AudioCraft MusicGen models deployed on SageMaker.

MusicGen on Amazon SageMaker Asynchronous Inference

The following steps detail the sequence happening in the workflow from the moment the user enters the input to the point where music is generated as output:

  1. The user invokes the SageMaker asynchronous endpoint using an Amazon SageMaker Studio notebook.
  2. The input payload is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket for inference. The payload consists of both the prompt and the music generation parameters. The generated music will be downloaded from the S3 bucket.
  3. The facebook/musicgen-large model is deployed to a SageMaker asynchronous endpoint. This endpoint is used to infer for music generation.
  4. The HuggingFace Inference Containers image is used as a base image. We use an image that supports PyTorch 2.1.0 with a Hugging Face Transformers framework.
  5. The SageMaker HuggingFaceModel is deployed to a SageMaker asynchronous endpoint.
  6. The Hugging Face model (facebook/musicgen-large) is uploaded to Amazon S3 during deployment. Also, during inference, the generated outputs are uploaded to Amazon S3.
  7. We use Amazon Simple Notification Service (Amazon SNS) topics to notify the success and failure as defined as a part of SageMaker asynchronous inference configuration.

Prerequisites

Make sure you have the following prerequisites in place :

  1. Confirm you have access to the AWS Management Console to create and manage resources in SageMaker, AWS Identity and Access Management (IAM), and other AWS services.
  2. If you’re using SageMaker Studio for the first time, create a SageMaker domain. Refer to Quick setup to Amazon SageMaker to create a SageMaker domain with default settings.
  3. Obtain the AWS Deep Learning Containers for Large Model Inference from pre-built HuggingFace Inference Containers.

Deploy the solution

To deploy the AudioCraft MusicGen model to a SageMaker asynchronous inference endpoint, complete the following steps:

  1. Create a model serving package for MusicGen.
  2. Create a Hugging Face model.
  3. Define asynchronous inference configuration.
  4. Deploy the model on SageMaker.

We detail each of the steps and show how we can deploy the MusicGen model onto SageMaker. For sake of brevity, only significant code snippets are included. The full source code for deploying the MusicGen model is available in the GitHub repo.

Create a model serving package for MusicGen

To deploy MusicGen, we first create a model serving package. The model package contains a requirements.txt file that lists the necessary Python packages to be installed to serve the MusicGen model. The model package also contains an inference.py script that holds the logic for serving the MusicGen model.

Let’s look at the key functions used in serving the MusicGen model for inference on SageMaker:

def model_fn(model_dir):
    '''loads model'''
    model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")
    return model

The model_fn function loads the MusicGen model facebook/musicgen-large from the Hugging Face Model Hub. We rely on the MusicgenForConditionalGeneration Transformers module to load the pre-trained MusicGen model.

You can also refer to musicgen-large-load-from-s3/deploy-musicgen-large-from-s3.ipynb, which demonstrates the best practice of downloading the model from the Hugging Face Hub to Amazon S3 and reusing the model artifacts for future deployments. Instead of downloading the model every time from Hugging Face when we deploy or when scaling happens, we download the model to Amazon S3 and reuse it for deployment and during scaling activities. Doing so can improve the download speed, especially for large models, thereby helping prevent the download from happening over the internet from a website outside of AWS. This best practice also maintains consistency, which means the same model from Amazon S3 can be deployed across various staging and production environments.

The predict_fn function uses the data provided during the inference request and the model loaded through model_fn:

texts, generation_params = _process_input(data)
processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
inputs = processor (
    text = texts,
    padding=True,
    return_tensors="pt",
)

Using the information available in the data dictionary, we process the input data to obtain the prompt and generation parameters used to generate the music. We discuss the generation parameters in more detail later in this post.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
audio_values = model.generate(**inputs.to(device),
                                **generation_params)

We load the model to the device and then send the inputs and generation parameters as inputs to the model. This process generates the music in the form of a three-dimensional Torch tensor of shape (batch_size, num_channels, sequence_length).

sampling_rate = model.config.audio_encoder.sampling_rate
disk_wav_locations = _write_wavs_to_disk(sampling_rate, audio_values)
# Upload wavs to S3
result_dict["generated_outputs_s3"] = _upload_wav_files(disk_wav_locations, bucket_name)
# Clean up disk
for wav_on_disk in disk_wav_locations:
    _delete_file_on_disk(wav_on_disk)

We then use the tensor to generate .wav music and upload these files to Amazon S3 and clean up the .wav files saved on disk. We then obtain the S3 URI of the .wav files and send them locations in the response.

We now create the archive of the inference scripts and upload those to the S3 bucket:

musicgen_prefix = 'musicgen_large'
s3_model_key = f'{musicgen_prefix}/model/model.tar.gz'
s3_model_location = f"s3://{sagemaker_session_bucket}/{s3_model_key}"
s3 = boto3.resource("s3")
s3.Bucket(sagemaker_session_bucket).upload_file("model.tar.gz", s3_model_key)

The uploaded URI of this object on Amazon S3 will later be used to create the Hugging Face model.

Create the Hugging Face model

Now we initialize HuggingFaceModel with the necessary arguments. During deployment, the model serving artifacts, stored in s3_model_location, will be deployed. Before the model serving, the MusicGen model will be downloaded from Hugging Face as per the logic in model_fn.

huggingface_model = HuggingFaceModel(
    name=async_endpoint_name,
    model_data=s3_model_location,  # path to your model artifacts 
    role=role,
    env= {
           'TS_MAX_REQUEST_SIZE': '100000000',
           'TS_MAX_RESPONSE_SIZE': '100000000',
           'TS_DEFAULT_RESPONSE_TIMEOUT': '3600'
       },# iam role with permissions to create an Endpoint
    transformers_version="4.37",  # transformers version used
    pytorch_version="2.1",  # pytorch version used
    py_version="py310",  # python version used
)

The env argument accepts a dictionary of parameters such as TS_MAX_REQUEST_SIZE and TS_MAX_RESPONSE_SIZE, which define the byte size values for request and response payloads to the asynchronous inference endpoint. The TS_DEFAULT_RESPONSE_TIMEOUT key in the env dictionary represents the timeout in seconds after which the asynchronous inference endpoint stops responding.

You can run MusicGen with the Hugging Face Transformers library from version 4.31.0 onwards. Here we set transformers_version to 4.37. MusicGen requires at least PyTorch version 2.1 or latest, and we have set pytorch_version to 2.1.

Define asynchronous inference configuration

Music generation using a text prompt as input can be both computationally intensive and time-consuming. Asynchronous inference in SageMaker is designed to address these demands. When working with music generation models, it’s important to note that the process can often take more than 60 seconds to complete.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making it ideal for requests with large payload sizes (up to 1 GB), long processing times (up to 1 hour), and near real-time latency requirements. By queuing incoming requests and processing them asynchronously, this capability efficiently handles the extended processing times inherent in music generation tasks. Moreover, asynchronous inference enables seamless auto scaling, making sure that resources are allocated only when needed, leading to cost savings.

Before we proceed with asynchronous inference configuration , we create SNS topics for success and failure that can be used to perform downstream tasks:

from utils.sns_client import SnsClient
import time
sns_client = SnsClient(boto3.client("sns"))
timestamp = time.time_ns()
topic_names = [f"musicgen-large-topic-SuccessTopic-{timestamp}", f"musicgen-large-topic-ErrorTopic-{timestamp}"]

topic_arns = []
for topic_name in topic_names:
    print(f"Creating topic {topic_name}.")
    response = sns_client.create_topic(topic_name)
    topic_arns.append(response.get('TopicArn'))

We now create an asynchronous inference endpoint configuration by specifying the AsyncInferenceConfig object:

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join(
        "s3://", sagemaker_session_bucket, "musicgen_large/async_inference/output"
    ),  # Where our results will be stored
    # Add nofitication SNS if needed
    notification_config={
        "SuccessTopic": topic_arns[0],
        "ErrorTopic": topic_arns[1],
    },  #  Notification configuration
)

The arguments to the AsyncInferenceConfig are detailed as follows:

  • output_path – The location where the output of the asynchronous inference endpoint will be stored. The files in this location will have an .out extension and will contain the details of the asynchronous inference performed by the MusicGen model.
  • notification_config – Optionally, you can associate success and error SNS topics. Dependent workflows can poll these topics to make informed decisions based on the inference outcomes.

Deploy the model on SageMaker

With the asynchronous inference configuration defined, we can deploy the Hugging Face model, setting initial_instance_count to 1:

# deploy the endpoint
async_predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    async_inference_config=async_config,
    endpoint_name=async_endpoint_name,
)

After successfully deploying, you can optionally configure automatic scaling to the asynchronous endpoint. With asynchronous inference, you can also scale down your asynchronous endpoint’s instances to zero.

We now dive into inferencing the asynchronous endpoint for music generation.

Inference

In this section, we show how to perform inference using an asynchronous inference endpoint with the MusicGen model. For the sake of brevity, only significant code snippets are included. The full source code for inferencing the MusicGen model is available in the GitHub repo. The following diagram explains the sequence of steps to invoke the asynchronous inference endpoint.

MusicGen - Amazon SageMaker Async Inference Sequence Diagram

We detail the steps to invoke the SageMaker asynchronous inference endpoint for MusicGen by prompting a desired mood in natural language using English. We then demonstrate how to download and play the .wav files generated from the user prompt. Finally, we cover the process of cleaning up the resources created as part of this deployment.

Prepare prompt and instructions

For controlled music generation using MusicGen models, it’s important to understand various generation parameters:

generation_params = { 
    'guidance_scale': 3,
    'max_new_tokens': 1200, 
    'do_sample': True, 
    'temperature': 1 
}

From the preceding code, let’s understand the generation parameters:

  • guidance_scale – The guidance_scale is used in classifier-free guidance (CFG), setting the weighting between the conditional logits (predicted from the text prompts) and the unconditional logits (predicted from an unconditional or ‘null’ prompt). A higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled by setting guidance_scale > 1. For best results, use guidance_scale = 3. Our deployment defaults to 3.
  • max_new_tokens – The max_new_tokens parameter specifies the number of new tokens to generate. Generation is limited by the sinusoidal positional embeddings to 30-second inputs, meaning MusicGen can’t generate more than 30 seconds of audio (1,503 tokens). Our deployment defaults to 256.
  • do_sample – The model can generate an audio sample conditioned on a text prompt through use of the MusicgenProcessor to preprocess the inputs. The preprocessed inputs can then be passed to the .generate method to generate text-conditional audio samples. Our deployment defaults to True.
  • temperature – This is the softmax temperature parameter. A higher temperature increases the randomness of the output, making it more diverse. Our deployment defaults to 1.

Let’s look at how to build a prompt to infer the MusicGen model:

data = {
    "texts": [
        "Warm and vibrant weather on a sunny day, feeling the vibes of hip hop and synth",
    ],
    "bucket_name": sagemaker_session_bucket,
    "generation_params": generation_params
}

The preceding code is the payload, which will be saved as a JSON file and uploaded to an S3 bucket. We then provide the URI of the input payload during the asynchronous inference endpoint invocation along with other arguments as follows.

The texts key accepts an array of texts, which may contain the mood you want to reflect in your generated music. You can include musical instruments in the text prompt to the MusicGen model to generate music featuring those instruments.

The response from the invoke_endpoint_async is a dictionary of various parameters:

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName=endpoint_name,
    InputLocation=input_s3_location,
    ContentType="application/json",
    InvocationTimeoutSeconds=3600
)

OutputLocation in the response metadata represents Amazon S3 URI where the inference response payload is stored.

Asynchronous music generation

As soon as the response metadata is sent to the client, the asynchronous inference begins the music generation. The music generation happens on the instance chosen during the deployment of the MusicGen model on the SageMaker asynchronous Inference endpoint , as detailed in the deployment section.

Continuous polling and obtaining music files

While the music generation is in progress, we continuously poll for the response metadata parameter OutputLocation:

from utils.inference_utils import get_output
output = get_output(sm_session, response.get('OutputLocation'))

The get_output function keeps polling for the presence of OutputLocation and returns the S3 URI of the .wav music file.

Audio output

Lastly, we download the files from Amazon S3 and play the output using the following logic:

from utils.inference_utils import play_output_audios
music_files = []
for s3_url in output.get('generated_outputs_s3'):
    if s3_url is not None:
        music_files.append(download_from_s3(s3_url))
play_output_audios(music_files, data.get('texts'))

You now have access to the .wav files and can try changing the generation parameters to experiment with various text prompts.

The following is another music sample based on the following generation parameters:

generation_params = { 'guidance_scale': 5, 'max_new_tokens': 1503, 'do_sample': True, 'temperature': 0.9 }
data = {
    "texts": [
        "Catchy funky beats with drums and bass, synthesized pop for an upbeat pop game",
    ],
    "bucket_name": sagemaker_session_bucket,
    "generation_params": generation_params
}

Clean up

To avoid incurring unnecessary charges, you can clean up using the following code:

import boto3
sagemaker_runtime = boto3.client('sagemaker-runtime')

cleanup = False # < - Set this to True to clean up resources.
endpoint_name = <Endpoint_Name>

sm_client = boto3.client('sagemaker')
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint['EndpointConfigName']
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']
notification_config = endpoint_config['AsyncInferenceConfig']['OutputConfig'].get('NotificationConfig', None)
print(f"""
About to delete the following sagemaker resources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Model: {model_name}
""")
for k,v in notification_config.items():
    print(f'About to delete SNS topics for {k} with ARN: {v}')

if cleanup:
    # delete endpoint
    sm_client.delete_endpoint(EndpointName=endpoint_name)
    # delete endpoint config
    sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
    # delete model
    sm_client.delete_model(ModelName=model_name)
    print('deleted model, config and endpoint')

The aforementioned cleanup routine will delete the SageMaker endpoint, endpoint configurations, and models associated with MusicGen model, so that you avoid incurring unnecessary charges. Make sure to set cleanup variable to True, and replace <Endpoint_Name> with the actual endpoint name of the MusicGen model deployed on SageMaker. Alternatively, you can use the console to delete the endpoints and its associated resources that were created while running the code mentioned in the post.

Conclusion

In this post, we learned how to use SageMaker asynchronous inference to deploy the AudioCraft MusicGen model. We started by exploring how the MusicGen models work and covered various use cases for deploying MusicGen models. We also explored how you can benefit from capabilities such as auto scaling and the integration of asynchronous endpoints with Amazon SNS to power downstream tasks. We then took a deep dive into the deployment and inference workflow of MusicGen models on SageMaker, using the AWS Deep Learning Containers for HuggingFace inference and the MusicGen model sourced from the Hugging Face Hub.

Get started with generating music using your creative prompts by signing up for AWS. The full source code is available on the official GitHub repository.

References


About the Authors

Pavan Kumar Rao NavulePavan Kumar Rao Navule is a Solutions Architect at Amazon Web Services, where he works with ISVs in India to help them innovate on the AWS platform. He is specialized in architecting AI/ML and generative AI services at AWS. Pavan is a published author for the book “Getting Started with V Programming.” In his free time, Pavan enjoys listening to the great magical voices of Sia and Rihanna.

David John ChakramDavid John Chakram is a Principal Solutions Architect at AWS. He specializes in building data platforms and architecting seamless data ecosystems. With a profound passion for databases, data analytics, and machine learning, he excels at transforming complex data challenges into innovative solutions and driving businesses forward with data-driven insights.

Sudhanshu HateSudhanshu Hate is a principal AI/ML specialist with AWS and works with clients to advise them on their MLOps and generative AI journey. In his previous role before Amazon, he conceptualized, created, and led teams to build ground-up open source-based AI and gamification platforms, and successfully commercialized it with over 100 clients. Sudhanshu has to his credit a couple of patents, has written two books and several papers and blogs, and has presented his points of view in various technical forums. He has been a thought leader and speaker, and has been in the industry for nearly 25 years. He has worked with Fortune 1000 clients across the globe and most recently with digital native clients in India.

Rupesh BajajRupesh Bajaj is a Solutions Architect at Amazon Web Services, where he collaborates with ISVs in India to help them leverage AWS for innovation. He specializes in providing guidance on cloud adoption through well-architected solutions and holds seven AWS certifications. With 5 years of AWS experience, Rupesh is also a Gen AI Ambassador. In his free time, he enjoys playing chess.

Read More

Build an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation

Build an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation

Retrieval Augmented Generation (RAG) is a state-of-the-art approach to building question answering systems that combines the strengths of retrieval and foundation models (FMs). RAG models first retrieve relevant information from a large corpus of text and then use a FM to synthesize an answer based on the retrieved information.

An end-to-end RAG solution involves several components, including a knowledge base, a retrieval system, and a generation system. Building and deploying these components can be complex and error-prone, especially when dealing with large-scale data and models.

This post demonstrates how to seamlessly automate the deployment of an end-to-end RAG solution using Knowledge Bases for Amazon Bedrock and AWS CloudFormation, enabling organizations to quickly and effortlessly set up a powerful RAG system.

Solution overview

The solution provides an automated end-to-end deployment of a RAG workflow using Knowledge Bases for Amazon Bedrock. We use AWS CloudFormation to set up the necessary resources, including :

  1. An AWS Identity and Access Management (IAM) role
  2. An Amazon OpenSearch Serverless collection and index
  3. A knowledge base with its associated data source

The RAG workflow enables you to use your document data stored in an Amazon Simple Storage Service (Amazon S3) bucket and integrate it with the powerful natural language processing capabilities of FMs provided in Amazon Bedrock. The solution simplifies the setup process, allowing you to quickly deploy and start querying your data using the selected FM.

Prerequisites

To implement the solution provided in this post, you should have the following:

  • An active AWS account and familiarity with FMs, Amazon Bedrock, and OpenSearch Serverless.
  • An S3 bucket where your documents are stored in a supported format (.txt, .md, .html, .doc/docx, .csv, .xls/.xlsx, .pdf).
  • The Amazon Titan Embeddings G1-Text model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If the Amazon Titan Embeddings G1-Text model is enabled, the access status will show as Access granted, as shown in the following screenshot.

Set up the solution

When the prerequisite steps are complete, you’re ready to set up the solution:

  1. Clone the GitHub repository containing the solution files:
git clone https://github.com/aws-samples/amazon-bedrock-samples.git
  1. Navigate to the solution directory:
cd knowledge-bases/features-examples/04-infrastructure/e2e-rag-deployment-using-bedrock-kb-cfn
  1. Run the sh script, which will create the deployment bucket, prepare the CloudFormation templates, and upload the ready CloudFormation templates and required artifacts to the deployment bucket:
bash deploy.sh

While running deploy.sh, if you provide a bucket name as an argument to the script, it will create a deployment bucket with the specified name. Otherwise, it will use the default name format: e2e-rag-deployment-${ACCOUNT_ID}-${AWS_REGION}

As shown in the following screenshot, if you complete the preceding steps in an Amazon SageMaker notebook instance, you can run the bash deploy.sh at the terminal, which creates the deployment bucket in your account (account number has been redacted).

  1. After the script is complete, note the S3 URL of the main-template-out.yml.

  1. On the AWS CloudFormation console, create a new stack.
  2. For Template source, select Amazon S3 URL and enter the URL you copied earlier.
  3. Choose Next.

  1. Provide a stack name and specify the RAG workflow details according to your use case and then choose Next.

  1. Leave everything else as default and choose Next on the following pages.
  1. Review the stack details and select the acknowledgement check boxes.

  1. Choose Submit to start the deployment process.

You can monitor the stack deployment progress on the AWS CloudFormation console.

Test the solution

When the deployment is successful (which may take 7–10 minutes to complete), you can start testing the solution.

  1. On the Amazon Bedrock console, navigate to the created knowledge base.
  2. Choose Sync to initiate the data ingestion job.

  1. After data synchronization is complete, select the desired FM to use for retrieval and generation (it requires model access to be granted to this FM in Amazon Bedrock before using).

  1. Start querying your data using natural language queries.

That’s it! You can now interact with your documents using the RAG workflow powered by Amazon Bedrock.

Clean up

To avoid incurring future charges, delete the resources used in this solution:

  1. On the Amazon S3 console, manually delete the contents inside the bucket you created for template deployment, then delete the bucket.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane, select the main stack, and choose Delete.

Your created knowledge base will be deleted when you delete the stack.

Conclusion

In this post, we introduced an automated solution for deploying an end-to-end RAG workflow using Knowledge Bases for Amazon Bedrock and AWS CloudFormation. By using the power of AWS services and the preconfigured CloudFormation templates, you can quickly set up a powerful question answering system without the complexities of building and deploying individual components for RAG applications. This automated deployment approach not only saves time and effort, but also provides a consistent and reproducible setup, enabling you to focus on utilizing the RAG workflow to extract valuable insights from your data.

Try it out and see firsthand how it can streamline your RAG workflow deployment and enhance efficiency. Please share your feedback to us!


About the Authors

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. With a keen interest in exploring new frontiers in the field, she continuously strives to push boundaries. Outside of work, she loves traveling, working out, and exploring new things.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Read More

Faster LLMs with speculative decoding and AWS Inferentia2

Faster LLMs with speculative decoding and AWS Inferentia2

In recent years, we have seen a big increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, which are in the order of hundreds of billions at the time of writing, tend to produce better results. For example, Llama-3-70B, scores better than its smaller 8B parameters version on metrics like reading comprehension (SQuAD 85.6 compared to 76.4). Thus, customers often experiment with larger and newer models to build ML-based products that bring value.

However, the larger the model, the more computationally demanding it is, and the higher the cost to deploy. For example, on AWS Trainium, Llama-3-70B has a median per-token latency of 21.4 ms, while Llama-3-8B takes 4.7 ms. Similarly, Llama-2-70B has a median per-token latency of 20.6 ms, while Llama-2-7B takes 3.7 ms. Customers have to consider performance to ensure they meet their users’ needs. In this blog post, we will explore how speculative sampling can help make large language model inference more compute efficient and cost-effective on AWS Inferentia and Trainium. This technique improves LLM inference throughput and output token latency (TPOT).

Introduction

Modern language models are based on the transformer architecture. The input prompts are processed first using a technique called context encoding, which runs fast because it is parallelizable. Next, we perform auto-regressive token generation where the output tokens are generated sequentially. Note that we cannot generate the next token until we know the previous one, as depicted in Figure 1. Therefore, to generate N output tokens we need N serial runs through the decoder. A run takes longer through a larger model, like Llama-3-70B, than through a smaller model, like Llama-3-8B.

AWS Neuron speculative decoding - Sequential token generation in LLMs

Figure 1: Sequential token generation in LLMs

From a computational perspective, token generation in LLMs is a memory bandwidth-bound process. The larger the model, the more likely it is that we will wait on memory transfers. This results in underutilizing the compute units and not fully benefiting from the floating-point operations (FLOPS) available.

Speculative sampling

Speculative sampling is a technique that improves the computational efficiency for running inference with LLMs, while maintaining accuracy. It works by using a smaller, faster draft model to generate multiple tokens, which are then verified by a larger, slower target model. This verification step processes multiple tokens in a single pass rather than sequentially and is more compute efficient than processing tokens sequentially. Increasing the number of tokens processed in parallel increases the compute intensity because a larger number of tokens can be multiplied with the same weight tensor. This provides better performance compared with the non-speculative run, which is usually memory bandwidth-bound, and thus leads to better hardware resource utilization.

The speculative process involves an adjustable window k, where the target model provides one guaranteed correct token, and the draft model speculates on the next k-1 tokens. If the draft model’s tokens are accepted, the process speeds up. If not, the target model takes over, ensuring accuracy.

AWS Neuron speculative decoding - Case when all speculated tokens are accepted

Figure 2: Case when all speculated tokens are accepted

Figure 2 illustrates a case where all speculated tokens are accepted, resulting in faster processing. The target model provides a guaranteed output token, and the draft model runs multiple times to produce a sequence of possible output tokens. These are verified by the target model and subsequently accepted by a probabilistic method.

AWS Neuron speculative decoding - Case when some speculated tokens are rejected

Figure 3: Case when some speculated tokens are rejected

On the other hand, Figure 3 shows a case where some of the tokens are rejected. The time it takes to run this speculative sampling loop is the same as in Figure 2, but we obtain fewer output tokens. This means we will be repeating this process more times to complete the response, resulting in slower overall processing.

By adjusting the window size k and understanding when the draft and target models are likely to produce similar results, we can maximize the benefits of speculative sampling.

A Llama-2-70B/7B demonstration

We will show how speculative sampling works on Inferentia2-powered Amazon EC2 Inf2 instances and Trainium-powered EC2 Trn1 instances. We will be using a sample where we generate text faster with Llama-2-70B by using a Llama-2-7B model as a draft model. The example walk-through is based on Llama-2 models, but you can follow a similar process for Llama-3 models as well.

Loading models

You can load the Llama-2 models using data type bfloat16. The draft model needs to be loaded in a standard way like in the example below. The parameter n_positions is adjustable and represents the maximum sequence length you want to allow for generation. The only batch_size we support for speculative sampling at the time of writing is 1. We will explain tp_degree later in this section.

draft_model = LlamaForSampling.from_pretrained('Llama-2-7b', n_positions=128, batch_size=1, tp_degree=32, amp='bf16')

The target model should be loaded in a similar way, but with speculative sampling functionality enabled. The value k was described previously.

target_model = LlamaForSampling.from_pretrained('Llama-2-70b', n_positions=128, batch_size=1, tp_degree=32, amp='bf16')
target_model.enable_speculative_decoder(k)

Combined, the two models need almost 200 GB of device memory for the weights with additional memory in the order of GBs needed for key-value (KV) caches. If you prefer to use the models with float32 parameters, they will need around 360 GB of device memory. Note that the KV caches grow linearly with sequence length (input tokens + tokens yet to be generated). Use neuron-top to see the memory utilization live. To accommodate for these memory requirements, we’ll need either the largest Inf2 instance (inf2.48xlarge) or largest Trn1 instance (trn1.32xlarge).

Because of the size of the models, their weights need to be distributed amongst the NeuronCores using a technique called tensor parallelism. Notice that in the sample provided, tp_degree is used per model to specify how many NeuronCores that model should use. This, in turn, affects the memory bandwidth utilization, which is critical for token generation performance. A higher tp_degree can lead to better bandwidth utilization and improved throughput. The topology for Trn1 requires that tp_degree is set to 1, 2, 8, 16 or a multiple of 32. For Inf2, it needs to be 1 or multiples of 2.

The order in which you load the models also matters. After a set of NeuronCores has been initialized and allocated for one model, you cannot use the same NeuronCores for another model unless it’s the exact same set. If you try to use only some of the NeuronCores that were previously initialized, you will get an nrt_load_collectives - global nec_comm is already init'd error.

Let’s go through two examples on trn1.32xlarge (32 NeuronCores) to understand this better. We will calculate how many NeuronCores we need per model. The formula used is the observed model size in memory, using neuron-top, divided by 16GB which is the device memory per NeuronCore.

  1. If we run the models using bfloat16, we need more than 10 NeuronCores for Llama-2-70B and more than 2 NeuronCores for Llama-2-7B. Because of topology constraints, it means we need at least tp_degree=16 for Llama-2-70B. We can use the remaining 16 NeuronCores for Llama-2-7B. However, because both models fit in memory across 32 NeuronCores, we should set tp_degree=32 for both, to speed-up the model inference for each.
  2. If we run the models using float32, we need more than 18 NeuronCores for Llama-2-70B and more than 3 NeuronCores for Llama-2-7B. Because of topology constraints, we have to set tp_degree=32 for Llama-2-70B. That means Llama-2-7B needs to re-use the same set of NeuronCores, so you need to set tp_degree=32 for Llama-2-7B too.

Walkthrough

The decoder we’ll use from transformers-neuronx is LlamaForSampling, which is suitable for loading and running Llama models. You can also use NeuronAutoModelForCausalLM which will attempt to auto-detect which decoder to use. To perform speculative sampling, we need to create a speculative generator first which takes two models and the value k described previously.

spec_gen = SpeculativeGenerator(draft_model, target_model, k)

We invoke the inferencing process by calling the following function:

spec_gen.sample(input_ids=input_token_ids, sequence_length=total_output_length)

During sampling, there are several hyper-parameters (for example: temperature, top_p, and top_k) that affect if the output is deterministic across multiple runs. At the time of writing, the speculative sampling implementation sets default values for these hyper-parameters. With these values, expect randomness in results when you run a model multiple times, even if it’s with the same prompt. This is normal intended behavior for LLMs because it improves their qualitative responses.

When you run the sample, you will use the default token acceptor, based on the DeepMind paper which introduced speculative sampling, which uses a probabilistic method to accept tokens. However, you can also implement a custom token acceptor, which you can pass as part of the acceptor parameter when you initialize the SpeculativeGenerator. You would do this if you wanted more deterministic responses, for example. See the implementation of the DefaultTokenAcceptor class in transformers-neuronx to understand how to write your own.

Conclusion

As more developers look to incorporate LLMs into their applications, they’re faced with a choice of using larger, more costly, and slower models that will deliver higher quality results. Or they can use smaller, less expensive and faster models that might reduce quality of answers. Now, with AWS artificial intelligence (AI) chips and speculative sampling, developers don’t have to make that choice. They can take advantage of the high-quality outputs of larger models and the speed and responsiveness of smaller models.

In this blog post, we have shown that we can accelerate the inference of large models, such as Llama-2-70B, by using a new feature called speculative sampling.

To try it yourself, check out the speculative sampling example, and tweak the input prompt and k parameter to see the results you get. For more advanced use cases, you can develop your own token acceptor implementation. To learn more about running your models on Inferentia and Trainium instances, see the AWS Neuron documentation. You can also visit repost.aws AWS Neuron channel to discuss your experimentations with the AWS Neuron community and share ideas.


About the Authors

Syl Taylor AWSSyl Taylor is a Specialist Solutions Architect for Efficient Compute. She advises customers across EMEA on Amazon EC2 cost optimization and improving application performance using AWS-designed chips. Syl previously worked in software development and AI/ML for AWS Professional Services, designing and implementing cloud native solutions. She’s based in the UK and loves spending time in nature.

Emir Ayar AWSEmir Ayar is a Senior Tech Lead Solutions Architect with the AWS Prototyping team. He specializes in assisting customers with building ML and generative AI solutions, and implementing architectural best practices. He supports customers in experimenting with solution architectures to achieve their business objectives, emphasizing agile innovation and prototyping. He lives in Luxembourg and enjoys playing synthesizers.

Read More