Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights

Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights

Many enterprise customers across various industries are looking to adopt Generative AI to drive innovation, user productivity, and enhance customer experience. Generative AI–powered assistants such as Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business understands natural language and allows users to receive immediate, permissions-aware responses from enterprise data sources with citations. This capability supports various use cases such as IT, HR, and help desk.

With custom plugins for Amazon Q Business, you can enhance the application environment to enable your users to use natural language to perform specific tasks related to third-party applications — such as Jira, Salesforce, and ServiceNow — directly from within their web experience chat.

Enterprises that have adopted ServiceNow can improve their operations and boost user productivity by using Amazon Q Business for various use cases, including incident and knowledge management. Users can search ServiceNow knowledge base (KB) articles and incidents in addition to being able to create, manage, and track incidents and KB articles, all from within their web experience chat.

In this post, we’ll demonstrate how to configure an Amazon Q Business application and add a custom plugin that gives users the ability to use a natural language interface provided by Amazon Q Business to query real-time data and take actions in ServiceNow. By the end of this hands-on session, you should be able to:

  • Create an Amazon Q Business application and integrate it with ServiceNow using a custom plugin.
  • Use natural language in your Amazon Q web experience chat to perform read and write actions in ServiceNow such as querying and creating incidents and KB articles in a secure and governed fashion.

Prerequisites

Before proceeding, make sure that you have the necessary AWS account permissions and services enabled, along with access to a ServiceNow environment with the required privileges for configuration.

AWS

ServiceNow

  • Obtain a ServiceNow Personal Developer Instance or use a clean ServiceNow developer environment. You will need an account that has admin privileges to perform the configuration steps in ServiceNow.

Solution overview

The following architecture diagram illustrates the workflow for Amazon Q Business web experience with enhanced capabilities to integrate it seamlessly with ServiceNow.

Solution Overview

The implementation includes the following steps:

  1. The solution begins with configuring Amazon Q Business using the AWS Management Console. This includes setting up the application environment, adding users to AWS IAM Identity Center, selecting the appropriate subscription tier, and configuring the web experience for users to interact with. The environment can optionally be configured to provide real-time data retrieval using a native retriever, which pulls information from indexed data sources, such as Amazon Simple Storage Service (Amazon S3), during interactions.
  2. The next step involves adjusting the global controls and response settings for the application environment guardrails to allow Amazon Q Business to use its large language model (LLM) knowledge to generate responses when it cannot find responses from your connected data sources.
  3. Integration with ServiceNow is achieved by setting up an OAuth Inbound application endpoint in ServiceNow, which authenticates and authorizes interactions between Amazon Q Business and ServiceNow. This involves creating an OAuth API endpoint in ServiceNow and using the web experience URL from Amazon Q Business as the callback URL. The setup makes sure that Amazon Q Business can securely perform actions in ServiceNow with the same scoped permissions as the user signing in to ServiceNow.
  4. The final step of the solution involves enhancing the application environment with a custom plugin for ServiceNow using APIs defined in an OpenAPI schema. The plugin allows Amazon Q Business to securely interact with ServiceNow’s REST APIs, enabling operations such as querying, creating, and updating records dynamically and in real time

Configuring the Amazon Q Business application

To create an Amazon Q Business application, sign in to the Amazon Q Business console.
As a prerequisite to creating an Amazon Q Business application, follow the instructions in Configuring an IAM Identity Center instance section. Amazon Q Business integrates with IAM Identity Center to enable managing user access to your Amazon Q Business application. This is the recommended method for managing human access to AWS resources and the method used for the purpose of this blog.

Amazon Q Business also supports identity federation through IAM. When you use identity federation, you can manage users with your enterprise identity provider (IdP) and use IAM to authenticate users when they sign in to Amazon Q Business.

Create and configure the Amazon Q Business application:

  1. In the Amazon Q Business console, choose Application from the navigation pane and then choose Create application.
  2. Enter the following information for your Amazon Q Business application:
    • Application name: Enter a name for quick identification, such as my-demo-application.
    • Service access: Select the Create and use a new service-linked role (SLR). A service-linked role is a unique type of IAM role that is linked directly to Amazon Q Business. Service-linked roles are predefined by Amazon Q Business and include the permissions that the service requires to call other AWS services on your behalf.
    • Choose Create.
  3.  After creating your Amazon Q Business application environment, create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. On the Select Retriever page:
    • Retrievers: Select Use native retriever.
    • Index provisioning: Select Starter, which is ideal for proof-of-concept or developer workloads. See Index types for more information.
    • Number of units: Enter 1. This indicates the capacity units that you want to provision for your index. Each unit is 20,000 documents. Choose Next.
    • Choose Next.

Select Retriever

  1. After you select a retriever for your Amazon Q Business application environment, you can optionally connect other data sources to it. Because a data source isn’t required for this session, we won’t configure one. For more information on connecting data sources to an Amazon Q Business application, see connecting data sources.
    • Choose Next.
  2. As an account admin, you can add users to your IAM Identity Center instance from the Amazon Q Business console. After you add users or groups to an application environment, you can then choose the Amazon Q Business tier for each user or group. On the Add groups and users page:
    • Choose Add groups and users.
    • In the Add new users dialog box that opens, enter the details of the user. The details you must enter for a single user include: Username, First name, Last name, email address, Confirm email address, and Display name.
    • Choose Next and then Add. The user is automatically added to an IAM Identity Center directory and an email invitation to join Identity Center is sent to the email address provided.
    • After adding a user or group, choose the Amazon Q Business subscription tier for each user or group. From the Current subscription dropdown menu, select Q Business Pro.
    • For the Web experience service access, select Create and use a new service role.
    • Choose Create application.

    Add groups and users

Upon successful completion, Amazon Q Business returns a web experience URL that you can share with the users you added to your application environment. The Web experience URL (in this case: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/) will be used when creating an OAuth application endpoint in ServiceNow. Note that your web experience URL will be different from the one shown here.

Application Created

Enhancing an Amazon Q Business application with guardrails

By default, an Amazon Q Business application is configured to respond to user chat queries using only enterprise data. Because we didn’t configure a data source for the purpose of this post, you will use Admin controls and guardrails to allow Amazon Q to use its LLM world knowledge to generate responses when it cannot find responses from your connected data sources.

Create a custom plugin for ServiceNow:

  1. From the Amazon Q Business console, choose Applications in the navigation pane. Select the name of your application from the list of applications.
  2. From the navigation pane, choose Enhancements, and then choose Admin Controls and guardrails.
  3. In Global Controls, choose Edit.
  4. In Response settings under Application guardrails, select Allow Amazon Q to fall back to LLM knowledge.

create guardrails

Configuring ServiceNow

To allow Amazon Q Business to connect to your ServiceNow instance, you need to create an OAuth inbound application endpoint. OAuth-based authentication validates the identity of the client that attempts to establish a trust on the system by using an authentication protocol. For more information, see OAuth Inbound and Outbound authentication.

Create an OAuth application endpoint for external client applications to access the ServiceNow instance:

  1. In the ServiceNow console, navigate to All, then System OAuth, then Application Registry and then choose New. On the interceptor page, select Create an OAuth API endpoint for external clients and then fill in the form with details for Name and Redirect URL. The other fields are automatically generated by the ServiceNow OAuth server.
    • The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by oauth/callback.
    • For this example, the URL is: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback
  2. For Auth Scope, set the value to useraccount. The scope API response parameter defines the amount of access granted by the access token, which means that the access token has the same rights as the user account that authorized the token. For example, if Abel Tuter authorizes an application by providing login credentials, then the resulting access token grants the token bearer the same access privileges as Abel Tuter.
  3. Choose Submit.

This creates an OAuth client application record and generates a client ID and client secret, which Amazon Q Business needs to access the restricted resources on the instance. You will need this authentication information (client ID and client secret) in the following custom plugin configuration process.

ServiceNow App Registry OAuth

Enhancing the Amazon Q Business application environment with custom plugins for ServiceNow

To integrate with external applications, Amazon Q Business uses APIs, which are configured as part of the custom plugins.

Before creating a custom plugin, you need to create or edit an OpenAPI schema, outlining the different API operations that you want to enable for your custom plugin. Amazon Q Business uses the configured third-party OpenAPI specifications to dynamically determine which API operations to perform to fulfill a user request. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy and might require design optimizations. In order to maximize accuracy and improve efficiency with an Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions.

To configure a custom plugin, you must define at least one and a maximum of eight API operations that can be invoked. To define the API operations, create an OpenAPI schema in JSON or YAML format. You can create OpenAPI schema files and upload them to Amazon S3. Alternatively, you can use the OpenAPI text editor in the console, which will validate your schema.

For this post, a working sample of an OpenAPI Schema for ServiceNow is provided in JSON format. Before using it, edit the template file and replace <YOUR_SERVICENOW_INSTANCE_URL> in the following sections with the URL of your ServiceNow instance.

You can use the REST API Explorer to browse available APIs, API versions, and methods for each API. The explorer enables you to test REST API requests straight from the user interface. The Table API provides endpoints that allow you to perform create, read, update, and delete (CRUD) operations on existing tables. The calling user must have sufficient roles to access the data in the table specified in the request. For additional information on assigning roles, see Managing roles.

{
  "openapi": "3.0.1",
  "info": {
    "title": "Table API",
    "description": "Allows you to perform create, read, update and delete (CRUD) operations on existing tables",
    "version": "latest"
  },
  "externalDocs": {
    "url": "https://docs.servicenow.com/?context=CSHelp:REST-Table-API"
  },
  "servers": [
    {
      "url": "YOUR_SERVICENOW_INSTANCE_URL"
    }
  ],
  "paths": {
    "/api/now/table/{tableName}": {
      "get": {
        "description": "Retrieve records from a table",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_query",
            "in": "query",
            "description": "An encoded query string used to filter the results like Incidents Numbers or Knowledge Base IDs etc",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_fields",
            "in": "query",
            "description": "A comma-separated list of fields to return in the response",
            "required": false,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_limit",
            "in": "query",
            "description": "The maximum number of results returned per page",
            "required": false,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/incident"
                }
              }
            }
          }
        }
      },
      "post": {
        "description": "Create a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "requestBody": {
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "short_description": {
                    "type": "string",
                    "description": "Short Description"
                  },
                  "description": {
                    "type": "string",
                    "description": "Full Description for Incidents only"
                  },
                  "caller_id": {
                    "type": "string",
                    "description": "Caller Email"
                  },
                  "state": {
                    "type": "string",
                    "description": "State of the incident",
                    "enum": [
                      "new",
                      "in_progress",
                      "resolved",
                      "closed"
                    ]
                  },
                  "text": {
                    "type": "string",
                    "description": "Article Body Text for Knowledge Bases Only (KB)"
                  }
                },
                "required": [
                  "short_description",
                  "caller_id"
                ]
              }
            }
          },
          "required": true
        },
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {}
            }
          }
        }
      }
    },
    "/api/now/table/{tableName}/{sys_id}": {
      "get": {
        "description": "Retrieve a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sysparm_fields",
            "in": "query",
            "description": "A comma-separated list of fields to return in the response",
            "required": false,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      },
      "delete": {
        "description": "Delete a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      },
      "patch": {
        "description": "Update or modify a record",
        "parameters": [
          {
            "name": "tableName",
            "in": "path",
            "description": "Table Name",
            "required": true,
            "schema": {
              "type": "string"
            }
          },
          {
            "name": "sys_id",
            "in": "path",
            "description": "Sys ID",
            "required": true,
            "schema": {
              "type": "string"
            }
          }
        ],
        "requestBody": {
          "content": {
            "application/json": {
              "schema": {
                "type": "object",
                "properties": {
                  "short_description": {
                    "type": "string",
                    "description": "Short Description"
                  },
                  "description": {
                    "type": "string",
                    "description": "Full Description for Incidents only"
                  },
                  "caller_id": {
                    "type": "string",
                    "description": "Caller Email"
                  },
                  "state": {
                    "type": "string",
                    "description": "State of the incident",
                    "enum": [
                      "new",
                      "in_progress",
                      "resolved",
                      "closed"
                    ]
                  },
                  "text": {
                    "type": "string",
                    "description": "Article Body Text for Knowledge Bases Only (KB)"
                  }
                },
                "required": [
                  "short_description",
                  "caller_id"
                ]
              }
            }
          },
          "required": true
        },
        "responses": {
          "200": {
            "description": "ok",
            "content": {
              "application/json": {},
              "application/xml": {},
              "text/xml": {}
            }
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "incident": {
        "type": "object",
        "properties": {
          "sys_id": {
            "type": "string",
            "description": "Unique identifier for the incident"
          },
          "number": {
            "type": "string",
            "description": "Incident number"
          },
          "short_description": {
            "type": "string",
            "description": "Brief description of the incident"
          }
        }
      }
    },
    "securitySchemes": {
      "oauth2": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_auth.do",
            "tokenUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_token.do",
            "scopes": {
            "useraccount": "Access equivalent to the user's account"
            }
          }
        }
      }
    }
  },
  "security": [
    {
      "oauth2": [
        "useraccount"
      ]
    }
  ]
}

The URL for the ServiceNow instance used in this post is: https://devxxxxxx.service-now.com/. After updating the sections of the template with the URL for this specific instance, the JSON should look like the following:

  "servers": [
    {
      "url": "https://devxxxxxx.service-now.com/"
    }
    "securitySchemes": {
      "oauth2": {
        "type": "oauth2",
        "flows": {
          "authorizationCode": {
            "authorizationUrl": "https://devxxxxxx.service-now.com/oauth_auth.do",
            "tokenUrl": "https://devxxxxxx.service-now.com/oauth_token.do",
            "scopes": {
              "useraccount": "Access equivalent to the user's account"
            }
          }
        }
      }
    }

To create a custom plugin for ServiceNow:

    1. Sign in to the Amazon Q Business console.
    2. Choose Applications in the navigation pane, and then select your application from the list of applications.
    3. In the navigation pane, choose Enhancements, and then choose Plugins.
    4. In Plugins, choose Add plugin.
    5. In Add plugins, choose Custom plugin.
      Create Custom Plugin
    6. In Custom plugin, enter the following information:
      • In Name and description, for Plugin name: Enter a name for your Amazon Q plugin.
      • In API schema, for API schema source, select Define with in-line OpenAPI schema editor.
      • Select JSON as the format for the schema.
      • Remove any sample schema that appears in the inline OpenAPI schema editor and replace it with the text from the provided sample JSON template, updated with your ServiceNow instance URL.

      Enter Custom Plugin Details

    7. In Authentication: Select Authentication required.
    8. For AWS Secrets Manager secret, choose Create and add a new secret. You need to store the ServiceNow OAuth authentication credentials in a Secrets Manager secret to connect your third-party application to Amazon Q. In the window that opens, enter the details in the form:
      • Secret name: A name for your Secrets Manager secret.
      • Client ID: The Client ID from ServiceNow OAuth configuration in the previous section.
      • Client secret: The Client Secret from ServiceNow OAuth configuration in the previous section.
      • OAuth callback URL: The URL the user needs to be redirected to after authentication. This will be your web experience URL. For this example, it’s: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback. Amazon Q Business will handle OAuth tokens in this URL.

Create AWS Secrets Manager secret

  1. In Choose a method to authorize Amazon Q Business: Select Create and add a new service role. The console will generate a service role name. To connect Amazon Q Business to third-party applications that require authentication, you need to give the Amazon Q role permissions to access your Secrets Manager secret. This will enable an Amazon Q Business custom plugin to access the credentials needed to sign in to the third-party service.
    Custom Plugin Authentication
  2. Choose Add plugin to add your plugin.

Upon successful completion, the plugin will appear under Plugins with Build status of Ready and Plugin status Active.
Custom Plugin Active

Using Amazon Q Business web experience chat to take actions in ServiceNow

Users can launch your Amazon Q Business web experience in two ways:

  • AWS access portal URL provided in an invitation email sent to the user to join AWS IAM Identity Center.
  • Web experience URL shared by the admin.

Navigate to the deployed web experience URL and sign with your AWS IAM Identity Center credentials.
After signing in, choose the New conversation icon in the left-hand menu to start a conversation.

Example: Search Knowledge Base Articles in ServiceNow for user issue and create an incident

The following chat conversation example illustrates a typical use case of Amazon Q Business integrated with custom plugins for ServiceNow. These features allow you to perform a wide range of tasks tailored to your organization’s needs.

In this example, we initiate a conversation in the web experience chat to search for KB articles related to ”log in issues” in ServiceNow by invoking a plugin action. After the user submits a prompt, Amazon Q Business queries ServiceNow through the appropriate API to retrieve the results and provides a response with related KB articles. We then proceed by asking Amazon Q Business for more details to see if any of the KB articles directly addresses the user’s issue. When no relevant KB articles pertaining to the user’s issue are found, we ask Amazon Q Business to summarize the conversation and create a new incident in ServiceNow, making sure the issue is logged for resolution.

User prompt 1 – I am having issues logging in to the intranet and want to know if there are any ServiceNow KB articles on log-in issues. Perform the search on both Short Description and Text field using LIKE operator

Before submitting the preceding prompt for an action to create an incident in ServiceNow, choose the vertical ellipsis to open Conversation settings, then choose Use a Plugin to select the corresponding custom plugin for ServiceNow.
Web Experience Chat conversation with Amazon Q Business with Custom Plugin
If this is the first time a user is accessing the custom plugin or if their past sign-in has expired, the user will need to authenticate. After authenticating successfully, Amazon Q Business will perform the requested task.

Choose Authorize.
Amazon Q Business Authorization for ServiceNow Interaction

If the user isn’t already signed in to ServiceNow, they will be prompted to enter their credentials. For this example, the user signing in to ServiceNow is the admin user and API actions performed in ServiceNow by Amazon Q Business on behalf of the user will have the same level of access as the user within ServiceNow.
ServiceNow Login

Choose Allow for Amazon Q Business to connect to ServiceNow and perform the requested task on your behalf.

Allow Access to Amazon Q Business

Upon executing the user’s request after verifying that they are authorized, Amazon Q Business responds with the information that it retrieved. We then proceed to retrieve additional details with the following prompt.

User prompt 2 – Can you list the KB number and short description in a tabular form?

Conversation with Amazon Q Business to search for KB articles in ServiceNow
Because there no KB articles related the user’s issue were found, we will ask Amazon Q to summarize the conversation context to create an incident with the following prompt.

User prompt 3 – The error I get is "Unable to Login After System Upgrade". Summarize my issue and create an incident with detailed description and add a note that this needs to be resolved asap.

In response to your prompt for an action, Amazon Q displays a review form where you can modify or fill in the necessary information.

To successfully complete the action, choose submit.

Note: The caller_id value entered in the following example is a valid ServiceNow user.

Amazon Q Business Create Service Now Incident
Your web experience will display a success message if the action succeeds, or an error message if the action fails. In this case, the action succeeded and Amazon Q Business responded accordingly.

Amazon Q Business - Success message after incident Creation

The following screenshot shows that the incident was created successfully in ServiceNow.

Shows ServiceNow Incident Created from Amazon Q Business

Troubleshooting common errors

To have a seamless experience with third-party application integrations, it’s essential to thoroughly test, identify, and troubleshoot unexpected behavior.

A common error encountered in Amazon Q Business is API Response too large, which occurs when an API response size exceeds the current limit of 100 KB. While prompting techniques are essential for obtaining accurate and relevant answers, optimizing API responses to include only the necessary and relevant data is crucial for better response times and enhanced user experience.

The REST API Explorer (shown in the following figure) in ServiceNow is a tool that allows developers and administrators to interact with and test the ServiceNow REST APIs directly from within the ServiceNow environment. It provides a user-friendly interface for making API requests, viewing responses, and understanding the available endpoints and data structures. Using this tool simplifies the process of testing and integrating with ServiceNow.
Rest API Explorer in ServiceNow

Clean up

To clean up AWS configurations, sign in to the Amazon Q Business console.

  1. From the Amazon Q Business console, in Applications, select the application that you want to delete.
  2. Choose Actions and select Delete.
  3. To confirm deletion, enter Delete.

This will take a few minutes to finish. When completed, the application and the configured custom plugin will be deleted.
Delete Amazon Q Business App

When you delete the Amazon Q Business application, the users created as part of the configuration are not automatically deleted from IAM Identity Center. Use the instructions in Delete users in IAM Identity Center to delete the users created for this post.

To clean up in ServiceNow, release the Personal Developer Instance provisioned for this post by following the instructions in the ServiceNow Documentation.

Conclusion

The integration of generative AI-powered assistants such as Amazon Q Business with enterprise systems such as ServiceNow offers significant benefits for organizations. By using natural language processing capabilities, enterprises can streamline operations, enhance user productivity, and deliver better customer experiences. The ability to query real-time data and create incidents and knowledge articles through a secure and governed chat interface transforms how users interact with enterprise data and applications. As demonstrated in this post, enhancing Amazon Q Business to integrate with ServiceNow using custom plugins empowers users to perform complex tasks effortlessly, driving efficiency across various business functions. Adopting this technology not only modernizes workflows, but also positions enterprises at the forefront of innovation.

Learn more


About the Author

Siddhartha Angara is a Senior Solutions Architect at Amazon Web Services. He helps enterprise customers design and build well-architected solutions in the cloud, accelerate cloud adoption, and build Machine Learning and Generative AI applications. He enjoys playing the guitar, reading and family time!

Read More

Simplify multimodal generative AI with Amazon Bedrock Data Automation

Simplify multimodal generative AI with Amazon Bedrock Data Automation

Developers face significant challenges when using foundation models (FMs) to extract data from unstructured assets. This data extraction process requires carefully identifying models that meet the developer’s specific accuracy, cost, and feature requirements. Additionally, developers must invest considerable time optimizing price performance through fine-tuning and extensive prompt engineering. Managing multiple models, implementing safety guardrails, and adapting outputs to align with downstream system requirements can be difficult and time consuming.

Amazon Bedrock Data Automation in public preview helps address these and other challenges. This new capability from Amazon Bedrock offers a unified experience for developers of all skillsets to easily automate the extraction, transformation, and generation of relevant insights from documents, images, audio, and videos to build generative AI–powered applications. With Amazon Bedrock Data Automation, customers can fully utilize their data by extracting insights from their unstructured multimodal content in a format compatible with their applications. Amazon Bedrock Data Automation’s managed experience, ease of use, and customization capabilities help customers deliver business value faster, eliminating the need to spend time and effort orchestrating multiple models, engineering prompts, or stitching together outputs.

In this post, we demonstrate how to use Amazon Bedrock Data Automation in the AWS Management Console and the AWS SDK for Python (Boto3) for media analysis and intelligent document processing (IDP) workflows.

Amazon Bedrock Data Automation overview

You can use Amazon Bedrock Data Automation to generate standard outputs and custom outputs. Standard outputs are modality-specific default insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. Custom outputs use customer-defined blueprints that specify output requirements using natural language or a schema editor. The blueprint includes a list of fields to extract, data format for each field, and other instructions, such as data transformations and normalizations. This gives customers full control of the output, making it easy to integrate Amazon Bedrock Data Automation into existing applications.

Using Amazon Bedrock Data Automation, you can build powerful generative AI applications and automate use cases such as media analysis and IDP. Amazon Bedrock Data Automation is also integrated with Amazon Bedrock Knowledge Bases, making it easier for developers to generate meaningful information from their unstructured multimodal content to provide more relevant responses for Retrieval Augmented Generation (RAG).

Customers can get started with standard outputs for all four modalities: documents, images, videos, and audio and custom outputs for documents and images. Custom outputs for video and audio will be supported when the capability is generally available.

Amazon Bedrock Data Automation for images, audio, and video

To take a media analysis example, suppose that customers in the media and entertainment industry are looking to monetize long-form content, such as TV shows and movies, through contextual ad placement. To deliver the right ads at the right video moments, you need to derive meaningful insights from both the ads and the video content. Amazon Bedrock Data Automation enables your contextual ad placement application by generating these insights. For instance, you can extract valuable information such as video summaries, scene-level summaries, content moderation concepts, and scene classifications based on the Interactive Advertising Bureau (IAB) taxonomy.

To get started with deriving insights with Amazon Bedrock Data Automation, you can create a project where you can specify your output configuration using the AWS console, AWS Command Line Interface (AWS CLI) or API.

To create a project on the Amazon Bedrock console, follow these steps:

  1. Expand the Data Automation dropdown menu in the navigation pane and select Projects, as shown in the following screenshot.
  2. From the Projects console, create a new project and provide a project name, as shown in the following screenshot.
  3. From within the project, choose Edit, as shown in the following screenshot, to specify or modify an output configuration. Standard output is the default way of interacting with Amazon Bedrock Data Automation, and it can be used with audio, documents, images and videos, where you can have one standard output configuration per data type for each project.
  4. For customers who want to analyze images and videos for media analysis, standard output can be used to generate insights such as image summary, video scene summary, and scene classifications with IAB taxonomy. You can select the image summarization, video scene summarization, and IAB taxonomy checkboxes from the Standard output tab and then choose Save changes to finish configuring your project, as shown in the following screenshot.
  5. To test the standard output configuration using your media assets, choose Test, as shown in the following screenshot.

The next example uses the project to generate insights for a travel ad.

  1. Upload an image, then choose Generate results, as shown in the following screenshot, for Amazon Bedrock Data Automation to invoke an inference request.
  2. Amazon Bedrock Data Automation will process the uploaded file based on the project’s configuration, automatically detecting that the file is an image and then generating a summary and IAB categories for the travel ad.
  3. After you have generated insights for the ad image, you can generate video insights to determine the best video scene for effective ad placement. In the same project, upload a video file and choose Generate results, as shown in the following screenshot.

Amazon Bedrock Data Automation will detect that the file is a video and will generate insights for the video based on the standard output configuration specified in the project, as shown in the following screenshot.

These insights from Amazon Bedrock Data Automation, can help you effectively place relevant ads in your video content, which can help improve content monetization.

Intelligent document processing with Amazon Bedrock Data Automation

You can use Amazon Bedrock Data Automation to automate IDP workflows at scale, without needing to orchestrate complex document processing tasks such as classification, extraction, normalization, or validation.

To take a mortgage example, a lender wants to automate the processing of a mortgage lending packet to streamline their IDP pipeline and improve the accuracy of loan processing. Amazon Bedrock Data Automation simplifies the automation of complex IDP tasks such as document splitting, classification, data extraction, output format normalization, and data validation. Amazon Bedrock Data Automation also incorporates confidence scores and visual grounding of the output data to mitigate hallucinations and help improve result reliability.

For example, you can generate custom output by defining blueprints, which specify output requirements using natural language or a schema editor, to process multiple file types in a single, streamlined API. Blueprints can be created using the console or the API, and you can use a catalog blueprint or create a custom blueprint for documents and images.

For all modalities, this workflow consists of three main steps: creating a project, invoking the analysis, and retrieving the results.

The following solution walks you through a simplified mortgage lending process with Amazon Bedrock Data Automation using the Amazon SDK for Python (Boto3), which is straightforward to integrate into an existing IDP workflow.

Prerequisites

Before you invoke the Amazon Bedrock API, make sure you have the following:

Create custom blueprint

In this example, you have the lending packet, as shown in the following image, which contains three documents: a pay stub, a W-2 form, and a driver’s license.

Amazon Bedrock Data Automation has sample blueprints for these three documents that define commonly extracted fields. However, you can also customize Amazon Bedrock Data Automation to extract specific fields from each document. For example, you can extract only the gross pay and net pay from the pay stub by creating a custom blueprint.

To create a custom blueprint using the API, you can use the CreateBlueprint operation using the Amazon Bedrock Data Automation Client. The following example shows the gross pay and net pay being defined as properties passed to CreateBlueprint, to be extracted from the lending packet:

bda_create_blueprint_response = bedrock_data_automation_client.create_blueprint(
    blueprintName='CUSTOM_PAYSLIP_BLUEPRINT',
    type='DOCUMENT',
    blueprintStage='LIVE',
    schema=json.dumps({
        '$schema': 'http://json-schema.org/draft-07/schema#',
        'description': 'default',
        'documentClass': 'default',
        'type': 'object',
        'properties': {
            'gross_pay_this_period': {
                'type': 'number',
                'inferenceType': 'extractive',
                'description': 'The gross pay for this pay period from the Earnings table'
            },
            'net_pay': {
                'type': 'number',
                'inferenceType': 'extractive',
                'description': 'The net pay for this pay period from the bottom of the document'
            }
        }
    }),
)

The CreateBlueprint response returns the blueprintARN for the pay stub’s custom blueprint:

'blueprintArn: arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'

Configure Amazon Bedrock Data Automation project

To begin processing files using blueprints with Amazon Bedrock Data Automation, you first need to create a data automation project. To process a multiple-page document containing different file types, you can configure a project with different blueprints for each file type.

Use Amazon Bedrock Data Automation to apply multiple document blueprints within one project so you can process different types of documents within the same project, each with its own custom extraction logic.

When using the API to create a project, you invoke the CreateDataAutomationProject operation. The following is an example of how you can configure custom output using the custom blueprint for the pay stub and the sample blueprints for the W-2 and driver’s license:

bda_bedrock_automation_create_project_response = bedrock_data_automation_client.create_data_automation_project(
    projectName='TEST_PROJECT',
    projectDescription='test BDA project',
    projectStage=bda_stage,
    standardOutputConfiguration={
        'document': {
            'outputFormat': {
                'textFormat': {
                    'types': ['PLAIN_TEXT']
                },
                'additionalFileFormat': {
                    'state': 'ENABLED',
                }
            }
        },
    },
    customOutputConfiguration={
        'blueprints': [
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'
          },
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-w2-form'
          },
          {
              'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-us-driver-license'
          },
        ],
    },
    overrideConfiguration={
        'document': {
            'splitter': {
                'state': 'ENABLED'
            }
        }
    },
)

The CreateProject response returns the projectARN for the project:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'

To process different types of documents using multiple document blueprints in a single project, Amazon Bedrock Data Automation uses a splitter configuration, which must be enabled through the API. The following is the override configuration for the splitter, and you can refer to the Boto3 documentation for more information:

overrideConfiguration={
    'document': {
        'splitter': {
            'state': 'ENABLED' | 'DISABLED'
        }
    }
},

Upon creation, the API validates the input configuration and creates a new project, returning the projectARN, as shown in the following screenshot.

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'

Test the solution

Now that the blueprint and project setup is complete, the InvokeDataAutomationAsync operation from the Amazon Bedrock Data Automation runtime can be used to start processing files. This API call initiatives the asynchronous processing of files in an S3 bucket, in this case the lending packet, using the configuration defined in the project by passing the project’s ARN:

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri': '<S3_URI>'},
    outputConfiguration={'s3Uri': '<S3_URI>'},
    dataAutomationConfiguration={
        'dataAutomationArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>',
        'stage': 'LIVE'
    }
)

InvokeDataAutomationAsync returns the invocationARN:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'

GetDataAutomationStatus can be used to view the status of the invocation, using the InvocationARN from the previous response:

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.get_data_automation_status(
    invocationArn='arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'
)

When the job is complete, view the results in the S3 bucket used in the outputConfiguration by navigating to the ~/JOB_ID/0/custom_output/ folder.

From the following sample output, Amazon Bedrock Data Automation associated the pay stub file with the custom pay stub blueprint with a high level of confidence:

'matched_blueprint': {
    'arn': '<BLUEPRINT_ARN>', 'name': 'CUSTOM_PAYSLIP_BLUEPRINT', 'confidence': 0.99959725
}

Using the matched blueprint, Amazon Bedrock Data Automation was able to accurately extract each field defined in the blueprint:

'inference_result': {
    'net_pay': 291.9, 'gross_pay_this_period': 452.43
}

Additionally, Amazon Bedrock Data Automation returns confidence intervals and bounding box information for each field:

'explainability_info': [{
    'net_pay': {'success': true, 'confidence': 0.96484375, 'geometry': [{'boundingBox': ...

This example demonstrates how customers can use Amazon Bedrock Data Automation to streamline and automate an IDP workflow. Amazon Bedrock Data Automation automates complex document processing tasks such as data extraction, normalization, and validation from documents. Amazon Bedrock Data Automation helps to reduce operational complexity and improves processing efficiency to handle higher loan processing volumes, minimize errors, and drive operational excellence.

Cleanup

When you’re finished evaluating this feature, delete the S3 bucket and any objects to avoid any further charges.

Summary

Customers can get started with Amazon Bedrock Data Automation, which is available in public preview in AWS Region US West 2 (Oregon). Learn more on Amazon Bedrock Data Automation and how to automate the generation of accurate information from unstructured content for building generative AI–based applications.


About the authors

Ian Lodge is a Solutions Architect at AWS, helping ISV customers in solving their architectural, operational, and cost optimization challenges. Outside of work he enjoys spending time with his family, ice hockey and woodworking.

Alex Pieri is a Solutions Architect at AWS that works with retail customers to plan, build, and optimize their AWS cloud environments. He specializes in helping customers build enterprise-ready generative AI solutions on AWS.

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Read More

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

TUI Group is one of the world’s leading global tourism services, providing 21 million customers with an unmatched holiday experience in 180 regions. TUI Group covers the end-to-end tourism chain with over 400 owned hotels, 16 cruise ships, 1,200 travel agencies, and 5 airlines covering all major holiday destinations around the globe. At TUI, crafting high-quality content is a crucial component of its promotional strategy.

The TUI content teams are tasked with producing high-quality content for its websites, including product details, hotel information, and travel guides, often using descriptions written by hotel and third-party partners. This content needs to adhere to TUI’s tone of voice, which is essential to communicating the brand’s distinct personality. But as its portfolio expands with more hotels and offerings, scaling content creation has proven challenging. This presents an opportunity to augment and automate the existing content creation process using generative AI.

In this post, we discuss how we used Amazon SageMaker and Amazon Bedrock to build a content generator that rewrites marketing content following specific brand and style guidelines. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon SageMaker helps data scientists and machine learning (ML) engineers build FMs from scratch, evaluate and customize FMs with advanced techniques, and deploy FMs with fine-grain controls for generative AI use cases that have stringent requirements on accuracy, latency, and cost.

Through experimentation, we found that following a two-phased approach worked best to make sure that the output aligned to TUI’s tone of voice requirements. The first phase was to fine-tune with a smaller large language model (LLM) on a large corpus of data. The second phase used a different LLM model for post-processing. Through fine-tuning, we generate content that mimics the TUI brand voice using static data and which could not be captured through prompt engineering. Employing a second model with few-shot examples helped verify the output adhered to specific formatting and grammatical rules. The latter uses a more dynamic dataset, which we can use to adjust the output quickly in the future for different brand requirements. Overall, this approach resulted in higher quality content and allowed TUI to improve content quality at a higher velocity.

Solution overview

The architecture consists of a few key components:

  • LLM models – We evaluated different approaches and found that a two-model solution performed the best. This consists of a fine-tuned Meta Llama model to generate a description for the given hotel and Anthropic’s Claude model to reformat its output. Fine-tuning and hosting the Meta Llama 2 model was done on Amazon SageMaker, and Anthropic’s Claude 2 was consumed from Amazon Bedrock through API calls.
  • Orchestration – We created a state machine using AWS Step Functions to make calls in a batch format to the two LLMs and fetch the search engine optimization (SEO) score for the generated content from a third-party API. If the SEO content score is above a defined threshold (80%), the generated content is stored in an Amazon DynamoDB table and can later be reviewed by the content team directly in the front-end UI. Through this process, we maintain and monitor content quality at scale.
  • Human in the loop feedback – We developed a custom React front-end application to gather feedback from the content team to facilitate continuous improvement and future model fine-tuning. You can use the feedback to fine-tune a base model on SageMaker using reinforcement learning from human feedback (RLHF) to improve performance.

The following diagram is the high-level architecture of the solution.

Architecture Diagram

Prompt engineering

We started by crafting prompts that include the most common issues identified by the TUI content team, including tone of voice, accuracy, length, and grammar. We provided the LLM with a few examples of curated content within the context window. Although the generated output followed the guidance, the writing style didn’t meet TUI’s tone of voice requirements.

 Example prompt:

You are an experienced British copywriter for TUI. TUI is a world-leading travel company. You are an expert in generating hotel descriptions, based on TUI’s tone of voice. TUI's tone of voice can be described as upbeat, enthusiastic, and promotional. Avoid all the words in the following list: {banned words}
Write at most 100 words.
Your hotel descriptions must follow TUI's tone of voice and apply SEO guidelines.
These are some good examples. You should mimic below.
{examples}.
Human: {input}
Assistant:

Here is an example of the results we received.

Original text Generated text
Boutique interiors and a great location combine at this small but perfectly formed hotel. It may be small but this 15-room hotel packs a punch with its interior design. The immaculate pool is set in an open atrium area. And there’s funky cocktail bar and lounge, too. The location is every bit as impressive as the styling. It’s about a 10-minute walk to the beach and less than half that to the centre of Thassos Town. This petite hotel brings boutique style to Thassos Town. Its immaculate pool and funky cocktail bar create an oasis in the heart of the action. The beach is just a 10-minute walk away, so you can easily split your time between sunbathing and sightseeing. With only 15 rooms, the experience feels intimate yet worldly. The interiors marry modern design with traditional touches for a unique look. And the central location means popular attractions like the Archaeological Museum and shopping street are right on your doorstep.

To evaluate these results, we generated a dataset of 150 hotel descriptions using two different prompt versions and asked our business teams to review it. However, they flagged 98% of the output with significant issues. Firstly, Anthropic’s Claude 2 struggled to adhere to TUI’s tone of voice and branding guidelines, frequently employed abbreviations, and favored American English over British English. It also failed to follow the SEO guidelines provided and sometimes generated false information regarding hotel facilities and locations. The following image shows a list of these challenges and how the LLM handled them. Of the six challenges, the LLM met only one.

Main Challenges

Fine-tuning Llama 2 using PEFT on Amazon SageMaker JumpStart

These issues and poor feedback led us to conclude that prompt engineering alone would not adequately address the problem. As a result, we decided to pursue an alternative approach: fine-tuning a smaller large language model to rewrite the text in accordance with TUI’s tone of voice. We used a curated set of hotel descriptions written by TUI copywriters so that the model would have better alignment with our guidelines.

We selected the Meta Llama 2 model using Amazon SageMaker JumpStart, one of the top open source LLMs available at the time, and chose the 13B parameter version to apply parameter-efficient fine-tuning (PEFT), specifically using quantized low-rank adaptation (QLoRA). This technique quantizes the pre-trained model to 4 bits and adds small low-rank adapters for fine-tuning. We fine-tuned the model on a single ml.g5.4xlarge instance in about 20 hours using a relatively small dataset of around 4,500 hotels. We also tested out the Llama 2 7B and 70B models. We found that the 7B model didn’t perform well enough, and the 70B model had much higher costs without seeing significant improvement.

We used common natural language processing (NLP) evaluation metrics, such as perplexity for evaluation and monitoring during training, and established daily feedback loops with the content team to refine the test set. The output from the fine-tuned Meta Llama 2 13B model effectively mimicked the TUI tone of voice and adhered to most SEO guidelines, even those not specified in the prompt.

For more information, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker Jumpstart.

Original text Generated text
The NH München City Süd hotel, formerly known as NH München am Ring, is located southeast of Munich. The location is ideal for reaching the highway towards Salzburg and provides good access to the airport and Exhibition Center. You can also get to the city center in 20 minutes by hopping on the S-Bahn at the Karl-Preis-Platz station. The peaceful residential location offers an excellent opportunity to relax in one of the 163 rooms. These are decorated in white with red accents. Rooms feature free Wi-Fi and electronic safes that are large enough for your laptop. If you need to work, there are spacious desks in every room and apartments with seperate living rooms are available for longer term guests. The suites also come with separate living rooms and foldaway beds are available upon request. The breakfast will get you ready for a busy day with a large spread of cold cuts, cheeses, fresh pastries, and fruit. These come accompanied with freshly brewed coffee, tea, and fresh fruit juices. For lunch and dinner, you will find many restaurant options in close proximity to the hotel. There is a sauna for relaxation and you can park your car in the garage.

Make yourself at home in Munich at the München City Süd. This former NH München am Ring sits just southeast of the Bavarian capital, putting both the highway to Salzburg and the airport within easy reach. The S-Bahn station Karl-Preis-Platz is a quick walk away too, so you can be in the city center in just 20 minutes.

The neighborhood here is peaceful, giving you the chance to properly relax during your stay. The hotel’s 163 rooms are decorated in calming white and red hues. Free Wi-Fi and laptop-sized safes come as standard, while spacious desks make working easy. For longer visits, apartments with separate living areas are available. The suites take things up a notch with their own living spaces and foldaway beds on request.

Fuel up each morning with a vast breakfast spread of cold cuts, cheeses, pastries and fruit. All washed down with coffee, tea and fresh juices. You’ll find many dining options close by for other meals. After a busy day, unwind in the sauna or park up in the garage.

The following image shows a list of the challenges and how the LLM handled them. Of the six challenges, the LLM met four.

Improvements

Integrating Anthropic’s Claude 2 for further improvement

To further improve the output, we introduced Anthropic’s Claude 2 using Amazon Bedrock as a final refinement step. This included converting American spelling to British spelling, writing numbers one through nine in words and larger numbers in digits, correcting typos and capitalization errors, minimizing banned words, incorporating essential TUI branding words, and adding missing hotel information. We also implemented a feedback mechanism in the UI to use data for ongoing fine-tuning in production. By using Anthropic Claude 2, we make sure that the final output applies the remaining formatting rules.

Original text Generated text

As its name suggests, Vienna Sporthotel has loads lined up for sports fans.

There are 10 tennis courts and four golf simulators here, plus, you can book a session with a coach to brush up on your skills. On top of that, there’s an outdoor pool, a well-stocked fitness room and a big sauna. To fuel up for all that exercise, you can grab a plate at the breakfast buffet.

Location-wise, this place is set up for sightseeing. Hop on a bus and you can reach the likes of The Hofburg, Belvedere Palace and Karlskirche in less than half an hour. You’re only a 10-minute stroll away from the nearest restaurants and cafes, too.

With a name like Vienna Sporthotel, you can expect this place to serve up plenty for fans of sport and activity. Brush up your skills on one of 10 tennis courts or in the four golf simulators, then recharge in the outdoor pool or well-equipped gym. There’s even a coach on hand to help perfect your technique.

When you’re ready to refuel, tuck into the breakfast buffet. Then get set for sightseeing – a bus ride whisks you to top Vienna attractions like The Hofburg, Belvedere Palace and Karlskirche in under 30 minutes. You’re also just a short stroll from local eateries and coffee shops.

The following image shows a list of the challenges and how the LLM handled them. The LLM met all six challenges.

Success in Target Outcomes

Key outcomes

The final architecture consists of a fine-tuned Meta Llama 2 13B model and Anthropic Claude 2, using the strengths of each model. In a blind test, these dynamically generated hotel descriptions were rated higher than those written by humans in 75% of a sample of 50 hotels. We also integrated a third-party API to calculate SEO scores for the generated content, and we observed up to 4% uplift in SEO scores for the generated content compared to human written descriptions. Most significantly, the content generation process is now five times faster, enhancing our team’s productivity without compromising quality or consistency. We can generate a vast number of hotel descriptions in just a few hours— a task that previously took months.

Takeaways

Moving forward, we plan to explore how this technology can address current inefficiencies and quality gaps, especially for hotels that our team hasn’t had the capacity to curate. We plan to expand this solution to more brands and regions within the TUI portfolio, including producing content in various languages and tailoring it to meet the specific needs of different audiences.

Throughout this project, we learned a few valuable lessons:

  • Few-shot prompting is cost-effective and sufficient when you have limited examples and specific guidelines for responses. Fine-tuning can help significantly improve model performance when you need to tailor content to match a brand’s tone of voice, but can be resource intensive and is based on static data sources that can get outdated.
  • Fine-tuning the Llama 70B model was much more expensive than Llama 13B and did not result in significant improvement.
  • Incorporating human feedback and maintaining a human-in-the-loop approach is essential for protecting brand integrity and continuously improving the solution. The collaboration between TUI engineering, content, and SEO teams was crucial to the success of this project.

Although Meta Llama 2 and Anthropic’s Claude 2 were the latest state-of-the-art models available at the time of our experiment, since then we have seen the launch of Meta Llama 3 and Anthropic’s Claude 3.5, which we expect can significantly improve the quality of our outputs. Amazon Bedrock also now supports fine-tuning for Meta Llama 2, Cohere Command Light, and Amazon Titan models, making it simpler and faster to test models without managing infrastructure.


About the Authors

Nikolaos Zavitsanos is a Data Scientist at TUI, specialized in developing customer-facing Generative AI applications using AWS services. With a strong background in Computer Science and Artificial Intelligence, he leverages advanced technologies to enhance user experiences and drive innovation. Outside of work, Nikolaos plays water polo and is competing at a national level. Connect with Nikolaos on Linkedin

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training. Connect with Hin Yee on LinkedIn.

Read More

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Llama 3.3 70B now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the Llama 3.3 70B from Meta is available in Amazon SageMaker JumpStart. Llama 3.3 70B marks an exciting advancement in large language model (LLM) development, offering comparable performance to larger Llama versions with fewer computational resources.

In this post, we explore how to deploy this model efficiently on Amazon SageMaker AI, using advanced SageMaker AI features for optimal performance and cost management.

Overview of the Llama 3.3 70B model

Llama 3.3 70B represents a significant breakthrough in model efficiency and performance optimization. This new model delivers output quality comparable to Llama 3.1 405B while requiring only a fraction of the computational resources. According to Meta, this efficiency gain translates to nearly five times more cost-effective inference operations, making it an attractive option for production deployments.

The model’s sophisticated architecture builds upon Meta’s optimized version of the transformer design, featuring an enhanced attention mechanism that can help substantially reduce inference costs. During its development, Meta’s engineering team trained the model on an extensive dataset comprising approximately 15 trillion tokens, incorporating both web-sourced content and over 25 million synthetic examples specifically created for LLM development. This comprehensive training approach results in the model’s robust understanding and generation capabilities across diverse tasks.

What sets Llama 3.3 70B apart is its refined training methodology. The model underwent an extensive supervised fine-tuning process, complemented by Reinforcement Learning from Human Feedback (RLHF). This dual-approach training strategy helps align the model’s outputs more closely with human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterpart, Llama 3.3 70B demonstrated remarkable consistency, trailing Llama 3.1 405B by less than 2% in 6 out of 10 standard AI benchmarks and actually outperforming it in three categories. This performance profile makes it an ideal candidate for organizations seeking to balance model capabilities with operational efficiency.

The following figure summarizes the benchmark results (source).

Getting started with SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select pre-trained foundation models (FMs), including Llama 3 models. These models are fully customizable for your use case with your data, and you can deploy them into production using either the UI or SDK.

Deploying Llama 3.3 70B through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.

Deploy Llama 3.3 70B through the SageMaker JumpStart UI

You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or Amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, complete the following steps:

  1. In SageMaker Unified Studio, on the Build menu, choose JumpStart models.

Alternatively, on the SageMaker Studio console, choose JumpStart in the navigation pane.

  1. Search for Meta Llama 3.3 70B.
  2. Choose the Meta Llama 3.3 70B model.
  3. Choose Deploy.
  4. Accept the end-user license agreement (EULA).
  5. For Instance type¸ choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
  6. Choose Deploy.

Wait until the endpoint status shows as InService. You can now run inference using the model.

Deploy Llama 3.3 70B using the SageMaker Python SDK

For teams looking to automate deployment or integrate with existing MLOps pipelines, you can use the following code to deploy the model using the SageMaker Python SDK:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-3-70b-instruct"

gpu_instance_type = "ml.p4d.24xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

model= model_builder.build()

predictor = model.deploy(model_access_configs={js_model_id:ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)

Set up auto scaling and scale down to zero

You can optionally set up auto scaling to scale down to zero after deployment. For more information, refer to Unlock cost savings with the new scale down to zero feature in SageMaker Inference.

Optimize deployment with SageMaker AI

SageMaker AI simplifies the deployment of sophisticated models like Llama 3.3 70B, offering a range of features designed to optimize both performance and cost efficiency. With the advanced capabilities of SageMaker AI, organizations can deploy and manage LLMs in production environments, taking full advantage of Llama 3.3 70B’s efficiency while benefiting from the streamlined deployment process and optimization tools of SageMaker AI. Default deployment through SageMaker JumpStart uses accelerated deployment, which uses speculative decoding to improve throughput. For more information on how speculative decoding works with SageMaker AI, see Amazon SageMaker launches the updated inference optimization toolkit for generative AI.

Firstly, the Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight streaming mechanism. This feature fundamentally changes how model weights are loaded onto accelerators, dramatically reducing the time required to get the model ready for inference. Instead of the traditional approach of loading the entire model into memory before beginning operations, Fast Model Loader streams weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster startup and scaling times.

One SageMaker inference capability is Container Caching, which transforms how model containers are managed during scaling operations. This feature eliminates one of the major bottlenecks in deployment scaling by pre-caching container images, removing the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where container images can be substantial in size, this optimization significantly reduces scaling latency and improves overall system responsiveness.

Another key capability is Scale to Zero. It introduces intelligent resource management that automatically adjusts compute capacity based on actual usage patterns. This feature represents a paradigm shift in cost optimization for model deployments, allowing endpoints to scale down completely during periods of inactivity while maintaining the ability to scale up quickly when demand returns. This capability is particularly valuable for organizations running multiple models or dealing with variable workload patterns.

Together, these features create a powerful deployment environment that maximizes the benefits of Llama 3.3 70B’s efficient architecture while providing robust tools for managing operational costs and performance.

Conclusion

The combination of Llama 3.3 70B with the advanced inference features of SageMaker AI provides an optimal solution for production deployments. By using Fast Model Loader, Container Caching, and Scale to Zero capabilities, organizations can achieve both high performance and cost-efficiency in their LLM deployments.

We encourage you to try this implementation and share your experiences.


About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Adriana Simmons is a Senior Product Marketing Manager at AWS.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Yotam Moss is a Software development Manager for Inference at AWS AI.

Read More

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale

AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale

We spoke with Dr. Swami Sivasubramanian, Vice President of Data and AI, shortly after AWS re:Invent 2024 to hear his impressions—and to get insights on how the latest AWS innovations help meet the real-world needs of customers as they build and scale transformative generative AI applications.

Q: What made this re:Invent different?

Swami Sivasubramanian: The theme I spoke about in my re:Invent keynote was simple but powerful—convergence. I believe that we’re at an inflection point unlike any other in the evolution of AI. We’re seeing a remarkable convergence of data, analytics, and generative AI. It’s a combination that enables next-level generative AI applications that are far more capable. And it lets our customers move faster in a really significant way, getting more value, more quickly. Companies like Rocket Mortgage are building on an AI-driven platform powered by Amazon Bedrock to create AI agents and automate tasks—working to give their employees access to generative AI with no-code tools. Canva uses AWS to power 1.2 million requests a day and sees 450 new designs created every second. There’s also a human side to convergence, as people across organizations are working together in new ways, requiring a deeper level of collaboration between groups, like science and engineering teams. And this isn’t just a one-time collaboration. It’s an ongoing process.

People’s expectations for applications and customer experiences are changing again with generative AI. Increasingly, I think generative AI inference is going to be a core building block for every application. To realize this future, organizations need more than just a chatbot or a single powerful large language model (LLM). At re:Invent, we made some exciting announcements about the future of generative AI, of course. But we also launched a remarkable portfolio of new products, capabilities, and features that will help our customers manage generative AI at scale—making it easier to control costs, build trust, increase productivity, and deliver ROI.

Q: Are there key innovations that build on the experience and lessons learned at Amazon in adopting generative AI? How are you bringing those capabilities to your customers

Swami Sivasubramanian: Yes, our announcement of Amazon Nova, a new generation of foundation models (FMs), has state-of-the-art intelligence across a wide range of tasks and industry-leading price performance. Amazon Nova models expand the growing selection of the broadest and most capable FMs in Amazon Bedrock for enterprise customers. The specific capabilities of Amazon Nova Micro, Lite, and Pro demonstrate exceptional intelligence, capabilities, and speed—and perform quite competitively against the best models in their respective categories. Amazon Nova Canvas, our state-of-the-art image generation model, creates professional grade images from text and image inputs, democratizing access to production-grade visual content for advertising, training, social media, and more. Finally, Amazon Nova Reel offers state-of-the-art video generation that allows customers to create high-quality video from text or images. With about 1,000 generative AI applications in motion inside Amazon, groups like Amazon Ads are using Amazon Nova to remove barriers for sellers and advertisers, enabling new levels of creativity and innovation. New capabilities like image and video generation are helping Amazon Ads customers promote more products in their catalogs, and experiment with new strategies like keyword-level creative to increase engagement and drive sales.

But there’s more ahead, and here’s where an important shift is happening. We’re working on an even more capable any-to-any model where you can provide text, images, audio, and video as input and the model can generate outputs in any of these modalities. And we think this multi-modal approach is how models are going to evolve, moving ahead where one model can accept any kind of input and generate any kind of output. Over time, I think this is what state-of-the-art models will look like.

Q: Speaking of announcements like Amazon Nova, you’ve been a key innovator in AI for many years. What continues to inspire you?

Swami Sivasubramanian: It’s fascinating to think about what LLMs are capable of. What inspires me most though is how can we help our customers unblock the challenges they are facing and realize that potential. Consider hallucinations. As highly capable as today’s models are, they still have a tendency to get things wrong occasionally. It’s a challenge that many of our customers struggle with when integrating generative AI into their businesses and moving to production. We explored the problem and asked ourselves if we could do more to help. We looked inward, and leveraged Automated Reasoning, an innovation that Amazon has been using as a behind-the-scenes technology in many of our services like identity and access management.

I like to think of this situation as yin and yang. Automated Reasoning is all about certainty and being able to mathematically prove that something is correct. Generative AI is all about creativity and open-ended responses. Though they might seem like opposites, they’re actually complementary—with Automated Reasoning completing and strengthening generative AI. We’ve found that Automated Reasoning works really well when you have a huge surface area of a problem, a corpus of knowledge about that problem area, and when it’s critical that you get the correct answer—which makes Automated Reasoning a good fit for addressing hallucinations.

At re:Invent, we announced Amazon Bedrock Guardrails Automated Reasoning checks—the first and only generative AI safeguard that helps prevent factual errors due to hallucinations. All by using logically accurate and verifiable reasoning that explains why generative AI responses are correct. I think that it’s an innovation that will have significant impact across organizations and industries, helping build trust and accelerate generative AI adoption.

Q: Controlling costs is important to all organizations, large and small, particularly as they take generative AI applications into production. How do the announcements at re:Invent answer this need?

Swami Sivasubramanian: Like our customers, here at Amazon we’re increasing our investment in generative AI development, with multiple projects in process—all requiring timely access to accelerated compute resources. But allocating optimal compute capacity to each project can create a supply/demand challenge. To address this challenge, we created an internal service that helped Amazon drive utilization of compute resources to more than 90% across all our projects. This service enabled us to smooth out demand across projects and achieve higher capacity utilization, speeding development.

As with Automated Reasoning, we realized that our customers would also benefit from these capabilities. So, at re:Invent, I announced the new task governance capability in Amazon SageMaker HyperPod, which helps our customers optimize compute resource utilization and reduce time to market by up to 40%. With this capability, users can dynamically run tasks across the end-to-end FM workflow— accelerating time to market for AI innovations while avoiding cost overruns due to underutilized compute resources.

Our customers also tell me that the trade-off between cost and accuracy for models is real. We’re answering this need by making it super-easy to evaluate models on Amazon Bedrock, so they don’t have to spend months researching and making comparisons. We’re also lowering costs with game-changing capabilities such Amazon Bedrock Model Distillation, which pairs models for lower costs; Amazon Bedrock Intelligent Prompt Routing, which manages prompts more efficiently, at scale; and prompt caching, which reduces repeated processing without compromising on accuracy.

Q: Higher productivity is one of the core promises of generative AI. How is AWS helping employees at all levels be more productive?

Swami Sivasubramanian: I like to point out that using generative AI becomes irresistible when it makes employees 10 times more productive. In short, not an incremental increase, but a major leap in productivity. And we’re helping employees get there. For example, Amazon Q Developer is transforming code development by taking care of the time-consuming chores that developers don’t want to deal with, like software upgrades. And it also helps them move much faster by automating code reviews and dealing with mainframe modernization. Consider Novacomp, a leading IT company in Latin America, which leveraged Amazon Q Developer to upgrade a project with over 10,000 lines of Java code in just 50 minutes, a task that would have typically taken an estimated 3 weeks. The company also simplified everyday tasks for developers, reducing its technical debt by 60% on average.

On the business side, Amazon Q Business is bridging the gap between unstructured and structured data, recognizing that most businesses need to draw from a mix of data. With Amazon Q in QuickSight, non-technical users can leverage natural language to build, discover, and share meaningful insights in seconds. Now they can access databases and data warehouses, as well as unstructured business data, like emails, reports, charts, graphs, and images.

And looking ahead, we announced advanced agentic capabilities for Amazon Q Business, coming in 2025, which will use agents to automate complex tasks that stretch across multiple teams and applications. Agents give generative AI applications next-level capabilities, and we’re bringing them to our customers via Amazon Q Business, as well as Amazon Bedrock multi-agent collaboration, which improves successful task completion by 40% over popular solutions. This major improvement translates to more accurate and human-like outcomes in use cases like automating customer support, analyzing financial data for risk management, or optimizing supply-chain logistics.

It’s all part of how we’re enabling greater productivity today, with even more on the horizon.

Q: To get employees and customers adopting generative AI and benefiting from that increased productivity, it has to be trusted. What steps is AWS taking to help build that trust?

Swami Sivasubramanian: I think that lack of trust is a big obstacle to moving from proof of concept to production. Business leaders are about to hit go and they hesitate because they don’t want to lose the trust of their customers. As generative AI continues to drive innovation across industries and our daily life, the need for responsible AI has become increasingly acute. And we’re helping meet that need with innovations like Amazon Bedrock Automated Reasoning, which I mentioned earlier, that works to prevent hallucinations—and increases trust. We also announced new LLM-as-a-judge capabilities with Amazon Bedrock Model Evaluation so you can now perform tests and evaluate other models with humanlike quality at a fraction of the cost and time of running human evaluations. These evaluations assess multiple quality dimensions, including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness.

I should also mention that AWS recently became the first major cloud provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. This international management system standard outlines requirements and controls for organizations to promote the responsible development and use of AI systems. Technical standards like ISO/IEC 42001 are significant because they provide a much-needed common framework for responsible AI development and deployment.

Q: Data remains central to building more personalized experiences applicable to your business. How do the re:Invent launches help AWS customers get their data ready for generative AI?

Swami Sivasubramanian: Generative AI isn’t going to be useful for organizations unless it can seamlessly access and deeply understand the organization’s data. With these insights, our customers can create customized experiences, such as highly personalized customer service agents that can help service representatives resolve issues faster. For AWS customers, getting data ready for generative AI isn’t just a technical challenge—it’s a strategic imperative. Proprietary, high-quality data is the key differentiator in transforming generic AI into powerful, business-specific applications. To prepare for this AI-driven future, we’re helping our customers build a robust, cloud-based data foundation, with built-in security and privacy. That’s the backbone of AI readiness.

With the next generation of Amazon SageMaker announced at re:Invent, we’re introducing an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities. Collaborate and build faster from a unified studio using familiar AWS tools for model development, generative AI, data processing, and SQL analytics—with Amazon Q Developer assisting you along the way. Access all your data whether it’s stored in data lakes, data warehouses, third-party or federated data sources. And move with confidence and trust, thanks to built-in governance to address enterprise security needs.

At re:Invent, we also launched key Amazon Bedrock capabilities that help our customers maximize the value of their data. Amazon Bedrock Knowledge Bases now offers the only managed, out-of-the-box Retrieval Augmented Generation (RAG) solution, which enables our customers to natively query their structured data where it resides, accelerating development. Support for GraphRAG generates more relevant responses by modeling and storing relationships between data. And Amazon Bedrock Data Automation transforms unstructured, multimodal data into structured data for generative AI—automatically extracting, transforming, and generating usable data from multimodal content, at scale. These capabilities and more help our customers leverage their data to create powerful, insightful generative AI applications.

Q: What did you take away from your customer conversations at re:Invent?

Swami Sivasubramanian: I continue to be amazed and inspired by our customers and the important work they’re doing. We continue to offer our customers the choice and specialization they need to power their unique use cases. With Amazon Bedrock Marketplace, customers now have access to more than 100 popular, emerging, and specialized models.

At re:Invent, I heard a lot about the new efficiency and transformative experiences customers are creating. I also heard about innovations that are changing people’s lives. Like Exact Sciences, a molecular diagnostic company, which developed an AI-powered solution using Amazon Bedrock to accelerate genetic testing and analysis by 50%. Behind that metric there’s a real human value—enabling earlier cancer detection and personalized treatment planning. And that’s just one story among thousands, as our customers reach higher and build faster, achieving impressive results that change industries and improve lives.

I get excited when I think about how we can help educate the next wave of innovators building these experiences. With the launch of the new Education Equity Initiative, Amazon is committing up to $100 million in cloud technology and technical resources to help existing, dedicated learning organizations reach more learners by creating new and innovative digital learning solutions. That’s truly inspiring to me.

In fact, the pace of change, the remarkable innovations we introduced at re:Invent, and the enthusiasm of our customers all reminded me of the early days of AWS, when anything seemed possible. And now, it still is.


About the author

Swami Sivasubramanian is VP, AWS AI & Data. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.

Read More

Multi-tenant RAG with Amazon Bedrock Knowledge Bases

Multi-tenant RAG with Amazon Bedrock Knowledge Bases

Organizations are continuously seeking ways to use their proprietary knowledge and domain expertise to gain a competitive edge. With the advent of foundation models (FMs) and their remarkable natural language processing capabilities, a new opportunity has emerged to unlock the value of their data assets.

As organizations strive to deliver personalized experiences to customers using generative AI, it becomes paramount to specialize the behavior of FMs using their own—and their customers’—data. Retrieval Augmented Generation (RAG) has emerged as a simple yet effective approach to achieve a desired level of specialization.

Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give FMs and agents contextual information from company’s private data sources to deliver more relevant and accurate responses tailored to their specific needs.

For organizations developing multi-tenant products, such as independent software vendors (ISVs) creating software as a service (SaaS) offerings, the ability to personalize experiences for each of their customers (tenants in their SaaS application) is particularly significant. This personalization can be achieved by implementing a RAG approach that selectively uses tenant-specific data.

In this post, we discuss and provide examples of how to achieve personalization using Amazon Bedrock Knowledge Bases. We focus particularly on addressing the multi-tenancy challenges that ISVs face, including data isolation, security, tenant management, and cost management. We focus on scenarios where the RAG architecture is integrated into the ISV application and not directly exposed to tenants. Although the specific implementations presented in this post use Amazon OpenSearch Service as a vector database to store tenants’ data, the challenges and architecture solutions proposed can be extended and tailored to other vector store implementations.

Multi-Tenancy design considerations

When architecting a multi-tenanted RAG system, organizations need to take several considerations into account:

  • Tenant isolation – One crucial consideration in designing multi-tenanted systems is the level of isolation between the data and resources related to each tenant. These resources include data sources, ingestion pipelines, vector databases, and RAG client application. The level of isolation is typically governed by security, performance, and the scalability requirements of your solution, together with your regulatory requirements. For example, you may need to encrypt the data related to each of your tenants using a different encryption key. You may also need to make sure that high activity generated by one of the tenants doesn’t affect other tenants.
  • Tenant variability – A similar yet distinct consideration is the level of variability of the features provided to each tenant. In the context of RAG systems, tenants might have varying requirements for data ingestion frequency, document chunking strategy, or vector search configuration.
  • Tenant management simplicity – Multi-tenant solutions need a mechanism for onboarding and offboarding tenants. This dimension determines the degree of complexity for this process, which might involve provisioning or tearing down tenant-specific infrastructure, such as data sources, ingestion pipelines, vector databases, and RAG client applications. This process could also involve adding or deleting tenant-specific data in its data sources.
  • Cost-efficiency – The operating costs of a multi-tenant solution depend on the way it provides the isolation mechanism for tenants, so designing a cost-efficient architecture for the solution is crucial.

These four considerations need to be carefully balanced and weighted to suit the needs of the specific solution. In this post, we present a model to simplify the decision-making process. Using the core isolation concepts of silo, pool, and bridge defined in the SaaS Tenant Isolation Strategies whitepaper, we propose three patterns for implementing a multi-tenant RAG solution using Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and OpenSearch Service.

A typical RAG solution using Amazon Bedrock Knowledge Bases is composed of several components, as shown in the following figure:

A Typical RAG Solution Architecture

The main challenge in adapting this architecture for multi-tenancy is determining how to provide isolation between tenants for each of the components. We propose three prescriptive patterns that cater to different use cases and offer carrying levels of isolation, variability, management simplicity, and cost-efficiency. The following figure illustrates the trade-offs between these three architectural patterns in terms of achieving tenant isolation, variability, cost-efficiency, and ease of tenant management.

Trade offs of the three RAG architectural patterns

Multi-tenancy patterns

In this section, we describe the implementation of these three different multi-tenancy patterns in a RAG architecture based on Amazon Bedrock Knowledge Bases, discussing their use cases as well as their pros and cons.

Silo

The silo pattern, illustrated in the following figure, offers the highest level of tenant isolation, because the entire stack is deployed and managed independently for each single tenant.

Solution architecture for the Silo pattern

In the context of the RAG architecture implemented by Amazon Bedrock Knowledge Bases, this pattern prescribes the following:

  • A separate data source per tenant – In this post, we consider the scenario in which tenant documents to be vectorized are stored in Amazon S3, therefore a separate S3 bucket is provisioned per tenant. This allows for per-tenant AWS Key Management Service (AWS KMS) encryption keys, as well as per-tenant S3 lifecycle policies to manage object expiration, and object versioning policies to maintain multiple versions of objects. Having separate buckets per tenant provides isolation and allows for customized configurations based on tenant requirements.
  • A separate knowledge base per tenant – This allows for a separate chunking strategy per tenant, and it’s particularly useful if you envision the document basis of your tenants to be different in nature. For example, one of your tenants might have a document base composed of flat text documents, which can be treated with fixed-size chunking, whereas another tenant might have a document base with explicit sections, for which semantic chunking would be better suited to section. Having a different knowledge base per tenant also lets you decide on different embedding models, giving you the possibility to choose different vector dimensions, balancing accuracy, cost, and latency. You can choose a different KMS key per tenant for the transient data stores, which Amazon Bedrock uses for end-to-end per-tenant encryption. You can also choose per-tenant data deletion policies to control whether your vectors are deleted from the vector database when a knowledge base is deleted. Separate knowledge bases also mean that you can have different ingestion schedules per tenants, allowing you to agree to different data freshness standards with your customers.
  • A separate OpenSearch Serverless collection per tenant – Having a separate OpenSearch Serverless collection per tenant allows you to have a separate KMS encryption key per tenant, maintaining per-tenant end-to-end encryption. For each tenant-specific collection, you can create a separate vector index, therefore choosing for each tenant the distance metric between Euclidean and dot product, so that you can choose how much importance to give to the document length. You can also choose the specific settings for the HNSW algorithm per tenant to control memory consumption, cost, and indexing time. Each vector index, in conjunction with the setup of metadata mappings in your knowledge base, can have a different metadata set per tenant, which can be used to perform filtered searches. Metadata filtering can be used in the silo pattern to restrict the search to a subset of documents with a specific characteristic. For example, one of your tenants might be uploading dated documents and wants to filter documents pertaining to a specific year, whereas another tenant might be uploading documents coming from different company divisions and wants to filter over the documentation of a specific company division.

Because the silo pattern offers tenant architectural independence, onboarding and offboarding a tenant means creating and destroying the RAG stack for that tenant, composed of the S3 bucket, knowledge base, and OpenSearch Serverless collection. You would typically do this using infrastructure as code (IaC). Depending on your application architecture, you may also need to update the log sinks and monitoring systems for each tenant.

Although the silo pattern offers the highest level of tenant isolation, it is also the most expensive to implement, mainly due to creating a separate OpenSearch Serverless collection per tenant for the following reasons:

  • Minimum capacity charges – Each OpenSearch Serverless collection encrypted with a separate KMS key has a minimum of 2 OpenSearch Compute Units (OCUs) charged hourly. These OCUs are charged independently from usage, meaning that you will incur charges for dormant tenants if you choose to have a separate KMS encryption key per tenant.
  • Scalability overhead – Each collection separately scales OCUs depending on usage, in steps of 6 GB of memory, and associated vCPUs and fast access storage. This means that resources might not be fully and optimally utilized across tenants.

When choosing the silo pattern, note that a maximum of 100 knowledge bases are supported in each AWS account. This makes the silo pattern favorable for your largest tenants with specific isolation requirements. Having a separate knowledge base per tenant also reduces the impact of quotas on concurrent ingestion jobs (maximum one concurrent job per KB, five per account), job size (100 GB per job), and data sources (maximum of 5 million documents per data source). It also improves the performance fairness as perceived by your tenants.
Deleting a knowledge base during offboarding a tenant might be time-consuming, depending on the size of the data sources and the synchronization process. To mitigate this, you can set the data deletion policy in your tenants’ knowledge bases to RETAIN. This way, the knowledge base deletion process will not delete your tenants’ data from the OpenSearch Service index. You can delete the index by deleting the OpenSearch Serverless collection.

Pool

In contrast with the silo pattern, in the pool pattern, illustrated in the following figure, the whole end-to-end RAG architecture is shared by your tenants, making it particularly suitable to accommodate many small tenants.

Solution architecture for the pool pattern

The pool pattern prescribes the following:

  • Single data source – The tenants’ data is stored within the same S3 bucket. This implies that the pool model supports a shared KMS key for encryption at rest, not offering the possibility of per-tenant encryption keys. To identify tenant ownership downstream for each document uploaded to Amazon S3, a corresponding JSON metadata file has to be generated and uploaded. The metadata file generation process can be asynchronous, or even batched for multiple files, because Amazon Bedrock Knowledge Bases requires an explicit triggering of the ingestion job. The metadata file must use the same name as its associated source document file, with .metadata.json appended to the end of the file name, and must be stored in the same folder or location as the source file in the S3 bucket. The following code is an example of the format:
{
  "metadataAttributes" : {
    "tenantId" : "tenant_1",
  ...
  }
}

In the preceding JSON structure, the key tenantId has been deliberately chosen, and can be changed to a key you want to use to express tenancy. The tenancy field will be used at runtime to filter documents belonging to a specific tenant, therefore the filtering key at runtime must match the metadata key in the JSON used to index the documents. Additionally, you can include other metadata keys to perform further filtering that isn’t based on tenancy. If you don’t upload the object.metadata.json file, the client application won’t be able to find the document using metadata filtering.

  • Single knowledge base – A single knowledge base is created to handle the data ingestion for your tenants. This means that your tenants will share the same chunking strategy and embedding model, and share the same encryption at-rest KMS key. Moreover, because ingestion jobs are triggered per data source per KB, you will be restricted to offer to your tenants the same data freshness standards.
  • Single OpenSearch Serverless collection and index – Your tenant data is pooled in a single OpenSearch Service vector index, therefore your tenants share the same KMS encryption key for vector data, and the same HNSW parameters for indexing and query. Because tenant data isn’t physically segregated, it’s crucial that the query client be able to filter results for a single tenant. This can be efficiently achieved using either the Amazon Bedrock Knowledge Bases Retrieve or RetrieveAndGenerate, expressing the tenant filtering condition as part of the retrievalConfiguration (for more details, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy). If you want to restrict the vector search to return results for tenant_1, the following is an example client implementation performing RetrieveAndGenerate based on the AWS SDK for Python (Boto3):

import boto3

bedrock_agent_runtime = boto3.client(
    service_name = "bedrock-agent-runtime"
)

tenant_filter = {
    "equals": {
        "key": "tenantId",
        "value": "tenant_1"
    }
}

retrievalConfiguration = {
    "vectorSearchConfiguration": {
        "filter": tenant_filter
    }
}

bedrock_agent_runtime.retrieve_and_generate(
    input = {
        'text': 'The original user query'
    },
    retrieveAndGenerateConfiguration = {
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': <YOUR_KNOWLEDGEBASE_ID>,
            'modelArn': <FM_ARN>,
            'retrievalConfiguration': retrievalConfiguration
        }
    }
)

text contains the original user query that needs to be answered. Taking into account the document base, <YOUR_KNOWLEDGEBASE_ID> needs to be substituted with the identifier of the knowledge base used to pool your tenants, and <FM_ARN> needs to be substituted with the Amazon Bedrock model Amazon Resource Name (ARN) you want to use to reply to the user query. The client presented in the preceding code has been streamlined to present the tenant filtering functionality. In a production case, we recommend implementing session and error handling, logging and retry logic, and separating the tenant filtering logic from the client invocation to make it inaccessible to developers.

Because the end-to-end architecture is pooled in this pattern, onboarding and offboarding a tenant doesn’t require you to create new physical or logical constructs, and it’s as simple as starting or stopping and uploading specific tenant documents to Amazon S3. This implies that there is no AWS managed API that can be used to offboard and end-to-end forget a specific tenant. To delete the historical documents belonging to a specific tenant, you can just delete the relevant objects in Amazon S3. Typically, customers will have an external application that maintains the list of available tenants and their status, facilitating the onboarding and offboarding process.

Sharing the monitoring system and logging capabilities in this pattern reduces the complexity of operations with a large number of tenants. However, it requires you to collect the tenant-specific metrics from the client side to perform specific tenant attribution.

The pool pattern optimizes the end-to-end cost of your RAG architecture, because sharing OCUs across tenants maximizes the use of each OCU and minimizes the tenants’ idle time. Sharing the same pool of OCUs across tenants means that this pattern doesn’t offer performance isolation at the vector store level, so the largest and most active tenants might impact the experience of other tenants.

When choosing the pool pattern for your RAG architecture, you should be aware that a single ingestion job can ingest or delete a maximum of 100 GB. Additionally, the data source can have a maximum of 5 million documents. If the solution has many tenants that are geographically distributed, consider triggering the ingestion job multiple times a day so you don’t hit the ingestion job size limit. Also, depending on the number and size of your documents to be synchronized, the time for ingestion will be determined by the embedding model invocation rate. For example, consider the following scenario:

  • Number of tenants to be synchronized = 10
  • Average number of documents per tenant = 100
  • Average size per document = 2 MB, containing roughly 200,000 tokens divided in 220 chunks of 1,000 tokens to allow for overlap
  • Using Amazon Titan Embeddings v2 on demand, allowing for 2,000 RPM and 300,000 TPM

This would result in the following:

  • Total embeddings requests = 10*100*220 = 220,000
  • Total tokens to process = 10*100*1,000=1,000,000
  • Total time taken to embed is dominated by the RPM, therefore 220,000/2,000 = 1 hour, 50 minutes

This means you could trigger an ingestion job 12 times per day to have a good time distribution of data to be ingested. This calculation is a best-case scenario and doesn’t account for the latency introduced by the FM when creating the vector from the chunk. If you expect having to synchronize a large number of tenants at the same time, consider using provisioned throughput to decrease the time it takes to create vector embeddings. This approach will also help distribute the load on the embedding models, limiting throttling of the Amazon Bedrock runtime API calls.

Bridge

The bridge pattern, illustrated in the following figure, strikes a balance between the silo and pool patterns, offering a middle ground that balances tenant data isolation and security.

Solution architecture for the bridge pattern

The bridge pattern delivers the following characteristics:

  • Separate data source per tenant in a common S3 bucket – Tenant data is stored in the same S3 bucket, but prefixed by a tenant identifier. Although having a different prefix per tenant doesn’t offer the possibility of using per-tenant encryption keys, it does create a logical separation that can be used to segregate data downstream in the knowledge bases.
  • Separate knowledge base per tenant – This pattern prescribes creating a separate knowledge base per tenant similar to the silo pattern. Therefore, the considerations in the silo pattern apply. Applications built using the bridge pattern usually share query clients across tenants, so they need to identify the specific tenant’s knowledge base to query. They can identify the knowledge base by storing the tenant-to-knowledge base mapping in an external database, which manages tenant-specific configurations. The following example shows how to store this tenant-specific information in an Amazon DynamoDB table:
    import boto3
    # Create a DynamoDB resource
    dynamodb = boto3.resource('dynamodb')
    
    table_name = 'tenantKbConfig'
    attribute_definitions = [
        {'AttributeName': 'tenantId', 'AttributeType': 'S'}
    ]
    
    key_schema = [
        {'AttributeName': 'tenantId', 'KeyType': 'HASH'}
    ]
    
    #Create the table holding KB tenant configurations
    tenant_kb_config_table = dynamodb.create_table(
        TableName=table_name,
        AttributeDefinitions=attribute_definitions,
        KeySchema=key_schema,
        BillingMode='PAY_PER_REQUEST' # Use on-demand billing mode for illustration
    )
    
    #Create a tenant
        tenant_kb_config_table.put_item(
        Item={
            'tenantId': 'tenant_1',
            'knowledgebaseId': <YOUR_KNOWLEDGEBASE_ID>,
            'modelArn': <FM_ARN>     }
    )

    In a production setting, your application will store tenant-specific parameters belonging to other functionality in your data stores. Depending on your application architecture, you might choose to store knowledgebaseId and modelARN alongside the other tenant-specific parameters, or create a separate data store (for example, the tenantKbConfig table) specifically for your RAG architecture.

    This mapping can then be used by the client application by invoking the RetrieveAndGenerate API. The following is an example implementation:

    import json
    import boto3
    
    # Create a DynamoDB resource
    dynamodb = boto3.resource('dynamodb')
    
    # Create a Bedrock Runtime client
    bedrock_runtime = boto3.client('bedrock-agent-runtime')
    
    # Define the table name
    table_name = 'tenantKbConfig'
    
    # Define function returning tenant config
    def get_tenant_config(tenant_id):
        table = dynamodb.Table(table_name)
        response = table.get_item(
            Key = {
                'tenantId': tenant_id
            }
        )
    if 'Item' in response:
        return { 'knowledgebaseId':response['Item'].get('knowledgebaseId'), 'modelArn': response['Item'].get('modelArn')}
    else:
        return None
    
    # Retrieve the tenant configurations from DynamoDB
    
    tenant_config = get_tenant_config('tenant_1')
    
    #Invoke the Retrieve and Generate API
    bedrock_runtime.retrieve_and_generate(
        input = {
            'text': 'What type of info do your documents contain?'
        },
        retrieveAndGenerateConfiguration = {
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': tenant_config['knowledgebaseId'],
                'modelArn': tenant_config['modelArn']
            }
        }
    )

  • Separate OpenSearch Service index per tenant – You store data within the same OpenSearch Serverless collection, but you create a vector index per tenant. This implies your tenants share the same KMS encryption key and the same pool of OCUs, optimizing the OpenSearch Service resources usage for indexing and querying. The separation in vector indexes gives you the flexibility of choosing different HNSM parameters per tenant, letting you tailor the performance of your k-NN indexing and querying for your different tenants.

The bridge pattern supports up to 100 tenants, and onboarding and offboarding a tenant requires the creation and deletion of a knowledge base and OpenSearch Service vector index. To delete the data pertaining to a particular tenant, you can delete the created resources and use the tenant-specific prefix as a logical parameter in your Amazon S3 API calls. Unlike the silo pattern, the bridge pattern doesn’t allow for per-tenant end-to-end encryption; it offers the same level of tenant customization offered by the silo pattern while optimizing costs.

Summary of differences

The following figure and table provide a consolidated view for comparing the characteristics of the different multi-tenant RAG architecture patterns. This comprehensive overview highlights the key attributes and trade-offs associated with the pool, bridge, and silo patterns, enabling informed decision-making based on specific requirements.

The following figure illustrates the mapping of design characteristics to components of the RAG architecture.

The following table summarizes the characteristics of the multi-tenant RAG architecture patterns.

Characteristic Attribute of  Pool Bridge Silo
Per-tenant chunking strategy Amazon Bedrock Knowledge Base Data Source No Yes Yes
Customer managed key for encryption of transient data and at rest Amazon Bedrock Knowledge Base Data Source No No Yes
Per-tenant distance measure Amazon OpenSearch Service Index No Yes Yes
Per-tenant ANN index configuration Amazon OpenSearch Service Index No Yes Yes
Per-tenant data deletion policies Amazon Bedrock Knowledge Base Data Source No Yes Yes
Per-tenant vector size Amazon Bedrock Knowledge Base Data Source No Yes Yes
Tenant performance isolation Vector database No No Yes
Tenant onboarding and offboarding complexity Overall solution Simplest, requires management of new tenants in existing infrastructure Medium, requires minimal management of end-to-end infrastructure Hardest, requires management of end-to-end infrastructure
Query client implementation Original Data Source Medium, requires dynamic filtering Hardest, requires external tenant mapping table Simplest, same as single-tenant implementation
Amazon S3 tenant management complexity Amazon S3 buckets and objects Hardest, need to maintain tenant specific metadata files for each object Medium, each tenant needs a different S3 path Simplest, each tenant requires a different S3 bucket
Cost Vector database Lowest Medium Highest
Per-tenant FM used to create vector embeddings Amazon Bedrock Knowledge Base No Yes Yes

Conclusion

This post explored three distinct patterns for implementing a multi-tenant RAG architecture using Amazon Bedrock Knowledge Bases and OpenSearch Service. The silo, pool, and bridge patterns offer varying levels of tenant isolation, variability, management simplicity, and cost-efficiency, catering to different use cases and requirements. By understanding the trade-offs and considerations associated with each pattern, organizations can make informed decisions and choose the approach that best aligns with their needs.

Get started with Amazon Bedrock Knowledge Bases today.


About the Authors

Emanuele Levi is a Solutions Architect in the Enterprise Software and SaaS team, based in London. Emanuele helps UK customers on their journey to refactor monolithic applications into modern microservices SaaS architectures. Emanuele is mainly interested in event-driven patterns and designs, especially when applied to analytics and AI, where he has expertise in the fraud-detection industry.

Mehran Nikoo is a Generative AI Go-To-Market Specialist at AWS. He leads the generative AI go-to-market strategy for UK and Ireland.

Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on computer vision use case and helps AWS customers in EMEA accelerate their machine learning and generative AI journeys with Amazon SageMaker and Amazon Bedrock.

Read More

How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines

How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines

Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML) workflows. This allows scientists and model developers to focus on model development and rapid experimentation rather than infrastructure management

Pipelines offers the ability to orchestrate complex ML workflows with a simple Python SDK with the ability to visualize those workflows through SageMaker Studio. This helps with data preparation and feature engineering tasks and model training and deployment automation. Pipelines also integrates with Amazon SageMaker Automatic Model Tuning which can automatically find the hyperparameter values that result in the best performing model, as determined by your chosen metric.

Ensemble models are becoming popular within the ML communities. They generate more accurate predictions through combining the predictions of multiple models. Pipelines can quickly be used to create and end-to-end ML pipeline for ensemble models. This enables developers to build highly accurate models while maintaining efficiency, and reproducibility.

In this post, we provide an example of an ensemble model that was trained and deployed using Pipelines.

Use case overview

Sales representatives generate new leads and create opportunities within Salesforce to track them. The following application is a ML approach using unsupervised learning to automatically identify use cases in each opportunity based on various text information, such as name, description, details, and product service group.

Preliminary analysis showed that use cases vary by industry and different use cases have a very different distribution of annualized revenue and can help with segmentation. Hence, a use case is an important predictive feature that can optimize analytics and improve sales recommendation models.

We can treat the use case identification as a topic identification problem and we explore different topic identification models such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and BERTopic. In both LSA and LDA, each document is treated as a collection of words only and the order of the words or grammatical role does not matter, which may cause some information loss in determining the topic. Moreover, they require a pre-determined number of topics, which was hard to determine in our data set. Since, BERTopic overcame the above problem, it was used in order to identify the use case.

The approach uses three sequential BERTopic models to generate the final clustering in a hierarchical method.

Each BERTopic model consists of four parts:

  • Embedding – Different embedding methods can be used in BERTopic. In this scenario, input data comes from various areas and is usually inputted manually. As a result, we use sentence embedding to ensure scalability and fast processing.
  • Dimension reduction – We use Uniform Manifold Approximation and Projection (UMAP), which is an unsupervised and nonlinear dimension reduction method, to reduce high dimension text vectors.
  • Clustering – We use the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) method to form different use case clusters.
  • Keyword identification – We use class-based TF-IDF to extract the most representative words from each cluster.

Sequential ensemble model

There is no predetermined number of topics, so we set an input for the number of clusters to be 15–25 topics. Upon observation, some of the topics are wide and general. Therefore, another layer of the BERTopic model is applied individually to them. After combining all of the newly identified topics in the second-layer model and together with the original topics from first-layer results, postprocessing is performed manually to finalize topic identification. Lastly, a third layer is used for some of the clusters to create sub-topics.

To enable the second- and third-layer models to work effectively, you need a mapping file to map results from previous models to specific words or phrases. This helps make sure that the clustering is accurate and relevant.

We’re using Bayesian optimization for hyperparameter tuning and cross-validation to reduce overfitting. The data set contains features like opportunity name, opportunity details, needs, associated product name, product details, product groups. The models are evaluated using a customized loss function, and the best embedding model is selected.

Challenges and considerations

Here are some of the challenges and considerations of this solution:

  • The pipeline’s data preprocessing capability is crucial for enhancing model performance. With the ability to preprocess incoming data prior to training, we can make sure that our models are fed with high-quality data. Some of the preprocessing and data cleaning steps include converting all text column to lower case, removing template elements, contractions, URLs, emails, etc. removing non-relevant NER labels, and lemmatizing combined text. The result is more accurate and reliable predictions.
  • We need a compute environment that is highly scalable so that we can effortlessly handle and train millions of rows of data. This allows us to perform large-scale data processing and modeling tasks with ease and reduces development time and costs.
  • Because every step of the ML workflow requires varying resource requirements, a flexible and adaptable pipeline is essential for efficient resource allocation. We can reduce the overall processing time, resulting in faster model development and deployment, by optimizing resource usage for each step.
  • Running custom scripts for data processing and model training requires the availability of required frameworks and dependencies.
  • Coordinating the training of multiple models can be challenging, especially when each subsequent model depends on the output of the previous one. The process of orchestrating the workflow between these models can be complex and time-consuming.
  • Following each training layer, it’s necessary to revise a mapping that reflects the topics produced by the model and use it as an input for the subsequent model layer.

Solution overview

In this solution, the entry point is Amazon SageMaker Studio, which is a web-based integrated development environment (IDE) provided by AWS that enables data scientists and ML developers to build, train, and deploy ML models at scale in a collaborative and efficient manner.

The following diagrams illustrates the high-level architecture of the solution.

As part of the architecture, we’re using the following SageMaker pipeline steps:

  • SageMaker Processing – This step allows you to preprocess and transform data before training. One benefit of this step is the ability to use built-in algorithms for common data transformations and automatic scaling of resources. You can also use custom code for complex data preprocessing, and it allows you to use custom container images.
  • SageMaker Training – This step allows you to train ML models using SageMaker-built-in algorithms or custom code. You can use distributed training to accelerate model training.
  • SageMaker Callback – This step allows you to run custom code during the ML workflow, such as sending notifications or triggering additional processing steps. You can run external processes and resume the pipeline workflow on completion in this step.
  • SageMaker Model – This step allows you to create or register model to Amazon SageMaker

Implementation Walkthrough

First, we set up the Sagemaker pipeline:

import boto3       
import sagemaker   

# create a Session with custom region (e.g. us-east-1), will be None if not specified 
region = "<your-region-name>"    		

# allocate default S3 bucket for SageMaker session, will be None if not specified
default_bucket = "<your-s3-bucket>"   	
boto_session = boto3.Session(region_name=region
sagemaker_client = boto_session.client("sagemaker") 

Initialize a SageMaker Session

sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket= default_bucket,) 

Set Sagemaker execution role for the session

role = sagemaker.session.get_execution_role(sagemaker_session)

Manage interactions under Pipeline Context

pipeline_session = sagemaker.workflow.pipeline_context.PipelineSession(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket=default_bucket,)

Define base image for scripts to run on

account_id = role.split(":")[4]
# create a base image that take care of dependencies
ecr_repository_name = "<your-base-image-to-run-script>".    
tag = "latest"
container_image_uri = "{0}.dkr.ecr.{1}.amazonaws.com/{2}:{3}".format(account_id, region, ecr_repository_name, tag)

The following is a detailed explanation of the workflow steps:

  • Preprocess the data – This involves cleaning and preparing the data for feature engineering and splitting the data into train, test, and validation sets.
import os
BASE_DIR = os.path.dirname(os.path.realpath(__file__))

from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep

from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
    ScriptProcessor,
)

processing_instance_type = ParameterString(
    name="ProcessingInstanceType",
    # choose an instance type suitable for the job
    default_value="ml.m5.4xlarge"           
)

script_processor = ScriptProcessor(
    image_uri=container_image_uri,
    command=["python"],
    instance_type=processing_instance_type,
    instance_count=1,
    role=role,
)
 
# define the data preprocess job 
step_preprocess = ProcessingStep(
    name="DataPreprocessing",
    processor=script_processor,
    inputs=[
        ProcessingInput(source=BASE_DIR, destination="/opt/ml/processing/input/code/")  
    ],
    outputs=[
        ProcessingOutput(output_name="data_train", source="/opt/ml/processing/data_train"),  # output data and dictionaries etc for later steps
    ]
    code=os.path.join(BASE_DIR, "preprocess.py"),      
)
  • Train layer 1 BERTopic model – A SageMaker training step is used to train the first layer of the BERTopic model using an Amazon Elastic Container Registry (Amazon ECR) image and a custom training script.
base_job_prefix="OppUseCase"

from sagemaker.workflow.steps import TrainingStep
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

training_instance_type = ParameterString(
    name="TrainingInstanceType",
    default_value="ml.m5.4xlarge"
)

# create an estimator for training job
estimator_first_layer = Estimator(
    image_uri=container_image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path= f"s3://{default_bucket}/{base_job_prefix}/train_first_layer",       # S3 bucket where the training output be stored
    role=role,
    entry_point = "train_first_layer.py"
)

# create training job for the estimator based on inputs from data-preprocess step 
step_train_first_layer = TrainingStep(
    name="TrainFirstLayerModel",
    estimator = estimator_first_layer,
    inputs={
            TrainingInput(
            s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ "data_train" ].S3Output.S3Uri,
        ),
    },
)
from sagemaker.workflow.callback_step import CallbackStep, CallbackOutput, CallbackOutputTypeEnum

first_sqs_queue_to_use = ParameterString(
    name="FirstSQSQueue",
    default_value= <first_queue_url>,  # add queue url  
)

first_callback_output = CallbackOutput(output_name="s3_mapping_first_update", output_type=CallbackOutputTypeEnum.String)

step_first_mapping_update = CallbackStep(
    name="FirstMappingUpdate",
    sqs_queue_url= first_sqs_queue_to_use,

    # Input arguments that will be provided in the SQS message
    inputs={
        "input_location": f"s3://{default_bucket}/{base_job_prefix}/mapping",             
        "output_location": f"s3://{default_bucket}/{base_job_prefix}/ mapping_first_update "
    },
    outputs=[
        first_callback_output,
    ],
)

step_first_mapping_update.add_depends_on([step_train_first_layer])       # call back is run after the step_train_first_layer
  • Train layer 2 BERTopic model – Another SageMaker TrainingStep is used to train the second layer of the BERTopic model using an ECR image and a custom training script.
estimator_second_layer = Estimator(
    image_uri=container_image_uri,
    instance_type=training_instance_type,    # same type as of first train layer
    instance_count=1,
    output_path=f"s3://{bucket}/{base_job_prefix}/train_second_layer",     # S3 bucket where the training output be stored
    role=role,
    entry_point = "train_second_layer.py"
)

# create training job for the estimator based on inputs from preprocessing, output of previous call back step and first train layer step
step_train_second_layer = TrainingStep(
    name="TrainSecondLayerModel",
    estimator = estimator_second_layer,
    inputs={
          TrainingInput(
            s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ "data_train"].S3Output.S3Uri,
        ),
          TrainingInput(
            # Output of the previous call back step
            s3_data= step_first_mapping_update.properties.Outputs["s3_mapping_first_update"],
        ),
        TrainingInput(
            s3_data=f"s3://{bucket}/{base_job_prefix}/train_first_layer"
        ),
    }
)
  • Use a callback step – Similar to Step 3, this involves sending a message to an SQS queue which triggers a Lambda function. The Lambda function updates the mapping file in Amazon S3 and sends a success token back to the pipeline to resume its run.
second_sqs_queue_to_use = ParameterString(
    name="SecondSQSQueue",
    default_value= <second_queue_url>,           # add queue url  
)

second_callback_output = CallbackOutput(output_name="s3_mapping_second_update", output_type=CallbackOutputTypeEnum.String)

step_second_mapping_update = CallbackStep(
    name="SecondMappingUpdate",
    sqs_queue_url= second_sqs_queue_to_use,

    # Input arguments that will be provided in the SQS message
    inputs={
        "input_location": f"s3://{default_bucket}/{base_job_prefix}/mapping_first_update ",             
        "output_location": f"s3://{default_bucket}/{base_job_prefix}/mapping_second_update "
    },
    outputs=[
        second_callback_output,
    ],
)

step_second_mapping_update.add_depends_on([step_train_second_layer])       # call back is run after the step_train_second_layer   
  • Train layer 3 BERTopic model – This involves fetching the mapping file from Amazon S3 and training the third layer of the BERTopic model using an ECR image and a custom training script.
estimator_third_layer = Estimator(
    image_uri=container_image_uri,
    instance_type=training_instance_type,                   # same type as of prvious two train layers
    instance_count=1,
    output_path=f"s3://{default_bucket}/{base_job_prefix}/train_third_layer",      # S3 bucket where the training output be stored
    role=role,
    entry_point = "train_third_layer.py"
)

# create training job for the estimator based on inputs from preprocess step, second callback step and outputs of previous two train layers
step_train_third_layer = TrainingStep(
    name="TrainThirdLayerModel",
    estimator = estimator_third_layer,
    inputs={
          TrainingInput(
            s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs["data_train"].S3Output.S3Uri,
        ),
          TrainingInput(
            # s3_data = Output of the previous call back step
            s3_data= step_second_mapping_update.properties.Outputs[' s3_mapping_second_update’],
        ),
        TrainingInput(
            s3_data=f"s3://{default_bucket}/{base_job_prefix}/train_first_layer"
        ),
        TrainingInput(
            s3_data=f"s3://{default_bucket}/{base_job_prefix}/train_second_layer"
        ),
    }
)
  • Register the model – A SageMaker model step is used to register the model in the SageMaker model registry. When the model is registered, you can use the model through a SageMaker inference pipeline.
from sagemaker.model import Model
from sagemaker.workflow.model_step import ModelStep

model = Model(
    image_uri=container_image_uri,
    model_data=step_train_third_layer.properties.ModelArtifacts.S3ModelArtifacts,     
    sagemaker_session=sagemaker_session,
    role=role,
)

register_args = model.register(
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.c5.9xlarge", "ml.m5.xlarge"],
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
)
step_register = ModelStep(name="OppUseCaseRegisterModel", step_args=register_args)

To effectively train a BERTopic model and BIRCH and UMAP methods, you need a custom training image which can provide additional dependencies and framework required to run the algorithm. For a working sample of a custom docker image, refer to Create a custom Docker container Image for SageMaker

Conclusion

In this post, we explained how you can use wide range of steps offered by SageMaker Pipelines with custom images to train an ensemble model. For more information on how to get started with Pipelines using an existing ML Operations (MLOps) template, refer to Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines.


About the Authors

Bikramjeet Singh is a Applied Scientist at AWS Sales Insights, Analytics and Data Science (SIADS) Team, responsible for building GenAI platform and AI/ML Infrastructure solutions for ML scientists within SIADS. Prior to working as an AS, Bikram worked as a Software Development Engineer within SIADS and Alexa AI.

Rahul Sharma is a Senior Specialist Solutions Architect at AWS, helping AWS customers build ML and Generative AI solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance industries, helping customers build data and analytics platforms.

Sachin Mishra is a seasoned professional with 16 years of industry experience in technology consulting and software leadership roles. Sachin lead the Sales Strategy Science and Engineering function at AWS. In this role, he was responsible for scaling cognitive analytics for sales strategy, leveraging advanced AI/ML technologies to drive insights and optimize business outcomes.

Nada Abdalla is a research scientist at AWS. Her work and expertise span multiple science areas in statistics and ML including text analytics, recommendation systems, Bayesian modeling and forecasting. She previously worked in academia and obtained her M.Sc and PhD from UCLA in Biostatistics. Through her work in academia and industry she published multiple papers at esteemed statistics journals and applied ML conferences. In her spare time she enjoys running and spending time with her family.

Read More

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

Implementing login node load balancing in SageMaker HyperPod for enhanced multi-user experience

Amazon SageMaker HyperPod is designed to support large-scale machine learning (ML) operations, providing a robust environment for training foundation models (FMs) over extended periods. Multiple users — such as ML researchers, software engineers, data scientists, and cluster administrators — can work concurrently on the same cluster, each managing their own jobs and files without interfering with others.

When using HyperPod, you can use familiar orchestration options such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). This blog post specifically applies to HyperPod clusters using Slurm as the orchestrator. In these clusters, the concept of login nodes is available, which cluster administrators can add to facilitate user access. These login nodes serve as the entry point through which users interact with the cluster’s computational resources. By using login nodes, users can separate their interactive activities, such as browsing files, submitting jobs, and compiling code, from the cluster’s head node. This separation helps prevent any single user’s activities from affecting the overall performance of the cluster.

However, although HyperPod provides the capability to use login nodes, it doesn’t provide an integrated mechanism for load balancing user activity across these nodes. As a result, users manually select a login node, which can lead to imbalances where some nodes are overutilized while others remain underutilized. This not only affects the efficiency of resource usage but can also lead to uneven performance experiences for different users.

In this post, we explore a solution for implementing load balancing across login nodes in Slurm-based HyperPod clusters. By distributing user activity evenly across all available nodes, this approach provides more consistent performance, better resource utilization, and a smoother experience for all users. We guide you through the setup process, providing practical steps to achieve effective load balancing in your HyperPod clusters.

Solution overview

In HyperPod, login nodes serve as access points for users to interact with the cluster’s computational resources so they can manage their tasks without impacting the head node. Although the default method for accessing these login nodes is through AWS Systems Manager, there are cases where direct Secure Shell (SSH) access is more suitable. SSH provides a more traditional and flexible way of managing interactions, especially for users who require specific networking configurations or need features such as TCP load balancing, which Systems Manager doesn’t support.

Given that HyperPod is typically deployed in a virtual private cloud (VPC) using private subnets, direct SSH access to the login nodes requires secure network connectivity into the private subnet. There are several options to achieve this:

  1. AWS Site-to-Site VPN – Establishes a secure connection between your on-premises network and your VPC, suitable for enterprise environments
  2. AWS Direct Connect – Offers a dedicated network connection for high-throughput and low-latency needs
  3. AWS VPN Client – A software-based solution that remote users can use to securely connect to the VPC, providing flexible and easy access to the login nodes

This post demonstrates how to use the AWS VPN Client to establish a secure connection to the VPC. We set up a Network Load Balancer (NLB) within the private subnet to evenly distribute SSH traffic across the available login nodes and use the VPN connection to connect to the NLB in the VPC. The NLB ensures that user sessions are balanced across the nodes, preventing any single node from becoming a bottleneck and thereby improving overall performance and resource utilization.

For environments where VPN connectivity might not be feasible, an alternative option is to deploy the NLB in a public subnet to allow direct SSH access from the internet. In this configuration, the NLB can be secured by restricting access through a security group that allows SSH traffic only from specified, trusted IP addresses. As a result, authorized users can connect directly to the login nodes while maintaining some level of control over access to the cluster. However, this public-facing method is outside the scope of this post and isn’t recommended for production environments, as exposing SSH access to the internet can introduce additional security risks.

The following diagram provides an overview of the solution architecture.

Solution overview

Prerequisites

Before following the steps in this post, make sure you have the foundational components of a HyperPod cluster setup in place. This includes the core infrastructure for the HyperPod cluster and the network configuration required for secure access. Specifically, you need:

  • HyperPod cluster – This post assumes you already have a HyperPod cluster deployed. If not, refer to Getting started with SageMaker HyperPod and the HyperPod workshop for guidance on creating and configuring your cluster.
  • VPC, subnets, and security group – Your HyperPod cluster should reside within a VPC with associated subnets. To deploy a new VPC and subnets, follow the instructions in the Own Account section of the HyperPod workshop. This process includes deploying an AWS CloudFormation stack to create essential resources such as the VPC, subnets, security group, and an Amazon FSx for Lustre volume for shared storage.

Setting up login nodes for cluster access

Login nodes are dedicated access points that users can use to interact with the HyperPod cluster’s computational resources without impacting the head node. By connecting through login nodes, users can browse files, submit jobs, and compile code independently, promoting a more organized and efficient use of the cluster’s resources.

If you haven’t set up login nodes yet, refer to the Login Node section of the HyperPod Workshop, which provides detailed instructions on adding these nodes to your cluster configuration.

Each login node in a HyperPod cluster has an associated network interface within your VPC. A network interface, also known as an elastic network interface, represents a virtual network card that connects each login node to your VPC, allowing it to communicate over the network. These interfaces have assigned IPv4 addresses, which are essential for routing traffic from the NLB to the login nodes.

To proceed with the load balancer setup, you need to obtain the IPv4 addresses of each login node. You can obtain these addresses from the AWS Management Console or by invoking a command on your HyperPod cluster’s head node.

Using the AWS Management Console

To set up login nodes for cluster access using the AWS Management Console, follow these steps:

  1. On the Amazon EC2 console, select Network interfaces in the navigation pane
  2. In the Search bar, select VPC ID = (Equals) and choose the VPC id of the VPC containing the HyperPod cluster
  3. In the Search bar, select Description : (Contains) and enter the name of the instance group that includes your login nodes (typically, this is login-group)

For each login node, you will find an entry in the list, as shown in the following screenshot. Note down the IPv4 addresses for all login nodes of your cluster.

Search network interfaces

Using the HyperPod head node

Alternatively, you can also retrieve the IPv4 addresses by entering the following command on your HyperPod cluster’s head node:

sudo cat /opt/ml/config/resource_config.json
    | jq '.InstanceGroups[] | select(.Name=="login-group").Instances[].CustomerIpAddress'

Create a Network Load Balancer

The next step is to create a NLB to manage traffic across your cluster’s login nodes.

For the NLB deployment, you need the IPv4 addresses of the login nodes collected earlier and the appropriate security group configurations. If you deployed your cluster using the HyperPod workshop instructions, a security group that permits communication between all cluster nodes should already be in place.

This security group can be applied to the load balancer, as demonstrated in the following instructions. Alternatively, you can opt to create a dedicated security group that grants access specifically to the login nodes.

Create target group

First, we create the target group that will be used by the NLB.

  1. On the Amazon EC2 console, select Target groups in the navigation pane
  2. Choose Create target group
  3. Create a target group with the following parameters:
    1. For Choose a target type, choose IP addresses
    2. For Target group name, enter smhp-login-node-tg
    3. For Protocol : Port, choose TCP and enter 22
    4. For IP address type, choose IPv4
    5. For VPC, choose SageMaker HyperPod VPC (which was created with the CloudFormation template)
    6. For Health check protocol, choose TCP
  4. Choose Next, as shown in the following screenshot

Create NLB target group - Step 1

  1. In the Register targets section, register the login node IP addresses as the targets
  2. For Ports, enter 22 and choose Include as pending below, as shown in the following screenshot

Create NLB target group - Step 2

  1. The login node IPs will appear as targets with Pending health status. Choose Create target group, as shown in the following screenshot

Create NLB target group - Step 3

Create load balancer

To create the load balancer, follow these steps:

  1. On the Amazon EC2 console, select Load Balancers in the navigation pane
  2. Choose Create load balancer
  3. Choose Network Load Balancer and choose Create, as shown in the following screenshot

Create load balancer selection dialog

  1. Provide a name (for example, smhp-login-node-lb) and choose Internal as Scheme

Create NLB - Step 1

  1. For network mapping, select the VPC that contains your HyperPod cluster and an associated private subnet, as shown in the following screenshot

Create NLB - Step 2

  1. Select a security group that allows access on port 22 to the login nodes. If you deployed your cluster using the HyperPod workshop instructions, you can use the security group from this deployment.
  1. Select the Target group that you created before and choose TCP as Protocol and 22 for Port, as shown in the following screenshot

Create NLB - Step 3

  1. Choose Create load balancer

After the load balancer has been created, you can find its DNS name on the load balancer’s detail page, as shown in the following screenshot. 

Find DNS name after NLB creation

Making sure host keys are consistent across login nodes

When using multiple login nodes in a load-balanced environment, it’s crucial to maintain consistent SSH host keys across all nodes. SSH host keys are unique identifiers that each server uses to prove its identity to connecting clients. If each login node has a different host key, users will encounter “WARNING: SSH HOST KEY CHANGED” messages whenever they connect to a different node, causing confusion and potentially leading users to question the security of the connection.

To avoid these warnings, configure the same SSH host keys on all login nodes in the load balancing rotation. This setup makes sure that users won’t receive host key mismatch alerts when routed to a different node by the load balancer.

You can enter the following script on the cluster’s head node to copy the SSH host keys from the first login node to the other login nodes in your HyperPod cluster:

#!/bin/bash

SUDOER_USER="ubuntu"

login_nodes=($(sudo cat /opt/ml/config/resource_config.json | jq '.InstanceGroups[] | select(.Name=="login-group").Instances[].CustomerIpAddress' | tr 'n' ' ' | tr -d '"'))
source_node="${login_nodes[0]}"
key_paths=("/etc/ssh/ssh_host_rsa_key"
           "/etc/ssh/ssh_host_rsa_key.pub"
           "/etc/ssh/ssh_host_ecdsa_key"
           "/etc/ssh/ssh_host_ecdsa_key.pub"
           "/etc/ssh/ssh_host_ed25519_key"
           "/etc/ssh/ssh_host_ed25519_key.pub")

tmp_dir="/tmp/ssh_host_keys_$(uuidgen)"

copy_cmd=""
for key_path in "${key_paths[@]}"; do
  copy_cmd="sudo cp $key_path $tmp_dir/;$copy_cmd"
done

ssh $source_node "mkdir -p $tmp_dir;${copy_cmd} sudo chown -R $SUDOER_USER $tmp_dir;"

for node in "${login_nodes[@]:1}"; do
  echo "Copying SSH host keys from $source_node to $node..."
  scp -r $source_node:$tmp_dir $node:$tmp_dir
  ssh $node "sudo chown -R root:root $tmp_dir; sudo mv $tmp_dir/ssh_host_* /etc/ssh/;"
done

for node in "${login_nodes[@]}"; do
  echo "Cleaning up tmp dir $tmp_dir on $node..."
  ssh $node "sudo rm -r $tmp_dir"
done

Create AWS Client VPN endpoint

Because the NLB has been created with Internal scheme, it’s only accessible from within the HyperPod VPC. To access the VPC and send requests to the NLB, we use AWS Client VPN in this post.

AWS Client VPN is a managed client-based VPN service that enables secure access to your AWS resources and resources in your on-premises network.

We’ll set up an AWS Client VPN endpoint that provides clients with access to the HyperPod VPC and uses mutual authentication. With mutual authentication, Client VPN uses certificates to perform authentication between clients and the Client VPN endpoint.

To deploy a client VPN endpoint with mutual authentication, you can follow the steps outlined in Get started with AWS Client VPN. When configuring the client VPN to access the HyperPod VPC and the login nodes, keep these adaptations to the following steps in mind:

  • Step 2 (create a Client VPN endpoint) – By default, all client traffic is routed through the Client VPN tunnel. To allow internet access without routing traffic through the VPN, you can enable the option Enable split-tunnel when creating the endpoint. When this option is enabled, only traffic destined for networks matching a route in the Client VPN endpoint route table is routed through the VPN tunnel. For more details, refer to Split-tunnel on Client VPN endpoints.
  • Step 3 (target network associations) – Select the VPC and private subnet used by your HyperPod cluster, which contains the cluster login nodes.
  • Step 4 (authorization rules) – Choose the Classless Inter-Domain Routing (CIDR) range associated with the HyperPod VPC. If you followed the HyperPod workshop instructions, the CIDR range is 10.0.0.0/16.
  • Step 6 (security groups) – Select the security group that you previously used when creating the NLB.

Connecting to the login nodes

After the AWS Client VPN is configured, clients can establish a VPN connection to the HyperPod VPC. With the VPN connection in place, clients can use SSH to connect to the NLB, which will route them to one of the login nodes.

ssh -i /path/to/your/private-key.pem user@<NLB-IP-or-DNS>

To allow SSH access to the login nodes, you must create user accounts on the cluster and add their public keys to the authorized_keys file on each login node (or on all nodes, if necessary). For detailed instructions on managing multi-user access, refer to the Multi-User section of the HyperPod workshop.

In addition to using the AWS Client VPN, you can also access the NLB from other AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, if they meet the following requirements:

  • VPC connectivity – The EC2 instances must be either in the same VPC as the NLB or able to access the HyperPod VPC through a peering connection or similar network setup.
  • Security group configuration – The EC2 instance’s security group must allow outbound connections on port 22 to the NLB security group. Likewise, the NLB security group should be configured to accept inbound SSH traffic on port 22 from the EC2 instance’s security group.

Clean up

To remove deployed resources, you can clean them up in the following order:

  1. Delete the Client VPN endpoint
  2. Delete the Network Load Balancer
  3. Delete the target group associated with the load balancer

If you also want to delete the HyperPod cluster, follow these additional steps:

  1. Delete the HyperPod cluster
  2. Delete the CloudFormation stack, which includes the VPC, subnets, security group, and FSx for Lustre volume

Conclusion

In this post, we explored how to implement login node load balancing for SageMaker HyperPod clusters. By using a Network Load Balancer to distribute user traffic across login nodes, you can optimize resource utilization and enhance the overall multi-user experience, providing seamless access to cluster resources for each user.

This approach represents only one way to customize your HyperPod cluster. Because of the flexibility of SageMaker HyperPod you can adapt configurations to your unique needs while benefiting from a managed, resilient environment. Whether you need to scale foundation model workloads, share compute resources across different tasks, or support long-running training jobs, SageMaker HyperPod offers a versatile solution that can evolve with your requirements.

For more details on making the most of SageMaker HyperPod, dive into the HyperPod workshop and explore further blog posts covering HyperPod.


About the Authors

Janosch Woschitz is a Senior Solutions Architect at AWS, specializing in AI/ML. With over 15 years of experience, he supports customers globally in leveraging AI and ML for innovative solutions and building ML platforms on AWS. His expertise spans machine learning, data engineering, and scalable distributed systems, augmented by a strong background in software engineering and industry expertise in domains such as autonomous driving.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Read More

How Clearwater Analytics is revolutionizing investment management with generative AI and Amazon SageMaker JumpStart

How Clearwater Analytics is revolutionizing investment management with generative AI and Amazon SageMaker JumpStart

This post was written with Darrel Cherry, Dan Siddall, and Rany ElHousieny of Clearwater Analytics.

As global trading volumes rise rapidly each year, capital markets firms are facing the need to manage large and diverse datasets to stay ahead. These datasets aren’t just expansive in volume; they’re critical in driving strategy development, enhancing execution, and streamlining risk management. The explosion of data creation and utilization, paired with the increasing need for rapid decision-making, has intensified competition and unlocked opportunities within the industry. To remain competitive, capital markets firms are adopting Amazon Web Services (AWS) Cloud services across the trade lifecycle to rearchitect their infrastructure, remove capacity constraints, accelerate innovation, and optimize costs.

Generative AI, AI, and machine learning (ML) are playing a vital role for capital markets firms to speed up revenue generation, deliver new products, mitigate risk, and innovate on behalf of their customers. A great example of such innovation is our customer Clearwater Analytics and their use of large language models (LLMs) hosted on Amazon SageMaker JumpStart, which has propelled asset management productivity and delivered AI-powered investment management productivity solutions to their customers.

In this post, we explore Clearwater Analytics’ foray into generative AI, how they’ve architected their solution with Amazon SageMaker, and dive deep into how Clearwater Analytics is using LLMs to take advantage of more than 18 years of experience within the investment management domain while optimizing model cost and performance.

About Clearwater Analytics

Clearwater Analytics (NYSE: CWAN) stands at the forefront of investment management technology. Founded in 2004 in Boise, Idaho, Clearwater has grown into a global software-as-a-service (SaaS) powerhouse, providing automated investment data reconciliation and reporting for over $7.3 trillion in assets across thousands of accounts worldwide. With a team of more than 1,600 professionals and a long-standing relationship with AWS dating back to 2008, Clearwater has consistently pushed the boundaries of financial technology innovation.

In May 2023, Clearwater embarked on a journey into the realm of generative AI, starting with a private, secure generative AI chat-based assistant for their internal workforce, enhancing client inquiries through Retrieval Augmented Generation (RAG). As a result, Clearwater was able to increase assets under management (AUM) over 20% without increasing operational headcount. By September of the same year, Clearwater unveiled its generative AI customer offerings at the Clearwater Connect User Conference, marking a significant milestone in their AI-driven transformation.

About SageMaker JumpStart

Amazon SageMaker JumpStart is an ML hub that can help you accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select foundation models (FMs) quickly based on predefined quality and responsibility metrics to perform tasks such as article summarization and image generation. Pre-trained models are fully customizable for your use case with your data, and you can effortlessly deploy them into production with the user interface or AWS SDK. You can also share artifacts, including models and notebooks, within your organization to accelerate model building and deployment, and admins can control which models are visible to users within their organization.

Clearwater’s generative AI solution architecture

Clearwater Analytics’ generative AI architecture supports a wide array of vertical solutions by merging extensive functional capabilities through the LangChain framework, domain knowledge through RAG, and customized LLMs hosted on Amazon SageMaker. This integration has resulted in a potent asset for both Clearwater customers and their internal teams.

The following image illustrates the solution architecture.

As of September 2024, the AI solution supports three core applications:

  1. Clearwater Intelligent Console (CWIC) – Clearwater’s customer-facing AI application. This assistant framework is built upon three pillars:
    • Knowledge awareness – Using RAG, CWIC compiles and delivers comprehensive knowledge that is crucial for customers from intricate calculations of book value to period-end reconciliation processes.
    • Application awareness – Transforming novice users into power users instantly, CWIC guides clients to inquire about Clearwater’s applications and receive direct links to relevant investment reports. For instance, if a client needs information on their yuan exposure, CWIC employs its tool framework to identify and provide links to the appropriate currency exposure reports.
    • Data awareness – Digging deep into portfolio data, CWIC adeptly manages complex queries, such as validating book yield tie-outs, by accessing customer-specific data and performing real-time calculations.The following image shows a snippet of the generative AI assistance within the CWIC.
  1. Crystal – Clearwater’s advanced AI assistant with expanded capabilities that empower internal teams’ operations. Crystal shares CWIC’s core functionalities but benefits from broader data sources and API access. Enhancements driven by Crystal have achieved efficiency gains between 25% and 43%, improving Clearwater’s ability to manage substantial increases in AUM without increases in staffing.
  2. CWIC Specialists – Their most recent solution CWIC Specialists are domain-specific generative AI agents equipped to handle nuanced investment tasks, from accounting to regulatory compliance. These agents can work in single or multi-agentic workflows to answer questions, perform complex operations, and collaborate to solve various investment-related tasks. These specialists assist both internal teams and customers in domain specific areas, such as investment accounting, regulatory requirements, and compliance information. Each specialist is underpinned by thousands of pages of domain documentation, which feeds into the RAG system and is used to train smaller, specialized models with Amazon SageMaker JumpStart. This approach enhances cost-effectiveness and performance to promote high-quality interactions.

In the next sections, we dive deep into how Clearwater analytics is using Amazon SageMaker JumpStart to fine-tune models for productivity improvement and to deliver new AI services.

Clearwater’s Use of LLMs hosted on Amazon SageMaker JumpStart

Clearwater employs a two-pronged strategy for using LLMs. This approach addresses both high-complexity scenarios requiring powerful language models and domain-specific applications demanding rapid response times.

  1. Advanced foundation models – For tasks involving intricate reasoning or creative output, Clearwater uses state-of-the-art pre-trained models such as Anthropic’s Claude or Meta’s Llama. These models excel in handling complex queries and generating innovative solutions.
  2. Fine-tuned models for specialized knowledge – In cases where domain-specific expertise or swift responses are crucial, Clearwater uses fine-tuned models. These customized LLMs are optimized for industries or tasks that require accuracy and efficiency.

Fine-tuned models through domain adaptation with Amazon SageMaker JumpStart

Although general LLMs are powerful, their accuracy can be put to the test in specialized domains. This is where domain adaptation, also known as continued pre-training, comes into play. Domain adaptation is a sophisticated form of transfer learning that allows a pre-trained model to be fine-tuned for optimal performance in a different, yet related, target domain. This approach is particularly valuable when there’s a scarcity of labeled data in the target domain but an abundance in a related source domain.

These are some of the key benefits for domain adaptation:

  • Cost-effectiveness – Creating a curated set of questions and answers for instruction fine-tuning can be prohibitively expensive and time-consuming. Domain adaptation eliminates the need for thousands of manually created Q&As.
  • Comprehensive learning – Unlike instruction tuning, which only learns from provided questions, domain adaptation extracts information from entire documents, resulting in a more thorough understanding of the subject matter.
  • Efficient use of expertise – Domain adaptation frees up human experts from the time-consuming task of generating questions so they can focus on their primary responsibilities.
  • Faster deployment – With domain adaptation, specialized AI models can be developed and deployed more quickly, accelerating time to market for AI-powered solutions.

AWS has been at the forefront of domain adaptation, creating a framework to allow creating powerful, specialized AI models. Using this framework, Clearwater has been able to train smaller, faster models tailored to specific domains without the need for extensive labeled datasets. This innovative approach allows Clearwater to power digital specialists with a finely tuned model trained on a particular domain. The result? More responsive LLMs that form the backbone of their cutting-edge generative AI services.

The evolution of fine-tuning with Amazon SageMaker JumpStart

Clearwater is collaborating with AWS to enhance their fine-tuning processes. Amazon SageMaker JumpStart offered them a framework for domain adaptation. During the year, Clearwater has witnessed significant improvements in the user interface and effortlessness of fine-tuning using SageMaker JumpStart.

For instance, the code required to set up and fine-tune a GPT-J-6B model has been drastically streamlined. Previously, it required a data scientist to write over 100 lines of code within an Amazon SageMaker Notebook to identify and retrieve the proper image, set the right training script, and import the right hyperparameters. Now, using SageMaker JumpStart and advancements in the field, the process has streamlined to a few lines of code:

estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters={"epoch": "3", "per_device_train_batch_size": "4"},
)

# initiate the traning process with the path of the data
estimator.fit(
    {"train": training_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True
)

A fine-tuning example: Clearwater’s approach

For Clearwater’s AI, the team successfully fine-tuned a GPT-J-6B (huggingface-textgeneration1-gpt-j- 6bmodel) model with domain adaptation using Amazon SageMaker JumpStart. The following are the concrete steps used for the fine-tuning process to serve as a blueprint for others to implement similar strategies. A detailed tutorial can found in this amazon-sagemaker-examples repo.

  1. Document assembly – Gather all relevant documents that will be used for training. This includes help content, manuals, and other domain-specific text. The data Clearwater used for training this model is public help content which contains no client data. Clearwater exclusively uses client data, with their collaboration and approval, to fine-tune a model dedicated solely to the specific client. Curation, cleaning and de-identification of data is necessary for training and subsequent tuning operations.
  2. Test set creation – Develop a set of questions and answers that will be used to evaluate the model’s performance before and after fine-tuning. Clearwater has implemented a sophisticated model evaluation system for additional assessment of performance for open source and commercial models. This is covered more in the Model evaluation and optimization section later in this post.
  3. Pre-trained model deployment Deploy the original, pre-trained GPT-J-6B model.
  4. Baseline testing Use the question set to test the pre-trained model, establishing a performance baseline.
  5. Pre-trained model teardown Remove the pre-trained model to free up resources.
  6. Data preparation Upload the assembled documents to an S3 bucket, making sure they’re in a format suitable for the fine-tuning process.
  7. Fine-tuning Train the new model using the uploaded documents, adjusting hyperparameters as needed.
  8. Fine-tuned model testing – Evaluate the fine-tuned model using the same question set used for the baseline.
  9. Fine-tuned model teardown If not immediately needed, tear down the fine-tuned model to optimize resource usage.
  10. RAG comparison Test a RAG-based system using the same question set for an additional point of comparison.
  11. Performance evaluation Analyze the results from all tests to assess the effectiveness of the fine-tuning process.

This is how the results look before tuning an open source model (GGPT-J-6B), after tuning, and using a commercial model with a larger context window (Claude-3.5-Sonnet).

Prompt

What does Clearwater Compliance module do?

Reference Answer

The Clearwater Compliance module supports client investment policies, providing daily updates, including compliance warnings and violations. Clearwater’s Compliance module monitors various portfolio characteristics, including asset and security type allocation, sector exposure, investment time horizon, and credit quality. The Compliance module enables users to establish and update investment policies, monitor changes to accounts, and provide audit reporting. Rules provide flexibility to evaluate data from the individual security-level to aggregate values at the portfolio-level. User-identified investments can also be prohibited. Multiple investment policies can be applied to individual accounts and aggregates of accounts.

GPT-J-6B pre-trained GPT-J-6B fine-tuned Claude-3.5-Sonnet with RAG
The Clearwater Compliance module is an integrated set of controls you can use to manage your compliance with the federal and state requirements for the handling and transportation of hazardous materials.

The Clearwater Compliance module is a suite of tools designed to help clients monitor and manage their compliance policies and procedures.

·       Compliance Overview dashboard – Provides a high-level view of compliance policies and a history of compliance policy changes.

·       Compliance Settings tool – Users can update compliance settings, such as the frequency at which compliance policies are checked and the level of compliance required to access a compliance-restricted area of the website.

·       Compliance Audit report – Provides a detailed list of policies and rules that weren’t met. The report includes a breakdown of rule violations and the actions required to bring the policies into compliance.

·       Compliance Notifications tool – Sends notifications to designated recipients, such as the client’s compliance officer, when a compliance rule is about to expire or has been violated.

·       Compliance Documents tool – Users can upload and manage compliance policies and rules.

The Clearwater Compliance module is a sophisticated tool designed to support and monitor client investment policies. Here’s what it does:

·       Daily Monitoring – Provides daily updates, including compliance warnings and violations, to foster ongoing adherence to investment policies.

·       Portfolio characteristic tracking –Monitors various aspects of portfolios, including asset and security type allocation, sector exposure, investment time horizon, and credit quality.

·       Policy management – Users can establish, update, and manage investment policies through the module.

·       Flexible rule application – Allows for the evaluation of compliance rules at both individual security and portfolio levels.

·       Custom policy creation – Clearwater works with clients to create compliance rules based on their specific investment policies and interpretations.

·       Multi-account support – Multiple investment policies can be applied to individual accounts or aggregates of accounts.

·       Reporting and auditing – Generates audit reports and provides tools for monitoring account changes.

·       Risk mitigation – Offers integrated, customized daily reports on the Clearwater Reporting Site, to proactively manage risk.

·       Regulatory compliance – Includes external and regulatory policies when provided by the client, providing broader compliance coverage.

·       Violation tracking – Identifies compliance rule violations, tracks the number of days in violation, and allows for resolution statements.

Although the Compliance module is comprehensive, it has some limitations. It only supports objective rules using data available within the Clearwater system and doesn’t cover areas such as pre-trade compliance, investments without publicly available data, or scenario testing.

Overall, the Clearwater Compliance module serves as a robust tool for maintaining investment policy compliance, offering daily monitoring, customization, and reporting features to support effective risk management and regulatory adherence.

Model evaluation and optimization

Clearwater employs a sophisticated evaluation system to assess the performance of new models available on Amazon SageMaker JumpStart. This means that only models demonstrating superior capabilities are integrated into the production environment.

Clearwater’s LLM operations (LLMOps) pipeline plays a crucial role in this process, automating the evaluation and seamless integration of new models. This commitment to using the most effective LLMs for each unique task with cutting-edge technology and optimal performance is the cornerstone of Clearwater’s approach.

The evaluation phase is crucial for determining the success of the fine-tuning process. As you determine the evaluation process and framework that should be used, you need to make sure they fit the criteria for their domain. At Clearwater, we designed our own internal evaluation framework to meet the specific needs of our investment management and accounting domains.

Here are key considerations:

  • Performance comparison The fine-tuned model should outperform the pre-trained model on domain-specific tasks. If it doesn’t, it might indicate that the pre-trained model already had significant knowledge in this area.
  • RAG benchmark Compare the fine-tuned model’s performance against a RAG system using a pre-trained model. If the fine-tuned model doesn’t at least match RAG performance, troubleshooting is necessary.
  • Troubleshooting checklist:
    • Data format suitability for fine-tuning
    • Completeness of the training dataset
    • Hyperparameter optimization
    • Potential overfitting or underfitting
    • Cost-benefit analysis. That is, estimate the operational costs of using a RAG system with a pre-tuned model (for example, Claude-3.5 Sonnet) compared with deploying the fine-tuned model at production scale.
  • Advance considerations:
    • Iterative fine-tuning – Consider multiple rounds of fine-tuning, gradually introducing more specific or complex data.
    • Multi-task learning – If applicable, fine-tune the model on multiple related domains simultaneously to improve its versatility.
    • Continual learning – Implement strategies to update the model with new information over time without full retraining.

Conclusion

For businesses and organizations seeking to harness the power of AI in specialized domains, domain adaptation presents significant opportunities. Whether you’re in healthcare, finance, legal services, or any other specialized field, adapting LLMs to your specific needs can provide a significant competitive advantage.

By following this comprehensive approach with Amazon SageMaker, organizations can effectively adapt LLMs to their specific domains, achieving better performance and potentially more cost-effective solutions than generic models with RAG systems. However, the process requires careful monitoring, evaluation, and optimization to achieve the best results.

As we’ve observed with Clearwater’s success, partnering with an experienced AI company such as AWS can help navigate the complexities of domain adaptation and unlock its full potential. By embracing this technology, you can create AI solutions that are not just powerful, but also truly tailored to your unique requirements and expertise.

The future of AI isn’t just about bigger models, but smarter, more specialized ones. Domain adaptation is paving the way for this future, and those who harness its power will emerge as leaders in their respective industries.

Get started with Amazon SageMaker JumpStart on your fine-tuning LLM journey today.


About the Authors

Darrel Cherry is a Distinguished Engineer with over 25 years of experience leading organizations to create solutions for complex business problems. With a passion for emerging technologies, he has architected large cloud and data processing solutions, including machine learning and deep learning AI applications. Darrel holds 19 US patents and has contributed to various industry publications. In his current role at Clearwater Analytics, Darrel leads technology strategy for AI solutions, as well as Clearwater’s overall enterprise architecture. Outside the professional sphere, he enjoys traveling, auto racing, and motorcycling, while also spending quality time with his family.

DanDan Siddall, a Staff Data Scientist at Clearwater Analytics, is a seasoned expert in generative AI and machine learning, with a comprehensive understanding of the entire ML lifecycle from development to production deployment. Recognized for his innovative problem-solving skills and ability to lead cross-functional teams, Dan leverages his extensive software engineering background and strong communication abilities to bridge the gap between complex AI concepts and practical business solutions.

RanyRany ElHousieny is an Engineering Leader at Clearwater Analytics with over 30 years of experience in software development, machine learning, and artificial intelligence. He has held leadership roles at Microsoft for two decades, where he led the NLP team at Microsoft Research and Azure AI, contributing to advancements in AI technologies. At Clearwater, Rany continues to leverage his extensive background to drive innovation in AI, helping teams solve complex challenges while maintaining a collaborative approach to leadership and problem-solving.

PabloPablo Redondo is a Principal Solutions Architect at Amazon Web Services. He is a data enthusiast with over 18 years of FinTech and healthcare industry experience and is a member of the AWS Analytics Technical Field Community (TFC). Pablo has been leading the AWS Gain Insights Program to help AWS customers achieve better insights and tangible business value from their data analytics and AI/ML initiatives. In his spare time, Pablo enjoys quality time with his family and plays pickleball in his hometown of Petaluma, CA.

Prashanth Ganapathy is a Senior Solutions Architect in the Small Medium Business (SMB) segment at AWS. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them. Outside of work, Prashanth enjoys photography, travel, and trying out different cuisines.

Read More