Theoretical analysis provides insight into the optimization process during model training and reveals that for some optimizations, the Gaussian attention kernel may work better than softmax.Read More
How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model
This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco and Andrea Policarpi from BIP xTech.
AI’s transformative impact extends throughout the modern business landscape, with telecommunications emerging as a key area of innovation. Fastweb, one of Italy’s leading telecommunications operators, recognized the immense potential of AI technologies early on and began investing in this area in 2019. With a vision to build a large language model (LLM) trained on Italian data, Fastweb embarked on a journey to make this powerful AI capability available to third parties.
Training an LLM is a compute-intensive and complex process, which is why Fastweb, as a first step in their AI journey, used AWS generative AI and machine learning (ML) services such as Amazon SageMaker HyperPod.
SageMaker HyperPod can provision and maintain large-scale compute resilient clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA H200 and H100 Graphical Processing Units (GPUs), but its flexibility allowed Fastweb to deploy a small, agile and on-demand cluster enabling efficient resource utilization and cost management, aligning well with the project’s requirements.
In this post, we explore how Fastweb used cutting-edge AI and ML services to embark on their LLM journey, overcoming challenges and unlocking new opportunities along the way.
Fine-tuning Mistral 7B on AWS
Fastweb recognized the importance of developing language models tailored to the Italian language and culture. To achieve this, the team built an extensive Italian language dataset by combining public sources and acquiring licensed data from publishers and media companies. Using this data, Fastweb, in their first experiment with LLM training, fine-tuned the Mistral 7B model, a state-of-the-art LLM, successfully adapting it to handle tasks such as summarization, question answering, and creative writing in the Italian language, applying a nuanced understanding of Italian culture to the LLM’s responses and providing contextually appropriate and culturally sensitive output.
The team opted for fine-tuning on AWS. This strategic decision was driven by several factors:
- Efficient data preparation – Building a high-quality pre-training dataset is a complex task, involving assembling and preprocessing text data from various sources, including web sources and partner companies. Because the final, comprehensive pre-training dataset was still under construction, it was essential to begin with an approach that could adapt existing models to Italian.
- Early results and insights – Fine-tuning allowed the team to achieve early results in training models on the Italian language, providing valuable insights and preliminary Italian language models. This enabled the engineers to iteratively improve the approach based on initial outcomes.
- Computational efficiency – Fine-tuning requires significantly less computational power and less time to complete compared to a complete model pre-training. This approach streamlined the development process and allowed for a higher volume of experiments within a shorter time frame on AWS.
To facilitate the process, the team created a comprehensive dataset encompassing a wide range of tasks, constructed by translating existing English datasets and generating synthetic elements. The dataset was stored in an Amazon Simple Storage Service (Amazon S3) bucket, which served as a centralized data repository. During the training process, our SageMaker HyperPod cluster was connected to this S3 bucket, enabling effortless retrieval of the dataset elements as needed.
The integration of Amazon S3 and the SageMaker HyperPod cluster exemplifies the power of the AWS ecosystem, where various services work together seamlessly to support complex workflows.
Overcoming data scarcity with translation and synthetic data generation
When fine-tuning a custom version of the Mistral 7B LLM for the Italian language, Fastweb faced a major obstacle: high-quality Italian datasets were extremely limited or unavailable. To tackle this data scarcity challenge, Fastweb had to build a comprehensive training dataset from scratch to enable effective model fine-tuning.
While establishing strategic agreements to acquire licensed data from publishers and media companies, Fastweb employed two main strategies to create a diverse and well-rounded dataset: translating open source English training data into Italian and generating synthetic Italian data using AI models.
To use the wealth of information available in English, Fastweb translated open source English training datasets into Italian. This approach made valuable data accessible and relevant for Italian language training. Both LLMs and open source translation tools were used for this process.
The open source Argos Translate tool was used for bulk translation of datasets with simpler content. Although LLMs offer superior translation quality, Argos Translate is free, extremely fast, and well-suited for efficiently handling large volumes of straightforward data. For complex datasets where accuracy was critical, LLMs were employed to provide high-quality translations.
To further enrich the dataset, Fastweb generated synthetic Italian data using LLMs. This involved creating a variety of text samples covering a wide range of topics and tasks relevant to the Italian language. High-quality Italian web articles, books, and other texts served as the basis for training the LLMs to generate authentic-sounding synthetic content that captured the nuances of the language.
The resulting sub-datasets spanned diverse subjects, including medical information, question-answer pairs, conversations, web articles, science topics, and more. The tasks covered were also highly varied, encompassing question answering, summarization, creative writing, and others.
Each subset generated through translation or synthetic data creation underwent meticulous filtering to maintain quality and diversity. A similarity check was performed to deduplicate the data; if two elements were found to be too similar, one was removed. This step was crucial in maintaining variability and preventing bias from repetitive or overly similar content.
The deduplication process involved embedding dataset elements using a text embedder, then computing cosine similarity between the embeddings to identify similar elements. Meta’s FAISS library, renowned for its efficiency in similarity search and clustering of dense vectors, was used as the underlying vector database due to its ability to handle large-scale datasets effectively.
After filtering and deduplication, the remaining subsets were postprocessed and combined to form the final fine-tuning dataset, comprising 300,000 training elements. This comprehensive dataset enabled Fastweb to effectively fine-tune their custom version of the Mistral 7B model, achieving high performance and diversity across a wide range of tasks and topics.
All data generation and processing steps were run in parallel directly on the SageMaker HyperPod cluster nodes, using a unique working environment and highlighting the cluster’s versatility for various tasks beyond just training models.
The following diagram illustrates two distinct data pipelines for creating the final dataset: the upper pipeline uses translations of existing English datasets into Italian, and the lower pipeline employs custom generated synthetic data.
The computational cost of training an LLM
The computational cost of training LLMs scales approximately with the number of parameters and the amount of training data. As a general rule, for each model parameter being trained, approximately 24 bytes of memory are required. This means that to fully fine-tune a 7 billion parameter model like Mistral 7B, at least 156 GB of hardware memory is necessary, not including the additional overhead of loading training data.
The following table provides additional examples.
LLM Model Size vs. Training Memory | |
Number of Parameters | Memory Requirement |
500 million | 12 GB |
1 billion | 23 GB |
2 billion | 45 GB |
3 billion | 67 GB |
5 billion | 112 GB |
7 billion | 156 GB |
10 billion | 224 GB |
Parameter-efficient fine-tuning (PEFT) methods minimize the number of trainable parameters, whereas quantization reduces the number of bits per parameter, often with minimal negative impact on the final training results.
Despite these memory-saving techniques, fine-tuning large models still demands substantial GPU memory and extended training times. This makes distributed training essential, allowing the workload to be shared across multiple GPUs, thereby enabling the efficient handling of such large-scale computational tasks.
The following table and figure illustrate the allocation of GPU memory during each phase of LLM training.
Solution overview
Training LLMs often requires significant computational resources that can exceed the capabilities of a single GPU. Distributed training is a powerful technique that addresses this challenge by distributing the workload across multiple GPUs and nodes, enabling parallel processing and reducing training time. SageMaker HyperPod simplifies the process of setting up and running distributed training jobs, providing preconfigured environments and libraries specifically designed for this purpose.
There are two main techniques for distributed training: data parallelization and model parallelization. Data parallelization involves distributing the training data across multiple GPUs, whereas model parallelization splits the model itself across different GPUs.
To take advantage of distributed training, a cluster of interconnected GPUs, often spread across multiple physical nodes, is required. SageMaker HyperPod allows for both data and model parallelization techniques to be employed simultaneously, maximizing the available computational resources. Also, SageMaker HyperPod provides resilience through features like automatic fault detection and recovery, which are crucial for long-running training jobs. SageMaker HyperPod allows for the creation of personalized Conda environments, enabling the installation of necessary libraries and tools for distributed training.
One popular library for implementing distributed training is DeepSpeed, a Python optimization library that handles distributed training and makes it memory-efficient and fast by enabling both data and model parallelization. The choice to use DeepSpeed was driven by the availability of an extensive, already-developed code base, ready to be employed for training experiments. The high flexibility and environment customization capabilities of SageMaker HyperPod made it possible to create a personalized Conda environment with all the necessary libraries installed, including DeepSpeed.
The following diagram illustrates the two key parallelization strategies offered by DeepSpeed: data parallelism and model parallelism. Data parallelism involves replicating the entire model across multiple devices, with each device processing a distinct batch of training data. In contrast, model parallelism distributes different parts of a single model across multiple devices, enabling the training of large models that exceed the memory capacity of a single device.
To help meet the demanding computational requirements of training LLMs, we used the power and flexibility of SageMaker HyperPod clusters, orchestrated with Slurm. While HyperPod also supports orchestration with Amazon EKS, our research team had prior expertise with Slurm. The cluster configuration was tailored to our specific training needs, providing optimal resource utilization and cost-effectiveness.
The SageMaker HyperPod cluster architecture consisted of a controller machine to orchestrate the training job’s coordination and resource allocation. The training tasks were run by two compute nodes, which were g5.12xlarge instances equipped with high-performance GPUs. These compute nodes handled the bulk of the computational workload, using their GPUs to accelerate the training process.
The AWS managed high-performance Lustre file system (Amazon FSx for Lustre) mounted on the nodes provided high-speed data access and transfer rates, which are essential for efficient training operations.
SageMaker HyperPod is used to launch large clusters for pre-training Large Language Models (LLMs) with thousands of GPUs, but one of its key advantages is its flexibility, indeed it also allows for the creation of small, agile, and on-demand clusters. The versatility of SageMaker HyperPod made it possible to use resources only when needed, avoiding unnecessary costs.
For the DeepSpeed configuration, we followed the standard recommended setup, enabling data and model parallelism across the two g5.12xlarge nodes of the cluster, for a total of 8 GPUs.
Although more advanced techniques were available, such as offloading some computation to the CPU during training, our cluster was sized with a sufficiently high GPU margin. With 192 GiB (206 GB) of available overall GPU memory, even accounting for the additional GPU needed to keep dataset batches in memory during training, we had ample resources to train a 7B parameter model without the need for these advanced techniques. The following figure describes the infrastructure setup of our training solution.
Training results and output examples
After completing the training process, Fastweb’s fine-tuned language model demonstrated a significant performance improvement on Italian language tasks compared to the base model. Evaluated on an internal benchmark dataset, the fine-tuned model achieved an average accuracy increase of 20% across a range of tasks designed to assess its general understanding of the Italian language.
The benchmark tasks focused on three key areas: question answering, common sense reasoning, and next word prediction. Question answering tasks tested the model’s ability to comprehend and provide accurate responses to queries in Italian. Common sense reasoning evaluated the model’s grasp of common sense knowledge and its capacity to make logical inferences based on real-world scenarios. Next word prediction assessed the model’s understanding of language patterns and its ability to predict the most likely word to follow in a given context.
To evaluate the fine-tuned model’s performance, we initiated our interaction by inquiring about its capabilities. The model responded by enumerating its primary functions, emphasizing its ability to address Fastweb-specific topics. The response was formulated in correct Italian with a very natural syntax, as illustrated in the following example.
Afterwards, we asked the model to generate five titles for a presentation on the topic of AI.
Just for fun, we asked what the most famous sandwich is. The model responded with a combination of typical Italian ingredients and added that there is a wide variety of choices.
Lastly, we asked the model to provide us with a useful link to understand the recent EU AI Act. The model provided a working link, along with a helpful description.
Conclusion
Using SageMaker HyperPod, Fastweb successfully fine-tuned the Mistral 7B model as a first step in their generative AI journey, significantly improving its performance on tasks involving the Italian language.
Looking ahead, Fastweb plans to deploy their next models also on Amazon Bedrock using the Custom Model Import feature. This strategic move will enable Fastweb to quickly build and scale new generative AI solutions for their customers, using the broad set of capabilities available on Amazon Bedrock.
By harnessing Amazon Bedrock, Fastweb can further enhance their offerings and drive digital transformation for their customers. This initiative aligns with Fastweb’s commitment to staying at the forefront of AI technology and fostering innovation across various industries.
With their fine-tuned language model running on Amazon Bedrock, Fastweb will be well-positioned to deliver cutting-edge generative AI solutions tailored to the unique needs of their customers. This will empower businesses to unlock new opportunities, streamline processes, and gain valuable insights, ultimately driving growth and competitiveness in the digital age.
Fastweb’s decision to use the Custom Model Import feature in Amazon Bedrock underscores the company’s forward-thinking approach and their dedication to providing their customers with the latest and most advanced AI technologies. This collaboration with AWS further solidifies Fastweb’s position as a leader in digital transformation and a driving force behind the adoption of innovative AI solutions across industries.
To learn more about SageMaker HyperPod, refer to Amazon SageMaker HyperPod and the Amazon SageMaker HyperPod workshop.
About the authors
Marta Cavalleri is the Manager of the Artificial Intelligence Center of Excellence (CoE) at Fastweb, where she leads teams of data scientists and engineers in implementing enterprise AI solutions. She specializes in AI operations, data governance, and cloud architecture on AWS.
Giovanni Germani is the Manager of Architecture & Artificial Intelligence CoE at Fastweb, where he leverages his extensive experience in Enterprise Architecture and digital transformation. With over 12 years in Management Consulting, Giovanni specializes in technology-driven projects across telecommunications, media, and insurance industries. He brings deep expertise in IT strategy, cybersecurity, and artificial intelligence to drive complex transformation programs.
Claudia Sacco is an AWS Professional Solutions Architect at BIP xTech, collaborating with Fastweb’s AI CoE and specialized in architecting advanced cloud and data platforms that drive innovation and operational excellence. With a sharp focus on delivering scalable, secure, and future-ready solutions, she collaborates with organizations to unlock the full potential of cloud technologies. Beyond her professional expertise, Claudia finds inspiration in the outdoors, embracing challenges through climbing and trekking adventures with her family.
Andrea Policarpi is a Data Scientist at BIP xTech, collaborating with Fastweb’s AI CoE. With a strong foundation in computer vision and natural language processing, he is currently exploring the world of Generative AI and leveraging its powerful tools to craft innovative solutions for emerging challenges. In his free time, Andrea is an avid reader and enjoys playing the piano to relax.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Adolfo Pica has a strong background in cloud computing, with over 20 years of experience in designing, implementing, and optimizing complex IT systems and architectures and with a keen interest and hands-on experience in the rapidly evolving field of generative AI and foundation models. He has expertise in AWS cloud services, DevOps practices, security, data analytics and generative AI. In his free time, Adolfo enjoys following his two sons in their sporting adventures in taekwondo and football.
Maurizio Pinto is a Senior Solutions Architect at AWS, specialized in cloud solutions for telecommunications. With extensive experience in software architecture and AWS services, he helps organizations navigate their cloud journey while pursuing his passion for AI’s transformative impact on technology and society.
Using natural language in Amazon Q Business: From searching and creating ServiceNow incidents and knowledge articles to generating insights
Many enterprise customers across various industries are looking to adopt Generative AI to drive innovation, user productivity, and enhance customer experience. Generative AI–powered assistants such as Amazon Q Business can be configured to answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. Amazon Q Business understands natural language and allows users to receive immediate, permissions-aware responses from enterprise data sources with citations. This capability supports various use cases such as IT, HR, and help desk.
With custom plugins for Amazon Q Business, you can enhance the application environment to enable your users to use natural language to perform specific tasks related to third-party applications — such as Jira, Salesforce, and ServiceNow — directly from within their web experience chat.
Enterprises that have adopted ServiceNow can improve their operations and boost user productivity by using Amazon Q Business for various use cases, including incident and knowledge management. Users can search ServiceNow knowledge base (KB) articles and incidents in addition to being able to create, manage, and track incidents and KB articles, all from within their web experience chat.
In this post, we’ll demonstrate how to configure an Amazon Q Business application and add a custom plugin that gives users the ability to use a natural language interface provided by Amazon Q Business to query real-time data and take actions in ServiceNow. By the end of this hands-on session, you should be able to:
- Create an Amazon Q Business application and integrate it with ServiceNow using a custom plugin.
- Use natural language in your Amazon Q web experience chat to perform read and write actions in ServiceNow such as querying and creating incidents and KB articles in a secure and governed fashion.
Prerequisites
Before proceeding, make sure that you have the necessary AWS account permissions and services enabled, along with access to a ServiceNow environment with the required privileges for configuration.
AWS
- Have an AWS account with administrative access. For more information, see Setting up for Amazon Q Business. For a complete list of AWS Identity and Access Management (IAM) roles for Amazon Q Business, see IAM roles for Amazon Q Business. Although we’re using admin privileges for the purpose of this post, it’s a security best practice to apply least privilege permissions and grant only the permissions required to perform a task.
- Subscribe to the Amazon Q Business Pro plan which includes access to custom plugins to enable users to execute actions in third-party applications. For information on what is included in the tiers of user subscriptions, see Amazon Q Business pricing document.
ServiceNow
- Obtain a ServiceNow Personal Developer Instance or use a clean ServiceNow developer environment. You will need an account that has admin privileges to perform the configuration steps in ServiceNow.
Solution overview
The following architecture diagram illustrates the workflow for Amazon Q Business web experience with enhanced capabilities to integrate it seamlessly with ServiceNow.
The implementation includes the following steps:
- The solution begins with configuring Amazon Q Business using the AWS Management Console. This includes setting up the application environment, adding users to AWS IAM Identity Center, selecting the appropriate subscription tier, and configuring the web experience for users to interact with. The environment can optionally be configured to provide real-time data retrieval using a native retriever, which pulls information from indexed data sources, such as Amazon Simple Storage Service (Amazon S3), during interactions.
- The next step involves adjusting the global controls and response settings for the application environment guardrails to allow Amazon Q Business to use its large language model (LLM) knowledge to generate responses when it cannot find responses from your connected data sources.
- Integration with ServiceNow is achieved by setting up an OAuth Inbound application endpoint in ServiceNow, which authenticates and authorizes interactions between Amazon Q Business and ServiceNow. This involves creating an OAuth API endpoint in ServiceNow and using the web experience URL from Amazon Q Business as the callback URL. The setup makes sure that Amazon Q Business can securely perform actions in ServiceNow with the same scoped permissions as the user signing in to ServiceNow.
- The final step of the solution involves enhancing the application environment with a custom plugin for ServiceNow using APIs defined in an OpenAPI schema. The plugin allows Amazon Q Business to securely interact with ServiceNow’s REST APIs, enabling operations such as querying, creating, and updating records dynamically and in real time
Configuring the Amazon Q Business application
To create an Amazon Q Business application, sign in to the Amazon Q Business console.
As a prerequisite to creating an Amazon Q Business application, follow the instructions in Configuring an IAM Identity Center instance section. Amazon Q Business integrates with IAM Identity Center to enable managing user access to your Amazon Q Business application. This is the recommended method for managing human access to AWS resources and the method used for the purpose of this blog.
Amazon Q Business also supports identity federation through IAM. When you use identity federation, you can manage users with your enterprise identity provider (IdP) and use IAM to authenticate users when they sign in to Amazon Q Business.
Create and configure the Amazon Q Business application:
- In the Amazon Q Business console, choose Application from the navigation pane and then choose Create application.
- Enter the following information for your Amazon Q Business application:
- Application name: Enter a name for quick identification, such as
my-demo-application
. - Service access: Select the Create and use a new service-linked role (SLR). A service-linked role is a unique type of IAM role that is linked directly to Amazon Q Business. Service-linked roles are predefined by Amazon Q Business and include the permissions that the service requires to call other AWS services on your behalf.
- Choose Create.
- Application name: Enter a name for quick identification, such as
- After creating your Amazon Q Business application environment, create and select the retriever and provision the index that will power your generative AI web experience. The retriever pulls data from the index in real time during a conversation. On the Select Retriever page:
- Retrievers: Select Use native retriever.
- Index provisioning: Select Starter, which is ideal for proof-of-concept or developer workloads. See Index types for more information.
- Number of units: Enter
1
. This indicates the capacity units that you want to provision for your index. Each unit is 20,000 documents. Choose Next. - Choose Next.
- After you select a retriever for your Amazon Q Business application environment, you can optionally connect other data sources to it. Because a data source isn’t required for this session, we won’t configure one. For more information on connecting data sources to an Amazon Q Business application, see connecting data sources.
- Choose Next.
- As an account admin, you can add users to your IAM Identity Center instance from the Amazon Q Business console. After you add users or groups to an application environment, you can then choose the Amazon Q Business tier for each user or group. On the Add groups and users page:
- Choose Add groups and users.
- In the Add new users dialog box that opens, enter the details of the user. The details you must enter for a single user include: Username, First name, Last name, email address, Confirm email address, and Display name.
- Choose Next and then Add. The user is automatically added to an IAM Identity Center directory and an email invitation to join Identity Center is sent to the email address provided.
- After adding a user or group, choose the Amazon Q Business subscription tier for each user or group. From the Current subscription dropdown menu, select Q Business Pro.
- For the Web experience service access, select Create and use a new service role.
- Choose Create application.
Upon successful completion, Amazon Q Business returns a web experience URL that you can share with the users you added to your application environment. The Web experience URL (in this case: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws
/) will be used when creating an OAuth application endpoint in ServiceNow. Note that your web experience URL will be different from the one shown here.
Enhancing an Amazon Q Business application with guardrails
By default, an Amazon Q Business application is configured to respond to user chat queries using only enterprise data. Because we didn’t configure a data source for the purpose of this post, you will use Admin controls and guardrails to allow Amazon Q to use its LLM world knowledge to generate responses when it cannot find responses from your connected data sources.
Create a custom plugin for ServiceNow:
- From the Amazon Q Business console, choose Applications in the navigation pane. Select the name of your application from the list of applications.
- From the navigation pane, choose Enhancements, and then choose Admin Controls and guardrails.
- In Global Controls, choose Edit.
- In Response settings under Application guardrails, select Allow Amazon Q to fall back to LLM knowledge.
Configuring ServiceNow
To allow Amazon Q Business to connect to your ServiceNow instance, you need to create an OAuth inbound application endpoint. OAuth-based authentication validates the identity of the client that attempts to establish a trust on the system by using an authentication protocol. For more information, see OAuth Inbound and Outbound authentication.
Create an OAuth application endpoint for external client applications to access the ServiceNow instance:
- In the ServiceNow console, navigate to All, then System OAuth, then Application Registry and then choose New. On the interceptor page, select Create an OAuth API endpoint for external clients and then fill in the form with details for Name and Redirect URL. The other fields are automatically generated by the ServiceNow OAuth server.
- The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by
oauth/callback
. - For this example, the URL is:
https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback
- The Redirect URL is the callback URL that the authorization server redirects to. Enter the web experience URL of your Amazon Q Business application environment (which is the client requesting access to the resource), appended by
- For Auth Scope, set the value to
useraccount
. The scope API response parameter defines the amount of access granted by the access token, which means that the access token has the same rights as the user account that authorized the token. For example, if Abel Tuter authorizes an application by providing login credentials, then the resulting access token grants the token bearer the same access privileges as Abel Tuter. - Choose Submit.
This creates an OAuth client application record and generates a client ID and client secret, which Amazon Q Business needs to access the restricted resources on the instance. You will need this authentication information (client ID and client secret) in the following custom plugin configuration process.
Enhancing the Amazon Q Business application environment with custom plugins for ServiceNow
To integrate with external applications, Amazon Q Business uses APIs, which are configured as part of the custom plugins.
Before creating a custom plugin, you need to create or edit an OpenAPI schema, outlining the different API operations that you want to enable for your custom plugin. Amazon Q Business uses the configured third-party OpenAPI specifications to dynamically determine which API operations to perform to fulfill a user request. Therefore, the OpenAPI schema definition has a big impact on API selection accuracy and might require design optimizations. In order to maximize accuracy and improve efficiency with an Amazon Q Business custom plugin, follow the best practices for configuring OpenAPI schema definitions.
To configure a custom plugin, you must define at least one and a maximum of eight API operations that can be invoked. To define the API operations, create an OpenAPI schema in JSON or YAML format. You can create OpenAPI schema files and upload them to Amazon S3. Alternatively, you can use the OpenAPI text editor in the console, which will validate your schema.
For this post, a working sample of an OpenAPI Schema for ServiceNow is provided in JSON format. Before using it, edit the template file and replace <YOUR_SERVICENOW_INSTANCE_URL>
in the following sections with the URL of your ServiceNow instance.
You can use the REST API Explorer to browse available APIs, API versions, and methods for each API. The explorer enables you to test REST API requests straight from the user interface. The Table API provides endpoints that allow you to perform create, read, update, and delete (CRUD) operations on existing tables. The calling user must have sufficient roles to access the data in the table specified in the request. For additional information on assigning roles, see Managing roles.
{
"openapi": "3.0.1",
"info": {
"title": "Table API",
"description": "Allows you to perform create, read, update and delete (CRUD) operations on existing tables",
"version": "latest"
},
"externalDocs": {
"url": "https://docs.servicenow.com/?context=CSHelp:REST-Table-API"
},
"servers": [
{
"url": "YOUR_SERVICENOW_INSTANCE_URL"
}
],
"paths": {
"/api/now/table/{tableName}": {
"get": {
"description": "Retrieve records from a table",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_query",
"in": "query",
"description": "An encoded query string used to filter the results like Incidents Numbers or Knowledge Base IDs etc",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_fields",
"in": "query",
"description": "A comma-separated list of fields to return in the response",
"required": false,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_limit",
"in": "query",
"description": "The maximum number of results returned per page",
"required": false,
"schema": {
"type": "string"
}
}
],
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/incident"
}
}
}
}
}
},
"post": {
"description": "Create a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"short_description": {
"type": "string",
"description": "Short Description"
},
"description": {
"type": "string",
"description": "Full Description for Incidents only"
},
"caller_id": {
"type": "string",
"description": "Caller Email"
},
"state": {
"type": "string",
"description": "State of the incident",
"enum": [
"new",
"in_progress",
"resolved",
"closed"
]
},
"text": {
"type": "string",
"description": "Article Body Text for Knowledge Bases Only (KB)"
}
},
"required": [
"short_description",
"caller_id"
]
}
}
},
"required": true
},
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {}
}
}
}
}
},
"/api/now/table/{tableName}/{sys_id}": {
"get": {
"description": "Retrieve a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sys_id",
"in": "path",
"description": "Sys ID",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sysparm_fields",
"in": "query",
"description": "A comma-separated list of fields to return in the response",
"required": false,
"schema": {
"type": "string"
}
}
],
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {},
"application/xml": {},
"text/xml": {}
}
}
}
},
"delete": {
"description": "Delete a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sys_id",
"in": "path",
"description": "Sys ID",
"required": true,
"schema": {
"type": "string"
}
}
],
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {},
"application/xml": {},
"text/xml": {}
}
}
}
},
"patch": {
"description": "Update or modify a record",
"parameters": [
{
"name": "tableName",
"in": "path",
"description": "Table Name",
"required": true,
"schema": {
"type": "string"
}
},
{
"name": "sys_id",
"in": "path",
"description": "Sys ID",
"required": true,
"schema": {
"type": "string"
}
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"short_description": {
"type": "string",
"description": "Short Description"
},
"description": {
"type": "string",
"description": "Full Description for Incidents only"
},
"caller_id": {
"type": "string",
"description": "Caller Email"
},
"state": {
"type": "string",
"description": "State of the incident",
"enum": [
"new",
"in_progress",
"resolved",
"closed"
]
},
"text": {
"type": "string",
"description": "Article Body Text for Knowledge Bases Only (KB)"
}
},
"required": [
"short_description",
"caller_id"
]
}
}
},
"required": true
},
"responses": {
"200": {
"description": "ok",
"content": {
"application/json": {},
"application/xml": {},
"text/xml": {}
}
}
}
}
}
},
"components": {
"schemas": {
"incident": {
"type": "object",
"properties": {
"sys_id": {
"type": "string",
"description": "Unique identifier for the incident"
},
"number": {
"type": "string",
"description": "Incident number"
},
"short_description": {
"type": "string",
"description": "Brief description of the incident"
}
}
}
},
"securitySchemes": {
"oauth2": {
"type": "oauth2",
"flows": {
"authorizationCode": {
"authorizationUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_auth.do",
"tokenUrl": "YOUR_SERVICENOW_INSTANCE_URL/oauth_token.do",
"scopes": {
"useraccount": "Access equivalent to the user's account"
}
}
}
}
}
},
"security": [
{
"oauth2": [
"useraccount"
]
}
]
}
The URL for the ServiceNow instance used in this post is: https://devxxxxxx.service-now.com/
. After updating the sections of the template with the URL for this specific instance, the JSON should look like the following:
"servers": [
{
"url": "https://devxxxxxx.service-now.com/"
}
"securitySchemes": {
"oauth2": {
"type": "oauth2",
"flows": {
"authorizationCode": {
"authorizationUrl": "https://devxxxxxx.service-now.com/oauth_auth.do",
"tokenUrl": "https://devxxxxxx.service-now.com/oauth_token.do",
"scopes": {
"useraccount": "Access equivalent to the user's account"
}
}
}
}
}
To create a custom plugin for ServiceNow:
-
- Sign in to the Amazon Q Business console.
- Choose Applications in the navigation pane, and then select your application from the list of applications.
- In the navigation pane, choose Enhancements, and then choose Plugins.
- In Plugins, choose Add plugin.
- In Add plugins, choose Custom plugin.
- In Custom plugin, enter the following information:
- In Name and description, for Plugin name: Enter a name for your Amazon Q plugin.
- In API schema, for API schema source, select Define with in-line OpenAPI schema editor.
- Select JSON as the format for the schema.
- Remove any sample schema that appears in the inline OpenAPI schema editor and replace it with the text from the provided sample JSON template, updated with your ServiceNow instance URL.
- In Authentication: Select Authentication required.
- For AWS Secrets Manager secret, choose Create and add a new secret. You need to store the ServiceNow OAuth authentication credentials in a Secrets Manager secret to connect your third-party application to Amazon Q. In the window that opens, enter the details in the form:
- Secret name: A name for your Secrets Manager secret.
- Client ID: The Client ID from ServiceNow OAuth configuration in the previous section.
- Client secret: The Client Secret from ServiceNow OAuth configuration in the previous section.
- OAuth callback URL: The URL the user needs to be redirected to after authentication. This will be your web experience URL. For this example, it’s: https://xxxxxxxx.chat.qbusiness.us-east-1.on.aws/oauth/callback. Amazon Q Business will handle OAuth tokens in this URL.
- In Choose a method to authorize Amazon Q Business: Select Create and add a new service role. The console will generate a service role name. To connect Amazon Q Business to third-party applications that require authentication, you need to give the Amazon Q role permissions to access your Secrets Manager secret. This will enable an Amazon Q Business custom plugin to access the credentials needed to sign in to the third-party service.
- Choose Add plugin to add your plugin.
Upon successful completion, the plugin will appear under Plugins with Build status of Ready and Plugin status Active.
Using Amazon Q Business web experience chat to take actions in ServiceNow
Users can launch your Amazon Q Business web experience in two ways:
- AWS access portal URL provided in an invitation email sent to the user to join AWS IAM Identity Center.
- Web experience URL shared by the admin.
Navigate to the deployed web experience URL and sign with your AWS IAM Identity Center credentials.
After signing in, choose the New conversation icon in the left-hand menu to start a conversation.
Example: Search Knowledge Base Articles in ServiceNow for user issue and create an incident
The following chat conversation example illustrates a typical use case of Amazon Q Business integrated with custom plugins for ServiceNow. These features allow you to perform a wide range of tasks tailored to your organization’s needs.
In this example, we initiate a conversation in the web experience chat to search for KB articles related to ”log in issues” in ServiceNow by invoking a plugin action. After the user submits a prompt, Amazon Q Business queries ServiceNow through the appropriate API to retrieve the results and provides a response with related KB articles. We then proceed by asking Amazon Q Business for more details to see if any of the KB articles directly addresses the user’s issue. When no relevant KB articles pertaining to the user’s issue are found, we ask Amazon Q Business to summarize the conversation and create a new incident in ServiceNow, making sure the issue is logged for resolution.
User prompt 1 – I am having issues logging in to the intranet and want to know if there are any ServiceNow KB articles on log-in issues. Perform the search on both Short Description and Text field using LIKE operator
Before submitting the preceding prompt for an action to create an incident in ServiceNow, choose the vertical ellipsis to open Conversation settings, then choose Use a Plugin to select the corresponding custom plugin for ServiceNow.
If this is the first time a user is accessing the custom plugin or if their past sign-in has expired, the user will need to authenticate. After authenticating successfully, Amazon Q Business will perform the requested task.
Choose Authorize.
If the user isn’t already signed in to ServiceNow, they will be prompted to enter their credentials. For this example, the user signing in to ServiceNow is the admin user and API actions performed in ServiceNow by Amazon Q Business on behalf of the user will have the same level of access as the user within ServiceNow.
Choose Allow for Amazon Q Business to connect to ServiceNow and perform the requested task on your behalf.
Upon executing the user’s request after verifying that they are authorized, Amazon Q Business responds with the information that it retrieved. We then proceed to retrieve additional details with the following prompt.
User prompt 2 – Can you list the KB number and short description in a tabular form?
Because there no KB articles related the user’s issue were found, we will ask Amazon Q to summarize the conversation context to create an incident with the following prompt.
User prompt 3 – The error I get is "Unable to Login After System Upgrade". Summarize my issue and create an incident with detailed description and add a note that this needs to be resolved asap.
In response to your prompt for an action, Amazon Q displays a review form where you can modify or fill in the necessary information.
To successfully complete the action, choose submit.
Note: The caller_id
value entered in the following example is a valid ServiceNow user.
Your web experience will display a success message if the action succeeds, or an error message if the action fails. In this case, the action succeeded and Amazon Q Business responded accordingly.
The following screenshot shows that the incident was created successfully in ServiceNow.
Troubleshooting common errors
To have a seamless experience with third-party application integrations, it’s essential to thoroughly test, identify, and troubleshoot unexpected behavior.
A common error encountered in Amazon Q Business is API Response too large
, which occurs when an API response size exceeds the current limit of 100 KB. While prompting techniques are essential for obtaining accurate and relevant answers, optimizing API responses to include only the necessary and relevant data is crucial for better response times and enhanced user experience.
The REST API Explorer (shown in the following figure) in ServiceNow is a tool that allows developers and administrators to interact with and test the ServiceNow REST APIs directly from within the ServiceNow environment. It provides a user-friendly interface for making API requests, viewing responses, and understanding the available endpoints and data structures. Using this tool simplifies the process of testing and integrating with ServiceNow.
Clean up
To clean up AWS configurations, sign in to the Amazon Q Business console.
- From the Amazon Q Business console, in Applications, select the application that you want to delete.
- Choose Actions and select Delete.
- To confirm deletion, enter
Delete
.
This will take a few minutes to finish. When completed, the application and the configured custom plugin will be deleted.
When you delete the Amazon Q Business application, the users created as part of the configuration are not automatically deleted from IAM Identity Center. Use the instructions in Delete users in IAM Identity Center to delete the users created for this post.
To clean up in ServiceNow, release the Personal Developer Instance provisioned for this post by following the instructions in the ServiceNow Documentation.
Conclusion
The integration of generative AI-powered assistants such as Amazon Q Business with enterprise systems such as ServiceNow offers significant benefits for organizations. By using natural language processing capabilities, enterprises can streamline operations, enhance user productivity, and deliver better customer experiences. The ability to query real-time data and create incidents and knowledge articles through a secure and governed chat interface transforms how users interact with enterprise data and applications. As demonstrated in this post, enhancing Amazon Q Business to integrate with ServiceNow using custom plugins empowers users to perform complex tasks effortlessly, driving efficiency across various business functions. Adopting this technology not only modernizes workflows, but also positions enterprises at the forefront of innovation.
Learn more
- Amazon Q main product page
- Get started with Amazon Q
- Introducing Amazon Q, a new generative AI-powered assistant (preview)
- Improve developer productivity with generative-AI powered Amazon Q in Amazon CodeCatalyst (preview)
- Upgrade your Java applications with Amazon Q Code Transformation (preview)
- New generative AI features in Amazon Connect, including Amazon Q, facilitate improved contact center service
- New Amazon Q in QuickSight uses generative AI assistance for quicker, easier data insights (preview)
- Amazon Q brings generative AI-powered assistance to IT pros and developers (preview)
About the Author
Siddhartha Angara is a Senior Solutions Architect at Amazon Web Services. He helps enterprise customers design and build well-architected solutions in the cloud, accelerate cloud adoption, and build Machine Learning and Generative AI applications. He enjoys playing the guitar, reading and family time!
Simplify multimodal generative AI with Amazon Bedrock Data Automation
Developers face significant challenges when using foundation models (FMs) to extract data from unstructured assets. This data extraction process requires carefully identifying models that meet the developer’s specific accuracy, cost, and feature requirements. Additionally, developers must invest considerable time optimizing price performance through fine-tuning and extensive prompt engineering. Managing multiple models, implementing safety guardrails, and adapting outputs to align with downstream system requirements can be difficult and time consuming.
Amazon Bedrock Data Automation in public preview helps address these and other challenges. This new capability from Amazon Bedrock offers a unified experience for developers of all skillsets to easily automate the extraction, transformation, and generation of relevant insights from documents, images, audio, and videos to build generative AI–powered applications. With Amazon Bedrock Data Automation, customers can fully utilize their data by extracting insights from their unstructured multimodal content in a format compatible with their applications. Amazon Bedrock Data Automation’s managed experience, ease of use, and customization capabilities help customers deliver business value faster, eliminating the need to spend time and effort orchestrating multiple models, engineering prompts, or stitching together outputs.
In this post, we demonstrate how to use Amazon Bedrock Data Automation in the AWS Management Console and the AWS SDK for Python (Boto3) for media analysis and intelligent document processing (IDP) workflows.
Amazon Bedrock Data Automation overview
You can use Amazon Bedrock Data Automation to generate standard outputs and custom outputs. Standard outputs are modality-specific default insights, such as video summaries that capture key moments, visual and audible toxic content, explanations of document charts, graph figure data, and more. Custom outputs use customer-defined blueprints that specify output requirements using natural language or a schema editor. The blueprint includes a list of fields to extract, data format for each field, and other instructions, such as data transformations and normalizations. This gives customers full control of the output, making it easy to integrate Amazon Bedrock Data Automation into existing applications.
Using Amazon Bedrock Data Automation, you can build powerful generative AI applications and automate use cases such as media analysis and IDP. Amazon Bedrock Data Automation is also integrated with Amazon Bedrock Knowledge Bases, making it easier for developers to generate meaningful information from their unstructured multimodal content to provide more relevant responses for Retrieval Augmented Generation (RAG).
Customers can get started with standard outputs for all four modalities: documents, images, videos, and audio and custom outputs for documents and images. Custom outputs for video and audio will be supported when the capability is generally available.
Amazon Bedrock Data Automation for images, audio, and video
To take a media analysis example, suppose that customers in the media and entertainment industry are looking to monetize long-form content, such as TV shows and movies, through contextual ad placement. To deliver the right ads at the right video moments, you need to derive meaningful insights from both the ads and the video content. Amazon Bedrock Data Automation enables your contextual ad placement application by generating these insights. For instance, you can extract valuable information such as video summaries, scene-level summaries, content moderation concepts, and scene classifications based on the Interactive Advertising Bureau (IAB) taxonomy.
To get started with deriving insights with Amazon Bedrock Data Automation, you can create a project where you can specify your output configuration using the AWS console, AWS Command Line Interface (AWS CLI) or API.
To create a project on the Amazon Bedrock console, follow these steps:
- Expand the Data Automation dropdown menu in the navigation pane and select Projects, as shown in the following screenshot.
- From the Projects console, create a new project and provide a project name, as shown in the following screenshot.
- From within the project, choose Edit, as shown in the following screenshot, to specify or modify an output configuration. Standard output is the default way of interacting with Amazon Bedrock Data Automation, and it can be used with audio, documents, images and videos, where you can have one standard output configuration per data type for each project.
- For customers who want to analyze images and videos for media analysis, standard output can be used to generate insights such as image summary, video scene summary, and scene classifications with IAB taxonomy. You can select the image summarization, video scene summarization, and IAB taxonomy checkboxes from the Standard output tab and then choose Save changes to finish configuring your project, as shown in the following screenshot.
- To test the standard output configuration using your media assets, choose Test, as shown in the following screenshot.
The next example uses the project to generate insights for a travel ad.
- Upload an image, then choose Generate results, as shown in the following screenshot, for Amazon Bedrock Data Automation to invoke an inference request.
- Amazon Bedrock Data Automation will process the uploaded file based on the project’s configuration, automatically detecting that the file is an image and then generating a summary and IAB categories for the travel ad.
- After you have generated insights for the ad image, you can generate video insights to determine the best video scene for effective ad placement. In the same project, upload a video file and choose Generate results, as shown in the following screenshot.
Amazon Bedrock Data Automation will detect that the file is a video and will generate insights for the video based on the standard output configuration specified in the project, as shown in the following screenshot.
These insights from Amazon Bedrock Data Automation, can help you effectively place relevant ads in your video content, which can help improve content monetization.
Intelligent document processing with Amazon Bedrock Data Automation
You can use Amazon Bedrock Data Automation to automate IDP workflows at scale, without needing to orchestrate complex document processing tasks such as classification, extraction, normalization, or validation.
To take a mortgage example, a lender wants to automate the processing of a mortgage lending packet to streamline their IDP pipeline and improve the accuracy of loan processing. Amazon Bedrock Data Automation simplifies the automation of complex IDP tasks such as document splitting, classification, data extraction, output format normalization, and data validation. Amazon Bedrock Data Automation also incorporates confidence scores and visual grounding of the output data to mitigate hallucinations and help improve result reliability.
For example, you can generate custom output by defining blueprints, which specify output requirements using natural language or a schema editor, to process multiple file types in a single, streamlined API. Blueprints can be created using the console or the API, and you can use a catalog blueprint or create a custom blueprint for documents and images.
For all modalities, this workflow consists of three main steps: creating a project, invoking the analysis, and retrieving the results.
The following solution walks you through a simplified mortgage lending process with Amazon Bedrock Data Automation using the Amazon SDK for Python (Boto3), which is straightforward to integrate into an existing IDP workflow.
Prerequisites
Before you invoke the Amazon Bedrock API, make sure you have the following:
- An AWS account that provides access to AWS services, including Amazon Bedrock Data Automation and Amazon Simple Storage Service (Amazon S3)
- The AWS CLI set up
- An AWS Identity and Access Management (IAM) user set up for the Amazon Bedrock Data Automation API and appropriate permissions added to the IAM user
- The IAM user access key and secret key to configure the AWS CLI and permissions
- The latest Boto3 library
- The minimum Python version 3.8 configured with your integrated development environment (IDE)
- An S3 bucket
Create custom blueprint
In this example, you have the lending packet, as shown in the following image, which contains three documents: a pay stub, a W-2 form, and a driver’s license.
Amazon Bedrock Data Automation has sample blueprints for these three documents that define commonly extracted fields. However, you can also customize Amazon Bedrock Data Automation to extract specific fields from each document. For example, you can extract only the gross pay and net pay from the pay stub by creating a custom blueprint.
To create a custom blueprint using the API, you can use the CreateBlueprint
operation using the Amazon Bedrock Data Automation Client. The following example shows the gross pay and net pay being defined as properties passed to CreateBlueprint
, to be extracted from the lending packet:
The CreateBlueprint
response returns the blueprintARN
for the pay stub’s custom blueprint:
Configure Amazon Bedrock Data Automation project
To begin processing files using blueprints with Amazon Bedrock Data Automation, you first need to create a data automation project. To process a multiple-page document containing different file types, you can configure a project with different blueprints for each file type.
Use Amazon Bedrock Data Automation to apply multiple document blueprints within one project so you can process different types of documents within the same project, each with its own custom extraction logic.
When using the API to create a project, you invoke the CreateDataAutomationProject
operation. The following is an example of how you can configure custom output using the custom blueprint for the pay stub and the sample blueprints for the W-2 and driver’s license:
The CreateProject
response returns the projectARN
for the project:
To process different types of documents using multiple document blueprints in a single project, Amazon Bedrock Data Automation uses a splitter configuration, which must be enabled through the API. The following is the override configuration for the splitter, and you can refer to the Boto3 documentation for more information:
Upon creation, the API validates the input configuration and creates a new project, returning the projectARN
, as shown in the following screenshot.
Test the solution
Now that the blueprint and project setup is complete, the InvokeDataAutomationAsync
operation from the Amazon Bedrock Data Automation runtime can be used to start processing files. This API call initiatives the asynchronous processing of files in an S3 bucket, in this case the lending packet, using the configuration defined in the project by passing the project’s ARN:
InvokeDataAutomationAsync
returns the invocationARN
:
GetDataAutomationStatus
can be used to view the status of the invocation, using the InvocationARN
from the previous response:
When the job is complete, view the results in the S3 bucket used in the outputConfiguration
by navigating to the ~/JOB_ID/0/custom_output/
folder.
From the following sample output, Amazon Bedrock Data Automation associated the pay stub file with the custom pay stub blueprint with a high level of confidence:
Using the matched blueprint, Amazon Bedrock Data Automation was able to accurately extract each field defined in the blueprint:
Additionally, Amazon Bedrock Data Automation returns confidence intervals and bounding box information for each field:
This example demonstrates how customers can use Amazon Bedrock Data Automation to streamline and automate an IDP workflow. Amazon Bedrock Data Automation automates complex document processing tasks such as data extraction, normalization, and validation from documents. Amazon Bedrock Data Automation helps to reduce operational complexity and improves processing efficiency to handle higher loan processing volumes, minimize errors, and drive operational excellence.
Cleanup
When you’re finished evaluating this feature, delete the S3 bucket and any objects to avoid any further charges.
Summary
Customers can get started with Amazon Bedrock Data Automation, which is available in public preview in AWS Region US West 2 (Oregon). Learn more on Amazon Bedrock Data Automation and how to automate the generation of accurate information from unstructured content for building generative AI–based applications.
About the authors
Ian Lodge is a Solutions Architect at AWS, helping ISV customers in solving their architectural, operational, and cost optimization challenges. Outside of work he enjoys spending time with his family, ice hockey and woodworking.
Alex Pieri is a Solutions Architect at AWS that works with retail customers to plan, build, and optimize their AWS cloud environments. He specializes in helping customers build enterprise-ready generative AI solutions on AWS.
Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.
How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds
TUI Group is one of the world’s leading global tourism services, providing 21 million customers with an unmatched holiday experience in 180 regions. TUI Group covers the end-to-end tourism chain with over 400 owned hotels, 16 cruise ships, 1,200 travel agencies, and 5 airlines covering all major holiday destinations around the globe. At TUI, crafting high-quality content is a crucial component of its promotional strategy.
The TUI content teams are tasked with producing high-quality content for its websites, including product details, hotel information, and travel guides, often using descriptions written by hotel and third-party partners. This content needs to adhere to TUI’s tone of voice, which is essential to communicating the brand’s distinct personality. But as its portfolio expands with more hotels and offerings, scaling content creation has proven challenging. This presents an opportunity to augment and automate the existing content creation process using generative AI.
In this post, we discuss how we used Amazon SageMaker and Amazon Bedrock to build a content generator that rewrites marketing content following specific brand and style guidelines. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Amazon SageMaker helps data scientists and machine learning (ML) engineers build FMs from scratch, evaluate and customize FMs with advanced techniques, and deploy FMs with fine-grain controls for generative AI use cases that have stringent requirements on accuracy, latency, and cost.
Through experimentation, we found that following a two-phased approach worked best to make sure that the output aligned to TUI’s tone of voice requirements. The first phase was to fine-tune with a smaller large language model (LLM) on a large corpus of data. The second phase used a different LLM model for post-processing. Through fine-tuning, we generate content that mimics the TUI brand voice using static data and which could not be captured through prompt engineering. Employing a second model with few-shot examples helped verify the output adhered to specific formatting and grammatical rules. The latter uses a more dynamic dataset, which we can use to adjust the output quickly in the future for different brand requirements. Overall, this approach resulted in higher quality content and allowed TUI to improve content quality at a higher velocity.
Solution overview
The architecture consists of a few key components:
- LLM models – We evaluated different approaches and found that a two-model solution performed the best. This consists of a fine-tuned Meta Llama model to generate a description for the given hotel and Anthropic’s Claude model to reformat its output. Fine-tuning and hosting the Meta Llama 2 model was done on Amazon SageMaker, and Anthropic’s Claude 2 was consumed from Amazon Bedrock through API calls.
- Orchestration – We created a state machine using AWS Step Functions to make calls in a batch format to the two LLMs and fetch the search engine optimization (SEO) score for the generated content from a third-party API. If the SEO content score is above a defined threshold (80%), the generated content is stored in an Amazon DynamoDB table and can later be reviewed by the content team directly in the front-end UI. Through this process, we maintain and monitor content quality at scale.
- Human in the loop feedback – We developed a custom React front-end application to gather feedback from the content team to facilitate continuous improvement and future model fine-tuning. You can use the feedback to fine-tune a base model on SageMaker using reinforcement learning from human feedback (RLHF) to improve performance.
The following diagram is the high-level architecture of the solution.
Prompt engineering
We started by crafting prompts that include the most common issues identified by the TUI content team, including tone of voice, accuracy, length, and grammar. We provided the LLM with a few examples of curated content within the context window. Although the generated output followed the guidance, the writing style didn’t meet TUI’s tone of voice requirements.
Example prompt:
Here is an example of the results we received.
Original text | Generated text |
Boutique interiors and a great location combine at this small but perfectly formed hotel. It may be small but this 15-room hotel packs a punch with its interior design. The immaculate pool is set in an open atrium area. And there’s funky cocktail bar and lounge, too. The location is every bit as impressive as the styling. It’s about a 10-minute walk to the beach and less than half that to the centre of Thassos Town. | This petite hotel brings boutique style to Thassos Town. Its immaculate pool and funky cocktail bar create an oasis in the heart of the action. The beach is just a 10-minute walk away, so you can easily split your time between sunbathing and sightseeing. With only 15 rooms, the experience feels intimate yet worldly. The interiors marry modern design with traditional touches for a unique look. And the central location means popular attractions like the Archaeological Museum and shopping street are right on your doorstep. |
To evaluate these results, we generated a dataset of 150 hotel descriptions using two different prompt versions and asked our business teams to review it. However, they flagged 98% of the output with significant issues. Firstly, Anthropic’s Claude 2 struggled to adhere to TUI’s tone of voice and branding guidelines, frequently employed abbreviations, and favored American English over British English. It also failed to follow the SEO guidelines provided and sometimes generated false information regarding hotel facilities and locations. The following image shows a list of these challenges and how the LLM handled them. Of the six challenges, the LLM met only one.
Fine-tuning Llama 2 using PEFT on Amazon SageMaker JumpStart
These issues and poor feedback led us to conclude that prompt engineering alone would not adequately address the problem. As a result, we decided to pursue an alternative approach: fine-tuning a smaller large language model to rewrite the text in accordance with TUI’s tone of voice. We used a curated set of hotel descriptions written by TUI copywriters so that the model would have better alignment with our guidelines.
We selected the Meta Llama 2 model using Amazon SageMaker JumpStart, one of the top open source LLMs available at the time, and chose the 13B parameter version to apply parameter-efficient fine-tuning (PEFT), specifically using quantized low-rank adaptation (QLoRA). This technique quantizes the pre-trained model to 4 bits and adds small low-rank adapters for fine-tuning. We fine-tuned the model on a single ml.g5.4xlarge instance in about 20 hours using a relatively small dataset of around 4,500 hotels. We also tested out the Llama 2 7B and 70B models. We found that the 7B model didn’t perform well enough, and the 70B model had much higher costs without seeing significant improvement.
We used common natural language processing (NLP) evaluation metrics, such as perplexity for evaluation and monitoring during training, and established daily feedback loops with the content team to refine the test set. The output from the fine-tuned Meta Llama 2 13B model effectively mimicked the TUI tone of voice and adhered to most SEO guidelines, even those not specified in the prompt.
For more information, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker Jumpstart.
Original text | Generated text |
The NH München City Süd hotel, formerly known as NH München am Ring, is located southeast of Munich. The location is ideal for reaching the highway towards Salzburg and provides good access to the airport and Exhibition Center. You can also get to the city center in 20 minutes by hopping on the S-Bahn at the Karl-Preis-Platz station. The peaceful residential location offers an excellent opportunity to relax in one of the 163 rooms. These are decorated in white with red accents. Rooms feature free Wi-Fi and electronic safes that are large enough for your laptop. If you need to work, there are spacious desks in every room and apartments with seperate living rooms are available for longer term guests. The suites also come with separate living rooms and foldaway beds are available upon request. The breakfast will get you ready for a busy day with a large spread of cold cuts, cheeses, fresh pastries, and fruit. These come accompanied with freshly brewed coffee, tea, and fresh fruit juices. For lunch and dinner, you will find many restaurant options in close proximity to the hotel. There is a sauna for relaxation and you can park your car in the garage. |
Make yourself at home in Munich at the München City Süd. This former NH München am Ring sits just southeast of the Bavarian capital, putting both the highway to Salzburg and the airport within easy reach. The S-Bahn station Karl-Preis-Platz is a quick walk away too, so you can be in the city center in just 20 minutes. The neighborhood here is peaceful, giving you the chance to properly relax during your stay. The hotel’s 163 rooms are decorated in calming white and red hues. Free Wi-Fi and laptop-sized safes come as standard, while spacious desks make working easy. For longer visits, apartments with separate living areas are available. The suites take things up a notch with their own living spaces and foldaway beds on request. Fuel up each morning with a vast breakfast spread of cold cuts, cheeses, pastries and fruit. All washed down with coffee, tea and fresh juices. You’ll find many dining options close by for other meals. After a busy day, unwind in the sauna or park up in the garage. |
The following image shows a list of the challenges and how the LLM handled them. Of the six challenges, the LLM met four.
Integrating Anthropic’s Claude 2 for further improvement
To further improve the output, we introduced Anthropic’s Claude 2 using Amazon Bedrock as a final refinement step. This included converting American spelling to British spelling, writing numbers one through nine in words and larger numbers in digits, correcting typos and capitalization errors, minimizing banned words, incorporating essential TUI branding words, and adding missing hotel information. We also implemented a feedback mechanism in the UI to use data for ongoing fine-tuning in production. By using Anthropic Claude 2, we make sure that the final output applies the remaining formatting rules.
Original text | Generated text |
As its name suggests, Vienna Sporthotel has loads lined up for sports fans. There are 10 tennis courts and four golf simulators here, plus, you can book a session with a coach to brush up on your skills. On top of that, there’s an outdoor pool, a well-stocked fitness room and a big sauna. To fuel up for all that exercise, you can grab a plate at the breakfast buffet. Location-wise, this place is set up for sightseeing. Hop on a bus and you can reach the likes of The Hofburg, Belvedere Palace and Karlskirche in less than half an hour. You’re only a 10-minute stroll away from the nearest restaurants and cafes, too. |
With a name like Vienna Sporthotel, you can expect this place to serve up plenty for fans of sport and activity. Brush up your skills on one of 10 tennis courts or in the four golf simulators, then recharge in the outdoor pool or well-equipped gym. There’s even a coach on hand to help perfect your technique. When you’re ready to refuel, tuck into the breakfast buffet. Then get set for sightseeing – a bus ride whisks you to top Vienna attractions like The Hofburg, Belvedere Palace and Karlskirche in under 30 minutes. You’re also just a short stroll from local eateries and coffee shops. |
The following image shows a list of the challenges and how the LLM handled them. The LLM met all six challenges.
Key outcomes
The final architecture consists of a fine-tuned Meta Llama 2 13B model and Anthropic Claude 2, using the strengths of each model. In a blind test, these dynamically generated hotel descriptions were rated higher than those written by humans in 75% of a sample of 50 hotels. We also integrated a third-party API to calculate SEO scores for the generated content, and we observed up to 4% uplift in SEO scores for the generated content compared to human written descriptions. Most significantly, the content generation process is now five times faster, enhancing our team’s productivity without compromising quality or consistency. We can generate a vast number of hotel descriptions in just a few hours— a task that previously took months.
Takeaways
Moving forward, we plan to explore how this technology can address current inefficiencies and quality gaps, especially for hotels that our team hasn’t had the capacity to curate. We plan to expand this solution to more brands and regions within the TUI portfolio, including producing content in various languages and tailoring it to meet the specific needs of different audiences.
Throughout this project, we learned a few valuable lessons:
- Few-shot prompting is cost-effective and sufficient when you have limited examples and specific guidelines for responses. Fine-tuning can help significantly improve model performance when you need to tailor content to match a brand’s tone of voice, but can be resource intensive and is based on static data sources that can get outdated.
- Fine-tuning the Llama 70B model was much more expensive than Llama 13B and did not result in significant improvement.
- Incorporating human feedback and maintaining a human-in-the-loop approach is essential for protecting brand integrity and continuously improving the solution. The collaboration between TUI engineering, content, and SEO teams was crucial to the success of this project.
Although Meta Llama 2 and Anthropic’s Claude 2 were the latest state-of-the-art models available at the time of our experiment, since then we have seen the launch of Meta Llama 3 and Anthropic’s Claude 3.5, which we expect can significantly improve the quality of our outputs. Amazon Bedrock also now supports fine-tuning for Meta Llama 2, Cohere Command Light, and Amazon Titan models, making it simpler and faster to test models without managing infrastructure.
About the Authors
Nikolaos Zavitsanos is a Data Scientist at TUI, specialized in developing customer-facing Generative AI applications using AWS services. With a strong background in Computer Science and Artificial Intelligence, he leverages advanced technologies to enhance user experiences and drive innovation. Outside of work, Nikolaos plays water polo and is competing at a national level. Connect with Nikolaos on Linkedin
Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies. In her free time, she enjoys knitting, travelling and strength training. Connect with Hin Yee on LinkedIn.
Llama 3.3 70B now available in Amazon SageMaker JumpStart
Today, we are excited to announce that the Llama 3.3 70B from Meta is available in Amazon SageMaker JumpStart. Llama 3.3 70B marks an exciting advancement in large language model (LLM) development, offering comparable performance to larger Llama versions with fewer computational resources.
In this post, we explore how to deploy this model efficiently on Amazon SageMaker AI, using advanced SageMaker AI features for optimal performance and cost management.
Overview of the Llama 3.3 70B model
Llama 3.3 70B represents a significant breakthrough in model efficiency and performance optimization. This new model delivers output quality comparable to Llama 3.1 405B while requiring only a fraction of the computational resources. According to Meta, this efficiency gain translates to nearly five times more cost-effective inference operations, making it an attractive option for production deployments.
The model’s sophisticated architecture builds upon Meta’s optimized version of the transformer design, featuring an enhanced attention mechanism that can help substantially reduce inference costs. During its development, Meta’s engineering team trained the model on an extensive dataset comprising approximately 15 trillion tokens, incorporating both web-sourced content and over 25 million synthetic examples specifically created for LLM development. This comprehensive training approach results in the model’s robust understanding and generation capabilities across diverse tasks.
What sets Llama 3.3 70B apart is its refined training methodology. The model underwent an extensive supervised fine-tuning process, complemented by Reinforcement Learning from Human Feedback (RLHF). This dual-approach training strategy helps align the model’s outputs more closely with human preferences while maintaining high performance standards. In benchmark evaluations against its larger counterpart, Llama 3.3 70B demonstrated remarkable consistency, trailing Llama 3.1 405B by less than 2% in 6 out of 10 standard AI benchmarks and actually outperforming it in three categories. This performance profile makes it an ideal candidate for organizations seeking to balance model capabilities with operational efficiency.
The following figure summarizes the benchmark results (source).
Getting started with SageMaker JumpStart
SageMaker JumpStart is a machine learning (ML) hub that can help accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select pre-trained foundation models (FMs), including Llama 3 models. These models are fully customizable for your use case with your data, and you can deploy them into production using either the UI or SDK.
Deploying Llama 3.3 70B through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.
Deploy Llama 3.3 70B through the SageMaker JumpStart UI
You can access the SageMaker JumpStart UI through either Amazon SageMaker Unified Studio or Amazon SageMaker Studio. To deploy Llama 3.3 70B using the SageMaker JumpStart UI, complete the following steps:
- In SageMaker Unified Studio, on the Build menu, choose JumpStart models.
Alternatively, on the SageMaker Studio console, choose JumpStart in the navigation pane.
- Search for Meta Llama 3.3 70B.
- Choose the Meta Llama 3.3 70B model.
- Choose Deploy.
- Accept the end-user license agreement (EULA).
- For Instance type¸ choose an instance (ml.g5.48xlarge or ml.p4d.24xlarge).
- Choose Deploy.
Wait until the endpoint status shows as InService. You can now run inference using the model.
Deploy Llama 3.3 70B using the SageMaker Python SDK
For teams looking to automate deployment or integrate with existing MLOps pipelines, you can use the following code to deploy the model using the SageMaker Python SDK:
Set up auto scaling and scale down to zero
You can optionally set up auto scaling to scale down to zero after deployment. For more information, refer to Unlock cost savings with the new scale down to zero feature in SageMaker Inference.
Optimize deployment with SageMaker AI
SageMaker AI simplifies the deployment of sophisticated models like Llama 3.3 70B, offering a range of features designed to optimize both performance and cost efficiency. With the advanced capabilities of SageMaker AI, organizations can deploy and manage LLMs in production environments, taking full advantage of Llama 3.3 70B’s efficiency while benefiting from the streamlined deployment process and optimization tools of SageMaker AI. Default deployment through SageMaker JumpStart uses accelerated deployment, which uses speculative decoding to improve throughput. For more information on how speculative decoding works with SageMaker AI, see Amazon SageMaker launches the updated inference optimization toolkit for generative AI.
Firstly, the Fast Model Loader revolutionizes the model initialization process by implementing an innovative weight streaming mechanism. This feature fundamentally changes how model weights are loaded onto accelerators, dramatically reducing the time required to get the model ready for inference. Instead of the traditional approach of loading the entire model into memory before beginning operations, Fast Model Loader streams weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster startup and scaling times.
One SageMaker inference capability is Container Caching, which transforms how model containers are managed during scaling operations. This feature eliminates one of the major bottlenecks in deployment scaling by pre-caching container images, removing the need for time-consuming downloads when adding new instances. For large models like Llama 3.3 70B, where container images can be substantial in size, this optimization significantly reduces scaling latency and improves overall system responsiveness.
Another key capability is Scale to Zero. It introduces intelligent resource management that automatically adjusts compute capacity based on actual usage patterns. This feature represents a paradigm shift in cost optimization for model deployments, allowing endpoints to scale down completely during periods of inactivity while maintaining the ability to scale up quickly when demand returns. This capability is particularly valuable for organizations running multiple models or dealing with variable workload patterns.
Together, these features create a powerful deployment environment that maximizes the benefits of Llama 3.3 70B’s efficient architecture while providing robust tools for managing operational costs and performance.
Conclusion
The combination of Llama 3.3 70B with the advanced inference features of SageMaker AI provides an optimal solution for production deployments. By using Fast Model Loader, Container Caching, and Scale to Zero capabilities, organizations can achieve both high performance and cost-efficiency in their LLM deployments.
We encourage you to try this implementation and share your experiences.
About the authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Adriana Simmons is a Senior Product Marketing Manager at AWS.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Yotam Moss is a Software development Manager for Inference at AWS AI.
AWS re:Invent 2024 Highlights: Top takeaways from Swami Sivasubramanian to help customers manage generative AI at scale
We spoke with Dr. Swami Sivasubramanian, Vice President of Data and AI, shortly after AWS re:Invent 2024 to hear his impressions—and to get insights on how the latest AWS innovations help meet the real-world needs of customers as they build and scale transformative generative AI applications.
Q: What made this re:Invent different?
Swami Sivasubramanian: The theme I spoke about in my re:Invent keynote was simple but powerful—convergence. I believe that we’re at an inflection point unlike any other in the evolution of AI. We’re seeing a remarkable convergence of data, analytics, and generative AI. It’s a combination that enables next-level generative AI applications that are far more capable. And it lets our customers move faster in a really significant way, getting more value, more quickly. Companies like Rocket Mortgage are building on an AI-driven platform powered by Amazon Bedrock to create AI agents and automate tasks—working to give their employees access to generative AI with no-code tools. Canva uses AWS to power 1.2 million requests a day and sees 450 new designs created every second. There’s also a human side to convergence, as people across organizations are working together in new ways, requiring a deeper level of collaboration between groups, like science and engineering teams. And this isn’t just a one-time collaboration. It’s an ongoing process.
People’s expectations for applications and customer experiences are changing again with generative AI. Increasingly, I think generative AI inference is going to be a core building block for every application. To realize this future, organizations need more than just a chatbot or a single powerful large language model (LLM). At re:Invent, we made some exciting announcements about the future of generative AI, of course. But we also launched a remarkable portfolio of new products, capabilities, and features that will help our customers manage generative AI at scale—making it easier to control costs, build trust, increase productivity, and deliver ROI.
Q: Are there key innovations that build on the experience and lessons learned at Amazon in adopting generative AI? How are you bringing those capabilities to your customers
Swami Sivasubramanian: Yes, our announcement of Amazon Nova, a new generation of foundation models (FMs), has state-of-the-art intelligence across a wide range of tasks and industry-leading price performance. Amazon Nova models expand the growing selection of the broadest and most capable FMs in Amazon Bedrock for enterprise customers. The specific capabilities of Amazon Nova Micro, Lite, and Pro demonstrate exceptional intelligence, capabilities, and speed—and perform quite competitively against the best models in their respective categories. Amazon Nova Canvas, our state-of-the-art image generation model, creates professional grade images from text and image inputs, democratizing access to production-grade visual content for advertising, training, social media, and more. Finally, Amazon Nova Reel offers state-of-the-art video generation that allows customers to create high-quality video from text or images. With about 1,000 generative AI applications in motion inside Amazon, groups like Amazon Ads are using Amazon Nova to remove barriers for sellers and advertisers, enabling new levels of creativity and innovation. New capabilities like image and video generation are helping Amazon Ads customers promote more products in their catalogs, and experiment with new strategies like keyword-level creative to increase engagement and drive sales.
But there’s more ahead, and here’s where an important shift is happening. We’re working on an even more capable any-to-any model where you can provide text, images, audio, and video as input and the model can generate outputs in any of these modalities. And we think this multi-modal approach is how models are going to evolve, moving ahead where one model can accept any kind of input and generate any kind of output. Over time, I think this is what state-of-the-art models will look like.
Q: Speaking of announcements like Amazon Nova, you’ve been a key innovator in AI for many years. What continues to inspire you?
Swami Sivasubramanian: It’s fascinating to think about what LLMs are capable of. What inspires me most though is how can we help our customers unblock the challenges they are facing and realize that potential. Consider hallucinations. As highly capable as today’s models are, they still have a tendency to get things wrong occasionally. It’s a challenge that many of our customers struggle with when integrating generative AI into their businesses and moving to production. We explored the problem and asked ourselves if we could do more to help. We looked inward, and leveraged Automated Reasoning, an innovation that Amazon has been using as a behind-the-scenes technology in many of our services like identity and access management.
I like to think of this situation as yin and yang. Automated Reasoning is all about certainty and being able to mathematically prove that something is correct. Generative AI is all about creativity and open-ended responses. Though they might seem like opposites, they’re actually complementary—with Automated Reasoning completing and strengthening generative AI. We’ve found that Automated Reasoning works really well when you have a huge surface area of a problem, a corpus of knowledge about that problem area, and when it’s critical that you get the correct answer—which makes Automated Reasoning a good fit for addressing hallucinations.
At re:Invent, we announced Amazon Bedrock Guardrails Automated Reasoning checks—the first and only generative AI safeguard that helps prevent factual errors due to hallucinations. All by using logically accurate and verifiable reasoning that explains why generative AI responses are correct. I think that it’s an innovation that will have significant impact across organizations and industries, helping build trust and accelerate generative AI adoption.
Q: Controlling costs is important to all organizations, large and small, particularly as they take generative AI applications into production. How do the announcements at re:Invent answer this need?
Swami Sivasubramanian: Like our customers, here at Amazon we’re increasing our investment in generative AI development, with multiple projects in process—all requiring timely access to accelerated compute resources. But allocating optimal compute capacity to each project can create a supply/demand challenge. To address this challenge, we created an internal service that helped Amazon drive utilization of compute resources to more than 90% across all our projects. This service enabled us to smooth out demand across projects and achieve higher capacity utilization, speeding development.
As with Automated Reasoning, we realized that our customers would also benefit from these capabilities. So, at re:Invent, I announced the new task governance capability in Amazon SageMaker HyperPod, which helps our customers optimize compute resource utilization and reduce time to market by up to 40%. With this capability, users can dynamically run tasks across the end-to-end FM workflow— accelerating time to market for AI innovations while avoiding cost overruns due to underutilized compute resources.
Our customers also tell me that the trade-off between cost and accuracy for models is real. We’re answering this need by making it super-easy to evaluate models on Amazon Bedrock, so they don’t have to spend months researching and making comparisons. We’re also lowering costs with game-changing capabilities such Amazon Bedrock Model Distillation, which pairs models for lower costs; Amazon Bedrock Intelligent Prompt Routing, which manages prompts more efficiently, at scale; and prompt caching, which reduces repeated processing without compromising on accuracy.
Q: Higher productivity is one of the core promises of generative AI. How is AWS helping employees at all levels be more productive?
Swami Sivasubramanian: I like to point out that using generative AI becomes irresistible when it makes employees 10 times more productive. In short, not an incremental increase, but a major leap in productivity. And we’re helping employees get there. For example, Amazon Q Developer is transforming code development by taking care of the time-consuming chores that developers don’t want to deal with, like software upgrades. And it also helps them move much faster by automating code reviews and dealing with mainframe modernization. Consider Novacomp, a leading IT company in Latin America, which leveraged Amazon Q Developer to upgrade a project with over 10,000 lines of Java code in just 50 minutes, a task that would have typically taken an estimated 3 weeks. The company also simplified everyday tasks for developers, reducing its technical debt by 60% on average.
On the business side, Amazon Q Business is bridging the gap between unstructured and structured data, recognizing that most businesses need to draw from a mix of data. With Amazon Q in QuickSight, non-technical users can leverage natural language to build, discover, and share meaningful insights in seconds. Now they can access databases and data warehouses, as well as unstructured business data, like emails, reports, charts, graphs, and images.
And looking ahead, we announced advanced agentic capabilities for Amazon Q Business, coming in 2025, which will use agents to automate complex tasks that stretch across multiple teams and applications. Agents give generative AI applications next-level capabilities, and we’re bringing them to our customers via Amazon Q Business, as well as Amazon Bedrock multi-agent collaboration, which improves successful task completion by 40% over popular solutions. This major improvement translates to more accurate and human-like outcomes in use cases like automating customer support, analyzing financial data for risk management, or optimizing supply-chain logistics.
It’s all part of how we’re enabling greater productivity today, with even more on the horizon.
Q: To get employees and customers adopting generative AI and benefiting from that increased productivity, it has to be trusted. What steps is AWS taking to help build that trust?
Swami Sivasubramanian: I think that lack of trust is a big obstacle to moving from proof of concept to production. Business leaders are about to hit go and they hesitate because they don’t want to lose the trust of their customers. As generative AI continues to drive innovation across industries and our daily life, the need for responsible AI has become increasingly acute. And we’re helping meet that need with innovations like Amazon Bedrock Automated Reasoning, which I mentioned earlier, that works to prevent hallucinations—and increases trust. We also announced new LLM-as-a-judge capabilities with Amazon Bedrock Model Evaluation so you can now perform tests and evaluate other models with humanlike quality at a fraction of the cost and time of running human evaluations. These evaluations assess multiple quality dimensions, including correctness, helpfulness, and responsible AI criteria such as answer refusal and harmfulness.
I should also mention that AWS recently became the first major cloud provider to announce ISO/IEC 42001 accredited certification for AI services, covering Amazon Bedrock, Amazon Q Business, Amazon Textract, and Amazon Transcribe. This international management system standard outlines requirements and controls for organizations to promote the responsible development and use of AI systems. Technical standards like ISO/IEC 42001 are significant because they provide a much-needed common framework for responsible AI development and deployment.
Q: Data remains central to building more personalized experiences applicable to your business. How do the re:Invent launches help AWS customers get their data ready for generative AI?
Swami Sivasubramanian: Generative AI isn’t going to be useful for organizations unless it can seamlessly access and deeply understand the organization’s data. With these insights, our customers can create customized experiences, such as highly personalized customer service agents that can help service representatives resolve issues faster. For AWS customers, getting data ready for generative AI isn’t just a technical challenge—it’s a strategic imperative. Proprietary, high-quality data is the key differentiator in transforming generic AI into powerful, business-specific applications. To prepare for this AI-driven future, we’re helping our customers build a robust, cloud-based data foundation, with built-in security and privacy. That’s the backbone of AI readiness.
With the next generation of Amazon SageMaker announced at re:Invent, we’re introducing an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities. Collaborate and build faster from a unified studio using familiar AWS tools for model development, generative AI, data processing, and SQL analytics—with Amazon Q Developer assisting you along the way. Access all your data whether it’s stored in data lakes, data warehouses, third-party or federated data sources. And move with confidence and trust, thanks to built-in governance to address enterprise security needs.
At re:Invent, we also launched key Amazon Bedrock capabilities that help our customers maximize the value of their data. Amazon Bedrock Knowledge Bases now offers the only managed, out-of-the-box Retrieval Augmented Generation (RAG) solution, which enables our customers to natively query their structured data where it resides, accelerating development. Support for GraphRAG generates more relevant responses by modeling and storing relationships between data. And Amazon Bedrock Data Automation transforms unstructured, multimodal data into structured data for generative AI—automatically extracting, transforming, and generating usable data from multimodal content, at scale. These capabilities and more help our customers leverage their data to create powerful, insightful generative AI applications.
Q: What did you take away from your customer conversations at re:Invent?
Swami Sivasubramanian: I continue to be amazed and inspired by our customers and the important work they’re doing. We continue to offer our customers the choice and specialization they need to power their unique use cases. With Amazon Bedrock Marketplace, customers now have access to more than 100 popular, emerging, and specialized models.
At re:Invent, I heard a lot about the new efficiency and transformative experiences customers are creating. I also heard about innovations that are changing people’s lives. Like Exact Sciences, a molecular diagnostic company, which developed an AI-powered solution using Amazon Bedrock to accelerate genetic testing and analysis by 50%. Behind that metric there’s a real human value—enabling earlier cancer detection and personalized treatment planning. And that’s just one story among thousands, as our customers reach higher and build faster, achieving impressive results that change industries and improve lives.
I get excited when I think about how we can help educate the next wave of innovators building these experiences. With the launch of the new Education Equity Initiative, Amazon is committing up to $100 million in cloud technology and technical resources to help existing, dedicated learning organizations reach more learners by creating new and innovative digital learning solutions. That’s truly inspiring to me.
In fact, the pace of change, the remarkable innovations we introduced at re:Invent, and the enthusiasm of our customers all reminded me of the early days of AWS, when anything seemed possible. And now, it still is.
About the author
Swami Sivasubramanian is VP, AWS AI & Data. In this role, Swami oversees all AWS Database, Analytics, and AI & Machine Learning services. His team’s mission is to help organizations put their data to work with a complete, end-to-end data solution to store, access, analyze, and visualize, and predict.
Economics Nobelist on causal inference
In a keynote address at the latest Amazon Machine Learning Conference, Amazon Visiting Academic, Stanford professor, and recent Nobel laureate Guido Imbens offered insights on the estimation of causal effects in “panel data” settings.Read More
Multi-tenant RAG with Amazon Bedrock Knowledge Bases
Organizations are continuously seeking ways to use their proprietary knowledge and domain expertise to gain a competitive edge. With the advent of foundation models (FMs) and their remarkable natural language processing capabilities, a new opportunity has emerged to unlock the value of their data assets.
As organizations strive to deliver personalized experiences to customers using generative AI, it becomes paramount to specialize the behavior of FMs using their own—and their customers’—data. Retrieval Augmented Generation (RAG) has emerged as a simple yet effective approach to achieve a desired level of specialization.
Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give FMs and agents contextual information from company’s private data sources to deliver more relevant and accurate responses tailored to their specific needs.
For organizations developing multi-tenant products, such as independent software vendors (ISVs) creating software as a service (SaaS) offerings, the ability to personalize experiences for each of their customers (tenants in their SaaS application) is particularly significant. This personalization can be achieved by implementing a RAG approach that selectively uses tenant-specific data.
In this post, we discuss and provide examples of how to achieve personalization using Amazon Bedrock Knowledge Bases. We focus particularly on addressing the multi-tenancy challenges that ISVs face, including data isolation, security, tenant management, and cost management. We focus on scenarios where the RAG architecture is integrated into the ISV application and not directly exposed to tenants. Although the specific implementations presented in this post use Amazon OpenSearch Service as a vector database to store tenants’ data, the challenges and architecture solutions proposed can be extended and tailored to other vector store implementations.
Multi-Tenancy design considerations
When architecting a multi-tenanted RAG system, organizations need to take several considerations into account:
- Tenant isolation – One crucial consideration in designing multi-tenanted systems is the level of isolation between the data and resources related to each tenant. These resources include data sources, ingestion pipelines, vector databases, and RAG client application. The level of isolation is typically governed by security, performance, and the scalability requirements of your solution, together with your regulatory requirements. For example, you may need to encrypt the data related to each of your tenants using a different encryption key. You may also need to make sure that high activity generated by one of the tenants doesn’t affect other tenants.
- Tenant variability – A similar yet distinct consideration is the level of variability of the features provided to each tenant. In the context of RAG systems, tenants might have varying requirements for data ingestion frequency, document chunking strategy, or vector search configuration.
- Tenant management simplicity – Multi-tenant solutions need a mechanism for onboarding and offboarding tenants. This dimension determines the degree of complexity for this process, which might involve provisioning or tearing down tenant-specific infrastructure, such as data sources, ingestion pipelines, vector databases, and RAG client applications. This process could also involve adding or deleting tenant-specific data in its data sources.
- Cost-efficiency – The operating costs of a multi-tenant solution depend on the way it provides the isolation mechanism for tenants, so designing a cost-efficient architecture for the solution is crucial.
These four considerations need to be carefully balanced and weighted to suit the needs of the specific solution. In this post, we present a model to simplify the decision-making process. Using the core isolation concepts of silo, pool, and bridge defined in the SaaS Tenant Isolation Strategies whitepaper, we propose three patterns for implementing a multi-tenant RAG solution using Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and OpenSearch Service.
A typical RAG solution using Amazon Bedrock Knowledge Bases is composed of several components, as shown in the following figure:
- A data source, such as an S3 bucket
- A knowledge base including a data source
- A vector database such as an Amazon OpenSearch Serverless collection and index or other supported vector databases
- A RAG client application
The main challenge in adapting this architecture for multi-tenancy is determining how to provide isolation between tenants for each of the components. We propose three prescriptive patterns that cater to different use cases and offer carrying levels of isolation, variability, management simplicity, and cost-efficiency. The following figure illustrates the trade-offs between these three architectural patterns in terms of achieving tenant isolation, variability, cost-efficiency, and ease of tenant management.
Multi-tenancy patterns
In this section, we describe the implementation of these three different multi-tenancy patterns in a RAG architecture based on Amazon Bedrock Knowledge Bases, discussing their use cases as well as their pros and cons.
Silo
The silo pattern, illustrated in the following figure, offers the highest level of tenant isolation, because the entire stack is deployed and managed independently for each single tenant.
In the context of the RAG architecture implemented by Amazon Bedrock Knowledge Bases, this pattern prescribes the following:
- A separate data source per tenant – In this post, we consider the scenario in which tenant documents to be vectorized are stored in Amazon S3, therefore a separate S3 bucket is provisioned per tenant. This allows for per-tenant AWS Key Management Service (AWS KMS) encryption keys, as well as per-tenant S3 lifecycle policies to manage object expiration, and object versioning policies to maintain multiple versions of objects. Having separate buckets per tenant provides isolation and allows for customized configurations based on tenant requirements.
- A separate knowledge base per tenant – This allows for a separate chunking strategy per tenant, and it’s particularly useful if you envision the document basis of your tenants to be different in nature. For example, one of your tenants might have a document base composed of flat text documents, which can be treated with fixed-size chunking, whereas another tenant might have a document base with explicit sections, for which semantic chunking would be better suited to section. Having a different knowledge base per tenant also lets you decide on different embedding models, giving you the possibility to choose different vector dimensions, balancing accuracy, cost, and latency. You can choose a different KMS key per tenant for the transient data stores, which Amazon Bedrock uses for end-to-end per-tenant encryption. You can also choose per-tenant data deletion policies to control whether your vectors are deleted from the vector database when a knowledge base is deleted. Separate knowledge bases also mean that you can have different ingestion schedules per tenants, allowing you to agree to different data freshness standards with your customers.
- A separate OpenSearch Serverless collection per tenant – Having a separate OpenSearch Serverless collection per tenant allows you to have a separate KMS encryption key per tenant, maintaining per-tenant end-to-end encryption. For each tenant-specific collection, you can create a separate vector index, therefore choosing for each tenant the distance metric between Euclidean and dot product, so that you can choose how much importance to give to the document length. You can also choose the specific settings for the HNSW algorithm per tenant to control memory consumption, cost, and indexing time. Each vector index, in conjunction with the setup of metadata mappings in your knowledge base, can have a different metadata set per tenant, which can be used to perform filtered searches. Metadata filtering can be used in the silo pattern to restrict the search to a subset of documents with a specific characteristic. For example, one of your tenants might be uploading dated documents and wants to filter documents pertaining to a specific year, whereas another tenant might be uploading documents coming from different company divisions and wants to filter over the documentation of a specific company division.
Because the silo pattern offers tenant architectural independence, onboarding and offboarding a tenant means creating and destroying the RAG stack for that tenant, composed of the S3 bucket, knowledge base, and OpenSearch Serverless collection. You would typically do this using infrastructure as code (IaC). Depending on your application architecture, you may also need to update the log sinks and monitoring systems for each tenant.
Although the silo pattern offers the highest level of tenant isolation, it is also the most expensive to implement, mainly due to creating a separate OpenSearch Serverless collection per tenant for the following reasons:
- Minimum capacity charges – Each OpenSearch Serverless collection encrypted with a separate KMS key has a minimum of 2 OpenSearch Compute Units (OCUs) charged hourly. These OCUs are charged independently from usage, meaning that you will incur charges for dormant tenants if you choose to have a separate KMS encryption key per tenant.
- Scalability overhead – Each collection separately scales OCUs depending on usage, in steps of 6 GB of memory, and associated vCPUs and fast access storage. This means that resources might not be fully and optimally utilized across tenants.
When choosing the silo pattern, note that a maximum of 100 knowledge bases are supported in each AWS account. This makes the silo pattern favorable for your largest tenants with specific isolation requirements. Having a separate knowledge base per tenant also reduces the impact of quotas on concurrent ingestion jobs (maximum one concurrent job per KB, five per account), job size (100 GB per job), and data sources (maximum of 5 million documents per data source). It also improves the performance fairness as perceived by your tenants.
Deleting a knowledge base during offboarding a tenant might be time-consuming, depending on the size of the data sources and the synchronization process. To mitigate this, you can set the data deletion policy in your tenants’ knowledge bases to RETAIN
. This way, the knowledge base deletion process will not delete your tenants’ data from the OpenSearch Service index. You can delete the index by deleting the OpenSearch Serverless collection.
Pool
In contrast with the silo pattern, in the pool pattern, illustrated in the following figure, the whole end-to-end RAG architecture is shared by your tenants, making it particularly suitable to accommodate many small tenants.
The pool pattern prescribes the following:
- Single data source – The tenants’ data is stored within the same S3 bucket. This implies that the pool model supports a shared KMS key for encryption at rest, not offering the possibility of per-tenant encryption keys. To identify tenant ownership downstream for each document uploaded to Amazon S3, a corresponding JSON metadata file has to be generated and uploaded. The metadata file generation process can be asynchronous, or even batched for multiple files, because Amazon Bedrock Knowledge Bases requires an explicit triggering of the ingestion job. The metadata file must use the same name as its associated source document file, with
.metadata.json
appended to the end of the file name, and must be stored in the same folder or location as the source file in the S3 bucket. The following code is an example of the format:
In the preceding JSON structure, the key tenantId
has been deliberately chosen, and can be changed to a key you want to use to express tenancy. The tenancy field will be used at runtime to filter documents belonging to a specific tenant, therefore the filtering key at runtime must match the metadata key in the JSON used to index the documents. Additionally, you can include other metadata keys to perform further filtering that isn’t based on tenancy. If you don’t upload the object.metadata.json
file, the client application won’t be able to find the document using metadata filtering.
- Single knowledge base – A single knowledge base is created to handle the data ingestion for your tenants. This means that your tenants will share the same chunking strategy and embedding model, and share the same encryption at-rest KMS key. Moreover, because ingestion jobs are triggered per data source per KB, you will be restricted to offer to your tenants the same data freshness standards.
- Single OpenSearch Serverless collection and index – Your tenant data is pooled in a single OpenSearch Service vector index, therefore your tenants share the same KMS encryption key for vector data, and the same HNSW parameters for indexing and query. Because tenant data isn’t physically segregated, it’s crucial that the query client be able to filter results for a single tenant. This can be efficiently achieved using either the Amazon Bedrock Knowledge Bases
Retrieve
orRetrieveAndGenerate
, expressing the tenant filtering condition as part of the retrievalConfiguration (for more details, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy). If you want to restrict the vector search to return results fortenant_1
, the following is an example client implementation performingRetrieveAndGenerate
based on the AWS SDK for Python (Boto3):
text
contains the original user query that needs to be answered. Taking into account the document base, <YOUR_KNOWLEDGEBASE_ID>
needs to be substituted with the identifier of the knowledge base used to pool your tenants, and <FM_ARN>
needs to be substituted with the Amazon Bedrock model Amazon Resource Name (ARN) you want to use to reply to the user query. The client presented in the preceding code has been streamlined to present the tenant filtering functionality. In a production case, we recommend implementing session and error handling, logging and retry logic, and separating the tenant filtering logic from the client invocation to make it inaccessible to developers.
Because the end-to-end architecture is pooled in this pattern, onboarding and offboarding a tenant doesn’t require you to create new physical or logical constructs, and it’s as simple as starting or stopping and uploading specific tenant documents to Amazon S3. This implies that there is no AWS managed API that can be used to offboard and end-to-end forget a specific tenant. To delete the historical documents belonging to a specific tenant, you can just delete the relevant objects in Amazon S3. Typically, customers will have an external application that maintains the list of available tenants and their status, facilitating the onboarding and offboarding process.
Sharing the monitoring system and logging capabilities in this pattern reduces the complexity of operations with a large number of tenants. However, it requires you to collect the tenant-specific metrics from the client side to perform specific tenant attribution.
The pool pattern optimizes the end-to-end cost of your RAG architecture, because sharing OCUs across tenants maximizes the use of each OCU and minimizes the tenants’ idle time. Sharing the same pool of OCUs across tenants means that this pattern doesn’t offer performance isolation at the vector store level, so the largest and most active tenants might impact the experience of other tenants.
When choosing the pool pattern for your RAG architecture, you should be aware that a single ingestion job can ingest or delete a maximum of 100 GB. Additionally, the data source can have a maximum of 5 million documents. If the solution has many tenants that are geographically distributed, consider triggering the ingestion job multiple times a day so you don’t hit the ingestion job size limit. Also, depending on the number and size of your documents to be synchronized, the time for ingestion will be determined by the embedding model invocation rate. For example, consider the following scenario:
- Number of tenants to be synchronized = 10
- Average number of documents per tenant = 100
- Average size per document = 2 MB, containing roughly 200,000 tokens divided in 220 chunks of 1,000 tokens to allow for overlap
- Using Amazon Titan Embeddings v2 on demand, allowing for 2,000 RPM and 300,000 TPM
This would result in the following:
- Total embeddings requests = 10*100*220 = 220,000
- Total tokens to process = 10*100*1,000=1,000,000
- Total time taken to embed is dominated by the RPM, therefore 220,000/2,000 = 1 hour, 50 minutes
This means you could trigger an ingestion job 12 times per day to have a good time distribution of data to be ingested. This calculation is a best-case scenario and doesn’t account for the latency introduced by the FM when creating the vector from the chunk. If you expect having to synchronize a large number of tenants at the same time, consider using provisioned throughput to decrease the time it takes to create vector embeddings. This approach will also help distribute the load on the embedding models, limiting throttling of the Amazon Bedrock runtime API calls.
Bridge
The bridge pattern, illustrated in the following figure, strikes a balance between the silo and pool patterns, offering a middle ground that balances tenant data isolation and security.
The bridge pattern delivers the following characteristics:
- Separate data source per tenant in a common S3 bucket – Tenant data is stored in the same S3 bucket, but prefixed by a tenant identifier. Although having a different prefix per tenant doesn’t offer the possibility of using per-tenant encryption keys, it does create a logical separation that can be used to segregate data downstream in the knowledge bases.
- Separate knowledge base per tenant – This pattern prescribes creating a separate knowledge base per tenant similar to the silo pattern. Therefore, the considerations in the silo pattern apply. Applications built using the bridge pattern usually share query clients across tenants, so they need to identify the specific tenant’s knowledge base to query. They can identify the knowledge base by storing the tenant-to-knowledge base mapping in an external database, which manages tenant-specific configurations. The following example shows how to store this tenant-specific information in an Amazon DynamoDB table:
In a production setting, your application will store tenant-specific parameters belonging to other functionality in your data stores. Depending on your application architecture, you might choose to store
knowledgebaseId
andmodelARN
alongside the other tenant-specific parameters, or create a separate data store (for example, thetenantKbConfig
table) specifically for your RAG architecture.This mapping can then be used by the client application by invoking the
RetrieveAndGenerate
API. The following is an example implementation: - Separate OpenSearch Service index per tenant – You store data within the same OpenSearch Serverless collection, but you create a vector index per tenant. This implies your tenants share the same KMS encryption key and the same pool of OCUs, optimizing the OpenSearch Service resources usage for indexing and querying. The separation in vector indexes gives you the flexibility of choosing different HNSM parameters per tenant, letting you tailor the performance of your k-NN indexing and querying for your different tenants.
The bridge pattern supports up to 100 tenants, and onboarding and offboarding a tenant requires the creation and deletion of a knowledge base and OpenSearch Service vector index. To delete the data pertaining to a particular tenant, you can delete the created resources and use the tenant-specific prefix as a logical parameter in your Amazon S3 API calls. Unlike the silo pattern, the bridge pattern doesn’t allow for per-tenant end-to-end encryption; it offers the same level of tenant customization offered by the silo pattern while optimizing costs.
Summary of differences
The following figure and table provide a consolidated view for comparing the characteristics of the different multi-tenant RAG architecture patterns. This comprehensive overview highlights the key attributes and trade-offs associated with the pool, bridge, and silo patterns, enabling informed decision-making based on specific requirements.
The following figure illustrates the mapping of design characteristics to components of the RAG architecture.
The following table summarizes the characteristics of the multi-tenant RAG architecture patterns.
Characteristic | Attribute of | Pool | Bridge | Silo |
Per-tenant chunking strategy | Amazon Bedrock Knowledge Base Data Source | No | Yes | Yes |
Customer managed key for encryption of transient data and at rest | Amazon Bedrock Knowledge Base Data Source | No | No | Yes |
Per-tenant distance measure | Amazon OpenSearch Service Index | No | Yes | Yes |
Per-tenant ANN index configuration | Amazon OpenSearch Service Index | No | Yes | Yes |
Per-tenant data deletion policies | Amazon Bedrock Knowledge Base Data Source | No | Yes | Yes |
Per-tenant vector size | Amazon Bedrock Knowledge Base Data Source | No | Yes | Yes |
Tenant performance isolation | Vector database | No | No | Yes |
Tenant onboarding and offboarding complexity | Overall solution | Simplest, requires management of new tenants in existing infrastructure | Medium, requires minimal management of end-to-end infrastructure | Hardest, requires management of end-to-end infrastructure |
Query client implementation | Original Data Source | Medium, requires dynamic filtering | Hardest, requires external tenant mapping table | Simplest, same as single-tenant implementation |
Amazon S3 tenant management complexity | Amazon S3 buckets and objects | Hardest, need to maintain tenant specific metadata files for each object | Medium, each tenant needs a different S3 path | Simplest, each tenant requires a different S3 bucket |
Cost | Vector database | Lowest | Medium | Highest |
Per-tenant FM used to create vector embeddings | Amazon Bedrock Knowledge Base | No | Yes | Yes |
Conclusion
This post explored three distinct patterns for implementing a multi-tenant RAG architecture using Amazon Bedrock Knowledge Bases and OpenSearch Service. The silo, pool, and bridge patterns offer varying levels of tenant isolation, variability, management simplicity, and cost-efficiency, catering to different use cases and requirements. By understanding the trade-offs and considerations associated with each pattern, organizations can make informed decisions and choose the approach that best aligns with their needs.
Get started with Amazon Bedrock Knowledge Bases today.
About the Authors
Emanuele Levi is a Solutions Architect in the Enterprise Software and SaaS team, based in London. Emanuele helps UK customers on their journey to refactor monolithic applications into modern microservices SaaS architectures. Emanuele is mainly interested in event-driven patterns and designs, especially when applied to analytics and AI, where he has expertise in the fraud-detection industry.
Mehran Nikoo is a Generative AI Go-To-Market Specialist at AWS. He leads the generative AI go-to-market strategy for UK and Ireland.
Dani Mitchell is a Generative AI Specialist Solutions Architect at AWS. He is focused on computer vision use case and helps AWS customers in EMEA accelerate their machine learning and generative AI journeys with Amazon SageMaker and Amazon Bedrock.
How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML) workflows. This allows scientists and model developers to focus on model development and rapid experimentation rather than infrastructure management
Pipelines offers the ability to orchestrate complex ML workflows with a simple Python SDK with the ability to visualize those workflows through SageMaker Studio. This helps with data preparation and feature engineering tasks and model training and deployment automation. Pipelines also integrates with Amazon SageMaker Automatic Model Tuning which can automatically find the hyperparameter values that result in the best performing model, as determined by your chosen metric.
Ensemble models are becoming popular within the ML communities. They generate more accurate predictions through combining the predictions of multiple models. Pipelines can quickly be used to create and end-to-end ML pipeline for ensemble models. This enables developers to build highly accurate models while maintaining efficiency, and reproducibility.
In this post, we provide an example of an ensemble model that was trained and deployed using Pipelines.
Use case overview
Sales representatives generate new leads and create opportunities within Salesforce to track them. The following application is a ML approach using unsupervised learning to automatically identify use cases in each opportunity based on various text information, such as name, description, details, and product service group.
Preliminary analysis showed that use cases vary by industry and different use cases have a very different distribution of annualized revenue and can help with segmentation. Hence, a use case is an important predictive feature that can optimize analytics and improve sales recommendation models.
We can treat the use case identification as a topic identification problem and we explore different topic identification models such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and BERTopic. In both LSA and LDA, each document is treated as a collection of words only and the order of the words or grammatical role does not matter, which may cause some information loss in determining the topic. Moreover, they require a pre-determined number of topics, which was hard to determine in our data set. Since, BERTopic overcame the above problem, it was used in order to identify the use case.
The approach uses three sequential BERTopic models to generate the final clustering in a hierarchical method.
Each BERTopic model consists of four parts:
- Embedding – Different embedding methods can be used in BERTopic. In this scenario, input data comes from various areas and is usually inputted manually. As a result, we use sentence embedding to ensure scalability and fast processing.
- Dimension reduction – We use Uniform Manifold Approximation and Projection (UMAP), which is an unsupervised and nonlinear dimension reduction method, to reduce high dimension text vectors.
- Clustering – We use the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) method to form different use case clusters.
- Keyword identification – We use class-based TF-IDF to extract the most representative words from each cluster.
Sequential ensemble model
There is no predetermined number of topics, so we set an input for the number of clusters to be 15–25 topics. Upon observation, some of the topics are wide and general. Therefore, another layer of the BERTopic model is applied individually to them. After combining all of the newly identified topics in the second-layer model and together with the original topics from first-layer results, postprocessing is performed manually to finalize topic identification. Lastly, a third layer is used for some of the clusters to create sub-topics.
To enable the second- and third-layer models to work effectively, you need a mapping file to map results from previous models to specific words or phrases. This helps make sure that the clustering is accurate and relevant.
We’re using Bayesian optimization for hyperparameter tuning and cross-validation to reduce overfitting. The data set contains features like opportunity name, opportunity details, needs, associated product name, product details, product groups. The models are evaluated using a customized loss function, and the best embedding model is selected.
Challenges and considerations
Here are some of the challenges and considerations of this solution:
- The pipeline’s data preprocessing capability is crucial for enhancing model performance. With the ability to preprocess incoming data prior to training, we can make sure that our models are fed with high-quality data. Some of the preprocessing and data cleaning steps include converting all text column to lower case, removing template elements, contractions, URLs, emails, etc. removing non-relevant NER labels, and lemmatizing combined text. The result is more accurate and reliable predictions.
- We need a compute environment that is highly scalable so that we can effortlessly handle and train millions of rows of data. This allows us to perform large-scale data processing and modeling tasks with ease and reduces development time and costs.
- Because every step of the ML workflow requires varying resource requirements, a flexible and adaptable pipeline is essential for efficient resource allocation. We can reduce the overall processing time, resulting in faster model development and deployment, by optimizing resource usage for each step.
- Running custom scripts for data processing and model training requires the availability of required frameworks and dependencies.
- Coordinating the training of multiple models can be challenging, especially when each subsequent model depends on the output of the previous one. The process of orchestrating the workflow between these models can be complex and time-consuming.
- Following each training layer, it’s necessary to revise a mapping that reflects the topics produced by the model and use it as an input for the subsequent model layer.
Solution overview
In this solution, the entry point is Amazon SageMaker Studio, which is a web-based integrated development environment (IDE) provided by AWS that enables data scientists and ML developers to build, train, and deploy ML models at scale in a collaborative and efficient manner.
The following diagrams illustrates the high-level architecture of the solution.
As part of the architecture, we’re using the following SageMaker pipeline steps:
- SageMaker Processing – This step allows you to preprocess and transform data before training. One benefit of this step is the ability to use built-in algorithms for common data transformations and automatic scaling of resources. You can also use custom code for complex data preprocessing, and it allows you to use custom container images.
- SageMaker Training – This step allows you to train ML models using SageMaker-built-in algorithms or custom code. You can use distributed training to accelerate model training.
- SageMaker Callback – This step allows you to run custom code during the ML workflow, such as sending notifications or triggering additional processing steps. You can run external processes and resume the pipeline workflow on completion in this step.
- SageMaker Model – This step allows you to create or register model to Amazon SageMaker
Implementation Walkthrough
First, we set up the Sagemaker pipeline:
import boto3
import sagemaker
# create a Session with custom region (e.g. us-east-1), will be None if not specified
region = "<your-region-name>"
# allocate default S3 bucket for SageMaker session, will be None if not specified
default_bucket = "<your-s3-bucket>"
boto_session = boto3.Session(region_name=region
sagemaker_client = boto_session.client("sagemaker")
Initialize a SageMaker Session
sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket= default_bucket,)
Set Sagemaker execution role for the session
role = sagemaker.session.get_execution_role(sagemaker_session)
Manage interactions under Pipeline Context
pipeline_session = sagemaker.workflow.pipeline_context.PipelineSession(boto_session=boto_session, sagemaker_client=sagemaker_client, default_bucket=default_bucket,)
Define base image for scripts to run on
account_id = role.split(":")[4]
# create a base image that take care of dependencies
ecr_repository_name = "<your-base-image-to-run-script>".
tag = "latest"
container_image_uri = "{0}.dkr.ecr.{1}.amazonaws.com/{2}:{3}".format(account_id, region, ecr_repository_name, tag)
The following is a detailed explanation of the workflow steps:
- Preprocess the data – This involves cleaning and preparing the data for feature engineering and splitting the data into train, test, and validation sets.
import os
BASE_DIR = os.path.dirname(os.path.realpath(__file__))
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import (
ProcessingInput,
ProcessingOutput,
ScriptProcessor,
)
processing_instance_type = ParameterString(
name="ProcessingInstanceType",
# choose an instance type suitable for the job
default_value="ml.m5.4xlarge"
)
script_processor = ScriptProcessor(
image_uri=container_image_uri,
command=["python"],
instance_type=processing_instance_type,
instance_count=1,
role=role,
)
# define the data preprocess job
step_preprocess = ProcessingStep(
name="DataPreprocessing",
processor=script_processor,
inputs=[
ProcessingInput(source=BASE_DIR, destination="/opt/ml/processing/input/code/")
],
outputs=[
ProcessingOutput(output_name="data_train", source="/opt/ml/processing/data_train"), # output data and dictionaries etc for later steps
]
code=os.path.join(BASE_DIR, "preprocess.py"),
)
- Train layer 1 BERTopic model – A SageMaker training step is used to train the first layer of the BERTopic model using an Amazon Elastic Container Registry (Amazon ECR) image and a custom training script.
base_job_prefix="OppUseCase"
from sagemaker.workflow.steps import TrainingStep
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
training_instance_type = ParameterString(
name="TrainingInstanceType",
default_value="ml.m5.4xlarge"
)
# create an estimator for training job
estimator_first_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type,
instance_count=1,
output_path= f"s3://{default_bucket}/{base_job_prefix}/train_first_layer", # S3 bucket where the training output be stored
role=role,
entry_point = "train_first_layer.py"
)
# create training job for the estimator based on inputs from data-preprocess step
step_train_first_layer = TrainingStep(
name="TrainFirstLayerModel",
estimator = estimator_first_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ "data_train" ].S3Output.S3Uri,
),
},
)
- Use a callback step – This involves sending a message to an Amazon Simple Queue Service (Amazon SQS) queue, which triggers an AWS Lambda function. The Lambda function updates the mapping file in Amazon Simple Storage Service (Amazon S3) and sends a success token back to the pipeline to resume its run.
from sagemaker.workflow.callback_step import CallbackStep, CallbackOutput, CallbackOutputTypeEnum
first_sqs_queue_to_use = ParameterString(
name="FirstSQSQueue",
default_value= <first_queue_url>, # add queue url
)
first_callback_output = CallbackOutput(output_name="s3_mapping_first_update", output_type=CallbackOutputTypeEnum.String)
step_first_mapping_update = CallbackStep(
name="FirstMappingUpdate",
sqs_queue_url= first_sqs_queue_to_use,
# Input arguments that will be provided in the SQS message
inputs={
"input_location": f"s3://{default_bucket}/{base_job_prefix}/mapping",
"output_location": f"s3://{default_bucket}/{base_job_prefix}/ mapping_first_update "
},
outputs=[
first_callback_output,
],
)
step_first_mapping_update.add_depends_on([step_train_first_layer]) # call back is run after the step_train_first_layer
- Train layer 2 BERTopic model – Another SageMaker TrainingStep is used to train the second layer of the BERTopic model using an ECR image and a custom training script.
estimator_second_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type, # same type as of first train layer
instance_count=1,
output_path=f"s3://{bucket}/{base_job_prefix}/train_second_layer", # S3 bucket where the training output be stored
role=role,
entry_point = "train_second_layer.py"
)
# create training job for the estimator based on inputs from preprocessing, output of previous call back step and first train layer step
step_train_second_layer = TrainingStep(
name="TrainSecondLayerModel",
estimator = estimator_second_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs[ "data_train"].S3Output.S3Uri,
),
TrainingInput(
# Output of the previous call back step
s3_data= step_first_mapping_update.properties.Outputs["s3_mapping_first_update"],
),
TrainingInput(
s3_data=f"s3://{bucket}/{base_job_prefix}/train_first_layer"
),
}
)
- Use a callback step – Similar to Step 3, this involves sending a message to an SQS queue which triggers a Lambda function. The Lambda function updates the mapping file in Amazon S3 and sends a success token back to the pipeline to resume its run.
second_sqs_queue_to_use = ParameterString(
name="SecondSQSQueue",
default_value= <second_queue_url>, # add queue url
)
second_callback_output = CallbackOutput(output_name="s3_mapping_second_update", output_type=CallbackOutputTypeEnum.String)
step_second_mapping_update = CallbackStep(
name="SecondMappingUpdate",
sqs_queue_url= second_sqs_queue_to_use,
# Input arguments that will be provided in the SQS message
inputs={
"input_location": f"s3://{default_bucket}/{base_job_prefix}/mapping_first_update ",
"output_location": f"s3://{default_bucket}/{base_job_prefix}/mapping_second_update "
},
outputs=[
second_callback_output,
],
)
step_second_mapping_update.add_depends_on([step_train_second_layer]) # call back is run after the step_train_second_layer
- Train layer 3 BERTopic model – This involves fetching the mapping file from Amazon S3 and training the third layer of the BERTopic model using an ECR image and a custom training script.
estimator_third_layer = Estimator(
image_uri=container_image_uri,
instance_type=training_instance_type, # same type as of prvious two train layers
instance_count=1,
output_path=f"s3://{default_bucket}/{base_job_prefix}/train_third_layer", # S3 bucket where the training output be stored
role=role,
entry_point = "train_third_layer.py"
)
# create training job for the estimator based on inputs from preprocess step, second callback step and outputs of previous two train layers
step_train_third_layer = TrainingStep(
name="TrainThirdLayerModel",
estimator = estimator_third_layer,
inputs={
TrainingInput(
s3_data=step_preprocess.properties.ProcessingOutputConfig.Outputs["data_train"].S3Output.S3Uri,
),
TrainingInput(
# s3_data = Output of the previous call back step
s3_data= step_second_mapping_update.properties.Outputs[' s3_mapping_second_update’],
),
TrainingInput(
s3_data=f"s3://{default_bucket}/{base_job_prefix}/train_first_layer"
),
TrainingInput(
s3_data=f"s3://{default_bucket}/{base_job_prefix}/train_second_layer"
),
}
)
- Register the model – A SageMaker model step is used to register the model in the SageMaker model registry. When the model is registered, you can use the model through a SageMaker inference pipeline.
from sagemaker.model import Model
from sagemaker.workflow.model_step import ModelStep
model = Model(
image_uri=container_image_uri,
model_data=step_train_third_layer.properties.ModelArtifacts.S3ModelArtifacts,
sagemaker_session=sagemaker_session,
role=role,
)
register_args = model.register(
content_types=["text/csv"],
response_types=["text/csv"],
inference_instances=["ml.c5.9xlarge", "ml.m5.xlarge"],
model_package_group_name=model_package_group_name,
approval_status=model_approval_status,
)
step_register = ModelStep(name="OppUseCaseRegisterModel", step_args=register_args)
To effectively train a BERTopic model and BIRCH and UMAP methods, you need a custom training image which can provide additional dependencies and framework required to run the algorithm. For a working sample of a custom docker image, refer to Create a custom Docker container Image for SageMaker
Conclusion
In this post, we explained how you can use wide range of steps offered by SageMaker Pipelines with custom images to train an ensemble model. For more information on how to get started with Pipelines using an existing ML Operations (MLOps) template, refer to Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines.
About the Authors
Bikramjeet Singh is a Applied Scientist at AWS Sales Insights, Analytics and Data Science (SIADS) Team, responsible for building GenAI platform and AI/ML Infrastructure solutions for ML scientists within SIADS. Prior to working as an AS, Bikram worked as a Software Development Engineer within SIADS and Alexa AI.
Rahul Sharma is a Senior Specialist Solutions Architect at AWS, helping AWS customers build ML and Generative AI solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance industries, helping customers build data and analytics platforms.
Sachin Mishra is a seasoned professional with 16 years of industry experience in technology consulting and software leadership roles. Sachin lead the Sales Strategy Science and Engineering function at AWS. In this role, he was responsible for scaling cognitive analytics for sales strategy, leveraging advanced AI/ML technologies to drive insights and optimize business outcomes.
Nada Abdalla is a research scientist at AWS. Her work and expertise span multiple science areas in statistics and ML including text analytics, recommendation systems, Bayesian modeling and forecasting. She previously worked in academia and obtained her M.Sc and PhD from UCLA in Biostatistics. Through her work in academia and industry she published multiple papers at esteemed statistics journals and applied ML conferences. In her spare time she enjoys running and spending time with her family.