Amazon AWS – Page 99

Supercharge your AI team with Amazon SageMaker Studio: A comprehensive view of Deutsche Bahn’s AI platform transformation

February 29, 2024

by Prasanna Tuladhar Amazon AWS

AI’s growing influence in large organizations brings crucial challenges in managing AI platforms. These include developing a scalable and operationally efficient platform that adheres to organizational compliance and security standards. Amazon SageMaker Studio offers a comprehensive set of capabilities for machine learning (ML) practitioners and data scientists. These include a fully managed AI development environment with an integrated development environment (IDE), simplifying the end-to-end ML workflow. Its collaborative capabilities such as real-time coediting and sharing notebooks within the team ensures smooth teamwork, while the scalability and high-performance training caters to large datasets. With built-in security, cost-effectiveness, and a range of pre-built tools like Amazon SageMaker Autopilot, Amazon SageMaker JumpStart, and Amazon SageMaker Feature store, SageMaker Studio is a powerful platform for accelerating AI projects and empowering data scientists at every level of expertise.

Deutsche Bahn is a leading transportation organization in Germany with a revenue of 56.3 billion EUR (in 2022), a workforce of 336,884 employees (including 221,343 employees in Germany), and operations spanning 130 countries. They offer a wide range of services, including public and regional transport, freight services, and rail infrastructure. Through the integrated operation of traffic and railway infrastructure, as well as the economically and ecologically intelligent connection of all modes of transport, Deutsche Bahn moves people and goods. Deutsche Bahn has been at the forefront in adopting AI, using SageMaker Studio as a key AI platform. At Deutsche Bahn, a dedicated AI platform team manages and operates the SageMaker Studio platform, and multiple data analytics teams within the organization use the platform to develop, train, and run various analytics and ML activities.

The AI platform team’s key objective is to ensure seamless access to Workbench services and SageMaker Studio for all Deutsche Bahn teams and projects, with a primary focus on data scientists and ML engineers. This platform helps Deutsche Bahn realize a spectrum of use cases, ranging from railway maintenance, forecasting, and future applications in generative AI.

The AI platform managed service, built on SageMaker Studio, seamlessly aligns with Deutsche Bahn’s group-wide platform strategy. It meets the company’s compliance requirements, enables a swift project initiation for the team by provisioning a SageMaker domain, and reduces maintenance overhead due to an overarching operating model. Major benefits include high scalability of the service, in large part due to automation and a self-service model, and an attractive pricing model that’s primarily based on resource consumption.

“SageMaker Studio provided us a common platform that is scalable, security compliant, and addresses the development needs of data scientists from multiple data analytics teams within the DB organization. Before this, each team managed and operated their own JupyterLab notebooks, which was not efficient or cost-effective. Within 8 weeks, we onboarded over 120 developers, provisioned 25 SageMaker domains, and quickly got started using this platform.”

– Emmanuel Drosos, product owner at DB Systel.

In this post, we explore how Deutsche Bahn scaled and operated their AI platform using SageMaker Studio for multiple teams, while ensuring robust security and oversight.

Solution overview

The architecture at Deutsche Bahn consists of a central platform account managed by a platform team responsible for managing infrastructure and operations for SageMaker Studio. SageMaker Studio resources are grouped by SageMaker domains, each consisting of an associated Amazon Elastic File System (Amazon EFS) volume, a list of authorized users, and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations. At Deutsche Bahn, data scientists from various teams use SageMaker domains for their ML activities; each team has a dedicated SageMaker domain that they use for developing and testing ML models and collaborate using features such as notebook sharing.

From an infrastructure perspective, the VPC provisioned in the AI platform account as shown in the following figure has no outbound internet connectivity to ensure security and compliance. For high availability, multiple identical private isolated subnets are provisioned. The SageMaker Studio domains are deployed in VPC only mode, which creates an elastic network interface for communication between the SageMaker service account (AWS service account) and the platform account’s VPC. The endpoints like SageMaker API, SageMaker Studio, and SageMaker notebook facilitate secure and reliable communication between the platform account’s VPC and the SageMaker domain managed by AWS in the SageMaker service account.

Each data analytics team is able to request one or multiple SageMaker domains through the company’s internal self-service portal. This process of ordering a SageMaker domain is orchestrated through a separate workflow process (via AWS Step Functions). During this orchestration flow, an Azure Active Directory (AD) group for the data analytics team is provisioned with the AD group name corresponding to the domain name. The orchestration leads to a continuous integration and continuous deployment (CI/CD) pipeline deploying an AWS Cloud Development Kit (AWS CDK) app consisting of a SageMaker domain for the respective team.

In addition to the SageMaker domain, a customized AWS Identity and Access Management (IAM) role (SageMaker-execution-role), Amazon Simple Storage Service (Amazon S3) bucket (data-bucket), customer managed key (CMK), and other AWS resources are provisioned during the deployment process by the AWS CDK app, as illustrated in the following figure. The AD group contains scientists who needs access to their team’s SageMaker domain. The AD group name corresponds to the SageMaker domain’s name and is primarily used during the authorization process.

Client separation is implemented on the level of SageMaker domains by using IAM authentication mode. A domain-specific IAM role (SageMaker-execution-role) is attached to each domain that follows the principle of least privilege and is assumed by the data analytics team during the login process. This role grants data scientists in the team the ability to perform various activities, such as running processing jobs, hyperparameter tuning jobs, transformation jobs, and experiments, as well as creating models. These ML activities are run on behalf of the user by SageMaker using the IAM pass role permission. However, certain actions like creating S3 buckets, modifying IAM roles, updating SageMaker domains, and provisioning large instances are restricted for security, compliance, and cost control reasons. The associated IAM policy makes sure that the data analytics team only has access to the relevant S3 bucket and CMK for their authorized domain, as depicted in the following figure. Additionally, the role SageMaker-execution-role allows the team members to assume roles in other accounts within the Deutsche Bahn organization from SageMaker Studio, providing them with flexibility to access resources like Amazon Relational Database Service (Amazon S3), other S3 buckets, and Amazon Athena. The IAM policy uses aws:RequestTag and aws:ResourceTag for fine-grained access control during SageMaker activities, like processing jobs, training jobs, and create models. These tags also help track associated costs for the domain. For more information, refer to Actions, resources, and condition keys for Amazon SageMaker.

The CMK encrypts both the SageMaker domain’s file system contents stored in Amazon EFS and the contents of the S3 bucket (data-bucket) that is provisioned to store data for SageMaker processing and transformation jobs. In addition, resource-based policies, such as the bucket policy and CMK policy, provide an extra layer of security, restricting both access to only authorized AI team members and permitted actions on these resources.

The AI team does not have AWS Management Console access to the AI platform team’s account. To access SageMaker Studio, as illustrated in the following figure, the data scientists from the data analytics team use a generated presigned URL by authenticating through an Amazon Cognito based custom login application. After the user logs in to this custom application, they receive an OAuth access token that contains information such as AD group name. After they log in to the custom application, the user requests SageMaker domain access through the UI by triggering an Amazon API Gateway call to generate a presigned URL. API Gateway invokes the PreSignUrlGenerator AWS Lambda function and uses an Amazon Cognito authorizer to validate the OAuth access token in the request header. The PreSignUrlGenerator function validates user access permissions for the requested SageMaker domain by comparing the AD name in the access token against the requested SageMaker domain. Upon successful authorization, the PreSignUrlGenerator function creates a SageMaker user profile upon first login and generates a presigned URL response. The custom login application then redirects the users to the requested SageMaker domain.

AWS CDK

The solution at Deutsche Bahn uses AWS CDK as infrastructure as code (IaC) to provision a SageMaker domain along with resources like S3 buckets and a CMK. The following figure illustrates the stacks and associated resources used for SageMaker deployment. The infrastructure stack takes care of setting up essential resources like VPC, subnets, and multiple SageMaker endpoints. The resources such as VPC, subnets, and service control policies (SCPs) are managed by a central cloud team through a different stack (but is shown here for simplicity). The SageMakerStudioStack is primarily responsible for provisioning a SageMaker domain, a dedicated data bucket, a CMK, and the dedicated IAM role SageMaker-execution-role. Notably, each SageMaker domain is provisioned through its individual SageMakerStudioStack.

The solution uses a purpose-built L3 construct (SageMaker Studio domain), as shown in the following figure, for the SageMaker domain resource. SageMaker Studio has a lifecycle configuration feature that enables specific initializations during the startup of JupyterLab or KernelGateway apps.

Deutsch Bahn uses the lifecycle configuration as shown in the following figure to automatically detect and shut down idle instances in the SageMaker domain, reducing unnecessary costs. Due to restricted outbound connectivity, the data analytics team uses internally hosted images and third-party libraries from the company’s internal artifactory. The lifecycle configuration script for KernelGateway configures pip and conda package managers to redirect downloads to the internally hosted artifactory location. As of this writing, there is no AWS CDK construct for the lifecycle configuration resource; therefore, they use a custom CDK resource to provision and manage the LifeCycleConfig script. Custom resources in AWS CDK offer the ability to provision and manage resources not directly supported by AWS CloudFormation or AWS CDK constructs.

Installation

The sample AWS CDK application demonstrates how various components, including the SageMaker domain, lifecycle configuration, Amazon Cognito, and IAM role with the least privileges, function together. Within the application, the SagemakerStudioStack class handles the provisioning of a SageMaker domain, IAM role (sagemaker-execution-role) that users assume, CMK, lifecycle configuration, SageMaker user profile, S3 bucket for data processing, and Amazon Cognito user group. The demo AWS CDK application provides a concise overview of key components, such as the SageMaker domain, lifecycle configuration, authentication through Amazon Cognito, and IAM role with least privileges. The SagemakerLoginStack, on the other hand, is responsible for deploying the Amazon Cognito user pool, Lambda function, and API Gateway for generating presigned URLs. The CognitoUserStack primarily focuses on deploying a user within the Amazon Cognito user pool.

You can run the following commands to compile, synthesize, and deploy the application. You should adjust the account, user, and password in the sample code for your application. The password should be at least 8 characters, with uppercase characters and numbers. The user parameter is the SageMaker domain user that will be authenticated by Amazon Cognito.

Download the source code from the GitHub repo.
Bootstrap the AWS account. In the following code, adjust the account number and Region as needed:
```
cdk bootstrap aws://11111111111/eu-central-1
```
Install the packages and compile the code:
```
npm install
npm run build
```

Synthesize the AWS CDK application:

npx cdk synth -c account=11111111111 -c region='eu-central-1' -c domain-name=team1 -c user=demo-user -c password=<your password>

Deploy the application with all stacks into the account and Region of your choice:

npx cdk deploy --all -c account=11111111111 -c region='eu-central-1' -c domain-name=team1 -c user=demo-user -c password=<password>

Download the Postman app to make an API call.

If you don’t have a Postman account, create a free account with your email. If you already have an account, sign in to your account.

On the File menu, choose Import and import the Postman environment JSON file included in the GitHub repo.
On the Environments tab in Postman, locate the environment called SageMaker.

Add the following environment variables, which you see as part of the stack deployment output from SagemakerLoginStack:

..... output from the cdk deploy .....

//PreSignedURLApi

SageMaker-login-stack.PreSignedURLApiEndpointXXXX= https://xxxxxxx.execute-api.eu-central-1.amazonaws.com/prod/

//UserPoolClientId

SageMaker-login-stack.UserPoolUserPoolClientIdFXXXX = xxxxxxxxxxxxxxxx

//UserPoolClientSecret

SageMaker-login-stack.UserPoolUserPoolClientSecretC1D088A5 = xxxxxxxxxxxxxxx

//CognitoSigninDomain

SageMaker-login-stack.UserPoolCognitoSigninDomainD3B08161 = https://SageMaker-login-xxxxx.auth.eu-central-1.amazoncognito.com/oauth2

Use the following parameters (fetch the values from the output during cdk deploy):

- domainName – The domain name parameter you passed in cdk deploy, for example team1
- client-id – The Amazon Cognito client ID
- client-secret – The Amazon Cognito client secret.
- SageMaker-presigned-api – The URL of the API Gateway created by AWS CDK, which generates the presigned URL
- cognito-signin-endpoint – The endpoint URL of the Amazon Cognito domain where the client app (in this case, Postman) authenticates by providing credentials of the user (demo-user)

The next step is to generate an OAuth2 token.

1. On the Authorization tab, choose the SageMaker environment and choose Generate New Access Token.

All the values on this tab should be prefilled.

1. Update the environment variables and choose Get New Access Token.

In the pop-up window that opens, log in to Amazon Cognito with the user name (demo-user) and password you used earlier.

Upon successful authentication, a new access token is generated.

Choose Use Token.
Choose GeneratePresignedUrlDemo in the Postman SageMaker collections and choose Send.
Make sure you selected the right environment (SageMaker) on the drop-down list.

This makes a REST API call to API Gateway and generates a presigned URL to access the SageMaker domain. You can see this URL in the response body.

Copy this URL and enter it in the browser window.

A new SageMaker domain will be launched with your user profile.

This demo application supports SageMaker features like training jobs, processing jobs, and model endpoints. Note that features like Amazon SageMaker Canvas, SageMaker JumpStart, and SageMaker Feature Store are not activated.

Clean up

Complete the following steps to clean up your resources:

On the SageMaker console, in the navigation pane, choose Domain, User Profile, and Apps.
Delete all running apps (KernelGateway or JupyterLab) from this solution.
Delete all the SageMaker user profiles you created during the login step.
On the Amazon EFS console, delete the EFS file system created for this post.
Run the following command to delete the resources created with the AWS CDK:
```
npx cdk destroy --all
```

Conclusion

The post highlighted how Deutsche Bahn effectively used SageMaker Studio to revamp its AI platform, resulting in a scalable, automated, and manageable solution to support its diverse data analytics teams. This architecture features a central platform account, a self-service domain ordering process, and infrastructure provisioning using AWS CDK. The deployment process incorporates a CI/CD pipeline, ensuring the smooth delivery of SageMaker domains.

Overall, the transformation brought about by SageMaker Studio has empowered Deutsche Bahn to construct a robust platform for their AI initiatives, catering to over 100 developers and managing 20 SageMaker domains within a single AWS account.

Lastly, we extend our sincere appreciation to Nico Seegert (d-fine) and Philipp Vollmer (Deutsche Bahn), whose invaluable contributions were instrumental in shaping this architecture.

For further reading, refer to the following resources:

___________________________________________________________________________________________

About the authors

Prasanna Tuladhar is a Cloud Infrastructure Architect at AWS Professional Services in Munich, Germany. Specializing in cloud infrastructure, workload migration, and DevOps on the AWS platform, he empowers customers to achieve their business objectives. Outside of work, he enjoys jogging, hiking, and quality time with his family.

Emmanuel Drosos is a Product Owner for the AI platform at DBSystel, a subsidiary of Deutsche Bahn (DB) Germany. With a passion for innovation and technology, Emmanuel spearheads initiatives aimed at leveraging the power of the cloud to drive AI platform at DB (Deutsche Bahn). The AI.Platform is one of DB’s group-wide development platforms. It includes AI services and tools for the development of AI (machine learning) models and directly usable AI services. Simple, integrated and scalable.He works closely with other DB customers to unlock the full potential of AI platform, enabling them to achieve their business objectives efficiently and effectively. Outside of his professional activities, Emmanuel enjoys traveling and is an enthusiastic nature and hiking lover.

Vishwanath Bhat is a DevOps Architect at AWS Professional Services, based in Germany. He helps customers to get the full benefit of the cloud and achieve their business goals with AWS cloud. When not working, he likes to go swimming in alpine lakes, hiking, reading or play football.

Kumudhan Cherarajan is a DevOps Consultant at AWS Professional Services, based in Switzerland. He is passionate about helping customers adopt process and services that increase their efficiency in the cloud journey. When not working, he likes to play cricket and music.

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

February 28, 2024

by Sanjeeb Panda Amazon AWS

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. Today, generative AI can enable people without SQL knowledge. This generative AI task is called text-to-SQL, which generates SQL queries from natural language processing (NLP) and converts text into semantically correct SQL. The solution in this post aims to bring enterprise analytics operations to the next level by shortening the path to your data using natural language.

With the emergence of large language models (LLMs), NLP-based SQL generation has undergone a significant transformation. Demonstrating exceptional performance, LLMs are now capable of generating accurate SQL queries from natural language descriptions. However, challenges still remain. First, human language is inherently ambiguous and context-dependent, whereas SQL is precise, mathematical, and structured. This gap may result in inaccurate conversion of the user’s needs into the SQL that’s generated. Second, you might need to build text-to-SQL features for every database because data is often not stored in a single target. You may have to recreate the capability for every database to enable users with NLP-based SQL generation. Third, despite the larger adoption of centralized analytics solutions like data lakes and warehouses, complexity rises with different table names and other metadata that is required to create the SQL for the desired sources. Therefore, collecting comprehensive and high-quality metadata also remains a challenge. To learn more about text-to-SQL best practices and design patterns, see Generating value from enterprise data: Best practices for Text2SQL and generative AI.

Our solution aims to address those challenges using Amazon Bedrock and AWS Analytics Services. We use Anthropic Claude v2.1 on Amazon Bedrock as our LLM. To address the challenges, our solution first incorporates the metadata of the data sources within the AWS Glue Data Catalog to increase the accuracy of the generated SQL query. The workflow also includes a final evaluation and correction loop, in case any SQL issues are identified by Amazon Athena, which is used downstream as the SQL engine. Athena also allows us to use a multitude of supported endpoints and connectors to cover a large set of data sources.

After we walk through the steps to build the solution, we present the results of some test scenarios with varying SQL complexity levels. Finally, we discuss how it is straightforward to incorporate different data sources to your SQL queries.

Solution overview

There are three critical components in our architecture: Retrieval Augmented Generation (RAG) with database metadata, a multi-step self-correction loop, and Athena as our SQL engine.

We use the RAG method to retrieve the table descriptions and schema descriptions (columns) from the AWS Glue metastore to ensure that the request is related to the right table and datasets. In our solution, we built the individual steps to run a RAG framework with the AWS Glue Data Catalog for demonstration purposes. However, you can also use knowledge bases in Amazon Bedrock to build RAG solutions quickly.

The multi-step component allows the LLM to correct the generated SQL query for accuracy. Here, the generated SQL is sent for syntax errors. We use Athena error messages to enrich our prompt for the LLM for more accurate and effective corrections in the generated SQL.

You can consider the error messages occasionally coming from Athena like feedback. The cost implications of an error correction step are negligible compared to the value delivered. You can even include these corrective steps as supervised reinforced learning examples to fine-tune your LLMs. However, we did not cover this flow in our post for simplicity purposes.

Note that there is always inherent risk of having inaccuracies, which naturally comes with generative AI solutions. Even if Athena error messages are highly effective to mitigate this risk, you can add more controls and views, such as human feedback or example queries for fine-tuning, to further minimize such risks.

Athena not only allows us to correct the SQL queries, but it also simplifies the overall problem for us because it serves as the hub, where the spokes are multiple data sources. Access management, SQL syntax, and more are all handled via Athena.

The following diagram illustrates the solution architecture.

The solution architecture and the process flow is shown.

Figure 1. The solution architecture and process flow.

The process flow includes the following steps:

Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an Amazon OpenSearch Serverless vector store, which serves as our knowledge base in our RAG framework.

At this stage, the process is ready to receive the query in natural language. Steps 7–9 represent a correction loop, if applicable.

The user enters their query in natural language. You can use any web application to provide the chat UI. Therefore, we did not cover the UI details in our post.
The solution applies a RAG framework via similarity search, which adds the extra context from the metadata from the vector database. This table is used for finding the correct table, database, and attributes.
The query is merged with the context and sent to Anthropic Claude v2.1 on Amazon Bedrock.
The model gets the generated SQL query and connects to Athena to validate the syntax.
If Athena provides an error message that mentions the syntax is incorrect, the model uses the error text from Athena’s response.
The new prompt adds Athena’s response.
The model creates the corrected SQL and continues the process. This iteration can be performed multiple times.
Finally, we run the SQL using Athena and generate output. Here, the output is presented to the user. For the sake of architectural simplicity, we did not show this step.

Prerequisites

For this post, you should complete the following prerequisites:

Have an AWS account.
Install the AWS Command Line Interface (AWS CLI).
Set up the SDK for Python (Boto3).
Create the AWS Glue Data Catalog using an AWS Glue crawler (or a different method).
Using the Titan-Text-Embeddings model on Amazon Bedrock, convert the metadata into embeddings and store it in an OpenSearch Serverless vector store.

Implement the solution

You can use the following Jupyter notebook, which includes all the code snippets provided in this section, to build the solution. We recommend using Amazon SageMaker Studio to open this notebook with an ml.t3.medium instance with the Python 3 (Data Science) kernel. For instructions, refer to Train a Machine Learning Model. Complete the following steps to set up the solution:

Create the knowledge base in OpenSearch Service for the RAG framework:

def add_documnets(self,index_name: str,file_name:str):

documents = JSONLoader(file_path=file_name, jq_schema='.', text_content=False, json_lines=False).load()
docs = OpenSearchVectorSearch.from_documents(embedding=self.embeddings, opensearch_url=self.opensearch_domain_endpoint, http_auth=self.http_auth, documents=documents, index_name=index_name, engine="faiss")
index_exists = self.check_if_index_exists(index_name,aws_region,opensearch_domain_endpoint,http_auth)
if not index_exists :
logger.info(f'index :{index_name} is not existing ')
sys.exit(-1)
else:
logger.info(f'index :{index_name} Got created')

Build the prompt (final_question) by combining the user input in natural language (user_query), the relevant metadata from the vector store (vector_search_match), and our instructions (details):

def userinput(user_query):
logger.info(f'Searching metadata from vector store')

# vector_search_match=rqst.getEmbeddding(user_query)
vector_search_match = rqst.getOpenSearchEmbedding(index_name,user_query)

# print(vector_search_match)
details = "It is important that the SQL query complies with Athena syntax. 
During join if column name are same please use alias ex llm.customer_id 
in select statement. It is also important to respect the type of columns: 
if a column is string, the value should be enclosed in quotes. 
If you are writing CTEs then include all the required columns. 
While concatenating a non string column, make sure cast the column to string. 
For date columns comparing to string , please cast the string input."
final_question = "nnHuman:"+details + vector_search_match + user_query+ "nnAssistant:"
answer = rqst.generate_sql(final_question)
return answer

Invoke Amazon Bedrock for the LLM (Claude v2) and prompt it to generate the SQL query. In the following code, it makes multiple attempts in order to illustrate the self-correction step:x

try:
logger.info(f'we are in Try block to generate the sql and count is :{attempt + 1}')
generated_sql = self.llm.predict(prompt)
query_str = generated_sql.split("```")[1]
query_str = " ".join(query_str.split("n")).strip()
sql_query = query_str[3:] if query_str.startswith("sql") else query_str

# return sql_query
syntaxcheckmsg=rqstath.syntax_checker(sql_query)
if syntaxcheckmsg=='Passed':
logger.info(f'syntax checked for query passed in attempt number :{attempt + 1}')
return sql_query

If any issues are received with the generated SQL query ({sqlgenerated}) from the Athena response ({syntaxcheckmsg}), the new prompt (prompt) is generated based on the response and the model tries again to generate the new SQL:

else:
prompt = f"""{prompt} 
This is syntax error: {syntaxcheckmsg}.
To correct this, please generate an alternative SQL query which will correct the syntax error. The updated query should take care of all the syntax issues encountered. Follow the instructions mentioned above to remediate the error.
Update the below SQL query to resolve the issue:
{sqlgenerated}
Make sure the updated SQL query aligns with the requirements provided in the initial question."""
prompts.append(prompt)

After the SQL is generated, the Athena client is invoked to run and generate the output:

query_execution = self.athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration=result_config,
QueryExecutionContext=query_execution_context, )
execution_id = query_execution["QueryExecutionId"]

Test the solution

In this section, we run our solution with different example scenarios to test different complexity levels of SQL queries.

To test our text-to-SQL, we use two datasets available from IMDB. Subsets of IMDb data are available for personal and non-commercial use. You can download the datasets and store them in Amazon Simple Storage Service (Amazon S3). You can use the following Spark SQL snippet to create tables in AWS Glue. For this example, we use title_ratings and title:

source_title_ratings3_path = 's3://llm-athena-output/input_data/title.ratings.tsv'
target_title_s3_path='s3://llm-athena-output/output_data/imdb_stg/title_ratings'
source_titleratingdf=spark.read.csv(source_title_ratings3_path,sep="t",header=True)
source_titleratingdf.write.mode('overwrite').format('parquet').option('path', target_title_s3_path).saveAsTable('imdb_stg.title_ratings')

Store data in Amazon S3 and metadata in AWS Glue

In this scenario, our dataset is stored in an S3 bucket. Athena has an S3 connector that allows you to use Amazon S3 as a data source that can be queried.

For our first query, we provide the input “I am new to this. Can you help me see all the tables and columns in imdb schema?”

The following is the generated query:

WITH tables AS (
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'imdb_stg'),
columns AS (
SELECT
c.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM information_schema.columns c
WHERE c.table_schema = 'imdb_stg')
SELECT
t.table_name,
c.column_name,
c.data_type,
c.is_nullable,
c.column_default,
c.ordinal_position
FROM tables t
INNER JOIN columns c
ON t.table_name = c.table_name
ORDER BY
t.table_name,
c.ordinal_position
LIMIT 10;

The following screenshot and code show our output.

table_name column_name data_type
0 title titleid varchar
1 title ordering integer
2 title title varchar
3 title region varchar
4 title language varchar

For our second query, we ask “Show me all the title and details in US region whose rating is more than 9.5.”

The following is our generated query:

WITH us_titles AS (
SELECT t.title, t.region, tr.averageRating, tr.numVotes
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.region = 'US' AND cast(tr.averageRating as varchar) > '9.5'
)
SELECT title, region, averageRating, numVotes
FROM us_titles
LIMIT 100;

The response is as follows.

title region averageRating numVotes
0 The Way You Saw Me US 9.7 8
1 The Brother Side of the Wake US 9.6 20
2 Ignis Fatuus US 9.6 11
3 Love and Hip Hop Atlanta US 9.9 11
4 ronny/lily US 9.7 14781

For our third query, we enter “Great Response! Now show me all the original type titles having ratings more than 7.5 and not in the US region.”

The following query is generated:

WITH titles AS (
SELECT t.titleId,
t.title,
t.types,
t.isOriginalTitle,
cast(tr.averageRating as decimal(3,1)) as averageRating,
tr.numVotes,
t.region
FROM imdb_stg.title t
INNER JOIN imdb_stg.title_ratings tr
ON t.titleId = tr.tconst
WHERE t.isOriginalTitle = '1'
AND cast(tr.averageRating as decimal(3,1)) > 7.5
AND t.region != 'US')
SELECT *
FROM titles
LIMIT 100;

We get the following results.

titleId title types isOriginalTitle averageRating numVotes region
0 tt0986264 Taare Zameen Par original 1 8.3 203760 XWW

Generate self-corrected SQL

This scenario simulates a SQL query that has syntax issues. Here, the generated SQL will be self-corrected based on the response from Athena. In the following response, Athena gave a COLUMN_NOT_FOUND error and mentioned that table_description can’t be resolved:

Status : {'State': 'FAILED', 'StateChangeReason': "COLUMN_NOT_FOUND: line 1:50: Column 'table_description' 
cannot be resolved or requester is not authorized to access requested resources",
'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 501000, tzinfo=tzlocal()),
'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 38, 57, 778000, tzinfo=tzlocal()),
'AthenaError': {'ErrorCategory': 2, 'ErrorType': 1006, 'Retryable': False, 'ErrorMessage': "COLUMN_NOT_FOUND: 
line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to 
access requested resources"}}
COLUMN_NOT_FOUND: line 1:50: Column 'table_description' cannot be resolved or requester is not authorized to access requested resources
Try Count: 2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,Try Count: 2
we are in Try block to generate the sql and count is :2
2024-01-14 14:39:02,521,llm_execute,MainProcess,INFO,we are in Try block to generate the sql and count is :2
Executing: Explain WITH tables AS ( SELECT table_name FROM information_schema.tables WHERE table_schema = 'imdb_stg' ), columns AS ( SELECT c.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM information_schema.columns c WHERE c.table_schema = 'imdb_stg' ) SELECT t.table_name, c.column_name, c.data_type, c.is_nullable, c.column_default, c.ordinal_position FROM tables t INNER JOIN columns c ON t.table_name = c.table_name ORDER BY t.table_name, c.ordinal_position LIMIT 10;
I am checking the syntax here
execution_id: 904857c3-b7ac-47d0-8e7e-6b9d0456099b
Status : {'State': 'SUCCEEDED', 'SubmissionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 29, 537000, tzinfo=tzlocal()), 'CompletionDateTime': datetime.datetime(2024, 1, 14, 14, 39, 30, 183000, tzinfo=tzlocal())}
syntax checked for query passed in tries number :2

Using the solution with other data sources

To use the solution with other data sources, Athena handles the job for you. To do this, Athena uses data source connectors that can be used with federated queries. You can consider a connector as an extension of the Athena query engine. Pre-built Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon Relational Database Service (Amazon RDS), and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. After you set up a connection to any data source, you can use the preceding code base to extend the solution. For more information, refer to Query any data source with Amazon Athena’s new federated query.

Clean up

To clean up the resources, you can start by cleaning up your S3 bucket where the data resides. Unless your application invokes Amazon Bedrock, it will not incur any cost. For the sake of infrastructure management best practices, we recommend deleting the resources created in this demonstration.

Conclusion

In this post, we presented a solution that allows you to use NLP to generate complex SQL queries with a variety of resources enabled by Athena. We also increased the accuracy of the generated SQL queries via a multi-step evaluation loop based on error messages from downstream processes. Additionally, we used the metadata in the AWS Glue Data Catalog to consider the table names asked in the query through the RAG framework. We then tested the solution in various realistic scenarios with different query complexity levels. Finally, we discussed how to apply this solution to different data sources supported by Athena.

Amazon Bedrock is at the center of this solution. Amazon Bedrock can help you build many generative AI applications. To get started with Amazon Bedrock, we recommend following the quick start in the following GitHub repo and familiarizing yourself with building generative AI applications. You can also try knowledge bases in Amazon Bedrock to build such RAG solutions quickly.

About the Authors

Sanjeeb Panda is a Data and ML engineer at Amazon. With the background in AI/ML, Data Science and Big Data, Sanjeeb design and develop innovative data and ML solutions that solve complex technical challenges and achieve strategic goals for global 3P sellers managing their businesses on Amazon. Outside of his work as a Data and ML engineer at Amazon, Sanjeeb Panda is an avid foodie and music enthusiast.

Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect located in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. Burak is still a research affiliate in MIT. Burak is passionate about yoga and meditation.

How Axfood enables accelerated machine learning throughout the organization using Amazon SageMaker

February 27, 2024

by Dr. Björn Blomqvist Amazon AWS

This is a guest post written by Axfood AB.

In this post, we share how Axfood, a large Swedish food retailer, improved operations and scalability of their existing artificial intelligence (AI) and machine learning (ML) operations by prototyping in close collaboration with AWS experts and using Amazon SageMaker.

Axfood is Sweden’s second largest food retailer, with over 13,000 employees and more than 300 stores. Axfood has a structure with multiple decentralized data science teams with different areas of responsibility. Together with a central data platform team, the data science teams bring innovation and digital transformation through AI and ML solutions to the organization. Axfood has been using Amazon SageMaker to cultivate their data using ML and has had models in production for many years. Lately, the level of sophistication and the sheer number of models in production is increasing exponentially. However, even though the pace of innovation is high, the different teams had developed their own ways of working and were in search of a new MLOps best practice.

Our challenge

To stay competitive in terms of cloud services and AI/ML, Axfood chose to partner with AWS and has been collaborating with them for many years.

During one of our recurring brainstorming sessions with AWS, we were discussing how to best collaborate across teams to increase the pace of innovation and efficiency of data science and ML practitioners. We decided to put in a joint effort to build a prototype on a best practice for MLOps. The aim of the prototype was to build a model template for all data science teams to build scalable and efficient ML models—the foundation to a new generation of AI and ML platforms for Axfood. The template should bridge and combine best practices from AWS ML experts and company-specific best practice models—the best of both worlds.

We decided to build a prototype from one of the currently most developed ML models within Axfood: forecasting sales in stores. More specifically, the forecast for fruits and vegetables of upcoming campaigns for food retail stores. Accurate daily forecasting supports the ordering process for the stores, increasing sustainability by minimizing food waste as a result of optimizing sales by accurately predicting the needed in-store stock levels. This was the perfect place to start for our prototype—not only would Axfood gain a new AI/ML platform, but we would also get a chance to benchmark our ML capabilities and learn from leading AWS experts.

Our solution: A new ML template on Amazon SageMaker Studio

Building a full ML pipeline that is designed for an actual business case can be challenging. In this case, we are developing a forecasting model, so there are two main steps to complete:

Train the model to make predictions using historical data.
Apply the trained model to make predictions of future events.

In Axfood’s case, a well-functioning pipeline for this purpose was already set up using SageMaker notebooks and orchestrated by the third-party workflow management platform Airflow. However, there are many clear benefits of modernizing our ML platform and moving to Amazon SageMaker Studio and Amazon SageMaker Pipelines. Moving to SageMaker Studio provides many predefined out-of-the-box features:

Monitoring model and data quality as well as model explainability
Built-in integrated development environment (IDE) tools such as debugging
Cost/performance monitoring
Model acceptance framework
Model registry

However, the most important incentive for Axfood is the ability to create custom project templates using Amazon SageMaker Projects to be used as a blueprint for all data science teams and ML practitioners. The Axfood team already had a robust and mature level of ML modeling, so the main focus was on building the new architecture.

Solution overview

Axfood’s proposed new ML framework is structured around two main pipelines: the model build pipeline and the batch inference pipeline:

These pipelines are versioned within two separate Git repositories: one build repository and one deploy (inference) repository. Together, they form a robust pipeline for forecasting fruits and vegetables.
The pipelines are packaged into a custom project template using SageMaker Projects in integration with a third-party Git repository (Bitbucket) and Bitbucket pipelines for continuous integration and continuous deployment (CI/CD) components.
The SageMaker project template includes seed code corresponding to each step of the build and deploy pipelines (we discuss these steps in more detail later in this post) as well as the pipeline definition—the recipe for how the steps should be run.
Automation of building new projects based on the template is streamlined through AWS Service Catalog, where a portfolio is created, serving as an abstraction for multiple products.
Each product translates into an AWS CloudFormation template, which is deployed when a data scientist creates a new SageMaker project with our MLOps blueprint as the foundation. This activates an AWS Lambda function that creates a Bitbucket project with two repositories—model build and model deploy—containing the seed code.

The following diagram illustrates the solution architecture. Workflow A depicts the intricate flow between the two model pipelines—build and inference. Workflow B shows the flow to create a new ML project.

Model build pipeline

The model build pipeline orchestrates the model’s lifecycle, beginning from preprocessing, moving through training, and culminating in being registered in the model registry:

Preprocessing – Here, the SageMaker ScriptProcessor class is employed for feature engineering, resulting in the dataset the model will be trained on.
Training and batch transform – Custom training and inference containers from SageMaker are harnessed to train the model on historical data and create predictions on the evaluation data using a SageMaker Estimator and Transformer for the respective tasks.
Evaluation – The trained model undergoes evaluation by comparing the generated predictions on the evaluation data to the ground truth using ScriptProcessor.
Baseline jobs – The pipeline creates baselines based on statistics in the input data. These are essential for monitoring data and model quality, as well as feature attributions.
Model registry – The trained model is registered for future use. The model will be approved by designated data scientists to deploy the model for use in production.

For production environments, data ingestion and trigger mechanisms are managed via a primary Airflow orchestration. Meanwhile, during development, the pipeline is activated each time a new commit is introduced to the model build Bitbucket repository. The following figure visualizes the model build pipeline.

Batch inference pipeline

The batch inference pipeline handles the inference phase, which consists of the following steps:

Preprocessing – Data is preprocessed using ScriptProcessor.
Batch transform – The model uses the custom inference container with a SageMaker Transformer and generates predictions given the input preprocessed data. The model used is the latest approved trained model in the model registry.
Postprocessing – The predictions undergo a series of postprocessing steps using ScriptProcessor.
Monitoring – Continuous surveillance completes checks for drifts related to data quality, model quality, and feature attribution.

If discrepancies arise, a business logic within the postprocessing script assesses whether retraining the model is necessary. The pipeline is scheduled to run at regular intervals.

The following diagram illustrates the batch inference pipeline. Workflow A corresponds to preprocessing, data quality and feature attribution drift checks, inference, and postprocessing. Workflow B corresponds to model quality drift checks. These pipelines are divided because the model quality drift check will only run if new ground truth data is available.

SageMaker Model Monitor

With Amazon SageMaker Model Monitor integrated, the pipelines benefit from real-time monitoring on the following:

Data quality – Monitors any drift or inconsistencies in data
Model quality – Watches for any fluctuations in model performance
Feature attribution – Checks for drift in feature attributions

Monitoring model quality requires access to ground truth data. Although obtaining ground truth can be challenging at times, using data or feature attribution drift monitoring serves as a competent proxy to model quality.

Specifically, in the case of data quality drift, the system watches out for the following:

Concept drift – This pertains to changes in the correlation between input and output, requiring ground truth
Covariate shift – Here, the emphasis is on alterations in the distribution of independent input variables

SageMaker Model Monitor’s data drift functionality meticulously captures and scrutinizes the input data, deploying rules and statistical checks. Alerts are raised whenever anomalies are detected.

In parallel to using data quality drift checks as a proxy for monitoring model degradation, the system also monitors feature attribution drift using the normalized discounted cumulative gain (NDCG) score. This score is sensitive to both changes in feature attribution ranking order as well as to the raw attribution scores of features. By monitoring drift in attribution for individual features and their relative importance, it’s straightforward to spot degradation in model quality.

Model explainability

Model explainability is a pivotal part of ML deployments, because it ensures transparency in predictions. For a detailed understanding, we use Amazon SageMaker Clarify.

It offers both global and local model explanations through a model-agnostic feature attribution technique based on the Shapley value concept. This is used to decode why a particular prediction was made during inference. Such explanations, which are inherently contrastive, can vary based on different baselines. SageMaker Clarify aids in determining this baseline using K-means or K-prototypes in the input dataset, which is then added to the model build pipeline. This functionality enables us to build generative AI applications in the future for increased understanding of how the model works.

Industrialization: From prototype to production

The MLOps project includes a high degree of automation and can serve as a blueprint for similar use cases:

The infrastructure can be reused entirely, whereas the seed code can be adapted for each task, with most changes limited to the pipeline definition and the business logic for preprocessing, training, inference, and postprocessing.
The training and inference scripts are hosted using SageMaker custom containers, so a variety of models can be accommodated without changes to the data and model monitoring or model explainability steps, as long as the data is in tabular format.

After finishing the work on the prototype, we turned to how we should use it in production. To do so, we felt the need to make some additional adjustments to the MLOps template:

The original seed code used in the prototype for the template included preprocessing and postprocessing steps run before and after the core ML steps (training and inference). However, when scaling up to use the template for multiple use cases in production, the built-in preprocessing and postprocessing steps may lead to decreased generality and reproduction of code.
To improve generality and minimize repetitive code, we chose to slim down the pipelines even further. Instead of running the preprocessing and postprocessing steps as part of the ML pipeline, we run these as part of the primary Airflow orchestration before and after triggering the ML pipeline.
This way, use case-specific processing tasks are abstracted from the template, and what is left is a core ML pipeline performing tasks that are general across multiple use cases with minimal repetition of code. Parameters that differ between use cases are supplied as input to the ML pipeline from the primary Airflow orchestration.

The result: A rapid & efficient approach to model build & deployment

The prototype in collaboration with AWS has resulted in an MLOps template following current best practices that is now available for use to all of Axfood’s data science teams. By creating a new SageMaker project within SageMaker Studio, data scientists can get started on new ML projects quickly and seamlessly transition to production, allowing for more efficient time management. This is made possible by automating tedious, repetitive MLOps tasks as part of the template.

Furthermore, several new functionalities have been added in an automated fashion to our ML setup. These gains include:

Model monitoring – We can perform drift checks for model and data quality as well as model explainability
Model and data lineage – It’s now possible to trace exactly which data has been used for which model
Model registry – This helps us catalog models for production and manage model versions

Conclusion

In this post, we discussed how Axfood improved operations and scalability of our existing AI and ML operations in collaboration with AWS experts and by using SageMaker and its related products.

These improvements will help Axfood’s data science teams building ML workflows in a more standardized way and will greatly simplify analysis and monitoring of models in production—ensuring the quality of ML models built and maintained by our teams.

Please leave any feedback or questions in the comments section.

About the Authors

Dr. Björn Blomqvist is the Head of AI Strategy at Axfood AB. Before joining Axfood AB he led a team of Data Scientists at Dagab, a part of Axfood, building innovative machine learning solutions with the mission to provide good and sustainable food to people all over Sweden. Born and raised in the north of Sweden, in his spare time Björn ventures to snowy mountains and open seas.

Oskar Klang is a Senior Data Scientist at the analytics department at Dagab, where he enjoys working with everything analytics and machine learning, e.g. optimizing supply chain operations, building forecasting models and, more recently, GenAI applications. He is committed to building more streamlined machine learning pipelines, enhancing efficiency and scalability.

Pavel Maslov is a Senior DevOps and ML engineer in the Analytic Platforms team. Pavel has extensive experience in the development of frameworks, infrastructure, and tools in the domains of DevOps and ML/AI on the AWS platform. Pavel has been one of the key players in building the foundational capability within ML at Axfood.

Joakim Berg is the Team Lead and Product Owner Analytic Platforms, based in Stockholm Sweden. He is leading a team of Data Platform end DevOps/MLOps engineers providing Data and ML platforms for the Data Science teams. Joakim has many years of experience leading senior development and architecture teams from different industries.

Adapting neural radiance fields (NeRFs) to dynamic scenes

February 26, 2024

by Amazon AWS

Representing light and density fields as weighted sums over basis functions, whose weights vary over time, improves motion capture, texture, and lighting.Read More

Techniques and approaches for monitoring large language models on AWS

February 26, 2024

by Bruno Klein Amazon AWS

Large Language Models (LLMs) have revolutionized the field of natural language processing (NLP), improving tasks such as language translation, text summarization, and sentiment analysis. However, as these models continue to grow in size and complexity, monitoring their performance and behavior has become increasingly challenging.

Monitoring the performance and behavior of LLMs is a critical task for ensuring their safety and effectiveness. Our proposed architecture provides a scalable and customizable solution for online LLM monitoring, enabling teams to tailor your monitoring solution to your specific use cases and requirements. By using AWS services, our architecture provides real-time visibility into LLM behavior and enables teams to quickly identify and address any issues or anomalies.

In this post, we demonstrate a few metrics for online LLM monitoring and their respective architecture for scale using AWS services such as Amazon CloudWatch and AWS Lambda. This offers a customizable solution beyond what is possible with model evaluation jobs with Amazon Bedrock.

Overview of solution

The first thing to consider is that different metrics require different computation considerations. A modular architecture, where each module can intake model inference data and produce its own metrics, is necessary.

We suggest that each module take incoming inference requests to the LLM, passing prompt and completion (response) pairs to metric compute modules. Each module is responsible for computing its own metrics with respect to the input prompt and completion (response). These metrics are passed to CloudWatch, which can aggregate them and work with CloudWatch alarms to send notifications on specific conditions. The following diagram illustrates this architecture.

Fig 1: Metric compute module – solution overview

The workflow includes the following steps:

A user makes a request to Amazon Bedrock as part of an application or user interface.
Amazon Bedrock saves the request and completion (response) in Amazon Simple Storage Service (Amazon S3) as the per configuration of invocation logging.
The file saved on Amazon S3 creates an event that triggers a Lambda function. The function invokes the modules.
The modules post their respective metrics to CloudWatch metrics.
Alarms can notify the development team of unexpected metric values.

The second thing to consider when implementing LLM monitoring is choosing the right metrics to track. Although there are many potential metrics that you can use to monitor LLM performance, we explain some of the broadest ones in this post.

In the following sections, we highlight a few of the relevant module metrics and their respective metric compute module architecture.

Semantic similarity between prompt and completion (response)

When running LLMs, you can intercept the prompt and completion (response) for each request and transform them into embeddings using an embedding model. Embeddings are high-dimensional vectors that represent the semantic meaning of the text. Amazon Titan provides such models through Titan Embeddings. By taking a distance such as cosine between these two vectors, you can quantify how semantically similar the prompt and completion (response) are. You can use SciPy or scikit-learn to compute the cosine distance between vectors. The following diagram illustrates the architecture of this metric compute module.

Fig 2: Metric compute module – semantic similarity

This workflow includes the following key steps:

A Lambda function receives a streamed message via Amazon Kinesis containing a prompt and completion (response) pair.
The function gets an embedding for both the prompt and completion (response), and computes the cosine distance between the two vectors.
The function sends that information to CloudWatch metrics.

Sentiment and toxicity

Monitoring sentiment allows you to gauge the overall tone and emotional impact of the responses, whereas toxicity analysis provides an important measure of the presence of offensive, disrespectful, or harmful language in LLM outputs. Any shifts in sentiment or toxicity should be closely monitored to ensure the model is behaving as expected. The following diagram illustrates the metric compute module.

Fig 3: Metric compute module – sentiment and toxicity

The workflow includes the following steps:

A Lambda function receives a prompt and completion (response) pair through Amazon Kinesis.
Through AWS Step Functions orchestration, the function calls Amazon Comprehend to detect the sentiment and toxicity.
The function saves the information to CloudWatch metrics.

For more information about detecting sentiment and toxicity with Amazon Comprehend, refer to Build a robust text-based toxicity predictor and Flag harmful content using Amazon Comprehend toxicity detection.

Ratio of refusals

An increase in refusals, such as when an LLM denies completion due to lack of information, could mean that either malicious users are trying to use the LLM in ways that are intended to jailbreak it, or that users’ expectations are not being met and they are getting low-value responses. One way to gauge how often this is happening is by comparing standard refusals from the LLM model being used with the actual responses from the LLM. For example, the following are some of Anthropic’s Claude v2 LLM common refusal phrases:

“Unfortunately, I do not have enough context to provide a substantive response. However, I am an AI assistant created by Anthropic to be helpful, harmless, and honest.”

“I apologize, but I cannot recommend ways to…”

“I'm an AI assistant created by Anthropic to be helpful, harmless, and honest.”

On a fixed set of prompts, an increase in these refusals can be a signal that the model has become overly cautious or sensitive. The inverse case should also be evaluated. It could be a signal that the model is now more prone to engage in toxic or harmful conversations.

To help model integrity and model refusal ratio, we can compare the response with a set of known refusal phrases from the LLM. This could be an actual classifier that can explain why the model refused the request. You can take the cosine distance between the response and known refusal responses from the model being monitored. The following diagram illustrates this metric compute module.

Fig 4: Metric compute module – ratio of refusals

The workflow consists of the following steps:

A Lambda function receives a prompt and completion (response) and gets an embedding from the response using Amazon Titan.
The function computes the cosine or Euclidian distance between the response and existing refusal prompts cached in memory.
The function sends that average to CloudWatch metrics.

Another option is to use fuzzy matching for a straightforward but less powerful approach to compare the known refusals to LLM output. Refer to the Python documentation for an example.

Summary

LLM observability is a critical practice for ensuring the reliable and trustworthy use of LLMs. Monitoring, understanding, and ensuring the accuracy and reliability of LLMs can help you mitigate the risks associated with these AI models. By monitoring hallucinations, bad completions (responses), and prompts, you can make sure your LLM stays on track and delivers the value you and your users are looking for. In this post, we discussed a few metrics to showcase examples.

For more information about evaluating foundation models, refer to Use SageMaker Clarify to evaluate foundation models, and browse additional example notebooks available in our GitHub repository. You can also explore ways to operationalize LLM evaluations at scale in Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services. Finally, we recommend referring to Evaluate large language models for quality and responsibility to learn more about evaluating LLMs.

About the Authors

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Knowledge distillation method for better vision-language models

February 22, 2024

by Amazon AWS

Method preserves knowledge encoded in teacher model’s attention heads even when student model has fewer of them.Read More

Streamline diarization using AI as an assistive technology: ZOO Digital’s story

February 20, 2024

by Ying Hou Amazon AWS

ZOO Digital provides end-to-end localization and media services to adapt original TV and movie content to different languages, regions, and cultures. It makes globalization easier for the world’s best content creators. Trusted by the biggest names in entertainment, ZOO Digital delivers high-quality localization and media services at scale, including dubbing, subtitling, scripting, and compliance.

Typical localization workflows require manual speaker diarization, wherein an audio stream is segmented based on the identity of the speaker. This time-consuming process must be completed before content can be dubbed into another language. With manual methods, a 30-minute episode can take between 1–3 hours to localize. Through automation, ZOO Digital aims to achieve localization in under 30 minutes.

In this post, we discuss deploying scalable machine learning (ML) models for diarizing media content using Amazon SageMaker, with a focus on the WhisperX model.

Background

ZOO Digital’s vision is to provide a faster turnaround of localized content. This goal is bottlenecked by the manually intensive nature of the exercise compounded by the small workforce of skilled people that can localize content manually. ZOO Digital works with over 11,000 freelancers and localized over 600 million words in 2022 alone. However, the supply of skilled people is being outstripped by the increasing demand for content, requiring automation to assist with localization workflows.

With an aim to accelerate the localization of content workflows through machine learning, ZOO Digital engaged AWS Prototyping, an investment program by AWS to co-build workloads with customers. The engagement focused on delivering a functional solution for the localization process, while providing hands-on training to ZOO Digital developers on SageMaker, Amazon Transcribe, and Amazon Translate.

Customer challenge

After a title (a movie or an episode of a TV series) has been transcribed, speakers must be assigned to each segment of speech so that they can be correctly assigned to the voice artists that are cast to play the characters. This process is called speaker diarization. ZOO Digital faces the challenge of diarizing content at scale while being economically viable.

Solution overview

In this prototype, we stored the original media files in a specified Amazon Simple Storage Service (Amazon S3) bucket. This S3 bucket was configured to emit an event when new files are detected within it, triggering an AWS Lambda function. For instructions on configuring this trigger, refer to the tutorial Using an Amazon S3 trigger to invoke a Lambda function. Subsequently, the Lambda function invoked the SageMaker endpoint for inference using the Boto3 SageMaker Runtime client.

The WhisperX model, based on OpenAI’s Whisper, performs transcriptions and diarization for media assets. It’s built upon the Faster Whisper reimplementation, offering up to four times faster transcription with improved word-level timestamp alignment compared to Whisper. Additionally, it introduces speaker diarization, not present in the original Whisper model. WhisperX utilizes the Whisper model for transcriptions, the Wav2Vec2 model to enhance timestamp alignment (ensuring synchronization of transcribed text with audio timestamps), and the pyannote model for diarization. FFmpeg is used for loading audio from source media, supporting various media formats. The transparent and modular model architecture allows flexibility, because each component of the model can be swapped out as needed in the future. However, it’s essential to note that WhisperX lacks full management features and isn’t an enterprise-level product. Without maintenance and support, it may not be suitable for production deployment.

In this collaboration, we deployed and evaluated WhisperX on SageMaker, using an asynchronous inference endpoint to host the model. SageMaker asynchronous endpoints support upload sizes up to 1 GB and incorporate auto scaling features that efficiently mitigate traffic spikes and save costs during off-peak times. Asynchronous endpoints are particularly well-suited for processing large files, such as movies and TV series in our use case.

The following diagram illustrates the core elements of the experiments we conducted in this collaboration.

In the following sections, we delve into the details of deploying the WhisperX model on SageMaker, and evaluate the diarization performance.

Download the model and its components

WhisperX is a system that includes multiple models for transcription, forced alignment, and diarization. For smooth SageMaker operation without the need to fetch model artifacts during inference, it’s essential to pre-download all model artifacts. These artifacts are then loaded into the SageMaker serving container during initiation. Because these models aren’t directly accessible, we offer descriptions and sample code from the WhisperX source, providing instructions on downloading the model and its components.

WhisperX uses six models:

A Faster Whisper model
A Voice Activity Detection (VAD) model
A Wav2Vec2 model
pyannote’s Speaker Diarization model
pyannote’s Segmentation model
SpeechBrain’s Speaker Embedding model

Most of these models can be obtained from Hugging Face using the huggingface_hub library. We use the following download_hf_model() function to retrieve these model artifacts. An access token from Hugging Face, generated after accepting the user agreements for the following pyannote models, is required:

import huggingface_hub
import yaml
import torchaudio
import urllib.request
import os

CONTAINER_MODEL_DIR = "/opt/ml/model"
WHISPERX_MODEL = "guillaumekln/faster-whisper-large-v2"
VAD_MODEL_URL = "https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin"
WAV2VEC2_MODEL = "WAV2VEC2_ASR_BASE_960H"
DIARIZATION_MODEL = "pyannote/speaker-diarization"

def download_hf_model(model_name: str, hf_token: str, local_model_dir: str) -> str:
    """
    Fetches the provided model from HuggingFace and returns the subdirectory it is downloaded to
    :param model_name: HuggingFace model name (and an optional version, appended with @[version])
    :param hf_token: HuggingFace access token authorized to access the requested model
    :param local_model_dir: The local directory to download the model to
    :return: The subdirectory within local_modeL_dir that the model is downloaded to
    """
    model_subdir = model_name.split('@')[0]
    huggingface_hub.snapshot_download(model_subdir, token=hf_token, local_dir=f"{local_model_dir}/{model_subdir}", local_dir_use_symlinks=False)
    return model_subdir

The VAD model is fetched from Amazon S3, and the Wav2Vec2 model is retrieved from the torchaudio.pipelines module. Based on the following code, we can retrieve all the models’ artifacts, including those from Hugging Face, and save them to the specified local model directory:

def fetch_models(hf_token: str, local_model_dir="./models"):
    """
    Fetches all required models to run WhisperX locally without downloading models every time 
    :param hf_token: A huggingface access token to download the models
    :param local_model_dir: The directory to download the models to
    """
    # Fetch Faster Whisper's Large V2 model from HuggingFace
    download_hf_model(model_name=WHISPERX_MODEL, hf_token=hf_token, local_model_dir=local_model_dir)

    # Fetch WhisperX's VAD Segmentation model from S3
    vad_model_dir = "whisperx/vad"
    if not os.path.exists(f"{local_model_dir}/{vad_model_dir}"):
        os.makedirs(f"{local_model_dir}/{vad_model_dir}")

    urllib.request.urlretrieve(VAD_MODEL_URL, f"{local_model_dir}/{vad_model_dir}/pytorch_model.bin")

    # Fetch the Wav2Vec2 alignment model
    torchaudio.pipelines.__dict__[WAV2VEC2_MODEL].get_model(dl_kwargs={"model_dir": f"{local_model_dir}/wav2vec2/"})

    # Fetch pyannote's Speaker Diarization model from HuggingFace
    download_hf_model(model_name=DIARIZATION_MODEL,
                      hf_token=hf_token,
                      local_model_dir=local_model_dir)

    # Read in the Speaker Diarization model config to fetch models and update with their local paths
    with open(f"{local_model_dir}/{DIARIZATION_MODEL}/config.yaml", 'r') as file:
        diarization_config = yaml.safe_load(file)

    embedding_model = diarization_config['pipeline']['params']['embedding']
    embedding_model_dir = download_hf_model(model_name=embedding_model,
                                            hf_token=hf_token,
                                            local_model_dir=local_model_dir)
    diarization_config['pipeline']['params']['embedding'] = f"{CONTAINER_MODEL_DIR}/{embedding_model_dir}"

    segmentation_model = diarization_config['pipeline']['params']['segmentation']
    segmentation_model_dir = download_hf_model(model_name=segmentation_model,
                                               hf_token=hf_token,
                                               local_model_dir=local_model_dir)
    diarization_config['pipeline']['params']['segmentation'] = f"{CONTAINER_MODEL_DIR}/{segmentation_model_dir}/pytorch_model.bin"

    with open(f"{local_model_dir}/{DIARIZATION_MODEL}/config.yaml", 'w') as file:
        yaml.safe_dump(diarization_config, file)

    # Read in the Speaker Embedding model config to update it with its local path
    speechbrain_hyperparams_path = f"{local_model_dir}/{embedding_model_dir}/hyperparams.yaml"
    with open(speechbrain_hyperparams_path, 'r') as file:
        speechbrain_hyperparams = file.read()

    speechbrain_hyperparams = speechbrain_hyperparams.replace(embedding_model_dir, f"{CONTAINER_MODEL_DIR}/{embedding_model_dir}")

    with open(speechbrain_hyperparams_path, 'w') as file:
        file.write(speechbrain_hyperparams)

Select the appropriate AWS Deep Learning Container for serving the model

After the model artifacts are saved using the preceding sample code, you can choose pre-built AWS Deep Learning Containers (DLCs) from the following GitHub repo. When selecting the Docker image, consider the following settings: framework (Hugging Face), task (inference), Python version, and hardware (for example, GPU). We recommend using the following image: 763104351884.dkr.ecr.[REGION].amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04 This image has all the necessary system packages pre-installed, such as ffmpeg. Remember to replace [REGION] with the AWS Region you are using.

For other required Python packages, create a requirements.txt file with a list of packages and their versions. These packages will be installed when the AWS DLC is built. The following are the additional packages needed to host the WhisperX model on SageMaker:

faster-whisper==0.7.1 
git+https://github.com/m-bain/whisperx.git@1b092de19a1878a8f138f665b1467ca21b076e7e 
ffmpeg-python

Create an inference script to load the models and run inference

Next, we create a custom inference.py script to outline how the WhisperX model and its components are loaded into the container and how the inference process should be run. The script contains two functions: model_fn and transform_fn. The model_fn function is invoked to load the models from their respective locations. Subsequently, these models are passed to the transform_fn function during inference, where transcription, alignment, and diarization processes are performed. The following is a code sample for inference.py:

import io
import json
import logging
import tempfile
import time

import torch
import whisperx

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

def model_fn(model_dir: str) -> dict:
    """
    Deserialize and return the models
    """
    logging.info("Loading WhisperX model")
    model = whisperx.load_model(whisper_arch=f"{model_dir}/guillaumekln/faster-whisper-large-v2",
                                device=DEVICE,
                                language="en",
                                compute_type="float16",
                                vad_options={'model_fp': f"{model_dir}/whisperx/vad/pytorch_model.bin"})

    logging.info("Loading alignment model")
    align_model, metadata = whisperx.load_align_model(language_code="en",
                                                      device=DEVICE,
                                                      model_name="WAV2VEC2_ASR_BASE_960H",
                                                      model_dir=f"{model_dir}/wav2vec2")

    logging.info("Loading diarization model")
    diarization_model = whisperx.DiarizationPipeline(model_name=f"{model_dir}/pyannote/speaker-diarization/config.yaml",
                                                     device=DEVICE)

    return {
        'model': model,
        'align_model': align_model,
        'metadata': metadata,
        'diarization_model': diarization_model
    }

def transform_fn(model: dict, request_body: bytes, request_content_type: str, response_content_type="application/json") -> (str, str):
    """
    Load in audio from the request, transcribe and diarize, and return JSON output
    """

    # Start a timer so that we can log how long inference takes
    start_time = time.time()

    # Unpack the models
    whisperx_model = model['model']
    align_model = model['align_model']
    metadata = model['metadata']
    diarization_model = model['diarization_model']

    # Load the media file (the request_body as bytes) into a temporary file, then use WhisperX to load the audio from it
    logging.info("Loading audio")
    with io.BytesIO(request_body) as file:
        tfile = tempfile.NamedTemporaryFile(delete=False)
        tfile.write(file.read())
        audio = whisperx.load_audio(tfile.name)

    # Run transcription
    logging.info("Transcribing audio")
    result = whisperx_model.transcribe(audio, batch_size=16)

    # Align the outputs for better timings
    logging.info("Aligning outputs")
    result = whisperx.align(result["segments"], align_model, metadata, audio, DEVICE, return_char_alignments=False)

    # Run diarization
    logging.info("Running diarization")
    diarize_segments = diarization_model(audio)
    result = whisperx.assign_word_speakers(diarize_segments, result)

    # Calculate the time it took to perform the transcription and diarization
    end_time = time.time()
    elapsed_time = end_time - start_time
    logging.info(f"Transcription and Diarization took {int(elapsed_time)} seconds")

    # Return the results to be stored in S3
    return json.dumps(result), response_content_type

Within the model’s directory, alongside the requirements.txt file, ensure the presence of inference.py in a code subdirectory. The models directory should resemble the following:

models
├── code
│   ├── inference.py
│   └── requirements.txt
├── guillaumekln
│   └── faster-whisper-large-v2
├── pyannote
│   ├── segmentation
│   │   └── ...
│   └── speaker-diarization
│       └── ...
├── speechbrain
│   └── spkrec-ecapa-voxceleb
│       └── ...
├── wav2vec2
│   └── ...
└── whisperx
    └── vad
        └── ...

Create a tarball of the models

After you create the models and code directories, you can use the following command lines to compress the model into a tarball (.tar.gz file) and upload it to Amazon S3. At the time of writing, using the faster-whisper Large V2 model, the resulting tarball representing the SageMaker model is 3 GB in size. For more information, refer to Model hosting patterns in Amazon SageMaker, Part 2: Getting started with deploying real time models on SageMaker.

# Save the model artifacts to the 'model' directory and create a tarball
tar cvzf model.tar.gz -C model/ .
# Upload the model to S3
aws s3 cp model.tar.gz s3://<target_bucket>

Create a SageMaker model and deploy an endpoint with an asynchronous predictor

Now you can create the SageMaker model, endpoint config, and asynchronous endpoint with AsyncPredictor using the model tarball created in the previous step. For instructions, refer to Create an Asynchronous Inference Endpoint.

Evaluate diarization performance

To assess the diarization performance of the WhisperX model in various scenarios, we selected three episodes each from two English titles: one drama title consisting of 30-minute episodes, and one documentary title consisting of 45-minute episodes. We utilized pyannote’s metrics toolkit, pyannote.metrics, to calculate the diarization error rate (DER). In the evaluation, manually transcribed and diarized transcripts provided by ZOO served as the ground truth.

We defined the DER as follows:

Total is the length of the ground truth video. FA (False Alarm) is the length of segments that are considered as speech in predictions, but not in ground truth. Miss is the length of segments that are considered as speech in ground truth, but not in prediction. Error, also called Confusion, is the length of segments that are assigned to different speakers in prediction and ground truth. All the units are measured in seconds. The typical values for DER can vary depending on the specific application, dataset, and the quality of the diarization system. Note that DER can be larger than 1.0. A lower DER is better.

To be able to calculate the DER for a piece of media, a ground truth diarization is required as well as the WhisperX transcribed and diarized outputs. These must be parsed and result in lists of tuples containing a speaker label, speech segment start time, and speech segment end time for each segment of speech in the media. The speaker labels don’t need to match between the WhisperX and ground truth diarizations. The results are based mostly on the time of the segments. pyannote.metrics takes these tuples of ground truth diarizations and output diarizations (referred to in the pyannote.metrics documentation as reference and hypothesis) to calculate the DER. The following table summarizes our results.

Video Type	DER	Correct	Miss	Error	False Alarm
Drama	0.738	44.80%	21.80%	33.30%	18.70%
Documentary	1.29	94.50%	5.30%	0.20%	123.40%
Average	0.901	71.40%	13.50%	15.10%	61.50%

These results reveal a significant performance difference between the drama and documentary titles, with the model achieving notably better results (using DER as an aggregate metric) for the drama episodes compared to the documentary title. A closer analysis of the titles provides insights into potential factors contributing to this performance gap. One key factor could be the frequent presence of background music overlapping with speech in the documentary title. Although preprocessing media to enhance diarization accuracy, such as removing background noise to isolate speech, was beyond the scope of this prototype, it opens avenues for future work that could potentially enhance the performance of WhisperX.

Conclusion

In this post, we explored the collaborative partnership between AWS and ZOO Digital, employing machine learning techniques with SageMaker and the WhisperX model to enhance the diarization workflow. The AWS team played a pivotal role in assisting ZOO in prototyping, evaluating, and understanding the effective deployment of custom ML models, specifically designed for diarization. This included incorporating auto scaling for scalability using SageMaker.

Harnessing AI for diarization will lead to substantial savings in both cost and time when generating localized content for ZOO. By aiding transcribers in swiftly and precisely creating and identifying speakers, this technology addresses the traditionally time-consuming and error-prone nature of the task. The conventional process often involves multiple passes through the video and additional quality control steps to minimize errors. The adoption of AI for diarization enables a more targeted and efficient approach, thereby increasing productivity within a shorter timeframe.

We’ve outlined key steps to deploy the WhisperX model on the SageMaker asynchronous endpoint, and encourage you to try it yourself using the provided code. For further insights into ZOO Digital’s services and technology, visit ZOO Digital’s official site. For details on deploying the OpenAI Whisper model on SageMaker and various inference options, refer to Host the Whisper Model on Amazon SageMaker: exploring inference options. Feel free to share your thoughts in the comments.

About the Authors

Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS. Her primary areas of interest encompass Deep Learning, with a focus on GenAI, Computer Vision, NLP, and time series data prediction. In her spare time, she relishes spending quality moments with her family, immersing herself in novels, and hiking in the national parks of the UK.

Ethan Cumberland is an AI Research Engineer at ZOO Digital, where he works on using AI and Machine Learning as assistive technologies to improve workflows in speech, language, and localisation. He has a background in software engineering and research in the security and policing domain, focusing on extracting structured information from the web and leveraging open-source ML models for analysing and enriching collected data.

Gaurav Kaila leads the AWS Prototyping team for UK & Ireland. His team works with customers across diverse industries to ideate & co-develop business critical workloads with a mandate to accelerate adoption of AWS services.

Scenario Diffusion helps Zoox vehicles navigate safety-critical situations

February 20, 2024

by Amazon AWS

Generative AI supports the creation, at scale, of complex, realistic driving scenarios that can be directed to specific locations and environments.Read More

Run ML inference on unplanned and spiky traffic using Amazon SageMaker multi-model endpoints

February 19, 2024

by Ram Vegiraju Amazon AWS

Amazon SageMaker multi-model endpoints (MMEs) are a fully managed capability of SageMaker inference that allows you to deploy thousands of models on a single endpoint. Previously, MMEs pre-determinedly allocated CPU computing power to models statically regardless the model traffic load, using Multi Model Server (MMS) as its model server. In this post, we discuss a solution in which an MME can dynamically adjust the compute power assigned to each model based on the model’s traffic pattern. This solution enables you to use the underlying compute of MMEs more efficiently and save costs.

MMEs dynamically load and unload models based on incoming traffic to the endpoint. When utilizing MMS as the model server, MMEs allocate a fixed number of model workers for each model. For more information, refer to Model hosting patterns in Amazon SageMaker, Part 3: Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints.

However, this can lead to a few issues when your traffic pattern is variable. Let’s say you have a singular or few models receiving a large amount of traffic. You can configure MMS to allocate a high number of workers for these models, but this gets assigned to all the models behind the MME because it’s a static configuration. This leads to a large number of workers using hardware compute—even the idle models. The opposite problem can happen if you set a small value for the number of workers. The popular models won’t have enough workers at the model server level to properly allocate enough hardware behind the endpoint for these models. The main issue is that it’s difficult to remain traffic pattern agnostic if you can’t dynamically scale your workers at the model server level to allocate the necessary amount of compute.

The solution we discuss in this post uses DJLServing as the model server, which can help mitigate some of the issues that we discussed and enable per-model scaling and enable MMEs to be traffic pattern agnostic.

MME architecture

SageMaker MMEs enable you to deploy multiple models behind a single inference endpoint that may contain one or more instances. Each instance is designed to load and serve multiple models up to its memory and CPU/GPU capacity. With this architecture, a software as a service (SaaS) business can break the linearly increasing cost of hosting multiple models and achieve reuse of infrastructure consistent with the multi-tenancy model applied elsewhere in the application stack. The following diagram illustrates this architecture.

A SageMaker MME dynamically loads models from Amazon Simple Storage Service (Amazon S3) when invoked, instead of downloading all the models when the endpoint is first created. As a result, an initial invocation to a model might see higher inference latency than the subsequent inferences, which are completed with low latency. If the model is already loaded on the container when invoked, then the download step is skipped and the model returns the inferences with low latency. For example, assume you have a model that is only used a few times a day. It’s automatically loaded on demand, whereas frequently accessed models are retained in memory and invoked with consistently low latency.

Behind each MME are model hosting instances, as depicted in the following diagram. These instances load and evict multiple models to and from memory based on the traffic patterns to the models.

SageMaker continues to route inference requests for a model to the instance where the model is already loaded such that the requests are served from a cached model copy (see the following diagram, which shows the request path for the first prediction request vs. the cached prediction request path). However, if the model receives many invocation requests, and there are additional instances for the MME, SageMaker routes some requests to another instance to accommodate the increase. To take advantage of automated model scaling in SageMaker, make sure you have instance auto scaling set up to provision additional instance capacity. Set up your endpoint-level scaling policy with either custom parameters or invocations per minute (recommended) to add more instances to the endpoint fleet.

Model server overview

A model server is a software component that provides a runtime environment for deploying and serving machine learning (ML) models. It acts as an interface between the trained models and client applications that want to make predictions using those models.

The primary purpose of a model server is to allow effortless integration and efficient deployment of ML models into production systems. Instead of embedding the model directly into an application or a specific framework, the model server provides a centralized platform where multiple models can be deployed, managed, and served.

Model servers typically offer the following functionalities:

Model loading – The server loads the trained ML models into memory, making them ready for serving predictions.
Inference API – The server exposes an API that allows client applications to send input data and receive predictions from the deployed models.
Scaling – Model servers are designed to handle concurrent requests from multiple clients. They provide mechanisms for parallel processing and managing resources efficiently to ensure high throughput and low latency.
Integration with backend engines – Model servers have integrations with backend frameworks like DeepSpeed and FasterTransformer to partition large models and run highly optimized inference.

DJL architecture

DJL Serving is an open source, high performance, universal model server. DJL Serving is built on top of DJL, a deep learning library written in the Java programming language. It can take a deep learning model, several models, or workflows and make them available through an HTTP endpoint. DJL Serving supports deploying models from multiple frameworks like PyTorch, TensorFlow, Apache MXNet, ONNX, TensorRT, Hugging Face Transformers, DeepSpeed, FasterTransformer, and more.

DJL Serving offers many features that allow you to deploy your models with high performance:

Ease of use – DJL Serving can serve most models out of the box. Just bring the model artifacts, and DJL Serving can host them.
Multiple device and accelerator support – DJL Serving supports deploying models on CPU, GPU, and AWS Inferentia.
Performance – DJL Serving runs multithreaded inference in a single JVM to boost throughput.
Dynamic batching – DJL Serving supports dynamic batching to increase throughput.
Auto scaling – DJL Serving will automatically scale workers up and down based on the traffic load.
Multi-engine support – DJL Serving can simultaneously host models using different frameworks (such as PyTorch and TensorFlow).
Ensemble and workflow models – DJL Serving supports deploying complex workflows comprised of multiple models, and runs parts of the workflow on CPU and parts on GPU. Models within a workflow can use different frameworks.

In particular, the auto scaling feature of DJL Serving makes it straightforward to ensure the models are scaled appropriately for the incoming traffic. By default, DJL Serving determines the maximum number of workers for a model that can be supported based on the hardware available (CPU cores, GPU devices). You can set lower and upper bounds for each model to make sure that a minimum traffic level can always be served, and that a single model doesn’t consume all available resources.

DJL Serving uses a Netty frontend on top of backend worker thread pools. The frontend uses a single Netty setup with multiple HttpRequestHandlers. Different request handlers will provide support for the Inference API, Management API, or other APIs available from various plugins.

The backend is based around the WorkLoadManager (WLM) module. The WLM takes care of multiple worker threads for each model along with the batching and request routing to them. When multiple models are served, WLM checks the inference request queue size of each model first. If the queue size is greater than two times a model’s batch size, WLM scales up the number of workers assigned to that model.

Solution overview

The implementation of DJL with an MME differs from the default MMS setup. For DJL Serving with an MME, we compress the following files in the model.tar.gz format that SageMaker Inference is expecting:

model.joblib – For this implementation, we directly push the model metadata into the tarball. In this case, we are working with a .joblib file, so we provide that file in our tarball for our inference script to read. If the artifact is too large, you can also push it to Amazon S3 and point towards that in the serving configuration you define for DJL.
serving.properties – Here you can configure any model server-related environment variables. The power of DJL here is that you can configure minWorkers and maxWorkers for each model tarball. This allows for each model to scale up and down at the model server level. For instance, if a singular model is receiving the majority of the traffic for an MME, the model server will scale the workers up dynamically. In this example, we don’t configure these variables and let DJL determine the necessary number of workers depending on our traffic pattern.
model.py – This is the inference script for any custom preprocessing or postprocessing you would like to implement. The model.py expects your logic to be encapsulated in a handle method by default.
requirements.txt (optional) – By default, DJL comes installed with PyTorch, but any additional dependencies you need can be pushed here.

For this example, we showcase the power of DJL with an MME by taking a sample SKLearn model. We run a training job with this model and then create 1,000 copies of this model artifact to back our MME. We then showcase how DJL can dynamically scale to handle any type of traffic pattern that your MME may receive. This can include an even distribution of traffic across all models or even a few popular models receiving the majority of the traffic. You can find all the code in the following GitHub repo.

Prerequisites

For this example, we use a SageMaker notebook instance with a conda_python3 kernel and ml.c5.xlarge instance. To perform the load tests, you can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or a larger SageMaker notebook instance. In this example, we scale to over a thousand transactions per second (TPS), so we suggest testing on a heavier EC2 instance such as an ml.c5.18xlarge so that you have more compute to work with.

Create a model artifact

We first need to create our model artifact and data that we use in this example. For this case, we generate some artificial data with NumPy and train using an SKLearn linear regression model with the following code snippet:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import joblib

# Generate dummy data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)

# Create serialized model artifact
model_filename = "model.joblib"
joblib.dump(model, model_filename)

After you run the preceding code, you should have a model.joblib file created in your local environment.

Pull the DJL Docker image

The Docker image djl-inference:0.23.0-cpu-full-v1.0 is our DJL serving container used in this example. You can adjust the following URL depending on your Region:

inference_image_uri = "474422712127.dkr.ecr.us-east-1.amazonaws.com/djl-serving-cpu:latest"

Optionally, you can also use this image as a base image and extend it to build your own Docker image on Amazon Elastic Container Registry (Amazon ECR) with any other dependencies you need.

Create the model file

First, we create a file called serving.properties. This instructs DJLServing to use the Python engine. We also define the max_idle_time of a worker to be 600 seconds. This makes sure that we take longer to scale down the number of workers we have per model. We don’t adjust minWorkers and maxWorkers that we can define and we let DJL dynamically compute the number of workers needed depending on the traffic each model is receiving. The serving.properties is shown as follows. To see the complete list of configuration options, refer to Engine Configuration.

engine=Python
max_idle_time=600

Next, we create our model.py file, which defines the model loading and inference logic. For MMEs, each model.py file is specific to a model. Models are stored in their own paths under the model store (usually /opt/ml/model/). When loading models, they will be loaded under the model store path in their own directory. The full model.py example in this demo can be seen in the GitHub repo.

We create a model.tar.gz file that includes our model (model.joblib), model.py, and serving.properties:

#Build tar file with model data + inference code, replace this cell with your model.joblib
bashCommand = "tar -cvpzf model.tar.gz model.joblib requirements.txt model.py serving.properties"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

For demonstration purposes, we make 1,000 copies of the same model.tar.gz file to represent the large number of models to be hosted. In production, you need to create a model.tar.gz file for each of your models.

Lastly, we upload these models to Amazon S3.

Create a SageMaker model

We now create a SageMaker model. We use the ECR image defined earlier and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure Mode as MultiModel. This tells DJLServing that we’re creating an MME.

mme_model_name = "sklearn-djl-mme" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + mme_model_name)

create_model_response = sm_client.create_model(
ModelName=mme_model_name,
ExecutionRoleArn=role,
PrimaryContainer={"Image": inference_image_uri, "Mode": "MultiModel", "ModelDataUrl": mme_artifacts},
)

Create a SageMaker endpoint

In this demo, we use 20 ml.c5d.18xlarge instances to scale to a TPS in the thousands range. Make sure to get a limit increase on your instance type, if necessary, to achieve the TPS you are targeting.

mme_epc_name = "sklearn-djl-mme-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=mme_epc_name,
ProductionVariants=[
{
"VariantName": "sklearnvariant",
"ModelName": mme_model_name,
"InstanceType": "ml.c5d.18xlarge",
"InitialInstanceCount": 20
},],)

Load testing

At the time of writing, the SageMaker in-house load testing tool Amazon SageMaker Inference Recommender doesn’t natively support testing for MMEs. Therefore, we use the open source Python tool Locust. Locust is straightforward to set up and can track metrics such as TPS and end-to-end latency. For a full understanding of how to set it up with SageMaker, see Best practices for load testing Amazon SageMaker real-time inference endpoints.

In this use case, we have three different traffic patterns we want to simulate with MMEs, so we have the following three Python scripts that align with each pattern. Our goal here is to prove that, regardless of what our traffic pattern is, we can achieve the same target TPS and scale appropriately.

Evenly distributed traffic (locust_distributed.py)
90% traffic to 10 popular models (locust_top_n.py)
90% traffic to a single hot model (locust_script_one.py)

We can specify a weight in our Locust script to assign traffic across different portions of our models. For instance, with our single hot model, we implement two methods as follows:

# popular model
def sendPopular(self):

        request_meta = {
            "request_type": "InvokeEndpoint",
            "name": "SageMaker",
            "start_time": time.time(),
            "response_length": 0,
            "response": None,
            "context": {},
            "exception": None,
        }
        start_perf_counter = time.perf_counter()
        try:
            response = self.sagemaker_client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=self.payload,
                ContentType=self.content_type,
                TargetModel = "sklearn-0.tar.gz"
            )
  
# rest of model          
def sendRest(self):

        request_meta = {
            "request_type": "InvokeEndpoint",
            "name": "SageMaker",
            "start_time": time.time(),
            "response_length": 0,
            "response": None,
            "context": {},
            "exception": None,
        }
        start_perf_counter = time.perf_counter()
   
        try:
            response = self.sagemaker_client.invoke_endpoint(
                EndpointName=self.endpoint_name,
                Body=self.payload,
                ContentType=self.content_type,
                TargetModel = f'sklearn-{random.randint(1,989)}.tar.gz'
            )
            response_body = response["Body"].read()

We can then assign a certain weight to each method, which is when a certain method receives a specific percentage of the traffic:

# assign weights to models
class MyUser(BotoUser):

# 90% of traffic to singular model
@task(9)
def send_request(self):
self.client.sendPopular()

@task
def send_request_major(self):
self.client.sendRest()

For 20 ml.c5d.18xlarge instances, we see the following invocation metrics on the Amazon CloudWatch console. These values remain fairly consistent across all three traffic patterns. To understand CloudWatch metrics for SageMaker real-time inference and MMEs better, refer to SageMaker Endpoint Invocation Metrics.

You can find the rest of the Locust scripts in the locust-utils directory in the GitHub repository.

Summary

In this post, we discussed how an MME can dynamically adjust the compute power assigned to each model based on the model’s traffic pattern. This newly launched feature is available in all AWS Regions where SageMaker is available. Note that at the time of announcement, only CPU instances are supported. To learn more, refer to Supported algorithms, frameworks, and instances.

About the Authors

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Use Amazon Titan models for image generation, editing, and searching

February 19, 2024

by Rohit Mittal Amazon AWS

Amazon Bedrock provides a broad range of high-performing foundation models from Amazon and other leading AI companies, including Anthropic, AI21, Meta, Cohere, and Stability AI, and covers a wide range of use cases, including text and image generation, searching, chat, reasoning and acting agents, and more. The new Amazon Titan Image Generator model allows content creators to quickly generate high-quality, realistic images using simple English text prompts. The advanced AI model understands complex instructions with multiple objects and returns studio-quality images suitable for advertising, ecommerce, and entertainment. Key features include the ability to refine images by iterating on prompts, automatic background editing, and generating multiple variations of the same scene. Creators can also customize the model with their own data to output on-brand images in a specific style. Importantly, Titan Image Generator has in-built safeguards, like invisible watermarks on all AI-generated images, to encourage responsible use and mitigate the spread of disinformation. This innovative technology makes producing custom images in large volume for any industry more accessible and efficient.

The new Amazon Titan Multimodal Embeddings model helps build more accurate search and recommendations by understanding text, images, or both. It converts images and English text into semantic vectors, capturing meaning and relationships in your data. You can combine text and images like product descriptions and photos to identify items more effectively. The vectors power speedy, accurate search experiences. Titan Multimodal Embeddings is flexible in vector dimensions, enabling optimization for performance needs. An asynchronous API and Amazon OpenSearch Service connector make it easy to integrate the model into your neural search applications.

In this post, we walk through how to use the Titan Image Generator and Titan Multimodal Embeddings models via the AWS Python SDK.

Image generation and editing

In this section, we demonstrate the basic coding patterns for using the AWS SDK to generate new images and perform AI-powered edits on existing images. Code examples are provided in Python, and JavaScript (Node.js) is also available in this GitHub repository.

Before you can write scripts that use the Amazon Bedrock API, you need to install the appropriate version of the AWS SDK in your environment. For Python scripts, you can use the AWS SDK for Python (Boto3). Python users may also want to install the Pillow module, which facilitates image operations like loading and saving images. For setup instructions, refer to the GitHub repository.

Additionally, enable access to the Amazon Titan Image Generator and Titan Multimodal Embeddings models. For more information, refer to Model access.

Helper functions

The following function sets up the Amazon Bedrock Boto3 runtime client and generates images by taking payloads of different configurations (which we discuss later in this post):

import boto3
import json, base64, io
from random import randint
from PIL import Image

bedrock_runtime_client = boto3.client("bedrock-runtime")


def titan_image(
    payload: dict,
    num_image: int = 2,
    cfg: float = 10.0,
    seed: int = None,
    modelId: str = "amazon.titan-image-generator-v1",
) -> list:
    #   ImageGenerationConfig Options:
    #   - numberOfImages: Number of images to be generated
    #   - quality: Quality of generated images, can be standard or premium
    #   - height: Height of output image(s)
    #   - width: Width of output image(s)
    #   - cfgScale: Scale for classifier-free guidance
    #   - seed: The seed to use for reproducibility
    seed = seed if seed is not None else randint(0, 214783647)
    body = json.dumps(
        {
            **payload,
            "imageGenerationConfig": {
                "numberOfImages": num_image,  # Range: 1 to 5
                "quality": "premium",  # Options: standard/premium
                "height": 1024,  # Supported height list above
                "width": 1024,  # Supported width list above
                "cfgScale": cfg,  # Range: 1.0 (exclusive) to 10.0
                "seed": seed,  # Range: 0 to 214783647
            },
        }
    )

    response = bedrock_runtime_client.invoke_model(
        body=body,
        modelId=modelId,
        accept="application/json",
        contentType="application/json",
    )

    response_body = json.loads(response.get("body").read())
    images = [
        Image.open(io.BytesIO(base64.b64decode(base64_image)))
        for base64_image in response_body.get("images")
    ]
    return images

Generate images from text

Scripts that generate a new image from a text prompt follow this implementation pattern:

Configure a text prompt and optional negative text prompt.
Use the BedrockRuntime client to invoke the Titan Image Generator model.
Parse and decode the response.
Save the resulting images to disk.

Text-to-image

The following is a typical image generation script for the Titan Image Generator model:

# Text Variation
# textToImageParams Options:
#   text: prompt to guide the model on how to generate variations
#   negativeText: prompts to guide the model on what you don't want in image
images = titan_image(
    {
        "taskType": "TEXT_IMAGE",
        "textToImageParams": {
            "text": "two dogs walking down an urban street, facing the camera",  # Required
            "negativeText": "cars",  # Optional
        },
    }
)

This will produce images similar to the following.

Response Image 1	Response Image 2

Image variants

Image variation provides a way to generate subtle variants of an existing image. The following code snippet uses one of the images generated in the previous example to create variant images:

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

# Image Variation
# ImageVariationParams Options:
#   text: prompt to guide the model on how to generate variations
#   negativeText: prompts to guide the model on what you don't want in image
#   images: base64 string representation of the input image, only 1 is supported
images = titan_image(
    {
        "taskType": "IMAGE_VARIATION",
        "imageVariationParams": {
            "text": "two dogs walking down an urban street, facing the camera",  # Required
            "images": [input_image],  # One image is required
            "negativeText": "cars",  # Optional
        },
    },
)

This will produce images similar to the following.

Original Image	Response Image 1	Response Image 2

Edit an existing image

The Titan Image Generator model allows you to add, remove, or replace elements or areas within an existing image. You specify which area to affect by providing one of the following:

Mask image – A mask image is a binary image in which the 0-value pixels represent the area you want to affect and the 255-value pixels represent the area that should remain unchanged.
Mask prompt – A mask prompt is a natural language text description of the elements you want to affect, that uses an in-house text-to-segmentation model.

For more information, refer to Prompt Engineering Guidelines.

Scripts that apply an edit to an image follow this implementation pattern:

Load the image to be edited from disk.
Convert the image to a base64-encoded string.
Configure the mask through one of the following methods:
1. Load a mask image from disk, encoding it as base64 and setting it as the maskImage parameter.
2. Set the maskText parameter to a text description of the elements to affect.
Specify the new content to be generated using one of the following options:
1. To add or replace an element, set the text parameter to a description of the new content.
2. To remove an element, omit the text parameter completely.
Use the BedrockRuntime client to invoke the Titan Image Generator model.
Parse and decode the response.
Save the resulting images to disk.

Object editing: Inpainting with a mask image

The following is a typical image editing script for the Titan Image Generator model using maskImage. We take one of the images generated earlier and provide a mask image, where 0-value pixels are rendered as black and 255-value pixels as white. We also replace one of the dogs in the image with a cat using a text prompt.

with open("<YOUR_MASK_IMAGE_FILE_PATH>", "rb") as image_file:
    mask_image = base64.b64encode(image_file.read()).decode("utf8")

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_ORIGINAL_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

# Inpainting
# inPaintingParams Options:
#   text: prompt to guide inpainting
#   negativeText: prompts to guide the model on what you don't want in image
#   image: base64 string representation of the input image
#   maskImage: base64 string representation of the input mask image
#   maskPrompt: prompt used for auto editing to generate mask

images = titan_image(
    {
        "taskType": "INPAINTING",
        "inPaintingParams": {
            "text": "a cat",  # Optional
            "negativeText": "bad quality, low res",  # Optional
            "image": input_image,  # Required
            "maskImage": mask_image,
        },
    },
    num_image=3,
)

This will produce images similar to the following.

Original Image	Mask Image	Edited Image

Object removal: Inpainting with a mask prompt

In another example, we use maskPrompt to specify an object in the image, taken from the earlier steps, to edit. By omitting the text prompt, the object will be removed:

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

images = titan_image(
    {
        "taskType": "INPAINTING",
        "inPaintingParams": {
            "negativeText": "bad quality, low res",  # Optional
            "image": input_image,  # Required
            "maskPrompt": "white dog",  # One of "maskImage" or "maskPrompt" is required
        },
    },
)

This will produce images similar to the following.

Original Image	Response Image

Background editing: Outpainting

Outpainting is useful when you want to replace the background of an image. You can also extend the bounds of an image for a zoom-out effect. In the following example script, we use maskPrompt to specify which object to keep; you can also use maskImage. The parameter outPaintingMode specifies whether to allow modification of the pixels inside the mask. If set as DEFAULT, pixels inside of the mask are allowed to be modified so that the reconstructed image will be consistent overall. This option is recommended if the maskImage provided doesn’t represent the object with pixel-level precision. If set as PRECISE, the modification of pixels inside of the mask is prevented. This option is recommended if using a maskPrompt or a maskImage that represents the object with pixel-level precision.

# Import an input image like this (only PNG/JPEG supported):
with open("<YOUR_IMAGE_FILE_PATH>", "rb") as image_file:
    input_image = base64.b64encode(image_file.read()).decode("utf8")

# OutPaintingParams Options:
#   text: prompt to guide outpainting
#   negativeText: prompts to guide the model on what you don't want in image
#   image: base64 string representation of the input image
#   maskImage: base64 string representation of the input mask image
#   maskPrompt: prompt used for auto editing to generate mask
#   outPaintingMode: DEFAULT | PRECISE
images = titan_image(
    {
        "taskType": "OUTPAINTING",
        "outPaintingParams": {
            "text": "forest",  # Required
            "image": input_image,  # Required
            "maskPrompt": "dogs",  # One of "maskImage" or "maskPrompt" is required
            "outPaintingMode": "PRECISE",  # One of "PRECISE" or "DEFAULT"
        },
    },
    num_image=3,
)

This will produce images similar to the following.

Original Image	Text	Response Image
	“beach”
	“forest”

In addition, the effects of different values for outPaintingMode, with a maskImage that doesn’t outline the object with pixel-level precision, are as follows.

Original Image	Mask Image	Text	outPaintingMode	Response Image
		“forest”	DEFAULT
		“forest”	PRECISE

This section has given you an overview of the operations you can perform with the Titan Image Generator model. Specifically, these scripts demonstrate text-to-image, image variation, inpainting, and outpainting tasks. You should be able to adapt the patterns for your own applications by referencing the parameter details for those task types detailed in Amazon Titan Image Generator documentation.

Multimodal embedding and searching

You can use the Amazon Titan Multimodal Embeddings model for enterprise tasks such as image search and similarity-based recommendation, and it has built-in mitigation that helps reduce bias in searching results. There are multiple embedding dimension sizes for best latency/accuracy trade-offs for different needs, and all can be customized with a simple API to adapt to your own data while persisting data security and privacy. Amazon Titan Multimodal Embeddings is provided as simple APIs for real-time or asynchronous batch transform searching and recommendation applications, and can be connected to different vector databases, including Amazon OpenSearch Service.

Helper functions

The following function converts an image, and optionally text, into multimodal embeddings:

def titan_multimodal_embedding(
    image_path: str = None,  # maximum 2048 x 2048 pixels
    description: str = None,  # English only and max input tokens 128
    dimension: int = 1024,  # 1,024 (default), 384, 256
    model_id: str = "amazon.titan-embed-image-v1",
):
    payload_body = {}
    embedding_config: dict = {"embeddingConfig": {"outputEmbeddingLength": dimension}}

    # You can specify either text or image or both
    if image_path:
        # Maximum image size supported is 2048 x 2048 pixels
        with open(image_path, "rb") as image_file:
            payload_body["inputImage"] = base64.b64encode(image_file.read()).decode(
                "utf8"
            )
    if description:
        payload_body["inputText"] = description

    assert payload_body, "please provide either an image and/or a text description"
    print("n".join(payload_body.keys()))

    response = bedrock_runtime_client.invoke_model(
        body=json.dumps({**payload_body, **embedding_config}),
        modelId=model_id,
        accept="application/json",
        contentType="application/json",
    )

    return json.loads(response.get("body").read())

The following function returns the top similar multimodal embeddings given a query multimodal embeddings. Note that in practice, you can use a managed vector database, such as OpenSearch Service. The following example is for illustration purposes:

from scipy.spatial.distance import cdist
import numpy as np

def search(query_emb: np.array, indexes: np.array, top_k: int = 1):
    dist = cdist(query_emb, indexes, metric="cosine")
    return dist.argsort(axis=-1)[0, :top_k], np.sort(dist, axis=-1)[:top_k]

Synthetic dataset

For illustration purposes, we use Anthropic’s Claude 2.1 model in Amazon Bedrock to randomly generate seven different products, each with three variants, using the following prompt:

Generate a list of 7 items description for an online e-commerce shop, each comes with 3 variants of color or type. All with separate full sentence description.

The following is the list of returned outputs:

1. T-shirt
- A red cotton t-shirt with a crew neck and short sleeves.
- A blue cotton t-shirt with a v-neck and short sleeves.
- A black polyester t-shirt with a scoop neck and cap sleeves.

2. Jeans
- Classic blue relaxed fit denim jeans with a mid-rise waist.
- Black skinny fit denim jeans with a high-rise waist and ripped details at the knees.
- Stonewash straight leg denim jeans with a standard waist and front pockets.

3. Sneakers
- White leather low-top sneakers with an almond toe cap and thick rubber outsole.
- Gray mesh high-top sneakers with neon green laces and a padded ankle collar.
- Tan suede mid-top sneakers with a round toe and ivory rubber cupsole.

4. Backpack
- A purple nylon backpack with padded shoulder straps, front zipper pocket and laptop sleeve.
- A gray canvas backpack with brown leather trims, side water bottle pockets and drawstring top closure.
- A black leather backpack with multiple interior pockets, top carry handle and adjustable padded straps.

5. Smartwatch
- A silver stainless steel smartwatch with heart rate monitor, GPS tracker and sleep analysis.
- A space gray aluminum smartwatch with step counter, phone notifications and calendar syncing.
- A rose gold smartwatch with activity tracking, music controls and customizable watch faces.

6. Coffee maker
- A 12-cup programmable coffee maker in brushed steel with removable water tank and keep warm plate.
- A compact 5-cup single serve coffee maker in matt black with travel mug auto-dispensing feature.
- A retro style stovetop percolator coffee pot in speckled enamel with stay-cool handle and glass knob lid.

7. Yoga mat
- A teal 4mm thick yoga mat made of natural tree rubber with moisture-wicking microfiber top.
- A purple 6mm thick yoga mat made of eco-friendly TPE material with integrated carrying strap.
- A patterned 5mm thick yoga mat made of PVC-free material with towel cover included.

Assign the above response to variable response_cat. Then we use the Titan Image Generator model to create product images for each item:

import re

def extract_text(input_string):
    pattern = r"- (.*?)($|n)"
    matches = re.findall(pattern, input_string)
    extracted_texts = [match[0] for match in matches]
    return extracted_texts

product_description = extract_text(response_cat)

titles = []
for prompt in product_description:
    images = titan_image(
        {
            "taskType": "TEXT_IMAGE",
            "textToImageParams": {
                "text": prompt,  # Required
            },
        },
        num_image=1,
    )
    title = "_".join(prompt.split()[:4]).lower()
    titles.append(title)
    images[0].save(f"{title}.png", format="png")

All the generated images can be found in the appendix at the end of this post.

Multimodal dataset indexing

Use the following code for multimodal dataset indexing:

multimodal_embeddings = []
for image_filename, description in zip(titles, product_description):
    embedding = titan_multimodal_embedding(f"{image_filename}.png", dimension=1024)["embedding"]
    multimodal_embeddings.append(embedding)

Multimodal searching

Use the following code for multimodal searching:

query_prompt = "<YOUR_QUERY_TEXT>"
query_embedding = titan_multimodal_embedding(description=query_prompt, dimension=1024)["embedding"]
# If searching via Image
# query_image_filename = "<YOUR_QUERY_IMAGE>"
# query_emb = titan_multimodal_embedding(image_path=query_image_filename, dimension=1024)["embedding"]
idx_returned, dist = search(np.array(query_embedding)[None], np.array(multimodal_embeddings))

The following are some search results.

Query	Results
“sneaker”
“white sneaker”
“leather backpack”
“purple backpack”

Conclusion

The post introduces the Amazon Titan Image Generator and Amazon Titan Multimodal Embeddings models. Titan Image Generator enables you to create custom, high-quality images from text prompts. Key features include iterating on prompts, automatic background editing, and data customization. It has safeguards like invisible watermarks to encourage responsible use. Titan Multimodal Embeddings converts text, images, or both into semantic vectors to power accurate search and recommendations. We then provided Python code samples for using these services, and demonstrated generating images from text prompts and iterating on those images; editing existing images by adding, removing, or replacing elements specified by mask images or mask text; creating multimodal embeddings from text, images, or both; and searching for similar multimodal embeddings to a query. We also demonstrated using a synthetic e-commerce dataset indexed and searched using Titan Multimodal Embeddings. The aim of this post is to enable developers to start using these new AI services in their applications. The code patterns can serve as templates for custom implementations.

All the code is available on the GitHub repository. For more information, refer to the Amazon Bedrock User Guide.

About the Authors

Rohit Mittal is a Principal Product Manager at Amazon AI building multi-modal foundation models. He recently led the launch of Amazon Titan Image Generator model as part of Amazon Bedrock service. Experienced in AI/ML, NLP, and Search, he is interested in building products that solves customer pain points with innovative technology.

Dr. Ashwin Swaminathan is a Computer Vision and Machine Learning researcher, engineer, and manager with 12+ years of industry experience and 5+ years of academic research experience. Strong fundamentals and proven ability to quickly gain knowledge and contribute to newer and emerging areas.

Dr. Yusheng Xie is a Principal Applied Scientist at Amazon AGI. His work focuses building multi-modal foundation models. Before joining AGI, he was leading various multi-modal AI development at AWS such as Amazon Titan Image Generator and Amazon Textract Queries.

Dr. Hao Yang is a Principal Applied Scientist at Amazon. His main research interests are object detection and learning with limited annotations. Outside work, Hao enjoys watching films, photography, and outdoor activities.

Dr. Davide Modolo is an Applied Science Manager at Amazon AGI, working on building large multimodal foundational models. Before joining Amazon AGI, he was a manager/lead for 7 years in AWS AI Labs (Amazon Bedrock and Amazon Rekognition). Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.

Dr. Baichuan Sun, is currently serving as a Sr. AI/ML Solutions Architect at AWS, focusing on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends.

Dr. Kai Zhu currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.

Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. In his role as Senior Product Manager, Kris helps design and build AWS services to power Media & Entertainment, Gaming, and Spatial Computing.

Appendix

In the following sections, we demonstrate challenging sample use cases like text insertion, hands, and reflections to highlight the capabilities of the Titan Image Generator model. We also include the sample output images produced in earlier examples.

Text

The Titan Image Generator model excels at complex workflows like inserting readable text into images. This example demonstrates Titan’s ability to clearly render uppercase and lowercase letters in a consistent style within an image.

a corgi wearing a baseball cap with text “genai”	a happy boy giving a thumbs up, wearing a tshirt with text “generative AI”

Hands

The Titan Image Generator model also has the ability to generate detailed AI images. The image shows realistic hands and fingers with visible detail, going beyond more basic AI image generation that may lack such specificity. In the following examples, notice the precise depiction of the pose and anatomy.

a person’s hand viewed from above	a close look at a person’s hands holding a coffee mug

Mirror

The images generated by the Titan Image Generator model spatially arrange objects and accurately reflect mirror effects, as demonstrated in the following examples.

A cute fluffy white cat stands on its hind legs, peering curiously into an ornate golden mirror. In the reflection the cat sees itself	beautiful sky lake with reflections on the water

Synthetic product images

The following are the product images generated earlier in this post for the Titan Multimodal Embeddings model.