Amazon AWS – Page 102

Intelligently search Adobe Experience Manager content using Amazon Kendra

September 6, 2023

by Praveen Edem Amazon AWS

Amazon Kendra is an intelligent search service powered by machine learning (ML). With Amazon Kendra, you can easily aggregate content from a variety of content repositories into an index that lets you quickly search all your enterprise data and find the most accurate answer. Adobe Experience Manager (AEM) is a content management system that’s used for creating website or mobile app content. Many organizations use Adobe Experience Manager (On-Premise) or Adobe Experience Manager (Cloud Service) as their content management platform. Enterprise users need to be able to search for accurate answers easily and securely across content from multiple data sources in the enterprise, including AEM, from content such as assets and pages.

Amazon Kendra customers can now use the Amazon Kendra AEM connector to index pages and assets from AEM. Amazon Kendra supports AEM as a Cloud Service author instances and AEM On-Premise author and publish instances. You can index AEM content and filter the types of content you want to index with the Amazon Kendra AEM On-Premise or Cloud Service connector, and search your data from AEM with Amazon Kendra intelligent search.

This post shows you how to configure the Amazon Kendra AEM connector to index your content and search your AEM assets and pages. The connector also ingests the access control list (ACL) information for each document. The ACL information is used to show search results filtered by what a user has access to.

Solution overview

In our solution, we configure AEM as a data source for an Amazon Kendra search index using the Amazon Kendra AEM connector. Based on the configuration, when the data source is synchronized, the connector crawls and indexes all the content from AEM that was created on or before a specific date. The connector also indexes the Access Control List (ACL) information for each message and document. When access control or user context filtering is enabled, the search results of a query made by a user includes results only from those documents that the user is authorized to read.

The Amazon Kendra AEM connector can integrate with AWS IAM Identity Center (Successor to AWS Single Sign-On). You first must enable IAM Identity Center and create an organization to sync users and groups from your active directory. The connector will use the user name and group lookup for the user context of the search queries.

Prerequisites

To try out the Amazon Kendra connector for AEM using this post as a reference, you need the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS and working knowledge of AEM. For more information, refer to User administration and Security.
AEM On-Premise set up (Version 6.5 and above). We store the admin user name and password in AWS Secrets Manager.

Set up OAuth2.0

If you are using AEM On-Premise, setup OAuth2.0 to generate an SSL certificate in order to complete the configuration of Amazon Kendra AEM connector.

The Adobe Granite OAuth 2.0 server implementation (com.adobe.granite.oauth.server) provides the support for OAuth 2.0 server functionalities in AEM.

Enable the OAuth Server authentication handler

By default, AEM won’t enable the OAuth Server authentication handler. To enable it, complete the following steps:

To start the AEM local instance, go to http://localhost:<port>/system/console/configMgr/com.adobe.granite.oauth.server.auth.impl.OAuth2ServerAuthenticationHandler
Change the jaas.ranking.name value to 1100 in the Adobe Granite OAuth Server Authentication Handler section and save the configuration.

The OAuth Server authentication handler is now enabled.

Register the OAuth client

Every external application requires OAuth authentication to be registered as an OAuth client in AEM. To register the OAuth client, complete the following steps:

On the AEM start page, choose Security and OAuth client.
Enter a name and redirect URI.
Choose Save.

After a successful authorization of an application, the OAuth server will redirect you back to the application with an authorization code to the configured redirect URL.

Copy the client ID and client secret and keep them safe.

The Granite OAuth Server supports the following grant types:

Authorization code
Refresh token
JWT bearer token

For this post, we use OAuth2.0 with the JWT grant type.

The JWT bearer token is mainly used for server-to-server integration. This will help us enable the server-to-server integration without the resource owner interaction; for example, to retrieve or upload files without user interaction.

Generate the JWT token

Complete the following steps to generate the JWT token:

Navigate to localhost and the OAuth client.
Choose Download Private Key.
Choose Download.

Generate the public certificate

Now, generate the public certificate from the downloaded private key, run the following command, and enter the private key password.

Use the openssl command to generate the private key:

>openssl pkcs12 -in store.p12 -out store.crt.pem -clcerts -nokeys

Extract the private key:

openssl pkcs12 -in store.p12 -passin pass:notasecret -nocerts -nodes -out store.private.key.txt

Make sure to install openssl and add to the environment path beforehand.

Before using the private key while configuring the Amazon Kendra data source, make sure to not use or copy “-----BEGIN PRIVATE KEY-----” and “-----END PRIVATE KEY-----“ in the code. Additionally, remove any empty spaces from the private key.

Use the generated ClientId, ClientSecret, and private key to configure the Amazon Kendra AEM data source.

For OAuth client registration, navigate to http://localhost:<port>/libs/granite/oauth/content/clients.html.

Set up SSL

Complete the following steps to set up SSL:

Create the key:

openssl genrsa -aes256 -out <keyFileName>.key 4096

Encrypt the key:

openssl req -sha256 -new -key <keyFileName>.key -out <keyFileName>.csr -subj '/CN=<keyFileName>'

Sign the key:

openssl x509 -req -days 365 -in <keyFileName>.csr -signkey <keyFileName>.key -out <keyFileName>.crt

Encode the private key to der format:

openssl pkcs8 -topk8 -inform PEM -outform DER -in <keyFileName>.key -out <keyFileName>.der -nocrypt

Four files will be generated with file names starting with <keyFileName>. We use <keyFileName>.crt and <keyFileName>.der in later steps.

Next, log in to AEM at http://localhost:<port>/aem/start.html.
Choose Tools, Security, and SSL Configuration.
In the Store Credentials section, enter the key store and trust store password.

In the Keys and Certificate section, specify the .der file for Private Key and the .crt file for Certificate.

In the next section, enter the domain (localhost), and leave the port as is.
Choose Done.

AEM will open in the specified new port. For example, https://localhost:8443.

Log in to AEM using HTTPS and download the certificate in the browser using the lock/pad button, export the certificate, and name it privateKey.crt.

Now, let’s import the certificate into the keystore path using the key tool.

Open a terminal and go to the folder location where privateKey.crt is present and run the following command:

keytool -import -trustcacerts -keystore <JAVA_HOME>/lib/security/cacerts -storepass changeit -noprompt -alias yourAliasName -file privateKey.crt

Be sure to open 8443 and 80 port in your firewall settings.

Add the certificate privateKey.crt to an Amazon Simple Storage Service (Amazon S3) bucket.

Configure the data source using the Amazon Kendra connector for AEM

You can use an existing index or create a new index to index documents from AEM using the AEM connector. Then complete the following steps. For more information, refer to the Amazon Kendra Developer Guide.

On the Amazon Kendra console, open your index and choose Data sources in the navigation pane.
Choose Add data source.
Under Adobe Experience Manager, choose Add connector.

In the Specify data source details section, enter a name and optionally a description, then choose Next.

In the Define access and security section, select either the AEM On-Premise or AEM as a Cloud Service source type and enter the AEM host URL. You can find the URL in your AEM settings.

If using AEM On-Premise, enter the host URL of the AEM On-Premise server. Then choose Browse S3 and choose the S3 bucket with the SSL certificate.

If using AEM as a Cloud Service, you can use the author URL https://author-xxxxxx-xxxxxxx.adobeaemcloud.com.

Under Authentication, you have two options, Basic authentication and OAuth 2.0 authentication.

If you select Basic authentication, for AWS Secrets Manager secret, choose Create and add a new secret. Then enter a name for the secret, the AEM site user name, and password. The user must have admin permission or be an admin user.

If you select OAuth 2.0 authentication, for AWS Secrets Manager secret, choose Create and add a new secret. Enter a name for the secret, client ID, client secret, and private key. If you use AEM as a Cloud Service, enter a name for the secret, client ID, client secret, private key, organization ID, technical account ID, and Adobe Identity Management System (IMS) host.

Choose Save or Add Secret.
In the Configure VPC and security group section, you can optionally choose to use a VPC. If so, you must add subnets and VPC security groups.
In the Identity crawler section, choose to crawl identity information on users and groups with access to certain documents and store this in the Amazon Kendra principal or identity store.

This is useful for filtering search results based on the user or their group access to documents.

In the IAM section, create a new IAM role or choose an existing IAM role to access repository credentials and index content.
Choose Next.

In the Configure sync settings section, provide information about your sync scope.

You can include the files to be crawled using inclusion patterns or exclude them using exclusion patterns. When you provide a pattern in the Include patterns section, only documents matching that pattern will be crawled. When you provide a pattern in the Exclude patterns section, documents matching that pattern will be not be crawled.

If you use AEM On-Premise and the time zone of your server is different than the time zone of the Amazon Kendra AEM connector or index, you can specify the server time zone to align with the AEM connector or index in the Timezone ID section.

The default time zone for AEM On-Premise is the time zone of the Amazon Kendra AEM connector or index. The default time zone for AEM as a Cloud Service is Greenwich Mean Time.

Choose the Sync mode (for this post, select Full sync).

With the Full sync option, every time the sync runs, Amazon Kendra will crawl all documents and ingest each document even if ingested earlier. The full refresh enables you to reset your Amazon Kendra index without the need to delete and create a new data source. If you choose New or modified content sync or New, modified, or deleted content sync, every time the sync job runs, it will process only objects added, modified, or deleted since the last crawl. Incremental crawls can help reduce runtime and cost when used with datasets that append new objects to existing data sources on a regular basis.

For Sync run schedule, choose Run on demand.
Choose Next.

In the Set field mappings section, you can optionally select from the Amazon Kendra generated default data source fields you want to map to your index. To add custom data source fields, choose Add Field to create an index field name to map to and the field data type. Specify the AEM field name, index field name, and data type.

Choose Next.

Review your settings and choose Add data source.

After the data source is added, choose Data sources in the navigation pane, select the newly added data source, and choose Sync now to start data source synchronization with the Amazon Kendra index.

The sync process will depend on the amount of data to be crawled.

Now let’s enable access control for the Amazon Kendra index.

In the navigation pane, choose your index.
On the User access control tab, choose Edit settings.

Change the settings to look like the following screenshot.
Choose Next.

Choose Update.

Wait a few minutes for the index to get updated by the changes. Now let’s see how you can perform intelligent search with Amazon Kendra.

Perform intelligent search with Amazon Kendra

Before you try searching on the Amazon Kendra console or using the API, make sure that the data source sync is complete. To check, view the data sources and verify if the last sync was successful.

Now we’re ready to search our index.

On the Amazon Kendra console, navigate to the index and choose Search indexed content in the navigation pane.
Let’s query the index using “What was the impact of Siberian heat wave?” without providing an access token.

Based on our access control settings in the index, a valid access token is needed to access content the user is allowed to see; therefore, when we use this search query without setting any user name or group, no results are returned.

Next, choose Apply Token and set the user name or user email ID (for example, user-dev@company.com) that has access to AEM content.

While crawling the AEM data source, the connecter would set the user email ID as principal. If user’s email ID is not available, then the user name would be set as a principal.

The following screenshot shows an example with the user email ID user-dev-2@amazon.com set as principal.

The following example uses user name user-dev-2 set as principal.

Now, let’s try to search the same content with the token of user user-dev@amazon.com, who is not authorized to view this specific document that appeared in the preceding query results.

This confirms that documents ingested by the Amazon Kendra connector for AEM honors the ACLs set by and within AEM and these same ACLs are being enforced on the search results based on applied token.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for AEM, delete that data source.

Conclusion

With the Amazon Kendra Adobe Experience Manager connector, your organization can search pages and assets securely using intelligent search powered by Amazon Kendra.

To learn more about the Amazon Kendra connector for AEM, refer to Adobe Experience Manager.

For more information on other Amazon Kendra built-in connectors to popular data sources, refer to Amazon Kendra native connectors.

About the Authors

Praveen Edem is a Senior Solutions Architect at Amazon Web Services. He works with major financial services customers, architecting and modernizing their critical large-scale applications while adopting AWS services. He specializes in serverless and container-based workloads. He has over 20 years of IT experience in application development and software architecture.

Manjula Nagineni is a Senior Solutions Architect with AWS based in New York. She works with major financial service institutions, architecting and modernizing their large-scale applications while adopting AWS Cloud services. She is passionate about designing big data workloads cloud-natively. She has over 20 years of IT experience in software development, analytics, and architecture across multiple domains such as finance, manufacturing, and telecom.

Omkar Phadtare is a Software Development Engineer at Amazon Web Services, with a deep-rooted passion for cloud computing. Leveraging his technical expertise and strong understanding of the domain, he designs, develops, and implements cutting-edge, highly scalable, and resilient cloud-based solutions for a diverse range of modern businesses and organizations.

Vijai Gandikota is a Senior Product Manager for Amazon Kendra at Amazon Web Services, responsible for launching Amazon Kendra connectors, Principal Store, Search Analytics Dashboard, and other features of Amazon Kendra. He has over 20 years of experience in designing, developing, and launching products in AI and analytics.

Fine-tune Llama 2 for text generation on Amazon SageMaker JumpStart

September 6, 2023

by Vivek Madan Amazon AWS

Today, we are excited to announce the capability to fine-tune Llama 2 models by Meta using Amazon SageMaker JumpStart. The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned LLMs, called Llama-2-chat, are optimized for dialogue use cases. You can easily try out these models and use them with SageMaker JumpStart, which is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. Now you can also fine-tune 7 billion, 13 billion, and 70 billion parameters Llama 2 text generation models on SageMaker JumpStart using the Amazon SageMaker Studio UI with a few clicks or using the SageMaker Python SDK.

Generative AI foundation models have been the focus of most of the ML and artificial intelligence research and use cases for over a year now. These foundation models perform very well with generative tasks, such as text generation, summarization, question answering, image and video generation, and more, because of their large size and also because they are trained on several large datasets and hundreds of tasks. Despite the great generalization capabilities of these models, there are often use cases that have very specific domain data (such as healthcare or financial services), because of which these models may not be able to provide good results for these use cases. This results in a need for further fine-tuning of these generative AI models over the use case-specific and domain-specific data.

In this post, we walk through how to fine-tune Llama 2 pre-trained text generation models via SageMaker JumpStart.

What is Llama 2

Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. According to Meta, the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The tuned models are intended for assistant-like chat, whereas pre-trained models can be adapted for a variety of natural language generation tasks. Regardless of which version of the model a developer uses, the responsible use guide from Meta can assist in guiding additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

Currently, Llama 2 is available in the following regions:

Deploy pre-trained model available: "us-west-2", "us-east-1", "us-east-2", "eu-west-1", "ap-southeast-1", "ap-southeast-2"
Fine-tune and deploy the fine-tuned model: “us-east-1”, “us-west-2”,“eu-west-1”

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment. You can now discover and deploy Llama 2 with a few clicks in SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. In addition, you can fine-tune Llama2 7B, 13B, and 70B pre-trained text generation models via SageMaker JumpStart.

Fine-tune Llama2 models

You can fine-tune the models using either the SageMaker Studio UI or SageMaker Python SDK. We discuss both methods in this section.

No-code fine-tuning via the SageMaker Studio UI

In SageMaker Studio, you can access Llama 2 models via SageMaker JumpStart under Models, notebooks, and solutions, as shown in the following screenshot.

If you don’t see Llama 2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

You can also find other four model variants by choosing Explore all Text Generation Models or searching for llama in the search box.

On this page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. You can then choose Train to start the training job on a SageMaker ML instance. The preceding screenshot shows the fine-tuning page for the Llama-2 7B model; however, you can fine-tune the 13B and 70B Llama 2 text generation models using their respective model pages similarly. To use Llama 2 models, you need to accept the End User License Agreement (EULA). It will show up when you when you choose Train, as shown in the following screenshot. Choose I have read and accept EULA and AUP to start the fine-tuning job.

Deploy the model

After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

Fine-tune via the SageMaker Python SDK

You can also fine-tune Llama 2 models using the SageMaker Python SDK. The following is a sample code to fine-tune the Llama 2 7B on your dataset:

import os
import boto3
from sagemaker.session import Session
from sagemaker.jumpstart.estimator import JumpStartEstimator

# To fine-tune the 13B/70B model, please change model_id to `meta-textgeneration-llama-2-13b`/`meta-textgeneration-llama-2-70b`.
model_id = "meta-textgeneration-llama-2-7b"

estimator = JumpStartEstimator(
    model_id=model_id, environment={"accept_eula": "true"}
)
# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
estimator.set_hyperparameters(instruction_tuned="True", epoch="5")
estimator.fit({"training": train_data_location})

You can deploy the fine-tuned model directly from the estimator:

finetuned_predictor = estimator.deploy()

You can also find the code in Fine-tune LLaMA 2 models on SageMaker JumpStart. It includes dataset preparation, training on your custom dataset, and deploying the fine-tuned model. It demonstrates fine-tuning on a subset of the Dolly dataset with examples from the summarization task. The following is the example input with responses from fine-tuned and non-fine-tuned along with the ground truth response:

Input to the model:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhen did Felix Luna die?nn### Input:nFélix César Luna (30 September 1925 – 5 November 2009) was an Argentine writer, lyricist and historian.nnnn### Response:n

Ground Truth response:

Felix Luna died on November 5th, 2009

Response from the non fine-tuned model:

Félix César Luna (30 September 1925 – 5 November 2009) was an ArgentinennWhen did Luna die?nnn### Explanation:nWe answer the question with the input's date of birth and the date of death.nnn### Solution: 1102n

Response from the fine-tuned model:

Félix Luna died on November 5th, 2009.nn

For performance benchmarking of different models on the Dolly and Dialogsum dataset, refer to the Performance benchmarking section in the appendix at the end of this post.

Fine-tuning technique

Language models such as Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly high CUDA memory. Furthermore, training these models can be very slow due to the size of the model. Therefore, for efficient fine-tuning, we use the following optimizations:

Low-Rank Adaptation (LoRA) – This is a type of parameter efficient fine-tuning (PEFT) for efficient fine-tuning of large models. In this, we freeze the whole model and only add a small set of adjustable parameters or layers into the model. For instance, instead of training all 7 billion parameters for Llama 2 7B, we can fine-tune less than 1% of the parameters. This helps in significant reduction of the memory requirement because we only need to store gradients, optimizer states, and other training-related information for only 1% of the parameters. Furthermore, this helps in reduction of training time as well as the cost. For more details on this method, refer to LoRA: Low-Rank Adaptation of Large Language Models.
Int8 quantization – Even with optimizations such as LoRA, models such as Llama 70B are still too big to train. To decrease the memory footprint during training, we can use Int8 quantization during training. Quantization typically reduces the precision of the floating point data types. Although this decreases the memory required to store model weights, it degrades the performance due to loss of information. Int8 quantization uses only a quarter precision but doesn’t incur degradation of performance because it doesn’t simply drop the bits. It rounds the data from one type to the another. To learn about Int8 quantization, refer to LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
Fully Sharded Data Parallel (FSDP) – This is a type of data-parallel training algorithm that shards the model’s parameters across data parallel workers and can optionally offload part of the training computation to the CPUs. Although the parameters are sharded across different GPUs, computation of each microbatch is local to the GPU worker. It shards parameters more uniformly and achieves optimized performance via communication and computation overlapping during training.

The following table compares different methods with the three Llama 2 models.

,	Default Instance Type	Supported Instance Types with Default configuration	Default Setting	LORA + FSDP	LORA + No FSDP	Int8 Quantization + LORA + No FSDP
Llama 2 7B	ml.g5.12xlarge	ml.g5.12xlarge, ml.g5.24xlarge, ml.g5.48xlarge	LORA + FSDP	Yes	Yes	Yes
Llama 2 13B	ml.g5.12xlarge	ml.g5.24xlarge, ml.g5.48xlarge	LORA + FSDP	Yes	Yes	Yes
Llama 2 70B	ml.g5.48xlarge	ml.g5.48xlarge	INT8 + LORA + NO FSDP	No	No	Yes

Note that fine-tuning of Llama models is based on scripts provided by the following GitHub repo.

Training dataset format

SageMaker JumpStart currently support datasets in both domain adaptation format and instruction tuning format. In this section, we specify an example dataset in both formats. For more details, refer to the Dataset formatting section in the appendix.

Domain adaptation format

The text generation Llama 2 model can be fine-tuned on any domain-specific dataset. After it’s fine-tuned on the domain-specific dataset, the model is expected to generate domain-specific text and solve various NLP tasks in that specific domain with few-shot prompting. With this dataset, input consists of a CSV, JSON, or TXT file. For instance, input data may be SEC filings of Amazon as a text file:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions.

Instruction tuning format

In instruction fine-tuning, the model is fine-tuned for a set of natural language processing (NLP) tasks described using instructions. This helps improve the model’s performance for unseen tasks with zero-shot prompts. In instruction tuning dataset format, you specify the template.json file describing the input and the output formats. For instance, each line in the file train.jsonl looks like the following:

{"instruction": "What is a dispersive prism?", 
"context": "In optics, a dispersive prism is an optical prism that is used to disperse light, that is, to separate light into its spectral components (the colors of the rainbow). Different wavelengths (colors) of light will be deflected by the prism at different angles. This is a result of the prism material's index of refraction varying with wavelength (dispersion). Generally, longer wavelengths (red) undergo a smaller deviation than shorter wavelengths (blue). The dispersion of white light into colors by a prism led Sir Isaac Newton to conclude that white light consisted of a mixture of different colors.", 
"response": "A dispersive prism is an optical prism that disperses the light's different wavelengths at different angles. When white light is shined through a dispersive prism it will separate into the different colors of the rainbow."}

The additional file template.json looks like the following:

{
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.nn"
    "### Instruction:n{instruction}nn### Input:n{context}nn",
    "completion": " {response}",
}

Supported hyperparameters for training

Llama 2 fine-tuning supports a number of hyperparameters, each of which can impact the memory requirement, training speed, and performance of the fine-tuned model:

epoch – The number of passes that the fine-tuning algorithm takes through the training dataset. Must be an integer greater than 1. Default is 5.
learning_rate – The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default is 1e-4.
instruction_tuned – Whether to instruction-train the model or not. Must be ‘True‘ or ‘False‘. Default is ‘False‘.
per_device_train_batch_size – The batch size per GPU core/CPU for training. Must be a positive integer. Default is 4.
per_device_eval_batch_size – The batch size per GPU core/CPU for evaluation. Must be a positive integer. Default is 1.
max_train_samples – For debugging purposes or quicker training, truncate the number of training examples to this value. Value -1 means using all of the training samples. Must be a positive integer or -1. Default is -1.
max_val_samples – For debugging purposes or quicker training, truncate the number of validation examples to this value. Value -1 means using all of the validation samples. Must be a positive integer or -1. Default is -1.
max_input_length – Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. If -1, max_input_length is set to the minimum of 1024 and the maximum model length defined by the tokenizer. If set to a positive value, max_input_length is set to the minimum of the provided value and the model_max_length defined by the tokenizer. Must be a positive integer or -1. Default is -1.
validation_split_ratio – If validation channel is none, ratio of train-validation split from the train data must be between 0–1. Default is 0.2.
train_data_split_seed – If validation data is not present, this fixes the random splitting of the input training data to training and validation data used by the algorithm. Must be an integer. Default is 0.
preprocessing_num_workers – The number of processes to use for preprocessing. If None, the main process is used for preprocessing. Default is None.
lora_r – Lora R. Must be a positive integer. Default is 8.
lora_alpha – Lora Alpha. Must be a positive integer. Default is 32
lora_dropout – Lora Dropout. must be a positive float between 0 and 1. Default is 0.05.
int8_quantization – If True, the model is loaded with 8-bit precision for training. Default for 7B and 13B is False. Default for 70B is True.
enable_fsdp – If True, training uses FSDP. Default for 7B and 13B is True. Default for 70B is False. Note that int8_quantization is not supported with FSDP.

Instance types and compatible hyperparameters

The memory requirement during fine-tuning may vary based on several factors:

Model type – The 7B model has the least GPU memory requirement and 70B has the largest memory requirement
Max input length – A higher value of input length leads to processing more tokens at a time and as such requires more CUDA memory
Batch size – A larger batch size requires larger CUDA memory and therefore requires larger instance types
Int8 quantization – If using Int8 quantization, the model is loaded into low precision and therefore requires less CUDA memory

To help you get started, we provide a set of combinations of different instance types, hyperparameters, and model types that can be successfully fine-tuned. You can select a configuration as per your requirements and availability of instance types. We fine-tune all three models on a variety of settings with three epochs on a subset of the Dolly dataset with summarization examples.

7B model

The following table summarizes the fine-tuning options on the 7B model.

Instance Type	Max Input Len	Per Device Batch Size	Int8 Quantization	Enable FSDP	Time Taken (mins)
ml.g4dn.12xlarge	1024	8	TRUE	FALSE	166
ml.g4dn.12xlarge	2048	2	TRUE	FALSE	178
ml.g4dn.12xlarge	1024	4	FALSE	TRUE	120
ml.g4dn.12xlarge	2048	2	FALSE	TRUE	143
ml.g5.2xlarge	1024	4	TRUE	FALSE	61
ml.g5.2xlarge	2048	2	TRUE	FALSE	68
ml.g5.2xlarge	1024	4	FALSE	TRUE	43
ml.g5.2xlarge	2048	2	FALSE	TRUE	49
ml.g5.4xlarge	1024	4	FALSE	TRUE	39
ml.g5.4xlarge	2048	2	FALSE	TRUE	50
ml.g5.12xlarge	1024	16	TRUE	FALSE	57
ml.g5.12xlarge	2048	4	TRUE	FALSE	64
ml.g5.12xlarge	1024	4	FALSE	TRUE	26
ml.g5.12xlarge	2048	4	FALSE	TRUE	23
ml.g5.48xlarge	1024	16	TRUE	FALSE	59
ml.g5.48xlarge	2048	4	TRUE	FALSE	67
ml.g5.48xlarge	1024	8	FALSE	TRUE	22
ml.g5.48xlarge	2048	4	FALSE	TRUE	21

13B

The following table summarizes the fine-tuning options on the 13B model.

Instance Type	Max Input Len	Per Device Batch Size	Int8 Quantization	Enable FSDP	Time Taken (mins)
ml.g4dn.12xlarge	1024	4	TRUE	FALSE	283
ml.g4dn.12xlarge	2048	2	TRUE	FALSE	328
ml.g5.12xlarge	1024	8	TRUE	FALSE	92
ml.g5.12xlarge	2048	4	TRUE	FALSE	104
ml.g5.48xlarge	1024	8	TRUE	FALSE	95
ml.g5.48xlarge	2048	4	TRUE	FALSE	107
ml.g5.48xlarge	1024	8	FALSE	TRUE	35
ml.g5.48xlarge	2048	2	FALSE	TRUE	41

70B

The following table summarizes the fine-tuning options on the 70B model.

Instance Type	Max Input Len	Per Device Batch Size	Int8 Quantization	Enable FSDP	Time Taken (mins)
ml.g5.48xlarge	1024	4	TRUE	FALSE	396
ml.g5.48xlarge	2048	1	TRUE	FALSE	454

Recommendations on instance types and hyperparameters

When fine-tuning the model’s accuracy, keep in mind the following:

Larger models such as 70B provide better performance than 7B
Performance without Int8 quantization is better than performance with INT8 quantization

Note the following training time and CUDA memory requirements:

Setting int8_quantization=True decreases the memory requirement and leads to faster training.
Decreasing per_device_train_batch_size and max_input_length reduces the memory requirement and therefore can be run on smaller instances. However, setting very low values may increase the training time.
If you’re not using Int8 quantization (int8_quantization=False), use FSDP (enable_fsdp=True) for faster and efficient training.

When choosing the instance type, consider the following:

G5 instances provide the most efficient training among the instance types supported. Therefore, if you have G5 instances available, you should use them.
Training time largely depends on the amount of the number of GPUs and the CUDA memory available. Therefore, training on instances with the same number of GPUs (for example, ml.g5.2xlarge and ml.g5.4xlarge) is roughly the same. Therefore, you can use the cheaper instance for training (ml.g5.2xlarge).
When using p3 instances, training will be done with 32-bit precision because bfloat16 is not supported on these instances. Therefore, the training job will consume double the amount of CUDA memory when training on p3 instances compared to g5 instances.

To learn about the cost of training per instance, refer to Amazon EC2 G5 Instances.

If the dataset is in instruction tuning format and input+completion sequences are small (such as 50–100 words), then a high value of max_input_length leads to very poor performance. The default value of this parameter is -1, which corresponds to the max_input_length of 2048 for Llama models. Therefore, we recommend that if your dataset contain small samples, use a small value for max_input_length (such as 200–400).

Lastly, due to high demand of the G5 instances, you may experience unavailability of these instances in your region with the error “CapacityError: Unable to provision requested ML compute capacity. Please retry using a different ML instance type.” If you experience this error, retry the training job or try a different Region.

Issues when fine-tuning very large models

In this section, we discuss two issues when fine-tuning very large models.

Disable output compression

By default, the output of a training job is a trained model that is compressed in a .tar.gz format before it’s uploaded to Amazon S3. However, due to the large size of the model, this step can take a long time. For example, compressing and uploading the 70B model can take more than 4 hours. To avoid this issue, you can use the disable output compression feature supported by the SageMaker training platform. In this case, the model is uploaded without any compression, which is further used for deployment:

estimator = JumpStartEstimator(
model_id=model_id, environment={"accept_eula": "true"}, disable_output_compression=True
)

SageMaker Studio kernel timeout issue

Due to the size of the Llama 70B model, the training job may take several hours and the SageMaker Studio kernel may die during the training phase. However, during this time, training is still running in SageMaker. If this happens, you can still deploy the endpoint using the training job name with the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator
training_job_name = <<<INSERT_TRAINING_JOB_NAME>>>

attached_estimator = JumpStartEstimator.attach(training_job_name, model_id)
attached_estimator.logs()
attached_estimator.deploy()

To find the training job name, navigate to the SageMaker console and under Training in the navigation pane, choose Training jobs. Identify the training job name and substitute it in the preceding code.

Conclusion

In this post, we discussed fine-tuning Meta’s Llama 2 models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. In addition, we outlined recommendations for optimized training based on various tests we carried out. The results for fine-tuning the three models over two datasets are shown in the appendix at the end of this post. As we can see from these results, fine-tuning improves summarization compared to non-fine-tuned models. As a next step, you can try fine-tuning these models on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.

The authors would like to acknowledge the technical contributions of Christopher Whitten, Xin Huang, Kyle Ulrich, Sifei Li, Amy You, Adam Kozdrowicz, Evan Kravitz , Benjamin Crabtree, Haotian An, Manan Shah, Tony Cruz, Ernev Sharma, Jonathan Guinegagne and June Won.

About the Authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Appendix

This appendix provides additional information about performance benchmarking and dataset formatting.

Performance benchmarking

In this section, we provide results for fine-tuning the three Llama 2 models (7B, 13B, and 70B) on two different datasets: Dolly and Dialogsum. For the Dolly dataset, our task is to summarize a paragraph of text, whereas for Dialogsum, we are fine-tuning the model to summarize a discussion between two people. In the following tables, we show the input to the model (prompt and instructions), ground truth (summary), response from the pre-trained Llama 2 model, and response from the fine-tuned Llama 2 model for each of the three Llama 2 models. We show inference results for five data points. You can notice from the following tables that the summaries improve for both the datasets when we fine-tune the models.

Results for fine-tuning the Llama 2 7B text generation model on the Dolly dataset:

Inputs	Ground Truth	Response from Non-Fine-Tuned Model	Response from Fine-Tuned Model
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhere is Allauchnn### Input:nAllauch (French pronunciation: ; Occitan: Alaug) is a French commune situated east of Marseille in the department of Bouches-du-Rhône in the Provence-Alpes-Côte d’Azur region of France.nnThe inhabitants of the commune are known as Allaudiens or Allaudiennes in French.nnnn### Response:n	Allauch is a French commune situated east of Marseille in the department of Bouches-du-Rhône in the Provence-Alpes-Côte d’Azur region of France.nnThe inhabitants of the commune are known as Allaudiens or Allaudiennes in French.	Allauch is a town located 28 km east of Marseille in the department of Bouches-du-Rhône in the Provence-Alpes-Côte d’Azur region of France. It is considered to be one of the most beautiful villages in France. The town is surrounded by hills and vineyards and has a population of around 6,000 people. It was first mentioned in 1006 and has a rich	The commune is situated on a rocky plateau called Massif du Gélas, situated east of Marseille. It has a beautiful view of Marseille and its coastline. Allauch is now a well-developed commune with its own town hall and an array of shops, bars and restaurants. Due to its location between Marseille and Alpilles, it has retained many natural features of the area, including the Glandes Massif, also known as the “W
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is an egg laying mammal?nn### Input:nThe platypus (Ornithorhynchus anatinus), sometimes referred to as the duck-billed platypus, is a semiaquatic, egg-laying mammal endemic to eastern Australia, including Tasmania. The platypus is the sole living representative or monotypic taxon of its family (Ornithorhynchidae) and genus (Ornithorhynchus), though a number of related species appear in the fossil record.nnnn### Response:n	Duck-billed platypus is an egg-laying mammal found in eastern Australia. It is the sole living representative in its genus Ornithorhynchus.	nI did quite a bit of research on your question. I read all the information that you provided in your second paragraph. I also read a bit about platypus, including [this page](https://phys.org/news/2019-04-australia-platypus-predator-species.html). I think I found some answers to your question. Are you asking if the platypus is an egg laying mammal or if	A mammal that lays eggs, like a platypusnn
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhen did Felix Luna die?nn### Input:nFélix César Luna (30 September 1925 – 5 November 2009) was an Argentine writer, lyricist and historian.nnnn### Response:n	Felix Luna died on November 5th, 2009	Félix César Luna (30 September 1925 – 5 November 2009) was an ArgentinennWhen did Luna die?nnn### Explanation:nWe answer the question with the input’s date of birth and the date of death.nnn### Solution: 1102n	Félix Luna died on November 5th, 2009.nn
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nJot down some important points about optical illusion from the given passage.nn### Input:nIn visual perception, an optical illusion (also called a visual illusion ) is an illusion caused by the visualnsystem and characterized by a visual percept that arguably appears to differ from reality. Illusions comenin a wide variety; their categorization is difficult because the underlying cause is often not clear but anclassification proposed by Richard Gregory is useful as an orientation. According to that, there are threenmain classes: physical, physiological, and cognitive illusions, and in each class there are four kinds:nAmbiguities, distortions, paradoxes, and fictions. A classical example for a physical distortion would benthe apparent bending of a stick half immerged in water; an example for a physiological paradox isnthe motion aftereffect (where, despite movement, position remains unchanged). An example for anphysiological fiction is an afterimage. Three typical cognitive distortions are the Ponzo , Poggendorff,nand Müller-Lyer illusion. Physical illusions are caused by the physical environment, e.g. by the opticalnproperties of water. Physiological illusions arise in the eye or the visual pathway, e.g. from the effects ofnexcessive stimulation of a specific receptor type. Cognitive visual illusions are the result of unconsciousninferences and are perhaps those most widely known.nnnn### Response:n	1. Visual Illusion is caused by the visual system and characterized by a visual percept that appearsnto be different from reality.n2. According to Richard Gregory a classification proposed by him is useful as an orientation that,nthere are three main classes: physical, physiological, and cognitive illusions, and in each classnthere are four kinds: Ambiguities, distortions, paradoxes, and fictions.n3. An example for a physical distortion would be the apparent bending of a stick half immerged innwater; an example for a physiological paradox is the motion aftereffect (where, despitenmovement, position remains unchanged). n4. An example for a physiological fiction is an afterimagen5. Three cognitive distortions are the Ponzo , Poggendorff, and Müller-Lyer illusion.n6. Physical illusions are caused by the physical environment, e.g. by the optical properties ofnwater. Physiological illusions arise in the eye or the visual pathway.n7. The most widely know illusion is visual illusions which are the result of unconscious inferences.	The optical illusion mentioned in the passage is the illusionary motion of the stick that is partially plunged in water.nn## 50.n### Instruction:nWrite an argument to support or refute the claim:nAll of the students in our lab have the same average IQ.nn### Input:nMany students in the experiment have the same average IQ, and I believe that they are the 10% with the highestn	nOptical illusions make use of the physiological and psychological limitations related to the way that we interpret and respond to visual stimuli, both stimuli in the world around us and visual stimuli produced internally. They provide a common stimulus to the two systems (physical and psychological/perceptual) which allow us to compare both, and understand the way that each works. There are a variety of different forms of optical illusions, that include psychological,
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nSummarize The Innovator’s Dilemma in a few sentencenn### Input:nIt describes how large incumbent companies lose market share by listening to their customers and providing what appears to be the highest-value products, but new companies that serve low-value customers with poorly developed technology can improve that technology incrementally until it is good enoughnnnn### Response:n	The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail, first published in 1997, is the best-known work of the Harvard professor and businessman Clayton Christensen. It expands on the concept of disruptive technologies, a term he coined in a 1995 article “Disruptive Technologies: Catching the Wave”. It describes how large incumbent companies lose market share by listening to their customers and providing what appears to be the highest-value products, but new companies that serve low-value customers with poorly developed technology can improve that technology incrementally until it is good enough to quickly take market share from established business. Christensen recommends that large companies maintain small, nimble divisions that attempt to replicate this phenomenon internally to avoid being blindsided and overtaken by startup competitors.	Hey there,n nSorry for the late response!! I was on vacation this past week :)nnThe Innovator’s Dilemma by Clayton Christensen explains that “the innovator’s dilemma is that a company’s pursuit of sustained profits by introducing or improving its products to provide more value to existing customers will result in those same customers abandoning the company to buy innovations that offer at least some value	The Innovator’s Dilemma is a book that explains how large incumbent companies lose market share by listening to their customers and providing what appears to be the highest-value products, but new companies that serve low-value customers with poorly developed technology can improve that technology incrementally until it is good enough nn

Results for fine-tuning the Llama 2 13B text generation model on the Dolly dataset:

Inputs	Ground Truth	Response from Non-Fine-Tuned Model	Response from Fine-Tuned Model
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWho is the next coach at Fc Bayern München to take over from Julian Nagelsmann?nn### Input:nNagelsmann era (2021–2023)nUnder new coach Julian Nagelsmann, Bayern have completed the feat of winning 10 consecutive Bundesliga titles following a 3–1 Der Klassiker win. However, the team unexpectedly lost to Villarreal in the Champions League quarter-finals, going out at that stage for the second year in a row. On 24 March 2023, Nagelsmann was released by Bayern and replaced by Thomas Tuchel, who received a contract until 30 June 2025.nnnn### Response:n	Thomas Tuchel replaced Julian Nagelsmann in March 2023 and is now the new coach at FC Bayern Munich, and was previously the coach at Chelsea.	nn“`pythonn{n “instruction”: “Who is the next coach at Fc Bayern München to take over from Julian Nagelsmann?”, n “action”: “next coach of FC Bayern”, n “context”:{n “Nagelsmann era (2021–2023)”: n ” Under new coach Julian Nagelsmann, Bayern have completed the feat of winning 10 consecutive Bundesliga titles	The next coach at Fc Bayern München will be Thomas Tuchel. Tuchel will take over from Julian Nagelsmann upon Nagelsmann’s departure from the team on 24 March, 2023.nn
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nTell me about the life of Adolphe Cohnnn### Input:nAdolphe Cohn (May 29, 1851– 1930) was a Franco-American educator, born in Paris. He was graduated “bachelier ès lettres” from the University of Paris in 1868, and studied law, historical criticism, and philology at various institutions of higher learning in Paris, receiving the degrees of LL.B. in 1873. A pupil of the École des Chartes, his thesis was called Vues sur l’histoire de l’organisation judiciaire en France du IXe au XIIIe siècle considérée au point de vue des juridictions extraordinaires and he got the diploma of “archiviste paléographe” in 1874. At the commencement of the Franco-Prussian War in 1870, he enlisted and served in the French army throughout the struggle.nnCohn emigrated to New York City on May 13, 1875, and from 1876 to 1884 was the American correspondent of La République française, then edited by Léon Gambetta, whom he had known in France, and whose political views he had adopted. In March, 1882, Cohn was appointed tutor in French at Columbia College, and soon afterward made an instructor. By a popular vote of the French residents of New York he was chosen to deliver the funeral oration on Gambetta in 1883 at Tammany Hall, and in 1885 was called from Cambridge, Massachusetts, for a similar purpose, upon the death of Victor Hugo.nnnn### Response:n	Adolphe Cohn (May 29, 1851 – 1930) was a Franco-American educator, born in Paris. He was graduated “bachelier es lettres” from the University of Paris in 1868, and studied law, historical criticism, and philology at various institutions of higher learning in Paris, receiving the degrees of LL.B. in 1873. nnA pupil of the École des Chartes, his thesis was called Vues sur l’histoire de l’organisation judiciaire en France du IXe au XIIIe siècle considérée au point de vue des juridictions extraordinaires and he got the diploma of “archiviste paléographe” in 1874. At the commencement of the Franco-Prussian War in 1870, he enlisted and served in the French army throughout the struggle.nnCohn emigrated to New York City on May 13, 1875, and from 1876 to 1884 was the American correspondent of La République française, then edited by Léon Gambetta, whom he had known in France, and whose political views he had adopted. In March, 1882, Cohn was appointed tutor in French at Columbia College, and soon afterward made an instructor. By a popular vote of the French residents of New York he was chosen to deliver the funeral oration on Gambetta in 1883 at Tammany Hall, and in 1885 was called from Cambridge, Massachusetts, for a similar purpose, upon the death of Victor Hugo.	nAdolphe Cohn (May 29, 1851– 1930) was a Franco-American educator, born in Paris. He was graduated “bachelier ès lettres” from the University of Paris in 1868, and studied law, historical criticism, and philology at various institutions of higher learning in Paris, receiving the degrees of LL.B. in 1873. A pupil of the É	Adolphe Cohn was a Franco-American educator who was born in Paris on May 29, 1851. He attended the University of Paris and earned a law degree in 1873. He also worked at the École des Chartes and earned his “archiviste paléographe” degree in 1874. At the start of the Franco-Prussian War, he joined the French army and served them for the whole
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nHow has McDonald’s target audience changed over time?nn### Input:nFast food restaurants often aim some of their advertising directly at the youth population. McDonald’s Happy Meals, which include a toy often tied in with a newly released family film, is a significant example. Ronald McDonald, a clown advertising mascot introduced in 1963 and designed to appeal to young children, is another. In addition, in 1987 McDonald’s incorporated a Play Place in their restaurants to further advertise to children, making their restaurants a more appealing environment for children. Additionally, from 1996 to 2006, Disney was an exclusive partner with McDonald’s, linking their products together. They announced the end of this deal in May 2006, with some reports saying that Disney was worried about childhood obesity. Other than Disney, McDonald’s has also been partnered with Nintendo since 1985, when the Nintendo Entertainment System was first introduced. In 1987, McDonald’s also created a Nintendo play land in their restaurants where kids could play on Nintendo consoles while waiting for their food. During 2007, McDonald’s began to provide WiFi from the Nintendo consoles; giving children the ability to play wherever they were. With also a target audience of children releasing these new films and incorporating toys, it uses the kids to reach out to their parents pockets because of their large investment.nnnn### Response:n	McDonald’s has consistently aimed their advertising directly at the youth population.nnStarting in 1963, they introduced Ronald McDonald, a clown advertising mascot designed to appeal to young children. Over the years, McDonald’s partnered with multiple companies that have products targeted at children, such as Nintendo and Disney.	Fast food restaurants have always targeted the youth population. Happy meals are also a huge factor when it comes to advertsing to youths.nAs many movies are realeased Happy Meals also come with a toy to promote that movie in your local McDonald’snn—n## 400-999nn### Comment:nAs a young company with a relatively narrow vision of the world, McDonald’s didn	Since launch, McDonald’s has targeted its audience to young children and teens with its kid oriented fast food restaurants and food products, as well as its advertising, which has largely targeted the youth. While the children are being driven to McDonalds by their parents, the restaurant is in essence a parental advertisement for their children: McDonalds appeals to children with their brightly colored restaurants and toys found in their food that appeal to
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is Anandapuram mandalnn### Input:nAnandapuram mandal is one of the 46 mandals in Visakhapatnam district of the Indian state of Andhra Pradesh. It is under the administration of Visakhapatnam revenue division and the headquarters are located at Anandapuram. Anandapuram mandal is bounded by Bheemunipatnam mandal to the east, Chinagadila to the south, Padmanabham to the north and Kothavalasa to the west.nnnn### Response:n	Anandapuram mandal is one of the 46 mandals in Visakhapatnam district of the Indian state of Andhra Pradesh. It is under the administration of Visakhapatnam revenue division and the headquarters are located at Anandapuram.	nnnn	Anandapuram mandal is one of the 46 mandals in Visakhapatnam district of the Indian state of Andhra Pradesh. It is under the administration of Visakhapatnam revenue division and the headquarters are located at Anandapuram. Anandapuram mandal is bounded by Bheemunipatnam mandal to the east, Chinagadila to the south, Padmanabham to the north and Koth
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat happened in Woodstocknn### Input:nWoodstock Music and Art Fair, commonly referred to as Woodstock, was a music festival held during August 15–18, 1969, on Max Yasgur’s dairy farm in Bethel, New York, United States, 40 miles (65 km) southwest of the town of Woodstock. Billed as “an Aquarian Exposition: 3 Days of Peace & Music” and alternatively referred to as the Woodstock Rock Festival, it attracted an audience of more than 400,000 attendees. Thirty-two acts performed outdoors despite sporadic rain. It was one of the largest music festivals held in history.nnThe festival has become widely regarded as a pivotal moment in popular music history as well as a defining event for the counterculture generation. The event’s significance was reinforced by a 1970 documentary film, an accompanying soundtrack album, and a song written by Joni Mitchell that became a major hit for both Crosby, Stills, Nash & Young and Matthews Southern Comfort. Musical events bearing the Woodstock name were planned for anniversaries, which included the tenth, twentieth, twenty-fifth, thirtieth, fortieth, and fiftieth. In 2004, Rolling Stone magazine listed it as number 19 of the 50 Moments That Changed the History of Rock and Roll. In 2017, the festival site became listed on the National Register of Historic Places.nnnn### Response:n	The Woodstock Music and Art Fair, commonly referred to as Woodstock, was a music festival held during August 15–18, 1969, on Max Yasgur’s dairy farm in Bethel, New York, United States, 40 miles (65 km) southwest of the town of Woodstock. Billed as “an Aquarian Exposition: 3 Days of Peace & Music” and alternatively referred to as the Woodstock Rock Festival, it attracted an audience of more than 400,000 attendees. Thirty-two acts performed outdoors despite sporadic rain. It was one of the largest music festivals held in history.nnThe festival has become widely regarded as a pivotal moment in popular music history as well as a defining event for the counterculture generation. The event’s significance was reinforced by a 1970 documentary film, an accompanying soundtrack album, and a song written by Joni Mitchell that became a major hit for both Crosby, Stills, Nash & Young and Matthews Southern Comfort. Musical events bearing the Woodstock name were planned for anniversaries, which included the tenth, twentieth, twenty-fifth, thirtieth, fortieth, and fiftieth. In 2004, Rolling Stone magazine listed it as number 19 of the 50 Moments That Changed the History of Rock and Roll. In 2017, the festival site became listed on the National Register of Historic Places.	nAnswer:nn“`jsndocument.getElementById(‘_0x1000’).innerHTML = “https://commons.wikimedia.org/wiki/File:BethelMeetingHouse.JPG#/media/File:BethelMeetingHouse.JPG”;nn“`nn</details>nn## Solution:nn<details>n<summary>To View Solution</summary>nn“`jsnn// Solution required	Woodstock Music and Art Fair held during the latter part of August 15–18, was a historic event in Bethel, New York by the counterculture generation. The crowd at the festival was more than 400,000 strong in attendance. About 32 acts performed outdoors in spite of sporadic rain. It was regarded as one of the biggest music festivals that had ever been in history.nn

Results for fine-tuning the Llama 2 70B text generation model on the Dolly dataset:

Inputs	Ground Truth	Response from Non-Fine-Tuned Model	Response from Fine-Tuned Model
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWho is the company ABBnn### Input:nABB Ltd. is a Swedish-Swiss multinational corporation headquartered in Zürich, Switzerland. The company was formed in 1988 when Sweden’s Allmänna Svenska Elektriska Aktiebolaget (ASEA) and Switzerland’s Brown, Boveri & Cie merged to create ASEA Brown Boveri, later simplified to the initials ABB. Both companies were established in the late 1800s and were major electrical equipment manufacturers, a business that ABB remains active in today. The company has also since expanded to robotics and automation technology.nnnn### Response:n	ABB Ltd. is a Swedish-Swiss multinational corporation headquartered in Zürich, Switzerland. The company was formed in 1988 when Sweden’s Allmanna Svenska Elektriska Aktiebolaget (ASEA) and Switzerland’s Brown, Boveri & Cie merged to create ASEA Brown Boveri, later simplified to the initials ABB. nnBoth companies were established in the late 1800s and were major electrical equipment manufacturers, a business that ABB remains active in today. The company has also since expanded to robotics and automation technology.	* ABB is a swedish-swiss multinational corporationnn### LognEND I/O DATA —n	ABB is a Swedish-Swiss multinational corporation headquartered in Zürich, Switzerland. The company was formed in 1988 when Sweden’s Allmänna Svenska Elektriska Aktiebolaget (ASEA) and Switzerland’s Brown, Boveri & Cie merged to create ASEA Brown Boveri, later simplified to the initials ABB. Both companies were established in the late 1800s and were
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhy did a William the Conqueror, a Frenchman, become the King of England?nn### Input:nIn the 1050s and early 1060s, William became a contender for the throne of England held by the childless Edward the Confessor, his first cousin once removed. There were other potential claimants, including the powerful English earl Harold Godwinson, whom Edward named as king on his deathbed in January 1066. Arguing that Edward had previously promised the throne to him and that Harold had sworn to support his claim, William built a large fleet and invaded England in September 1066. He decisively defeated and killed Harold at the Battle of Hastings on 14 October 1066. After further military efforts, William was crowned king on Christmas Day, 1066, in London. He made arrangements for the governance of England in early 1067 before returning to Normandy. Several unsuccessful rebellions followed, but William’s hold was mostly secure on England by 1075, allowing him to spend the greater part of his reign in continental Europe.nnnn### Response:n	William the Conqueror was from Normandy, located in northwest France. The King of England in the late 1050s and early 1060s was Edward the Confessor. Edward was William’s cousin. Since Edward was childless, there were multiple people who claimed that they were the rightful heir and should become king. These conflicting claims gave rise to armed conflict. William invaded England, where he won the Battle of Hastings, defeating earl Harold Godwinson. A number of battles followed, but William became King of England late in 1066. He dealt with multiple rebellions over the next decade, but retained his power. Ultimately he was able to return to Normandy and rule England from there.	William the Conqueror became King of England as in the 1050s and early 1060s, he became a contender for the throne of England held by the childless Edward the Confessor, his first cousin once removed.nnnResponse can be empty but must fit within 256 characters long.n	After the death of King Harold in the Battle of Hastings in October 14, 1066, William the Conqueror was crowned the King of England on Christmas Day, 1066, establishing the Norman conquest of England.nn
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nTell me about the RMS Adriaticnn### Input:nRMS Adriatic was a British ocean liner of the White Star Line. She was the fourth of a quartet of ships of more than 20,000 GRT, dubbed The Big Four. The ship was the only one of the four which was never the world’s largest ship. However, she was the largest, the fastest, and the most luxurious of the Big Four, being the first ocean liner to have an indoor swimming pool and a Turkish bath.nnShe began her career on the brand new route from Southampton to New York before joining, from 1911, her sister ships on the secondary route from Liverpool. They were in fact slow liners intended to provide a service at moderate prices. When World War I broke out, the Adriatic was among the ships that continued their civilian transatlantic service, while carrying many provisions. In 1917, she was requisitioned and served as a troop transport.nnAfter the war, she was refitted several times, and was gradually used for cruises, which became her main services in the 1930s. When Cunard Line and White Star Line merged in 1934, she was quickly deemed unnecessary and sold at the end of the year. She was scrapped in Osaka in 1935.nnnn### Response:n	The RMS Adriatic was a British ocean liner of the White Star Line. She was the fourth of a quartet of ships of more than 20,000 GRT, dubbed The Big Four. The ship was the only one of the four which was never the world’s largest ship. However, she was the largest, the fastest, and the most luxurious of the Big Four, being the first ocean liner to have an indoor swimming pool and a Turkish bath.	RMS Adriatic was a British ocean liner of the White Star Line. She was the fourth of a quartet of ships of more than 20,000 GRT, dubbed The Big Four. The ship was the only one of the four which was never the world’s largest ship. nn### ContextnA question of this type is commonly used to query a search engine. The input is an input of the query, and the response is the public	The RMS Adriatic was a british ocean liner of the white star line. She was launched in 1906 but her first commercial sailing was in 1907. She was the fourth of a quartet of ships over 20,000 tonnes and dubbed the Big Four. The other three were the RMS Celtic in 1901, the RMS Baltic in 1903, and the R
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nWhat is the CGRO that detects photons?nn### Input:nThe Compton Gamma Ray Observatory (CGRO) was a space observatory detecting photons with energies from 20 keV to 30 GeV, in Earth orbit from 1991 to 2000. The observatory featured four main telescopes in one spacecraft, covering X-rays and gamma rays, including various specialized sub-instruments and detectors. Following 14 years of effort, the observatory was launched from Space Shuttle Atlantis during STS-37 on April 5, 1991, and operated until its deorbit on June 4, 2000. It was deployed in low Earth orbit at 450 km (280 mi) to avoid the Van Allen radiation belt. It was the heaviest astrophysical payload ever flown at that time at 17,000 kilograms (37,000 lb).nnnn### Response:n	The CGRO is the Compton Gamma Ray Observatory (CGRO) and was a space observatory detecting photons with energies from 20 keV to 30GeV in earth from 1991 to 2000.	Transcriber: The EGRET instrument detects photons.nn<!–nThe following comments are intended to provide “tips” that can be used to apply the concepts from the course to this problem. Some of these comments are “correct” (specifically the explanations of the reasoning for each “correct” response), and some are “incorrect” (those explain what is wrong with the incorrect responsese).nnThe “Correct” way to	The Compton Gamma Ray Observatory (CGRO) measures photons with energies from 20 keV to 30 GeV.nn
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:nHow good as a pianist is Yeol Eum Sonnn### Input:nYeol Eum Son (born May 2, 1986, in Wonju, South Korea) is a South Korean classical pianist. She first drew international attention in October 2004 at age 18 when she appeared as a soloist performing Liszt Piano Concerto No. 1 with the New York Philharmonic under the baton of Lorin Maazel on their Asia tour in Seoul, Daejeon, and Tokyo. Son again performed with Maazel and the New York Philharmonic when they returned to the Seoul Arts Center in February 2008, this time as soloist for Beethoven Piano Concerto No. 2.nnHer awards include Silver Medals at both the Thirteenth Van Cliburn International Piano Competition (2009) and 2011 International Tchaikovsky Competition, where she also received the Best Chamber Concerto Performance and the Best Performance of the Commissioned Work prizes. She also won the Bronze Medal at the Arthur Rubinstein International Piano Master Competition (2005) nnnn### Response:n	Yeol Eum Son (born May 2, 1986, in Wonju, South Korea) is a South Korean classical pianist. She first drew international attention in October 2004 at age 18 when she appeared as a soloist performing Liszt Piano Concerto No. 1 with the New York Philharmonic under the baton of Lorin Maazel on their Asia tour in Seoul, Daejeon, and Tokyo. Son again performed with Maazel and the New York Philharmonic when they returned to the Seoul Arts Center in February 2008, this time as soloist for Beethoven Piano Concerto No. 2.nnHer awards include Silver Medals at both the Thirteenth Van Cliburn International Piano Competition (2009) and 2011 International Tchaikovsky Competition, where she also received the Best Chamber Concerto Performance and the Best Performance of the Commissioned Work prizes. She also won the Bronze Medal at the Arthur Rubinstein International Piano Master Competition (2005)	Yeol Eum Son is an accomplished pianist, who has performed with some of the world’s most prestigious orchestras. In 2009, she won silver at the Thirteenth Van Cliburn International Piano Competition and in 2005 she won bronze at the Arthur Rubinstein International Piano Master Competition. She continues to perform with great success all over the world.	Yeol Eum Son is a very good pianist. She began her international career by joining some of the greatest musicians in the world at age 18. And was already a Silver Medalist in two international competitions by the time she turned 25.nn

Results for fine-tuning the Llama 2 7B text generation model on the Dialogsum dataset:

Inputs	Ground Truth	Response from Non-Fine-Tuned Model	Response from Fine-Tuned Model
Summarize this dialog:n#Person1#: This is a great meal!n#Person2#: Thank you. I am glad that everything turned out so well.n#Person1#: Who taught you how to cook this well?n#Person2#: I took a cooking class at the Culinary Institute.n#Person1#: Wow, what is in this chicken dish?n#Person2#: Isn’t that great? That is coconut ginger chicken with rice.n#Person1#: Is the soup a shrimp soup?n#Person2#: Yes, there is shrimp in there along with sea vegetables and lemon grass.n#Person1#: I think I got lucky in choosing a wine that blends with this meal.n#Person2#: I appreciate you bringing the wine. It goes well with the chicken.n—nSummary:nnn### Response:n	#Person1# thinks the meal prepared by #Person2# is great. #Person2# introduces the ingredients of the dishes.	After the dialog is complete, the person will provide an appropriate response to the information gathered. For example, person 1 can say,nn#Person1#: This is great! Thank you.nn#Person2#: You are very welcome. I thought that you would like your meal. I’m happy that you enjoyed it.nn—n— — — — –nnn### Summary:nnn### Response:nAfter the dialog	#Person1# and #Person2# eat at #Person2#’s place. They talk about the food and the wine that blend with the chicken.nn
Summarize this dialog:n#Person1#: Excuse me, I am writing an essay about ancient Greece. Are there any books on this topic that you can recommend?n#Person2#: Geese? You want to write about ancient geese? I’ve never heard of such a thing. Anyway, it would be with a book of some animals, that’s row J. Or you want to try science, that’s in Row G.n#Person1#: Animals? But Greece is a country, not an animal.n#Person2#: Oh, Greece. Well, you know, today must be your lucky day. We have just received a new collection of books on that very topic. There are writers over here in row C.n#Person1#: I see. That’s a lot of books, which do you suggest?n#Person2#: Well, the one by Robin Smith is excellent but it is very long. It’s the one with the statue on the cover. There is also one by Milton Pike. It has a painting of a soldier on the front. As you can see, it’s not very thick, so it shouldn’t take too much time to read. But probably the best book is the one with the picture of ancient Olympic Stadium. It has won several history works rewards.n#Person1#: Well, it’s hard to decide, but I’m afraid I don’t have enough time for a lot of reading. I think I’ll get the book by Milton Pike.n#Person2#: OK. Could I have your library card please?n#Person1#: Sure, here you are.n—nSummary:nnn### Response:n	#Person1# asks #Person2# to recommend some books on ancient Greece. #Person2# suggests the one by Robin Smith which is very long, the one by Milton Pike which is not very thick, and the one winning several history works rewards. Since #Person1# doesn’t have enough time, #Person1#’ll get the book by Milton Pike.	n—nnSummary:n	#Person1#’s writing an essay about ancient Greece. #Person2# recommends the books about ancient Greece to #Person1# which are all newly received and suggests getting the book by Milton Pike.nn
Summarize this dialog:n#Person1#: Well, I see you’ve got your books, George. But are you reading? Looks like you’re just coloring.n#Person2#: I’m marking the important parts of the book. The colors are for different things.n#Person1#: Like what?n#Person2#: Main ideas are yellow. Important details to remember are pink and the light green is for difficult ideas.n#Person1#: This seems like a lot of activities. Aren’t you wasting your time?n#Person2#: I don’t think so, mom. I’ll read it all tonight. I’ll listen to it tomorrow. And then when it’s time to study, I will already know most of the information.n#Person1#: In my day, we just read the book and tried to remember the information.n#Person2#: The teacher told us the importance of active reading. And it’s the best way to learn at whatever speed you choose.n#Person1#: I see. And what’s this notebook for?n#Person2#: That’s for taking notes. What kind of things do you write down?n#Person1#: Sometimes I write a summary. On this page I write down questions. And of course I have another page where I write down the new words.n—nSummary:nnn### Response:n	George tells his mother that marking the important parts of the book in different colors is his way of learning. He also takes notes on the notebook and sometimes writes a summary.	#Person1#: Ohhh! Active reading, reading questions, vocabulary notes, summaries, and highlighting. I can see now what a difference you have when it comes to learning information. I’m impressed.n#Person2#: Thanks. Reading is the best way to learn and remember new information. Are you using active reading right now?n#Person1#: Hmmm….. I’ll have to think about that.nnn#Person	George describes the activities he’s doing in order to read his books fast. George shows the notebook to mom and explains how to take notes.nn
Summarize this dialog:n#Person1#: Good morning, sir. May I see your passport?n#Person2#: Yes, of course. Here it is.n#Person1#: Thank you. Are all these yours?n#Person2#: No, some of them are my friend’s. Do you want her passport, too?n#Person1#: Yes, please. Do you have any carry-on items?n#Person2#: Do you want me to put it on the belt, too?n#Person1#: Yes, please. Would you please open this big suitcase?n#Person2#: OK. Just a moment. It’s just some clothes and toiletries.n#Person1#: OK. Do you have anything to declare?n#Person2#: I think there is nothing.n#Person1#: Any fresh food, live plants, or animals?n#Person2#: Oh, no. None of that.n#Person1#: Thank you. You may lock up your suitcase. Here are your passports.n#Person2#: Thanks a lot.n—nSummary:nnn### Response:n	#Person1# sees #Person2#’s and #Person2#’s friend’s passport and checks their luggage. #Person2# tells #Person1# they’ve nothing to declare. #Person1# lets them go.	> I think there is nothing.nnn—nCategories:nn- [Conversation (32)](../conversation/32.md)n – [TSA (4)](../conversation/32/4.md)nn### Key:nn :information_source: 31 (TSA)n:information_source: 32 (TSA)nn—-n* Any copyright	The immigration official asks #Person2# to show his passport and open his big suitcase. #Person2# has nothing to declare and he may lock up his suitcase.nn
Summarize this dialog:n#Person1#: Chinese tea is famous all over the world. I want to buy several boxes.n#Person2#: There are various kinds of tea in China. It depends on the taste and effect.n#Person1#: Well, I do not know Chinese tea at all.n#Person2#: This is green tea, you can drink it in summer. It can relieve summer-heat and it is good for maintaining your lungs.n#Person1#: Sounds interesting. What about this?n#Person2#: It is black tea, it has a strong taste and promotes digestion.n#Person1#: What’s this? It smells fragrant.n#Person2#: It is jasmine tea of high quality. It is fragrant and has a pleasant taste.n#Person1#: I will take the jasmine tea. Could you pack it with the wrapping paper and ribbon? It is a gift for my friend.n#Person2#: No problem. Wait for a moment.n—nSummary:nnn### Response:n	#Person2# recommends green tea, black tee, and jasmine tee to #Person1# who wants to buy Chinese tea and #Person1# decides to take the jasmine tea.	#[English]#: Chinese tea is famous all over the world. I want to buy several boxes.n#[French]#: Il y a différentes sorts de thé chinois au pays. C’est selon le goût, et les effets qu’on veut obtenir.n#[Spanish]#: El té chino es muy famosos, tengo que comprar varias cajas.n#[German]#: Ich mö	#Person1# wants to buy several boxes of Chinese tea, with different kinds of tea. #Person2# helps #Person1# to select and will pack it up.nn

Results for fine-tuning the Llama-2 13B model on the Dialogsum dataset:

Inputs	Ground Truth	Response from Non-Fine-Tuned Model	Response from Fine-Tuned Model
Summarize this dialog:n#Person1#: Mrs. Thompson once ditheist, Jason.n#Person2#: Why? We shouldn’t be in trouble or anything? In fact we were super active during the entire class discussion this morning.n#Person1#: That’s just it. I don’t think it’s a bad thing at all. I think she is going to ask us to run for class monitor.n#Person2#: Whoa! Really? That’s a lot of responsibility.n#Person1#: Well, I think you’d be great at it. Besides, it’s a big honor to be asked to do it. It means that Mrs. Thompson sees us as leaders.n#Person2#: I’m not sure if I can do it by myself, though. Could we do it together?n#Person1#: Let’s ask Mrs. Thompson. I hear her coming in from the hallway now.n—nSummary:nnn### Response:n	#Person1# tells Jason Mrs. Thompson once ditheist and #Person1# guesses she wants them to run for class monitors. They’ll ask Mrs. Thompson whether they can do this together.	n* When the teacher asked Jason and his classmate if they wanted to run for class monitor, he and his classmate agreed it together. Both boys are Christian, but they are not friends in class.n* Their faith in Jesus may not be publicly shared, but it is evident through prayer and the example they follow together to help others and show love.nn—n## 1654nn[Situation: Students are talking about the recent	Mrs. Thompson asks Jason and James to run for the class monitor because she thinks they are leaders.nn
Summarize this dialog:n#Person1#: Hey, let’s pull over and look at that house over there.n#Person2#: Yes, that looks great! Is it on our Open House list?n#Person1#: Yes, it is on our list.n#Person2#: Well, it looks great from the curb. Let’s go on inside.n#Person1#: Let’s sign the registry. The realtor looks kind of busy right now.n#Person2#: OK, let’s go check out the kitchen next.n#Person1#: Oh look, they have some refreshments in here.n#Person2#: Make sure that you pick up a flyer with the information about the home.n#Person1#: Does it say how much they are asking for this house?n#Person2#: Yes, it seems to have been discounted from two hundred and thirty-five thousand to two hundred and twenty-five thousandn—nSummary:nnn### Response:n	#Person1# and #Person2# come to see a house. They check out the inside and talk about its price.	This is an AMI response to the following stimulus.nnPerson 2: nYes, that looks great! Is it on our Open House list? nYes, it is on our list.nWell, it looks great from the curb. Let’s go on inside.nLet’s sign the registry. The realtor looks kind of busy right now.nOK, let’s go check out the kitchen next.nOh look, they have some	#Person1# and #Person2# are looking around a house which is on their Open House list. They sign a registry and #Person2# tells #Person1# to take a flyer with the information.nn
Summarize this dialog:n#Person1#: How are you, Mr. Wilson? I am Tina.n#Person2#: Glad to meet you, Mrs. Tina.n#Person1#: Please have a seat. I know our employment of forcer has already given you the information about your employee contract, but I’d like to go over the main details again before signing. First, you’ll be getting a monthly salary, with one month paid vacation after one year of service.n#Person2#: That’s what I understand.n#Person1#: You’ll be covered by our medical plan while on duty. Since you’ll be middle management, you’re expected to be available up to 1. 5 hours past normal working hours. Any approved time over that will be paid at time and a half, which you can take as salary or time off.n#Person2#: Exactly my understanding.n#Person1#: A reasonable number of sick days will be covered by the company. Any extended illness will be covered by insurance. Have you read the other terms of the contract?n#Person2#: Yes, I have.n#Person1#: Do you have any other questions?n#Person2#: Just one. I noticed an item about flex-time. Is that a possibility for me?n#Person1#: Yes, it is, but you’ll have to discuss the details with your manager.n#Person2#: That’s acceptable.n#Person1#: Good. Now, if you’ll just sign here, you can start work in 3 days.n—nSummary:nnn### Response:n	Mr.Wilson describes the elementary information about the employment contract to Tina. He mentions monthly salary, one month paid vacation, medical plans, extended working hours and flex-time.	nNo, the summaries aren’t right. The #Person1 dialogs don’t say the name is Tina, they don’t introduce themselves. I think they both know each others’ names, so that’s unneeded.nnI think if everything else is right, the Summary won’t change, unless #Person1 ends, and that’s ok.nn	Tina meets with Mr. Wilson to go over the main details of Tina’s employee contract. The two discuss the monthly salary and medical plan. Tina has no questions and will start work in 3 days.nn
Summarize this dialog:n#Person1#: Jack, what is your favorite subject?n#Person2#: My favorite subject? Well, I think I prefer science. Yes, science, I find it very interesting. The science teacher makes the subject very exciting. What about you?n#Person1#: I always do very well in the maths tests. I also like geography. I like learning about people and the places where they live. Are there any subjects that you don’t like Jack?n#Person2#: I don’t like physics very much. It’s not easy to understand, most of the children are bored with it.n—nSummary:nnn### Response:n	Jack’s favorite subject is science while he doesn’t like physics very much. #Person1# likes maths and geography.	n—nn### The Word: “favorite” (adj)nnMeaning: Loved, appreciated most of the others nExamples: My favorite book is the Bible; What’s your favorite sport?; My favorite color is blue.nn—nnnn### What is your ____ subject? (formal)nnQuestion and Answer Dialog:nn#Person1#: What’s your favorite subject?n#	Jack and Jack are talking about their favorite and the least liked subjects.nn
Summarize this dialog:n#Person1#: Hi, Maggie. I saw your painting in the end of year school art exhibition that all the parents went to. It was great.n#Person2#: Well, our art teacher seemed really pleased with it, but I’m not sure what the parents thought about.n#Person1#: Oh, don’t worry. I only heard them saying good things about it.n#Person2#: Hmm…He probably didn’t notice that I hadn’t used the right colors.n#Person1#: I don’t think anyone noticed.n#Person2#: No, but I did. But it was too late to do anything about it. Well, next time I’ll do something better hopefully.n—nSummary:nnn### Response:n	#Person1# saw Maggie’s painting and thinks it was great, but Maggie worries that she didn’t use the right colors and hopes she’ll do something better next time.	This summary should highlight the key points and conclusions from the discussion.nn# Respond to Person1:n[Response summary]nn—n### Response:nThis summary should highlight the key points and conclusions from the discussion.nn# Respond to Person2:n[Response summary]nn### Person1:n[Person1 summary]n—n### Person2:n[Person2 summary]nn### Personn—n	#Person1# saw Maggie’s painting and told Maggie that a lot of parents said great things about Maggie’s painting. Maggie does not think so.nn

Results for fine-tuning the Llama 2 70B model on the Dialogsum dataset:

Inputs	Ground Truth	Response from Non-Fine-Tuned Model	Response from Fine-Tuned Model
Summarize this dialog:n#Person1#: Have you received any degrees?n#Person2#: Yes. In 1996 I received my Bachelor of Science degree from Hebes University, and in 2001 I received my MBA degree from Peking University.n#Person1#: How about your academic records at college?n#Person2#: In fact my records were excellent. My overall GPA was 9 on a 10 scale, which was the highest in my class.n#Person1#: That’s very impressive. Which course did you like best?n#Person2#: English. It was both interesting and useful, so I showed a great interest in it.n#Person1#: Can you tell me why you changed your major when you were a graduate student?n#Person2#: Because I am very interested in administration and I did take some courses in it. I also performed well in the subjects. Hence I believe that I can do a good job in this position.n#Person1#: Did you get any honors and awards at college?n#Person2#: Yes. I was awarded a scholarship from the university every year. In 1995 I participated in the National Contest of Maths Models and I won the prize.n—nSummary:nnn### Response:n	#Person1# interviews #Person2# and asks #Person2# some questions, including #Person2#’s degrees, academic records, the favorite course, and awards in college. #Person2# also tells #Person1# why #Person2# changed the major into administration.	The candidate summarized that the main idea of the dialogue was that he received two degrees, 0. And the majors did well. <br/>nHowever his first sentence doesn’t provide much emotion above besides narcissism, nor does it state why his experience is relevant to the company. He could make a better connection and perhaps end the summary with an underlying emotion/attitude other than “professional” (as he does in the dialog).<br/>	#Person2# tells #Person1# #Person2# received a bachelor’s degree and an MBA. #Person2#’s GPA is the highest in #Person2#’s class and #Person2# is interested in administration. #Person2# also got some honors and awards at college.nn
Summarize this dialog:n#Person1#: Excuse me, what material is it? n#Person2#: Rhaki. Do you want some? n#Person1#: No, where are the cotton cloths? n#Person2#: I will show you. n#Person1#: What is the shrink rate after washing? n#Person2#: Less than 5%. It’s quite durable. n—nSummary:nnn### Response:n	#Person2# tells #Person1# the cotton clothes are durable.	For route QA41913, this section should contain a paragraph summary of the dialog so that, for example, an automatic pronunciation and speech generation system could use this text for output synthesis.	#Person1# wants some cotton cloth. #Person2# shows some rhaki.nn
Summarize this dialog:n#Person1#: Sorry, I’m late, Tom.n#Person2#: It’s all right, Rita. Where have you been?n#Person1#: At the police station. I’ve lost my handbag. Or perhaps someone’s taken it. I don’t know.n#Person2#: Oh, no. What happened? Sit down. Was there anything important in it?n#Person1#: Important! My checkbook, all the papers I need for work.n#Person2#: Oh, that’s terrible. But how did you lose it?n#Person1#: Well, as you know, I was with a friend all morning and we had lunch together. After I had lunch, I went shopping. And when I wanted to buy something, I couldn’t find my checkbook. Then I remembered that it was in my handbag. And my handbag was in my car.n#Person2#: So you went back to your car.n#Person1#: But I didn’t find it there.n#Person2#: And you went to the police station?n#Person1#: Not immediately. Before I went to the police station I called my friend’s office. No luck.n#Person2#: You should go to the restaurant where you had lunch and look for it.n#Person1#: Oh, I should have done that.n#Person2#: Now you’d better telephone the manager right away.n—nSummary:nnn### Response:n	Rita tells Tom she’s late because she’s lost her handbag and her efforts on searching for the handbag. Tom suggests Rita go to the restaurant where she had lunch and look for it.	Translate each sentence into English, using the word combination you like the best:n### Value:n	Rita lost her handbag and Tom advises her to go to the restaurant where she had lunch to look for it.nn
Summarize this dialog:n#Person1#: Morning, Mum!n#Person2#: Morning, Meg. You look not well today? Are you ill?n#Person1#: No, I am not ill.n#Person2#: Then, What’s the matter with you my child?n#Person1#: Nothing.n#Person2#: Oh, come on, baby. Tell me what happened.n#Person1#: I. . . I failed to pass the examination. How I wish I had studied hard.n#Person2#: Oh. Take it easy. You can set your aim from today.n#Person1#: Ok, Mum. I will not fail the examination next time.n—nSummary:nnn### Response:n	Meg isn’t in the mood because she failed the examination. Her mom encourages her to set her aim from today.	Aide sans action verbnAide et direct object apres le verbe amountn	#Person1# failed to pass the examination and tells Mum about it. Mum encourages #Person1# to study harder.nn
Summarize this dialog:n#Person1#: Everyone wants to be financial lose kill. You must invest your savings if you planed to retire rich. I’m thinking about buying stocks. It can be a good investment if I can manage well. What do you say?n#Person2#: I partly agree with you. Buying stocks is not difficult, but making money consistently from buying stock is very difficult. Even professionals don’t find it easy, so be careful. Anyway, people like us need help with investing. Why not do some online research? There are data, reports and discussion forums to educate people on how to invest. In addition, you need to visit an established website that offers stock charts.n#Person1#: Wow, you sound very professional.n#Person2#: I just watched the investment TV program last weekend. It has become a very hot topic these days. Actually I am not even a beginner.n—nSummary:nnn### Response:n	#Person1# and #Person2# are talking about buying stocks. #Person2# thinks it is hard to make money consistently in this way and suggests doing online research to learn about investing.	nI agree with the investment strategy recommended earlier. It took me a great deal of time and energy to come up with such plan. It is great that trust has been put it my judgement to this extent. Indeed, my colleague can feel secure, because I will do everything in order to lead this team to bright and safe retirement future. I look forward to your role as an active member of the team.	#Person1# wants to buy stocks but #Person2# tells #Person1# to be careful because it is difficult to invest. #Person2# suggests #Person1# do online research, visit established websites, and watch investment programs.nn

Dataset formatting

We currently offer two types of fine-tuning: instruction fine-tuning and domain adaption fine-tuning. You can easily switch to one of the training methods by specifying the parameter instruction_tuned as ‘True‘ or ‘False‘.

Domain adaption format

The text generation model can also be fine-tuned on any domain-specific dataset. After it’s fine-tuned on the domain-specific dataset, the model is expected to generate domain-specific text and solve various NLP tasks in that specific domain with few-shot prompting.

For input to the model, use a training and optional validation directory. Each directory contains a CSV, JSON, or TXT file. For CSV and JSON files, the train or validation data is used from the column called text or the first column if no column called text is found. The number of files under train and validation (if provided) should equal to 1, respectively.

The output is a trained model that can be deployed for inference.

The following is an example of a TXT file for fine-tuning the text generation model. The TXT file is SEC filings of Amazon from 2021–2022:

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.

GENERAL

Embracing Our Future ...

Instruction fine-tuning

The text generation model can be instruction-tuned on any text data provided that the data is in the expected format. The instruction-tuned model can be further deployed for inference.

For input, use a training and optional validation directory. The train and validation directories should contain one or multiple JSON lines (.jsonl) formatted files. In particular, the train directory can also contain an optional *.json file describing the input and output formats.

The best model is selected according to the validation loss, calculated at the end of each epoch. If a validation set is not given, an (adjustable) percentage of the training data is automatically split and used for validation.

The training data must be formatted in a JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder; however, it can be saved in multiple .jsonl files. The .jsonl file extension is mandatory. The training folder can also contain a template.json file describing the input and output formats. If no template file is given, the following template will be used:

{
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.nn### Instruction:n{instruction}nn### Input:n{context}`,
    "completion": "{response}",
}

In this case, the data in the JSON lines entries must include prompt and completion fields. If a custom template is provided, it must also use prompt and completion keys to define the input and output templates. The following is a sample custom template:

{
  "prompt": "question: {question} context: {context}",
  "completion": "{answer}"
}

Here, the data in the JSON lines entries must include the question, context, and answer fields.

The output is a trained model that can be deployed for inference.

We provide a subset of SEC filings data of Amazon. It is downloaded from publicly available EDGAR. For instructions on accessing the data, refer to Accessing EDGAR Data.

License: Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)

Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs

September 6, 2023

by James Wu Amazon AWS

Multi-model endpoints (MMEs) are a powerful feature of Amazon SageMaker designed to simplify the deployment and operation of machine learning (ML) models. With MMEs, you can host multiple models on a single serving container and host all the models behind a single endpoint. The SageMaker platform automatically manages the loading and unloading of models and scales resources based on traffic patterns, reducing the operational burden of managing a large quantity of models. This feature is particularly beneficial for deep learning and generative AI models that require accelerated compute. The cost savings achieved through resource sharing and simplified model management makes SageMaker MMEs an excellent choice for you to host models at scale on AWS.

Recently, generative AI applications have captured widespread attention and imagination. Customers want to deploy generative AI models on GPUs but at the same time are conscious of costs. SageMaker MMEs support GPU instances and is a great option for these types of applications. Today, we are excited to announce TorchServe support for SageMaker MMEs. This new model server support gives you the advantage of all the benefits of MMEs while still using the serving stack that TorchServe customers are most familiar with. In this post, we demonstrate how to host generative AI models, such as Stable Diffusion and Segment Anything Model, on SageMaker MMEs using TorchServe and build a language-guided editing solution that can help artists and content creators develop and iterate their artwork faster.

Solution overview

Language-guided editing is a common cross-industry generative AI use case. It can help artists and content creators work more efficiently to meet content demand by automating repetitive tasks, optimizing campaigns, and providing a hyper-personalized experience for the end customer. Businesses can benefit from increased content output, cost savings, improved personalization, and enhanced customer experience. In this post, we demonstrate how you can build language-assisted editing features using MME TorchServe that allow you to erase any unwanted object from an image and modify or replace any object in an image by supplying a text instruction.

The user experience flow for each use case is as follows:

To remove an unwanted object, the select the object from the image to highlight it. This action sends the pixel coordinates and the original image to a generative AI model, which generates a segmentation mask for the object. After confirming the correct object selection, you can send the original and mask images to a second model for removal. The detailed illustration of this user flow is demonstrated below.


Step 1: Select an object (“dog”) from the image	Step 2: Confirm the correct object is highlighted	Step 3: Erase the object from the image

To modify or replace an object, the select and highlight the desired object, following the same process as described above. Once you confirm the correct object selection, you can modify the object by supplying the original image, the mask, and a text prompt. The model will then change the highlighted object based on the provided instructions. A detailed illustration of this second user flow is as follows.


Step 1: Select an object (“vase”) from the image	Step 2: Confirm the correct object is highlighted	Step 3: Provide a text prompt (“futuristic vase”) to modify the object

To power this solution, we use three generative AI models: Segment Anything Model (SAM), Large Mask Inpainting Model (LaMa), and Stable Diffusion Inpaint (SD). Here are how these models been utilized in the user experience workflow:

To remove an unwanted object	To modify or replace an object

Segment Anything Model (SAM) is used to generate a segment mask of the object of interest. Developed by Meta Research, SAM is an open-source model that can segment any object in an image. This model has been trained on a massive dataset known as SA-1B, which comprises over 11 million images and 1.1 billion segmentation masks. For more information on SAM, refer to their website and research paper.
LaMa is used to remove any undesired objects from an image. LaMa is a Generative Adversarial Network (GAN) model specializes in fill missing parts of images using irregular masks. The model architecture incorporates image-wide global context and a single-step architecture that uses Fourier convolutions, enabling it to achieve state-of-the-art results at a faster speed. For more details on LaMa, visit their website and research paper.
SD 2 inpaint model from Stability AI is used to modify or replace objects in an image. This model allows us to edit the object in the mask area by providing a text prompt. The inpaint model is based on the text-to-image SD model, which can create high-quality images with a simple text prompt. It provides additional arguments such as original and mask images, allowing for quick modification and restoration of existing content. To learn more about Stable Diffusion models on AWS, refer to Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker.

All three models are hosted on SageMaker MMEs, which reduces the operational burden from managing multiple endpoints. In addition to that, using MME eliminates concerns about certain models being underutilized because resources are shared. You can observe the benefit from improved instance saturation, which ultimately leads to cost savings. The following architecture diagram illustrates how all three models are served using SageMaker MMEs with TorchServe.

We have published the code to implement this solution architecture in our GitHub repository. To follow along with the rest of the post, use the notebook file. It is recommended to run this example on a SageMaker notebook instance using the conda_python3 (Python 3.10.10) kernel.

Extend the TorchServe container

The first step is to prepare the model hosting container. SageMaker provides a managed PyTorch Deep Learning Container (DLC) that you can retrieve using the following code snippet:

# Use SageMaker PyTorch DLC as base image
baseimage = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    py_version="py310",
    image_scope="inference",
    version="2.0.0",
    instance_type="ml.g5.2xlarge",
)
print(baseimage)

Because the models require resources and additional packages that are not on the base PyTorch DLC, you need to build a Docker image. This image is then uploaded to Amazon Elastic Container Registry (Amazon ECR) so we can access directly from SageMaker. The custom installed libraries are listed in the Docker file:

ARG BASE_IMAGE

FROM $BASE_IMAGE

#Install any additional libraries
RUN pip install segment-anything-py==1.0
RUN pip install opencv-python-headless==4.7.0.68
RUN pip install matplotlib==3.6.3
RUN pip install diffusers
RUN pip install tqdm
RUN pip install easydict
RUN pip install scikit-image
RUN pip install xformers
RUN pip install tensorflow
RUN pip install joblib
RUN pip install matplotlib
RUN pip install albumentations==0.5.2
RUN pip install hydra-core==1.1.0
RUN pip install pytorch-lightning
RUN pip install tabulate
RUN pip install kornia==0.5.0
RUN pip install webdataset
RUN pip install omegaconf==2.1.2
RUN pip install transformers==4.28.1
RUN pip install accelerate
RUN pip install ftfy

Run the shell command file to build the custom image locally and push it to Amazon ECR:

%%capture build_output

reponame = "torchserve-mme-demo"
versiontag = "genai-0.1"

# Build our own docker image
!cd workspace/docker && ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}

Prepare the model artifacts

The main difference for the new MMEs with TorchServe support is how you prepare your model artifacts. The code repo provides a skeleton folder for each model (models folder) to house the required files for TorchServe. We follow the same four-step process to prepare each model .tar file. The following code is an example of the skeleton folder for the SD model:

workspace
|--sd
   |-- custom_handler.py
   |-- model-config.yaml

The first step is to download the pre-trained model checkpoints in the models folder:

import diffusers
import torch
import transformers

pipeline = diffusers.StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting", torch_dtype=torch.float16
)

sd_dir = "workspace/sd/model"
pipeline.save_pretrained(sd_dir)

The next step is to define a custom_handler.py file. This is required to define the behavior of the model when it receives a request, such as loading the model, preprocessing the input, and postprocessing the output. The handle method is the main entry point for requests, and it accepts a request object and returns a response object. It loads the pre-trained model checkpoints and applies the preprocess and postprocess methods to the input and output data. The following code snippet illustrates a simple structure of the custom_handler.py file. For more detail, refer to the TorchServe handler API.

def initialize(self, ctx: Context):

def preprocess(self, data):

def inference(self, data):

def handle(self, data, context):
    requests = self.preprocess(data)
    responses = self.inference(requests)

    return responses

The last required file for TorchServe is model-config.yaml. The file defines the configuration of the model server, such as number of workers and batch size. The configuration is at a per-model level, and an example config file is shown in the following code. For a complete list of parameters, refer to the GitHub repo.

minWorkers: 1
maxWorkers: 1
batchSize: 1
maxBatchDelay: 200
responseTimeout: 300

The final step is to package all the model artifacts into a single .tar.gz file using the torch-model-archiver module:

!torch-model-archiver --model-name sd --version 1.0 --handler workspace/sd/custom_handler.py --extra-files workspace/sd/model --config-file workspace/sam/model-config.yaml --archive-format no-archive!cd sd && tar cvzf sd.tar.gz .

Create the multi-model endpoint

The steps to create a SageMaker MME are the same as before. In this particular example, you spin up an endpoint using the SageMaker SDK. Start by defining an Amazon Simple Storage Service (Amazon S3) location and the hosting container. This S3 location is where SageMaker will dynamically load the models base on invocation patterns. The hosting container is the custom container you built and pushed to Amazon ECR in the earlier step. See the following code:

# This is where our MME will read models from on S3.
multi_model_s3uri = output_path

Then you want to define a MulitDataModel that captures all the attributes like model location, hosting container, and permission access:

print(multi_model_s3uri)
model = Model(
    model_data=f"{multi_model_s3uri}/sam.tar.gz",
    image_uri=container,
    role=role,
    sagemaker_session=smsess,
    env={"TF_ENABLE_ONEDNN_OPTS": "0"},
)

mme = MultiDataModel(
    name="torchserve-mme-genai-" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    model_data_prefix=multi_model_s3uri,
    model=model,
    sagemaker_session=smsess,
)
print(mme)

The deploy() function creates an endpoint configuration and hosts the endpoint:

mme.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In the example we provided, we also show how you can list models and dynamically add new models using the SDK. The add_model() function copies your local model .tar files into the MME S3 location:

# Only sam.tar.gz visible!
list(mme.list_models())

models = ["sd/sd.tar.gz", "lama/lama.tar.gz"]
for model in models:
    mme.add_model(model_data_source=model)

Invoke the models

Now that we have all three models hosted on an MME, we can invoke each model in sequence to build our language-assisted editing features. To invoke each model, provide a target_model parameter in the predictor.predict() function. The model name is just the name of the model .tar file we uploaded. The following is an example code snippet for the SAM model that takes in a pixel coordinate, a point label, and dilate kernel size, and generates a segmentation mask of the object in the pixel location:

img_file = "workspace/test_data/sample1.png"
img_bytes = None

with Image.open(img_file) as f:
    img_bytes = encode_image(f)

gen_args = json.dumps(dict(point_coords=[750, 500], point_labels=1, dilate_kernel_size=15))

payload = json.dumps({"image": img_bytes, "gen_args": gen_args}).encode("utf-8")

response = predictor.predict(data=payload, target_model="/sam.tar.gz")
encoded_masks_string = json.loads(response.decode("utf-8"))["generated_image"]
base64_bytes_masks = base64.b64decode(encoded_masks_string)

with Image.open(io.BytesIO(base64_bytes_masks)) as f:
    generated_image_rgb = f.convert("RGB")
    generated_image_rgb.show()

To remove an unwanted object from an image, take the segmentation mask generated from SAM and feed that into the LaMa model with the original image. The following images show an example.


Sample image	Segmentation mask from SAM	Erase the dog using LaMa

To modify or replace any object in an image with a text prompt, take the segmentation mask from SAM and feed it into SD model with the original image and text prompt, as shown in the following example.

Sample image

Segmentation mask from SAM

Replace using SD model with text prompt

“a hamster on a bench”

Cost savings

The benefits of SageMaker MMEs increase based on the scale of model consolidation. The following table shows the GPU memory usage of the three models in this post. They are deployed on one g5.2xlarge instance by using one SageMaker MME.

Model	GPU Memory (MiB)
Segment Anything Model	3,362
Stable Diffusion In Paint	3,910
Lama	852

You can see cost savings when hosting the three models with one endpoint, and for use cases with hundreds or thousands of models, the savings are much greater.

For example, consider 100 Stable Diffusion models. Each of the models on its own could be served by an ml.g5.2xlarge endpoint (4 GiB memory), costing $1.52 per instance hour in the US East (N. Virginia) Region. To provide all 100 models using their own endpoint would cost $218,880 per month. With a SageMaker MME, a single endpoint using ml.g5.2xlarge instances can host four models simultaneously. This reduces production inference costs by 75% to only $54,720 per month. The following table summarizes the differences between single-model and multi-model endpoints for this example. Given an endpoint configuration with sufficient memory for your target models, steady state invocation latency after all models have been loaded will be similar to that of a single-model endpoint.

.	Single-model endpoint	Multi-model endpoint
Total endpoint price per month	$218,880	$54,720
Endpoint instance type	ml.g5.2xlarge	ml.g5.2xlarge
CPU Memory capacity (GiB)	32	32
GPU Memory capacity (GiB)	24	24
Endpoint price per hour	$1.52	$1.52
Number of instances per endpoint	2	2
Endpoints needed for 100 models	100	25

Clean up

After you are done, please follow the instructions in the cleanup section of the notebook to delete the resources provisioned in this post to avoid unnecessary charges. Refer to Amazon SageMaker Pricing for details on the cost of the inference instances.

Conclusion

This post demonstrates the language-assisted editing capabilities made possible through the use of generative AI models hosted on SageMaker MMEs with TorchServe. The example we shared illustrates how we can use resource sharing and simplified model management with SageMaker MMEs while still utilizing TorchServe as our model serving stack. We utilized three deep learning foundation models: SAM, SD 2 Inpainting, and LaMa. These models enable us to build powerful capabilities, such as erasing any unwanted object from an image and modifying or replacing any object in an image by supplying a text instruction. These features can help artists and content creators work more efficiently and meet their content demands by automating repetitive tasks, optimizing campaigns, and providing a hyper-personalized experience. We invite you to explore the example provided in this post and build your own UI experience using TorchServe on a SageMaker MME.

To get started, see Supported algorithms, frameworks, and instances for multi-model endpoints using GPU backed instances.

About the authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Li Ning is a senior software engineer at AWS with a specialization in building large-scale AI solutions. As a tech lead for TorchServe, a project jointly developed by AWS and Meta, her passion lies in leveraging PyTorch and AWS SageMaker to help customers embrace AI for the greater good. Outside of her professional endeavors, Li enjoys swimming, traveling, following the latest advancements in technology, and spending quality time with her family.

Ankith Gunapal is an AI Partner Engineer at Meta (PyTorch). He is passionate about model optimization and model serving, with experience ranging from RTL verification, embedded software, computer vision, to PyTorch. He holds a Master’s in Data Science and a Master’s in Telecommunications. Outside of work, Ankith is also an electronic dance music producer.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Subhash Talluri is a Lead AI/ML solutions architect of the Telecom Industry business unit at Amazon Web Services. He’s been leading development of innovative AI/ML solutions for Telecom customers and partners worldwide. He brings interdisciplinary expertise in engineering and computer science to help build scalable, secure, and compliant AI/ML solutions via cloud-optimized architectures on AWS.

Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart

September 5, 2023

by Gordon Wang Amazon AWS

Content moderation plays a pivotal role in maintaining online safety and upholding the values and standards of websites and social media platforms. Its significance is underscored by the protection it provides users from exposure to inappropriate content, safeguarding their well-being in digital spaces. For example, in the advertising industry, content moderation serves to shield brands from unfavorable associations, thereby contributing to brand elevation and revenue growth. Advertisers prioritize their brand’s alignment with appropriate content to uphold their reputation and avert negative publicity. Content moderation also assumes critical importance in the finance and healthcare sectors, where it serves multiple functions. It plays an important role in identifying and safeguarding sensitive personal identifiable and health information (PII, PHI). By adhering to internal standards and practices and complying with external regulations, content moderation enhances digital security for users. This way, it prevents the inadvertent sharing of confidential data on public platforms, ensuring the preservation of user privacy and data security.

In this post, we introduce a novel method to perform content moderation on image data with multi-modal pre-training and a large language model (LLM). With multi-modal pre-training, we can directly query the image content based on a set of questions of interest and the model will be able to answer these questions. This enables users to chat with the image to confirm if it contains any inappropriate content that violates the organization’s policies. We use the powerful generating capability of LLMs to generate the final decision including safe/unsafe labels and category type. In addition, by designing a prompt, we can make an LLM generate the defined output format, such as JSON format. The designed prompt template allows the LLM to determine if the image violates the moderation policy, identify the category of violation, explain why, and provide the output in a structured JSON format.

We use BLIP-2 as the multi-modal pre-training method. BLIP-2 is one of the state-of-the-art models in multi-modal pre-training and outperforms most of the existing methods in visual question answering, image captioning, and image text retrieval. For our LLM, we use Llama 2, the next generation open-source LLM, which outperforms existing open-source language models on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. The following diagram illustrates the solution components.

Challenges in content moderation

Traditional content moderation methods, such as human-based moderation, can’t keep up with the growing volume of user-generated content (UGC). As the volume of UGC increases, human moderators can become overwhelmed and struggle to moderate content effectively. This results in a poor user experience, high moderation costs, and brand risk. Human-based moderation is also prone to errors, which can result in inconsistent moderation and biased decisions. To address these challenges, content moderation powered by machine learning (ML) has emerged as a solution. ML algorithms can analyze large volumes of UGC and identify content that violates the organization’s policies. ML models can be trained to recognize patterns and identify problematic content, such as hate speech, spam, and inappropriate material. According to the study Protect your users, brand, and budget with AI-powered content moderation, ML-powered content moderation can help organizations reclaim up to 95% of the time their teams spend moderating content manually. This allows organizations to focus their resources on more strategic tasks, such as community building and content creation. ML-powered content moderation can also reduce moderation costs because it’s more efficient than human-based moderation.

Despite the advantages of ML-powered content moderation, it still has further improvement space. The effectiveness of ML algorithms heavily relies on the quality of the data they are trained on. When models are trained using biased or incomplete data, they can make erroneous moderation decisions, exposing organizations to brand risks and potential legal liabilities. The adoption of ML-based approaches for content moderation brings several challenges that necessitate careful consideration. These challenges include:

Acquiring labeled data – This can be a costly process, especially for complex content moderation tasks that require training labelers. This cost can make it challenging to gather large enough datasets to train a supervised ML model with ease. Additionally, the accuracy of the model heavily relies on the quality of the training data, and biased or incomplete data can result in inaccurate moderation decisions, leading to brand risk and legal liabilities.
Model generalization – This is critical to adopting ML-based approaches. A model trained on one dataset may not generalize well to another dataset, particularly if the datasets have different distributions. Therefore, it is essential to ensure that the model is trained on a diverse and representative dataset to ensure it generalizes well to new data.
Operational efficiency – This is another challenge when using conventional ML-based approaches for content moderation. Constantly adding new labels and retraining the model when new classes are added can be time-consuming and costly. Additionally, it is essential to ensure that the model is regularly updated to keep up with changes in the content being moderated.
Explainability – End users may perceive the platform as biased or unjust if content gets flagged or removed without justification, resulting in a poor user experience. Similarly, the absence of clear explanations can render the content moderation process inefficient, time-consuming, and costly for moderators.
Adversarial nature – The adversarial nature of image-based content moderation presents a unique challenge to conventional ML-based approaches. Bad actors can attempt to evade content moderation mechanisms by altering the content in various ways, such as using synonyms of images or embedding their actual content within a larger body of non-offending content. This requires constant monitoring and updating of the model to detect and respond to such adversarial tactics.

Multi-modal reasoning with BLIP-2

Multi-modality ML models refer to models that can handle and integrate data from multiple sources or modalities, such as images, text, audio, video, and other forms of structured or unstructured data. One of the popular multi-modality models is the visual-language models such as BLIP-2, which combines computer vision and natural language processing (NLP) to understand and generate both visual and textual information. These models enable computers to interpret the meaning of images and text in a way that mimics human understanding. Vision-language models can tackle a variety of tasks, including image captioning, image text retrieval, visual question answering, and more. For example, an image captioning model can generate a natural language description of an image, and an image text retrieval model can search for images based on a text query. Visual question answering models can respond to natural language questions about images, and multi-modal chatbots can use visual and textual inputs to generate responses. In terms of content moderation, you can use this capability to query against a list of questions.

BLIP-2 contains three parts. The first component is a frozen image encoder, ViT-L/14 from CLIP, which takes image data as input. The second component is a frozen LLM, FlanT5, which outputs text. The third component is a trainable module called Q-Former, a lightweight transformer that connects the frozen image encoder with the frozen LLM. Q-Former employs learnable query vectors to extract visual features from the frozen image encoder and feeds the most useful visual feature to the LLM to output the desired text.

The pre-training process involves two stages. In the first stage, vision-language representation learning is performed to teach Q-Former to learn the most relevant visual representation for the text. In the second stage, vision-to-language generative learning is performed by connecting the output of Q-Former to a frozen LLM and training Q-Former to output visual representations that can be interpreted by the LLM.

BLIP-2 achieves state-of-the-art performance on various vision-language tasks despite having significantly fewer trainable parameters than existing methods. The model also demonstrates emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions. The following illustration is modified from the original research paper.

Solution overview

The following diagram illustrates the solution architecture.

In the following sections, we demonstrate how to deploy BLIP-2 to an Amazon SageMaker endpoint, and use BLIP-2 and an LLM for content moderation.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Create a standalone AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. For the BLIP-2 model, we use an ml.g5.2xlarge SageMaker hosting instance. For the Llama 2 13B model, we use an ml.g5.12xlarge SageMaker hosting instance.

Deploy BLIP-2 to a SageMaker endpoint

You can host an LLM on SageMaker using the Large Model Inference (LMI) container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. With the help of the SageMaker LMI container, the BLIP-2 model can be easily implemented with the Hugging Face library and hosted on SageMaker. You can run blip2-sagemaker.ipynb for this step.

To prepare the Docker image and model file, you need to retrieve the Docker image of DJLServing, package the inference script and configuration files as a model.tar.gz file, and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. You can refer to the inference script and configuration file for more details.

inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=sess.boto_session.region_name, version="0.22.1"
)
! tar czvf model.tar.gz blip2/
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

When the Docker image and inference related files are ready, you create the model, the configuration for the endpoint, and the endpoint:

from sagemaker.utils import name_from_base
blip_model_version = "blip2-flan-t5-xl"
model_name = name_from_base(blip_model_version)
model = Model(
    image_uri=inference_image_uri,
    model_data=s3_code_artifact,
    role=role,
    name=model_name,
)
model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    endpoint_name=model_name
)

When the endpoint status becomes in service, you can invoke the endpoint for image captioning and the instructed zero-shot vision-to-language generation task. For the image captioning task, you only need to pass an image to the endpoint:

import base64
import json
from PIL import Image

smr_client = boto3.client("sagemaker-runtime")

def encode_image(img_file):
    with open(img_file, "rb") as image_file:
        img_str = base64.b64encode(image_file.read())
        base64_string = img_str.decode("latin1")
    return base64_string

def run_inference(endpoint_name, inputs):
    response = smr_client.invoke_endpoint(
        EndpointName=endpoint_name, Body=json.dumps(inputs)
    )
    print(response["Body"].read())

test_image = "carcrash-ai.jpeg"
base64_string = encode_image(test_image)
inputs = {"image": base64_string}
run_inference(endpoint_name, inputs)

For the instructed zero-shot vision-to-language generation task, in addition to the input image, you need to define the question as a prompt:

base64_string = encode_image(test_image)
inputs = {"prompt": "Question: what happened in this photo? Answer:", "image": base64_string}
run_inference(endpoint_name, inputs)

Use BLIP-2 and LLM for content moderation

In this stage, you can make queries on the given image and retrieve hidden information. With the LLM, you organize the queries and retrieve information to generate the JSON format result. You can roughly split this task into the following two sub-tasks:

Extract information from the image with the BLIP-2 model.
Generate the final result and explanation with the LLM.

Extract information from the image with the BLIP-2 model

To retrieve enough useful hidden information from the given image, you need to define queries. Because each query will invoke the endpoint once, many queries will lead to longer processing time. Therefore, we suggest making queries high quality and cover all policies but also without duplicated. In our sample code, we define the queries as follows:

check_list = [
"Does this photo contain complete naked person?",
"Does this photo contain topless person?",
"Does this photo contain weapon?",
"Does this photo contain contact information?",
"Does this photo contain a smoker?",
"Does this photo contain blood?",
"Are there persons fighting in this photo?",
"Does this photo contain harassment words?"
]

With the preceding queries, invoke the endpoint of BLIP-2 to retrieve the information with the following code:

test_image = "./surf_swimwear.png"
raw_image = Image.open(test_image).convert('RGB')

base64_string = encode_image(test_image)
conversations = """"""
for question in check_list:
    inputs = {"prompt": f"Question: {question}? Answer:", "image": base64_string}
    response = run_inference(endpoint_name, inputs)
    conversations += f"""
Question: {question}
Answer: {response}.
"""

In addition to the information retrieved by queries, you can get information with the image captioning task by invoking the endpoint without the prompt field in the payload:

inputs = {"image": base64_string}
response = smr_client.invoke_endpoint(
EndpointName=endpoint_name, Body=json.dumps(inputs)
)
image_caption = response["Body"].read().decode('utf-8')

You can combine the contents of queries and answers with the image caption and use this retrieved information for the downstream task, described in the next section below.

Generate the final result and explanation with the LLM

Large language models (LLMs) such as Llama 2 can generate high-quality results with the right prompt template. Using Amazon SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models. With just a few clicks in SageMaker Studio, you can now discover and deploy Llama 2.

The final results rely on a LLM with a specific prompt template. Such prompt consists of: the moderation policy based on categories of inappropriate or offensive moderation categories; image information extracted from BLIP-2; the question template to the LLM asking if the photo contains unsafe content and requesting its category and reason if unsafe; and instructions to output the results in JSON format. The designed prompt template allows the LLM to determine if the image violates the moderation policy, identify the category of violation, explain why, and provide the output in a structured JSON format.

The core source code is as follows:

prompt_template = f"""
The following is our company's content moderation policy, based on the moderation policy, we gather image information from the user uploaded image. Please answer the question with json format. 
        
###### moderation policy ######
{moderation_policy}
        
###### Image information ######
{conversations}
        
###### Question ######
Based on the following Moderation policy and QA, tell me if the photo containes unsafe content, also give its category and reason if it's unsafe. Please anwser the question with the following format and only put explanation into the reason field:  
"""

prompt_template += """
{
    "flag": "xxx",
    "category": "xxx",
    "reason": "the reason is ..."
}
"""

dialog = [
    {"role": "user", "content": prompt_template}
]

You can customize the prompt based on your own use case. Refer to the notebook for more details. When the prompt is ready, you can invoke the LLM endpoint to generate results:

endpoint_name = "jumpstart-dft-meta-textgeneration-llama-2-70b-f"

def query_endpoint(payload):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
        CustomAttributes="accept_eula=true",
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response
    
payload = {
    "inputs": [dialog], 
    "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.5}
}
result = query_endpoint(payload)[0]

Part of the generated output is as follows:

> Assistant:  {
    "flag": "unsafe",
    "category": "Suggestive",
    "reason": "The photo contains a topless person, which is considered suggestive content."
}

Explanation:
The photo contains a topless person, which violates the moderation policy's rule number 2, which states that suggestive content includes "Female Swimwear Or Underwear, Male Swimwear Or Underwear, Partial Nudity, Barechested Male, Revealing Clothes and Sexual Situations." Therefore, the photo is considered unsafe and falls under the category of Suggestive.

Occasionally, Llama 2 attaches additional explanation besides the answer from the assistant. You could use the parsing code to extract JSON data from the original generated results:

answer = result['generation']['content'].split('}')[0]+'}'
json.loads(answer)

Advantages of generative approaches

The preceding sections showed how to implement the core part of model inference. In this section, we cover various aspects of generative approaches, including comparisons with conventional approaches and perspectives.

The following table compares each approach.

.	Generative Approach	Classification Approach
Acquiring labeled data	Pre-trained model on a large number of images, zero-shot inference	Requires data from all types of categories
Model generalization	Pre-trained model with various types of images	Requires a large volume of content moderation related data to improve model generalization
Operational efficiency	Zero-shot capabilities	Requires training the model for recognizing different patterns, and retraining when labels are added
Explainability	Reasoning as the text output, great user experience	Hard to achieve reasoning, hard to explain and interpret
Adversarial nature	Robust	High frequency retraining

Potential use cases of multi-modal reasoning beyond content moderation

The BLIP-2 models can be applied to fit multiple purposes with or without fine-tuning, which includes the following:

Image captioning – This asks the model to generate a text description for the image’s visual content. As illustrated in the following example image (left), we can have “a man is standing on the beach with a surfboard” as the image description.
Visual question answering – As the example image in the middle shows, we can ask “Is it commercial related content” and we have “yes” as the answer. In addition, BLIP-2 supports the multi-round conversation and outputs the following question: “Why do you think so?” Based on the visual cue and LLM capabilities, BLIP-2 outputs “it’s a sign for amazon.”
Image text retrieval – Given the question as “Text on the image”, we can extract the image text “it’s monday but keep smiling” as demonstrated in the image on the right.

The following images show examples to demonstrate the zero-shot image-to-text capability of visual knowledge reasoning.

As we can see from various examples above, multi-modality models open up new opportunities for solving complex problems that traditional single-modality models would struggle to address.

Clean up

To avoid incurring future charges, delete the resources created as part of this post. You can do this by following the instructions in the notebook cleanup section, or delete the created endpoints via the SageMaker console and resources stored in the S3 bucket.

Conclusion

In this post, we discussed the importance of content moderation in the digital world and highlighted its challenges. We proposed a new method to help improve content moderation with image data and perform question answering against the images to automatically extract useful information. We also provided further discussion on the advantages of using a generative AI-based approach compared to the traditional classification-based approach. Lastly, we illustrated the potential use cases of visual-language models beyond content moderation.

We encourage you to learn more by exploring SageMaker and building a solution using the multi-modality solution provided in this post and a dataset relevant to your business.

About the Authors

Gordon Wang is a Senior AI/ML Specialist TAM at AWS. He supports strategic customers with AI/ML best practices cross many industries. He is passionate about computer vision, NLP, generative AI, and MLOps. In his spare time, he loves running and hiking.

Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.

How Carrier predicts HVAC faults using AWS Glue and Amazon SageMaker

September 5, 2023

by Ravi Patankar Amazon AWS

In their own words, “In 1902, Willis Carrier solved one of mankind’s most elusive challenges of controlling the indoor environment through modern air conditioning. Today, Carrier products create comfortable environments, safeguard the global food supply, and enable safe transport of vital medical supplies under exacting conditions.”

At Carrier, the foundation of our success is making products our customers can trust to keep them comfortable and safe year-round. High reliability and low equipment downtime are increasingly important as extreme temperatures become more common due to climate change. We have historically relied on threshold-based systems that alert us to abnormal equipment behavior, using parameters defined by our engineering team. Although such systems are effective, they are intended to identify and diagnose equipment issues rather than predict them. Predicting faults before they occur allows our HVAC dealers to proactively address issues and improve the customer experience.

In order to improve our equipment reliability, we partnered with the Amazon Machine Learning Solutions Lab to develop a custom machine learning (ML) model capable of predicting equipment issues prior to failure. Our teams developed a framework for processing over 50 TB of historical sensor data and predicting faults with 91% precision. We can now notify dealers of impending equipment failure, so that they can schedule inspections and minimize unit downtime. The solution framework is scalable as more equipment is installed and can be reused for a variety of downstream modeling tasks.

In this post, we show how the Carrier and AWS teams applied ML to predict faults across large fleets of equipment using a single model. We first highlight how we use AWS Glue for highly parallel data processing. We then discuss how Amazon SageMaker helps us with feature engineering and building a scalable supervised deep learning model.

Overview of use case, goals, and risks

The main goal of this project is to reduce downtime by predicting impending equipment failures and notifying dealers. This allows dealers to schedule maintenance proactively and provide exceptional customer service. We faced three primary challenges when working on this solution:

Data scalability – Data processing and feature extraction needs to scale across large growing historical sensor data
Model scalability – The modeling approach needs to be capable of scaling across over 10,000 units
Model precision – Low false positive rates are needed to avoid unnecessary maintenance inspections

Scalability, both from a data and modeling perspective, is a key requirement for this solution. We have over 50 TB of historical equipment data and expect this data to grow quickly as more HVAC units are connected to the cloud. Data processing and model inference need to scale as our data grows. In order for our modeling approach to scale across over 10,000 units, we need a model that can learn from a fleet of equipment rather than relying on anomalous readings for a single unit. This will allow for generalization across units and reduce the cost of inference by hosting a single model.

The other concern for this use case is triggering false alarms. This means that a dealer or technician will go on-site to inspect the customer’s equipment and find everything to be operating appropriately. The solution requires a high precision model to ensure that when a dealer is alerted, the equipment is likely to fail. This helps earn the trust of dealers, technicians, and homeowners alike, and reduces the costs associated with unnecessary on-site inspections.

We partnered with the AI/ML experts at the Amazon ML Solutions Lab for a 14-week development effort. In the end, our solution includes two primary components. The first is a data processing module built with AWS Glue that summarizes equipment behavior and reduces the size of our training data for efficient downstream processing. The second is a model training interface managed through SageMaker, which allows us to train, tune, and evaluate our model before it is deployed to a production endpoint.

Data processing

Each HVAC unit we install generates data from 90 different sensors with readings for RPMs, temperature, and pressures throughout the system. This amounts to roughly 8 million data points generated per unit per day, with tens of thousands of units installed. As more HVAC systems are connected to the cloud, we anticipate the volume of data to grow quickly, making it critical for us to manage its size and complexity for use in downstream tasks. The length of the sensor data history also presents a modeling challenge. A unit may start displaying signs of impending failure months before a fault is actually triggered. This creates a significant lag between the predictive signal and the actual failure. A method for compressing the length of the input data becomes critical for ML modeling.

To address the size and complexity of the sensor data, we compress it into cycle features as shown in Figure 1. This dramatically reduces the size of data while capturing features that characterize the equipment’s behavior.

Figure 1: Sample of HVAC sensor data

AWS Glue is a serverless data integration service for processing large quantities of data at scale. AWS Glue allowed us to easily run parallel data preprocessing and feature extraction. We used AWS Glue to detect cycles and summarize unit behavior using key features identified by our engineering team. This dramatically reduced the size of our dataset from over 8 million data points per day per unit down to roughly 1,200. Crucially, this approach preserves predictive information about unit behavior with a much smaller data footprint.

The output of the AWS Glue job is a summary of unit behavior for each cycle. We then use an Amazon SageMaker Processing job to calculate features across cycles and label our data. We formulate the ML problem as a binary classification task with a goal of predicting equipment faults in the next 60 days. This allows our dealer network to address potential equipment failures in a timely manner. It’s important to note that not all units fail within 60 days. A unit experiencing slow performance degradation could take more time to fail. We address this during the model evaluation step. We focused our modeling on summertime because those months are when most HVAC systems in the US are in consistent operation and under more extreme conditions.

Modeling

Transformer architectures have become the state-of-the-art approach for handling temporal data. They can use long sequences of historical data at each time step without suffering from vanishing gradients. The input to our model at a given point in time is composed of the features for the previous 128 equipment cycles, which is roughly one week of unit operation. This is processed by a three-layer encoder whose output is averaged and fed into a multi-layered perceptron (MLP) classifier. The MLP classifier is composed of three linear layers with ReLU activation functions and a final layer with LogSoftMax activation. We use weighted negative log-likelihood loss with a different weight on the positive class for our loss function. This biases our model towards high precision and avoids costly false alarms. It also incorporates our business objectives directly into the model training process. Figure 2 illustrates the transformer architecture.

Figure 2: Temporal transformer architecture

Training

One challenge when training this temporal learning model is data imbalance. Some units have a longer operational history than others and therefore have more cycles in our dataset. Because they are overrepresented in the dataset, these units will have more influence on our model. We solve this by randomly sampling 100 cycles in a unit’s history where we assess the probability of a failure at that time. This ensures that each unit is equally represented during the training process. While removing the imbalanced data problem, this approach has the added benefit of replicating a batch processing approach that will be used in production. This sampling approach was applied to the training, validation, and test sets.

Training was performed using a GPU-accelerated instance on SageMaker. Monitoring the loss shows that it achieves the best results after 180 training epochs as show in Figure 3. Figure 4 shows that the area under the ROC curve for the resulting temporal classification model is 81%.

Figure 3: Training loss over epochs

Figure 4: ROC-AUC for 60-day lockout

Evaluation

While our model is trained at the cycle level, evaluation needs to take place at the unit level. In this way, one unit with multiple true positive detections is still only counted as a single true positive at the unit level. To do this, we analyze the overlap between the predicted outcomes and the 60-day window preceding a fault. This is illustrated in the following figure, which shows four cases of predicting outcomes:

True negative – All the prediction results are negative (purple) (Figure 5)
False positive – The positive predictions are false alarms (Figure 6)
False negative – Although the predictions are all negative, the actual labels could be positive (green) (Figure 7)
True positive – Some of the predictions could be negative (green), and at least one prediction is positive (yellow) (Figure 8)

Figure 5.1: True negative case	Figure 5.2: False positive case
Figure 5.3: False negative case	Figure 5.4: True positive case

After training, we use the evaluation set to tune the threshold for sending an alert. Setting the model confidence threshold at 0.99 yields a precision of roughly 81%. This falls short of our initial 90% criterion for success. However, we found that a good portion of units failed just outside the 60-day evaluation window. This makes sense, because a unit may actively display faulty behavior but take longer than 60 days to fail. To handle this, we defined a metric called effective precision, which is a combination of the true positive precision (81%) with the added precision of lockouts that occurred in the 30 days beyond our target 60-day window.

For an HVAC dealer, what is most important is that an onsite inspection helps prevent future HVAC issues for the customer. Using this model, we estimate that 81.2% of the time the inspection will prevent a lockout from occurring in the next 60 days. Additionally, 10.4% of the time the lockout would have occurred in within 90 days of inspection. The remaining 8.4% will be a false alarm. The effective precision of the trained model is 91.6%.

Conclusion

In this post, we showed how our team used AWS Glue and SageMaker to create a scalable supervised learning solution for predictive maintenance. Our model is capable of capturing trends across long-term histories of sensor data and accurately detecting hundreds of equipment failures weeks in advance. Predicting faults in advance will reduce curb-to-curb time, allowing our dealers to provide more timely technical assistance and improving the overall customer experience. The impacts of this approach will grow over time as more cloud-connected HVAC units are installed every year.

Our next step is to integrate these insights into the upcoming release of Carrier’s Connected Dealer Portal. The portal combines these predictive alerts with other insights we derive from our AWS-based data lake in order to give our dealers more clarity into equipment health across their entire client base. We will continue to improve our model by integrating data from additional sources and extracting more advanced features from our sensor data. The methods employed in this project provide a strong foundation for our team to start answering other key questions that can help us reduce warranty claims and improve equipment efficiency in the field.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab. To learn more about the services used in this project, refer to the AWS Glue Developer Guide and the Amazon SageMaker Developer Guide.

About the Authors

Ravi Patankar is a technical leader for IoT related analytics at Carrier’s Residential HVAC Unit. He formulates analytics problems related to diagnostics and prognostics and provides direction for ML/deep learning-based analytics solutions and architecture.

Dan Volk is a Data Scientist at the AWS Generative AI Innovation Center. He has ten years of experience in machine learning, deep learning and time-series analysis and holds a Master’s in Data Science from UC Berkeley. He is passionate about transforming complex business challenges into opportunities by leveraging cutting-edge AI technologies.

Yingwei Yu is an Applied Scientist at AWS Generative AI Innovation Center. He has experience working with several organizations across industries on various proof-of-concepts in machine learning, including NLP, time-series analysis, and generative AI technologies. Yingwei received his PhD in computer science from Texas A&M University.

Yanxiang Yu is an Applied Scientist at Amazon Web Services, working on the Generative AI Innovation Center. With over 8 years of experience building AI and machine learning models for industrial applications, he specializes in generative AI, computer vision, and time series modeling. His work focuses on finding innovative ways to apply advanced generative techniques to real-world problems.

Diego Socolinsky is a Senior Applied Science Manager with the AWS Generative AI Innovation Center, where he leads the delivery team for the Eastern US and Latin America regions. He has over twenty years of experience in machine learning and computer vision, and holds a PhD degree in mathematics from The Johns Hopkins University.

Kexin Ding is a fifth-year Ph.D. candidate in computer science at UNC-Charlotte. Her research focuses on applying deep learning methods for analyzing multi-modal data, including medical image and genomics sequencing data.

Optimize deployment cost of Amazon SageMaker JumpStart foundation models with Amazon SageMaker asynchronous endpoints

September 5, 2023

by Davide Gallitelli Amazon AWS

The success of generative AI applications across a wide range of industries has attracted the attention and interest of companies worldwide who are looking to reproduce and surpass the achievements of competitors or solve new and exciting use cases. These customers are looking into foundation models, such as TII Falcon, Stable Diffusion XL, or OpenAI’s GPT-3.5, as the engines that power the generative AI innovation.

Foundation models are a class of generative AI models that are capable of understanding and generating human-like content, thanks to the vast amounts of unstructured data they have been trained on. These models have revolutionized various computer vision (CV) and natural language processing (NLP) tasks, including image generation, translation, and question answering. They serve as the building blocks for many AI applications and have become a crucial component in the development of advanced intelligent systems.

However, the deployment of foundation models can come with significant challenges, particularly in terms of cost and resource requirements. These models are known for their size, often ranging from hundreds of millions to billions of parameters. Their large size demands extensive computational resources, including powerful hardware and significant memory capacity. In fact, deploying foundation models usually requires at least one (often more) GPUs to handle the computational load efficiently. For example, the TII Falcon-40B Instruct model requires at least an ml.g5.12xlarge instance to be loaded into memory successfully, but performs best with bigger instances. As a result, the return on investment (ROI) of deploying and maintaining these models can be too low to prove business value, especially during development cycles or for spiky workloads. This is due to the running costs of having GPU-powered instances for long sessions, potentially 24/7.

Earlier this year, we announced Amazon Bedrock, a serverless API to access foundation models from Amazon and our generative AI partners. Although it’s currently in Private Preview, its serverless API allows you to use foundation models from Amazon, Anthropic, Stability AI, and AI21, without having to deploy any endpoints yourself. However, open-source models from communities such as Hugging Face have been growing a lot, and not every one of them has been made available through Amazon Bedrock.

In this post, we target these situations and solve the problem of risking high costs by deploying large foundation models to Amazon SageMaker asynchronous endpoints from Amazon SageMaker JumpStart. This can help cut costs of the architecture, allowing the endpoint to run only when requests are in the queue and for a short time-to-live, while scaling down to zero when no requests are waiting to be serviced. This sounds great for a lot of use cases; however, an endpoint that has scaled down to zero will introduce a cold start time before being able to serve inferences.

Solution overview

The following diagram illustrates our solution architecture.

The architecture we deploy is very straightforward:

The user interface is a notebook, which can be replaced by a web UI built on Streamlit or similar technology. In our case, the notebook is an Amazon SageMaker Studio notebook, running on an ml.m5.large instance with the PyTorch 2.0 Python 3.10 CPU kernel.
The notebook queries the endpoint in three ways: the SageMaker Python SDK, the AWS SDK for Python (Boto3), and LangChain.
The endpoint is running asynchronously on SageMaker, and on the endpoint, we deploy the Falcon-40B Instruct model. It’s currently the state of the art in terms of instruct models and available in SageMaker JumpStart. A single API call allows us to deploy the model on the endpoint.

What is SageMaker asynchronous inference

SageMaker asynchronous inference is one of the four deployment options in SageMaker, together with real-time endpoints, batch inference, and serverless inference. To learn more about the different deployment options, refer to Deploy models for Inference.

SageMaker asynchronous inference queues incoming requests and processes them asynchronously, making this option ideal for requests with large payload sizes up to 1 GB, long processing times, and near-real-time latency requirements. However, the main advantage that it provides when dealing with large foundation models, especially during a proof of concept (POC) or during development, is the capability to configure asynchronous inference to scale in to an instance count of zero when there are no requests to process, thereby saving costs. For more information about SageMaker asynchronous inference, refer to Asynchronous inference. The following diagram illustrates this architecture.

To deploy an asynchronous inference endpoint, you need to create an AsyncInferenceConfig object. If you create AsyncInferenceConfig without specifying its arguments, the default S3OutputPath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-outputs/{UNIQUE-JOB-NAME} and S3FailurePath will be s3://sagemaker-{REGION}-{ACCOUNTID}/async-endpoint-failures/{UNIQUE-JOB-NAME}.

What is SageMaker JumpStart

Our model comes from SageMaker JumpStart, a feature of SageMaker that accelerates the machine learning (ML) journey by offering pre-trained models, solution templates, and example notebooks. It provides access to a wide range of pre-trained models for different problem types, allowing you to start your ML tasks with a solid foundation. SageMaker JumpStart also offers solution templates for common use cases and example notebooks for learning. With SageMaker JumpStart, you can reduce the time and effort required to start your ML projects with one-click solution launches and comprehensive resources for practical ML experience.

The following screenshot shows an example of just some of the models available on the SageMaker JumpStart UI.

Deploy the model

Our first step is to deploy the model to SageMaker. To do that, we can use the UI for SageMaker JumpStart or the SageMaker Python SDK, which provides an API that we can use to deploy the model to the asynchronous endpoint:

%%time
from sagemaker.jumpstart.model import JumpStartModel, AsyncInferenceConfig
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

model_id, model_version = "huggingface-llm-falcon-40b-instruct-bf16", "*"
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy(
    initial_instance_count=0,
    instance_type="ml.g5.12xlarge",
    async_inference_config=AsyncInferenceConfig()
)

This call can take approximately10 minutes to complete. During this time, the endpoint is spun up, the container together with the model artifacts are downloaded to the endpoint, the model configuration is loaded from SageMaker JumpStart, then the asynchronous endpoint is exposed via a DNS endpoint. To make sure that our endpoint can scale down to zero, we need to configure auto scaling on the asynchronous endpoint using Application Auto Scaling. You need to first register your endpoint variant with Application Auto Scaling, define a scaling policy, and then apply the scaling policy. In this configuration, we use a custom metric using CustomizedMetricSpecification, called ApproximateBacklogSizePerInstance, as shown in the following code. For a detailed list of Amazon CloudWatch metrics available with your asynchronous inference endpoint, refer to Monitoring with CloudWatch.

import boto3

client = boto3.client("application-autoscaling")
resource_id = "endpoint/" + my_model.endpoint_name + "/variant/" + "AllTraffic"

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0, # Miminum number of instances we want to scale down to - scale down to 0 to stop incurring in costs
    MaxCapacity=1, # Maximum number of instances we want to scale up to - scale up to 1 max is good enough for dev
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": my_model.endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,  # The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
        "ScaleOutCooldown": 300,  # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        # 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled.
        # If the value is true, scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    },
)

You can verify that this policy has been set successfully by navigating to the SageMaker console, choosing Endpoints under Inference in the navigation pane, and looking for the endpoint we just deployed.

Invoke the asynchronous endpoint

To invoke the endpoint, you need to place the request payload in Amazon Simple Storage Service (Amazon S3) and provide a pointer to this payload as a part of the InvokeEndpointAsync request. Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon Simple Notification Service (Amazon SNS).

SageMaker Python SDK

After deployment is complete, it will return an AsyncPredictor object. To perform asynchronous inference, you need to upload data to Amazon S3 and use the predict_async() method with the S3 URI as the input. It will return an AsyncInferenceResponse object, and you can check the result using the get_response() method.

Alternatively, if you would like to check for a result periodically and return it upon generation, use the predict() method. We use this second method in the following code:

import time

# Invoking the asynchronous endpoint with the SageMaker Python SDK
def query_endpoint(payload):
    """Query endpoint and print the response"""
    response = predictor.predict_async(
        data=payload,
        input_path="s3://{}/{}".format(bucket, prefix),
    )
    while True:
        try:
            response = response.get_result()
            break
        except:
            print("Inference is not ready ...")
            time.sleep(5)
    print(f"33[1m Input:33[0m {payload['inputs']}")
    print(f"33[1m Output:33[0m {response[0]['generated_text']}")
    
query_endpoint(payload)

Boto3

Let’s now explore the invoke_endpoint_async method from Boto3’s sagemaker-runtime client. It enables developers to asynchronously invoke a SageMaker endpoint, providing a token for progress tracking and retrieval of the response later. Boto3 doesn’t offer a way to wait for the asynchronous inference to be completed like the SageMaker Python SDK’s get_result() operation. Therefore, we take advantage of the fact that Boto3 will store the inference output in Amazon S3 in the response["OutputLocation"]. We can use the following function to wait for the inference file to be written to Amazon S3:

import json
import time
import boto3
from botocore.exceptions import ClientError

s3_client = boto3.client("s3")

# Wait until the prediction is generated
def wait_inference_file(bucket, prefix):
    while True:
        try:
            response = s3_client.get_object(Bucket=bucket, Key=prefix)
            break
        except ClientError as ex:
            if ex.response['Error']['Code'] == 'NoSuchKey':
                print("Waiting for file to be generated...")
                time.sleep(5)
                next
            else:
                raise
        except Exception as e:
            print(e.__dict__)
            raise
    return response

With this function, we can now query the endpoint:

# Invoking the asynchronous endpoint with the Boto3 SDK
import boto3

sagemaker_client = boto3.client("sagemaker-runtime")

# Query the endpoint function
def query_endpoint_boto3(payload):
    """Query endpoint and print the response"""
    response = sagemaker_client.invoke_endpoint_async(
        EndpointName=my_model.endpoint_name,
        InputLocation="s3://{}/{}".format(bucket, prefix),
        ContentType="application/json",
        Accept="application/json"
    )
    output_url = response["OutputLocation"]
    output_prefix = "/".join(output_url.split("/")[3:])
    # Read the bytes of the file from S3 in output_url with Boto3
    output = wait_inference_file(bucket, output_prefix)
    output = json.loads(output['Body'].read())[0]['generated_text']
    # Emit output
    print(f"33[1m Input:33[0m {payload['inputs']}")
    print(f"33[1m Output:33[0m {output}")

query_endpoint_boto3(payload)

LangChain

LangChain is an open-source framework launched in October 2022 by Harrison Chase. It simplifies the development of applications using large language models (LLMs) by providing integrations with various systems and data sources. LangChain allows for document analysis, summarization, chatbot creation, code analysis, and more. It has gained popularity, with contributions from hundreds of developers and significant funding from venture firms. LangChain enables the connection of LLMs with external sources, making it possible to create dynamic, data-responsive applications. It offers libraries, APIs, and documentation to streamline the development process.

LangChain provides libraries and examples for using SageMaker endpoints with its framework, making it easier to use ML models hosted on SageMaker as the “brain” of the chain. To learn more about how LangChain integrates with SageMaker, refer to the SageMaker Endpoint in the LangChain documentation.

One of the limits of the current implementation of LangChain is that it doesn’t support asynchronous endpoints natively. To use an asynchronous endpoint to LangChain, we have to define a new class, SagemakerAsyncEndpoint, that extends the SagemakerEndpoint class already available in LangChain. Additionally, we provide the following information:

The S3 bucket and prefix where asynchronous inference will store the inputs (and outputs)
A maximum number of seconds to wait before timing out
An updated _call() function to query the endpoint with invoke_endpoint_async() instead of invoke_endpoint()
A way to wake up the asynchronous endpoint if it’s in cold start (scaled down to zero)

To review the newly created SagemakerAsyncEndpoint, you can check out the sagemaker_async_endpoint.py file available on GitHub.

from typing import Dict
from langchain import PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains import LLMChain
from sagemaker_async_endpoint import SagemakerAsyncEndpoint

class ContentHandler(LLMContentHandler):
    content_type:str = "application/json"
    accepts:str = "application/json"
    len_prompt:int = 0

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        self.len_prompt = len(prompt)
        input_str = json.dumps({"inputs": prompt, "parameters": {"max_new_tokens": 100, "do_sample": False, "repetition_penalty": 1.1}})
        return input_str.encode('utf-8')

    def transform_output(self, output: bytes) -> str:
        response_json = output.read()
        res = json.loads(response_json)
        ans = res[0]['generated_text']
        return ans

chain = LLMChain(
    llm=SagemakerAsyncEndpoint(
        input_bucket=bucket,
        input_prefix=prefix,
        endpoint_name=my_model.endpoint_name,
        region_name=sagemaker.Session().boto_region_name,
        content_handler=ContentHandler(),
    ),
    prompt=PromptTemplate(
        input_variables=["query"],
        template="{query}",
    ),
)

print(chain.run(payload['inputs']))

Clean up

When you’re done testing the generation of inferences from the endpoint, remember to delete the endpoint to avoid incurring in extra charges:

predictor.delete_endpoint()

Conclusion

When deploying large foundation models like TII Falcon, optimizing cost is crucial. These models require powerful hardware and substantial memory capacity, leading to high infrastructure costs. SageMaker asynchronous inference, a deployment option that processes requests asynchronously, reduces expenses by scaling the instance count to zero when there are no pending requests. In this post, we demonstrated how to deploy large SageMaker JumpStart foundation models to SageMaker asynchronous endpoints. We provided code examples using the SageMaker Python SDK, Boto3, and LangChain to illustrate different methods for invoking asynchronous endpoints and retrieving results. These techniques enable developers and researchers to optimize costs while using the capabilities of foundation models for advanced language understanding systems.

To learn more about asynchronous inference and SageMaker JumpStart, check out the following posts:

About the author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Making automated visual-inspection systems practical

September 5, 2023

by Amazon AWS

Benchmarking framework that includes a product-agnostic public dataset, guidelines for model selection, and an evaluation approach helps bridge the gap between research and real-world implementation.Read More

Elevating the generative AI experience: Introducing streaming support in Amazon SageMaker hosting

September 1, 2023

by Raghu Ramesha Amazon AWS

We’re excited to announce the availability of response streaming through Amazon SageMaker real-time inference. Now you can continuously stream inference responses back to the client when using SageMaker real-time inference to help you build interactive experiences for generative AI applications such as chatbots, virtual assistants, and music generators. With this new feature, you can start streaming the responses immediately when they’re available instead of waiting for the entire response to be generated. This lowers the time-to-first-byte for your generative AI applications.

In this post, we’ll show how to build a streaming web application using SageMaker real-time endpoints with the new response streaming feature for an interactive chat use case. We use Streamlit for the sample demo application UI.

Solution overview

To get responses streamed back from SageMaker, you can use our new InvokeEndpointWithResponseStream API. It helps enhance customer satisfaction by delivering a faster time-to-first-response-byte. This reduction in customer-perceived latency is particularly crucial for applications built with generative AI models, where immediate processing is valued over waiting for the entire payload. Moreover, it introduces a sticky session that will enable continuity in interactions, benefiting use cases such as chatbots, to create more natural and efficient user experiences.

The implementation of response streaming in SageMaker real-time endpoints is achieved through HTTP 1.1 chunked encoding, which is a mechanism for sending multiple responses. This is a HTTP standard that supports binary content and is supported by most client/server frameworks. HTTP chunked encoding supports both text and image data streaming, which means the models hosted on SageMaker endpoints can send back streamed responses as text or image, such as Falcon, Llama 2, and Stable Diffusion models. In terms of security, both the input and output are secured using TLS using AWS Sigv4 Auth. Other streaming techniques like Server-Sent Events (SSE) are also implemented using the same HTTP chunked encoding mechanism. To take advantage of the new streaming API, you need to make sure the model container returns the streamed response as chunked encoded data.

The following diagram illustrates the high-level architecture for response streaming with a SageMaker inference endpoint.

One of the use cases that will benefit from streaming response is generative AI model-powered chatbots. Traditionally, users send a query and wait for the entire response to be generated before receiving an answer. This could take precious seconds or even longer, which can potentially degrade the performance of the application. With response streaming, the chatbot can begin sending back partial inference results as they are generated. This means that users can see the initial response almost instantaneously, even as the AI continues refining its answer in the background. This creates a seamless and engaging conversation flow, where users feel like they’re chatting with an AI that understands and responds in real time.

In this post, we showcase two container options to create a SageMaker endpoint with response streaming: using an AWS Large Model Inference (LMI) and Hugging Face Text Generation Inference (TGI) container. In the following sections, we walk you through the detailed implementation steps to deploy and test the Falcon-7B-Instruct model using both LMI and TGI containers on SageMaker. We chose Falcon 7B as an example, but any model can take advantage of this new streaming feature.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account. If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. For the Falcon-7B-Instruct model, we use an ml.g5.2xlarge SageMaker hosting instance. For hosting a Falcon-40B-Instruct model, we use an ml.g5.48xlarge SageMaker hosting instance. You can request a quota increase from the Service Quotas UI. For more information, refer to Requesting a quota increase.

Option 1: Deploy a real-time streaming endpoint using an LMI container

The LMI container is one of the Deep Learning Containers for large model inference hosted by SageMaker to facilitate hosting large language models (LLMs) on AWS infrastructure for low-latency inference use cases. The LMI container uses Deep Java Library (DJL) Serving, which is an open-source, high-level, engine-agnostic Java framework for deep learning. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, Transformers-neuronx, and FasterTransformer to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. For more details on the benefits using the LMI container to deploy large models on SageMaker, refer to Deploy large models at high performance using FasterTransformer on Amazon SageMaker and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference. You can also find more examples of hosting open-source LLMs on SageMaker using the LMI containers in this GitHub repo.

For the LMI container, we expect the following artifacts to help set up the model for inference:

serving.properties (required) – Defines the model server settings
model.py (optional) – A Python file to define the core inference logic
requirements.txt (optional) – Any additional pip wheel will need to install

LMI containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom preprocessing of the input data or postprocessing of the model’s predictions. We use the following configuration:

For this example, we host the Falcon-7B-Instruct model. We need to create a serving.properties configuration file with our desired hosting options and package it up into a tar.gz artifact. Response streaming can be enabled in DJL Serving by setting the enable_streaming option in the serving.properties file. For all the supported parameters, refer to Streaming Python configuration.
In this example, we use the default handlers in DJL Serving to stream responses, so we only care about sending requests and parsing the output response. You can also provide an entrypoint code with a custom handler in a model.py file to customize input and output handlers. For more details on the custom handler, refer to Custom model.py handler.
Because we’re hosting the Falcon-7B-Instruct model on a single GPU instance (ml.g5.2xlarge), we set option.tensor_parallel_degree to 1. If you plan to run in multiple GPUs, use this to set the number of GPUs per worker.
We use option.output_formatter to control the output content type. The default output content type is application/json, so if your application requires a different output, you can overwrite this value. For more information on the available options, refer to Configurations and settings and All DJL configuration options.

%%writefile serving.properties
engine=MPI 
option.model_id=tiiuae/falcon-7b-instruct
option.trust_remote_code=true
option.tensor_parallel_degree=1
option.max_rolling_batch_size=32
option.rolling_batch=auto
option.output_formatter=jsonlines
option.paged_attention=false
option.enable_streaming=true

To create the SageMaker model, retrieve the container image URI:

image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.23.0"
)

Use the SageMaker Python SDK to create the SageMaker model and deploy it to a SageMaker real-time endpoint using the deploy method:

instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-falcon-7b")

model = Model(sagemaker_session=sess, 
                image_uri=image_uri, 
                model_data=code_artifact, 
                role=role)

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900
)

When the endpoint is in service, you can use the InvokeEndpointWithResponseStream API call to invoke the model. This API allows the model to respond as a stream of parts of the full response payload. This enables models to respond with responses of larger size and enables faster-time-to-first-byte for models where there is a significant difference between the generation of the first and last byte of the response.

The response content type shown in x-amzn-sagemaker-content-type for the LMI container is application/jsonlines as specified in the model properties configuration. Because it’s part of the common data formats supported for inference, we can use the default deserializer provided by the SageMaker Python SDK to deserialize the JSON lines data. We create a helper LineIterator class to parse the response stream received from the inference request:

class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}n'
    b'{"outputs": [" challenging"]}n'
    b'{"outputs": [" problem"]}n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a 'n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

With the class in the preceding code, each time a response is streamed, it will return a binary string (for example, b'{"outputs": [" a"]}n') that can be deserialized again into a Python dictionary using JSON package. We can use the following code to iterate through each streamed line of text and return text response:

body = {"inputs": "what is life", "parameters": {"max_new_tokens":400}}
resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=json.dumps(body), ContentType="application/json")
event_stream = resp['Body']

for line in LineIterator(event_stream):
    resp = json.loads(line)
    print(resp.get("outputs")[0], end='')

The following screenshot shows what it would look like if you invoked the model through the SageMaker notebook using an LMI container.

Option 2: Implement a chatbot using a Hugging Face TGI container

In the previous section, you saw how to deploy the Falcon-7B-Instruct model using an LMI container. In this section, we show how to do the same using a Hugging Face Text Generation Inference (TGI) container on SageMaker. TGI is an open source, purpose-built solution for deploying LLMs. It incorporates optimizations including tensor parallelism for faster multi-GPU inference, dynamic batching to boost overall throughput, and optimized transformers code using flash-attention for popular model architectures including BLOOM, T5, GPT-NeoX, StarCoder, and LLaMa.

TGI deep learning containers support token streaming using Server-Sent Events (SSE). With token streaming, the server can start answering after the first prefill pass directly, without waiting for all the generation to be done. For extremely long queries, this means clients can start to see something happening orders of magnitude before the work is done. The following diagram shows a high-level end-to-end request/response workflow for hosting LLMs on a SageMaker endpoint using the TGI container.

To deploy the Falcon-7B-Instruct model on a SageMaker endpoint, we use the HuggingFaceModel class from the SageMaker Python SDK. We start by setting our parameters as follows:

hf_model_id = "tiiuae/falcon-7b-instruct" # model id from huggingface.co/models
number_of_gpus = 1 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model
instance_type = "ml.g5.2xlarge" # instance type to use for deployment

Compared to deploying regular Hugging Face models, we first need to retrieve the container URI and provide it to our HuggingFaceModel model class with image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in SageMaker, we can use the get_huggingface_llm_image_uri method provided by the SageMaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, Region, and version. For more details on the available versions, refer to HuggingFace Text Generation Inference Containers.

llm_image = get_huggingface_llm_image_uri(
    "huggingface",
    version="0.9.3"
)

We then create the HuggingFaceModel and deploy it to SageMaker using the deploy method:

endpoint_name = sagemaker.utils.name_from_base("tgi-model-falcon-7b")
    llm_model = HuggingFaceModel(
    role=role,
    image_uri=llm_image,
    env={
            'HF_MODEL_ID': hf_model_id,
            # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
            'SM_NUM_GPUS': str(number_of_gpus),
            'MAX_INPUT_LENGTH': "1900",  # Max length of input text
            'MAX_TOTAL_TOKENS': "2048",  # Max length of the generation (including input text)
        }
)

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    endpoint_name=endpoint_name,
)

The main difference compared to the LMI container is that you enable response streaming when you invoke the endpoint by supplying stream=true as part of the invocation request payload. The following code is an example of the payload used to invoke the TGI container with streaming:

body = {
    "inputs":"tell me one sentence",
    "parameters":{
        "max_new_tokens":400,
        "return_full_text": False
    },
    "stream": True
}

Then you can invoke the endpoint and receive a streamed response using the following command:

from sagemaker.base_deserializers import StreamDeserializer

llm.deserializer=StreamDeserializer()
resp = smr.invoke_endpoint_with_response_stream(EndpointName=llm.endpoint_name, Body=json.dumps(body), ContentType='application/json')

The response content type shown in x-amzn-sagemaker-content-type for the TGI container is text/event-stream. We use StreamDeserializer to deserialize the response into the EventStream class and parse the response body using the same LineIterator class as that used in the LMI container section.

Note that the streamed response from the TGI containers will return a binary string (for example, b`data:{"token": {"text": " sometext"}}`), which can be deserialized again into a Python dictionary using a JSON package. We can use the following code to iterate through each streamed line of text and return a text response:

event_stream = resp['Body']
start_json = b'{'
for line in LineIterator(event_stream):
    if line != b'' and start_json in line:
        data = json.loads(line[line.find(start_json):].decode('utf-8'))
        if data['token']['text'] != stop_token:
            print(data['token']['text'],end='')

The following screenshot shows what it would look like if you invoked the model through the SageMaker notebook using a TGI container.

Run the chatbot app on SageMaker Studio

In this use case, we build a dynamic chatbot on SageMaker Studio using Streamlit, which invokes the Falcon-7B-Instruct model hosted on a SageMaker real-time endpoint to provide streaming responses. First, you can test that the streaming responses work in the notebook as shown in the previous section. Then, you can set up the Streamlit application in the SageMaker Studio JupyterServer terminal and access the chatbot UI from your browser by completing the following steps:

Open a system terminal in SageMaker Studio.
On the top menu of the SageMaker Studio console, choose File, then New, then Terminal.
Install the required Python packages that are specified in the requirements.txt file:
```
$ pip install -r requirements.txt
```

Set up the environment variable with the endpoint name deployed in your account:

$ export endpoint_name=<Falcon-7B-instruct endpoint name deployed in your account>

Launch the Streamlit app from the streamlit_chatbot_<LMI or TGI>.py file, which will automatically update the endpoint names in the script based on the environment variable that was set up earlier:
```
$ streamlit run streamlit_chatbot_LMI.py --server.port 6006
```
To access the Streamlit UI, copy your SageMaker Studio URL to another tab in your browser and replace lab? with proxy/[PORT NUMBER]/. Because we specified the server port to 6006, the URL should look as follows:
```
https://<domain ID>.studio.<region>.sagemaker.aws/jupyter/default/proxy/6006/
```

Replace the domain ID and Region in the preceding URL with your account and Region to access the chatbot UI. You can find some suggested prompts in the left pane to get started.

The following demo shows how response streaming revolutionizes the user experience. It can make interactions feel fluid and responsive, ultimately enhancing user satisfaction and engagement. Refer to the GitHub repo for more details of the chatbot implementation.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required:

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion

In this post, we provided an overview of building applications with generative AI, the challenges, and how SageMaker real-time response streaming helps you address these challenges. We showcased how to build a chatbot application to deploy the Falcon-7B-Instruct model to use response streaming using both SageMaker LMI and HuggingFace TGI containers using an example available on GitHub.

Start building your own cutting-edge streaming applications with LLMs and SageMaker today! Reach out to us for expert guidance and unlock the potential of large model streaming for your projects.

About the Authors

Raghu Ramesha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Abhi Shivaditya is a Senior Solutions Architect at AWS, working with strategic global enterprise organizations to facilitate the adoption of AWS services in areas such as artificial intelligence, distributed computing, networking, and storage. His expertise lies in deep learning in the domains of natural language processing (NLP) and computer vision. Abhi assists customers in deploying high-performance machine learning models efficiently within the AWS ecosystem.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers build solutions using state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing ML solutions with best practices. In her spare time, she loves to explore nature and spend time with family and friends.

Sam Edwards, is a Cloud Engineer (AI/ML) at AWS Sydney specialized in machine learning and Amazon SageMaker. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. Outside of work, he enjoys playing racquet sports and traveling.

James Sanders is a Senior Software Engineer at Amazon Web Services. He works on the real-time inference platform for Amazon SageMaker.

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

September 1, 2023

by Sokratis Kartakis Amazon AWS

Nowadays, the majority of our customers is excited about large language models (LLMs) and thinking how generative AI could transform their business. However, bringing such solutions and models to the business-as-usual operations is not an easy task. In this post, we discuss how to operationalize generative AI applications using MLOps principles leading to foundation model operations (FMOps). Furthermore, we deep dive on the most common generative AI use case of text-to-text applications and LLM operations (LLMOps), a subset of FMOps. The following figure illustrates the topics we discuss.

Specifically, we briefly introduce MLOps principles and focus on the main differentiators compared to FMOps and LLMOps regarding processes, people, model selection and evaluation, data privacy, and model deployment. This applies to customers that use them out of the box, create foundation models from scratch, or fine-tune them. Our approach applies to both open-source and proprietary models equally.

ML operationalization summary

As defined in the post MLOps foundation roadmap for enterprises with Amazon SageMaker, ML and operations (MLOps) is the combination of people, processes, and technology to productionize machine learning (ML) solutions efficiently. To achieve this, a combination of teams and personas need to collaborate, as illustrated in the following figure.

These teams are as follows:

Advanced analytics team (data lake and data mesh) – Data engineers are responsible for preparing and ingesting data from multiple sources, building ETL (extract, transform, and load) pipelines to curate and catalog the data, and prepare the necessary historical data for the ML use cases. These data owners are focused on providing access to their data to multiple business units or teams.
Data science team – Data scientists need to focus on creating the best model based on predefined key performance indicators (KPIs) working in notebooks. After the completion of the research phase, the data scientists need to collaborate with ML engineers to create automations for building (ML pipelines) and deploying models into production using CI/CD pipelines.
Business team – A product owner is responsible for defining the business case, requirements, and KPIs to be used to evaluate model performance. The ML consumers are other business stakeholders who use the inference results (predictions) to drive decisions.
Platform team – Architects are responsible for the overall cloud architecture of the business and how all the different services are connected together. Security SMEs review the architecture based on business security policies and needs. MLOps engineers are responsible for providing a secure environment for data scientists and ML engineers to productionize the ML use cases. Specifically, they are responsible for standardizing CI/CD pipelines, user and service roles and container creation, model consumption, testing, and deployment methodology based on business and security requirements.
Risk and compliance team – For more restrictive environments, auditors are responsible for assessing the data, code, and model artifacts and making sure that the business is compliant with regulations, such as data privacy.

Note that multiple personas can be covered by the same person depending on the scaling and MLOps maturity of the business.

These personas need dedicated environments to perform the different processes, as illustrated in the following figure.

The environments are as follows:

Platform administration – The platform administration environment is the place where the platform team has access to create AWS accounts and link the right users and data
Data – The data layer, often known as the data lake or data mesh, is the environment that data engineers or owners and business stakeholders use to prepare, interact, and visualize with the data
Experimentation – The data scientists use a sandbox or experimentation environment to test new libraries and ML techniques to prove that their proof of concept can solve business problems
Model build, model test, model deployment – The model build, test, and deployment environment is the layer of MLOps, where data scientists and ML engineers collaborate to automate and move the research to production
ML governance – The last piece of the puzzle is the ML governance environment, where all the model and code artifacts are stored, reviewed, and audited by the corresponding personas

The following diagram illustrates the reference architecture, which has already been discussed in MLOps foundation roadmap for enterprises with Amazon SageMaker.

Each business unit has each own set of development (automated model training and building), preproduction (automatic testing), and production (model deployment and serving) accounts to productionize ML use cases, which retrieve data from a centralized or decentralized data lake or data mesh, respectively. All the produced models and code automation are stored in a centralized tooling account using the capability of a model registry. The infrastructure code for all these accounts is versioned in a shared service account (advanced analytics governance account) that the platform team can abstract, templatize, maintain, and reuse for the onboarding to the MLOps platform of every new team.

Generative AI definitions and differences to MLOps

In classic ML, the preceding combination of people, processes, and technology can help you productize your ML use cases. However, in generative AI, the nature of the use cases requires either an extension of those capabilities or new capabilities. One of these new notions is the foundation model (FM). They are called as such because they can be used to create a wide range of other AI models, as illustrated in the following figure.

FM have been trained based on terabytes of data and have hundreds of billions of parameters to be able to predict the next best answer based on three main categories of generative AI use cases:

Text-to-text – The FMs (LLMs) have been trained based on unlabeled data (such as free text) and are able to predict the next best word or sequence of words (paragraphs or long essays). Main use cases are around human-like chatbots, summarization, or other content creation such as programming code.
Text-to-image – Labeled data, such as pairs of <text, image>, has been used to train FMs, which are able to predict the best combination of pixels. Example use cases are clothing design generation or imaginary personalized images.
Text-to-audio or video – Both labeled and unlabeled data can be used for FM training. One main generative AI use case example is music composition.

To productionize those generative AI use cases, we need to borrow and extend the MLOps domain to include the following:

FM operations (FMOps) – This can productionize generative AI solutions, including any use case type
LLM operations (LLMOps) – This is a subset of FMOps focusing on productionizing LLM-based solutions, such as text-to-text

The following figure illustrates the overlap of these use cases.

Compared to classic ML and MLOps, FMOps and LLMOps defer based on four main categories that we cover in the following sections: people and process, selection and adaptation of FM, evaluation and monitoring of FM, data privacy and model deployment, and technology needs. We will cover monitoring in a separate post.

Operationalization journey per generative AI user type

To simplify the description of the processes, we need to categorize the main generative AI user types, as shown in the following figure.

The user types are as follows:

Providers – Users who build FMs from scratch and provide them as a product to other users (fine-tuner and consumer). They have deep end-to-end ML and natural language processing (NLP) expertise and data science skills, and massive data labeler and editor teams.
Fine-tuners – Users who retrain (fine-tune) FMs from providers to fit custom requirements. They orchestrate the deployment of the model as a service for use by consumers. These users need strong end-to-end ML and data science expertise and knowledge of model deployment and inference. Strong domain knowledge for tuning, including prompt engineering, is required as well.
Consumers – Users who interact with generative AI services from providers or fine-tuners by text prompting or a visual interface to complete desired actions. No ML expertise is required but, mostly, application developers or end-users with understanding of the service capabilities. Only prompt engineering is necessary for better results.

As per the definition and the required ML expertise, MLOps is required mostly for providers and fine-tuners, while consumers can use application productionization principles, such as DevOps and AppDev to create the generative AI applications. Furthermore, we have observed a movement among the user types, where providers might become fine-tuners to support use cases based on a specific vertical (such as the financial sector) or consumers might become fine-tuners to achieve more accurate results. But let’s observe the main processes per user type.

The journey of consumers

The following figure illustrates the consumer journey.

As previously mentioned, consumers are required to select, test, and use an FM, interacting with it by providing specific inputs, otherwise known as prompts. Prompts, in the context of computer programming and AI, refer to the input that is given to a model or system to generate a response. This can be in the form of a text, command, or a question, which the system uses to process and generate an output. The output generated by the FM can then be utilized by end-users, who should also be able to rate these outputs to enhance the model’s future responses.

Beyond these fundamental processes, we’ve noticed consumers expressing a desire to fine-tune a model by harnessing the functionality offered by fine-tuners. Take, for instance, a website that generates images. Here, end-users can set up private accounts, upload personal photos, and subsequently generate content related to those images (for example, generating an image depicting the end-user on a motorbike wielding a sword or located in an exotic location). In this scenario, the generative AI application, designed by the consumer, must interact with the fine-tuner backend via APIs to deliver this functionality to the end-users.

However, before we delve into that, let’s first concentrate on the journey of model selection, testing, usage, input and output interaction, and rating, as shown in the following figure.

*15K available FM reference

Step 1. Understand top FM capabilities

There are many dimensions that need to be considered when selecting foundation models, depending on the use case, the data available, regulations, and so on. A good checklist, although not comprehensive, might be the following:

Proprietary or open-source FM – Proprietary models often come at a financial cost, but they typically offer better performance (in terms of quality of the generated text or image), often being developed and maintained by dedicated teams of model providers who ensure optimal performance and reliability. On the other hand, we also see adoption of open-source models that, other than being free, offer additional benefits of being accessible and flexible (for example, every open-source model is fine-tunable). An example of a proprietary model is Anthropic’s Claude model, and an example of a high performing open-source model is Falcon-40B, as of July 2023.
Commercial license – Licensing considerations are crucial when deciding on an FM. It’s important to note that some models are open-source but can’t be used for commercial purposes, due to licensing restrictions or conditions. The differences can be subtle: The newly released xgen-7b-8k-base model, for example, is open source and commercially usable (Apache-2.0 license), whereas the instruction fine-tuned version of the model xgen-7b-8k-inst is only released for research purposes only. When selecting an FM for a commercial application, it’s essential to verify the license agreement, understand its limitations, and ensure it aligns with the intended use of the project.
Parameters – The number of parameters, which consist of the weights and biases in the neural network, is another key factor. More parameters generally means a more complex and potentially powerful model, because it can capture more intricate patterns and correlations in the data. However, the trade-off is that it requires more computational resources and, therefore, costs more to run. Additionally, we do see a trend towards smaller models, especially in the open-source space (models ranging from 7–40 billion) that perform well, especially, when fine-tuned.
Speed – The speed of a model is influenced by its size. Larger models tend to process data slower (higher latency) due to the increased computational complexity. Therefore, it’s crucial to balance the need for a model with high predictive power (often larger models) with the practical requirements for speed, especially in applications, like chat bots, that demand real-time or near-real-time responses.
Context window size (number of tokens) – The context window, defined by the maximum number of tokens that can be input or output per prompt, is crucial in determining how much context the model can consider at a time (a token roughly translates to 0.75 words for English). Models with larger context windows can understand and generate longer sequences of text, which can be useful for tasks involving longer conversations or documents.
Training dataset – It’s also important to understand what kind of data the FM was trained on. Some models may be trained on diverse text datasets like internet data, coding scripts, instructions, or human feedback. Others may also be trained on multimodal datasets, like combinations of text and image data. This can influence the model’s suitability for different tasks. In addition, an organization might have copyright concerns depending on the exact sources a model has been trained on—therefore, it’s mandatory to inspect the training dataset closely.
Quality – The quality of an FM can vary based on its type (proprietary vs. open source), size, and what it was trained on. Quality is context-dependent, meaning what is considered high-quality for one application might not be for another. For example, a model trained on internet data might be considered high quality for generating conversational text, but less so for technical or specialized tasks.
Fine-tunable – The ability to fine-tune an FM by adjusting its model weights or layers can be a crucial factor. Fine-tuning allows for the model to better adapt to the specific context of the application, improving performance on the specific task at hand. However, fine-tuning requires additional computational resources and technical expertise, and not all models support this feature. Open-source models are (in general) always fine-tunable because the model artifacts are available for downloading and the users are able to extend and use them at will. Proprietary models might sometimes offer the option of fine-tuning.
Existing customer skills – The selection of an FM can also be influenced by the skills and familiarity of the customer or the development team. If an organization has no AI/ML experts in their team, then an API service might be better suited for them. Also, if a team has extensive experience with a specific FM, it might be more efficient to continue using it rather than investing time and resources to learn and adapt to a new one.

The following is an example of two shortlists, one for proprietary models and one for open-source models. You might compile similar tables based on your specific needs to get a quick overview of the available options. Note that the performance and parameters of those models change rapidly and might be outdated by the time of reading, while other capabilities might be important for specific customers, such as the supported language.

The following is an example of notable proprietary FMs available in AWS (July 2023).

The following is an example of notable open-source FM available in AWS (July 2023).

After you have compiled an overview of 10–20 potential candidate models, it becomes necessary to further refine this shortlist. In this section, we propose a swift mechanism that will yield two or three viable final models as candidates for the next round.

The following diagram illustrates the initial shortlisting process.

Typically, prompt engineers, who are experts in creating high-quality prompts that allow AI models to understand and process user inputs, experiment with various methods to perform the same task (such as summarization) on a model. We suggest that these prompts are not created on the fly, but are systematically extracted from a prompt catalog. This prompt catalog is a central location for storing prompts to avoid replications, enable version control, and share prompts within the team to ensure consistency between different prompt testers in the different development stages, which we introduce in the next section. This prompt catalog is analogous to a Git repository of a feature store. The generative AI developer, who could potentially be the same person as the prompt engineer, then needs to evaluate the output to determine if it would be suitable for the generative AI application they are seeking to develop.

Step 2. Test and evaluate the top FM

After the shortlist is reduced to approximately three FMs, we recommend an evaluation step to further test the FMs’ capabilities and suitability for the use case. Depending on the availability and nature of evaluation data, we suggest different methods, as illustrated in the following figure.

The method to use first depends on whether you have labeled test data or not.

If you have labeled data, you can use it to conduct a model evaluation, as we do with the traditional ML models (input some samples and compare the output with the labels). Depending on whether the test data has discrete labels (such as positive, negative, or neutral sentiment analysis) or is unstructured text (such as summarization), we propose different methods for evaluation:

Accuracy metrics – In case of discrete outputs (such as sentiment analysis), we can use standard accuracy metrics such as precision, recall, and F1 score
Similarity metrics – If the output is unstructured (such as a summary), we suggest similarity metrics like ROUGE and cosine similarity

Some use cases don’t lend themselves to having one true answer (for example, “Create a short children’s story for my 5-year-old daughter”). In such cases, it becomes more challenging to evaluate the models because you don’t have labeled test data. We propose two approaches, depending on the importance of human review of the model versus automated evaluation:

Human-in-the-Loop (HIL) – In this case, a team of prompt testers will review the responses from a model. Depending on how critical the application is, the prompt testers might review 100% of the model outputs or just a sample.
LLM-powered evaluation – In this scenario, the prompt testers are replaced by an LLM, ideally one that is more powerful (although perhaps slower and most costly) than the ones being tested. The LLM will review all model-generated responses and score them. This method may result in lower quality, but it’s a cheaper and faster evaluation option that might provide a good initial gauge on the models’ performance.

For example, we can use the following example prompt to an LLM, which includes the input query, the LLM response, and instructions: “We would like to request your feedback on the performance of an AI assistant in response to the user question displayed above. Please rate the helpfulness, relevance, accuracy, level of details of the assistant’s response. The assistant shall receive an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.”

This prompt will be sent to another LLM (the EvalLLM) for evaluation. The following is an example where the response of the Falcon-40B model has been evaluated by an EvalLLM (in this case Claude). We use the question “What if the Suez Canal had never been constructed?” The response is as follows:

Question: What if the Suez Canal had never been constructed?

falcon-40b-instruct

If the Suez Canal had never been constructed, ships would have to travel around Africa to navigate between the Mediterranean and the Red Sea. This would add significant time and distance to the voyage, making it less efficient and more expensive. Additionally, without the Suez Canal, many countries in the Middle East and North Africa would have been much less connected to the rest of the world, hindering economic and cultural development.

Review (Score: 9/10)

The assistant’s response is highly relevant, accurate, and detailed. It provides an excellent answer to the user’s hypothetical question about the non-existence of the Suez Canal. The assistant correctly points out the implications on maritime travel and the economic and cultural impact on the Middle East and North Africa. However, it could have further elaborated on the geopolitical implications or the impact on global trade patterns for a more comprehensive response.

The following figure illustrates the end-to-end evaluation process example.

Based on this example, to perform evaluation, we need to provide the example prompts, which we store in the prompt catalog, and an evaluation labeled or unlabeled dataset based on our specific applications. For example, with a labeled evaluation dataset, we can provide prompts (input and query) such as “Give me the full name of the UK PM in 2023” and outputs and answers, such as “Rishi Sunak.” With an unlabeled dataset, we provide just the question or instruction, such as “Generate the source code for a retail website.” We call the combination of prompt catalog and evaluation dataset the evaluation prompt catalog. The reason that we differentiate the prompt catalog and evaluation prompt catalog is because the latter is dedicated to a specific use case instead of generic prompts and instructions (such as question answering) that the prompt catalog contains.

With this evaluation prompt catalog, the next step is to feed the evaluation prompts to the top FMs. The result is an evaluation result dataset that contains the prompts, outputs of each FM, and the labeled output together with a score (if it exists). In the case of an unlabeled evaluation prompt catalog, there is an additional step for an HIL or LLM to review the results and provide a score and feedback (as we described earlier). The final outcome will be aggregated results that combine the scores of all the outputs (calculate the average precision or human rating) and allow the users to benchmark the quality of the models.

After the evaluation results have been collected, we propose choosing a model based on several dimensions. These typically come down to factors such as precision, speed, and cost. The following figure shows an example.

Each model will possess strengths and certain trade-offs along these dimensions. Depending on the use case, we should assign varying priorities to these dimensions. In the preceding example, we elected to prioritize cost as the most important factor, followed by precision, and then speed. Even though it’s slower and not as efficient as FM1, it remains sufficiently effective and significantly cheaper to host. Consequently, we might select FM2 as the top choice.

Step 3. Develop the generative AI application backend and frontend

At this point, the generative AI developers have selected the right FM for the specific application together with the help of prompt engineers and testers. The next step is to start developing the generative AI application. We have separated the development of the generative AI application into two layers, a backend and front end, as shown in the following figure.

On the backend, the generative AI developers incorporate the selected FM into the solutions and work together with the prompt engineers to create the automation to transform the end-user input to appropriate FM prompts. The prompt testers create the necessary entries to the prompt catalog for automatic or manual (HIL or LLM) testing. Then, the generative AI developers create the prompt chaining and application mechanism to provide the final output. Prompt chaining, in this context, is a technique to create more dynamic and contextually-aware LLM applications. It works by breaking down a complex task into a series of smaller, more manageable sub-tasks. For example, if we ask an LLM the question “Where was the prime minister of the UK born and how far is that place from London,” the task can be broken down into individual prompts, where a prompt might be built based on the answer of a previous prompt evaluation, such as “Who is the prime minister of the UK,” “What is their birthplace,” and “How far is that place from London?” To ensure a certain input and output quality, the generative AI developers also need to create the mechanism to monitor and filter the end-user inputs and application outputs. If, for example, the LLM application is supposed to avoid toxic requests and responses, they could apply a toxicity detector for input and output and filter those out. Lastly, they need to provide a rating mechanism, which will support the augmentation of the evaluation prompt catalog with good and bad examples. A more detailed representation of those mechanisms will be presented in future posts.

To provide the functionality to the generative AI end-user, the development of a frontend website that interacts with the backend is necessary. Therefore, DevOps and AppDevs (application developers on the cloud) personas need to follow best development practices to implement the functionality of input/output and rating.

In addition to this basic functionality, the frontend and backend need to incorporate the feature of creating personal user accounts, uploading data, initiating fine-tuning as a black box, and using the personalized model instead of the basic FM. The productionization of a generative AI application is similar with a normal application. The following figure depicts an example architecture.

In this architecture, the generative AI developers, prompt engineers, and DevOps or AppDevs create and test the application manually by deploying it via CI/CD to a development environment (generative AI App Dev in the preceding figure) using dedicated code repositories and merging with the dev branch. At this stage, the generative AI developers will use the corresponding FM by calling the API as has been provided by the FM providers of fine-tuners. Then, to test the application extensively, they need to promote the code to the test branch, which will trigger the deployment via CI/CD to the preproduction environment (generative AI App Pre-prod). At this environment, the prompt testers need to try a large amount of prompt combinations and review the results. The combination of prompts, outputs, and review need to be moved to the evaluation prompt catalog to automate the testing process in the future. After this extensive test, the last step is to promote the generative AI application to production via CI/CD by merging with the main branch (generative AI App Prod). Note that all the data, including the prompt catalog, evaluation data and results, end-user data and metadata, and fine-tuned model metadata, need to be stored in the data lake or data mesh layer. The CI/CD pipelines and repositories need to be stored in a separate tooling account (similar to the one described for MLOps).

The journey of providers

FM providers need to train FMs, such as deep learning models. For them, the end-to-end MLOps lifecycle and infrastructure is necessary. Additions are required in historical data preparation, model evaluation, and monitoring. The following figure illustrates their journey.

In classic ML, the historical data is most often created by feeding the ground truth via ETL pipelines. For example, in a churn prediction use case, an automation updates a database table based on the new status of a customer to churn/not churn automatically. In the case of FMs, they need either billions of labeled or unlabeled data points. In text-to-image use cases, a team of data labelers need to label <text, image> pairs manually. This is an expensive exercise requiring a large number of people resources. Amazon SageMaker Ground Truth Plus can provide a team of labelers to perform this activity for you. For some use cases, this process can be also partially automated, for example by using CLIP-like models. In the case of an LLM, such as text-to-text, the data is unlabeled. However, they need to be prepared and follow the format of the existing historical unlabeled data. Therefore, data editors are needed to perform necessary data preparation and ensure consistency.

With the historical data prepared, the next step is the training and productionization of the model. Note that the same evaluation techniques as we described for consumers can be used.

The journey of fine-tuners

Fine-tuners aim to adapt an existing FM to their specific context. For example, an FM model can summarize a general-purpose text but not a financial report accurately or can’t generate source code for a non-common programming language. In those cases, the fine-tuners need to label data, fine-tune a model by running a training job, deploy the model, test it based on the consumer processes, and monitor the model. The following diagram illustrates this process.

For the time being, there are two fine-tuning mechanisms:

Fine-tuning – By using an FM and labeled data, a training job recalculates the weights and biases of the deep learning model layers. This process can be computationally intensive and requires a representative amount of data but can generate accurate results.
Parameter-efficient fine-tuning (PEFT) – Instead of recalculating all the weights and biases, researchers have shown that by adding additional small layers to the deep learning models, they can achieve satisfactory results (for example, LoRA). PEFT requires lower computational power than deep fine-tuning and a training job with less input data. The drawback is potential lower accuracy.

The following diagram illustrates these mechanisms.

Now that we have defined the two main fine-tuning methods, the next step is to determine how we can deploy and use the open-source and proprietary FM.

With open-source FMs, the fine-tuners can download the model artifact and the source code from the web, for example, by using the Hugging Face Model Hub. This gives you the flexibility to deep fine-tune the model, store it to a local model registry, and deploy it to an Amazon SageMaker endpoint. This process requires an internet connection. To support more secure environments (such as for customers in the financial sector), you can download the model on premises, run all the necessary security checks, and upload them to a local bucket on an AWS account. Then, the fine-tuners use the FM from the local bucket without an internet connection. This ensures data privacy, and the data doesn’t travel over the internet. The following diagram illustrates this method.

With proprietary FMs, the deployment process is different because the fine-tuners don’t have access to the model artifact or source code. The models are stored in proprietary FM provider AWS accounts and model registries. To deploy such a model to a SageMaker endpoint, the fine-tuners can request only the model package that will be deployed directly to an endpoint. This process requires customer data to be used in the proprietary FM providers’ accounts, which raises questions regarding customer-sensitive data being used in a remote account to perform fine-tuning, and models being hosted in a model registry that is shared among multiple customers. This leads to a multi-tenancy problem that becomes more challenging if the proprietary FM providers need to serve these models. If the fine-tuners use Amazon Bedrock, these challenges are resolved—the data doesn’t travel over the internet and the FM providers don’t have access to fine-tuners’ data. The same challenges hold for the open-source models if the fine-tuners want to serve models from multiple customers, such as the example we gave earlier with the website that thousands of customers will upload personalized images to. However, these scenarios can be considered controllable because only the fine-tuner is involved. The following diagram illustrates this method.

From a technology perspective, the architecture that a fine-tuner needs to support is like the one for MLOps (see the following figure). The fine-tuning needs to be conducted in dev by creating ML pipelines, such as using Amazon SageMaker Pipelines; performing preprocessing, fine-tuning (training job), and postprocessing; and sending the fine-tuned models to a local model registry in the case of an open-source FM (otherwise, the new model will be stored to the proprietary FM provide environment). Then, in pre-production, we need to test the model as we describe for the consumers’ scenario. Finally, the model will be served and monitored in prod. Note that the current (fine-tuned) FM requires GPU instance endpoints. If we need to deploy each fine-tuned model to a separate endpoint, this might increase the cost in the case of hundreds of models. Therefore, we need to use multi-model endpoints and resolve the multi-tenancy challenge.

The fine-tuners adapt an FM model based on a specific context to use it for their business purpose. That means that most of the time, the fine-tuners are also consumers required to support all the layers, as we described in the previous sections, including generative AI application development, data lake and data mesh, and MLOps.

The following figure illustrates the complete FM fine-tuning lifecycle that the fine-tuners need to provide the generative AI end-user.

The following figure illustrates the key steps.

The key steps are the following:

The end-user creates a personal account and uploads private data.
The data is stored in the data lake and is preprocessed to follow the format that the FM expects.
This triggers a fine-tuning ML pipeline that adds the model to the model registry,
From there, either the model is deployed to production with minimum testing or the model pushes extensive testing with HIL and manual approval gates.
The fine-tuned model is made available for end-users.

Because this infrastructure is complex for non-enterprise customers, AWS released Amazon Bedrock to offload the effort of creating such architectures and bringing fine-tuned FMs closer to production.

FMOps and LLMOps personas and processes differentiators

Based on the preceding user type journeys (consumer, producer, and fine-tuner), new personas with specific skills are required, as illustrated in the following figure.

The new personas are as follows:

Data labelers and editors – These users label data, such as <text, image> pairs, or prepare unlabeled data, such as free text, and extend the advanced analytics team and data lake environments.
Fine-tuners – These users have deep knowledge on FMs and know to tune them, extending the data science team that will focus on classic ML.
Generative AI developers – They have deep knowledge on selecting FMs, chaining prompts and applications, and filtering input and outputs. They belong a new team—the generative AI application team.
Prompt engineers – These users design the input and output prompts to adapt the solution to the context and test and create the initial version of prompt catalog. Their team is the generative AI application team.
Prompt testers – They test at scale the generative AI solution (backend and frontend) and feed their results to augment the prompt catalog and evaluation dataset. Their team is the generative AI application team.
AppDev and DevOps – They develop the front end (such as a website) of the generative AI application. Their team is the generative AI application team.
Generative AI end-users – These users consume generative AI applications as black boxes, share data, and rate the quality of the output.

The extended version of the MLOps process map to incorporate generative AI can be illustrated with the following figure.

A new application layer is the environment where generative AI developers, prompt engineers, and testers, and AppDevs created the backend and front end of generative AI applications. The generative AI end-users interact with the generative AI applications front end via the internet (such as a web UI). On the other side, data labelers and editors need to preprocess the data without accessing the backend of the data lake or data mesh. Therefore, a web UI (website) with an editor is necessary for interacting securely with the data. SageMaker Ground Truth provides this functionality out of the box.

Conclusion

MLOps can help us productionize ML models efficiently. However, to operationalize generative AI applications, you need additional skills, processes, and technologies, leading to FMOps and LLMOps. In this post, we defined the main concepts of FMOps and LLMOps and described the key differentiators compared to MLOps capabilities in terms of people, processes, technology, FM model selection, and evaluation. Furthermore, we illustrated the thought process of a generative AI developer and the development lifecycle of a generative AI application.

In the future, we will focus on providing solutions per the domain we discussed, and will provide more details on how to integrate FM monitoring (such as toxicity, bias, and hallucination) and third-party or private data source architectural patterns, such as Retrieval Augmented Generation (RAG), into FMOps/LLMOps.

To learn more, refer to MLOps foundation roadmap for enterprises with Amazon SageMaker and try out the end-to-end solution in Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models.

If you have any comments or questions, please leave them in the comments section.

About the Authors

Dr. Sokratis Kartakis is a Senior Machine Learning and Operations Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) solutions by exploiting AWS services and shaping their operating model, i.e. MLOps foundation, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and Internet of Things (IoT) solutions in the domains of energy, retail, health, finance/banking, motorsports etc. Sokratis likes to spend his spare time with family and friends, or riding motorbikes.

Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning with a special focus on natural language processing, large language models, and generative AI. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers be successful in their AI/ML journey on AWS and has worked with organizations in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. In his spare time, Heiko travels as much as possible.

Use Amazon SageMaker Model Card sharing to improve model governance

August 31, 2023

by Vishal Naik Amazon AWS

As Artificial Intelligence (AI) and Machine Learning (ML) technologies have become mainstream, many enterprises have been successful in building critical business applications powered by ML models at scale in production. However, since these ML models are making critical business decisions for the business, it’s important for enterprises to add proper guardrails throughout their ML lifecycle. Guardrails ensure that security, privacy, and quality of the code, configuration, and data and model configuration used in model lifecycle are versioned and preserved.

Implementing these guardrails is getting harder for enterprises because the ML processes and activities within enterprises are becoming more complex due to the inclusion of deeply involved processes that require contributions from multiple stakeholders and personas. In addition to data engineers and data scientists, there have been inclusions of operational processes to automate & streamline the ML lifecycle. Additionally, the surge of business stakeholders and in some cases legal and compliance reviews need capabilities to add transparency for managing access control, activity tracking, and reporting across the ML lifecycle.

The framework that gives systematic visibility into ML model development, validation, and usage is called ML governance. During AWS re:Invent 2022, AWS introduced new ML governance tools for Amazon SageMaker which simplifies access control and enhances transparency over your ML projects. One of the tools available as part of the ML governance is Amazon SageMaker Model Cards, which has the capability to create a single source of truth for model information by centralizing and standardizing documentation throughout the model lifecycle.

SageMaker model cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a fact sheet of the model that is important for model governance.

As you scale your models, projects, and teams, as a best practice we recommend that you adopt a multi-account strategy that provides project and team isolation for ML model development and deployment. For more information about improving governance of your ML models, refer to Improve governance of your machine learning models with Amazon SageMaker.

Architecture overview

The architecture is implemented as follows:

Data Science Account – Data Scientists conduct their experiments in SageMaker Studio and build an MLOps setup to deploy models to staging/production environments using SageMaker Projects.
ML Shared Services Account – The MLOps set up from the Data Science account will trigger continuous integration and continuous delivery (CI/CD) pipelines using AWS CodeCommit and AWS CodePipeline.
Dev Account – The CI/CD pipelines will further trigger ML pipelines in this account covering data pre-processing, model training and post processing like model evaluation and registration. Output of these pipelines will deploy the model in SageMaker endpoints to be consumed for inference purposes. Depending on your governance requirements, Data Science & Dev accounts can be merged into a single AWS account.
Data Account – The ML pipelines running in the Dev Account will pull the data from this account.
Test and Prod Accounts – The CI/CD pipelines will continue the deployment after the Dev Account to set up SageMaker endpoint configuration in these accounts.
Security and Governance – Services like AWS Identity and Access Management (IAM), AWS IAM Identity Center, AWS CloudTrail, AWS Key Management Service (AWS KMS), Amazon CloudWatch, and AWS Security Hub will be used across these accounts as part of security and governance.

The following diagram illustrates this architecture.

For more information about setting scalable multi account ML architecture, refer to MLOps foundation for enterprises with Amazon SageMaker.

Our customers need the capability to share model cards across accounts to improve visibility and governance of their models through information shared in the model card. Now, with cross-account model cards sharing, customers can enjoy the benefits of multi-account strategy while having accessibility into the available model cards in their organization, so they can accelerate collaboration and ensure governance.

In this post, we show how to set up and access model cards across Model Development Lifecycle (MDLC) accounts using the new cross-account sharing feature of the model card. First, we will describe a scenario and architecture for setting up the cross-account sharing feature of the model card, and then dive deep into each component of how to set up and access shared model cards across accounts to improve visibility and model governance.

Solution overview

When building ML models, we recommend setting up a multi-account architecture to provide workload isolation improving security, reliability, and scalability. For this post, we will assume building and deploying a model for Customer Churn use case. The architecture diagram that follows shows one of the recommended approaches – centralized model card – for managing a model card in a multi-account Machine Learning Model-Development Lifecycle (MDLC) architecture. However, you can also adopt another approach, a hub-and-spoke model card. In this post, we will focus only on a centralized model card approach, but the same principles can be extended to a hub-and-spoke approach. The main difference is that each spoke account will maintain their own version of model card and it will have processes to aggregate and copy to a centralized account.

The following diagram illustrates this architecture.

The architecture is implemented as follows:

Lead Data Scientist is notified to solve the Customer Churn use case using ML, and they start the ML project through creation of a model card for Customer Churn V1 model in Draft status in the ML Shared Services Account
Through automation, that model card is shared with ML Dev Account
Data Scientist builds the model and starts to populate information via APIs into the model card based on their experimentation results and the model card status is set to Pending Review
Through automation, that model card is shared with the ML test account
ML Engineer (MLE) runs integration and validation tests in ML Test account and the model in the central registry is marked Pending Approval
Model Approver reviews the model results with the supporting documentation provided in the central model card and approves the model card for production deployment.
Through automation, that model card is shared with ML Prod account in read-only mode.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Two AWS accounts.
In both AWS accounts, an IAM federation role with administrator access to do the following:
- Create, edit, view, and delete model cards within Amazon SageMaker.
- Create, edit, view, and delete resource share within AWS RAM.

For more information, refer to Example IAM policies for AWS RAM.

Setting up model card sharing

The account where the model cards are created is the model card account. Users in the model card account share them with the shared accounts where they can be updated. Users in the model card account can share their model cards through AWS Resource Access Manager (AWS RAM). AWS RAM helps you share resources across AWS accounts.

In the following section, we show how to share model cards.

First, create a model card for a Customer Churn use case as previously described. On the Amazon SageMaker console, expand the Governance section and choose Model cards.

We create the model card in Draft status with the name Customer-Churn-Model-Card. For more information, refer to Create a model card. In this demonstration, you can leave the remainder of the fields blank and create the model card.

Alternatively, you can use the following AWS CLI command to create the model card:

aws sagemaker create-model-card --model-card-name Customer-Churn-Model-Card --content "{"model_overview": {"model_owner": "model-owner","problem_type": "Customer Churn Model"}}" --model-card-status Draft

Now, create the cross-account share using AWS RAM. In the AWS RAM console, select Create a resource share.

Enter a name for the resource share, for example “Customer-Churn-Model-Card-Share”. In the Resources – optional section, select the resource type as SageMaker Model Cards. The model card we created in the previous step will appear in the listing.

Select that model and it will appear in the Selected resources section. Select that resource again as shown in the following steps and choose Next.

On the next page, you can select the Managed permissions. You can create custom permissions or use the default option “AWSRAMPermissionSageMakerModelCards” and select Next. For more information, refer to Managing permissions in AWS RAM.

On the next page, you can select Principals. Under Select principal type, choose AWS Account and enter the ID of the account of the share the model card. Select Add and continue to the next page.

On the last page, review the information and select “Create resource share”. Alternatively, you can use the following AWS CLI command to create a resource share:

aws ram create-resource-share --name <Name of the Model Card>

aws ram associate-resource-share --resource-share-arn <ARN of resource share create from the previous command> --resource-arns <ARN of the Model Card>

On the AWS RAM console, you see the attributes of the resource share. Make sure that Shared resources, Managed permissions, and Shared principals are in the “Associated” status.

After you use AWS RAM to create a resource share, the principals specified in the resource share can be granted access to the share’s resources.

If you turn on AWS RAM sharing with AWS Organizations, and your principals that you share with are in the same organization as the sharing account, those principals can receive access as soon as their account administrator grants them permissions.
If you don’t turn on AWS RAM sharing with Organizations, you can still share resources with individual AWS accounts that are in your organization. The administrator in the consuming account receives an invitation to join the resource share, and they must accept the invitation before the principals specified in the resource share can access the shared resources.
You can also share with accounts outside of your organization if the resource type supports it. The administrator in the consuming account receives an invitation to join the resource share, and they must accept the invitation before the principals specified in the resource share can access the shared resources.

For more information about AWS RAM, refer to Terms and concepts for AWS RAM.

Accessing shared model cards

Now we can log in to the shared AWS account to access the model card. Make sure that you are accessing the AWS console using IAM permissions (IAM role) which allow access to AWS RAM.

With AWS RAM, you can view the resource shares to which you have been added, the shared resources that you can access, and the AWS accounts that have shared resources with you. You can also leave a resource share when you no longer require access to its shared resources.

To view the model card in the shared AWS account:

Navigate to the Shared with me: Shared resources page in the AWS RAM console.
Make sure that you are operating in the same AWS region where the share was created.
The model shared from the model account will be available in the listing. If there is a long list of resources, then you can apply a filter to find specific shared resources. You can apply multiple filters to narrow your search.
The following information is available:
1. Resource ID – The ID of the resource. This is the name of the model card that we created earlier in the model card account.
2. Resource type – The type of resource.
3. Last share date – The date on which the resource was shared with you.
4. Resource shares – The number of resource shares in which the resource is included. Choose the value to view the resource shares.
5. Owner ID – The ID of the principal who owns the resource.

You can also access the model card using the AWS CLI option. For the AWS IAM policy configured with the correct credentials, make sure that you have permissions to create, edit, and delete model cards within Amazon SageMaker. For more information, refer to Configure the AWS CLI.

You can use the following AWS IAM permissions policy as template:

{
     "Version": "2012-10-17",
     "Statement": [
        {
             "Effect": "Allow",
             "Action": [
                 "sagemaker:DescribeModelCard",
                 "sagemaker:UpdateModelCard",
                 "sagemaker:CreateModelCardExportJob",
                 "sagemaker:ListModelCardVersions",
                 "sagemaker:DescribeModelCardExportJob"
             ],
             "Resource": [
                 "arn:aws:sagemaker:AWS-Region:AWS-model-card-account-id:model-card/example-model-card-name-0",
                 "arn:aws:sagemaker:AWS-Region:AWS-model-card-account-id:model-card/example-model-card-name-1/*"
             ]
        },
        { 
             "Effect": "Allow", 
             "Action": "s3:PutObject",
             "Resource": "arn:aws:s3:::Amazon-S3-bucket-storing-the-pdf-of-the-model-card/model-card-name/*"
        }
    ]
}

You can run the following AWS CLI command to access the details of the shared model card.

aws sagemaker describe-model-card --model-card-name <ARN of the model card>

Now you can make changes to this model card from this account.

aws sagemaker update-model-card --model-card-name <ARN of the Model Card> --content "{"model_overview": {"model_owner": "model-owner","problem_type": "Customer Churn Model"}}"

After you make changes, go back to the model card account to see the changes that we made in this shared account.

The problem type has been updated to “Customer Churn Model” which we had provided as part of the AWS CLI command input.

Clean up

You can now delete the model card you created. Make sure that you delete the AWS RAM resource share that you created to share the model card.

Conclusion

In this post, we provided an overview of multi-account architecture for scaling and governing your ML workloads securely and reliably. We discussed the architecture patterns for setting up model card sharing and illustrated how centralized model card sharing patterns work. Finally, we set up model card sharing across multiple accounts for improving visibility and governance in your model development lifecycle. We encourage you try out the new model card sharing feature and let us know your feedback.

About the authors

Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Machine Learning, DevOps, and Containers. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides his motorcycle and walks with his 2-year-old sheep-a-doodle!