Configure and use defaults for Amazon SageMaker resources with the SageMaker Python SDK

Configure and use defaults for Amazon SageMaker resources with the SageMaker Python SDK

The Amazon SageMaker Python SDK is an open-source library for training and deploying machine learning (ML) models on Amazon SageMaker. Enterprise customers in tightly controlled industries such as healthcare and finance set up security guardrails to ensure their data is encrypted and traffic doesn’t traverse the internet. To ensure the SageMaker training and deployment of ML models follow these guardrails, it’s a common practice to set restrictions at the account or AWS Organizations level through service control policies and AWS Identity and Access Management (IAM) policies to enforce the usage of specific IAM roles, Amazon Virtual Private Cloud (Amazon VPC) configurations, and AWS Key Management Service (AWS KMS) keys. In such cases, data scientists have to provide these parameters to their ML model training and deployment code manually, by noting down subnets, security groups, and KMS keys. This puts the onus on the data scientists to remember to specify these configurations, to successfully run their jobs, and avoid getting Access Denied errors.

Starting with SageMaker Python SDK version 2.148.0, you can now configure default values for parameters such as IAM roles, VPCs, and KMS keys. Administrators and end-users can initialize AWS infrastructure primitives with defaults specified in a configuration file in YAML format. Once configured, the Python SDK automatically inherits these values and propagates them to the underlying SageMaker API calls such as CreateProcessingJob(), CreateTrainingJob(), and CreateEndpointConfig(), with no additional actions needed. The SDK also supports multiple configuration files, allowing admins to set a configuration file for all users, and users can override it via a user-level configuration that can be stored in Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS) for Amazon SageMaker Studio, or the user’s local file system.

In this post, we show you how to create and store the default configuration file in Studio and use the SDK defaults feature to create your SageMaker resources.

Solution overview

We demonstrate this new feature with an end-to-end AWS CloudFormation template that creates the required infrastructure, and creates a Studio domain in the deployed VPC. In addition, we create KMS keys for encrypting the volumes used in training and processing jobs. The steps are as follows:

  1. Launch the CloudFormation stack in your account. Alternatively, if you want to explore this feature on an existing SageMaker domain or notebook, skip this step.
  2. Populate the config.yaml file and save the file in the default location.
  3. Run a sample notebook with an end-to-end ML use case, including data processing, model training, and inference.
  4. Override the default configuration values.

Prerequisites

Before you get started, make sure you have an AWS account and an IAM user or role with administrator privileges. If you are a data scientist currently passing infrastructure parameters to resources in your notebook, you can skip the next step of setting up your environment and start creating the configuration file.

To use this feature, make sure to upgrade your SageMaker SDK version by running pip install --upgrade sagemaker.

Set up the environment

To deploy a complete infrastructure including networking and a Studio domain, complete the following steps:

  1. Clone the GitHub repository.
  2. Log in to your AWS account and open the AWS CloudFormation console.
  3. To deploy the networking resources, choose Create stack.
  4. Upload the template under setup/vpc_mode/01_networking.yaml.
  5. Provide a name for the stack (for example, networking-stack), and complete the remaining steps to create the stack.
  6. To deploy the Studio domain, choose Create stack again.
  7. Upload the template under setup/vpc_mode/02_sagemaker_studio.yaml.
  8. Provide a name for the stack (for example, sagemaker-stack), and provide the name of the networking stack when prompted for the CoreNetworkingStackName parameter.
  9. Proceed with the remaining steps, select the acknowledgements for IAM resources, and create the stack.

When the status of both stacks update to CREATE_COMPLETE, proceed to the next step.

Create the configuration file

To use the default configuration for the SageMaker Python SDK, you create a config.yaml file in the format that the SDK expects. For the format for the config.yaml file, refer to Configuration file structure. Depending on your work environment, such as Studio notebooks, SageMaker notebook instances, or your local IDE, you can either save the configuration file at the default location or override the defaults by passing a config file location. For the default locations for other environments, refer to Configuration file locations. The following steps showcase the setup for a Studio notebook environment.

To easily create the config.yaml file, run the following cells in your Studio system terminal, replacing the placeholders with the CloudFormation stack names from the previous step:

git clone https://github.com/aws-samples/amazon-sagemaker-build-train-deploy.git
cd amazon-sagemaker-build-train-deploy
pip install boto3
python generate-defaults.py --networking-stack <network-stack-name> 
--sagemaker-stack <sagemaker-stack-name>

# save the file to the default location
mkdir .config/sagemaker
cp user-configs.yaml ~/.config/sagemaker/config.yaml

This script automatically populates the YAML file, replacing the placeholders with the infrastructure defaults, and saves the file in the home folder. Then it copies the file into the default location for Studio notebooks. The resulting config file should look similar to the following format:

SageMaker:
  Model:
    EnableNetworkIsolation: false
    VpcConfig:
      SecurityGroupIds:
      - sg-xxxx
      Subnets:
      - subnet-xxxx
      - subnet-xxxx
  ProcessingJob:
    NetworkConfig:
      EnableNetworkIsolation: false
      VpcConfig:
        SecurityGroupIds:
        - sg-xxxx
        Subnets:
        - subnet-xxxx
        - subnet-xxxx
    ProcessingOutputConfig:
      KmsKeyId: arn:aws:kms:us-east-2:0123456789:alias/kms-defaults
    RoleArn: arn:aws:iam::0123456789:role/service-role/AmazonSageMakerExecutionRole-xxx
  TrainingJob:
    EnableNetworkIsolation: false
    VpcConfig:
      SecurityGroupIds:
      - sg-xxxx
      Subnets:
      - subnet-xxxx
      - subnet-xxxx
SchemaVersion: '1.0'
something: '1.0'

If you have an existing domain and networking configuration set up, create the config.yaml file in the required format and save it in the default location for Studio notebooks.

Note that these defaults simply auto-populate the configuration values for the appropriate SageMaker SDK calls, and don’t enforce the user to any specific VPC, subnet, or role. As an administrator, if you want your users to use a specific configuration or role, use IAM condition keys to enforce the default values.

Additionally, each API call can have its own configurations. For example, in the preceding config file sample, you can specify vpc-a and subnet-a for training jobs, and specify vpc-b and subnet-c, subnet-d for processing jobs.

Run a sample notebook

Now that you have set the configuration file, you can start running your model building and training notebooks as usual, without the need to explicitly set networking and encryption parameters, for most SDK functions. See Supported APIs and parameters for a complete list of supported API calls and parameters.

In Studio, choose the File Explorer icon in the navigation pane and open 03_feature_engineering/03_feature_engineering.ipynb, as shown in the following screenshot.

studio-file-explorer

Run the notebook cells one by one, and notice that you are not specifying any additional configuration. When you create the processor object, you will see the cell outputs like the following example.

configs-applied

As you can see in the output, the default configuration is automatically applied to the processing job, without needing any additional input from the user.

When you run the next cell to run the processor, you can also verify the defaults are set by viewing the job on the SageMaker console. Choose Processing jobs under Processing in the navigation pane, as shown in the following screenshot.

console-processing-jobs

Choose the processing job with the prefix end-to-end-ml-sm-proc, and you should be able to view the networking and encryption already configured.

console-job-configs

You can continue running the remaining notebooks to train and deploy the model, and you will notice that the infrastructure defaults are automatically applied for both training jobs and models.

Override the default configuration file

There could be cases where a user needs to override the default configuration, for example, to experiment with public internet access, or update the networking configuration if the subnet runs out of IP addresses. In such cases, the Python SDK also allows you to provide a custom location for the configuration file, either on local storage, or you can point to a location in Amazon S3. In this section, we explore an example.

Open the user-configs.yaml file on your home directory and update the EnableNetworkIsolation value to True, under the TrainingJob section.

Now, open the same notebook, and add the following cell to the beginning of the notebook:

import os
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = "~/config.yaml"

With this cell, you point the location of the config file to the SDK. Now, when you create the processor object, you’ll notice that the default config has been overridden to enable network isolation, and the processing job will fail in network isolation mode.

You can use the same override environment variable to set the location of the configuration file if you’re using your local environment such as VSCode.

Debug and retrieve defaults

For quick troubleshooting if you run into any errors when running API calls from your notebook, the cell output displays the applied default configurations as shown in the previous section. To view the exact Boto3 call created to view the attribute values passed from default config file, you can debug by turning on Boto3 logging. To turn on logging, run the following cell at the top of the notebook:

import boto3
import logging
boto3.set_stream_logger(name='botocore.endpoint', level=logging.DEBUG)

Any subsequent Boto3 calls will be logged with the complete request, visible under the body section in the log.

You can also view the collection of default configurations using the session.sagemaker_config value as shown in the following example.

session-config-values

Finally, if you’re using Boto3 to create your SageMaker resources, you can retrieve the default configuration values using the sagemaker_config variable. For example, to run the processing job in 03_feature_engineering.ipynb using Boto3, you can enter the contents of the following cell in the same notebook and run the cell:

import boto3
import sagemaker
session = sagemaker.Session()
client = boto3.client('sagemaker')

# get the default values
subnet_ids = session.sagemaker_config["SageMaker"]["ProcessingJob"]['NetworkConfig']["VpcConfig"]["Subnets"]
security_groups = session.sagemaker_config["SageMaker"]["ProcessingJob"]['NetworkConfig']["VpcConfig"]["SecurityGroupIds"]
kms_key = session.sagemaker_config["SageMaker"]["ProcessingJob"]["ProcessingOutputConfig"]["KmsKeyId"]
role_arn = session.sagemaker_config["SageMaker"]["ProcessingJob"]["RoleArn"]

# upload processing code
code_location = sagemaker_session.upload_data('./source_dir/preprocessor.py', 
                              bucket=s3_bucket_name, 
                              key_prefix=f'{s3_key_prefix}/code')
code_location = ('/').join(code_location.split('/')[:-1])

# create a processing job
response = client.create_processing_job(
    ProcessingJobName='end-to-end-ml-sm-proc-boto3',
    ProcessingInputs=[
        {
            'InputName': 'raw_data',
            "S3Input": {
                "S3Uri": raw_data_path,
                "LocalPath": "/opt/ml/processing/input",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
            }
        },
        {
            "InputName": "code",
            "S3Input": {
                "S3Uri": code_location,
                "LocalPath": "/opt/ml/processing/input/code",
                "S3DataType": "S3Prefix",
                "S3InputMode": "File",
            }
        }
    ],
    ProcessingOutputConfig={
        'Outputs': [
            {
                'OutputName': 'train_data',
                'S3Output': {
                    'S3Uri': train_data_path,
                    'LocalPath': "/opt/ml/processing/train",
                    'S3UploadMode': 'EndOfJob'
                },
            },
            {
                'OutputName': 'val_data',
                'S3Output': {
                    'S3Uri': val_data_path,
                    'LocalPath': "/opt/ml/processing/val",
                    'S3UploadMode': 'EndOfJob'
                },
            },
            {
                'OutputName': 'test_data',
                'S3Output': {
                    'S3Uri': test_data_path,
                    'LocalPath': "/opt/ml/processing/test",
                    'S3UploadMode': 'EndOfJob'
                },
            },
            {
                'OutputName': 'model',
                'S3Output': {
                    'S3Uri': model_path,
                    'LocalPath': "/opt/ml/processing/model",
                    'S3UploadMode': 'EndOfJob'
                },
            },
        ],
        'KmsKeyId': kms_key
    },
    ProcessingResources={
        'ClusterConfig': {
            'InstanceCount': 1,
            'InstanceType': 'ml.m5.large',
            'VolumeSizeInGB': 30,
        }
    },
    AppSpecification={
        "ImageUri": "257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3",
        "ContainerArguments": [
            "--train-test-split-ratio",
            "0.2"
        ],
        "ContainerEntrypoint": [
            "python3",
            "/opt/ml/processing/input/code/preprocessor.py"
        ]
    },
    NetworkConfig={
        'EnableNetworkIsolation': False,
        'VpcConfig': {
            'SecurityGroupIds': security_groups,
            'Subnets': subnet_ids
        }
    },
    RoleArn=role_arn,
)

Automate config file creation

For administrators, having to create the config file and save the file to each SageMaker notebook instance or Studio user profile can be a daunting task. Although you can recommend that users use a common file stored in a default S3 location, it puts the additional overhead of specifying the override on the data scientists.

To automate this, administrators can use SageMaker Lifecycle Configurations (LCC). For Studio user profiles or notebook instances, you can attach the following sample LCC script as a default LCC for the user’s default Jupyter Server app:

# sample LCC script to set default config files on the user's folder
#!/bin/bash

set -eux

# add --endpoint-url [S3 Interface Endpoint] if using an S3 interface endpoint
aws s3 cp <s3-location-of-config-file> ~/.config/sagemaker/config.yaml
echo "config file saved in the default location"

See Use Lifecycle Configurations for Amazon SageMaker Studio or Customize a Notebook Instance for instructions on creating and setting a default lifecycle script.

Clean up

When you’re done experimenting with this feature, clean up your resources to avoid paying additional costs. If you have provisioned new resources as specified in this post, complete the following steps to clean up your resources:

  1. Shut down your Studio apps for the user profile. See Shut Down and Update SageMaker Studio and Studio Apps for instructions. Ensure that all apps are deleted before deleting the stack.
  2. Delete the EFS volume created for the Studio domain. You can view the EFS volume attached with the domain by using a DescribeDomain API call.
  3. Delete the Studio domain stack.
  4. Delete the security groups created for the Studio domain. You can find them on the Amazon Elastic Compute Cloud (Amazon EC2) console, with the names security-group-for-inbound-nfs-d-xxx and security-group-for-outbound-nfs-d-xxx
  5. Delete the networking stack.

Conclusion

In this post, we discussed configuring and using default values for key infrastructure parameters using the SageMaker Python SDK. This allows administrators to set default configurations for data scientists, thereby saving time for users and admins, eliminating the burden of repetitively specifying parameters, and resulting in leaner and more manageable code. For the full list of supported parameters and APIs, see Configuring and using defaults with the SageMaker Python SDK. For any questions and discussions, join the Machine Learning & AI community.


About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size on helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertise are Machine Learning end to end, Machine Learning Industrialization and MLOps. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Durga Sury is an ML Solutions Architect on the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and long walks with her 5-year-old husky.

Read More

Accelerate your learning towards AWS Certification exams with automated quiz generation using Amazon SageMaker foundations models

Accelerate your learning towards AWS Certification exams with automated quiz generation using Amazon SageMaker foundations models

Getting AWS Certified can help you propel your career, whether you’re looking to find a new role, showcase your skills to take on a new project, or become your team’s go-to expert. And because AWS Certification exams are created by experts in the relevant role or technical area, preparing for one of these exams helps you build the required skills identified by skilled practitioners in the field.

Reading the FAQ page of the AWS services relevant for your certification exam is important in order to acquire a deeper understanding of the service. However, this could take quite some time. Reading FAQs of even one service can take half a day to read and understand. For example, the Amazon SageMaker FAQ contains about 33 pages (printed) of content just on SageMaker.

Wouldn’t it be an easier and more fun learning experience if you could use a system to test yourself on the AWS service FAQ pages? Actually, you can develop such a system using state-of-the-art language models and a few lines of Python.

In this post, we present a comprehensive guide of deploying a multiple-choice quiz solution for the FAQ pages of any AWS service, based on the AI21 Jurassic-2 Jumbo Instruct foundation model on Amazon SageMaker Jumpstart.

Large language models

In recent years, language models have seen a huge surge in size and popularity. In 2018, BERT-large made its debut with its 340 million parameters and innovative transformer architecture, setting the benchmark for performance on NLP tasks. In a few short years, the state-of-the-art in terms of model size has ballooned by over 500 times; OpenAI’s GPT-3 and Bloom 176 B, both with 175 billion parameters, and AI21 Jurassic-2 Jumbo Instruct with 178 billion parameters are just three examples of large language models (LLMs) raising the bar on natural language processing (NLP) accuracy.

SageMaker foundation models

SageMaker provides a range of models from popular model hubs including Hugging Face, PyTorch Hub, and TensorFlow Hub, and propriety ones from AI21, Cohere, and LightOn, which you can access within your machine learning (ML) development workflow in SageMaker. Recent advances in ML have given rise to a new class of models known as foundation models, which have billions of parameters and are trained on massive amounts of data. Those foundation models can be adapted to a wide range of use cases, such as text summarization, generating digital art, and language translation. Because these models can be expensive to train, customers want to use existing pre-trained foundation models and fine-tune them as needed, rather than train these models themselves. SageMaker provides a curated list of models that you can choose from on the SageMaker console.

With JumpStart, you can find foundation models from different providers, enabling you to get started with foundation models quickly. You can review model characteristics and usage terms, and try out these models using a test UI widget. When you’re ready to use a foundation model at scale, you can do so easily without leaving SageMaker by using pre-built notebooks from model providers. Your data, whether used for evaluating or using the model at scale, is never shared with third parties because the models are hosted and deployed on AWS.

AI21 Jurassic-2 Jumbo Instruct

Jurassic-2 Jumbo Instruct is an LLM by AI21 Labs that can be applied to any language comprehension or generation task. It’s optimized to follow natural language instructions and context, so there is no need to provide it with any examples. The endpoint comes pre-loaded with the model and ready to serve queries via an easy-to-use API and Python SDK, so you can hit the ground running. Jurassic-2 Jumbo Instruct is a top performer at HELM, particularly in tasks related to reading and writing.

Solution overview

In the following sections, we go through the steps to test the Jurassic-2 Jumbo instruct model in SageMaker:

  1. Choose the Jurassic-2 Jumbo instruct model on the SageMaker console.
  2. Evaluate the model using the playground.
  3. Use a notebook associated with the foundation model to deploy it in your environment.

Access Jurassic-2 Jumbo Instruct through the SageMaker console

The first step is to log in to the SageMaker console. Under JumpStart in the navigation pane, choose Foundation models to request access to the model list.

SageMaker Foundation Models

After your account is allow listed, you can see a list of models on this page and search for the Jurassic-2 Jumbo Instruct model.

Evaluate the Jurassic-2 Jumbo Instruct model in the model playground

On the AI21 Jurassic-2 Jumbo Instruct listing, choose View Model. You will see a description of the model and the tasks that you can perform. Read through the EULA for the model before proceeding.

Let’s first try out the model to generate a test based on the SageMaker FAQ page. Navigate to the Playground tab.

On the Playground tab, you can provide sample prompts to the Jurassic-2 Jumbo Instruct model and view the output.

AI21 Jurassic-2 Jumbo Instruct - choose playground

Note that you can use a maximum of 500 tokens. We set the Max length to 500, which is the maximum number of tokens to generate. This model has an 8,192-token context window (the length of the prompt plus completion should be at most 8,192 tokens).

To make it easier to see the prompt, you can enlarge the Prompt box.

AI21 Jurassic-2 Jumbo Instruct - configure playground

Because we can use a maximum of 500 tokens, we take a small portion of the Amazon SageMaker FAQs page, the Low-code ML section, for our test prompt.

We use the following prompt:

Below is SageMaker Low-code ML FAQ:

##
Q: Will my data (from inference or training) be used or shared to update the base model that is offered to customers using Amazon SageMaker JumpStart?
No. Your inference and training data will not be used nor shared to update or train the base model that SageMaker JumpStart surfaces to customers.

Q: Can I see the model weights and scripts of proprietary models in preview with Amazon SageMaker JumpStart?
No. Proprietary models do not allow customers to view model weights and scripts.

Q: Which open-source models are supported with Amazon SageMaker JumpStart?
Amazon SageMaker JumpStart includes 150+ pre-trained open-source models from PyTorch Hub and TensorFlow Hub. For vision tasks such as image classification and object detection, you can use models such as ResNet, MobileNet, and Single-Shot Detector (SSD). For text tasks such as sentence classification, text classification, and question answering, you can use models such as BERT, RoBERTa, and DistilBERT.

Q: What solutions come pre-built with Amazon SageMaker JumpStart?
SageMaker JumpStart includes solutions that are preconfigured with all necessary AWS services to launch a solution into production. Solutions are fully customizable so you can easily modify them to fit your specific use case and dataset. You can use solutions for over 15 use cases including demand forecasting, fraud detection, and predictive maintenance, and readily deploy solutions with just a few clicks. For more information about all solutions available, visit the SageMaker getting started page. 

Q: What built-in algorithms are supported in Amazon SageMaker Autopilot?
Amazon SageMaker Autopilot supports 2 built-in algorithms: XGBoost and Linear Learner.

Q: Can I stop an Amazon SageMaker Autopilot job manually?
Yes. You can stop a job at any time. When an Amazon SageMaker Autopilot job is stopped, all ongoing trials will be stopped and no new trial will be started.
##

Create a multiple choice quiz on the topic of SageMaker Low-code ML FAQ consisting of 4 questions. Each question should have 4 options. Also include the correct answer for each question using the starting string 'Correct Answer:`

Prompt engineering is an iterative process. You should be clear and specific, and give the model time to think.

Here we specified the context with ## as stop sequences, which signals the model to stop generating after this character or string is generated. It’s useful when using a few-shot prompt.

Below is SageMaker Low-code ML FAQ:

##
<SageMaker Low-code ML FAQ content>
##

Next, we are clear and very specific in our prompt, asking for a multiple-choice quiz, consisting of four questions with four options. We ask the model to include the correct answer for each question using the starting string 'Correct Answer:' so we can parse it later using Python:

Create a multiple choice quiz on the topic of SageMaker Low-code ML FAQ consisting of 4 questions. Each question should have 4 options. Also include the correct answer for each question using the starting string 'Correct Answer:`

A well-designed prompt can make the model more creative and generalized so that it can easily adapt to new tasks. Prompts can also help incorporate domain knowledge on specific tasks and improve interpretability. Prompt engineering can greatly improve the performance of zero-shot and few-shot learning models. Creating high-quality prompts requires careful consideration of the task at hand, as well as a deep understanding of the model’s strengths and limitations.

In the scope of this post, we don’t cover this wide area further.

Copy the prompt and enter it in the Prompt box, then choose Generate text.

AI21 Jurassic-2 Jumbo Instruct - prompt input

This sends the prompt to the Jurassic-2 Jumbo Instruct model for inference. Note that experimenting in the playground is free.

Also keep in mind that despite the cutting-edge nature of LLMs, they are still prone to biases, errors, and hallucinations.

After reading the model output thoroughly and carefully, we can see that the model generated quite a good quiz!

After you have played with the model, it’s time to use the notebook and deploy it as an endpoint in your environment. We use a small Python function to parse the output and simulate an interactive test.

Deploy the Jurassic-2 Jumbo Instruct foundation model from a notebook

You can use the following sample notebook to deploy Jurassic-2 Jumbo Instruct using SageMaker. Note that this example uses an ml.p4d.24xlarge instance. If your default limit for your AWS account is 0, you need to request a limit increase for this GPU instance.

Let’s create the endpoint using SageMaker inference. First, we set the necessary variables, then we deploy the model from the model package:

endpoint_name = "j2-jumbo-instruct"

content_type = "application/json"

real_time_inference_instance_type = (
"ml.p4d.24xlarge"
)

# create a deployable model from the model package.
model = ModelPackage(
role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=endpoint_name,
model_data_download_timeout=3600,
container_startup_health_check_timeout=600,
)

After the endpoint is deployed, you can run inference queries against the model.

After the model is deployed, you can interact with the deployed endpoint using the following code snippet:

response = ai21.Completion.execute(sm_endpoint=endpoint_name,
prompt=instruction,
maxTokens=2048,
temperature=0.7,
numResults=1,
stopSequences=['##'])

output = response['completions'][0]['data']['text']

With the Jurassic-2 Jumbo Instruct foundation model deployed on an ml.p4d.24xlarge instance SageMaker endpoint, you can use a prompt with 4,096 tokens. You can take the same prompt we used in the playground and add many more questions. In this example, we added the FAQ’s entire Low-code ML section as context into the prompt.

AI21 Jurassic-2 Jumbo Instruct endpoint prompt output

We can see the output of the model, which generated a multiple-choice quiz with four questions and four options for each question.

Now you can develop a Python function to parse the output and create an interactive multiple-choice quiz.

It’s quite straightforward to develop such a function with a few lines of code. You can parse the answer easily because the model created a line with “Correct Answer: ” for each question, exactly as we requested in the prompt. We don’t provide the Python code for the quiz generation in the scope of this post.

Run the quiz in the notebook

Using the Python function we created earlier and the output from the Jurassic-2 Jumbo Instruct foundation model, we run the interactive quiz in the notebook.

AI21 Jurassic-2 Jumbo Instruct endpoint - take a test

You can see I answered three out of four questions correctly and got a 75% grade. Perhaps I need to read the SageMaker FAQ a few more times!

Clean up

After you have tried out the endpoint, make sure to remove the SageMaker inference endpoint and the model to prevent any charges:

model.sagemaker_session.delete_endpoint(endpoint_name)
model.sagemaker_session.delete_endpoint_config(endpoint_name)

model.delete_model()

Conclusion

In this post, we showed you how you can test and use AI21’s Jurassic-2 Jumbo Instruct model using SageMaker to build an automated quiz generation system. This was achieved using a rather simple prompt with a publicly available SageMaker FAQ page’s text embedded and a few lines of Python code.

Similar to this example mentioned in the post, you can customize a foundation model for your business with just a few labeled examples. Because all the data is encrypted and doesn’t leave your AWS account, you can trust that your data will remain private and confidential.

Request access to try out the foundation model in SageMaker today, and let us know your feedback!


About the Author

Eitan SelaEitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Read More

Amazon SageMaker XGBoost now offers fully distributed GPU training

Amazon SageMaker XGBoost now offers fully distributed GPU training

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

The SageMaker XGBoost algorithm allows you to easily run XGBoost training and inference on SageMaker. XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in ML competitions because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune. You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. You can use GPUs to accelerate training on large datasets.

Today, we are happy to announce that SageMaker XGBoost now offers fully distributed GPU training.

Starting with version 1.5-1 and above, you can now utilize all GPUs when using multi-GPU instances. The new feature addresses your needs to use fully distributed GPU training when dealing with large datasets. This means being able to use multiple Amazon Elastic Compute Cloud (Amazon EC2) instances (GPU) and using all GPUs per instance.

Distributed GPU training with multi-GPU instances

With SageMaker XGBoost version 1.2-2 or later, you can use one or more single-GPU instances for training. The hyperparameter tree_method needs to be set to gpu_hist. When using more than one instance (distributed setup), the data needs to be divided among instances as follows (the same as the non-GPU distributed training steps mentioned in XGBoost Algorithm). Although this option is performant and can be used in various training setups, it doesn’t extend to using all GPUs when choosing multi-GPU instances such as g5.12xlarge.

With SageMaker XGBoost version 1.5-1 and above, you can now use all GPUs on each instance when using multi-GPU instances. The ability to use all GPUs in multi-GPU instance is offered by integrating the Dask framework.

You can use this setup to complete training quickly. Apart from saving time, this option will also be useful to work around blockers such as maximum usable instance (soft) limits, or if the training job is unable to provision a large number of single-GPU instances for some reason.

The configurations to use this option are the same as the previous option, except for the following differences:

  • Add the new hyperparameter use_dask_gpu_training with string value true.
  • When creating TrainingInput, set the distribution parameter to FullyReplicated, whether using single or multiple instances. The underlying Dask framework will carry out the data load and split the data among Dask workers. This is different from the data distribution setting for all other distributed training with SageMaker XGBoost.

Note that splitting data into smaller files still applies for Parquet, where Dask will read each file as a partition. Because you’ll have a Dask worker per GPU, the number of files should be greater than instance count * GPU count per instance. Also, making each file too small and having a very large number of files can degrade performance. For more information, see Avoid Very Large Graphs. For CSV, we still recommend splitting up large files into smaller ones to reduce data download time and enable quicker reads. However, it’s not a requirement.

Currently, the supported input formats with this option are:

  • text/csv
  • application/x-parquet

The following input mode is supported:

  • File mode

The code will look similar to the following:

import os
import boto3
import re
import sagemaker
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost

role = sagemaker.get_execution_role()
region = sagemaker.Session().boto_region_name
session = Session()

bucket = "<Specify S3 Bucket>"
prefix = "<Specify S3 prefix>"

hyperparams = {
    "objective": "reg:squarederror",
    "num_round": "500",
    "verbosity": "3",
    "tree_method": "gpu_hist",
    "eval_metric": "rmse",
    "use_dask_gpu_training": "true"
}


output_path = "s3://{}/{}/output".format(bucket, prefix)

content_type = "application/x-parquet"
instance_type = "ml.g4dn.2xlarge"

xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")
xgb_script_mode_estimator = sagemaker.estimator.Estimator(
    image_uri=xgboost_container,
    hyperparameters=hyperparams,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    output_path=output_path,
    max_run=7200,

)

test_data_uri = " <specify the S3 uri for training dataset>"
validation_data_uri = “<specify the S3 uri for validation dataset>”

train_input = TrainingInput(
    test_data_uri, content_type=content_type
)

validation_input = TrainingInput(
    validation_data_uri, content_type=content_type
)

xgb_script_mode_estimator.fit({"train": train_input, "validation": validation_input})

The following screenshots show a successful training job log from the notebook.

Benchmarks

We benchmarked evaluation metrics to ensure that the model quality didn’t deteriorate with the multi-GPU training path compared to single-GPU training. We also benchmarked on large datasets to ensure that our distributed GPU setups were performant and scalable.

Billable time refers to the absolute wall-clock time. Training time is only the XGBoost training time, measured from the train() call until the model is saved to Amazon Simple Storage Service (Amazon S3).

Performance benchmarks on large datasets

The use of multi-GPU is usually appropriate for large datasets with complex training. We created a dummy dataset with 2,497,248,278 rows and 28 features for testing. The dataset was 150 GB and composed of 1,419 files. Each file was sized between 105–115 MB. We saved the data in Parquet format in an S3 bucket. To simulate somewhat complex training, we used this dataset for a binary classification task, with 1,000 rounds, to compare performance between the single-GPU training path and the multi-GPU training path.

The following table contains the billable training time and performance comparison between the single-GPU training path and the multi-GPU training path.

Single-GPU Training Path
Instance Type Instance Count Billable Time / Instance(s) Training Time(s)
g4dn.xlarge 20 Out of Memory
g4dn.2xlarge 20 Out of Memory
g4dn.4xlarge 15 1710 1551.9
16 1592 1412.2
17 1542 1352.2
18 1423 1281.2
19 1346 1220.3
Multi-GPU Training Path (with Dask)
Instance Type Instance Count Billable Time / Instance(s) Training Time(s)
. . . .
g4dn.12xlarge 7 Out of Memory
8 1143 784.7
9 1039 710.73
10 978 676.7
12 940 614.35

We can see that using multi-GPU instances results in low training time and low overall time. The single-GPU training path still has some advantage in downloading and reading only part of the data in each instance, and therefore low data download time. It also doesn’t suffer from Dask’s overhead. Therefore, the difference between training time and total time is smaller. However, due to using more GPUs, multi-GPU setup can cut training time significantly.

You should use an EC2 instance that has enough compute power to avoid out of memory errors when dealing with large datasets.

It’s possible to reduce total time further with the single-GPU setup by using more instances or more powerful instances. However, in terms of cost, it might be more expensive. For example, the following table shows the training time and cost comparison with a single-GPU instance g4dn.8xlarge.

Single-GPU Training Path
Instance Type Instance Count Billable Time / Instance(s) Cost ($)
g4dn.8xlarge 15 1679 15.22
17 1509 15.51
19 1326 15.22
Multi-GPU Training Path (with Dask)
Instance Type Instance Count Billable Time / Instance(s) Cost ($)
g4dn.12xlarge 8 1143 9.93
10 978 10.63
12 940 12.26

Cost calculation is based on the On-Demand price for each instance. For more information, refer to Amazon EC2 G4 Instances.

Model quality benchmarks

For model quality, we compared evaluation metrics between the Dask GPU option and the single-GPU option, and ran training on various instance types and instance counts. For different tasks, we used different datasets and hyperparameters, with each dataset split into training, validation, and test sets.

For a binary classification (binary:logistic) task, we used the HIGGS dataset in CSV format. The training split of the dataset has 9,348,181 rows and 28 features. The number of rounds used was 1,000. The following table summarizes the results.

Multi-GPU Training with Dask
Instance Type Num GPUs / Instance Instance Count Billable Time / Instance(s) Accuracy % F1 % ROC AUC %
g4dn.2xlarge 1 1 343 75.97 77.61 84.34
g4dn.4xlarge 1 1 413 76.16 77.75 84.51
g4dn.8xlarge 1 1 413 76.16 77.75 84.51
g4dn.12xlarge 4 1 157 76.16 77.74 84.52

For regression (reg:squarederror), we used the NYC green cab dataset (with some modifications) in Parquet format. The training split of the dataset has 72,921,051 rows and 8 features. The number of rounds used was 500. The following table shows the results.

Multi-GPU Training with Dask
Instance Type Num GPUs / Instance Instance Count Billable Time / Instance(s) MSE R2 MAE
g4dn.2xlarge 1 1 775 21.92 0.7787 2.43
g4dn.4xlarge 1 1 770 21.92 0.7787 2.43
g4dn.8xlarge 1 1 705 21.92 0.7787 2.43
g4dn.12xlarge 4 1 253 21.93 0.7787 2.44

Model quality metrics are similar between the multi-GPU (Dask) training option and the existing training option. Model quality remains consistent when using a distributed setup with multiple instances or GPUs.

Conclusion

In this post, we gave an overview of how you can use different instance type and instance count combinations for distributed GPU training with SageMaker XGBoost. For most use cases, you can use single-GPU instances. This option provides a wide range of instances to use and is very performant. You can use multi-GPU instances for training with large datasets and lots of rounds. It can provide quick training with a smaller number of instances. Overall, you can use SageMaker XGBoost’s distributed GPU setup to immensely speed up your XGBoost training.

To learn more about SageMaker and distributed training using Dask, check out Amazon SageMaker built-in LightGBM now offers distributed training using Dask


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Dewan Choudhury is a Software Development Engineer with Amazon Web Services. He works on Amazon SageMaker’s algorithms and JumpStart offerings. Apart from building AI/ML infrastructures, he is also passionate about building scalable distributed systems.

Xin HuangDr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Tony Cruz

Read More

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 5: Hosting

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 5: Hosting

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we have helped hundreds of customers optimize their workloads, set guardrails, and improve visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. In this post, we focus on SageMaker inference environments: real-time inference, batch transform, asynchronous inference, and serverless inference.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

SageMaker offers multiple inference options for you to pick from based on your workload requirements:

  • Real-time inference for online, low latency, or high throughput requirements
  • Batch transform for offline, scheduled processing and when you don’t need a persistent endpoint
  • Asynchronous inference for when you have large payloads with long processing times and want to queue requests
  • Serverless inference for when you have intermittent or unpredictable traffic patterns and can tolerate cold starts

In the following sections, we discuss each inference option in more detail.

SageMaker real-time inference

When you create an endpoint, SageMaker attaches an Amazon Elastic Block Store (Amazon EBS) storage volume to the Amazon Elastic Compute Cloud (Amazon EC2) instance that hosts the endpoint. This is true for all instance types that don’t come with a SSD storage. Because the d* instance types come with an NVMe SSD storage, SageMaker doesn’t attach an EBS storage volume to these ML compute instances. Refer to Host instance storage volumes for the size of the storage volumes that SageMaker attaches for each instance type for a single endpoint and for a multi-model endpoint.

The cost of SageMaker real-time endpoints is based on the per instance-hour consumed for each instance while the endpoint is running, the cost of GB-month of provisioned storage (EBS volume), as well as the GB data processed in and out of the endpoint instance, as outlined in Amazon SageMaker Pricing. In Cost Explorer, you can view real-time endpoint costs by applying a filter on the usage type. The names of these usage types are structured as follows:

  • REGION-Host:instanceType (for example, USE1-Host:ml.c5.9xlarge)
  • REGION-Host:VolumeUsage.gp2 (for example, USE1-Host:VolumeUsage.gp2)
  • REGION-Hst:Data-Bytes-Out (for example, USE2-Hst:Data-Bytes-In)
  • REGION-Hst:Data-Bytes-Out (for example, USW2-Hst:Data-Bytes-Out)

As shown in the following screenshot, filtering by the usage type Host: will show a list of real-time hosting usage types in an account.

You can either select specific usage types or select Select All and choose Apply to display the cost breakdown of SageMaker real-time hosting usage. To see the cost and usage breakdown by instance hours, you need to de-select all the REGION-Host:VolumeUsage.gp2 usage types before applying the usage type filter. You can also apply additional filters such as account number, EC2 instance type, cost allocation tag, Region, and more. The following screenshot shows cost and usage graphs for the selected hosting usage types.

Additionally, you can explore the cost associated with one or more hosting instances by using the Instance type filter. The following screenshot shows cost and usage breakdown for hosting instance ml.p2.xlarge.

Similarly, the cost for GB data processed in and processed out can be displayed by selecting the associated usage types as an applied filter, as shown in the following screenshot.

After you have achieved your desired results with filters and groupings, you can either download your results by choosing Download as CSV or save the report by choosing Save to report library. For general guidance on using Cost Explorer, refer to AWS Cost Explorer’s New Look and Common Use Cases.

Optionally, you can enable AWS Cost and Usage Reports (AWS CUR) to gain insights into the cost and usage data for your accounts. AWS CUR contains hourly AWS consumption details. It’s stored in Amazon Simple Storage Service (Amazon S3) in the payer account, which consolidates data for all the linked accounts. You can run queries to analyze trends in your usage and take appropriate action to optimize cost. Amazon Athena is a serverless query service that you can use to analyze the data from AWS CUR in Amazon S3 using standard SQL. More information and example queries can be found in the AWS CUR Query Library.

You can also feed AWS CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.

You can obtain resource-level information such as endpoint ARN, endpoint instance types, hourly instance rate, daily usage hours, and more from AWS CUR. You can also include cost-allocation tags in your query for an additional level of granularity. The following example query returns real-time hosting resource usage for the last 3 months for the given payer account:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS endpoint_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      customer_all
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Host%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the query using Athena. For more information, refer to Querying Cost and Usage Reports using Amazon Athena.

The result of the query shows that endpoint mme-xgboost-housing with ml.x4.xlarge instance is reporting 24 hours of runtime for multiple consecutive days. The instance rate is $0.24/hour and the daily cost for running for 24 hours is $5.76.

AWS CUR results can help you identify patterns of endpoints running for consecutive days in each of the linked accounts, as well as endpoints with the highest monthly cost. This can also help you decide whether the endpoints in non-production accounts can be deleted to save cost.

Optimize costs for real-time endpoints

From a cost management perspective, it’s important to identify under-utilized (or over-sized) instances and bring the instance size and counts, if required, in line with workload requirements. Common system metrics like CPU/GPU utilization and memory utilization are written to Amazon CloudWatch for all hosting instances. For real-time endpoints, SageMaker makes several additional metrics available in CloudWatch. Some of the commonly monitored metrics include invocation counts and invocation 4xx/5xx errors. For a full list of metrics, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

The metric CPUUtilization provides the sum of each individual CPU core’s utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0–400%. The metric MemoryUtilization is the percentage of memory that is used by the containers on an instance. This value range is 0–100%. The following screenshot shows an example of CloudWatch metrics CPUUtilization and MemoryUtilization for an endpoint instance ml.m4.10xlarge that comes with 40 vCPUs and 160 GiB memory.

These metrics graphs show maximum CPU utilization of approximately 3,000%, which is the equivalent of 30 vCPUs. This means that this endpoint isn’t utilizing more than 30 vCPUs out of the total capacity of 40 vCPUs. Similarly, the memory utilization is below 6%. Using this information, you can possibly experiment with a smaller instance that can match this resource need. Furthermore, the CPUUtilization metric shows a classic pattern of periodic high and low CPU demand, which makes this endpoint a good candidate for auto scaling. You can start with a smaller instance and scale out first as your compute demand changes. For information, see Automatically Scale Amazon SageMaker Models.

SageMaker is great for testing new models because you can easily deploy them into an A/B testing environment using production variants, and you only pay for what you use. Each production variant runs on its own compute instance and you’re charged per instance-hour consumed for each instance while the variant is running.

SageMaker also supports shadow variants, which have the same components as a production variant and run on their own compute instance. With shadow variants, SageMaker automatically deploys the model in a test environment, routes a copy of the inference requests received by the production model to the test model in real time, and collects performance metrics such as latency and throughput. This enables you to validate any new candidate component of your model serving stack before promoting it to production.

When you’re done with your tests and aren’t using the endpoint or the variants extensively anymore, you should delete it to save cost. Because the model is stored in Amazon S3, you can recreate it as needed. You can automatically detect these endpoints and take corrective actions (such as deleting them) by using Amazon CloudWatch Events and AWS Lambda functions. For example, you can use the Invocations metric to get the total number of requests sent to a model endpoint and then detect if the endpoints have been idle for the past number of hours (with no invocations over a certain period, such as 24 hours).

If you have several under-utilized endpoint instances, consider hosting options such as multi-model endpoints (MMEs), multi-container endpoints (MCEs), and serial inference pipelines to consolidate usage to fewer endpoint instances.

For real-time and asynchronous inference model deployment, you can optimize cost and performance by deploying models on SageMaker using AWS Graviton. AWS Graviton is a family of processors designed by AWS that provide the best price performance and are more energy efficient than their x86 counterparts. For guidance on deploying an ML model to AWS Graviton-based instances and details on the price performance benefit, refer to Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker. SageMaker also supports AWS Inferentia accelerators through the ml.inf2 family of instances for deploying ML models for real-time and asynchronous inference. You can use these instances on SageMaker to achieve high performance at a low cost for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers.

In addition, you can use Amazon SageMaker Inference Recommender to run load tests and evaluate the price performance benefits of deploying your model on these instances. For additional guidance on automatically detecting idle SageMaker endpoints, as well as instance right-sizing and auto scaling for SageMaker endpoints, refer to Ensure efficient compute resources on Amazon SageMaker.

SageMaker batch transform

Batch inference, or offline inference, is the process of generating predictions on a batch of observations. Offline predictions are suitable for larger datasets and in cases where you can afford to wait several minutes or hours for a response.

The cost for SageMaker batch transform is based on the per instance-hour consumed for each instance while the batch transform job is running, as outlined in Amazon SageMaker Pricing. In Cost Explorer, you can explore batch transform costs by applying a filter on the usage type. The name of this usage type is structured as REGION-Tsform:instanceType (for example, USE1-Tsform:ml.c5.9xlarge).

As shown in the following screenshot, filtering by usage type Tsform: will show a list of SageMaker batch transform usage types in an account.

You can either select specific usage types or select Select All and choose Apply to display the cost breakdown of batch transform instance usage for the selected types. As mentioned earlier, you can also apply additional filters. The following screenshot shows cost and usage graphs for the selected batch transform usage types.

Optimize costs for batch transform

SageMaker batch transform only charges you for the instances used while your jobs are running. If your data is already in Amazon S3, then there is no cost for reading input data from Amazon S3 and writing output data to Amazon S3. All output objects are attempted to be uploaded to Amazon S3. If all are successful, then the batch transform job is marked as complete. If one or more objects fail, the batch transform job is marked as failed.

Charges for batch transform jobs apply in the following scenarios:

  • The job is successful
  • Failure due to ClientError and the model container is SageMaker or a SageMaker managed framework
  • Failure due to AlgorithmError or ClientError and the model container is your own custom container (BYOC)

The following are some of the best practices for optimizing a SageMaker batch transform job. These recommendations can reduce the total runtime of your batch transform job, thereby lowering costs:

  • Set BatchStrategy to MultiRecord and SplitType to Line if you need the batch transform job to make mini batches from the input file. If it can’t automatically split the dataset into mini batches, you can divide it into mini batches by putting each batch in a separate input file, placed in the data source S3 bucket.
  • Make sure that the batch size fits into the memory. SageMaker usually handles this automatically; however, when dividing batches manually, this needs to be tuned based on the memory.
  • Batch transform partitions the S3 objects in the input by key and maps those objects to instances. When you have multiples files, one instance might process input1.csv, and another instance might process input2.csv. If you have one input file but initialize multiple compute instances, only one instance processes the input file and the rest of the instances are idle. Make sure the number of files is equal to or greater than the number of instances.
  • If you have a large number of small files, it may be beneficial to combine multiple files into a small number of bigger files to reduce Amazon S3 interaction time.
  • If you’re using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy:
    • MaxConcurrentTransforms indicates the maximum number of parallel requests that can be sent to each instance in a transform job. The ideal value for MaxConcurrentTransforms is equal to the number of vCPU cores in an instance.
    • MaxPayloadInMB is the maximum allowed size of the payload, in MB. The value in MaxPayloadInMB must be greater than or equal to the size of a single record. To estimate the size of a record in MB, divide the size of your dataset by the number of records. To ensure that the records fit within the maximum payload size, we recommend using a slightly larger value. The default value is 6 MB.
    • MaxPayloadInMB must not be greater than 100 MB. If you specify the optional MaxConcurrentTransforms parameter, then the value of (MaxConcurrentTransforms * MaxPayloadInMB) must also not exceed 100 MB.
    • For cases where the payload might be arbitrarily large and is transmitted using HTTP chunked encoding, set the MaxPayloadInMB value to 0. This feature works only in supported algorithms. Currently, SageMaker built-in algorithms do not support HTTP chunked encoding.
  • Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. If a single instance is not sufficient to meet your performance requirements, consider using multiple instances in parallel to distribute the workload. For key considerations when architecting batch transform jobs, refer to Batch Inference at Scale with Amazon SageMaker.
  • Continuously monitor the performance metrics of your SageMaker batch transform jobs using CloudWatch. Look for bottlenecks, such as high CPU or GPU utilization, memory usage, or network throughput, to determine if you need to adjust instance sizes or configurations.
  • SageMaker uses the Amazon S3 multipart upload API to upload results from a batch transform job to Amazon S3. If an error occurs, the uploaded results are removed from Amazon S3. In some cases, such as when a network outage occurs, an incomplete multipart upload might remain in Amazon S3. To avoid incurring storage charges, we recommend that you add the S3 bucket policy to the S3 bucket lifecycle rules. This policy deletes incomplete multipart uploads that might be stored in the S3 bucket. For more information, see Managing your storage lifecycle.

SageMaker asynchronous inference

Asynchronous inference is a great choice for cost-sensitive workloads with large payloads and burst traffic. Requests can take up to 1 hour to process and have payload sizes of up to 1 GB, so it’s more suitable for workloads that have relaxed latency requirements.

Invocation of asynchronous endpoints differs from real-time endpoints. Rather than passing a request payload synchronously with the request, you upload the payload to Amazon S3 and pass an S3 URI as a part of the request. Internally, SageMaker maintains a queue with these requests and processes them. During endpoint creation, you can optionally specify an Amazon Simple Notification Service (Amazon SNS) topic to receive success or error notifications. When you receive the notification that your inference request has been successfully processed, you can access the result in the output Amazon S3 location.

The cost for asynchronous inference is based on the per instance-hour consumed for each instance while the endpoint is running, cost of GB-month of provisioned storage, as well as GB data processed in and out of the endpoint instance, as outlined in Amazon SageMaker Pricing. In Cost Explorer, you can filter asynchronous inference costs by applying a filter on the usage type. The name of this usage type is structured as REGION-AsyncInf:instanceType (for example, USE1-AsyncInf:ml.c5.9xlarge). Note that GB volume and GB data processed usage types are the same as real-time endpoints, as mentioned earlier in this post.

As shown in the following screenshot, filtering by the usage type AsyncInf: in Cost Explorer displays a cost breakdown by asynchronous endpoint usage types.

To see the cost and usage breakdown by instance hours, you need to de-select all the REGION-Host:VolumeUsage.gp2 usage types before applying the usage type filter. You can also apply additional filters. Resource-level information such as endpoint ARN, endpoint instance types, hourly instance rate, and daily usage hours can be obtained from AWS CUR. The following is an example of an AWS CUR query to obtain asynchronous hosting resource usage for the last 3 months:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS endpoint_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      customer_all
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%AsyncInf%'
      AND line_item_operation = 'RunInstance'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Athena.

The result of the query shows that endpoint sagemaker-abc-model-5 with ml.m5.xlarge instance is reporting 24 hours of runtime for multiple consecutive days. The instance rate is $0.23/hour and the daily cost for running for 24 hours is $5.52.

As mentioned earlier, AWS CUR results can help you identify patterns of endpoints running for consecutive days, as well as endpoints with the highest monthly cost. This can also help you decide whether the endpoints in non-production accounts can be deleted to save cost.

Optimize costs for asynchronous inference

Just like the real-time endpoints, the cost for asynchronous endpoints is based on the instance type usage. Therefore, it’s important to identify under-utilized instances and resize them based on the workload requirements. In order to monitor asynchronous endpoints, SageMaker makes several metrics such as ApproximateBacklogSize, HasBacklogWithoutCapacity, and more available in CloudWatch. These metrics can show requests in the queue for an instance and can be used for auto scaling an endpoint. SageMaker asynchronous inference also includes host-level metrics. For information on host-level metrics, see SageMaker Jobs and Endpoint Metrics. These metrics can show resource utilization that can help you right-size the instance.

SageMaker supports auto scaling for asynchronous endpoints. Unlike real-time hosted endpoints, asynchronous inference endpoints support scaling down instances to zero by setting the minimum capacity to zero. For asynchronous endpoints, SageMaker strongly recommends that you create a policy configuration for target-tracking scaling for a deployed model (variant). You need to define the scaling policy that scaled on the ApproximateBacklogPerInstance custom metric and set the MinCapacity value to zero.

Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests. Requests that are received when there are zero instances are queued for processing after the endpoint scales up. Therefore, for use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive. Cold start time depends on the time required to launch a new endpoint from scratch. Also, if the model itself is big, then the time can be longer. If your job is expected to take longer than the 1-hour processing time, you may want to consider SageMaker batch transform.

Additionally, you may also consider your request’s queued time combined with the processing time to choose the instance type. For example, if your use case can tolerate hours of wait time, you can choose a smaller instance to save cost.

For additional guidance on instance right-sizing and auto scaling for SageMaker endpoints, refer to Ensure efficient compute resources on Amazon SageMaker.

Serverless inference

Serverless inference allows you to deploy ML models for inference without having to configure or manage the underlying infrastructure. Based on the volume of inference requests your model receives, SageMaker serverless inference automatically provisions, scales, and turns off compute capacity. As a result, you pay for only the compute time to run your inference code and the amount of data processed, not for idle time. For serverless endpoints, instance provisioning is not necessary. You need to provide the memory size and maximum concurrency. Because serverless endpoints provision compute resources on demand, your endpoint may experience a few extra seconds of latency (cold start) for the first invocation after an idle period. You pay for the compute capacity used to process inference requests, billed by the millisecond, GB-month of provisioned storage, and the amount of data processed. The compute charge depends on the memory configuration you choose.

In Cost Explorer, you can filter serverless endpoints costs by applying a filter on the usage type. The name of this usage type is structured as REGION-ServerlessInf:Mem-MemorySize (for example, USE2-ServerlessInf:Mem-4GB). Note that GB volume and GB data processed usage types are the same as real-time endpoints.

You can see the cost breakdown by applying additional filters such as account number, instance type, Region, and more. The following screenshot shows the cost breakdown by applying filters for the serverless inference usage type.

Optimize cost for serverless inference

When configuring your serverless endpoint, you can specify the memory size and maximum number of concurrent invocations. SageMaker serverless inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. With serverless inference, you only pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. The compute charge depends on the memory configuration you choose. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, and 6144 MB. The pricing increases with the memory size increments, as explained in Amazon SageMaker Pricing, so it’s important to select the correct memory size. As a general rule, the memory size should be at least as large as your model size. However, it’s a good practice to refer to memory utilization when deciding the endpoint memory size, in addition to the model size itself.

General best practices for optimizing SageMaker inference costs

Optimizing hosting costs isn’t a one-time event. It’s a continuous process of monitoring deployed infrastructure, usage patterns, and performance, and also keeping a keen eye on new innovative solutions that AWS releases that could impact cost. Consider the following best practices:

  • Choose an appropriate instance type – SageMaker supports multiple instance types, each with varying combinations of CPU, GPU, memory, and storage capacities. Based on your model’s resource requirements, choose an instance type that provides the necessary resources without over-provisioning. For information about available SageMaker instance types, their specifications, and guidance on selecting the right instance, refer to Ensure efficient compute resources on Amazon SageMaker.
  • Test using local mode – In order to detect failures and debug faster, it’s recommended to test the code and container (in case of BYOC) in local mode before running the inference workload on the remote SageMaker instance. Local mode is a great way to test your scripts before running them in a SageMaker managed hosting environment.
  • Optimize models to be more performant – Unoptimized models can lead to longer runtimes and use more resources. You can choose to use more or bigger instances to improve performance; however, this leads to higher costs. By optimizing your models to be more performant, you may be able to lower costs by using fewer or smaller instances while keeping the same or better performance characteristics. You can use Amazon SageMaker Neo with SageMaker inference to automatically optimize models. For more details and samples, see Optimize model performance using Neo.
  • Use tags and cost management tools – To maintain visibility into your inference workloads, it’s recommended to use tags as well as AWS cost management tools such as AWS Budgets, the AWS Billing console, and the forecasting feature of Cost Explorer. You can also explore SageMaker Savings Plans as a flexible pricing model. For more information about these options, refer to Part 1 of this series.

Conclusion

In this post, we provided guidance on cost analysis and best practices when using SageMaker inference options. As machine learning establishes itself as a powerful tool across industries, training and running ML models needs to remain cost-effective. SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline and provides cost optimization opportunities without impacting performance or agility. Reach out to your AWS team for cost guidance on your SageMaker workloads.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:


About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, and rock and roll climbing.

Read More

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 4: Training jobs

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In this post, we focus on SageMaker training jobs.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

SageMaker training jobs

SageMaker training jobs are asynchronous batch processes with built-in features for ML model training and optimization.

With SageMaker training jobs, you can bring your own algorithm or choose from more than 25 built-in algorithms. SageMaker supports various data sources and access patterns, distributed training including heterogenous clusters, as well as experiment management features and automatic model tuning.

The cost of a training job is based on the resources you use (instances and storage) for the duration (in seconds) that those instances are running. This includes the time training takes place and, if you’re using the warm pool feature, the keep alive period you configure. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. You can filter training costs by applying a filter on the usage type. The names of these usage types are as follows:

  • REGION-Train:instanceType (for example, USE1-Train:ml.m5.large)
  • REGION-Train:VolumeUsage.gp2 (for example, USE1-Train:VolumeUsage.gp2)

To view a breakdown of your training costs in Cost Explorer, you can enter train: as a prefix for Usage type. If you filter only for hours used (see the following screenshot), Cost Explorer will generate two graphs: Cost and Usage. This view will help you prioritize your optimization opportunities and identify which instances are long-running and costly.

Before optimizing an existing training job, we recommend following the best practices covered in Optimizing costs for machine learning with Amazon SageMaker: test your code locally and use local mode for testing, use pre-trained models where possible, and consider managed spot training (which can optimize cost up to 90% over On-Demand instances).

When an On-Demand job is launched, it goes through five phases: Starting, Downloading, Training, Uploading, and Completed. You can see those phases and descriptions on the training job’s page on the SageMaker console.

From a pricing perspective, you are charged for Downloading, Training, and Uploading phases.

Reviewing these phases is a first step in diagnosing where to optimize your training costs. In this post, we discuss the Downloading and Training phases.

Downloading phase

In the preceding example, the Downloading phase took less than a minute. However, if data downloading is a big factor of your training cost, you should consider the data source you are using and access methods. SageMaker training jobs support three data sources natively: Amazon Elastic File System (Amazon EFS), Amazon Simple Storage Service (Amazon S3), and Amazon FSx for Lustre. For Amazon S3, SageMaker offers three managed ways that your algorithm can access the training: File mode (where data is downloaded to the instance block storage), Pipe mode (data is streamed to the instance, thereby eliminating the duration of the Downloading phase) and Fast File mode (combines the ease of use of the existing File mode with the performance of Pipe mode). For detailed guidance on choosing the right data source and access methods, refer to Choose the best data source for your Amazon SageMaker training job.

When using managed spot training, any repeated Downloading phases that occurred due to interruption are not charged (so you’re only charged for the duration of the data download one time).

It’s important to note that although SageMaker training jobs support the data sources we mentioned, they are not mandatory. In your training code, you can implement any method for downloading the training data from any source (provided that the training instance can access it). There are additional ways to speed up download time, such as using the Boto3 API with multiprocessing to download files concurrently, or using third-party libraries such as WebDataset or s5cmd for faster download from Amazon S3. For more information, refer to Parallelizing S3 Workloads with s5cmd.

Training phase

Optimizing the Training phase cost consists of optimizing two vectors: choosing the right infrastructure (instance family and size), and optimizing the training itself. We can roughly divide training instances into two categories: accelerated GPU-based, mostly for deep-learning models, and CPU-based for common ML frameworks. For guidance on selecting the right instance family for training, refer to Ensure efficient compute resources on Amazon SageMaker. If your training requires GPUs instances, we recommend referring to the video How to select Amazon EC2 GPU instances for deep learning.

As a general guidance, if your workload does require an NVIDIA GPU, we found customers gain significant cost savings with two Amazon Elastic Compute Cloud (Amazon EC2) instance types: ml.g4dn and ml.g5. The ml.g4dn is equipped with NVIDIA T4 and offers a particularly low cost per memory. The ml.g5 instance is equipped with NVIDIA A10g Tensor Core and has the lowest cost-per-CUDA flop (fp32).

AWS offers specific cost saving features for deep learning training:

In order to right-size and optimize your instance, you should first look at the Amazon CloudWatch metrics the training jobs are generating. For more information, refer to SageMaker Jobs and Endpoint Metrics. You can further use CloudWatch custom algorithm metrics to monitor the training performance.

These metrics can indicate bottlenecks or over-provisioning of resources. For example, if you’re observing high CPU with low GPU utilizations, you can address the issue by using heterogeneous clusters. Another example can be seeing consistent low CPU utilization throughout the job duration—this can lead to reducing the size of the instance.

If you’re using distributed training, you should test different distribution methods (tower, Ring-AllReduce, mirrored, and so on) to validate maximum utilization and fine-tune your framework parameters accordingly (for an example, see Best practices for TensorFlow 1.x acceleration training on Amazon SageMaker). It’s important to highlight that you can use the SageMaker distribution API and libraries like SageMaker Distributed Data Parallel, SageMaker Model Parallel, and SageMaker Sharded Data Parallel, which are optimized for AWS infrastructure and help reduce training costs.

Note that distributed training doesn’t necessarily scale linearly and might introduce some overhead, which will affect the overall runtime.

For deep learning models, another optimization technique is using mixed precision. Mixed precision can speed up training, thereby reducing both training time and memory usage with minimal to no impact on model accuracy. For more information, see the Train with Data Parallel and Model Parallel section in Distributed Training in Amazon SageMaker.

Finally, optimizing framework-specific parameters can have a significant impact in optimizing the training process. SageMaker automatic model tuning finds hyperparameters that perform the best, as measured by an objective metric that you choose. Setting the training time as an objective metric and framework configuration as hyperparameters can help remove bottlenecks and reduce overall training time. For an example of optimizing the default TensorFlow settings and removing a CPU bottleneck, refer to Aerobotics improves training speed by 24 times per sample with Amazon SageMaker and TensorFlow.

Another opportunity for optimizing both download and processing time is to consider training on a subset of your data. If your data consists of multiple duplicate entries or features with low information gain, you might be able to train on a subset of data and reduce downloading and training time as well as use a smaller instance and Amazon Elastic Block Store (Amazon EBS) volume. For an example, refer to Use a data-centric approach to minimize the amount of data required to train Amazon SageMaker models. Also, Amazon SageMaker Data Wrangler can simplify the analysis and creation of training samples. For more information, refer to Create random and stratified samples of data with Amazon SageMaker Data Wrangler.

SageMaker Debugger

To ensure efficient training and resource utilization, SageMaker can profile your training job using Amazon SageMaker Debugger. Debugger offers built-in rules to alert on common issues that are affecting your training like CPU bottleneck, GPU memory increase, or I/O bottleneck, or you can create your own rules. You can access and analyze the generated report in Amazon SageMaker Studio. For more information, refer to Amazon SageMaker Debugger UI in Amazon SageMaker Studio Experiments. The following screenshot shows the Debugger view in Studio.

You can drill down into the Python operators and functions (the Top operations on GPU section) that are run to perform the training job. The Debugger built-in rules for profiling watch framework operation-related issues, including excessive training initialization time due to data downloading before training starts and step duration outliers in training loops. You should note that although using the built-in rules are free, costs for custom rules apply based on the instance that you configure for the duration of the training job and storage that is attached to it.

Conclusion

In this post, we provided guidance on cost analysis and best practices when training ML models using SageMaker training jobs. As machine learning establishes itself as a powerful tool across industries, training and running ML models needs to remain cost-effective. SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline and provides cost optimization opportunities without impacting performance or agility.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:


About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, and increasing entropy.

Read More

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 3: Processing and Data Wrangler jobs

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 3: Processing and Data Wrangler jobs

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In this post, we focus on data preprocessing using Amazon SageMaker Processing and Amazon SageMaker Data Wrangler jobs.

Data preprocessing holds a pivotal role in a data-centric AI approach. However, preparing raw data for ML training and evaluation is often a tedious and demanding task in terms of compute resources, time, and human effort. Data preparation commonly needs to be integrated from different sources and deal with missing or noisy values, outliers, and so on.

Furthermore, in addition to common extract, transform, and load (ETL) tasks, ML teams occasionally require more advanced capabilities like creating quick models to evaluate data and produce feature importance scores or post-training model evaluation as part of an MLOps pipeline.

SageMaker offers two features specifically designed to help with those issues: SageMaker Processing and Data Wrangler. SageMaker Processing enables you to easily run preprocessing, postprocessing, and model evaluation on a fully managed infrastructure. Data Wrangler reduces the time it takes to aggregate and prepare data by simplifying the process of data source integration and feature engineering using a single visual interface and a fully distributed data processing environment.

Both SageMaker features provide great flexibility with several options for I/O, storage, and computation. However, setting those options incorrectly may lead to unnecessary cost, especially when dealing with large datasets.

In this post, we analyze the pricing factors and provide cost optimization guidance for SageMaker Processing and Data Wrangler jobs.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

SageMaker Processing

SageMaker Processing is a managed solution to run data processing and model evaluation workloads. You can use it in data processing steps such as feature engineering, data validation, model evaluation, and model interpretation in ML workflows. With SageMaker Processing, you can bring your own custom processing scripts and choose to build a custom container or use a SageMaker managed container with common frameworks like scikit-learn, Lime, Spark and more.

SageMaker Processing charges you for the instance type you choose, based on the duration of use and provisioned storage that is attached to that instance. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker.

You can filter processing costs by applying a filter on the usage type. The names of these usage types are as follows:

  • REGION-Processing:instanceType (for example, USE1-Processing:ml.m5.large)
  • REGION-Processing:VolumeUsage.gp2 (for example, USE1-Processing:VolumeUsage.gp2)

To review your SageMaker Processing cost in Cost Explorer, start by filtering with SageMaker for Service, and for Usage type, you can select all processing instances running hours by entering the processing:ml prefix and selecting the list on the menu.

Avoid cost in processing and pipeline development

Before right-sizing and optimizing a SageMaker Processing job’s run duration, we check for high-level metrics about historic job runs. You can choose from two methods to do this.

First, you can access the Processing page on the SageMaker console.

Alternatively, you can use the list_processing_jobs API.

A Processing job status can be InProgress, Completed, Failed, Stopping, or Stopped.

A high number of failed jobs is common when developing new MLOps pipelines. However, you should always test and make every effort to validate jobs before launching them on SageMaker because there are charges for resources used. For that purpose, you can use SageMaker Processing in local mode. Local mode is a SageMaker SDK feature that allows you to create estimators, processors, and pipelines, and deploy them to your local development environment. This is a great way to test your scripts before running them in a SageMaker managed environment. Local mode is supported by SageMaker managed containers and the ones you supply yourself. To learn more about how to use local mode with Amazon SageMaker Pipelines, refer to Local Mode.

Optimize I/O-related cost

SageMaker Processing jobs offer access to three data sources as part of the managed processing input: Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon Redshift. For more information, refer to ProcessingS3Input, AthenaDatasetDefinition, and RedshiftDatasetDefinition, respectively.

Before looking into optimization, it’s important to note that although SageMaker Processing jobs support these data sources, they are not mandatory. In your processing code, you can implement any method for downloading the accessing data from any source (provided that the processing instance can access it).

To gain better insights into processing performance and detecting optimization opportunities, we recommend following logging best practices in your processing script. SageMaker publishes your processing logs to Amazon CloudWatch.

In the following example job log, we see that the script processing took 15 minutes (between Start custom script and End custom script).

However, on the SageMaker console, we see that the job took 4 additional minutes (almost 25% of the job’s total runtime).

This is due to the fact that in addition to the time our processing script took, SageMaker-managed data downloading and uploading also took time (4 minutes). If this proves to be a big part of the cost, consider alternate ways to speed up downloading time, such as using the Boto3 API with multiprocessing to download files concurrently, or using third-party libraries as WebDataset or s5cmd for faster download from Amazon S3. For more information, refer to Parallelizing S3 Workloads with s5cmd. Note that such methods might introduce charges in Amazon S3 due to data transfer.

Processing jobs also support Pipe mode. With this method, SageMaker streams input data from the source directly to your processing container into named pipes without using the ML storage volume, thereby eliminating the data download time and a smaller disk volume. However, this requires a more complicated programming model than simply reading from files on a disk.

As mentioned earlier, SageMaker Processing also supports Athena and Amazon Redshift as data sources. When setting up a Processing job with these sources, SageMaker automatically copies the data to Amazon S3, and the processing instance fetches the data from the Amazon S3 location. However, when the job is finished, there is no managed cleanup process and the data copied will still remain in Amazon S3 and might incur unwanted storage charges. Therefore, when using Athena and Amazon Redshift data sources, make sure to implement a cleanup procedure, such as a Lambda function that runs on a schedule or in a Lambda Step as part of a SageMaker pipeline.

Like downloading, uploading processing artifacts can also be an opportunity for optimization. When a Processing job’s output is configured using the ProcessingS3Output parameter, you can specify which S3UploadMode to use. The S3UploadMode parameter default value is EndOfJob, which will get SageMaker to upload the results after the job completes. However, if your Processing job produces multiple files, you can set S3UploadMode to Continuous, thereby enabling the upload of artifacts simultaneously as processing continues, and decreasing the job runtime.

Right-size processing job instances

Choosing the right instance type and size is a major factor in optimizing the cost of SageMaker Processing jobs. You can right-size an instance by migrating to a different version within the same instance family or by migrating to another instance family. When migrating within the same instance family, you only need to consider CPU/GPU and memory. For more information and general guidance on choosing the right processing resources, refer to Ensure efficient compute resources on Amazon SageMaker.

To fine-tune instance selection, we start by analyzing Processing job metrics in CloudWatch. For more information, refer to Monitor Amazon SageMaker with Amazon CloudWatch.

CloudWatch collects raw data from SageMaker and processes it into readable, near-real-time metrics. Although these statistics are kept for 15 months, the CloudWatch console limits the search to metrics that were updated in the last 2 weeks (this ensures that only current jobs are shown). Processing jobs metrics can be found in the /aws/sagemaker/ProcessingJobs namespace and the metrics collected are CPUUtilization, MemoryUtilization, GPUUtilization, GPUMemoryUtilization, and DiskUtilization.

The following screenshot shows an example in CloudWatch of the Processing job we saw earlier.

In this example, we see the averaged CPU and memory values (which is the default in CloudWatch): the average CPU usage is 0.04%, memory 1.84%, and disk usage 13.7%. In order to right-size, always consider the maximum CPU and memory usage (in this example, the maximum CPU utilization was 98% in the first 3 minutes). As a general rule, if your maximum CPU and memory usage is consistently less than 40%, you can safely cut the machine in half. For example, if you were using an ml.c5.4xlarge instance, you could move to an ml.c5.2xlarge, which could reduce your cost by 50%.

Data Wrangler jobs

Data Wrangler is a feature of Amazon SageMaker Studio that provides a repeatable and scalable solution for data exploration and processing. You use the Data Wrangler interface to interactively import, analyze, transform, and featurize your data. Those steps are captured in a recipe (a .flow file) that you can then use in a Data Wrangler job. This helps you reapply the same data transformations on your data and also scale to a distributed batch data processing job, either as part of an ML pipeline or independently.

For guidance on optimizing your Data Wrangler app in Studio, refer to Part 2 in this series.

In this section, we focus on optimizing Data Wrangler jobs.

Data Wrangler uses SageMaker Spark processing jobs with a Data Wrangler-managed container. This container runs the directions from the .flow file in the job. Like any processing jobs, Data Wrangler charges you for the instances you choose, based on the duration of use and provisioned storage that is attached to that instance.

In Cost Explorer, you can filter Data Wrangler jobs costs by applying a filter on the usage type. The names of these usage types are:

  • REGION-processing_DW:instanceType (for example, USE1-processing_DW:ml.m5.large)
  • REGION-processing_DW:VolumeUsage.gp2 (for example, USE1-processing_DW:VolumeUsage.gp2)

To view your Data Wrangler cost in Cost Explorer, filter the service to use SageMaker, and for Usage type, choose the processing_DW prefix and select the list on the menu. This will show you both instance usage (hours) and storage volume (GB) related costs. (If you want to see Studio Data Wrangler costs you can filter the usage type by the Studio_DW prefix.)

Right-size and schedule Data Wrangler job instances

At the moment, Data Wrangler supports only m5 instances with following instance sizes: ml.m5.4xlarge, ml.m5.12xlarge, and ml.m5.24xlarge. You can use the distributed job feature to fine-tune your job cost. For example, suppose you need to process a dataset that requires 350 GiB in RAM. The 4xlarge (128 GiB) and 12xlarge (256 GiB) might not be able to process and will lead you to use the m5.24xlarge instance (768 GiB). However, you could use two m5.12xlarge instances (2 * 256 GiB = 512 GiB) and reduce the cost by 40% or three m5.4xlarge instances (3 * 128 GiB = 384 GiB) and save 50% of the m5.24xlarge instance cost. You should note that these are estimates and that distributed processing might introduce some overhead that will affect the overall runtime.

When changing the instance type, make sure you update the Spark config accordingly. For example, if you have an initial ml.m5.4xlarge instance job configured with properties spark.driver.memory set to 2048 and spark.executor.memory set to 55742, and later scale up to ml.m5.12xlarge, those configuration values need to be increased, otherwise they will be the bottleneck in the processing job. You can update these variables in the Data Wrangler GUI or in a configuration file appended to the config path (see the following examples).

Another compelling feature in Data Wrangler is the ability to set a scheduled job. If you’re processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data (for examples, see Export to Amazon S3 or Export to Amazon SageMaker Feature Store). However, you should note that when you create a schedule, Data Wrangler creates an eventRule in EventBridge. This means you also be charged for the event rules that you create (as well as the instances used to run the processing job). For more information, see Amazon EventBridge pricing.

Conclusion

In this post, we provided guidance on cost analysis and best practices when preprocessing

data using SageMaker Processing and Data Wrangler jobs. Similar to preprocessing, there are many options and configuration settings in building, training, and running ML models that may lead to unnecessary costs. Therefore, as machine learning establishes itself as a powerful tool across industries, ML workloads needs to remain cost-effective.

SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline.

This robustness also provides continuous cost optimization opportunities without compromising performance or agility.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:


About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, and watching sunsets (at minimum once a day).

Read More

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 2: SageMaker notebooks and Studio

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 2: SageMaker notebooks and Studio

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support offering. Since its introduction, we have helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their machine learning (ML) workloads’ cost and usage.

In this series of posts, we share lessons learned about optimizing costs in Amazon SageMaker. In Part 1, we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. In this post, we focus on various ways to analyze SageMaker usage and identify cost optimization opportunities for SageMaker notebook instances and Amazon SageMaker Studio.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

SageMaker notebook instances

A SageMaker notebook instance is a fully managed compute instance running the Jupyter Notebook app. SageMaker manages creating the instance and related resources. Notebooks contain everything needed to run or recreate an ML workflow. You can use Jupyter notebooks in your notebook instance to prepare and process data, write code to train models, deploy models to SageMaker Hosting, and test or validate your models. SageMaker notebook instances’ cost is based on the instance-hours consumed while the notebook instance is running, as well as the cost of GB-month of provisioned storage, as outlined in Amazon SageMaker Pricing.

In Cost Explorer, you can filter notebook costs by applying a filter on Usage type. The names of these usage types are structured as follows:

  • REGION-Notebk:instanceType (for example, USE1-Notebk:ml.g4dn.8xlarge)
  • REGION-Notebk:VolumeUsage.gp2 (for example, USE2-Notebk:VolumeUsage.gp2)

Filtering by the usage type Notebk: will show you a list of notebook usage types in an account. As shown in the following screenshot, you can select Select All and choose Apply to display the cost breakdown of your notebook usage.

To see the cost breakdown of the selected notebook usage type by the number of usage hours, you need to de-select all the REGION-Notebk:VolumeUsage.gp2 usage types from the preceding list and choose Apply to apply the filter. The following screenshot shows the cost and usage graphs for the selected notebook usage types.

You can also apply additional filters such as account number, Amazon Elastic Compute Cloud (Amazon EC2) instance type, cost allocation tag, Region, and more. Changing the granularity to Daily gives you daily cost and usage charts based on the selected usage types and dimension, as shown in the following screenshot.

In the preceding example, the notebook instance of type ml.t2.medium in the USE2 Region is reporting a daily usage of 24 hours between the period of July 2 and September 26. Similarly, the notebook instance of type ml.t3.medium in the USE1 Region is reporting a daily usage of 24 hours between August 3 and September 26, and a daily usage of 48 hours between September 26 and December 31. Daily usage of 24 hours or more for multiple consecutive days could indicate that a notebook instance has been left running for multiple days but is not in active use. This type of pattern could benefit from applying cost control guardrails such as manual or auto-shutdown of notebook instances to prevent idle runtime.

Although Cost Explorer helps you understand cost and usage data at the granularity of the instance type, you can use AWS Cost and Usage Reports (AWS CUR) to get data at the granularity of a resource such as notebook ARN. You can build custom queries to look up AWS CUR data using standard SQL. You can also include cost-allocation tags in your query for an additional level of granularity. The following query returns notebook resource usage for the last 3 months from your AWS CUR data:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS notebook_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      {$table_name}
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Notebk%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Amazon Athena. For more information about using Athena, refer to Querying Cost and Usage Reports using Amazon Athena.

The result of the query shows that notebook dev-notebook running on an ml.t2.medium instance is reporting 24 hours of usage for multiple consecutive days. The instance rate is $0.0464/hour and the daily cost for running for 24 hours is $1.1136.

AWS CUR query results can help you identify patterns of notebooks running for consecutive days, which can be analyzed for cost optimization. More information and example queries can be found in the AWS CUR Query Library.

You can also feed AWS CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions on ingesting AWS CUR data into QuickSight, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.

Optimize notebook instance cost

SageMaker notebooks are suitable for ML model development, which includes interactive data exploration, script writing, prototyping of feature engineering, and modeling. Each of these tasks may have varying computing resource requirements. Estimating the right type of computing resources to serve various workloads is challenging, and may lead to over-provisioning of resources, resulting in increased cost.

For ML model development, the size of a SageMaker notebook instance depends on the amount of data you need to load in-memory for meaningful exploratory data analyses (EDA) and the amount of computation required. We recommend starting small with general-purpose instances (such as T or M families) and scaling up as needed. For example, ml.t2.medium is sufficient for most basic data processing, feature engineering, and EDA that deals with small datasets that can be held within 4 GB memory. If your model development involves heavy computational work (such as image processing), you can stop your smaller notebook instance and change the instance type to the desired larger instance, such as ml.c5.xlarge. You can switch back to the smaller instance when you no longer need a larger instance. This will help keep the compute costs down.

Consider the following best practices to help reduce the cost of your notebook instances.

CPU vs. GPU

Considering CPU vs. GPU notebook instances is important for instance right-sizing. CPUs are best at handling single, more complex calculations sequentially, whereas GPUs are better at handling multiple but simple calculations in parallel. For many use cases, a standard current generation instance type from an instance family such as M provides enough computing power, memory, and network performance for notebooks to perform well.

GPUs provide a great price/performance ratio if you take advantage of them effectively. For example, if you are training your deep learning model on a SageMaker notebook and your neural network is relatively big, performing a large number of calculations involving hundreds of thousands of parameters, then your model can take advantage of the accelerated compute and hardware parallelism offered by GPU instances such as P instance families. However, it’s recommended to use GPU instances only when you really need them because they’re expensive and GPU communication overhead might even degrade performance if your notebook doesn’t need them. We recommend using notebooks with instances that are smaller in compute for interactive building and leaving the heavy lifting to ephemeral training, tuning, and processing jobs with larger instances, including GPU-enabled instances. This way, you don’t keep a large instance (or a GPU) constantly running with your notebook. If you need accelerated computing in your notebook environment, you can stop your m* family notebook instance, switch to a GPU-enabled P* family instance, and start it again. Don’t forget to switch it back when you no longer need that extra boost in your development environment.

Restrict user access to specific instance types

Administrators can restrict users from creating notebooks that are too large through AWS Identity and Access Management (IAM) policies. For example, the following sample policy only allows users to create smaller t3 SageMaker notebook instances:

{
    "Action": [
        "sagemaker:CreateNotebookInstances"
    ],
    "Resource": [
        "*"
    ],
    "Effect": "Deny",
    "Sid": "BlockLargeNotebookInstances",
    "Condition": {
        "ForAnyValue:StringNotLike": {
            "sagemaker:InstanceTypes": [
                "ml.t3.medium",
                "ml.t3.large"
            ]
        }
    }
}

Administrators can also use AWS Service Catalog to allow for self-service of SageMaker notebooks. This allows you to restrict the instance types that are available to users when creating a notebook. For more information, see Enable self-service, secured data science using Amazon SageMaker notebooks and AWS Service Catalog and Launch Amazon SageMaker Studio using AWS Service Catalog and AWS SSO in AWS Control Tower Environment.

Stop idle notebook instances

To keep your costs down, we recommend stopping your notebook instances when you don’t need them and starting them when you do need them. Consider auto-detecting idle notebook instances and managing their lifecycle using a lifecycle configuration script. For example, auto-stop-idle is a sample shell script that stops a SageMaker notebook when it’s idle for more than 1 hour.

AWS maintains a public repository of notebook lifecycle configuration scripts that address common use cases for customizing notebook instances, including a sample bash script for stopping idle notebooks.

Schedule automatic start and stop of notebook instances

Another approach to save on notebooks cost is to automatically start and stop your notebooks at specific times. You can accomplish this by using Amazon EventBridge rules and AWS Lambda functions. For more information about configuring your Lambda functions, see Configuring Lambda function options. After you have created the functions, you can create rules to trigger these functions on a specific schedule, for example, start the notebooks every weekday at 7:00 AM. See Creating an Amazon EventBridge rule that runs on a schedule for instructions. For the scripts to start and stop notebooks with a Lambda function, refer to Ensure efficient compute resources on Amazon SageMaker.

SageMaker Studio

Studio provides a fully managed solution for data scientists to interactively build, train, and deploy ML models. Studio notebooks are one-click collaborative Jupyter notebooks that can be spun up quickly because you don’t need to set up compute instances and file storage beforehand. You are charged for the compute instance type you choose to run your notebooks on, based on the duration of use. There is no additional charge for using Studio. The costs incurred for running Studio notebooks, interactive shells, consoles, and terminals are based on ML compute instance usage.

When launched, the resource is run on an ML compute instance of the chosen instance type. If an instance of that type was previously launched and is available, the resource is run on that instance. For CPU-based images, the default suggested instance type is ml.t3.medium. For GPU-based images, the default suggested instance type is ml.g4dn.xlarge. Billing occurs per instance and starts when the first instance of a given instance type is launched.

If you want to create or open a notebook without the risk of incurring charges, open the notebook from the File menu and choose No Kernel from the Select Kernel dialog. You can read and edit a notebook without a running kernel, but you can’t run cells. You are billed separately for each instance. Billing ends when all the KernelGateway apps on the instance are shut down, or the instance is shut down. For information about billing along with pricing examples, see Amazon SageMaker Pricing.

In Cost Explorer, you can filter Studio notebook costs by applying a filter on Usage type. The name of this usage types is structured as: REGION-studio:KernelGateway-instanceType (for example, USE1-Studio:KernelGateway-ml.m5.large)

Filtering by the usage type studio: in Cost Explorer will show you the list of Studio usage types in an account. You can select the necessary usage types, or select Select All and choose Apply to display the cost breakdown of Studio app usage. The following screenshot shows the selection all the studio usage types for cost analysis.

You can also apply additional filters such as Region, linked account, or instance type for more granular cost analysis. Changing the granularity to Daily gives you daily cost and usage charts based selected usage types and dimension, as shown in the following screenshot.

In the preceding example, the Studio KernelGateway instance of type ml.t3.medium in the USE1 Region is reporting a daily usage of 48 hours between the period of January 1 and January 24, followed by a daily usage of 24 hours until February 11. Similarly, Studio KernelGateway instance of type ml.m5.large in USE1 Region is reporting 24 hours of daily usage of between January 1 and January 23. A daily usage of 24 hours or more for multiple consecutive days indicates a possibility of Studio notebook instances running continuously for multiple days. This type of pattern could benefit from applying cost control guardrails such as manual or automatic shutdown of Studio apps when not in use.

As mentioned earlier, you can use AWS CUR to get data at the granularity of a resource and build custom queries to look up AWS CUR data using standard SQL. You can also include cost-allocation tags in your query for an additional level of granularity. The following query returns Studio KernelGateway resource usage for the last 3 months from your AWS CUR data:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS studio_notebook_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      customer_all
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Studio:KernelGateway%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Athena.

The result of the query shows that the Studio KernelGateway app named datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395 running in account 111111111111, Studio domain d-domain1234, and user profile user1 on an ml.t3.medium instance is reporting 24 hours of usage for multiple consecutive days. The instance rate is $0.05/hour and the daily cost for running for 24 hours is $1.20.

AWS CUR query results can help you identify patterns of resources running for consecutive days at a granular level of hourly or daily usage, which can be analyzed for cost optimization. As with SageMaker notebooks, you can also feed AWS CUR data into QuickSight for reporting or visualization purposes.

SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a feature of Studio that helps you simplify the process of data preparation and feature engineering from a low-code visual interface. The usage type name for a Studio Data Wrangler app is structured as REGION-Studio_DW:KernelGateway-instanceType (for example, USE1-Studio_DW:KernelGateway-ml.m5.4xlarge).

Filtering by the usage type studio_DW: in Cost Explorer will show you the list of Studio Data Wrangler usage types in an account. You can select the necessary usage types, or select Select All and choose Apply to display the cost breakdown of Studio Data Wrangler app usage. The following screenshot shows the selection all the studio_DW usage types for cost analysis.

As noted earlier, you can also apply additional filters for more granular cost analysis. For example, the following screenshot shows 24 hours of daily usage of the Studio Data Wrangler instance type ml.m5.4xlarge in the USE1 Region for multiple days and its associated cost. Insights like this can be used to apply cost control measures such as shutting down Studio apps when not in use.

You can obtain resource-level information from AWS CUR, and build custom queries to look up AWS CUR data using standard SQL. The following query returns Studio Data Wrangler app resource usage and associated cost for the last 3 months from your AWS CUR data:

SELECT
      bill_payer_account_id,
      line_item_usage_account_id,
      line_item_resource_id AS studio_notebook_arn,
      line_item_usage_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d') AS day_line_item_usage_start_date,
      SUM(CAST(line_item_usage_amount AS DOUBLE)) AS sum_line_item_usage_amount,
      line_item_unblended_rate,
      SUM(CAST(line_item_unblended_cost AS DECIMAL(16,8))) AS sum_line_item_unblended_cost,
      line_item_blended_rate,
      SUM(CAST(line_item_blended_cost AS DECIMAL(16,8))) AS sum_line_item_blended_cost,
      line_item_line_item_description,
      line_item_line_item_type
    FROM 
      {$table_name}
    WHERE
      line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
      AND line_item_product_code = 'AmazonSageMaker'
      AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')
      AND line_item_usage_type like '%Studio_DW:KernelGateway%'
        AND line_item_operation = 'RunInstance'
        AND bill_payer_account_id = 'xxxxxxxxxxxx'
    GROUP BY
      bill_payer_account_id, 
      line_item_usage_account_id,
      line_item_resource_id,
      line_item_usage_type,
      line_item_unblended_rate,
      line_item_blended_rate,
      line_item_line_item_type,
      DATE_FORMAT((line_item_usage_start_date),'%Y-%m-%d'),
      line_item_line_item_description
      ORDER BY 
      line_item_resource_id, day_line_item_usage_start_date

The following screenshot shows the results obtained from running the AWS CUR query using Athena.

The result of the query shows that the Studio Data Wrangler app named sagemaker-data-wrang-ml-m5-4xlarge-b741c1a025d542c78bb538373f2d running in account 111111111111, Studio domain d-domain1234, and user profile user1 on an ml.m5.4xlarge instance is reporting 24 hours of usage for multiple consecutive days. The instance rate is $0.922/hour and the daily cost for running for 24 hours is $22.128.

Optimize Studio cost

Studio notebooks are charged for the instance type you choose, based on the duration of use. You must shut down the instance to stop incurring charges. If you shut down the notebook running on the instance but don’t shut down the instance, you will still incur charges. When you shut down the Studio notebook instances, any additional resources, such as SageMaker endpoints, Amazon EMR clusters, and Amazon Simple Storage Service (Amazon S3) buckets created from Studio are not deleted. Delete those resources if they are no longer needed to stop accrual of charges. For more details about shutting down Studio resources, refer to Shut Down Resources. If you’re using Data Wrangler, it’s important to shut it down after your work is done to save cost. For details, refer to Shut Down Data Wrangler.

Consider the following best practices to help reduce the cost of your Studio notebooks.

Automatically stop idle Studio notebook instances

You can automatically stop idle Studio notebook resources with lifecycle configurations in Studio. You can also install and use a JupyterLab extension available on GitHub as a Studio lifecycle configuration. For detailed instructions on the Studio architecture and adding the extension, see Save costs by automatically shutting down idle resources within Amazon SageMaker Studio.

Resize on the fly

The benefit of Studio notebooks over notebook instances is that with Studio, the underlying compute resources are fully elastic and you can change the instance on the fly. This allows you to scale the compute up and down as your compute demand changes, for example from ml.t3.medium to ml.m5.4xlarge, without interrupting your work or managing infrastructure. Moving from one instance to another is seamless, and you can continue working while the instance launches. With on-demand notebook instances, you need to stop the instance, update the setting, and restart with the new instance type. For more information, see Learn how to select ML instances on the fly in Amazon SageMaker Studio.

Restrict user access to specific instance types

Administrators can use IAM condition keys as an effective way to restrict certain instance types, such as GPU instances, for specific users, thereby controlling costs. For example, in the following sample policy, access is denied for all instances except ml.t3.medium and ml.g4dn.xlarge. Note that you need to allow the system instance for the default Jupyter Server apps.

{
    "Action": [
        "sagemaker:CreateApp"
    ],
    "Resource": [
        "*"
    ],
    "Effect": "Deny",
    "Sid": "BlockSagemakerLargeInstances",
    "Condition": {
        "ForAnyValue:StringNotLike": {
            "sagemaker:InstanceTypes": [
                "ml.t3.medium",
                "ml.g4dn.xlarge",
                "system"
            ]
        }
    }
}

For comprehensive guidance on best practices to optimize Studio cost, refer to Ensure efficient compute resources on Amazon SageMaker.

Use tags to keep track of Studio cost

In Studio, you can assign custom tags to your Studio domain as well as users who are provisioned access to the domain. Studio will automatically copy and assign these tags to the Studio notebooks created by the users, so you can easily track and categorize the cost of Studio notebooks and create cost chargeback models for your organization.

By default, SageMaker automatically tags new SageMaker resources such as training jobs, processing jobs, experiments, pipelines, and model registry entries with their respective sagemaker:domain-arn. SageMaker also tags the resource with the sagemaker:user-profile-arn or sagemaker:space-arn to designate the resource creation at an even more granular level.

Administrators can use automated tagging to easily monitor costs associated with their line of business, teams, individual users, or individual business problems by using tools such as AWS Budgets and Cost Explorer. For example, you can attach a cost allocation tag for the sagemaker:domain-arn tag.

This allows you to utilize Cost Explorer to visualize the Studio notebook spend for a given domain.

Consider storage costs

When the first member of your team onboards to Studio, SageMaker creates an Amazon Elastic File System (Amazon EFS) volume for the team. When this member, or any member of the team, opens Studio, a home directory is created in the volume for the member. A storage charge is incurred for this directory. Subsequently, additional storage charges are incurred for the notebooks and data files stored in the member’s home directory. For more information, see Amazon EFS Pricing.

Conclusion

In this post, we provided guidance on cost analysis and best practices when building ML models using notebook instances and Studio. As machine learning establishes itself as a powerful tool across industries, training and running ML models needs to remain cost-effective. SageMaker offers a wide and deep feature set for facilitating each step in the ML pipeline and provides cost optimization opportunities without impacting performance or agility.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:


About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, breakfast, lunch and dinner.

Read More

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 1

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 1

Cost optimization is one of the pillars of the AWS Well-Architected Framework, and it’s a continual process of refinement and improvement over the span of a workload’s lifecycle. It enables building and operating cost-aware systems that minimize costs, maximize return on investment, and achieve business outcomes.

Amazon SageMaker is a fully managed machine learning (ML) service that offers a variety of cost optimization options and capabilities like managed spot training, multi-model endpoints, AWS Inferentia, ML Savings Plans, and many others that help reduce the total cost of ownership (TCO) of ML workloads compared to other cloud-based options, such as self-managed Amazon Elastic Compute Cloud (Amazon EC2) and AWS-managed Amazon Elastic Kubernetes Service (Amazon EKS).

AWS is dedicated to helping you achieve the highest savings by offering extensive service and pricing options. We provide tools for flexible cost management and improved visibility of detailed cost and usage of your workloads.

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support plan. Since its introduction, we’ve helped hundreds of customers optimize their workloads, set guardrails, and improve the visibility of their ML workloads’ cost and usage.

In this post, we share lessons learned and walk you through the various ways to analyze your SageMaker usage and identify opportunities for cost optimization.

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage:

Analyze SageMaker cost using AWS Cost Explorer

AWS Cost Explorer provides preconfigured views that display information about your cost trends and give you a head start on understanding your cost history and trends. It allows you to filter and group by values such as AWS service, usage type, cost allocation tags, EC2 instance type, and more. If you use consolidated billing, you can also filter by linked account. In addition, you can set time intervals and granularity, as well as forecast future costs based on your historical cost and usage data.

Let’s start by using Cost Explorer to identify cost optimization opportunities in SageMaker.

  1. On the Cost Explorer console, choose SageMaker for Service and choose Apply filters.
  2. You can set the desired time interval and granularity, as well as the Group by parameter.
  3. You can display the chart data in bar, line, or stack plot format.
  4. After you have achieved your desired results with filters and groupings, you can either download your results by choosing Download as CSV or save the report by choosing Save to report library.

The following screenshot shows SageMaker costs per month for the selected date range, grouped by Region.

For general guidance on using Cost Explorer, refer to AWS Cost Explorer’s New Look and Common Use Cases.

Optionally, you can enable AWS Cost and Usage Reports (AWS CUR) to gain insights into the cost and usage data for your accounts. The report contains hourly AWS consumption details. It is stored in Amazon Simple Storage Service (Amazon S3) in the payer account, which consolidates data for all the linked accounts. You can query the report to analyze trends in your usage and take appropriate action to optimize cost. Amazon Athena is a serverless query service you can use to analyze the data from your report in Amazon S3 using standard SQL. For more information and example queries, refer to the AWS CUR Query Library.

The following code is an example of an AWS CUR query to obtain SageMaker costs for the last 3 months of usage:

SELECT *
FROM {$table_name}
WHERE 
    line_item_usage_start_date >= date_trunc('month',current_date - interval '3' month)
    AND line_item_product_code = 'AmazonSageMaker'
    AND line_item_line_item_type  IN ('DiscountedUsage', 'Usage', 'SavingsPlanCoveredUsage')

You can also feed AWS CUR data into Amazon QuickSight, where you can slice and dice it any way you’d like for reporting or visualization purposes. For instructions on ingesting CUR data into QuickSight, see How do I ingest and visualize the AWS Cost and Usage Report (CUR) into Amazon QuickSight.

Analyze cost for SageMaker usage types

Your monthly SageMaker cost comes from different SageMaker usage types such as notebook instances, hosting, training, and processing, among others. Selecting the SageMaker service filter and grouping by the Usage type dimension in Cost Explorer gives you a general idea of cost distribution based on SageMaker usage type. The usage type is displayed in the format

REGION-UsageType:instanceType (for example, USE1-Notebk:ml.g4dn.8xlarge)

The following screenshot shows cost distribution grouped by SageMaker usage types when an account has reported usage on notebooks and Amazon SageMaker Studio KernelGateway apps.

General best practices for optimizing SageMaker cost

In this section, we share general recommendations to save on costs while using SageMaker.

Tagging

A tag is a label that you assign to an AWS resource. You can use tags to organize your resources by users, departments, or cost centers, and track your costs on a detailed level. Cost allocation tags can be used for categorizing costs in Cost Explorer or Cost and Usage Reports. For tips and best practices regarding cost allocation for your SageMaker environment and workloads, refer to Set up enterprise-level cost allocation for ML environments and workloads using resource tagging in Amazon SageMaker

AWS Budgets

AWS Budgets gives you visibility into your ML cost on AWS and helps you track your SageMaker cost, including development, training, and hosting. It lets you set custom budgets to track your cost and usage from the simplest to the most complex use cases. AWS Budgets also supports email or Amazon Simple Notification Service (Amazon SNS) notification when actual or forecasted cost and usage exceeds your budget threshold, or when your actual Savings Plans’ utilization or coverage drops below your desired threshold.

AWS Budgets is also integrated with Cost Explorer, so you can easily view and analyze your cost and usage drivers, AWS Chatbot, so you can receive AWS Budget alerts in your designated Slack channel or Amazon Chime room, and AWS Service Catalog, so you can track cost on your approved AWS portfolios and products. You can also set alerts and get a notification when your cost or usage exceeds (or is forecasted to exceed) your budgeted amount. After you create your budget, you can track the progress on the AWS Budgets console. For more information, see Managing your costs with AWS Budgets.

AWS Billing console

The AWS Billing console allows you to easily understand your AWS spending, view and pay invoices, manage billing preferences and tax settings, and access additional cloud financial management services. You can quickly evaluate whether your monthly spend is in line with prior periods, forecast, or budget, and investigate and take corrective actions in a timely manner. You can use the dashboard page of the AWS Billing console to gain a general view of your AWS spending. You can also use it to identify your highest cost service or Region and view trends in your spending over the past few months as well as to see various breakdowns of your AWS usage.

The AWS summary section of the page gives an overview of your AWS costs across all accounts, Regions, service providers, and services, and other KPIs. It also provides a comparison to your total forecasted costs for the current month. The Highest cost section shows your top service, account, or Region by estimated month-to-date (MTD) spend. The Cost trend by top five services section shows the cost trend for your top five services for the most recent three to six closed billing periods.

Planning and forecasting

Forecasting is an essential part of staying on top of your cloud costs and usage, and becomes even more important as your business scales.

AWS has multiple options to help you forecast your costs. The forecasting feature of Cost Explorer gives you the ability to create custom usage forecasts to gain a line of sight into your expected future costs. The built-in ML-powered forecasting of QuickSight allows you to forecast your key business metrics with point-and-click simplicity. It offers a straightforward way to use ML to make predictions on any time series data with minimal setup time and no ML experience required.

You can also use Amazon Forecast, a fully managed service that uses ML to deliver highly accurate forecasts, to generate forecasts for specific AWS services with data collected from AWS CUR. For more information, see Forecasting AWS spend using the AWS Cost and Usage Reports, AWS Glue DataBrew, and Amazon Forecast.

For additional information about cost forecasting options, see Using the right tools for your cloud cost forecasting.

Instance right-sizing

You can optimize SageMaker cost and only pay for what you really need by selecting the right resources. You should right-size the SageMaker compute instances before purchasing a Savings Plan in order to provide a proper commitment and obtain maximum cost savings. SageMaker currently offers ML compute instances on the various instance families. Machine learning is an iterative process with varying compute requirements for different stages of the ML lifecycle, from data preprocessing to model training and model hosting. Identifying the right type of compute instance is challenging, and may lead to over-provisioning of resources and therefore increased cost. The modular architecture of SageMaker allows you to optimize the scalability, performance, and pricing of your ML workloads based on the stage of the ML lifecycle. For more details, refer to the Right-sizing compute resources for Amazon SageMaker notebooks, processing jobs, training, and deployment section of the post Ensure efficient compute resources on Amazon SageMaker.

Amazon SageMaker Savings Plans

Amazon SageMaker Savings Plans is a flexible pricing model for SageMaker. It offers discounted rates in exchange for a commitment to a consistent amount of usage (measured in $/hour) for a 1-year or 3-year term. Savings Plans provide flexibility due to their usage-based model and help reduce your costs by up to 64%. These rates automatically apply to eligible SageMaker ML instance usages including Studio notebooks, SageMaker notebook instances, SageMaker Processing, SageMaker Data Wrangler, SageMaker training, SageMaker real-time inference, and SageMaker batch transform regardless of instance family, size, or Region. This makes it easy for you to maximize savings regardless of how your use cases and consumption evolve over time, and you can save up to 64% compared to the On-Demand price.

For example, you could start with small instances to experiment with different algorithms on a fraction of your dataset. Then, you could move to larger instances to prepare data and train at scale against your full dataset. Finally, you could deploy your models in several Regions to serve low-latency predictions to your users. All the instance size modifications and deployments across new Regions would be covered by the same Savings Plan, without any management effort required on your part.

Every type of SageMaker usage that is eligible for SageMaker Savings Plans has a Savings Plans rate and an On-Demand rate. When you sign up for the SageMaker Savings Plans, you will be charged the Savings Plan rate for your usage up to your commitment. Any usage beyond the commitment will be charged at On-Demand rates. The AWS Cost Management console provides you with recommendations that make it easy to find the right commitment level for a Savings Plan. These recommendations are based on the following:

  • Your SageMaker usage in the last 7, 30, or 60 days. You should select the time period that best represents your future usage.
  • The term of your plan: 1-year or 3-year.
  • Your payment option: No Upfront, Partial Upfront (50% or more), or All Upfront. Some customers prefer (or must use) this last option, because it gives them a clear and predictable view of their SageMaker bill.

The recommendations are based on your historical usage over the selected lookback period and don’t forecast your usage. Be sure to select a lookback period that reflects your future usage. A 3-year term plan provides the highest discount rate; similarly, an All Upfront payment option offers the highest discount rate compared to No Upfront or Partial Upfront payment options. Workloads and usage typically change over time and a consistent, steady-state usage pattern makes a good candidate for a savings plan. If you have a lot of short-lived or one-off workloads, selecting the right commitment for compute usage (measured per hour) could be difficult. It’s recommended to continually purchase small amounts of Savings Plans commitment over time. This ensures that you maintain high levels of coverage to maximize your discounts, and your plans closely match your workload and organization requirements at all times.

To understand Savings Plan recommendations, refer to Decrease Your Machine Learning Costs with Instance Price Reductions and Savings Plans for Amazon SageMaker.

Utilization report

For active Savings Plans, utilization reports are available on the Savings Plans console to see the percentage of the commitment that you’ve actually used. You can use your Savings Plans utilization report to visually understand how much of your Savings Plans commitment you are using over the configured time period, as well as your savings as compared to On-Demand prices. For example, if you have a $10/hour commitment, and your usage billed with Savings Plans rates totals to $9.80 for the hour, your utilization for that hour is 98%. You can see your Savings Plans utilization at an hourly, daily, or monthly granularity, based on your lookback period. You can apply filters by Savings Plans type, member account, Region, and instance family in the Filters section. If you’re a user in a management account, you can see the aggregated utilization for the entire Consolidated Billing Family.

The following screenshot shows an example of a utilization report. You can see that even though Savings Plans coverage is not 100% on many consecutive days, the total net savings is still positive. Without Savings Plans, you would be charged at On-Demand rates for the usage. To realize maximum savings and avoid over-committing, it’s recommended to select the right commitment based on consistent, optimized usage of your SageMaker workloads.

Coverage report

Likewise, coverage reports show you how much of your eligible spend has been covered by the plan. To understand how the coverage is calculated, refer to Using your coverage report.

The following screenshot shows an example of a coverage report. You can see that the average coverage for the selected time period is 92%, along with the On-Demand spend that was not covered by the plan. Based on the On-Demand spend not covered by the plan, you can optionally buy an additional Savings Plan to obtain maximum savings. Also, it’s recommended to right-size the SageMaker compute instances before purchasing a Savings Plan and understand the workload size to avoid over- or under-committing the Savings Plan usage.

For more details on how Savings Plans apply to your AWS usage, refer to Understanding how Savings Plans apply to your AWS usage.

Conclusion

Machine learning has established itself as a powerful tool across industries, but training new models and running ML models for inference can be costly. One of the advantages of running ML on SageMaker is the wide and deep feature set offering cost optimization strategies without impacting performance or agility. This post highlighted the AWS tools and options to analyze your SageMaker costs, identify trends, and implement proactive alerts and optimization best practices.

Refer to the following posts in this series for more information about optimizing cost for SageMaker:


About the Authors

Deepali Rajale is a Senior AI/ML Specialist at AWS. She works with enterprise customers providing technical guidance with best practices for deploying and maintaining AI/ML solutions in the AWS ecosystem. She has worked with a wide range of organizations on various deep learning use cases involving NLP and computer vision. She is passionate about empowering organizations to leverage generative AI to enhance their use experience. In her spare time, she enjoys movies, music, and literature.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers on all things ML to design, build, and operate at scale. In his spare time, he enjoys cycling, hiking, and time traveling.

Read More

High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

High-quality human feedback for your generative AI applications from Amazon SageMaker Ground Truth Plus

Amazon SageMaker Ground Truth Plus helps you prepare high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks labels your data. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus. Now, Ground Truth Plus is serving customers who need data labeling and human feedback for fine-tuning foundation models for generative AI applications.

In this post, you will learn about recent advancements in human feedback for generative AI available through SageMaker Ground Truth Plus. This includes new workflows and user interfaces (UIs) available for preparing demonstration datasets used in supervised fine-tuning, gathering high-quality human feedback to make preference datasets for aligning generative AI foundation models with human preferences, as well as customizing models to application builders’ requirements for style, substance, and voice.

Challenges of getting started with generative AI

Generative AI applications around the world incorporate both single-mode and multi-modal foundation models to solve for many different use cases. Common among them are chatbots, image generators, and video generators. Large language models (LLMs) are being used in chatbots for creative pursuits, academic and personal assistants, business intelligence tools, and productivity tools. You can use text-to-image models to generate abstract or realistic AI art and marketing assets. Text-to-video models are being used to generate videos for art projects, highly engaging advertisements, video game development, and even film development.

Two of the most important problems to solve for both model producers who create foundation models and application builders who use existing generative foundation models to build their own tools and applications are:

  • Fine-tuning these foundation models to be able to perform specific tasks
  • Aligning them with human preferences to ensure they output helpful, accurate, and harmless information

Foundation models are typically pre-trained on large corpora of unlabeled data, and therefore don’t perform well following natural language instructions. For an LLM, that means that they may be able to parse and generate language in general, but they may not be able to answer questions coherently or summarize text up to a user’s required quality. For example, when a user requests a summary of a text in a prompt, a model that hasn’t been fine-tuned how to summarize text may just recite the prompt text back to the user or respond with something irrelevant. If a user asks a question about a topic, the response from a model could just be a recitation of the question. For multi-modal models, such as text-to-image or text-to-video models, the models may output content unrelated to the prompt. For example, if a corporate graphic designer prompts a text-to-image model to create a new logo or an image for an advertisement, the model may not generate a relevant graphic related to the prompt if it has only a general concept of an image and elements of an image. In some cases, a model may output a harmful image or video, risking user confidence or company reputation.

Even if models are fine-tuned to perform specific tasks, they may not be aligned with human preferences with respect to the meaning, style, or substance of their output content. In an LLM, this could manifest itself as inaccurate or even harmful content being generated by the model. For example, a model that isn’t aligned with human preferences through fine-tuning may output dangerous, unethical, or even illegal instructions when prompted by a user. No care will have been taken to limit the content being generated by the model to ensure it is aligned with human preferences to be accurate, relevant, and useful. This misalignment can be a problem for companies that rely on generative AI models for their applications, such as chatbots and multimedia creation. For multi-modal models, this may take the form of toxic, dangerous, or abusive images or video being generated. This is a risk when prompts are input to the model without the intention of generating sensitive content, and also if the model producer or application builder had not intended to allow the model to generate that kind of content, but it was generated anyway.

To solve the issues of task-specific capability and aligning generative foundation models with human preferences, model producers and application builders must fine-tune the models with data using human-directed demonstrations and human feedback of model outputs.

Data and training types

There are several types of fine-tuning methods with different types of labeled data that are categorized as instruction tuning – or teaching a model how to follow instructions. Among them are supervised fine-tuning (SFT) using demonstration data, and reinforcement learning from human feedback (RLHF) using preference data.

Demonstration data for supervised fine-tuning

To fine-tune foundation models to perform specific tasks such as answering questions or summarizing text with high quality, the models undergo SFT with demonstration data. The purpose of demonstration data is to guide the model by providing it with labeled examples (demonstrations) of completed tasks being done by humans. For example, to teach an LLM how to answer questions, a human annotator will create a labeled dataset of human-generated question and answer pairs to demonstrate how a question and answer interaction works linguistically and what the content means semantically. This kind of SFT trains the model to recognize patterns of behavior demonstrated by the humans in the demonstration training data. Model producers need to do this type of fine-tuning to show that their models are capable of performing such tasks for downstream adopters. Application builders who use existing foundation models for their generative AI applications may need to fine-tune their models with demonstration data on these tasks with industry-specific or company-specific data to improve the relevancy and accuracy of their applications.

Preference data for instruction tuning such as RLHF

To further align foundation models with human preferences, model producers—and especially application builders—need to generate preference datasets to perform instruction tuning. Preference data in the context of instruction tuning is labeled data that captures human feedback with respect to a set of options output by a generative foundation model. It typically includes rating or ranking several inferences or pairwise comparing two inferences from a foundation model according to some specific attribute. For LLMs, these attributes may be helpfulness, accuracy, and harmlessness. For text-to-image models, it may be an aesthetic quality or text-image alignment. This preference data based on human feedback can then be used in various instruction tuning methods—including RLHF—in order to further fine-tune a model to align with human preferences.

Instruction tuning using preference data plays a crucial role in enhancing the personalization and effectiveness of foundation models. This is a key step in building custom applications on top of pre-trained foundation models and is a powerful method to ensure models are generating helpful, accurate, and harmless content. A common example of instruction tuning is to instruct a chatbot to generate three responses to a query, and have a human read and rank all three according to some specified dimension, such as toxicity, factual accuracy, or readability. For example, a company may use a chatbot for its marketing department and wants to make sure that content is aligned to its brand message, doesn’t exhibit biases, and is clearly readable. The company would prompt the chatbot during instruction tuning to produce three examples, and have their internal experts select the ones that most align to their goal. Over time, they build a dataset used to teach the model what style of content humans prefer through reinforcement learning. This enables the chatbot application to output more relevant, readable, and safe content.

SageMaker Ground Truth Plus

Ground Truth Plus helps you address both challenges—generating demonstration datasets with task-specific capabilities, as well as gathering preference datasets from human feedback to align models with human preferences. You can request projects for LLMs and multi-modal models such as text-to-image and text-to-video. For LLMs, key demonstration datasets include generating questions and answers (Q&A), text summarization, text generation, and text reworking for the purposes of content moderation, style change, or length change. Key LLM preference datasets include ranking and classifying text outputs. For multi-modal models, key task types include captioning images or videos as well as logging timestamps of events in videos. Therefore, Ground Truth Plus can help both model producers and application builders on their generative AI journey.

In this post, we dive deeper into the human annotator and feedback journey on four cases covering both demonstration data and preference data for both LLMs and multi-modal models: question and answer pair generation and text ranking for LLMs, as well as image captioning and video captioning for multi-modal models.

Large language models

In this section, we discuss question and answer pairs and text ranking for LLMs, along with customizations you may want for your use case.

Question and answer pairs

The following screenshot shows a labeling UI in which a human annotator will read a text passage and generate both questions and answers in the process of building a Q&A demonstration dataset.

Let’s walk through a tour of the UI in the annotator’s shoes. On the left side of the UI, the job requester’s specific instructions are presented to the annotator. In this case, the annotator is supposed to read the passage of text presented in the center of the UI and create questions and answers based on the text. On the right side, the questions and answers that the annotator has written are shown. The text passage as well as type, length, and number of questions and answers can all be customized by the job requester during the project setup with the Ground Truth Plus team. In this case, the annotator has created a question that requires understanding the whole text passage to answer and is marked with a References entire passage check box. The other two questions and answers are based on specific parts of the text passage, as shown by the annotator highlights with color-coded matching. Optionally, you may want to request that questions and answers are generated without a provided text passage, and provide other guidelines for human annotators—this is also supported by Ground Truth Plus.

After the questions and answers are submitted, they can flow to an optional quality control loop workflow where other human reviewers will confirm that customer-defined distribution and types of questions and answers have been created. If there is a mismatch between the customer requirements and what the human annotator has produced, the work will get funneled back to a human for rework before being exported as part of the dataset to deliver to the customer. When the dataset is delivered back to you, it’s ready to incorporate into the supervised fine-tuning workflow at your discretion.

Text ranking

The following screenshot shows a UI for ranking the outputs from an LLM based on a prompt.

You can simply write the instructions for the human reviewer, and bring prompts and pre-generated responses to the Ground Truth Plus project team to start the job. In this case, we have requested for a human reviewer to rank three responses per prompt from an LLM on the dimension of writing clarity (readability). Again, the left pane shows the instructions given to the reviewer by the job requester. In the center, the prompt is at the top of the page, and the three pre-generated responses are the main body for ease of use. On the right side of the UI, the human reviewer will rank them in order of most to least clear writing.

Customers wanting to generate this type of preference dataset include application builders interested in building human-like chatbots, and therefore want to customize the instructions for their own use. The length of the prompt, number of responses, and ranking dimension can all be customized. For example, you may want to rank five responses in order of most to least factually accurate, biased, or toxic, or even rank and classify multiple dimensions simultaneously. These customizations are supported in Ground Truth Plus.

Multi-modal models

In this section, we discuss image and video captioning for training multi-modal models such as text-to-image and text-to-video models, as well as customizations you may want to make for your particular use case.

Image captioning

The following screenshot shows a labeling UI for image captioning. You can request a project with image captioning to gather data to train a text-to-image model or an image-to-text model.

In this case, we have requested to train a text-to-image model and have set specific requirements on the caption in terms of length and detail. The UI is designed to walk the human annotators through the cognitive process of generating rich captions by providing a mental framework through assistive and descriptive tools. We have found that providing this mental framework for annotators results in more descriptive and accurate captions than simply providing an editable text box alone.

The first step in the framework is for the human annotator to identify key objects in the image. When the annotator chooses an object in the image, a color-coded dot appears on the object. In this case, the annotator has chosen both the dog and the cat, creating two editable fields on the right side of the UI wherein the annotator will enter the names of the objects—cat and dog—along with a detailed description of each object. Next, the annotator is guided to identify all the relationships between all the objects in the image. In this case, the cat is relaxing next to the dog. Next, the annotator is asked to identify specific attributes about the image, such as the setting, background, or environment. Finally, in the caption input text box, the annotator is instructed to combine all of what they wrote in the objects, relationships, and image setting fields into a complete single descriptive caption of the image.

Optionally, you can configure this image caption to be passed through a human-based quality check loop with specific instructions to ensure that the caption meets the requirements. If there is an issue identified, such as a missing key object, that caption can be sent back for a human to correct the issue before exporting as part of the training dataset.

Video captioning

The following screenshot shows a video captioning UI to generate rich video captions with timestamp tags. You can request a video caption project to gather data to build text-to-video or video-to-text models.

In this labeling UI, we have built a similar mental framework to ensure high-quality captions are written. The human annotator can control the video on the left side and create descriptions and timestamps for each activity shown in the video on the right side with the UI elements. Similar to the image captioning UI, there is also a place for the annotator to write a detailed description of the video setting, background, and environment. Finally, the annotator is instructed to combine all the elements into a coherent video caption.

Similar to the image caption case, the video captions may optionally flow through a human-based quality control workflow to determine if your requirements are met. If there is an issue with the video captions, it will be sent for rework by the human annotator workforce.

Conclusion

Ground Truth Plus can help you prepare high-quality datasets to fine-tune foundation models for generative AI tasks, from answering questions to generating images and videos. It also allows skilled human workforces to review model outputs to ensure that they are aligned with human preferences. Additionally, it enables application builders to customize models using their industry or company data to ensure their application represents their preferred voice and style. These are the first of many innovations in Ground Truth Plus, and more are in development. Stay tuned for future posts.

Interested in starting a project to build or improve your generative AI models and applications? Get started with Ground Truth Plus by connecting with our team today.


About the authors

Jesse Manders is a Senior Product Manager in the AWS AI/ML human in the loop services team. He works at the intersection of AI and human interaction with the goal of creating and improving AI/ML products and services to meet our needs. Previously, Jesse held leadership roles in engineering at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Romi DattaRomi Datta is a Senior Manager of Product Management in the Amazon SageMaker team responsible for Human in the Loop services. He has been in AWS for over 4 years, holding several product management leadership roles in SageMaker, S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

Jonathan Buck is a Software Engineer at Amazon Web Services working at the intersection of machine learning and distributed systems. His work involves productionizing machine learning models and developing novel software applications powered by machine learning to put the latest capabilities in the hands of customers.

Alex Williams is an applied scientist in the human-in-the-loop science team at AWS AI where he conducts interactive systems research at the intersection of human-computer interaction (HCI) and machine learning. Before joining Amazon, he was a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee where he co-directed the People, Agents, Interactions, and Systems (PAIRS) research laboratory. He has also held research positions at Microsoft Research, Mozilla Research, and the University of Oxford. He regularly publishes his work at premier publication venues for HCI, such as CHI, CSCW, and UIST. He holds a PhD from the University of Waterloo.

Sarah Gao is a Software Development Manager in Amazon SageMaker Human In the Loop (HIL) responsible for building the ML based labeling platform. Sarah has been in AWS for over 4 years, holding several software management leadership roles in EC2 security and SageMaker. Prior to AWS she worked in various engineering management roles at Oracle and Sun Microsystem.

Erran Li is the applied science manager at human-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.

Read More

Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker

Create high-quality images with Stable Diffusion models and deploy them cost-efficiently with Amazon SageMaker

Text-to-image generation is a task in which a machine learning (ML) model generates an image from a textual description. The goal is to generate an image that closely matches the description, capturing the details and nuances of the text. This task is challenging because it requires the model to understand the semantics and syntax of the text and to generate photorealistic images. There are many practical applications of text-to-image generation in AI photography, concept art, building architecture, fashion, video games, graphic design, and much more.

Stable Diffusion is a text-to-image model that empowers you to create high-quality images within seconds. When real-time interaction with this type of model is the goal, ensuring a smooth user experience depends on the use of accelerated hardware for inference, such as GPUs or AWS Inferentia2, Amazon’s own ML inference accelerator. The steep costs involved in using GPUs typically requires optimizing the utilization of the underlying compute, even more so when you need to deploy different architectures or personalized (fine-tuned) models. Amazon SageMaker multi-model endpoints (MMEs) help you address this problem by helping you scale thousands of models into one endpoint. By using a shared serving container, you can host multiple models in a cost-effective, scalable manner within the same endpoint, and even the same GPU.

In this post, you will learn about Stable Diffusion model architectures, different types of Stable Diffusion models, and techniques to enhance image quality. We also show you how to deploy Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server.

Prompt: portrait of a cute bernese dog, art by elke Vogelsang, 8k ultra realistic, trending on artstation, 4 k Prompt: architecture design of living room, 8 k ultra-realistic, 4 k, hyperrealistic, focused, extreme details Prompt: New York skyline at night, 8k, long shot photography, unreal engine 5, cinematic, masterpiece

Stable Diffusion architecture

Stable Diffusion is a text-to-image open-source model that you can use to create images of different styles and content simply by providing a text prompt. In the context of text-to-image generation, a diffusion model is a generative model that you can use to generate high-quality images from textual descriptions. Diffusion models are a type of generative model that can capture the complex dependencies between the input and output modalities text and images.

The following diagram shows a high-level architecture of a Stable Diffusion model.

It consists of the following key elements:

  • Text encoder – CLIP is a transformers-based text encoder model that takes input prompt text and converts it into token embeddings that represent each word in the text. CLIP is trained on a dataset of images and their captions, a combination of image encoder and text encoder.
  • U-Net – A U-Net model takes token embeddings from CLIP along with an array of noisy inputs and produces a denoised output. This happens though a series of iterative steps, where each step processes an input latent tensor and produces a new latent space tensor that better represents the input text.
  • Auto encoder-decoder – This model creates the final images. It takes the final denoised latent output from the U-Net model and converts it into images that represents the text input.

Types of Stable Diffusion models

In this post, we explore the following pre-trained Stable Diffusion models by Stability AI from the Hugging Face model hub.

stable-diffusion-2-1-base

Use this model to generate images based on a text prompt. This is a base version of the model that was trained on LAION-5B. The model was trained on a subset of the large-scale dataset LAION-5B, and mainly with English captions. We use StableDiffusionPipeline from the diffusers library to generate images from text prompts. This model can create images of dimension 512 x 512. It uses the following parameters:

  • prompt – A prompt can be a text word, phrase, sentences, or paragraphs.
  • negative_prompt – You can also pass a negative prompt to exclude specified elements from the image generation process and to enhance the quality of the generated images.
  • guidance_scale – A higher guidance scale results in an image more closely related to the prompt, at the expense of image quality. If specified, it must be a float.

stable-diffusion-2-depth

This model is used to generate new images from existing ones while preserving the shape and depth of the objects in the original image. This stable-diffusion-2-depth model is fine-tuned from stable-diffusion-2-base, an extra input channel to process the (relative) depth prediction. We use StableDiffusionDepth2ImgPipeline from the diffusers library to load the pipeline and generate depth images. The following are the additional parameters specific to the depth model:

  • image – The initial image to condition the generation of new images.
  • num_inference_steps (optional) – The number of denoising steps. More denoising steps usually leads to a higher-quality image at the expense of slower inference. This parameter is modulated by strength.
  • strength (optional) – Conceptually, this indicates how much to transform the reference image. The value must be between 0–1. image is used as a starting point, adding more noise to it the larger the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, the added noise will be maximum and the denoising process will run for the full number of iterations specified in num_inference_steps. A value of 1, therefore, essentially ignores image. For more details, refer to the following code.

stable-diffusion-2-inpainting

You can use this model for AI image restoration use cases. You can also use it to create novel designs and images from the prompts and additional arguments. This model is also derived from the base model and has a mask generation strategy. It specifies the mask of the original image to represent segments to be changed and segments to leave unchanged. We use StableDiffusionUpscalePipeline from the diffusers library to apply inpaint changes on original image. The following additional parameter is specific to the depth model:

  • mask_input – An image where the blacked-out portion remains unchanged during image generation and the white portion is replaced

stable-diffusion-x4-upscaler

This model is also derived from the base model, additionally trained on the 10M subset of LAION containing 2048 x 2048 images. As the name implies, it can be used to upscale lower-resolution images to higher resolutions

Use case overview

For this post, we deploy an AI image service with multiple capabilities, including generating novel images from text, changing the styles of existing images, removing unwanted objects from images, and upscaling low-resolution images to higher resolutions. Using several variations of Stable Diffusion models, you can address all of these use cases within a single SageMaker endpoint. This means that you’ll need to host large number of models in a performant, scalable, and cost-efficient way. In this post, we show how to deploy multiple Stable Diffusion models cost-effectively using SageMaker MMEs and NVIDIA Triton Inference Server. You will learn about the implementation details, optimization techniques, and best practices to work with text-to-image models.

The following table summarizes the Stable Diffusion models that we deploy to a SageMaker MME.

Model Name Model Size in GB
stabilityai/stable-diffusion-2-1-base 2.5
stabilityai/stable-diffusion-2-depth 2.7
stabilityai/stable-diffusion-2-inpainting 2.5
stabilityai/stable-diffusion-x4-upscaler 7

Solution overview

The following steps are involved in deploying Stable Diffusion models to SageMaker MMEs:

  1. Use the Hugging Face hub to download the Stable Diffusion models to a local directory. This will download scheduler, text_encoder, tokenizer, unet, and vae for each Stable Diffusion model into its corresponding local directory. We use the revision="fp16" version of the model.
  2. Set up the NVIDIA Triton model repository, model configurations, and model serving logic model.py. Triton uses these artifacts to serve predictions.
  3. Package the conda environment with additional dependencies and the package model repository to be deployed to the SageMaker MME.
  4. Package the model artifacts in an NVIDIA Triton-specific format and upload model.tar.gz to Amazon Simple Storage Service (Amazon S3). The model will be used for generating images.
  5. Configure a SageMaker model, endpoint configuration, and deploy the SageMaker MME.
  6. Run inference and send prompts to the SageMaker endpoint to generate images using the Stable Diffusion model. We specify the TargetModel variable and invoke different Stable Diffusion models to compare the results visually.

We have published the code to implement this solution architecture in the GitHub repo. Follow the README instructions to get started.

Serve models with an NVIDIA Triton Inference Server Python backend

We use a Triton Python backend to deploy the Stable Diffusion pipeline model to a SageMaker MME. The Python backend lets you serve models written in Python by Triton Inference Server. To use the Python backend, you need to create a Python file model.py that has the following structure: Every Python backend can implement four main functions in the TritonPythonModel class:

import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""
def auto_complete_config(auto_complete_model_config):
def initialize(self, args):
def execute(self, requests):
def finalize(self):

Every Python backend can implement four main functions in the TritonPythonModel class: auto_complete_config, initialize, execute, and finalize.

initialize is called when the model is being loaded. Implementing initialize is optional. initialize allows you to do any necessary initializations before running inference. In the initialize function, we create a pipeline and load the pipelines using from_pretrained checkpoints. We configure schedulers from the pipeline scheduler config pipe.scheduler.config. Finally, we specify xformers optimizations to enable the xformer memory efficient parameter enable_xformers_memory_efficient_attention. We provide more details on xformers later in this post. You can refer to model.py of each model to understand the different pipeline details. This file can be found in the model repository.

The execute function is called whenever an inference request is made. Every Python model must implement the execute function. In the execute function, you are given a list of InferenceRequest objects. We pass the input text prompt to the pipeline to get an image from the model. Images are decoded and the generated image is returned from this function call.

We get the input tensor from the name defined in the model configuration config.pbtxt file. From the inference request, we get prompt, negative_prompt, and gen_args, and decode them. We pass all the arguments to the model pipeline object. Encode the image to return the generated image predictions. You can refer to the config.pbtxt file of each model to understand the different pipeline details. This file can be found in the model repository. Finally, we wrap the generated image in InferenceResponse and return the response.

Implementing finalize is optional. This function allows you to do any cleanups necessary before the model is unloaded from Triton Inference Server.

When working with the Python backend, it’s the user’s responsibility to ensure that the inputs are processed in a batched manner and that responses are sent back accordingly. To achieve this, we recommend following these steps:

  1. Loop through all requests in the requests object to form a batched_input.
  2. Run inference on the batched_input.
  3. Split the results into multiple InferenceResponse objects and concatenate them as the responses.

Refer to the Triton Python backend documentation or Host ML models on Amazon SageMaker using Triton: Python backend for more details.

NVIDIA Triton model repository and configuration

The model repository contains the model serving script, model artifacts and tokenizer artifacts, a packaged conda environment (with dependencies needed for inference), the Triton config file, and the Python script used for inference. The latter is mandatory when you use the Python backend, and you should use the Python file model.py. Let’s explore the configuration file of the inpaint Stable Diffusion model and understand the different options specified:

name: "sd_inpaint"
backend: "python"
max_batch_size: 8
input [
  {
    name: "prompt"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "negative_prompt"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
    optional: true
  },
  {
    name: "image"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "mask_image"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
  },
  {
    name: "gen_args"
    data_type: TYPE_STRING
    dims: [
      -1
    ]
    optional: true
  }
]
output [
  {
    name: "generated_image"
    data_type: TYPE_STRING    
    dims: [
      -1
    ]
  }
]
instance_group [
  {
    kind: KIND_GPU
  }
]
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/tmp/conda/sd_env.tar.gz"
  }
}

The following table explains the various parameters and values:

Key Details
name It’s not required to include the model configuration name property. In the event that the configuration doesn’t specify the model’s name, it’s presumed to be identical to the name of the model repository directory where the model is stored. However, if a name is provided, it must match the name of the model repository directory where the model is stored. sd_inpaint is the config property name.
backend This specifies the Triton framework to serve model predictions. This is a mandatory parameter. We specify python, because we’ll be using the Triton Python backend to host the Stable Diffusion models.
max_batch_size This indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton.
input→ prompt Text prompt of type string. Specify -1 to accept dynamic tensor shape.
input→ negative_prompt Negative text prompt of type string. Specify -1 to accept dynamic tensor shape.
input→ mask_image Base64 encoded mask image of type string. Specify -1 to accept dynamic tensor shape.
input→ image Base64 encoded image of type string. Specify -1 to accept dynamic tensor shape.
input→ gen_args JSON encoded additional arguments of type string. Specify -1 to accept dynamic tensor shape.
output→ generated_image Generated image of type string. Specify -1 to accept dynamic tensor shape.
instance_group You can use this this setting to place multiple run instances of a model on every GPU or on only certain GPUs. We specify KIND_GPU to make copies of the model on available GPUs.
parameters We set the conda environment path to EXECUTION_ENV_PATH.

For details about the model repository and configurations of other Stable Diffusion models, refer to the code in the GitHub repo. Each directory contains artifacts for the specific Stable Diffusion models.

Package a conda environment and extend the SageMaker Triton container

SageMaker NVIDIA Triton container images don’t contain libraries like transformer, accelerate, and diffusers to deploy and serve Stable Diffusion models. However, Triton allows you to bring additional dependencies using conda-pack. Let’s start by creating the conda environment with the necessary dependencies outlined in the environment.yml file and create a tar model artifact sd_env.tar.gz file containing the conda environment with dependencies installed in it. Run the following YML file to create a conda-pack artifact and copy the artifact to the local directory from where it will be uploaded to Amazon S3. Note that we will be uploading the conda artifacts as one of the models in the MME and invoking this model to set up the conda environment in the SageMaker hosting ML instance.

%%writefile environment.yml
name: mme_env
dependencies:
  - python=3.8
  - pip
  - pip:
      - numpy
      - torch --extra-index-url https://download.pytorch.org/whl/cu118
      - accelerate
      - transformers
      - diffusers
      - xformers
      - conda-pack

!conda env create -f environment.yml –force

Upload model artifacts to Amazon S3

SageMaker expects the .tar.gz file containing each Triton model repository to be hosted on the multi-model endpoint. Therefore, we create a tar artifact with content from the Triton model repository. We can use this S3 bucket to host thousands of model artifacts, and the SageMaker MME will use models from this location to dynamically load and serve a large number of models. We store all the Stable Diffusion models in this Amazon S3 location.

Deploy the SageMaker MME

In this section, we walk through the steps to deploy the SageMaker MME by defining container specification, SageMaker model and endpoint configurations.

Define the serving container

In the container definition, define the ModelDataUrl to specify the S3 directory that contains all the models that the SageMaker MME will use to load and serve predictions. Set Mode to MultiModel to indicate that SageMaker will create the endpoint with the MME container specifications. We set the container with an image that supports deploying MMEs with GPU. See Supported algorithms, frameworks, and instances for more details.

We see all three model artifacts in the following Amazon S3 ModelDataUrl location:

container = {"Image": mme_triton_image_uri, 
             "ModelDataUrl": model_data_url, 
             "Mode": "MultiModel"}

Create an MME object

We use the SageMaker Boto3 client to create the model using the create_model API. We pass the container definition to the create model API along with ModelName and ExecutionRoleArn:

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, 
    ExecutionRoleArn=role, 
    PrimaryContainer=container
)

Define configurations for the MME

Create an MME configuration using the create_endpoint_config Boto3 API. Specify an accelerated GPU computing instance in InstanceType (we use the same instance type that we are using to host our SageMaker notebook). We recommend configuring your endpoints with at least two instances with real-life use cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

Create an MME

Use the preceding endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish:

create_endpoint_response = sm_client.create_endpoint(
                EndpointName=endpoint_name, 
                EndpointConfigName=endpoint_config_name
)

The status will change to InService when the deployment is successful.

Generate images using different versions of Stable Diffusion models

Let’s start by invoking the base model with a prompt and getting the generated image. We pass the inputs to the base model with prompt, negative_prompt, and gen_args as a dictionary. We set the data type and shape of each input item in the dictionary and pass it as input to the model.

inputs = dict(prompt = "Infinity pool on top of a high rise overlooking Central Park",
             negative_prompt = "blur,low detail, low quality",
             gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=8))
)
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_base.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
decode_image(output[0]["data"][0])

Prompt: Infinity pool on top of a high rise overlooking Central Park

Working with this image, we can modify it with the versatile Stable Diffusion depth model. For example, we can change the style of the image to an oil painting, or change the setting from Central Park to Yellowstone National Park simply by passing the original image along with a prompt describing the changes we would like to see.

We invoke the depth model by specifying sd_depth.tar.gz in the TargetModel of the invoke_endpoint function call. In the outputs, notice how the orientation of the original image is preserved, but for one example, the NYC buildings have been transformed into rock formations of the same shape.

inputs = dict(prompt = "highly detailed oil painting of an inifinity pool overlooking central park",
              image=image,
              gen_args = json.dumps(dict(num_inference_steps=50, strength=0.9))
              )
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_depth.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
print("original image")
display(original_image)
print("generated image")
display(decode_image(output[0]["data"][0]))
Original image Oil painting Yellowstone Park

Another useful model is Stable Diffusion inpainting, which we can use to remove certain parts of the image. Let’s say you want to remove the tree in the following example image. We can do so by invoking the inpaint model sd_inpaint.tar.gz. To remove the tree, we need to pass a mask_image, which indicates which regions of the image should be retained and which should be filled in. The black pixel portion of the mask image indicates the regions that should remain unchanged, and the white pixels indicate what should be replaced.

image = encode_image(original_image).decode("utf8")
mask_image = encode_image(Image.open("sample_images/bertrand-gabioud-mask.png")).decode("utf8")
inputs = dict(prompt = "building, facade, paint, windows",
              image=image,
              mask_image=mask_image,
              negative_prompt = "tree, obstruction, sky, clouds",
              gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=10))
              )
payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}
response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_inpaint.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
decode_image(output[0]["data"][0])
Original image Mask image Inpaint image

In our final example, we downsize the original image that was generated earlier from its 512 x 512 resolution to 128 x 128. We then invoke the Stable Diffusion upscaler model to upscale the image back to 512 x 512. We use the same prompt to upscale the image as what we used to generate the initial image. While not necessary, providing a prompt that describes the image helps guide the upscaling process and should lead to better results.

low_res_image = output_image.resize((128, 128))
inputs = dict(prompt = "Infinity pool on top of a high rise overlooking Central Park",
             image=encode_image(low_res_image).decode("utf8")
)

payload = {
    "inputs":
        [{"name": name, "shape": [1,1], "datatype": "BYTES", "data": [data]} for name, data in inputs.items()]
}

response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=json.dumps(payload),
        TargetModel="sd_upscale.tar.gz", 
    )
output = json.loads(response["Body"].read().decode("utf8"))["outputs"]
upscaled_image = decode_image(output[0]["data"][0])
Low-resolution image Upscaled image

Although the upscaled image is not as detailed as the original, it’s a marked improvement over the low-resolution one.

Optimize for memory and speed

The xformers library is a way to speed up image generation. This optimization is only available for NVIDIA GPUs. It speeds up image generation and lowers VRAM usage. We have used the xformers library for memory-efficient attention and speed. When the enable_xformers_memory_efficient_attention option is enabled, you should observe lower GPU memory usage and a potential speedup at inference time.

Clean Up

Follow the instruction in the clean up section of the notebook to delete the resource provisioned part of this blog to avoid unnecessary charges. Refer Amazon SageMaker Pricing for details the cost of the inference instances.

Conclusion

In this post, we discussed Stable Diffusion models and how you can deploy different versions of Stable Diffusion models cost-effectively using SageMaker multi-model endpoints. You can use this approach to build a creator image generation and editing tool. Check out the code samples in the GitHub repo to get started and let us know about the cool generative AI tool that you build.


About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Vikram Elango is a Sr. AI/ML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and architecture to build and deploy ML applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model training and inference optimization, and more broadly building large-scale ML platforms on AWS. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Read More