Amazon AWS – Page 190

Brain tumor segmentation at scale using AWS Inferentia

November 9, 2022

by Benedetto Carollo Amazon AWS

Medical imaging is an important tool for the diagnosis and localization of disease. Over the past decade, collections of medical images have grown rapidly, and open repositories such as The Cancer Imaging Archive and Imaging Data Commons have democratized access to this vast imaging data. Computational tools such as machine learning (ML) and artificial intelligence (AI) have emerged as an effective and viable option for rapid analysis of this imaging data. Many algorithms have been developed for different kinds of image analysis. These include classification, segmentation, and localization, to name a few. However, the development of the algorithm and training of the required ML model is only one piece of the larger ML/AI puzzle.

Cost-efficient and high-performance deployment of the model is also vital. Additionally, for a model to be of any use at scale, it must be deployed for inference in a reliable, scalable environment.

In this post, we discuss one possible approach of using native AWS technologies to deploy ML algorithms at scale for a medical imaging use case. We talk about segmenting a tumor from MRI brain scans and cover solution architecture, compute infrastructure, and results.

Solution overview

The solution proposed in this post is based around a trained U-net model using the popular Keras framework and with a sample dataset from the popular Kaggle competition platform.

The trained U-net model is then processed via the AWS Neuron SDK so that it can be optimized to target Amazon EC2 Inf1 instances, featuring AWS Inferentia, the first AWS ML accelerator optimized for inference.

The solution uses a managed elastic architecture with fast storage to ensure that high throughput is maintained across each layer of the solution. The following diagram describes the overall architecture.

The central idea around the proposed architecture spins around an elastic cluster of AWS Inferentia-powered containers running on Amazon Elastic Container Service (Amazon ECS) serving a U-net model optimized via the AWS Neuron SDK.

The inference nodes: AWS Inferentia

AWS offers various ways to deploy a deep learning model in the cloud. One option uses AWS Inferentia, which is a high-performance ML inference chip designed by AWS.

AWS Inferentia delivers up to 80% lower cost per inference and up to 2.3 times higher throughput than comparable current generation GPU-based Amazon Elastic Compute Cloud (Amazon EC2) instances. With Inf1 instances, you can run high-scale ML inference applications for a variety of medical imaging uses cases. The AWS Neuron SDK optimizes models for deployment onto AWS Inferentia-powered instances.

AWS Neuron consists of a compiler, runtime, and profiling tools that help optimize the performance of workloads for AWS Inferentia.

With AWS Neuron, developers can deploy neural network models using popular frameworks like PyTorch or TensorFlow on AWS Inferentia-based EC2 Inf1 instances.

The workflow to deploy a trained deep learning model into an AWS Inferentia accelerated inference node consists of the following steps:

Train a neural network model.
Process the trained model via the AWS Neuron compiler to generate an AWS Inferentia-optimized trained neural model.
Use the AWS Neuron runtime to load the AWS Inferentia-optimized model to EC2 Inf1 instances and run inference requests.

Inference at scale: An elastic architecture for AWS Inferentia

The architecture elasticity is determined by an AWS Lambda function and Amazon Simple Queue Service (Amazon SQS) queue that receives requests for segmentations initiated by simply uploading the volume that needs to be segmented into an Amazon Simple Storage Service (Amazon S3) bucket.

The AWS Inferentia ECS cluster gets fed from a highly performant Amazon FSx for Lustre file system, which accelerates compute workloads with shared storage that provides sub-millisecond latencies, up to hundreds of GBs/s of throughput, and millions of IOPS.

The following diagram outlines the architecture that enables the AWS Inferentia cluster to be elastic and scale dynamically according the number of inference requests submitted to the whole system.

In this architecture, an actor pushes an image volume to an S3 bucket. After the image volume is uploaded to THE S3 bucket, a Lambda function gets triggered using the built-in Amazon S3 event notification.

This function places the image volume S3 key into a request queue implemented via Amazon SQS. At the same time, it instructs the AWS Inferentia ECS cluster to start a new task to process the uploaded image volume.

To compliment this architecture, another Lambda function fetches the SQS queue depth and uses this value to modulate the size of the ECS cluster, adding or removing nodes according to the queue depth.

To ensure that the ECS cluster can be fed constantly with data, a highly performant FSx for Lustre file system is placed in front of the ECS cluster. Here, using the automated integration of FSx for Lustre with Amazon S3, the data uploaded into the S3 bucket landing zone is automatically made available in the FSx for Lustre file system and is ready to be consumed by the ECS cluster.

Inference results

The following sample images show the results of a brain tumor classification (multi-class segmentation) task done using the architecture described in this post.

The following figure shows the benchmark results of AWS Inferentia vs. NVIDIA Tesla V100-SXM2-16GB GPU.

Conclusion

Medical imaging is an important tool for the diagnosis and localization of disease. With the growing demand for diagnosis from various modalities, for example from emergency units, the need for automated tools to isolate and support radiologists and doctors in the diagnosis of various pathologies is becoming increasingly important.

In this post, we explored using EC2 Inf1 instance types with AWS Inferentia acceleration to build an elastic inference architecture that can support the ever-increasing inference demand while keeping costs under control.

To learn more about how AWS is accelerating innovation in healthcare, visit AWS for Health.

About the Author

Benedetto Carollo is the Senior Solution Architect for medical imaging and healthcare at Amazon Web Services in Europe, Middle East, and Africa. His work focuses on helping medical imaging and healthcare customers solve business problems by leveraging technology. Benedetto has over 15 years of experience of technology and medical imaging and has worked for companies like Canon Medical Research and Vital Images. Benedetto received his summa cum laude MSc in Software Engineering from the University of Palermo – Italy.

Serve multiple models with Amazon SageMaker and Triton Inference Server

November 9, 2022

by Zheng Zhang Amazon AWS

Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

In 2021, AWS announced the integration of NVIDIA Triton Inference Server in SageMaker. You can use NVIDIA Triton Inference Server to serve models for inference in SageMaker. By using an NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

In some scenarios, users want to deploy multiple models. For example, an application for revising English composition always includes several models, such as BERT for text classification and GECToR to grammar checking. A typical request may flow across multiple models, like data preprocessing, BERT, GECToR, and postprocessing, and they run serially as inference pipelines. If these models are hosted on different instances, the additional network latency between these instances increases the overall latency. For an application with uncertain traffic, deploying multiple models on different instances will inevitably lead to inefficient utilization of resources.

Consider another scenario, in which users develop multiple models with different versions, and each model uses a different training framework. A common practice is to use multiple containers, each of which deploys a model. But this will cause increased workload and costs for development, operation, and maintenance. In this post, we discuss how SageMaker and NVIDIA Triton Inference Server can solve this problem.

Solution overview

Let’s look at how SageMaker inference works. SageMaker invokes the hosting service by running a Docker container. The Docker container launches a RESTful inference server (such as Flask) to serve HTTP requests for inference. The inference server loads the model and listens to port 8080 providing external service. The client application sends a POST request to the SageMaker endpoint, SageMaker passes the request to the container, and returns the inference result from the container to the client.

In our architecture, we use NVIDIA Triton Inference Server, which provides concurrent runs of multiple models from different frameworks, and we use a Flask server to process client-side requests and dispatch these requests to the backend Triton server. While launching a Docker container, the Triton server and Flask server are started automatically. The Triton server loads multiple models and exposes ports 8000, 8001, and 8002 as gRPC, HTTP, and metrics server. The Flask server listens to 8080 ports and parses the original request and payload, and then invokes the local Triton backend via model name and version information. For the client side, it adds the model name and model version in the request in addition to the original payload, so that Flask is able to route the inference request to the correct model on Triton server.

The following diagram illustrates this process.

A complete API call from the client is as follows:

The client assembles the request and initiates the request to a SageMaker endpoint.
The Flask server receives and parses the request, and gets the model name, version, and payload.
The Flask server assembles the request again and routes to the corresponding endpoint of the Triton server according to the model name and version.
The Triton server runs an inference request and sends responses to the Flask server.
The Flask server receives the response message, assembles the message again, and returns it to the client.
The client receives and parses the response, and continues to subsequent business procedures.

In the following sections, we introduce the steps needed to prepare a model and build the TensorRT engine, prepare a Docker image, create a SageMaker endpoint, and verify the result.

Prepare models and build the engine

We demonstrate hosting three typical ML models in our solution: image classification (ResNet50), object detection (YOLOv5), and a natural language processing (NLP) model (BERT-base). NVIDIA Triton Inference Server supports multiple formats, including TensorFlow 1. x and 2. x, TensorFlow SavedModel, TensorFlow GraphDef, TensorRT, ONNX, OpenVINO, and PyTorch TorchScript.

The following table summarizes our model details.

Model Name	Model Size	Format
ResNet50	52M	Tensor RT
YOLOv5	38M	Tensor RT
BERT-base	133M	ONNX RT

NVIDIA provides detailed documentation describing how to generate the TensorRT engine. To achieve best performance, the TensorRT engine must be built over the device. This means the build time and runtime require the same computer capacity. For example, a TensorRT engine built on a g4dn instance can’t be deployed on a g5 instance.

You can generate your own TensorRT engines according to your needs. For test purposes, we prepared sample codes and deployable models with the TensorRT engine. The source code is also available on GitHub.

Next, we use an Amazon Elastic Compute Cloud (Amazon EC2) G4dn instance to generate the TensorRT engine with the following steps. We use YOLOv5 as an example.

Launch a G4dn.2xlarge EC2 instance with the Deep Learning AMI (Ubuntu 20.04) in the us-east-1 Region.
Open a terminal window and use the ssh command to connect to the instance.

Run the following commands one by one:

nvidia-docker run --gpus all -it --rm -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.04-py3
git clone -b v7.0.1 https://github.com/ultralytics/yolov5
pip install seaborn
pip install onnx-simplifier
cd yolov5
wget https://github.com/ultralytics/yolov5/releases/download/v6.2/yolov5s.pt
python export.py --weights yolov5s.pt --include onnx --simplify --imgsz 640 640 --device 0
onnxsim yolov5s.onnx yolov5s-sim.onnx

trtexec --onnx=yolov5s-sim.onnx --saveEngine=model.plan --explicitBatch --workspace=1024*12

Create a config.pbtxt file:

name: "yolov5s"
platform: "tensorrt_plan"
input: [
    {
        name: "images"
        data_type: TYPE_FP32
        format: FORMAT_NONE
        dims: [1, 3, 640, 640 ]
    }
]
output: [
    {
        name: "output",
        data_type: TYPE_FP32
        dims: [1,25200,85 ]
    }
]

Create the following file structure and put the generated files in the appropriate location:

mkdir yolov5s
mkdir -p yolov5s/1
cp config.pbtxt yolov5s
cp model.plan yolov5s/1

yolov5s
├── 1
│   └── model.plan
└── config.pbtxt

Test the TensorRT engine

Before we deploy to SageMaker, we start a Triton server to verify these three models are configured correctly. Use the following command to start a Triton server and load the models:

docker run --gpus all --rm -p8000:8000 -p8001:8001 -v<MODEL_ROOT_DIR>/model_repository:/models nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver --model-repository=/models

If you receive the following prompt message, it means the Triton server is started correctly.

Enter nvidia-smi in the terminal to see GPU memory usage.

Client implementation for inference

The file structure is as follows:

serve – The wrapper that starts the inference server. The Python script starts the NGINX, Flask, and Triton server.
predictor.py – The Flask implementation for /ping and /invocations endpoints, and dispatching requests.
wsgi.py – The startup shell for the individual server workers.
base.py – The abstract method definition that each client requires to implement their inference method.
client folder – One folder per client:
- resnet
- bert_base
- yolov5
nginx.conf – The configuration for the NGINX primary server.

We define an abstract method to implement the inference interface, and each client implements this method:

from abc import ABC, abstractmethod
class Base(ABC):
    @abstractmethod
    def inference(self,img):
        pass

The Triton server exposes an HTTP endpoint on port 8000, a gRPC endpoint on port 8001, and a Prometheus metrics endpoint on port 8002. The following is a sample ResNet client with a gRPC call. You can implement the HTTP interface or gRPC interface according to your use case.

from base import Base
import numpy as np
import tritonclient.grpc as grpcclient
from PIL import Image
import cv2
class Resnet(Base):
    def image_transform_onnx(self, image, size: int) -> np.ndarray:
        '''Image transform helper for onnx runtime inference.'''
        img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  #OpenCV follows BGR convention and PIL follows RGB
        image = Image.fromarray(img)
        image = image.resize((size,size))

        # now our image is represented by 3 layers - Red, Green, Blue
        # each layer has a 224 x 224 values representing
        image = np.array(image)

        # dummy input for the model at export - torch.randn(1, 3, 224, 224)
        image = image.transpose(2,0,1).astype(np.float32)

        # our image is currently represented by values ranging between 0-255
        # we need to convert these values to 0.0-1.0 - those are the values that are expected by our model
        image /= 255
        image = image[None, ...]
        return image

    def inference(self, img):
        INPUT_SHAPE = (224, 224)

        TRITON_IP = "localhost"
        TRITON_PORT = 8001
        MODEL_NAME = "resnet"
        INPUTS = []
        OUTPUTS = []
        INPUT_LAYER_NAME = "input"
        OUTPUT_LAYER_NAME = "output"

        INPUTS.append(grpcclient.InferInput(INPUT_LAYER_NAME, [1, 3, INPUT_SHAPE[0], INPUT_SHAPE[1]], "FP32"))
        OUTPUTS.append(grpcclient.InferRequestedOutput(OUTPUT_LAYER_NAME, class_count=3))
        TRITON_CLIENT = grpcclient.InferenceServerClient(url=f"{TRITON_IP}:{TRITON_PORT}")

        INPUTS[0].set_data_from_numpy(self.image_transform_onnx(img, 224))

        results = TRITON_CLIENT.infer(model_name=MODEL_NAME, inputs=INPUTS, outputs=OUTPUTS, headers={})
        output = np.squeeze(results.as_numpy(OUTPUT_LAYER_NAME))
        #print(output)
        lista = [x.decode('utf-8') for x in output.tolist()]
        return lista

In this architecture, the NGINX, Flask, and Triton servers should be started at the beginning. Edit the serve file and add a line to start the Triton server.

Build a Docker image and push the image to Amazon ECR

The Docker file code looks as follows:

FROM nvcr.io/nvidia/tritonserver:22.04-py3 

# Add arguments to achieve the version, python and url
ARG PYTHON=python3
ARG PYTHON_PIP=python3-pip
ARG PIP=pip3

ENV LANG=C.UTF-8

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC 
 && apt-get update 
 && apt-get install -y nginx 
 && apt-get install -y libgl1-mesa-glx 
 && apt-get clean 
 && rm -rf /var/lib/apt/lists/*

RUN ${PIP} install -U --no-cache-dir 
tritonclient[all] 
torch 
torchvision 
pillow==9.1.1 
scipy==1.8.1 
transformers==4.20.1 
opencv-python==4.6.0.66 
flask 
gunicorn 
&& 

ldconfig && 
apt-get clean && 
apt-get autoremove && 
rm -rf /var/lib/apt/lists/* /tmp/* ~/* &&
mkdir -p /opt/program/models/ 
                                                                               
COPY sm /opt/program
COPY model /opt/program/models
WORKDIR /opt/program

ENTRYPOINT ["python3", "serve"]

Install and configure the aws-cli client with the following code:

sudo apt install awscli
sudo apt install git-all
aws configure
# # input AWS Access Key ID, AWS Secret Access Key, Default region name and Default output format

Run the following command to build the Docker image and push the image to Amazon Elastic Container Registry (Amazon ECR). Provide your Region and account ID.

aws ecr get-login-password --region <regionID> | docker login --username AWS --password-stdin <accountID>.dkr.ecr.<regionID>.amazonaws.com

docker build -t inference/mytriton .

docker tag inference/mytriton:latest <accountID>.dkr.ecr. <regionID>.amazonaws.com/inference/mytriton:latest

docker push <accountID>.dkr.ecr.<regionID>.amazonaws.com/inference/mytriton:latest

Create a SageMaker endpoint and test the endpoint

Now it’s time to verify the result. Launch a notebook instance with an ml.c5.xlarge instance from the SageMaker console, and create a notebook with the conda_python3 kernel. The following code snippet shows an example deployment of an inference endpoint. The source code is available in the GitHub repo.

role = get_execution_role()
sess = sage.Session()
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/inference/mytriton:latest'.format(account, region)
model = sess.create_model(
        name="mytriton", role=role, container_defs=image)
endpoint_cfg=sess.create_endpoint_config(
        name="MYTRITONCFG",
        model_name="mytriton",
        initial_instance_count=1,
        instance_type="ml.g4dn.xlarge"
    )
endpoint=sess.create_endpoint(
        endpoint_name="MyTritonEndpoint", config_name="MYTRITONCFG")

Wait about 3 minutes until the inference server is started to verify the result.

The following code is the ResNet client request:

## resnet client
runtime = boto3.Session().client('runtime.sagemaker')
img = cv2.imread('dog.jpg')
string_img = base64.b64encode(cv2.imencode('.jpg', img)[1]).decode()
payload = json.dumps({"modelname": "resnet","payload": {"img":string_img}})

endpoint="MyTritonEndpoint"
response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType="application/json",Body=payload,Accept='application/json')

out=response['Body'].read()
res=eval(out)
print(res)

We get the following response:

{'modelname': 'resnet', 'result': ['11.250000:250:250:malamute, malemute, Alaskan malamute', '9.914062:249:249:Eskimo dog, husky', '9.906250:248:248:Saint Bernard, St Bernard']}

The following code is the YOLOv5 client request:

# yolov5 client
payload = json.dumps({"modelname": "yolov5","payload": {"img":string_img}})

endpoint="MyTritonEndpoint"
response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType="application/json",Body=payload,Accept='application/json')

out=response['Body'].read()
res=eval(out)
print(str(out))

We get the following response:

b'{"modelname": "yolov5", "result": [[16, 0.9168673157691956, 111.92530059814453, 258.53240966796875, 262.0159606933594, 533.407958984375, 768, 576], [2, 0.6941519379615784, 392.20037841796875, 573.6005249023438, 142.55178833007812, 224.56454467773438, 768, 576], [1, 0.5813695788383484, 131.8942413330078, 473.7420654296875, 179.61459350585938, 427.0913391113281, 768, 576], [7, 0.5316226482391357, 392.82275390625, 572.4647216796875, 144.685546875, 223.052734375, 768, 576]]}'

The following code is the BERT client request:

# bert client
text="The world has [MASK] people."

payload = json.dumps({"modelname": "bert_base","payload": {"text":text}})

endpoint="MyTritonEndpoint"
response = runtime.invoke_endpoint(EndpointName=endpoint,ContentType="application/json",Body=payload,Accept='application/json')

out=response['Body'].read()
res=eval(out)
print(res)

We get the following response:

{'modelname': 'bert_base', 'result': [{'token': 'The world has many people.', 'score': 0.16609132289886475}, {'token': 'The world has no people.', 'score': 0.07334889471530914}, {'token': 'The world has few people.', 'score': 0.0617995485663414}, {'token': 'The world has two people.', 'score': 0.03924647718667984}, {'token': 'The world has its people.', 'score': 0.023465465754270554}]}

Here we see our architecture is working as expected.

Note that hosting an endpoint will incur some costs. Therefore, delete the endpoint after you complete the test:

runtime.delete_endpoint(EndpointName=endpoint)

Cost estimation

To estimate cost, assume that you have three models, but not all of them are long-running. You’re using one endpoint for each model, and the online time of each endpoint is different. Using ml.g4dn.xlarge as an example, the total cost is about $971.52/month. The following table lists the details.

Model Name	Endpoint Running /Day	Instance Type	Cost/Month (us-east-1)
ResNet	24 hours	ml.g4dn.xlarge	0.736 * 24 * 30=$529.92
BERT	8 hours	ml.g4dn.xlarge	0.736 * 8 * 30=$176.64
YOLOv5	12 hours	ml.g4dn.xlarge	0.736 * 12 * 30=$264.96

The following table shows the cost for sharing one endpoint for three models using the preceding architecture. The total cost is about $676.8/month. From this result, we can conclude that you can save 30% in costs while also having 24/7 service from your endpoint.

Model Name	Endpoint Running /Day	Instance Type	Cost/Month (us-east-1)
ResNet, YOLOv5, BERT	24 hours	ml.g4dn.2xlarge	0.94 * 24 * 30 = $676.8

Summary

In this post, we introduced an improved architecture in which multiple models share one endpoint in SageMaker. Under some conditions, this solution can help you save costs and improve resource utilization. It is suitable for business scenarios with low concurrency and latency-insensitive requirements.

To learn more about SageMaker and AI/ML solutions, refer to Amazon SageMaker.

References

About the authors

Zheng Zhang is a Senior Specialist Solutions Architect in AWS, he focuses on helping customers accelerate model training, inference and deployment for machine learning solutions. He also has rich experience in large-scale distributed training, design AI/ML solutions.

Yinuo He is an AI/ML specialist in AWS. She has experiences in designing and developing machine learning based products to provide better user experiences. She now works to help customers succeed in their ML journey.

Model Hosting Patterns in SageMaker: Best practices in testing and updating models on SageMaker

November 9, 2022

by Deepali Rajale Amazon AWS

Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to quickly build, train, and deploy machine learning (ML) models. With SageMaker, you can deploy your ML models on hosted endpoints and get inference results in real time. You can easily view the performance metrics for your endpoints in Amazon CloudWatch, automatically scale endpoints based on traffic, and update your models in production without losing any availability. SageMaker offers a wide variety of options to deploy ML models for inference in any of the following ways, depending on your use case:

For synchronous predictions that need to be served in the order of milliseconds, use SageMaker real-time inference
For workloads that have idle periods between traffic spurts and can tolerate cold starts, use Serverless Inference
For requests with large payload sizes up to 1 GB, long processing times (up to 15 minutes) and near-real-time latency requirements (seconds to minutes), use SageMaker Asynchronous Inference
To get predictions for an entire dataset, use SageMaker batch transform

Real-time inference is ideal for inference workloads where you have real time, interactive, low latency requirements. You deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are backed by a fully managed infrastructure and support auto scaling. You can improve efficiency and cost by combining multiple models into a single endpoint using multi-model endpoints or multi-container endpoints.

There are certain use cases where you want to deploy multiple variants of the same model into production to gauge their performance, measure improvements, or run A/B tests. In such cases, SageMaker multi-variant endpoints are useful because they allow you to deploy multiple production variants of a model to the same SageMaker endpoint.

In this post, we discuss SageMaker multi-variant endpoints and best practices for optimization.

Comparing SageMaker real-time inference options

The following diagram gives a quick overview of the real-time inference options with SageMaker.

A single-model endpoint allows you to deploy one model on a container hosted on dedicated instances or serverless for low latency and high throughput. You can create a model and retrieve a SageMaker supported image for popular frameworks such as TensorFlow, PyTorch, Scikit-learn, and more. If you’re working with a custom framework for your model, you can also bring your own container that installs your dependencies.

SageMaker also supports more advanced options such as multi-model endpoints (MMEs) and multi-container endpoints (MCEs). MMEs are useful when you’re dealing with hundreds to tens of thousands of models and where you don’t need to deploy each model as an individual endpoint. MMEs allow you to host multiple models in a cost-effective, scalable manner within the same endpoint by using a shared serving container hosted on an instance. The underlying infrastructure (container and instance) remains the same, but the models are loaded and unloaded dynamically from a common S3 location, according to usage and the amount of memory available on the endpoint. Your application simply needs to include an API call with the target model to this endpoint to achieve low-latency, high-throughput inference. Instead of paying for a separate endpoint for every single model, you can host many models for the price of a single endpoint.

MCEs enable you to run up to 15 different ML containers on a single endpoint and invoke them independently. You can build these ML containers on different serving stacks (such as ML framework, model server, and algorithm), to be run on the same endpoint for cost savings. You can stitch the containers together in a serial inference pipeline or invoke the container independently. This can be ideal when you have several different ML models that have different traffic patterns and similar resource needs. Examples of when to utilize MCEs include, but are not limited to, the following:

Hosting models across different frameworks (such as TensorFlow, PyTorch, and Scikit-learn) that don’t have sufficient traffic to saturate the full capacity of an instance
Hosting models from the same framework with different ML algorithms (such as recommendations, forecasting, or classification) and handler functions
Comparisons of similar architectures running on different framework versions (such as TensorFlow 1.x vs. TensorFlow 2.x) for scenarios like A/B testing

SageMaker multi-variant endpoints (MVEs) allow you to test multiple models or model versions behind the same endpoint using production variants. Each production variant identifies a ML model and the resources deployed for hosting the model, such as the serving container and instance.

Overview of SageMaker multi-variant endpoints

In production ML workflows, data scientists and ML engineers refine models through a variety of methods, such as retraining based on data/model/concept drift, hyperparameter tuning, feature selection, framework selection, and more. Performing A/B testing between a new model and an old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs relative to each other. You then choose the best-performing model to replace the previous model with a new version that delivers better performance than the previous version. By using production variants, you can test these ML models and different model versions behind the same endpoint. You can train these ML models using different datasets, different algorithms, and ML frameworks; deploy them to different instance types; or any combination of these options. The load balancer connected to the SageMaker endpoint provides the ability to distribute the invocation requests across multiple production variants. For example, you can distribute traffic between production variants by specifying the traffic distribution for each variant, or you can invoke a specific variant directly for each request.

You can also configure the auto scaling policy to automatically scale your variants in or out based on metrics such as requests per second.

The following diagram illustrates how MVE works in more detail.

Deploying an MVE is also very straightforward. All you need to do is define model objects with the image and model data using the create_model construct from the SageMaker Python SDK, and define the endpoint configurations using production_variant constructs to create production variants, each with its own different model and resource requirements (instance type and counts). This enables you to also test models on different instance types. To deploy, use the endpoint_from_production_variant construct to create the endpoint.

During endpoint creation, SageMaker provisions the hosting instance specified in the endpoint settings and downloads the model and inference container specified by the production variant to the hosting instance. If a successful response is returned after starting the container and performing a health check with a ping, a message indicating that the endpoint creation is complete is sent to the user. See the following code:

sm_session.create_model(
	name=model_name,
	role=role,
	container_defs={'Image':  image_uri, 'ModelDataUrl': model_url}
	)

sm_session.create_model(
	name=model_name2,
	role=role,
	container_defs={'Image':  image_uri, 'ModelDataUrl': model_url2 }
	)

variant1 = production_variant(
	model_name=model_name,
	instance_type="ml.c5.4xlarge",
	initial_instance_count=1,
	variant_name="Variant1",
	initial_weight=1
	)

variant2 = production_variant(
	model_name=model_name2,
	instance_type="ml.m5.4xlarge",
	initial_instance_count=1,
	variant_name="Variant2",
	initial_weight=1
	)

sm_session.endpoint_from_production_variants(
	name=endpoint_name,
	production_variants=[variant1,  variant2]
	)

In the preceding example, we created two variants, each with its own different model (these could also have different instance types and counts). We set an initial_weight of 1 for both variants: this means 50% of our requests go to Variant1, and the remaining 50% to Variant2. The sum of weights across both variants is 2 and each variant has weight assignment of 1. This implies each variant receives 50% of the total traffic.

Invoking the endpoint is similar to the common SageMaker construct invoke_endpoint; you can call the endpoint directly with the data as a payload:

sm_runtime.invoke_endpoint(
	EndpointName=endpoint_name,
	ContentType="text/csv",
	Body=payload
	)

SageMaker emits metrics such as Latency and Invocations for each variant in CloudWatch. For a complete list of metrics that SageMaker emits, see Monitor Amazon SageMaker with Amazon CloudWatch. You can query CloudWatch to get the number of invocations per variant, to see how invocations are split across variants by default.

To invoke a specific version of the model, specify a variant as the TargetVariant in the call to invoke_endpoint:

sm_runtime.invoke_endpoint(
	EndpointName=endpoint_name,
	ContentType="text/csv",
	Body=payload,
	TargetVariant="Variant1"
	)

You can evaluate each production variant’s performance by reviewing metrics such as accuracy, precision, recall, F1 score, and receiver operating characteristic/area under the curve for each variant using Amazon SageMaker Model Monitor. You can then decide to increase traffic to the best model by updating the weights assigned to each variant by calling UpdateEndpointWeightsAndCapacities. This changes the traffic distribution to your production variants without requiring updates to your endpoint. So instead of 50% of the traffic from the initial setup, we shift 75% of the traffic to Variant2 by assigning new weights to each variant using UpdateEndpointWeightsAndCapacities. See the following code:

sm.update_endpoint_weights_and_capacities(
	EndpointName=endpoint_name,
	DesiredWeightsAndCapacities=[
	{
		"DesiredWeight": 25,
		"VariantName": variant1["VariantName"]
	},
	{
		"DesiredWeight": 75,
		"VariantName": variant2["VariantName"]
	}
] )

When you’re satisfied with a variant’s performance, you can route 100% of the traffic to that variant. For example, you can set the weight for Variant1 to 0 and the weight for Variant2 to 1. SageMaker then sends 100% of all inference requests to Variant2. You can then safely update your endpoint and delete Variant1 from your endpoint. You can also continue testing new models in production by adding new variants to your endpoint. You can also configure these endpoints to scale automatically based on the traffic the endpoints receive.

Advantages of multi-variant endpoints

SageMaker MVEs allow you to do the following:

Deploy and test multiple variants of a model using the same SageMaker endpoint. This is useful for testing variations of a model in production. For example, suppose that you’ve deployed a model into production. You can test a variation of the model by directing a small amount of traffic, say 5%, to the new model.
Evaluate model performance in production without interrupting traffic by monitoring operational metrics for each variant in CloudWatch.
Update models in production without losing any availability. You can modify an endpoint without taking models that are already deployed into production out of service. For example, you can add new model variants, update the ML compute instance configurations of existing model variants, or change the distribution of traffic among model variants. For more information, see UpdateEndpoint and UpdateEndpointWeightsAndCapacities.

Challenges when using multi-variant endpoints

SageMaker MVEs come with the following challenges:

Load testing effort – You need to put in a fair amount of effort and resources for testing and model matrix comparisons for each variant. For an A/B test to be considered successful, you need to perform a statistical analysis of the metrics gathered from the test to determine if there is a statistically significant result. It could become challenging to minimize exploring poor performing variants. You could potentially use the multi-armed bandit optimization technique to avoid sending traffic to experiments that aren’t working and optimize performance as you test. For load testing, you could also explore Amazon SageMaker Inference Recommender to conduct extensive benchmarks based on production requirements for latency and throughput, custom traffic patterns, and instances (up to 10) that you select.
Tight coupling between model variant and endpoint – It could become tricky depending on model deployment frequency, because the endpoint may end up in updating status for each production variant being updated. SageMaker also supports deployment guardrails, which you can use to easily switch from the current model in production to a new one in a controlled way. This option introduces canary and linear traffic shifting modes so that you can have granular control over the shifting of traffic from your current model to the new one during the course of the update. With built-in safeguards such as auto-rollbacks, you can catch issues early and automatically take corrective action before they cause significant production impact.

Best practices for multi-variant endpoints

When hosting models using SageMaker MVEs, consider the following:

SageMaker is great for testing new models because you can easily deploy them into an A/B testing environment and you pay for only what you use. You’re charged per instance-hour consumed for each instance while the endpoint is running. When you’re done with your tests and not using the endpoint or the variants extensively anymore, you should delete it to save cost. You can always recreate it when you need it again because the model is stored in Amazon Simple Storage Service (Amazon S3).
You should use the most optimal instance type and size to deploy models. SageMaker currently offers ML compute instances on various instance families. An endpoint instance is running all the time (while the instance is in service). Therefore, selecting the right type of instance can have a significant impact on the total cost and performance of ML models. Load testing is the best practice to determine the appropriate instance type and fleet size, with or without auto scaling for your live endpoint to avoid over-provisioning and paying extra for capacity you don’t need.
You can monitor model performance and resource utilization in CloudWatch. You can configure a ProductionVariant to use Application Auto Scaling. To specify the metrics and target values for a scaling policy, you configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric. For more information about policy configuration syntax, see TargetTrackingScalingPolicyConfiguration. For information about configuring automatic scaling, see Automatically Scale Amazon SageMaker Models. To quickly define a target-tracking scaling policy for a variant, you can choose a specific CloudWatch metric and set threshold values. For example, use metric SageMakerVariantInvocationsPerInstance to monitor the average number of times per minute that each instance for a variant is invoked, or use metric CPUUtilization to monitor the sum of work handled by a CPU. The following example uses the SageMakerVariantInvocationsPerInstance predefined metric to adjust the number of variant instances so that each instance has an InvocationsPerInstance metric of 70:

{
	"TargetValue": 70.0,
	"PredefinedMetricSpecification":
	{
		"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
	}
}

Changing or deleting model artifacts or changing inference code after deploying a model produces unpredictable results. Before deploying models to production, it’s a good practice to check whether the model hosting in local mode is successful after sufficiently debugging the inference code snippets (like model_fn, input_fn, predict_fn, and output_fn) in the local development environment like a SageMaker notebook instance or local server. If you need to change or delete model artifacts or change inference code, modify the endpoint by providing a new endpoint configuration. After you provide the new endpoint configuration, you can change or delete the model artifacts corresponding to the old endpoint configuration.
You can use SageMaker batch transform to test production variants. Batch transform is ideal to get inferences from large datasets. You can create a separate transform job for each new model variant and use a validation dataset to test. For each transform job, specify a unique model name and location in Amazon S3 for the output file. To analyze the results, use inference pipeline logs and metrics.

Conclusion

SageMaker enables you to easily A/B test ML models in production by running multiple production variants on an endpoint. You can use SageMaker’s capabilities to test models that have been trained using different training datasets, hyperparameters, algorithms, or ML frameworks; how they perform on different instance types; or a combination of all of the above. You can provide the traffic distribution between the variants on an endpoint, and SageMaker splits the inference traffic to the variants based on the specified distribution. Alternately, if you want to test models for specific customer segments, you can specify the variant that should process an inference request by providing the TargetVariant header, and SageMaker will route the request to the variant that you specified. For more information about A/B testing, see Safely update models in production.

References

About the authors

Deepali Rajale is AI/ML Specialist Technical Account Manager at Amazon Web Services. She works with enterprise customers providing technical guidance on implementing machine learning solutions with best practices. In her spare time, she enjoys hiking, movies and hanging out with family and friends.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

“ID + Selfie” – Improving digital identity verification using AWS

November 8, 2022

by Mike Ames Amazon AWS

The COVID-19 global pandemic has accelerated the need to verify and onboard users online across several industries, such as financial services, insurance, and healthcare. When it comes to user experience it is crucial to provide a frictionless transaction while maintaining a high standard for identity verification. The question is, how do you verify real people in the digital world?

Amazon Rekognition provides pre-trained facial recognition and analysis capabilities for identity verification to your online applications, such as banking, benefits, ecommerce, and much more.

In this post, we present the “ID + Selfie” identity verification design pattern and sample code you can use to create your own identity verification REST endpoint. This is a common design pattern that you can incorporate into existing or new solutions that require face-based identity verification. The user presents a form of identification like a driver’s license or passport. The user than captures a real-time selfie with the application. We then compare the face from the document with the real-time selfie taken on their device.

The Amazon Rekognition CompareFaces API

At the core of the “ID + Selfie” design pattern is the comparison of the face in the selfie to the face on the identification document. For this, we use the Amazon Rekognition CompareFaces API. The API compares a face in the source input image with a face or faces detected in the target input image. In the following example, we compare a sample driver’s license (left) with a selfie (right).

Source	Target

The following is an example of the API code:

response = client.compare_faces(SimilarityThreshold=80,
                              SourceImage={'Bytes': s_bytes},
                              TargetImage={'Bytes': t_bytes})

for faceMatch in response['FaceMatches']:
    position = faceMatch['Face']['BoundingBox']
    similarity = str(faceMatch['Similarity'])

Several values are returned in the CompareFaces API response. We focus on the Similarity value returned in FaceMatches to validate the selfie matches the ID provided.

Understanding key tuning parameters

SimilarityThreshold is set to 80% by default and will only return results with a similarity score greater than or equal to 80%. Adjust the value by specifying the SimilarityThreshold parameter.

QualityFilter is an input parameter to filter out detected faces that don’t meet a required quality bar. The quality bar is based on a variety of common use cases. Use QualityFilter to set the quality bar by specifying LOW, MEDIUM, or HIGH. If you don’t want to filter poor quality faces, specify NONE. The default value is NONE.

Solution overview

You can create an “ID + Selfie” API for digital identity verification by deploying the following components:

A REST API with a POST method that allows us to send the selfie and identification payload and returns a response, in this case the similarity score
A function to receive the payload, convert the images to the proper format, and call the Amazon Rekognition compare_faces API.

We implement Amazon API Gateway for the REST API functionality and AWS Lambda for the function.

The following diagram illustrates the solution architecture and workflow.

The workflow contains the following steps:

The user uploads the required identification document and a selfie.
The client submits the identification document and selfie to the REST endpoint.
The REST endpoint returns a similarity score to the client.
An evaluation is done through business logic in your application. For example, if the similarity score is below 80%, it fails the digital identity check; otherwise it passes the digital identity check.
The client sends the status to the user.

Lambda code

The Lambda function converts the incoming payload from base64 to byte for each image and then sends the source (selfie) and target (identification) to the Amazon Rekognition compare_faces API and returns the similarity score received in the body of the API response. See the following code:

import boto3
import sys
import json
import base64


def lambda_handler(event, context):

  client = boto3.client('rekognition')

  payload_dict = json.loads(json.loads(event['body']))
  selfie = payload_dict['selfie']
  dl = payload_dict['dl']

  # convert text to base64
  s_base64 = dl.encode('utf-8')
  t_base64 = selfie.encode('utf-8')
  #convert base64 to bytes
  s_bytes = base64.b64decode(s_base64)
  t_bytes = base64.b64decode(t_base64)
  response = client.compare_faces(SimilarityThreshold=80,
                                SourceImage={'Bytes': s_bytes},
                                TargetImage={'Bytes': t_bytes})

  for faceMatch in response['FaceMatches']:
      position = faceMatch['Face']['BoundingBox']
      similarity = str(faceMatch['Similarity'])

  return {

    'statusCode': response['ResponseMetadata']['HTTPStatusCode'],

    'body': similarity

  }

Deploy the project

This project is available to deploy through AWS Samples with the AWS Cloud Development Kit (AWS CDK). You can clone the repository and use the following AWS CDK process to deploy to your AWS account.

Set up a user who has permissions to programmatically deploy the solution resources through the AWS CDK.
Set up the AWS Command Line Interface (AWS CLI). For instructions, refer to Configuring the AWS CLI.
If this is your first time using the AWS CDK, complete the prerequisites listed in Working with the AWS CDK in Python.
Clone the GitHub repository.
Create the virtual environment. The command you use depends on your OS:
1. If using Windows, run the following command in your terminal window from the source of the cloned repository:
```
..venvScriptsactivate
```
2. If using Mac or Linux, run the following command in your terminal window from the source of the cloned repository:
```
.venv/bin/activate
```
After activating the virtual environment, install the app’s standard dependencies:
```
python -m pip install -r requirements.txt
```
Now that the environment is set up and the requirements are met, we can issue the AWS CDK deployment command to deploy this project to AWS:
```
CDK Deploy
```

Make API calls

We need to send the payload in base64 format to the REST endpoint. We use a Python file to make the API call, which allows us to open the source and target files, convert them to base64, and send the payload to the API Gateway. This code is available in the repository.

Note that the SOURCE and TARGET file locations will be on your local file system, and the URL is the API Gateway URL generated during the creation of the project.

import requests
from base64 import b64encode
from json import dumps

TARGET = '<Selfie>.png'
SOURCE = <ID Image>.png'
URL = "https://<your api gateway>.execute-api.<region>.amazonaws.com/<deployment slot>/ips"
ENCODING = 'utf-8'
JSON_NAME = 'output.json'

# first: reading the binary stuff
with open(SOURCE, 'rb') as source_file:
    s_byte_content = source_file.read()
with open(TARGET, 'rb') as target_file:
    t_byte_content = target_file.read()

# second: base64 encode read data
s_base64_bytes = b64encode(s_byte_content)
t_base64_bytes = b64encode(t_byte_content)

# third: decode these bytes to text
s_base64_string = s_base64_bytes.decode(ENCODING)
t_base64_string = t_base64_bytes.decode(ENCODING)

# make raw data for json
raw_data = {
    " dl ": s_base64_string,
    " selfie ": t_base64_string
}

# now: encoding the data to json
json_data = dumps(raw_data, indent=2)

response = requests.post(url=URL, json=json_data)
response.raise_for_status()

print("Status Code", response.status_code)
print("Body ", response.json())

Clean up

We used the AWS CDK to build this project, so we can open our project locally and issue the following AWS CDK command to clean up the resources:

CDK Destroy

Conclusion

There you have it, the “ID + Selfie” design pattern with a simple API that you can integrate with your application to perform digital identity verification. In the next post in our series, we expand upon this pattern further by extracting text from the identification document and searching a collection of faces to prevent duplication.

To learn more, check out the Amazon Rekognition Developer Guide on detecting and analyzing faces.

About the Authors

Mike Ames is a Principal Applied AI/ML Solutions Architect with AWS. He helps companies use machine learning and AI services to combat fraud, waste, and abuse. In his spare time, you can find him mountain biking, kickboxing, or playing guitar in a 90s metal band.

Noah Donaldson is a Solutions Architect at AWS supporting federal financial organizations. He is excited about AI/ML technology that can reduce manual processes, improve customer experiences, and help solve interesting problems. Outside of work, he enjoys spending time on the ice with his son playing hockey, hunting with his oldest daughter, and shooting hoops with his youngest daughter.

Getting started with deploying real-time models on Amazon SageMaker

November 8, 2022

by Raghu Ramesha Amazon AWS

Amazon SageMaker is a fully-managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models at scale. ML is realized in inference. SageMaker offers four Inference options:

These four options can be broadly classified into Online and Batch inference options. In Online Inference, requests are expected to be processed as they arrive, and the consuming application expects a response after each request is processed. This can either happen synchronously (real-time Inference, serverless) or asynchronously (asynchronous inference). In a synchronous pattern, the consuming application is blocked and can’t proceed until it receives a response. These workloads tend to be real-time applications, such as online credit card fraud detection, where responses are expected in the order of milliseconds to seconds and request payloads are small (a few MB). In the asynchronous pattern, the application experience isn’t blocked (for example, submitting an insurance claim via a mobile app), and usually requires larger payload sizes and/or longer processing times. In Offline inference, an aggregation (batch) of inference requests are processed together, and responses are provided only after the entire batch has been processed. Usually, these workloads aren’t latency sensitive, involve large volumes (multiple GBs) of data, and are scheduled at a regular cadence (for example, run object detection on security camera footage at the end of the day or process payroll data at the end of the month).

At the bare bones, SageMaker Real-Time Inference consists of a model(s), the framework/container with which you’re working, and the infrastructure/instances that are backing your deployed endpoint. In this post, we’ll explore how you can create and invoke a Single Model Endpoint.

Choosing model deployment option

Choosing the right inference type can be difficult, and the following simple guide can help you. It’s not a strict flow chart, so if you find that another option works better for you, then feel free to use those. In particular, Real-Time Inference is a great option for hosting your models when you have low and consistent latency (in the order of milliseconds or seconds) and throughput sensitive workloads. You can control the instance type and count behind your endpoint while also configuring AutoScaling policy to handle traffic. There are two other SageMaker Inference options that you can also use to create an endpoint. Asynchronous Inference is when you have large payload sizes and near real-time latency bandwidth. This is a good option, especially for NLP and Computer Vision models that have longer preprocessing times. Serverless Inference is a great option when you have intermittent traffic and don’t want to manage infrastructure scaling. The recipe for creating an endpoint remains the same regardless of the Inference type that you choose. In this post, we’ll focus on creating a real-time instance-based endpoint, but you can easily adapt it to either of the other Inference Options based on your use-case. Lastly, Batch inference takes place offline, so you can provide a set of data that you want to get inference from and we’ll run it. This is similarly instance-based, so you can select the optimal instance for your workload. As there is no endpoint up and running, you only pay for the duration of the job. It is good for processing gigabytes of data and the job duration can be days. There are built-in features to make working with structured data easier and optimizations to automatically distribute structured data. Some example use cases are propensity modeling, predictive maintenance, and churn prediction. All of these can take place offline in bulk because it doesn’t have to react to a specific event.

Hosting a model on SageMaker Endpoints

At the crux, SageMaker Real-Time Endpoints consists of a model and the infrastructure with which you choose to back the Endpoint. SageMaker uses containers to host models, which means that you need a container that properly sets up the environment for the framework that you use for each model that you provide. For example, if you’re working with a Sklearn model, you must pass in your model scripts/data within a container that properly sets up Sklearn. Luckily, SageMaker provides managed images for popular frameworks, such as TensorFlow, PyTorch, Sklearn, and HuggingFace. You can retrieve and utilize these images using the high-level SageMaker Python SDK and inject your model scripts and data into these containers. In the case that SageMaker doesn’t have a supported container, you can also Build Your Own Container and push your own custom image, installing the dependencies that are necessary for your model.

SageMaker supports both trained and pre-trained models. In the previous paragraph when we’re talking about model scripts/data, we’re referencing this matter. You can either mount a script on your container, or if you have a pre-trained model artifact (for example, `model.joblib` for SKLearn), then you can provide this along with your image to SageMaker. To understand SageMaker Inference, there are three main entities that you’ll create in the process of Endpoint creation:

SageMaker Model Entity – Here you can pass in your trained model data/model script and your image that you’re working with, whether it’s owned by AWS or built by you.
Endpoint configuration creation – Here you define your infrastructure, meaning that you select the instance type, count, etc.
Endpoint creation – This is the REST Endpoint that hosts your model that you’re invoking to get a response. Let’s look at how you can utilize a managed SageMaker Image and your own custom-built image to deploy an endpoint.

Real-time endpoint requirements

Before creating an Endpoint, you must understand what type of Model you want to host. If it’s a Framework model, such as TensorFlow, PyTorch, or MXNet, then you can utilize one of the prebuilt Framework images.
If it’s a custom model, or you would like full flexibility in creating the container that SageMaker will run for inference, then you can build your own container.

SageMaker Endpoints are made up of a SageMaker Model and Endpoint Configuration.
If you’re using Boto3, then you would create both objects. Otherwise, if you’re utilizing the SageMaker Python SDK, then the Endpoint Configuration is created on your behalf when you use the .deploy(..) function.

SageMaker entities:

SageMaker Model:
- Contains the details of the inference image, location of the model artifacts in Amazon Simple Storage Service (Amazon S3), network configuration, and AWS Identity and Access Management (IAM) role to be used by the Endpoint.
  - SageMaker requires your model artifacts to be compressed in a .tar.gz file. SageMaker automatically extracts this .tar.gz file into the /opt/ml/model/ directory in your container. If you’re utilizing one of the framework containers, such as TensorFlow, PyTorch, or MXNet, then the container expects your TAR structure to be as follows:
    - TensorFlow
      model.tar.gz/ |--[model_version_number]/ |--variables |--saved_model.pb code/ |--inference.py |--requirements.txt
    - PyTorch
      model.tar.gz/ |- model.pth |- code/ |- inference.py |- requirements.txt # only for versions 1.3.1 and higher
    - MXNet
      model.tar.gz/ |- model-symbol.json |- model-shapes.json |- model-0000.params |- code/ |- inference.py |- requirements.txt # only for versions 1.6.0 and higher
    - Sklearn
      model.tar.gz/ |- model.joblib | code/ |- inference.py
  - When utilizing a Framework image, we can provide a custom entry-point script, where we can implement our own pre and post processing. In our case, the inference script is packaged in the model.tar.gz under the /code directory.

- Endpoint Configuration
  - Contains the infrastructure information required to deploy the SageMaker Model to the Endpoint.
  - For example, the SageMaker Model we created is specified here as well as the Instance Type and Initial Instance count.

Frameworks and BYOC

- Retrieving SageMaker images
  - This portion isn’t always necessary and abstracted out by the SageMaker Python SDK via estimators. However, if you would like to be able to retrieve a SageMaker managed image to extend on it, then you can get the images that are available via the SDK. The following is an example of retreiving a TF 2.2 image for inference.
```
import sagemaker
tf_image = sagemaker.image_uris.retreive(framework="tensorflow", region="us-east-1",
image_scope = "inference", version = "2.2", instance_type = "ml.c5.xlarge)
print(tf_image)
```
- Frameworks
  - In the case that you want to deploy a Framework Model, such as TensorFlow, PyTorch, or MXNet, then all you need is the model artifacts.
  - See the documentation for deploying models directly from model artifacts for TensorFlow, PyTorch, or MXNet.

- Choosing between 1P and BYOC
  - The SageMaker SDK also abstracts handling the image out, as you saw in the previous Frameworks section. It has ready-made estimators for Sklearn, TensorFlow, and PyTorch that automatically select the image for you based off of the version that you’ve selected. Then you can pass in a training/inference script through Script Mode into these estimators.
```
from sagemaker.pytorch import PyTorch #PyTorch Estimator within SageMaker SDK
estimator_parameters = {"entry_point": "train_deploy_pytorch_without_dependencies.py",
"source_dir": "pytorch_script","instance_type": train_instance_type,
"instance_count": 1,"hyperparameters": hyperparameters,
"role": role,"base_job_name": "pytorch-model","framework_version": "1.5",
"py_version": "py3",}

## Model Training
estimator = PyTorch(**estimator_parameters)estimator.fit(inputs)

## Deploy Trained model
pytorch_predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=pytorch_endpoint_name)
```
  - Not all packages and images are supported by SageMaker, and in this case you must bring your own container (BYOC). This means building a Dockerfile that will setup the proper environment for your model serving. An example of this is the Spacy NLP module, and there are no managed SageMaker containers for this framework. Therefore, you must provide a Dockerfile that installs Spacy. Within the container you also mount your model inference scripts. Let’s quickly discuss the components that you provide in a Bring Your Own Container format, as these stay consistent for most examples.
    - “nginx.conf“ is the configuration file for the nginx front-end. You won’t have to edit this file, unless you would like to tune these portions.
    - “predictor.py“ is the program that actually implements the Flask web server and model code for your application. You can have further Python files or functions in your container that you can call in this file.
    - “serve“ is the program started when the container is started for hosting. It simply launches the gunicorn server, which runs multiple instances of the Flask app defined in predictor.py. Like nginx.conf, you don’t have to edit this file unless there’s further tuning that you would like to perform.
    - “train“ is the program that is invoked when the container is run for training. You’ll modify this program to implement your training algorithm. If you’re bringing a pre-trained model or framework like Spacy, then you don’t need this file.
    - “wsgi.py“ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is, unless you’ve changed the name of your predictor.py file. In that case, make sure that maps properly here.

- Custom inference script
  - SageMaker Framework containers give you the flexibility to handle pre/post processing of the request and model loading using a custom entry point script/inference.py.
  - See the documentation for creating a custom inference.py script for TensorFlow, PyTorch and MXNet.
- Custom container
  - Custom containers allow you full autonomy in using your own inference code to host your model. Your container is required to comply with certain API contracts. For example, it must respond to /invocations and /ping on port 8080. This Scikit Bring Your Own Container example showcases a common hosting pattern of NGINXGunicornFlask.

Different ways that you can interact with SageMaker Endpoints

There are many options for using SageMaker programmatically so that you can call your deployed models to get inference. The AWS Command Line Interface (AWS CLI), REST APIs, AWS CloudFormation, AWS Cloud Development Kit (AWS CDK), and AWS SDKs are common tools offered by AWS and widely supported by other AWS services. For SageMaker, we also have a SageMaker Python SDK. Now, let’s compare the different options to create, invoke, and manage SageMaker Endpoints.

In addition to SageMaker CLI, there are two ways programmatically that you can interact with Endpoints in SageMaker through the SDKs. Let’s look at some differences between SageMaker Python SDK and Boto3 Python SDK:

High-level SageMaker “Python” SDK – This SDK is an open-source library that provides higher level abstraction specifically meant for calling SageMaker APIs programmatically using Python. The Good part of this SDK is that it’s very easy to call sagemaker APIs, lots of heavy lifting is done already like calling the APIs synchronously/async mode (helps to avoid polling), simpler request/response schema, much less code, and much simpler code. SageMaker Python SDK provides several high-level abstractions for working with SageMaker. The package is meant to simplify different ML processes on SageMaker.
Low-level AWS SDK (Boto3 SDK) – This SDK works at the lower level by allowing the user to choose from the supported programming languages and call any AWS services programmatically. This isn’t just specific to SageMaker but can be used in general for all AWS services. The low-level AWS SDKs are available in various programming languages, such as .NET, Python, Java, Node.js, etc. One of the popular SDKs used is boto3 python SDK, which is popular in the data scientist community for ML. The good part of this SDK is that it’s very lightweight and available by default installed on AWS Lambda Runtime. Furthermore, you can use this SDK to interact with any AWS service outside of SageMaker.

Both of these SDKs can be utilized for the same tasks, but in some cases it’s more intuitive to use one more than the other. SageMaker Python SDK is recommended for easy testing while AWS SDK/Boto3 is recommended for production use cases for better control on performance. For example, SageMaker as a service provides pre-built and maintained images for popular frameworks, such as Sklearn, PyTorch, and TensorFlow. It can be particularly useful to use SageMaker SDK to retrieve deep learning images, train models using Estimators, and easily deploy the model using a simple API call. An example to showcase this in action can be found here.

On the other hand, sometimes you have pre-trained models or different frameworks that you may be using. This requires a greater deal of customization and the SageMaker SDK doesn’t always offer that. We have three important steps and corresponding boto3 API calls that we need to execute to deploy an endpoint: Model Creation, Endpoint Configuration Creation, and Endpoint Creation. The first two entities were abstracted out with the SageMaker SDK with our supported frameworks, but we see those details with the Boto3 SDK. An extensive example to showcase the steps involved in using a Boto3 SDK to create and manage an endpoint can be found here.

Considerations of SageMaker hosting

SageMaker Real-Time Inference has two main optimizations that you can consider: 1/ Performance optimization, and 2/ Cost optimization. Let’s first look at performance optimization, as when we’re dealing with latency sensitive workloads, every millisecond is crucial. There are different knobs that you can tune to optimize your latency and throughput. At the instance level, you can use Inference Recommender, our built-in load testing tool, to help you select the right instance type and count for your workload. Utilizing the proper combination of compute will help you with both performance and cost. You can also tune at the container and framework level.
Questions to ask yourself include:

What framework are you using?
Are there any environment variables that you can tune within your container?

An example of this is maximizing TensorFlow performance with SageMaker containers. Another example of container level optimizations is utilizing gRPC rather than REST behind your endpoint. Lastly, you can also optimize at the script level. Is your inference code taking extra time at certain blocks? Timing each and every line of your script will help you capture any bottlenecks within your code.

There are three ways to look at improving the utilization of your Real Time endpoint:

Multi-model Endpoints (MME)
- You can host thousands of models behind a single endpoint. This is perfect for use cases where you don’t need a dedicated endpoint for each one of your models. MME works best when the models are similarly sized and latency and belong to the same ML framework. These can be typically used when you don’t need to call the same model at all times. You can dynamically load the respective model onto the SageMaker Endpoint to serve your request. An example that showcases MME in action can be found here. If you want to learn more about the different caveats and best practices for hosting models on MME, then refer to the post here.
Multi-Container Endpoints (MCE)
- Instead of utilizing multiple endpoints to host multiple containers, you can look at hosting up to 15 containers on a single endpoint. Each one of these containers can be invoked directly. Therefore, you can look at hosting disparate models of different frameworks all on a single endpoint. This option is best when containers exhibit similar usage and performance characteristics. An example that showcases MCE can be found here. If you want to learn more about the different caveats and best practices for hosting models on MCE, then refer to the post here.
Serial Inference Pipeline (SIP)
- If you have a pipeline of steps in your inference logic, then you might utilize Serial Inference Pipeline (SIP). SIP lets you chain 2-15 containers together behind a single endpoint. SIP works well when you have preprocessing and post-processing steps. If you want to learn more about the design patterns for serial inference pipelines, then refer to the post here.

The second main optimization to keep in mind is cost. Real-Time Inference is one of three options within creating SageMaker Endpoints. SageMaker Endpoints are running at all times unless deleted. Therefore, you must look at improving the utilization of the endpoint which in turn provides a cost benefit.

SageMaker also offers Savings Plans. Savings Plans can reduce your costs by up to 64%. This is a 1 or 3-year term commitment to a consistent amount of usage ($/hour). See this link for more information. And see this link for best to optimize costs for Inference on Amazon SageMaker.

Conclusion

In this post, we showed you some of the best practices to choose between different model hosting options on SageMaker. We discussed the SageMaker Endpoint requirements, and also contrasted Framework and BYOC requirements and functionality. Furthermore, we talked about the different ways that you can leverage Real-Time Endpoints to host your ML models in production. in a cost-effective way, and have high performance.

See the corresponding GitHub repository and try out the examples.

About the authors

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Marc Karp is a ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Predict lung cancer survival status using multimodal data on Amazon SageMaker JumpStart

November 8, 2022

by Michael Hsieh Amazon AWS

Non-small cell lung cancer (NSCLC) is the most common type of lung cancer, and is composed of tumors with significant molecular heterogeneity resulting from differences in intrinsic oncogenic signaling pathways [1]. Enabling precision medicine, anticipating patient preferences, detecting disease, and improving care quality for NSCLC patients are important topics among healthcare and life sciences (HCLS) communities.

Applying machine learning (ML) to diverse health datasets, known as multimodal machine learning (multimodal ML), is an active area of research and development. Analyzing linked patient-level data from diverse data modalities, such as genomics and medical imaging, promises to accelerate improvements in patient care. However, performing analyses of multiple modalities at scale has been challenging in on-premises or cloud environments due to the distinct infrastructure requirements of each modality. With Amazon SageMaker, you can create purpose-built pipelines and scale them to meet your needs easily, paying only for what you use.

We’re announcing a new solution on lung cancer survival prediction in Amazon SageMaker JumpStart, based on the blog posts Building Scalable Machine Learning Pipelines for Multimodal Health Data on AWS and Training Machine Learning Models on Multimodal Health Data with Amazon SageMaker. JumpStart provides pre-trained, open-source models and pre-built solution templates for a wide range of problem types to help data scientists and ML practitioners get started on training and deploying ML models quickly. This is the first HCLS solution offered by JumpStart.

The solution builds a multimodal ML model for predicting survival outcome of patients diagnosed with NSCLC. The multimodal model is trained on data derived from different modalities or domains, including medical imaging, genomic, and clinical data. Multimodal ML has been adopted in HCLS for personalized treatment, clinical decision support, and drug response prediction. In this post, we demonstrate how you can create a scalable, purpose-built ML pipeline easily with one-click deployment from JumpStart.

What’s in the dataset

Non–small cell lung cancer is the leading cause for cancer death [2]. However, no two cancer diagnoses are alike, because tumors contain significant molecular heterogeneity resulting from differences in intrinsic oncogenic signaling pathways [1]. In addition, different clinical information collected from patients may impact prognosis and treatment options. Therefore, enabling precision medicine, anticipating patient preferences, detecting disease, and improving care quality for NSCLC patients is of utmost importance in the oncology and HCLS communities.

The Non-Small Cell Lung Cancer (NSCLC) Radiogenomic dataset [3] comprises medical imaging, clinical, and genomic data collected from a cohort of early-stage NSCLC patients referred for surgical treatment. It includes Computed Tomography (CT), Positron Emission Tomography (PET)/CT images, semantic annotations of the tumors as observed on the medical images using a controlled vocabulary, segmentation maps of tumors in the CT scans, and quantitative values obtained from the PET/CT scans. The genomic data contains gene mutation and RNA sequencing data from samples of surgically excised tumor tissue. It also consists of clinical data reflective of electronic health records (EHR) such as age, gender, weight, ethnicity, smoking status, Tumor Node Metastasis (TNM) stage, histopathological grade, and survival outcome. Each data modality presents a different view of a patient.

Medical imaging data

Medical imaging biomarkers of cancer promise improvements in patient care through advances in precision medicine. Compared to genomic biomarkers, imaging biomarkers provide the advantages of being non-invasive, and characterizing a heterogeneous tumor in its entirety, as opposed to limited tissue available via biopsy [3]. In this dataset, CT and PET/CT imaging sequences were acquired for patients prior to surgical procedures. Segmentation of tumor regions were annotated by two expert thoracic radiologists. The following image is an example overlay of a tumor segmentation onto a lung CT scan (case R01-093 in the dataset).

Genomic data

Samples of surgically excised tumor tissues were analyzed with RNA sequencing technology. The dataset file that is available from the source was preprocessed using open-source tools, including STAR v.2.3 for alignment and Cufflinks v.2.0.2 for expression calls [4]. The original dataset (GSE103584_R01_NSCLC_RNAseq.txt.gz) can be found on the NCBI website. Although the original data contains more than 22,000 genes, for the purpose of demonstration, we used 21 genes from 10 highly co-expressed gene clusters (metagenes) that were identified, validated in publicly available gene-expression cohorts, and correlated with prognosis [4].

The following table shows the tabular representation of the gene expression data. Each row corresponds to a patient, and the columns represent a subset of genes selected for demonstration. The value denotes the expression level of a gene for a patient. A higher value means the corresponding gene is highly expressed in that specific tumor sample.

Case_ID	LRIG1	HPGD	GDF15	CDH2	POSTN	……
R01-024	26.7037	3.12635	13.0269	0	36.4332	……
R01-153	15.2133	5.0693	0.90866	0	32.8595	……
R01-031	5.54082	1.23083	29.8832	1.13549	34.8544	……
R01-032	12.8391	7.21931	12.0701	0	7.77297	……
R01-033	33.7975	3.19058	5.43418	0	9.84029	……

Clinical data

Clinical data was collected from medical records. It included demographics, smoking history, survival, recurrence status, histology, histopathological grading, Pathological TNM staging, and survival outcome of patients. The data is stored in CSV format, as shown in the following table.

Case ID	Survival Status	Age at Histological Diagnosis	Weight (lbs)	Smoking status	Pack Years	Quit Smoking Year	Chemotherapy	Adjuvant Treatment	EGFR mutation status	……
R01-005	Dead	84	145	Former	20	1951	No	No	Wildtype	……
R01-006	Alive	62	Not Collected	Former	Not Collected	nan	No	No	Wildtype	……
R01-007	Dead	68	Not Collected	Former	15	1968	Yes	Yes	Wildtype	……
R01-008	Alive	73	102	Nonsmoker	nan	nan	No	No	Wildtype	……
R01-009	Dead	59	133	Current	100	nan	No	……	……	……

Solution overview

Working with multimodal healthcare data for ML purposes often requires dedicated data processing pipelines and compute resources to extract relevant biomarkers and features. Collecting, storing, and managing extracted features can be challenging. In this JumpStart solution, we show you how to process each modality in separate notebooks, use Amazon SageMaker Processing for compute intensive 3D image construction, use Amazon SageMaker Feature Store to centrally store extracted features for ML modeling, run SageMaker experiments to enhance ML lineage, implement the SageMaker built-in algorithm to train an ML model without sophisticated coding, and deploy a model for real-time prediction with SageMaker deployment.

The architecture for the solution is illustrated in the following diagram.

The key services involved are as follows:

Amazon SageMaker JumpStart – JumpStart provides pre-trained, open-source models for a wide range of problem types to help you get started. JumpStart also provides solution templates that set up infrastructure for common use cases, and notebooks for ML with SageMaker. This service is where we launch our solution.
Amazon Simple Storage Service – Amazon S3 is an object storage service that offers scalability, data availability, security, and performance. Amazon S3 allows you to store data to be used by SageMaker.
Amazon SageMaker Studio notebooks – You can launch these collaborative notebooks quickly because you don’t need to set up compute instances and file storage beforehand. This is the environment you can build in.
SageMaker Processing jobs – These jobs enable you to analyze data and evaluate ML models. This managed experience helps you run your data processing workloads, such as feature engineering, data validation, and model interpretation. This service allows you to process the data that is stored in Amazon S3.
Amazon SageMaker Feature Store – You can create, share, and manage features for ML development. This is where you store processed features to use for training.
Amazon SageMaker Experiments – You can organize, track, compare, and evaluate ML experiments.
Built-in XGBoost algorithm (eXtreme gradient boosting) – This is a popular and efficient open-source implementation of the gradient boosting algorithm that is used for training the.
SageMaker training jobs – You create a managed training job in SageMaker using four attributes: the URL of the S3 bucket you’ve stored the training data, compute resources, the URL of the S3 bucket where you want to store the output of the job, and the Amazon Elastic Container Registry (Amazon ECR) path where the training code is stored.
SageMaker real-time endpoints – These endpoints are ideal for inference workloads where you have real-time, interactive, low-latency requirements. You deploy the model to SageMaker hosting services and get an endpoint that can be used for inference.

With this JumpStart solution, you can easily spin up the solution in Amazon SageMaker Studio, and follow the post to explore, train, and deploy a lung cancer survival status prediction model for learning purposes.

Prerequisites

To use the Lung Cancer Survival Prediction solution provided by JumpStart, you need to have a Studio domain. A SageMaker project and JumpStart must also be enabled.

If you don’t have a Studio domain, refer to Onboard to Amazon SageMaker Domain Using Quick Setup.

If you have a SageMaker domain, make sure that the project and JumpStart features have been enabled by following these steps:

On the SageMaker console, choose the gear icon next to Domain.
Choose Next.
In the SageMaker Projects and JumpStart section, select Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for this account and Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for Studio users.

We have completed the prerequisites needed to use JumpStart.

Deploy the solution and example demo notebook

To start using the Lung Cancer Survival Prediction solution, complete the following steps:

Open Studio.
Choose Go to SageMaker JumpStart under Jumpstart models, algorithms, and solutions.
To find the solution within JumpStart, choose Explore All Solutions.
Search for and choose Lung Cancer Survival Prediction.
Under Launch Solution, choose Launch.
Note that you can specify custom execution roles to be used throughout this solution, otherwise execution roles are created for you.
It should take a few moments for the solution to launch. You can follow along as a number of resources are launched.

Wait until the endpoint and model statuses show as Complete. You may need to refresh the page.

This solution generates a model, endpoint configuration, endpoint, and five notebooks.
To navigate to the first notebook, under Open solution in Studio, choose Open Notebook.
To navigate to other notebooks, select the folder icon and open the S3Downloads folder.
Open the jumpstart-prod-lcsp_****** folder.

Form here, you can access the following notebooks:

0_demo.ipynb – Demonstrates how to send inference requests to a pre-deployed endpoint and receive a model response.
1_preprocess_genomic_data.ipynb – Showcases how to read and process genomic data (in tabular format).
2_preprocess_clinical_data.ipynb – Demonstrates how to read and process health insurance claims data (in tabular format).
3_preprocess_imaging_data.ipynb – Showcases how to read and process medical imagining data in DICOM file format and convert to NIfTI neuroimaging format. Therefore, it takes medical imagining data in volumetric format. Note that in notebooks 1, 2, and 3, we store the output of each processing job in Feature Store.
4_train_test_model.ipynb – Demonstrates how to access multimodal features from Feature Store, train an XGBoost model, and predict the survival status of patients diagnosed with non-small cell lung cancer.

After the solution is deployed in Studio, you can start building a lung cancer survival status prediction using multimodal health data. To start, let’s look at the 0_demo.ipynb notebook, which demonstrates how to send inference requests to a pre-deployed endpoint, and get the model response (survived vs. dead).

In this notebook, you can see a preview of the datasets (which we cover in a later section), and the steps needed to make predictions with an endpoint that has already been deployed. A predictor is instantiated to begin making real-time predictions against a SageMaker endpoint. We can use the predictor’s predict function to invoke the endpoint with test data.

The endpoint returns the predicted survival status, which is used to assess model performance, as shown in the following screenshot.

Feature engineering

As described earlier, genomic, clinical, and medical imaging data is available in the NSCLC dataset for us to create a comprehensive view of a patient. To process and compute the features that later help us build a ML model, let’s start with the genomic data pipeline in 1_preprocess_genomic_data.ipynb and step through 2_preprocess_clinical_data.ipynb and 3_preprocess_imaging_data.ipynb for the clinical data pipeline and medical imaging pipeline, respectively.

Genomic

For genomic data, we read the RNA sequence data from the JumpStart solution bucket into the notebook 1_preprocess_genomic_data.ipynb:

file_name = "GSE103584_R01_NSCLC_RNAseq.txt"
input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
input_data = f"{input_data_bucket}/{file_name}"
!aws s3 cp $input_data .

We then keep 21 genes from 10 highly co-expressed gene clusters (metagenes) that were identified, validated in publicly available gene-expression cohorts, and correlated with prognosis [4]:

selected_columns = ['Case_ID','LRIG1', 'HPGD', 'GDF15', 'CDH2', 'POSTN', 'VCAN', 'PDGFRA',
                    'VCAM1', 'CD44', 'CD48', 'CD4', 'LYL1', 'SPI1', 'CD37', 'VIM', 'LMO2',
                    'EGR2', 'BGN', 'COL4A1', 'COL5A1', 'COL5A2']
gen_data_t = gen_data_t[selected_columns]
data_gen = gen_data_t.fillna(0)

With the genomic features ready for analysis, we ingest the features into Feature Store as a feature group. Having the features in the Feature Store allows us to repeatedly source the important genomic features in the downstream analysis with governance.

genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, 
                                     sagemaker_session=feature_store_session)
# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
genomic_feature_group.load_feature_definitions(data_frame=data_gen)
genomic_feature_group.create(s3_uri=f"s3://{BUCKET}/{prefix}",
                             record_identifier_name=record_identifier_feature_name,
                             event_time_feature_name=event_time_feature_name, 
                             role_arn=sagemaker.get_execution_role(),
                             enable_online_store=True)
genomic_feature_group.ingest(data_frame=data_gen, max_workers=3, wait=True)

To see the recently ingested features in the Feature Store UI, under SageMaker resources, choose Feature Store.

From here, you can find the most recently ingested features into Feature Store. The feature group name should be sagemaker-soln-lcsp-js-******-genomic-feature-group.

Clinical

For clinical data, we perform data cleaning and processing on the data hosted on the JumpStart solution bucket and ingest into the Feature Store, as shown in 2_preprocess_clinical_data.ipynb:

file_name = "NSCLCR01Radiogenomic_DATA_LABELS_2018-05-22_1500-shifted.csv"
input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
input_data = f"{input_data_bucket}/{file_name}"
!aws s3 cp $input_data .

We run one-hot encoding to convert categorical attributes to numerical attributes. We then remove columns that don’t provide useful information (such as dates), and remove rows with missing values. Afterwards, we create another feature group in Feature Store to store the clinical features, similarly to how we did in the genomic notebook.

Medical imaging

For medical imaging data, we create patient-level 3-dimensional radiomic features that explain the size, shape, and visual attributes of the tumors observed in the CT scans, and store them in a feature group in Feature Store. For each patient study, the following steps are performed. Details can be found in 3_preprocess_imaging_data.ipynb.

Read the 2D DICOM slice files for both the CT scan and tumor segmentation, combine them to 3D volumes, and save the volumes in NIfTI format.
Align CT volume and tumor segmentation so we can focus the computation inside the tumor.
Compute radiomic features describing the tumor region using the pyradiomics library. We extract 120 radiomic features of eight classes such as statistical representations of the distribution and co-occurrence of the intensity within the tumorous region of interest, and shape-based measurements describing the tumor morphologically.

It’s worth mentioning how the medical imaging pipeline is done at scale. Processing hundreds of large, high-resolution 3D images requires the right compute resource. Parallelizing the tasks can further reduce the total processing time and help you get the results sooner. We use SageMaker Processing, an on-demand, scalable feature for data processing need in SageMaker, to run DICOM to 3D volume construction, CT/tumor segmentation alignment, radiomic feature extract, and feature store ingestion. We parallelize the processing to one job per patient data, as shown in the following figure.

To launch the SageMaker Processing jobs for multiple patients in parallel efficiently, we run the utility function launch_processing_job(), which submits a configured SageMaker Processing job on one ml.r5.large instance, in a for loop.

In this case, we need to work with the service quota for the instance cleverly. The default service quota for the number of instances across processing jobs and number of ml.r5.large instances is four. If your account has a higher limit, you may run a higher number of simultaneous processing jobs (therefore a faster completion time). To request a quota increase, refer to AWS service quotas. We implemented the function wait_for_instance_quota() to check for the current job count that is in InProgress state and limit the total count in this experiment to the value set in job_limit. If the total running job count is at the limit, the function waits the number of seconds specified in the wait argument and checks the job count again. This is to account for the account-level SageMaker quota that may cause errors in the for loop.

Another challenge when we work with a large amount of processing and training jobs is keeping track of the details and the lineage of all the jobs in an experiment. We use SageMaker Experiments to achieve this. Setting up the experiment tracking is simple. When each SageMaker Processing job is submitted within launch_processing_job(), an experiment configuration is set up and is provided to run the job. See the following code:

experiment_config={'ExperimentName': experiment_name,
                   'TrialName': trial_name,
                   'TrialComponentDisplayName': f'ImageProcessing-{subject}'}
script_processor = ScriptProcessor(...)
script_processor.run(..., experiment_config=experiment_config)

We can then find the status and details of each job on the Experiments and trials menu.

Note that medical imaging processing and feature store offline store synchronization takes some time to complete for all patients, and imaging features may not be available in the offline store immediately for model training in the next section. To make sure the medical imaging features are available for training, a check of total entry count is implemented before proceeding to the next notebook 4_train_test_model.ipynb.

Modeling

As described earlier, we process the data of each modality in three separate notebooks, and create and ingest features into per-modality feature groups. At the time of training, we flexibly choose multimodal features using a SQL query against the offline store in Feature Store. We then preprocess the data and apply dimensionality reduction prior to training an XGBoost model from the SageMaker built-in algorithm for the binary classification task of predicting the survival outcome of patients. After the model is trained, we host the XGBoost model in a SageMaker real-time endpoint for testing and inference purposes.

To create a multimodal view of a patient for model training, we join the feature vectors from three modalities. This includes the following steps:

Normalize the range of independent features using feature scaling.

For ML training, run a query against the three feature groups to join the data stored in the offline store. For the given dataset, this integration results in 119 data samples, where each sample is a 215-dimensional vector.

genomic_table = <GENOMIC-TABLE-NAME>
clinical_table = <CLINICAL-TABLE-NAME>
imaging_table = <IMAGING-TABLE-NAME>

query = <FEATURE-GROUP-NAME>.athena_query()

query_string = 'SELECT '+genomic_table+'.*, '+clinical_table+'.*, '+imaging_table+'.* 
FROM '+genomic_table+' LEFT OUTER JOIN '+clinical_table+' ON '+clinical_table+'.case_id = '+genomic_table+'.case_id 
LEFT OUTER JOIN '+imaging_table+' ON '+clinical_table+'.case_id = '+imaging_table+'.subject 
ORDER BY '+clinical_table+'.case_id ASC;'

query.run(query_string=query_string, output_location='s3://'+<BUCKET>+'/'+prefix+'/query_results/')

imaging_query.wait()

multimodal_features = query.as_dataframe()

Perform principal component analysis (PCA) on the features to reduce the dimensionality and identify the most discriminative features that contribute to 95% variance in the data. This results in a dimensionality reduction from 215 features down to 45 principal components, which constitute features for the supervised learner.
```
from sklearn.decomposition import PCA

# Set variance threshold for PCA to 99%
pca_threshold = 0.99
pca = PCA(n_components = pca_threshold)
X_trainval_pca = pca.fit_transform(X_trainval_scaled)
X_test_pca = pca.transform(X_test_scaled)
```
Randomly shuffle this data and divide it into 80% for training and 20% for testing the model.
Further split the training data into 80% for training and 20% for validating the model.

To train the ML model, construct an estimator of the gradient boosting library XGBoost through a SageMaker XGBoost container. Because our objective is to train a baseline model with multimodal data, we consider default hyperparameters and don’t perform any hyperparameter tuning. The model predicts NSCLC patients’ survival status (dead or alive) in a form of probability. In addition to the model and prediction, we also generate reports to explain the model. The medical imaging pipeline produces 3D lung CT volumes and tumor segmentation for visualization purposes. See the following code

container = retrieve("xgboost", region=region, version='1.2-1')

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/training-output'.format(bucket, prefix),
                                    sagemaker_session=sm_session)

xgb.set_hyperparameters(eta=0.1, objective='reg:logistic', num_round=10) 

# Train model
xgb.fit({'train': train_data, 'validation': validation_data})

After we train the model, we can deploy the model to a SageMaker endpoint to give us the ability to make predictions in real time in the next step.

Inference

To deploy the endpoint, we use the deploy() method from the trained estimator. The deploy method uses several parameters. We provide the deploy method with an instance_count, which specifies the number of compute instances to launch initially, and instance_type, which specifies the compute instance type. We then use a CSVSerilizer to serialize the incoming data of various formats to a CSV-formatted string for our endpoint. See the following code:

from sagemaker.serializers import CSVSerializer

predictor = xgb.deploy(initial_instance_count = 1, 
                       instance_type = 'ml.m5.xlarge', 
                       serializer = CSVSerializer(),
                       endpoint_name=endpoint_name)

After the endpoint has been deployed, we can make requests and receive predictions in real time. To make predictions, we use the predict() method to pass in data from the test_data data frame:

predictions = predictor.predict(test_data.to_numpy()[:,1:]).decode('utf-8')
# Predictions come back as str. Below transforms stings to float and the probability to binary label (1: dead, 0: alive)
y_predict = [1 if float(pred) > 0.5 else 0 for pred in predictions.split(',')]

The predictor returns a probability. If the probability is greater than 0.5, the patient is less likely to survive NSCLC. If the probability is less than 0.5, the patient is more likely to survive NSCLC.

Finally, we evaluate the prediction with the ground truth in the test_data using accuracy, F1 score, precision, recall, and a confusion matrix. We see that the model can accurately predict 19 out of 25 patients in the test_data, with a balanced precision and recall (and F1 score).

Clean up

After completing this solution, we delete resources instantiated from the solution to avoid incurring further charges. To delete the endpoint and model resources, run the Clean up step in 4_train_test_model.ipynb.

To delete the full JumpStart solution, go to the launched solution you deployed by choosing the JumpStart icon. Under Your launched solutions, select Lung Cancer Survival Prediction, and scroll to the bottom of the solution page. Under Delete solution, choose Delete all resources. This will delete the launched resources including the notebooks that were created.

If you now go back to the S3Downloads folder, you will notice that the Studio notebooks have been deleted.

To delete your SageMaker domain, refer to Delete an Amazon SageMaker Domain.

Conclusion

In this post, we announced a new JumpStart solution that predicts the survival status of non-small cell lung cancer patients using multimodal data. We walked you through the diverse multimodal dataset (medical imaging, genomic, and clinical records); discussed our solution that uses SageMaker features such as SageMaker Processing, Feature Store, the built-in XGBoost algorithm, and SageMaker real-time endpoints; and showed you how to easily run the solution in JumpStart, where cloud resources are managed and deployed for you with just one click.

We encourage you to launch the solution in Studio and step through the notebooks in detail to learn how to process complex healthcare multimodal data, build a survival status prediction model, and make inference on the data using SageMaker. You will then have the knowledge to build an ML solution using SageMaker for your own healthcare and life sciences datasets and use cases.

To learn more about other solutions, pre-built models, and algorithms in JumpStart, visit SageMaker JumpStart.

To learn more about multimodal and multi-omics analysis on AWS, visit Simplifying Multi-modal & Multi-omics Analysis with AWS for Health.

Disclaimer

This solution is for demonstrative purposes only. It is not for clinical use and is not a substitute for professional medical advice, diagnosis, or treatment. The associated notebooks, including the trained model and sample data, are not intended for production. It is each customers’ responsibility to determine whether they are subject to HIPAA, and if so, how best to comply with HIPAA and its implementing regulations. Before using AWS in connection with protected health information, customers must enter an AWS Business Associate Addendum (BAA) and follow its configuration requirements.

References

[1] Travis, William D., et al. “The 2015 World Health Organization classification of lung tumors: impact of genetic, clinical and radiologic advances since the 2004 classification.” Journal of thoracic oncology 10.9 (2015): 1243-1260.

[2] Jernal, A., et al. “Cancer statistics, 2002.” CA cancer J clin52.1 (2002): 23-47.

[3] Bakr, Shaimaa, et al. “A radiogenomic dataset of non-small cell lung cancer.” Scientific data 5.1 (2018): 1-9.

[4] Zhou, Mu, et al. “Non–small cell lung cancer radiogenomics map identifies relationships between molecular and imaging phenotypes with prognostic implications.” Radiology 286.1 (2018): 307.

About the Authors

Michael Hsieh is a Principal AI/ML Specialist Solutions Architect. He focuses on solving business challenges using AI/ML for customers in the healthcare and life sciences industry. As a Seattle transplant, he loves exploring the great Mother Nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay. As a former long-time resident of Philadelphia, he has been rooting for the Philadelphia Eagles and Philadelphia Phillies.

Olivia Choudhury, PhD, is a Senior Partner Solutions Architect at AWS. She helps partners in the healthcare and life sciences domain design, develop, and scale state-of-the-art solutions using AWS. She has a background in genomics, healthcare analytics, federated learning, and privacy-preserving machine learning. Outside of work, she plays board games, paints landscapes, and collects manga.

Curt Lockhart is an AI/ML Specialist Solutions Architect at AWS. He comes from a non-traditional background of working in the arts before his move to tech, and enjoys making machine learning approachable for each customer. Based in Seattle, you can find him venturing to local art museums, catching a concert, and wandering throughout the cities and outdoors of the Pacific Northwest.

Method predicts bias in face recognition models using unlabeled data

November 8, 2022

by Amazon AWS

Eliminating the need for annotation makes bias testing much more practical.Read More

Cost-effective data preparation for machine learning using SageMaker Data Wrangler

November 7, 2022

by Rajakumar Sampathkumar Amazon AWS

Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare high-quality features for machine learning (ML) applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

In this post, we dive into different aspects of data preparation and the associated features of Data Wrangler to understand the cost components of data preparation and how Data Wrangler offers a cost-effective approach to data preparation. We also cover cost optimization best practices to further reduce data preparation costs in Data Wrangler.

Overview of exploratory data analysis (EDA) and data preparation in Data Wrangler

To understand the cost-effectiveness of Data Wrangler, it’s important to look at different aspects of EDA and data preparation phase of ML. This blog will not compare different platforms or services for EDA, but understand different steps in EDA, their cost considerations, and how Data Wrangler facilitates EDA in a cost-effective way.

The typical EDA experience of a data scientist consists of the following steps:

Launch a Jupyter notebook instance to carry out EDA.
Import required packages for data analysis and visualization.
Import the data from multiple sources.
Carry out transformations such as handling missing values and outliers, one-hot encoding, balancing data, and more to clean the data and make it ready for modeling.
Visualize the data.
Create mechanisms to repeat the steps.
Export processed data for downstream analytics or ML.

These steps are complex, and require flexibility in compute and memory requirements so you can run each step with appropriate compute and memory. You also need an integrated system that can import data from multiple sources and mechanisms to repeat or reuse so that you can apply the same EDA steps you already built to larger, similar, or different datasets as required by your downstream ML pipeline.

EDA cost considerations

The following are some of the cost considerations for EDA:

Compute

Some EDA environments require data in a certain format. In such cases, you need to process the data to the format accepted by the EDA environment. For example, if the environment accepts only CSV format but you have data in Parquet or another format, you have to convert your dataset to CSV format. Reformatting data requires compute.
Not all environments have the flexibility to change compute or memory configuration with the click of a button. You may need to have the highest compute capacity and memory footprint as applicable to each transformation you’re performing.

Storage and data transfer

Data in multiple sources has to be collected. If only selected sources are supported by the EDA environment, you may have to move your data from different sources to that single supported source, which increases both storage and data transfer cost.

Labor cost and expertise

Managing the EDA platform and the underlying compute infrastructure involves expertise, effort, and cost. When you manage the infrastructure, you have the operational burden of managing operating systems and applications such as provisioning, patching, and upgrading. Make sure to identify issues quickly. If you don’t validate the data before building your model, you have wasted a lot of resources as well as engineer time.
Note that EDA requires data science and data experience expertise.
Additionally, some EDA environments don’t offer a point-and-click interface and require you to write code to explore, visualize, and transform data, which involves labor cost.

Operations cost

To move the data from the source to carry out transformations and then to downstream ML pipelines, you may have to carry out the repetitive EDA steps again from the beginning of fetching the data in each phase of EDA, which is time consuming and carries a cumulative labor cost. If you can use the transformed data from the previous step, it doesn’t cumulatively increase cost.
Having an easy mechanism to repeat the same set of EDA steps on similar or incremental datasets saves time as well as cost from a people and compute resources perspective.

Let’s see how Data Wrangler facilitates EDA or data preparation in a cost-effective manner in regards to these different areas.

Compute

When you carry out EDA on a notebook, you may not have the flexibility to scale the compute or memory on demand, which may force you to run the transformation and visualizations in an oversized environment. If you have an undersized environment, you may run into out of memory issues. In Data Wrangler, you can choose a smaller instance type for certain transformations or analysis and then upscale the instance to a larger type and carry out complex transformations. When the complex transformation is complete, you can downscale the Data Wrangler instance to a smaller instance type. This gives you the flexibility to scale your compute based on the transformation requirements.

Data Wrangler supports a variety of instance types, and you can choose the right one for your workload, thereby eliminating the costs of oversized or undersized environments.

Storage and data transfer

In this section, we discuss some of the cost considerations for storage and data transfer.

Import

Data for ML is often available from multiple sources and in different formats. With Data Wrangler, you can import data from the following data sources: Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon SageMaker Feature Store and Snowflake. Data can be in any of the following formats: CSV, Parquet, JSON, and Optimized Row Columnar (ORC), and more data formats will be added based on customer demand. Because the important data sources are already supported in Data Wrangler, data can be directly imported from the respective sources, and you only pay for the GB-month of provisioned storage. For more information, refer to Amazon SageMaker Pricing.

All the iterative data exploration, data transformation, and visualization can be carried out within Data Wrangler itself. This eliminates further data movement compared to other environments where you may have to move the data to different locations for ingestion, transformation, and processing. From a cost perspective, this eliminates duplicate data storage as well as reduced data movement.

Data quality cost

If you don’t identify bad data and correct it early, it will become a costly problem to solve later. The Data Quality and Insights Report helps you eliminate this problem. You can use the Data Quality and Insights Report to perform an analysis of your data to gain insights into your dataset, such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention. As soon as you import your data, you can run an insights report with a click of a button. This reduces the effort of importing libraries and writing code to get the required insights on the dataset, which reduces the labor cost and expertise required.

When you create the Data Quality and Insights Report, Data Wrangler gives you the option to select a target column (the column that you’re trying to predict). When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power (see the following screenshot). This contributes to the direct business benefit of high-quality features for the downstream ML process.

Transformation

If your EDA tool supports only certain transformations, you may need to move the data to a different environment to carry out the custom transformations such as Spark jobs. Data Wrangler supports custom transformations, which can be written in PySpark, Pandas, and SQL (see the following screenshot for an example). They’re developer friendly and all seamlessly packaged into one place, reducing data movement and saving cost associated with data transfer and storage.

You may also need to carry out mathematical operations on your datasets, such as taking an absolute value of a column. If your EDA tool doesn’t support mathematical operations, you may have to carry out the operations externally, which requires additional effort and cost. Some tools might support mathematical operations on the dataset but require you to import libraries, which involves additional effort. In Data Wrangler, you can also use a custom formula to define a new column using a Spark SQL expression to query data in the current data frame without incurring any additional cost for custom transformations or custom queries.

Labor cost and expertise

Managing the EDA platform and the underlying compute infrastructure involves expertise, effort, and cost. Data Wrangler offers a selection of over 300 preconfigured data transformations written in PySpark, so you can process datasets up to hundreds of gigabytes efficiently without having to worry about writing code to transform the data. You can use transformations such as convert column type, one hot encoding, impute missing data with mean or median, rescale columns, and data/time embeddings to transform your data into formats that the models can use without even writing a single line of code. This reduces time and effort, thereby reducing labor cost.

Data Wrangler offers a point-and-click interface to visualize and validate data (see the following screenshot). No expertise is required on data engineering or analytics because all the data preparation can be done via simple point and click.

Visualization

Data Wrangler helps you understand your data and identify potential errors and extreme values with a set of robust preconfigured visualization templates. You don’t need familiarity or to spend additional time to import any external libraries or dependencies to carry out the visualizations. Histograms, scatter plots, box and whisker plots, line plots, and bar charts are all available (see the following screenshots for some examples). Templates such as histograms make it simple to create and edit your own visualizations without writing code.

Validation

Data Wrangler enables you to quickly identify inconsistencies in your data preparation workflow and diagnose issues before models are deployed into production (see the following screenshot). You can quickly identify if your prepared data will result in an accurate model so you can determine if additional feature engineering is needed to improve performance. All of this occurs before the model building phase, so there is no additional labor cost for building a model that’s not performing as expected (low performance metrics) that would result in additional transformations after the model build. The validation also results in the business benefit of better quality features.

Build scalable data preparation pipelines

When you carry out EDA you have to build data preparation pipelines that can scale with datasets (see the following screenshot). This is important for repetition as well as downstream ML processes. Typically, customers use Spark for its distributed, scalable, and in-memory processing nature; however, this requires a lot of expertise on Spark. Setting up a Spark environment is time consuming and requires expertise for optimal configuration. With Data Wrangler, you can create data processing jobs and export to Amazon S3 and Amazon feature store purely via the visual interface without having to generate, run, or manage Jupyter notebooks, which facilitates scalable data preparation pipelines without any Spark expertise. For more information, refer to Launch processing jobs with a few clicks using Amazon SageMaker Data Wrangler.

Operations cost

Integration may not be a direct cost benefit; however, there are indirect cost benefits when you work in an integrated environment such as SageMaker. Because Data Wrangler is integrated with AWS services, you can export your data preparation workflow to a Data Wrangler job notebook, and launch Amazon SageMaker Autopilot training experiment, Amazon SageMaker Pipelines notebook, or code script. You can also create a Data Wrangler processing job with one click without needing to set up and manage infrastructure to carry out repetitive steps or automation in an ML workflow.

In your Data Wrangler flow, you can export some or all of the transformations that you made to your data processing pipelines. When you export your data flow, you’re charged for the AWS resources that you use. From a cost perspective, exporting the transformation gives you the ability to repeat the transformation on additional datasets with no incremental effort.

With Data Wrangler, you can export all the transformations that you made to a dataset to a destination node with just a few clicks. This allows you to create data processing jobs and export to Amazon S3 purely via the visual interface without having to generate, run, or manage Jupyter notebooks, thereby enhancing the low-code experience.

Data Wrangler allows you to export your data preparation steps or data flow into different environments. Data Wrangler has seamless integration with other AWS services and features, such as the following:

SageMaker Feature Store – You can engineer your model features using Data Wrangler and then ingest into your feature store, which is a centralized store for features and their associated metadata
SageMaker Pipelines – You can use the data flow exported from Data Wrangler in SageMaker pipelines, which are used to build and deploy large-scale ML workflows
Amazon S3 – You can export the data to Amazon S3 and use it to create Data Wrangler jobs
Python – Finally, you can export all the steps in your data flow to a Python file, which you can manually integrate into any data processing workflow.

Such tight integration helps reduce effort, time, expertise, and cost.

Cost optimization best practices

In this section, we discuss best practices to further optimize cost in Data Wrangler.

Update Data Wrangler to the latest release

When you update Data Wrangler to the latest release, you get all the latest features, security, and overall optimizations made to Data Wrangler, which may improve its cost-effectiveness.

Use built-in Data Wrangler transformers

Use the built-in Data Wrangler transformers over custom Pandas transforms when processing larger and wider datasets.

Choose the right instance type for your Data Wrangler flow

There are two families of ml instance types supported for Data Wrangler: m5 and r5. m5 instances are general purpose instances that provide a balance between compute and memory, whereas r5 instances are designed to deliver fast performance to process large datasets in memory.

We recommend choosing an instance that is best optimized around your workloads. For example, the r5.8xlarge might have a higher price than the m5.4xlarge, but the r5.8xlarge might be better optimized for your workloads. With better optimized instances, you can run your data flows in less time at lower cost.

Process larger and wider datasets

For datasets larger than tens of gigabytes, we recommend using built-in transforms, or sampling data on import to run custom Pandas transforms interactively. In the post, we share our findings from two benchmark tests to demonstrate how to do this.

Shut down unused instances

You are charged for all running instances. To avoid incurring additional charges, shut down the instances that you aren’t using manually. To shut down an instance that is running, complete the following steps:

On your data flow page, choose the instance icon in the navigation pane under Running instances.
Choose Shut down.

If you shut down an instance used to run a flow, you can’t access the flow temporarily. If you get an error in opening the flow running an instance you previously shut down, wait for approximately 5 minutes and try opening it again.

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees. For more information, refer to Shut Down Data Wrangler.

For information about shutting down Data Wrangler resources automatically, refer to Save costs by automatically shutting down idle resources within Amazon SageMaker Studio.

Export

When you export your Data Wrangler flow or transformations, you can use cost allocation tags to organize and manage the costs of those resources. You create these tags for your user profile and Data Wrangler automatically applies them to the resources used to export the data flow. For more information, see Using Cost Allocation Tags.

Pricing

Data Wrangler pricing has three components: Data Wrangler instances, Data Wrangler jobs, and ML storage. You can perform all the steps for EDA or data preparation within Data Wrangler and you pay for the instance, jobs, and storage pricing based on usage or consumption, with no upfront or licensing fees. For more information, refer to On-Demand Pricing.

Conclusion

In this post, we reviewed different cost aspects of EDA and data preparation to discover how feature-rich and integrated Data Wrangler reduces the time it takes to aggregate and prepare data for ML use cases from weeks to minutes, thereby facilitating cost-effective data preparation for ML. We also inspected the pricing components of Data Wrangler and best practices for cost optimization when using Data Wrangler for your ML data preparation requirements.

For more information, see the following resources:

About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customer guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Rahul Nabera is a Data Analytics Consultant in AWS Professional Services. His current work focuses on enabling customers build their data and machine learning workloads on AWS. In his spare time, he enjoys playing cricket and volleyball.

Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart

November 7, 2022

by Vivek Madan Amazon AWS

In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that solve common business problems. These features remove the heavy lifting from each step of the ML process, making it easier to develop high-quality models and reducing time to deployment.

This post is the fifth in a series on using JumpStart for specific ML tasks. In the first post, we showed how you can run image classification use cases on JumpStart. In the second post, we demonstrated how to run text classification use cases. In the third post, we discussed image segmentation use cases. In the fourth post, we showed how you can run text generation use cases.

In this post, we provide a step-by-step walkthrough on how to deploy pre-trained stable diffusion models for generating images from text. We explore two ways of obtaining the same result: via JumpStart’s graphical interface on Amazon SageMaker Studio, and programmatically through JumpStart APIs.

If you want to jump straight into the JumpStart API code we go through in this post, you can refer to the following sample Jupyter notebook: Introduction to JumpStart – Text to Image.

JumpStart overview

JumpStart helps you get started with ML models for a variety of tasks without writing a single line of code. Currently, JumpStart enables you to do the following:

Deploy pre-trained models for common ML tasks – JumpStart enables you to address common ML tasks with no development effort by providing easy deployment of models pre-trained on large, publicly available datasets. The ML research community has put a large amount of effort into making a majority of recently developed models publicly available for use. JumpStart hosts a collection of over 300 models, spanning the 15 most popular ML tasks such as object detection, text classification, and text generation, making it easy for beginners to use them. These models are drawn from popular model hubs such as TensorFlow, PyTorch, Hugging Face, and MXNet.
Fine-tune pre-trained models – JumpStart allows you to fine-tune pre-trained models without needing to write your own training algorithm. In ML, the ability to transfer the knowledge learned in one domain to another domain is called transfer learning. You can use transfer learning to produce accurate models on your smaller datasets, with much lower training costs than the ones involved in training the original model. JumpStart also includes popular training algorithms based on LightGBM, CatBoost, XGBoost, and Scikit-learn, which you can train from scratch for tabular regression and classification.
Use pre-built solutions – JumpStart provides a set of 17 solutions for common ML use cases, such as demand forecasting and industrial and financial applications, which you can deploy with just a few clicks. Solutions are end-to-end ML applications that string together various AWS services to solve a particular business use case. They use AWS CloudFormation templates and reference architectures for quick deployment, which means they’re fully customizable.
Refer to notebook examples for SageMaker algorithms – SageMaker provides a suite of built-in algorithms to help data scientists and ML practitioners get started with training and deploying ML models quickly. JumpStart provides sample notebooks that you can use to quickly use these algorithms.
Review training videos and blogs – JumpStart also provides numerous blog posts and videos that teach you how to use different functionalities within SageMaker.

JumpStart accepts custom VPC settings and AWS Key Management Service (AWS KMS) encryption keys, so you can use the available models and solutions securely within your enterprise environment. You can pass your security settings to JumpStart within Studio or through the SageMaker Python SDK.

Text-to-image and stable diffusion models

Text-to-image is the task of generating realistic images describing a raw input text. Stable diffusion is a popular model for this task. It’s a latent diffusion model where input text is first embedded into a latent space via a language model. Then, a series of noise addition and removal operations are performed in the latent space with U-Net architecture. Finally, denoised output is decoded into the pixel space.

For example, with the input “Garden painted in impressionist style,” we get the following output image.

Although primarily used to generate images conditioned on text, stable diffusion models can also be used for other tasks such as inpainting, outpainting, and generating image-to-image translations guided by text.

Solution overview

The following sections provide a step-by-step demo to perform inference, both via the Studio UI and via JumpStart APIs. We walk through the following steps:

Access JumpStart through the Studio UI to deploy and run inference on the pre-trained model.
Use JumpStart programmatically with the SageMaker Python SDK to deploy the pre-trained model and run inference.

Access JumpStart through the Studio UI and run inference with a pre-trained model

In this section, we demonstrate how to train and deploy JumpStart models through the Studio UI.

The following video shows you how to find a pre-trained text-to-image model on JumpStart and deploy it. The model page contains valuable information about the model and how to use it. You can deploy any of the pre-trained models available in JumpStart. For inference, we pick the ml.p3.2xlarge instance type, because it provides the GPU acceleration needed for low inference latency at a low price point. After you configure the SageMaker hosting instance, choose Deploy. It may take 5–10 minutes until your persistent endpoint is up and running.

After a few minutes, your endpoint is operational and ready to respond to inference requests!

To accelerate your time to inference, JumpStart provides a sample notebook that shows you how to run inference on your freshly deployed endpoint. Choose Open Notebook under Use Endpoint from Studio.

Use JumpStart programmatically with the SageMaker SDK

In the preceding section, we showed how you can use the JumpStart UI to deploy a pre-trained model interactively, in a matter of a few clicks. However, you can also use JumpStart’s models programmatically by using APIs that are integrated into the SageMaker SDK.

In this section, we go over a quick example of how you can replicate the preceding process with the SageMaker SDK. We choose an appropriate pre-trained model in JumpStart, deploy this model to a SageMaker endpoint, and run inference on the deployed endpoint. All the steps in this demo are available in the accompanying notebook Introduction to JumpStart – Text to Image.

Deploy the pre-trained model

SageMaker is a platform that makes extensive use of Docker containers for build and runtime tasks. JumpStart uses the available framework-specific SageMaker Deep Learning Containers (DLCs). We first fetch any additional packages, as well as scripts to handle training and inference for the selected task. Finally, the pre-trained model artifacts are separately fetched with model_uris, which provides flexibility to the platform. You can use any number of models pre-trained on the same task with a single inference script. See the following code:

model_id, model_version = "huggingface-txt2img-stable-diffusion-v1-4", "*"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="inference")

base_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")

Next, we feed the resources into a SageMaker model instance and deploy an endpoint:

# Create the SageMaker model instance
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=base_model_uri,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
)

# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the sagemaker API.
base_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)

After our model is deployed, we can get predictions from it in real time!

Run inference

The following code snippet gives you a glimpse of what the outputs look like. To send requests to a deployed model, input text needs to be supplied in a utf-8 encoded format.

def query(model_predictor, text):
    """Query the model predictor."""

    encoded_text = json.dumps(text).encode("utf-8")

    query_response = model_predictor.predict(
        encoded_text,
        {
            "ContentType": "application/x-text",
            "Accept": "application/json",
        },
    )
    return query_response

The endpoint response is a JSON object containing the generated image and the prompt:

def parse_response(query_response):
    response_dict = json.loads(query_response['Body'].read())
    return response_dict['generated_image'], response_dict['prompt']
    
text = "Garden painted in impressionist style"
query_response = query(model_predictor, text)
img, prompt = parse_response(query_response)
display_img_prompt(img,prompt)

We get the following image as our output.

Conclusion

In this post, we showed how to deploy a pre-trained text generation model using JumpStart. You can accomplish this without needing to write code. Try out the solution on your own and send us your comments. To learn more about JumpStart and how you can use open-source pre-trained models for a variety of other ML tasks, check out the following AWS re:Invent 2020 video.

About the Authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Santosh Kulkarni is an Enterprise Solutions Architect at Amazon Web Services who works with sports customers in Australia. He is passionate about building large-scale distributed applications to solve business problems using his knowledge in AI/ML, big data, and software development.

Leonardo Bachega is a Senior Scientist and Manager in the Amazon SageMaker JumpStart team. He’s passionate about building AI services for computer vision.

Run text generation with GPT and Bloom models on Amazon SageMaker JumpStart

November 7, 2022

by Vivek Madan Amazon AWS

This post is the fourth in a series on using JumpStart for specific ML tasks. In the first post, we showed how to run image classification use cases on JumpStart. In the second post, we demonstrated how to run text classification use cases. In the third post, we ran image segmentation use cases.

In this post, we provide a step-by-step walkthrough on how to deploy pre-trained text generation models. We explore two ways of obtaining the same result: via JumpStart’s graphical interface on Amazon SageMaker Studio, and programmatically through JumpStart APIs.

If you want to jump straight into the JumpStart API code we go through in this post, you can refer to the following sample Jupyter notebook: Introduction to JumpStart – Text Generation.

JumpStart overview

JumpStart helps you get started with ML models for a variety of tasks without writing a single line of code. Currently, JumpStart enables you to do the following:

Deploy pre-trained models for common ML tasks – JumpStart enables you to address common ML tasks with no development effort by providing easy deployment of models pre-trained on large, publicly available datasets. The ML research community has put a large amount of effort into making a majority of recently developed models publicly available for use. JumpStart hosts a collection of over 300 models, spanning the 15 most popular ML tasks such as object detection, text classification, and text generation, making it easy for beginners to use them. These models are drawn from popular model hubs such as TensorFlow, PyTorch, Hugging Face, and MXNet.
Fine-tune pre-trained models – JumpStart allows you to fine-tune pre-trained models without needing to write your own training algorithm. In ML, the ability to transfer the knowledge learned in one domain to another domain is called transfer learning. You can use transfer learning to produce accurate models on your smaller datasets, with much lower training costs than the ones involved in training the original model. JumpStart also includes popular training algorithms based on LightGBM, CatBoost, XGBoost, and Scikit-learn, which you can train from scratch for tabular regression and classification.
Use pre-built solutions – JumpStart provides a set of 17 solutions for common ML use cases, such as demand forecasting and industrial and financial applications, which you can deploy with just a few clicks. Solutions are end-to-end ML applications that string together various AWS services to solve a particular business use case. They use AWS CloudFormation templates and reference architectures for quick deployment, which means they’re fully customizable.
Refer to notebook examples for SageMaker algorithms – SageMaker provides a suite of built-in algorithms to help data scientists and ML practitioners get started with training and deploying ML models quickly. JumpStart provides sample notebooks that you can use to quickly use these algorithms.
Review training videos and blogs – JumpStart also provides numerous blog posts and videos that teach you how to use different functionalities within SageMaker.

Text generation, GPT-2, and Bloom

Text generation is the task of generating text that is fluent and appears indistinguishable from human-written text. It is also known as natural language generation.

GPT-2 is a popular transformer-based text generation model. It’s pre-trained on a large corpus of raw English text with no human labeling. It’s trained on the task where, given a partial sequence (sentence or piece of text), the model needs to predict the next word or token in the sequence.

Bloom is also a transformer-based text generation model and trained similarly to GPT-2. However, Bloom is pre-trained on 46 different languages and 13 programming languages. The following is an example of running text generation with the Bloom model:

Input: "Some people like dogs, some people like cats"
Output: "Some people like dogs, some people like cats some people like birds, some people like fish,"

Solution overview

The following sections provide a step-by-step demo to perform inference, both via the Studio UI and via JumpStart APIs. We walk through the following steps:

Access JumpStart through the Studio UI to deploy and run inference on the pre-trained model.
Use JumpStart programmatically with the SageMaker Python SDK to deploy the pre-trained model and run inference.

Access JumpStart through the Studio UI and run inference with a pre-trained model

In this section, we demonstrate how to train and deploy JumpStart models through the Studio UI.

The following video shows you how to find a pre-trained text generation model on JumpStart and deploy it. The model page contains valuable information about the model and how to use it. You can deploy any of the pre-trained models available in JumpStart. For inference, we pick the ml.p3.2xlarge instance type, because it provides the GPU acceleration needed for low inference latency at a low price point. After you configure the SageMaker hosting instance, choose Deploy. It may take 20–25 minutes until your persistent endpoint is up and running.

Once your endpoint is operational, it’s ready to respond to inference requests!

Use JumpStart programmatically with the SageMaker SDK

Deploy the pre-trained model

model_id, model_version = "huggingface-textgeneration-bloom-560m", "*"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="inference")

base_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")

Bloom is a very large model and can take up to 20–25 minutes to deploy. You can also use a smaller model such as GPT-2. To deploy a pre-trained GPT-2 model, you can set model_id = huggingface-textgeneration-gpt2. For a list of other available models in JumpStart, refer to JumpStart Available Model Table.

Next, we feed the resources into a SageMaker model instance and deploy an endpoint:

# Create the SageMaker model instance
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=base_model_uri,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
)

# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the sagemaker API.
base_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)

After our model is deployed, we can get predictions from it in real time!

Run inference

The following code snippet gives you a glimpse of what the outputs look like. To send requests to a deployed model, input text needs to be supplied in a utf-8 encoded format.

def query(model_predictor, text):
    """Query the model predictor."""

    encoded_text = json.dumps(text).encode("utf-8")

    query_response = model_predictor.predict(
        encoded_text,
        {
            "ContentType": "application/x-text",
            "Accept": "application/json",
        },
    )
    return query_response

The endpoint response is a JSON object containing the input text followed by the generated text:

def parse_response(query_response):
    """Parse response and return the generated text."""

    model_predictions = json.loads(query_response)
    generated_text = model_predictions["generated_text"]
    return generated_text
    
text = "Some people like dogs, some people like cats"
query_response = query(model_predictor, text)
parse_response(query_response)

Our output is as follows:

"Some people like dogs, some people like cats some people like birds, some people like fish,"

Conclusion

About the Authors

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.