Enriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker

Enriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker

This post is co-authored by Marios Skevofylakas, Jason Ramchandani and Haykaz Aramyan from Refinitiv, An LSEG Business.

Financial service providers often need to identify relevant news, analyze it, extract insights, and take actions in real time, like trading specific instruments (such as commodities, shares, funds) based on additional information or context of the news item. One such additional piece of information (which we use as an example in this post) is the sentiment of the news.

Refinitiv Data (RD) Libraries provide a comprehensive set of interfaces for uniform access to the Refinitiv Data Catalogue. The library offers multiple layers of abstraction providing different styles and programming techniques suitable for all developers, from low-latency, real-time access to batch ingestions of Refinitiv data.

In this post, we present a prototype AWS architecture that ingests our news feeds using RD Libraries and enhances them with machine learning (ML) model predictions using Amazon SageMaker, a fully managed ML service from AWS.

In an effort to design a modular architecture that could be used in a variety of use cases, like sentiment analysis, named entity recognition, and more, regardless of the ML model used for enhancement, we decided to focus on the real-time space. The reason for this decision is that real-time use cases are generally more complex and that the same architecture can also be used, with minimal adjustments, for batch inference. In our use case, we implement an architecture that ingests our real-time news feed, calculates sentiment on each news headline using ML, and re-serves the AI enhanced feed through a publisher/subscriber architecture.

Moreover, to present a comprehensive and reusable way to productionize ML models by adopting MLOps practices, we introduce the concept of infrastructure as code (IaC) during the entire MLOps lifecycle of the prototype. By using Terraform and a single entry point configurable script, we are able to instantiate the entire infrastructure, in production mode, on AWS in just a few minutes.

In this solution, we don’t address the MLOps aspect of the development, training, and deployment of the individual models. If you’re interested in learning more on this, refer to MLOps foundation roadmap for enterprises with Amazon SageMaker, which explains in detail a framework for model building, training, and deployment following best practices.

Solution overview

In this prototype, we follow a fully automated provisioning methodology in accordance with IaC best practices. IaC is the process of provisioning resources programmatically using automated scripts rather than using interactive configuration tools. Resources can be both hardware and needed software. In our case, we use Terraform to accomplish the implementation of a single configurable entry point that can automatically spin up the entire infrastructure we need, including security and access policies, as well as automated monitoring. With this single entry point that triggers a collection of Terraform scripts, one per service or resource entity, we can fully automate the lifecycle of all or parts of the components of the architecture, allowing us to implement granular control both on the DevOps as well as the MLOps side. After Terraform is correctly installed and integrated with AWS, we can replicate most operations that can be done on the AWS service dashboards.

The following diagram illustrates our solution architecture.

The architecture consists of three stages: ingestion, enrichment, and publishing. During the first stage, the real-time feeds are ingested on an Amazon Elastic Compute Cloud (Amazon EC2) instance that is created through a Refinitiv Data Library-ready AMI. The instance also connects to a data stream via Amazon Kinesis Data Streams, which triggers an AWS Lambda function.

In the second stage, the Lambda function that is triggered from Kinesis Data Streams connects to and sends the news headlines to a SageMaker FinBERT endpoint, which returns the calculated sentiment for the news item. This calculated sentiment is the enrichment in the real-time data that the Lambda function then wraps the news item with and stores in an Amazon DynamoDB table.

In the third stage of the architecture, a DynamoDB stream triggers a Lambda function on new item inserts, which is integrated with an Amazon MQ server running RabbitMQ, which re-serves the AI enhanced stream.

The decision on this three-stage engineering design, rather than the first Lambda layer directly communicating with the Amazon MQ server or implementing more functionality in the EC2 instance, was made to enable exploration of more complex, less coupled AI design architectures in the future.

Building and deploying the prototype

We present this prototype in a series of three detailed blueprints. In each blueprint and for every service used, you will find overviews and relevant information on its technical implementations as well as Terraform scripts that allow you to automatically start, configure, and integrate the service with the rest of the structure. At the end of each blueprint, you will find instructions on how to make sure that everything is working as expected up to each stage. The blueprints are as follows:

To start the implementation of this prototype, we suggest creating a new Python environment dedicated to it and installing the necessary packages and tools separately from other environments you may have. To do so, create and activate the new environment in Anaconda using the following commands:

conda create —name rd_news_aws_terraform python=3.7
conda activate rd_news_aws_terraform

We’re now ready to install the AWS Command Line Interface (AWS CLI) toolset that will allow us to build all the necessary programmatic interactions in and between AWS services:

pip install awscli

Now that the AWS CLI is installed, we need to install Terraform. HashiCorp provides Terraform with a binary installer, which you can download and install.

After you have both tools installed, ensure that they properly work using the following commands:

terraform -help
AWS – version

You’re now ready to follow the detailed blueprints on each of the three stages of the implementation.

Blueprint I: Real-time news ingestion using Amazon EC2 and Kinesis Data Streams

This blueprint represents the initial stages of the architecture that allow us to ingest the real-time news feeds. It consists of the following components:

  • Amazon EC2 preparing your instance for RD News ingestion – This section sets up an EC2 instance in a way that it enables the connection to the RD Libraries API and the real-time stream. We also show how to save the image of the created instance to ensure its reusability and scalability.
  • Real-time news ingestion from Amazon EC2 – A detailed implementation of the configurations needed to enable Amazon EC2 to connect the RD Libraries as well as the scripts to start the ingestion.
  • Creating and launching Amazon EC2 from the AMI – Launch a new instance by simultaneously transferring ingestion files to the newly created instance, all automatically using Terraform.
  • Creating a Kinesis data stream – This section provides an overview of Kinesis Data Streams and how to set up a stream on AWS.
  • Connecting and pushing data to Kinesis – Once the ingestion code is working, we need to connect it and send data to a Kinesis stream.
  • Testing the prototype so far – We use Amazon CloudWatch and command line tools to verify that the prototype is working up to this point and that we can continue to the next blueprint. The log of ingested data should look like the following screenshot.

Blueprint II: Real-time serverless AI news sentiment analysis using Kinesis Data Streams, Lambda, and SageMaker

In this second blueprint, we focus on the main part of the architecture: the Lambda function that ingests and analyzes the news item stream, attaches the AI inference to it, and stores it for further use. It includes the following components:

  • Lambda – Define a Terraform Lambda configuration allowing it to connect to a SageMaker endpoint.
  • Amazon S3 – To implement Lambda, we need to upload the appropriate code to Amazon Simple Storage Service (Amazon S3) and allow the Lambda function to ingest it in its environment. This section describes how we can use Terraform to accomplish that.
  • Implementing the Lambda function: Step 1, Handling the Kinesis event – In this section, we start building the Lambda function. Here, we build the Kinesis data stream response handler part only.
  • SageMaker – In this prototype, we use a pre-trained Hugging Face model that we store into a SageMaker endpoint. Here, we present how this can be achieved using Terraform scripts and how the appropriate integrations take place to allow SageMaker endpoints and Lambda functions work together.
    • At this point, you can instead use any other model that you have developed and deployed behind a SageMaker endpoint. Such a model could provide a different enhancement to the original news data, based on your needs. Optionally, this can be extrapolated to multiple models for multiple enhancements if such exist. Thanks to the rest of the architecture, any such models will enrich your data sources in real time.
  • Building the Lambda function: Step 2, Invoking the SageMaker endpoint – In this section, we build up our original Lambda function by adding the SageMaker block to get a sentiment enhanced news headline by invoking the SageMaker endpoint.
  • DynamoDB – Finally, when the AI inference is in the memory of the Lambda function, it re-bundles the item and sends it to a DynamoDB table for storage. Here, we discuss both the appropriate Python code needed to accomplish that, as well as the necessary Terraform scripts that enable these interactions.
  • Building the Lambda function: Step 3, Pushing enhanced data to DynamoDB – Here, we continue building up our Lambda function by adding the last part that creates an entry in the Dynamo table.
  • Testing the prototype so far – We can navigate to the DynamoDB table on the DynamoDB console to verify that our enhancements are appearing in the table.

Blueprint III: Real-time streaming using DynamoDB Streams, Lambda, and Amazon MQ

This third Blueprint finalizes this prototype. It focuses on redistributing the newly created, AI enhanced data item to a RabbitMQ server in Amazon MQ, allowing consumers to connect and retrieve the enhanced news items in real time. It includes the following components:

  • DynamoDB Streams – When the enhanced news item is in DynamoDB, we set up an event getting triggered that can then be captured from the appropriate Lambda function.
  • Writing the Lambda producer – This Lambda function captures the event and acts as a producer of the RabbitMQ stream. This new function introduces the concept of Lambda layers as it uses Python libraries to implement the producer functionality.
  • Amazon MQ and RabbitMQ consumers – The final step of the prototype is setting up the RabbitMQ service and implementing an example consumer that will connect to the message stream and receive the AI enhanced news items.
  • Final test of the prototype – We use an end-to-end process to verify that the prototype is fully working, from ingestion to re-serving and consuming the new AI-enhanced stream.

At this stage, you can validate that everything has been working by navigating to the RabbitMQ dashboard, as shown in the following screenshot.

In the final blueprint, you also find a detailed test vector to make sure that the entire architecture is behaving as planned.

Conclusion

In this post, we shared a solution using ML on the cloud with AWS services like SageMaker (ML), Lambda (serverless), and Kinesis Data Streams (streaming) to enrich streaming news data provided by Refinitiv Data Libraries. The solution adds a sentiment score to news items in real time and scales the infrastructure using code.

The benefit of this modular architecture is that you can reuse it with your own model to perform other types of data augmentation, in a serverless, scalable, and cost-efficient way that can be applied on top of Refinitiv Data Library. This can add value for trading/investment/risk management workflows.

If you have any comments or questions, please leave them in the comments section.

Related Information


 About the Authors

Marios Skevofylakas comes from a financial services, investment banking and consulting technology background. He holds an engineering Ph.D. in Artificial Intelligence and an M.Sc. in Machine Vision. Throughout his career, he has participated in numerous multidisciplinary AI and DLT projects. He is currently a Developer Advocate with Refinitiv, an LSEG business, focusing on AI and Quantum applications in financial services.

Jason Ramchandani has worked at Refinitiv, an LSEG Business, for 8 years as Lead Developer Advocate helping to build their Developer Community. Previously he has worked in financial markets for over 15 years with a quant background in the equity/equity-linked space at Okasan Securities, Sakura Finance and Jefferies LLC. His alma mater is UCL.

Haykaz Aramyan comes from a finance and technology background. He holds a Ph.D. in Finance, and an M.Sc. in Finance, Technology and Policy. Through 10 years of professional experience Haykaz worked on several multidisciplinary projects involving pension, VC funds and technology startups. He is currently a Developer Advocate with Refinitiv, An LSEG Business, focusing on AI applications in financial services.

Georgios Schinas is a Senior Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.

Muthuvelan Swaminathan is an Enterprise Solutions Architect based out of New York. He works with enterprise customers providing architectural guidance in building resilient, cost-effective, innovative solutions that address their business needs and help them execute at scale using AWS products and services.

Mayur Udernani leads AWS AI & ML business with commercial enterprises in UK & Ireland. In his role, Mayur spends majority of his time with customers and partners to help create impactful solutions that solve the most pressing needs of a customer or for a wider industry leveraging AWS Cloud, AI & ML services. Mayur lives in the London area. He has an MBA from Indian Institute of Management and Bachelors in Computer Engineering from Mumbai University.

Read More

Best practices for load testing Amazon SageMaker real-time inference endpoints

Best practices for load testing Amazon SageMaker real-time inference endpoints

Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers. It also provides common ML algorithms that are optimized to run efficiently against extremely large data in a distributed environment.

SageMaker real-time inference is ideal for workloads that have real-time, interactive, low-latency requirements. With SageMaker real-time inference, you can deploy REST endpoints that are backed by a specific instance type with a certain amount of compute and memory. Deploying a SageMaker real-time endpoint is only the first step in the path to production for many customers. We want to be able to maximize the performance of the endpoint to achieve a target transactions per second (TPS) while adhering to latency requirements. A large part of performance optimization for inference is making sure you select the proper instance type and count to back an endpoint.

This post describes the best practices for load testing a SageMaker endpoint to find the right configuration for the number of instances and size. This can help us understand the minimum provisioned instance requirements to meet our latency and TPS requirements. From there, we dive into how you can track and understand the metrics and performance of the SageMaker endpoint utilizing Amazon CloudWatch metrics.

We first benchmark the performance of our model on a single instance to identify the TPS it can handle per our acceptable latency requirements. Then we extrapolate the findings to decide on the number of instances we need in order to handle our production traffic. Finally, we simulate production-level traffic and set up load tests for a real-time SageMaker endpoint to confirm our endpoint can handle the production-level load. The entire set of code for the example is available in the following GitHub repository.

Overview of solution

For this post, we deploy a pre-trained Hugging Face DistilBERT model from the Hugging Face Hub. This model can perform a number of tasks, but we send a payload specifically for sentiment analysis and text classification. With this sample payload, we strive to achieve 1000 TPS.

Deploy a real-time endpoint

This post assumes you are familiar with how to deploy a model. Refer to Create your endpoint and deploy your model to understand the internals behind hosting an endpoint. For now, we can quickly point to this model in the Hugging Face Hub and deploy a real-time endpoint with the following code snippet:

# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'distilbert-base-uncased',
'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.12xlarge' # ec2 instance type
)

Let’s test our endpoint quickly with the sample payload that we want to use for load testing:


import boto3
import json
client = boto3.client('sagemaker-runtime')
content_type = "application/json"
request_body = {'inputs': "I am super happy right now."}
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)
response = client.invoke_endpoint(
EndpointName=predictor.endpoint_name,
ContentType=content_type,
Body=payload)
result = response['Body'].read()
result

Note that we’re backing the endpoint using a single Amazon Elastic Compute Cloud (Amazon EC2) instance of type ml.m5.12xlarge, which contains 48 vCPU and 192 GiB of memory. The number of vCPUs is a good indication of the concurrency the instance can handle. In general, it’s recommended to test different instance types to make sure we have an instance that has resources that are properly utilized. To see a full list of SageMaker instances and their corresponding compute power for real-time Inference, refer to Amazon SageMaker Pricing.

Metrics to track

Before we can get into load testing, it’s essential to understand what metrics to track to understand the performance breakdown of your SageMaker endpoint. CloudWatch is the primary logging tool that SageMaker uses to help you understand the different metrics that describe your endpoint’s performance. You can utilize CloudWatch logs to debug your endpoint invocations; all logging and print statements you have in your inference code are captured here. For more information, refer to How Amazon CloudWatch works.

There are two different types of metrics CloudWatch covers for SageMaker: instance-level and invocation metrics.

Instance-level metrics

The first set of parameters to consider is the instance-level metrics: CPUUtilization and MemoryUtilization (for GPU-based instances, GPUUtilization). For CPUUtilization, you may see percentages above 100% at first in CloudWatch. It’s important to realize for CPUUtilization, the sum of all the CPU cores is being displayed. For example, if the instance behind your endpoint contains 4 vCPUs, this means the range of utilization is up to 400%. MemoryUtilization, on the other hand, is in the range of 0–100%.

Specifically, you can use CPUUtilization to get a deeper understanding of if you have sufficient or even an excess amount of hardware. If you have an under-utilized instance (less than 30%), you could potentially scale down your instance type. Conversely, if you are around 80–90% utilization, it would benefit to pick an instance with greater compute/memory. From our tests, we suggest around 60–70% utilization of your hardware.

Invocation metrics

As suggested by the name, invocation metrics is where we can track the end-to-end latency of any invokes to your endpoint. You can utilize the invocation metrics to capture error counts and what type of errors (5xx, 4xx, and so on) that your endpoint may be experiencing. More importantly, you can understand the latency breakdown of your endpoint calls. A lot of this can be captured with ModelLatency and OverheadLatency metrics, as illustrated in the following diagram.

Latencies

The ModelLatency metric captures the time that inference takes within the model container behind a SageMaker endpoint. Note that the model container also includes any custom inference code or scripts that you have passed for inference. This unit is captured in microseconds as an invocation metric, and generally you can graph a percentile across CloudWatch (p99, p90, and so on) to see if you’re meeting your target latency. Note that several factors can impact model and container latency, such as the following:

  • Custom inference script – Whether you have implemented your own container or used a SageMaker-based container with custom inference handlers, it’s best practice to profile your script to catch any operations that are specifically adding a lot of time to your latency.
  • Communication protocol – Consider REST vs. gRPC connections to the model server within the model container.
  • Model framework optimizations – This is framework specific, for example with TensorFlow, there are a number of environment variables you can tune that are TF Serving specific. Make sure to check what container you’re using and if there are any framework-specific optimizations you can add within the script or as environment variables to inject in the container.

OverheadLatency is measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This part is largely outside of your control and falls under the time taken by SageMaker overheads.

End-to-end latency as a whole depends on a variety of factors and isn’t necessarily the sum of ModelLatency plus OverheadLatency. For example, if you client is making the InvokeEndpoint API call over the internet, from the client’s perspective, the end-to-end latency would be internet + ModelLatency + OverheadLatency. As such, when load testing your endpoint in order to accurately benchmark the endpoint itself, it’s recommended to focus on the endpoint metrics (ModelLatency, OverheadLatency, and InvocationsPerInstance) to accurately benchmark the SageMaker endpoint. Any issues related to end-to-end latency can then be isolated separately.

A few questions to consider for end-to-end latency:

  • Where is the client that is invoking your endpoint?
  • Are there any intermediary layers between your client and the SageMaker runtime?

Auto scaling

We don’t cover auto scaling in this post specifically, but it’s an important consideration in order to provision the correct number of instances based on the workload. Depending on your traffic patterns, you can attach an auto scaling policy to your SageMaker endpoint. There are different scaling options, such as TargetTrackingScaling, SimpleScaling, and StepScaling. This allows your endpoint to scale in and out automatically based on your traffic pattern.

A common option is target tracking, where you can specify a CloudWatch metric or custom metric that you have defined and scale out based on that. A frequent utilization of auto scaling is tracking the InvocationsPerInstance metric. After you have identified a bottleneck at a certain TPS, you can often use that as a metric to scale out to a greater number of instances to be able to handle peak loads of traffic. To get a deeper breakdown of auto scaling SageMaker endpoints, refer to Configuring autoscaling inference endpoints in Amazon SageMaker.

Load testing

Although we utilize Locust to display how we can load test at scale, if you’re trying to right size the instance behind your endpoint, SageMaker Inference Recommender is a more efficient option. With third-party load testing tools, you have to manually deploy endpoints across different instances. With Inference Recommender, you can simply pass an array of the instance types you want to load test against, and SageMaker will spin up jobs for each of these instances.

Locust

For this example, we use Locust, an open-source load testing tool that you can implement using Python. Locust is similar to many other open-source load testing tools, but has a few specific benefits:

  • Easy to set up – As we demonstrate in this post, we’ll pass a simple Python script that can easily be refactored for your specific endpoint and payload.
  • Distributed and scalable – Locust is event-based and utilizes gevent under the hood. This is very useful for testing highly concurrent workloads and simulating thousands of concurrent users. You can achieve high TPS with a single process running Locust, but it also has a distributed load generation feature that enables you to scale out to multiple processes and client machines, as we will explore in this post.
  • Locust metrics and UI – Locust also captures end-to-end latency as a metric. This can help supplement your CloudWatch metrics to paint a full picture of your tests. This is all captured in the Locust UI, where you can track concurrent users, workers, and more.

To further understand Locust, check out their documentation.

Amazon EC2 setup

You can set up Locust in whatever environment is compatible for you. For this post, we set up an EC2 instance and install Locust there to conduct our tests. We use a c5.18xlarge EC2 instance. The client-side compute power is also something to consider. At times when you run out of compute power on the client side, this is often not captured, and is mistaken as a SageMaker endpoint error. It’s important to place your client in a location of sufficient compute power that can handle the load that you are testing at. For our EC2 instance, we use an Ubuntu Deep Learning AMI, but you can utilize any AMI as long as you can properly set up Locust on the machine. To understand how to launch and connect to your EC2 instance, refer to the tutorial Get started with Amazon EC2 Linux instances.

The Locust UI is accessible via port 8089. We can open this by adjusting our inbound security group rules for the EC2 Instance. We also open up port 22 so we can SSH into the EC2 instance. Consider scoping the source down to the specific IP address you are accessing the EC2 instance from.

Security Groups

After you’re connected to your EC2 instance, we set up a Python virtual environment and install the open-source Locust API via the CLI:

virtualenv venv #venv is the virtual environment name, you can change as you desire
source venv/bin/activate #activate virtual environment
pip install locust

We’re now ready to work with Locust for load testing our endpoint.

Locust testing

All Locust load tests are conducted based off of a Locust file that you provide. This Locust file defines a task for the load test; this is where we define our Boto3 invoke_endpoint API call. See the following code:

config = Config(
retries = {
'max_attempts': 0,
'mode': 'standard'
}
)

self.sagemaker_client = boto3.client('sagemaker-runtime',config=config)
self.endpoint_name = host.split('/')[-1]
self.region = region
self.content_type = content_type
self.payload = payload

In the preceding code, adjust your invoke endpoint call parameters to suit your specific model invocation. We use the InvokeEndpoint API using the following piece of code in the Locust file; this is our load test run point. The Locust file we’re using is locust_script.py.

def send(self):

request_meta = {
"request_type": "InvokeEndpoint",
"name": "SageMaker",
"start_time": time.time(),
"response_length": 0,
"response": None,
"context": {},
"exception": None,
}
start_perf_counter = time.perf_counter()

try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type
)
response_body = response["Body"].read()

Now that we have our Locust script ready, we want to run distributed Locust tests to stress test our single instance to find out how much traffic our instance can handle.

Locust distributed mode is a little more nuanced than a single-process Locust test. In distributed mode, we have one primary and multiple workers. The primary worker instructs the workers on how to spawn and control the concurrent users that are sending a request. In our distributed.sh script, we see by default that 240 users will be distributed across the 60 workers. Note that the --headless flag in the Locust CLI removes the UI feature of Locust.

#replace with your endpoint name in format https://<<endpoint-name>>
export ENDPOINT_NAME=https://$1

export REGION=us-east-1
export CONTENT_TYPE=application/json
export PAYLOAD='{"inputs": "I am super happy right now."}'
export USERS=240
export WORKERS=60
export RUN_TIME=1m
export LOCUST_UI=false # Use Locust UI

.
.
.

locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results &
.
.
.

for (( c=1; c<=$WORKERS; c++ ))
do
locust -f $SCRIPT -H $ENDPOINT_NAME --worker --master-host=localhost &
done

./distributed.sh huggingface-pytorch-inference-2022-10-04-02-46-44-677 #to execute Distributed Locust test

We first run the distributed test on a single instance backing the endpoint. The idea here is we want to fully maximize a single instance to understand the instance count we need to achieve our target TPS while staying within our latency requirements. Note that if you want to access the UI, change the Locust_UI environment variable to True and take the public IP of your EC2 instance and map port 8089 to the URL.

The following screenshot shows our CloudWatch metrics.

CloudWatch Metrics

Eventually, we notice that although we initially achieve a TPS of 200, we start noticing 5xx errors in our EC2 client-side logs, as shown in the following screenshot.

We can also verify this by looking at our instance-level metrics, specifically CPUUtilization.

CloudWatch MetricsHere we notice CPUUtilization at nearly 4,800%. Our ml.m5.12x.large instance has 48 vCPUs (48 * 100 = 4800~). This is saturating the entire instance, which also helps explain our 5xx errors. We also see an increase in ModelLatency.

It seems as if our single instance is getting toppled and doesn’t have the compute to sustain a load past the 200 TPS that we are observing. Our target TPS is 1000, so let’s try to increase our instance count to 5. This might have to be even more in a production setting, because we were observing errors at 200 TPS after a certain point.

Endpoint settings

We see in both the Locust UI and CloudWatch logs that we have a TPS of nearly 1000 with five instances backing the endpoint.

Locust

CloudWatch MetricsIf you start experiencing errors even with this hardware setup, make sure to monitor CPUUtilization to understand the full picture behind your endpoint hosting. It’s crucial to understand your hardware utilization to see if you need to scale up or even down. Sometimes container-level problems lead to 5xx errors, but if CPUUtilization is low, it indicates that it’s not your hardware but something at the container or model level that might be leading to these issues (proper environment variable for number of workers not set, for example). On the other hand, if you notice your instance is getting fully saturated, it’s a sign that you need to either increase the current instance fleet or try out a larger instance with a smaller fleet.

Although we increased the instance count to 5 to handle 100 TPS, we can see that the ModelLatency metric is still high. This is due to the instances being saturated. In general, we suggest aiming to utilize the instance’s resources between 60–70%.

Clean up

After load testing, make sure to clean up any resources you won’t utilize via the SageMaker console or through the delete_endpoint Boto3 API call. In addition, make sure to stop your EC2 instance or whatever client setup you have to not incur any further charges there as well.

Summary

In this post, we described how you can load test your SageMaker real-time endpoint. We also discussed what metrics you should be evaluating when load testing your endpoint to understand your performance breakdown. Make sure to check out SageMaker Inference Recommender to further understand instance right-sizing and more performance optimization techniques.


About the Authors

Marc Karp is a ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Read More

Get smarter search results with the Amazon Kendra Intelligent Ranking and OpenSearch plugin

Get smarter search results with the Amazon Kendra Intelligent Ranking and OpenSearch plugin

If you’ve had the opportunity to build a search application for unstructured data (i.e., wiki, informational web sites, self-service help pages, internal documentation, etc.) using open source or commercial-off-the-shelf search engines, then you’re probably familiar with the inherent accuracy challenges involved in getting relevant search results. The intended meaning of both query and document can be lost because the search is reduced to matching component keywords and terms. Consequently, while you get results that may contain the right words, they aren’t always relevant to the user. You need your search engine to be smarter so it can rank documents based on matching the meaning or semantics of the content to the intention of the user’s query.

Amazon Kendra provides a fully managed intelligent search service that automates document ingestion and provides highly accurate search and FAQ results based on content across many data sources. If you haven’t migrated to Amazon Kendra and would like to improve the quality of search results, you can use Amazon Kendra Intelligent Ranking for self-managed OpenSearch on your existing search solution.

We’re delighted to introduce the new Amazon Kendra Intelligent Ranking for self-managed OpenSearch, and its companion plugin for the OpenSearch search engine! Now you can easily add intelligent ranking to your OpenSearch document queries, with no need to migrate, duplicate your OpenSearch indexes, or rewrite your applications. The difference between Amazon Kendra Intelligent Ranking for self-managed OpenSearch and the fully managed Amazon Kendra service is that while the former provides powerful semantic re-ranking for the search results, the later provides additional search accuracy improvements and functionality such as incremental learning, question answering, FAQ matching, and built-in connectors. For more information about the fully managed service, please visit the Amazon Kendra service page.

With Amazon Kendra Intelligent Ranking for self-managed OpenSearch, previous results like this:

Query: What is the address of the White House?

Hit1 (best): The president delivered an address to the nation from the White House today.

Hit2: The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500

become like this: 

Query: What is the address of the White House?

Hit1 (best): The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500

Hit2: The president delivered an address to the nation from the White House today.

In this post, we show you how to get started with Amazon Kendra Intelligent Ranking for self-managed OpenSearch, and we provide a few examples that demonstrate the power and value of this feature.

Components of Amazon Kendra Intelligent Ranking for self-managed OpenSearch

Prerequisites

For this tutorial, you’ll need a bash terminal on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. Hint: consider using an Amazon Cloud9 instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

You will:

  • Install Docker, if it’s not already installed on your system.
  • Install the latest AWS Command Line Interface (AWS CLI), if it’s not already installed.
  • Create and start OpenSearch containers, with the Amazon Kendra Intelligent Ranking plugin enabled.
  • Create test indexes, and load some sample documents.
  • Run some queries, with and without intelligent ranking, and be suitably impressed by the differences!

Install Docker

If Docker (i.e., docker and docker-compose) is not already installed in your environment, then install it. See Get Docker for directions.

Install the AWS CLI

If you don’t already have the latest version of the AWS CLI installed, then install and configure it now (see AWS CLI Getting Started). Your default AWS user credentials must have administrator access, or ask your AWS administrator to add the following policy to your user permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "kendra-ranking:*",
            "Resource": "*"
        }
    ]
}

Create and start OpenSearch using the Quickstart script

Download the search_processing_kendra_quickstart.sh script:

wget https://raw.githubusercontent.com/msfroh/search-relevance/quickstart-script/helpers/search_processing_kendra_quickstart.sh
chmod +x search_processing_kendra_quickstart.sh

Make it executable:

chmod +x ./search_processing_kendra_quickstart.sh

The quickstart script:

  1. Creates an Amazon Kendra Intelligent Ranking Rescore Execution Plan in your AWS account.
  2. Creates Docker containers for OpenSearch and its Dashboards.
  3. Configures OpenSearch to use the Kendra Intelligent Ranking Service.
  4. Starts the OpenSearch services.
  5. Provides helpful guidance for using the service.

Use the --help option to see the command line options:

./search_processing_kendra_quickstart.sh --help

Now, execute the script to automate the Amazon Kendra and OpenSearch setup:

./search_processing_kendra_quickstart.sh --create-execution-plan

That’s it! OpenSearch and OpenSearch Dashboard containers are now up and running.

Read the output message from the quickstart script, and make a note of the directory where you can run the handy docker-compose commands, and the cleanup_resources.sh script.

Try a test query to validate you can connect to your OpenSearch container:

curl -XGET --insecure -u 'admin:admin' 'https://localhost:9200'

Note that if you get the error curl(35):OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:9200, it means that OpenSearch is still coming up. Please wait for a couple of minutes for OpenSearch to be ready and try again.

Create test indexes and load sample documents

The script below is used to create an index and load sample documents. Save it on your computer as bulk_post.sh:

#!/bin/bash
curl -u admin:admin -XPOST https://localhost:9200/_bulk --insecure --data-binary @$1 -H 'Content-Type: application/json'

Save the data files below as tinydocs.jsonl:

{ "create" : { "_index" : "tinydocs",  "_id" : "tdoc1" } }
{"title": "WhiteHouse1", "body": "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500"}
{ "create" : { "_index" : "tinydocs",  "_id" : "tdoc2" } }
{"title": "WhiteHouse2", "body": "The president delivered an address to the nation from the White House today."}

And save the data file below as dstinfo.jsonl:

(This data is adapted from Daylight Saving Time article).

{ "create" : { "_index" : "dstinfo",  "_id" : "dst1" } }
{"title": "Daylight Saving Time", "body": "Daylight saving time begins on the second Sunday in March at 2 a.m., and clocks are set an hour ahead, according to the Farmers’ Almanac. It lasts for eight months and ends on the first Sunday in November, when clocks are set back an hour at 2 a.m."}
{ "create" : { "_index" : "dstinfo",  "_id" : "dst2" } }
{"title":"History of daylight saving time", "body": "Founding Father Benjamin Franklin is often deemed the brain behind daylight saving time after a letter he wrote in 1784 to a Parisian newspaper, according to the Farmers’ Almanac. But Franklin’s letter suggested people simply change their routines and schedules — not the clocks — to the sun’s cycles. Perhaps surprisingly, daylight saving time had a soft rollout in the United States in 1883 to solve issues with railroad accidents, according to the U.S. Bureau of Transportation Services. It was instituted across the United States in 1918, according to the Congressional Research Service. In 2005, Congress changed it to span from March to November instead of its original timeframe of April to October."}
{ "create" : { "_index" : "dstinfo",  "_id" : "dst3" } }
{"title": "Daylight saving time participants", "body":"The United States is one of more than 70 countries that follow some form of daylight saving time, according to World Data. States can individually decide whether or not to follow it, according to the Farmers’ Almanac. Arizona and Hawaii do not, nor do parts of northeastern British Columbia in Canada. Puerto Rico and the Virgin Islands, both U.S. territories, also don’t follow daylight saving time, according to the Congressional Research Service."}
{ "create" : { "_index" : "dstinfo",  "_id" : "dst4" } }
{"title":"Benefits of daylight saving time", "body":"Those in favor of daylight saving time, whether eight months long or permanent, also vouch that it increases tourism in places such as parks or other public attractions, according to National Geographic. The longer days can keep more people outdoors later in the day."}

Make the script executable:

chmod +x ./bulk_post.sh

Now use the bulk_post.sh script to create indexes and load the data by running the two commands below:

./bulk_post.sh tinydocs.jsonl
./bulk_post.sh dstinfo.jsonl

Run sample queries

Prepare query scripts

OpenSearch queries are defined in JSON using the OpenSearch query domain specific language (DSL). For this post, we use the Linux curl command to send queries to our local OpenSearch server using HTTPS.

To make this easy, we’ve defined two small scripts to construct our query DSL and send it to OpenSearch.

The first script creates a regular OpenSearch text match query on two document fields – title and body. See OpenSearch documentation for more on the multi-match query syntax. We’ve kept the query very simple, but you can experiment later with defining alternate types of queries.

Save the script below as query_nokendra.sh:

#!/bin/bash
curl -XGET "https://localhost:9200/$1/_search?pretty" -u 'admin:admin' --insecure -H 'Content-Type: application/json' -d'
  {
    "query": {
      "multi_match": {
        "fields": ["title", "body"],
        "query": "'"$2"'"
      }
    },
    "size": 20
  }
  '

The second script is similar to the first one, but this time we add a query extension to instruct OpenSearch to invoke the Amazon Kendra Intelligent Ranking plugin as a post-processing step to re-rank the original results using the Amazon Kendra Intelligent Ranking service.

The size property determines how many OpenSearch result documents are sent to Kendra for re-ranking. Here, we specify a maximum of 20 results for re-ranking. Two properties, title_field (optional) and body_field (required), specify the document fields used for intelligent ranking.

Save the script below as query_kendra.sh:

#!/bin/bash
curl -XGET "https://localhost:9200/$1/_search?pretty" -u 'admin:admin' --insecure -H 'Content-Type: application/json' -d'
  {
    "query": {
      "multi_match": {
        "fields": ["title", "body"],
        "query": "'"$2"'"
      }
    },
    "size": 20,
    "ext": {
      "search_configuration": {
        "result_transformer": {
          "kendra_intelligent_ranking": {
            "order": 1,
            "properties": {
              "title_field": "title",
              "body_field": "body"
            }
          }
        }
      }
    }
  }
  '

Make both scripts executable:

chmod +x ./query_*kendra.sh

Run initial queries

Start with a simple query on the tinydocs index, to reproduce the example used in the post introduction.

Use the query_nokendra.sh script to search for the address of the White House:

./query_nokendra.sh tinydocs "what is the address of White House"

You see the results shown below. Observe the order of the two results, which are ranked by the score assigned by the OpenSearch text match query. Although the top scoring result does contain the keywords address and White House, it’s clear the meaning doesn’t match the intent of the question. The keywords match, but the semantics do not.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.1619741,
    "hits" : [
      {
        "_index" : "tinydocs",
        "_id" : "tdoc2",
        "_score" : 1.1619741,
        "_source" : {
          "title" : "Whitehouse2",
          "body" : "The president delivered an address to the nation from the White House today."
        }
      },
      {
        "_index" : "tinydocs",
        "_id" : "tdoc1",
        "_score" : 1.0577903,
        "_source" : {
          "title" : "Whitehouse1",
          "body" : "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500"
        }
      }
    ]
  }
}

Now let’s run the query with Amazon Kendra Intelligent Ranking, using the query_kendra.sh script:

./query_kendra.sh tinydocs "what is the address of White House"

This time, you see the results in a different order as shown below. The Amazon Kendra Intelligent Ranking service has re-assigned the score values, and assigned a higher score to the document that more closely matches the intention of the query. From a keyword perspective, this is a poorer match because it doesn’t contain the word address; however, from a semantic perspective it’s the better response. Now you see the benefit of using the Amazon Kendra Intelligent Ranking plugin!

{
  "took" : 522,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.3798389,
    "hits" : [
      {
        "_index" : "tinydocs",
        "_id" : "tdoc1",
        "_score" : 0.3798389,
        "_source" : {
          "title" : "Whitehouse1",
          "body" : "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500"
        }
      },
      {
        "_index" : "tinydocs",
        "_id" : "tdoc2",
        "_score" : 0.25906953,
        "_source" : {
          "title" : "Whitehouse2",
          "body" : "The president delivered an address to the nation from the White House today."
        }
      }
    ]
  }
}

Run additional queries and compare search results

Try the dstinfo index now, to see how the same concept works with different data and queries. While you can use the scripts query_nokendra.sh and query_kendra.sh to make queries from the command line, let’s use instead the OpenSearch Dashboards Compare Search Results Plugin to run queries and compare search results.

Paste the local Dashboards URL into your browser: http://localhost:5601/app/searchRelevance – / to access the dashboard comparison tool. Use the default credentials: Username: admin, Password: admin.

In the search bar, enter: what is daylight saving time?

For the Query 1 and Query 2 index, select dstinfo.

Copy the DSL query below and paste it in the Query panel under Query 1. This is a keyword search query.

{
  "query": { "multi_match": { "fields": ["title", "body"], "query": "%SearchText%" } }, 
  "size": 20
}

Now copy the DSL query below and paste it in the Query panel under Query 2. This query invokes the Amazon Kendra Intelligent Ranking plugin for self-managed OpenSearch to perform semantic re-ranking of the search results.

{
  "query": { "multi_match": { "fields": ["title", "body"], "query": "%SearchText%" } },
  "size": 20,
  "ext": {
    "search_configuration": {
      "result_transformer": {
        "kendra_intelligent_ranking": {
          "order": 1,
          "properties": { "title_field": "title", "body_field": "body" }
        }
      }
    }
  }
}

Choose the Search button to run the queries and observe the search results. In Result 1, the hit ranked last is probably actually the most relevant response to this query. In Result 2, the output from Amazon Kendra Intelligent Ranking has the most relevant answer correctly ranked first.

Now that you have experienced Amazon Kendra Intelligent Ranking for self-managed OpenSearch, experiment with a few queries of your own. Use the data we have already loaded or use the bulk_post.sh script to load your own data.

Explore the Amazon Kendra ranking rescore API

As you’ve seen from this post, the Amazon Kendra Intelligent Ranking plugin for OpenSearch can be conveniently used for semantic re-ranking of your search results. However, if you use a search service that doesn’t support the Amazon Kendra Intelligent Ranking plugin for self-managed OpenSearch, then you can use the Rescore function from the Amazon Kendra Intelligent Ranking API directly.

Try this API using the search results from the example query we used above: what is the address of the White House?

First, find your Execution Plan Id by running:

aws kendra-ranking list-rescore-execution-plans

The JSON below contains the search query, and the two results that were returned by the original OpenSearch match query, with their original OpenSearch scores. Replace {kendra-execution-plan_id} with your Execution Plan Id (from above) and save it as rescore_input.json:

{
    "RescoreExecutionPlanId": "{kendra-execution-plan_id}", 
    "SearchQuery": "what is the address of White House", 
    "Documents": [
        { "Id": "tdoc1",  "Title": "Whitehouse1",  "Body": "The president delivered an address to the nation from the White House today.",  "OriginalScore": 1.4484794 },
        { "Id": "tdoc2",  "Title": "Whitehouse2",  "Body": "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500",  "OriginalScore": 1.2401118 }
    ]
}

Run the CLI command below to re-score this list of documents using the Amazon Kendra Intelligent Ranking service:

aws kendra-ranking rescore --cli-input-json "`cat rescore_input.json`"

The output of a successful execution of this will look as below.

{
    "ResultItems": [
        {
            "Score": 0.39321771264076233, 
            "DocumentId": "tdoc2"
        }, 
        {
            "Score": 0.328217089176178, 
            "DocumentId": "tdoc1"
        }
    ], 
    "RescoreId": "991459b0-ca9e-4ba8-b0b3-1e8e01f2ad15"
}

As expected, the document tdoc2 (containing the text bodyThe White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500”) now has the higher ranking, as it’s the semantically more relevant response for the query. The ResultItems list in the output contains each input DocumentId with its new Score, ranked in descending order of Score.

Clean up

When you’re done experimenting, shut down, and remove your Docker containers and Rescore Execution Plan by running the cleanup_resources.sh script created by the Quickstart script, e.g.:

./opensearch-kendra-ranking-docker.xxxx/cleanup_resources.sh

Conclusion

In this post, we showed you how to use Amazon Kendra Intelligent Ranking plugin for self-managed OpenSearch to easily add intelligent ranking to your OpenSearch document queries to dramatically improve the relevance ranking of the results, while using your existing OpenSearch search engine deployments.

You can also use the Amazon Kendra Intelligent Ranking Rescore API directly to intelligently re-score and rank results from your own applications.

Read the Amazon Kendra Intelligent Ranking for self-managed OpenSearch documentation to learn more about this feature, and start planning to apply it in your production applications.


About the Authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

Read More

Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker

Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker

Machine learning (ML) applications are complex to deploy and often require the ability to hyper-scale, and have ultra-low latency requirements and stringent cost budgets. Use cases such as fraud detection, product recommendations, and traffic prediction are examples where milliseconds matter and are critical for business success. Strict service level agreements (SLAs) need to be met, and a typical request may require multiple steps such as preprocessing, data transformation, feature engineering, model selection logic, model aggregation, and postprocessing.

Deploying ML models at scale with optimized cost and compute efficiencies can be a daunting and cumbersome task. Each model has its own merits and dependencies based on the external data sources as well as runtime environment such as CPU/GPU power of the underlying compute resources. An application may require multiple ML models to serve a single inference request. In certain scenarios, a request may flow across multiple models. There is no one-size-fits-all approach, and it’s important for ML practitioners to look for tried-and-proven methods to address recurring ML hosting challenges. This has led to the evolution of design patterns for ML model hosting.

In this post, we explore common design patterns for building ML applications on Amazon SageMaker.

Design patterns for building ML applications

Let’s look at the following design patterns to use for hosting ML applications.

Single-model based ML applications

This is a great option when your ML use case requires a single model to serve a request. The model is deployed on a dedicated compute infrastructure with the ability to scale based on the input traffic. This option is also ideal when the client application has a low-latency (in the order of milliseconds or seconds) inference requirement.

Multi-model based ML applications

To make hosting more cost-effective, this design pattern allows you to host multiple models on the same tenant infrastructure. Multiple ML models can share the host or container resources, including caching the most-used ML models in the memory, resulting in better utilization of memory and compute resources. Depending on the types of the models you chose to deploy, model co-hosting may use the following methods:

  • Multi-model hosting – This option allows you to host multiple models using a shared serving container on a single endpoint. This feature is ideal when you have a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time.
  • Multi-container hosting – This option is ideal when you have multiple models running on different serving stacks with similar resource needs, and when individual models don’t have sufficient traffic to utilize the full capacity of the endpoint instances. Multi-container hosting allows you to deploy multiple containers that use different models or frameworks on a single endpoint. The models can be completely heterogenous, with their own independent serving stack.
  • Model ensembles – In a lot of production use cases, there can often be many upstream models feeding inputs to a given downstream model. This is where ensembles are useful. Ensemble patterns involve mixing output from one or more base models in order to reduce the generalization error of the prediction. The base models can be diverse and trained by different algorithms. Model ensembles can out-perform single models because the prediction error of the model decreases when the ensemble approach is used.

The following are common use cases of ensemble patterns and their corresponding design pattern diagrams:

  • Scatter-gather – In a scatter-gather pattern, a request for inference is routed to a number of models. An aggregator is then used to collect the responses and distill them into a single inference response. For example, an image classification use case may use three different models to perform the task. The scatter-gather pattern allows you to combine results from inferences run on three different models and pick the most probable classification model.

  • Model aggregate – In an aggregation pattern, outputs from multiple models are averaged. For classification models, multiple models’ predictions are evaluated to determine the class that received the most votes and is treated as the final output of the ensemble. For example, in a two-class classification problem to classify a set of fruits as oranges or apples, if two models vote for an orange and one model votes for an apple, then the aggregated output will be an orange. Aggregation helps combat inaccuracy in individual models and makes the output more accurate.

  • Dynamic selection – Another pattern for ensemble models is to dynamically perform model selection for the given input attributes. For example, in a given input of images of fruits, if the input contains an orange, model A will be used because it’s specialized for oranges. If the input contains an apple, model B will be used because it’s specialized for apples.

  • Serial inference ML applications – With a serial inference pattern, also known as an inference pipeline, use cases have requirements to preprocess incoming data before invoking a pre-trained ML model for generating inferences. Additionally, in some cases, the generated inferences may need to be processed further, so that they can be easily consumed by downstream applications. An inference pipeline allows you to reuse the same preprocessing code used during model training to process the inference request data used for predictions.

  • Business logic – Productionizing ML always involves business logic. Business logic patterns involve everything that’s needed to perform an ML task that is not ML model inference. This includes loading the model from Amazon Simple Storage Service (Amazon S3), for example, database lookups to validate the input, obtaining pre-computed features from the feature store, and so on. After these business logic steps are complete, the inputs are passed through to ML models.

ML inference options

For model deployment, it’s important to work backward from your use case. What is the frequency of the prediction? Do you expect live traffic to your application and real-time response to your clients? Do you have many models trained for different subsets of data for the same use case? Does the prediction traffic fluctuate? Is latency of inference a concern? Based on these details, all the preceding design patterns can be implemented using the following deployment options:

  • Real-time inference – Real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. Real-time ML inference workloads may include a single-model based ML application, where an application requires only one ML model to serve a single request, or a multi-model based ML application, where an application requires multiple ML models to serve a single request.
  • Near-real-time (asynchronous) inference – With-near-real time inference, you can queue incoming requests. This can be utilized for running inference on inputs that are hundreds of MBs. It operates in near-real time and allows users to use the input for inference, and read the output from the endpoint from an S3 bucket. It can especially be handy in cases with NLP and computer vision, where there are large payloads that require longer preprocessing times.
  • Batch inference – Batch inference can be utilized for running inference offline on a large dataset. Because it runs offline, batch inference doesn’t offer the lowest latency. Here, the inference request is processed with either a scheduled or event-based trigger of a batch inference job.
  • Serverless inference – Serverless inference is ideal for workloads that have idle periods between traffic spurts and can tolerate a few extra seconds of latency (cold start) for the first invocation after an idle period. For example, a chatbot service or an application to process forms or analyze data from documents. In this case, you might want an online inference option that is able to automatically provision and scale compute capacity based on the volume of inference requests. And during idle time, it should be able to turn off compute capacity completely so that you’re not charged. Serverless inference takes away the undifferentiated heavy lifting of selecting and managing servers by automatically launching compute resources and scaling them in and out depending on traffic.

Use fitness functions to select the right ML inference option

Deciding on the right hosting option is important because it impacts the end-users rendered by your applications. For this purpose, we’re borrowing the concept of fitness functions, which was coined by Neal Ford and his colleagues from AWS Partner ThoughtWorks in their work Building Evolutionary Architectures. Fitness functions provide a prescriptive assessment of various hosting options based on the customer’s objectives. Fitness functions help you obtain the necessary data to allow for the planned evolution of your architecture. They set measurable values to assess how close your solution is to achieving your set goals. Fitness functions can and should be adapted as the architecture evolves to guide a desired change process. This provides architects with a tool to guide their teams while maintaining team autonomy.

There are five main fitness functions that customers care about when it comes to selecting the right ML inference option for hosting their ML models and applications.

Fitness function Description
Cost

To deploy and maintain an ML model and ML application on a scalable framework is a critical business process, and the costs may vary greatly depending on choices made about model hosting infrastructure, hosting option, ML frameworks, ML model characteristics, optimizations, scaling policy, and more. The workloads must utilize the hardware infrastructure optimally to ensure that the cost remains in check.

This fitness function specifically refers to the infrastructure cost, which is a part of overall total cost of ownership (TCO). The infrastructure costs are the combined costs for storage, network, and compute. It’s also critical to understand other components of TCO, including operational costs and security and compliance costs.

Operational costs are the combined costs of operating, monitoring, and maintaining the ML infrastructure. The operational costs are calculated as the number of engineers required based on each scenario and the annual salary of engineers, aggregated over a specific period.

Customers using self-managed ML solutions on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS) need to build operational tooling themselves.

Customers using SageMaker incur significantly less TCO. SageMaker inference is a fully managed service and provides capabilities out of the box for deploying ML models for inference. You don’t need to provision instances, monitor instance health, manage security updates or patches, emit operational metrics, or build monitoring for your ML inference workloads. It has built-in capabilities to ensure high availability and resiliency. SageMaker supports security with end-to-end encryption at rest and in transit, including encryption of the root volume and Amazon Elastic Block Store (Amazon EBS) volume, Amazon Virtual Private Cloud (Amazon VPC) support, AWS PrivateLink, customer-managed keys, AWS Identity and Access Management (IAM) fine-grained access control, AWS CloudTrail audits, internode encryption for training, tag-based access control, network isolation, and Interactive Application Proxy.

All of these security features are provided out of the box in SageMaker, and can save businesses tens of development months of engineering effort over a 3-year period. SageMaker is a HIPAA-eligible service, and is certified under PCI, SOC, GDPR, and ISO. SageMaker also supports FIPS endpoints. For more information about TCO, refer to The total cost of ownership of Amazon SageMaker.

Inference latency Many ML models and applications are latency critical, in which the inference latency must be within the bounds specified by a service level objective. Inference latency depends upon a multitude of factors, including model size and complexity, hardware platform, software environment, and network architecture. For example, larger and more complex models can take longer to run inference.
Throughput (transactions per second) For model inference, optimizing throughput is crucial for performance tuning and achieving the business objective of the ML application. As we continue to advance rapidly in all aspects of ML, including low-level implementations of mathematical operations in chip design, hardware-specific libraries play a greater role in performance optimization. Various factors such as payload size, network hops, nature of hops, model graph features, operators in the model, and the CPU, GPU, and memory profile of the model hosting instances affect the throughput of the ML model.
Scaling configuration complexity It’s crucial for the ML models or applications to run on a scalable framework that can handle the demand of varying traffic. It also allows for the maximum utilization of CPU and GPU resources and prevents over-provisioning of compute resources.
Expected traffic pattern ML models or applications can have different traffic patterns, ranging from continuous real-time live traffic to periodic peaks of thousands of requests per second, and from infrequent, unpredictable request patterns to offline batch requests on larger datasets. Working backward from the expected traffic pattern is recommended in order to select the right hosting option for your ML model.

Deploying models with SageMaker

SageMaker is a fully managed AWS service that provides every developer and data scientist with the ability to quickly build, train, and deploy ML models at scale. With SageMaker inference, you can deploy your ML models on hosted endpoints and get inference results. SageMaker provides a wide selection of hardware and features to meet your workload requirements, allowing you to select over 70 instance types with hardware acceleration. SageMaker can also provide inference instance type recommendation using a new feature called SageMaker Inference Recommender, in case you’re not sure which one would be most optimal for your workload.

You can choose deployment options to best meet your use cases, such as real time inference, asynchronous, batch, and even serverless endpoints. In addition, SageMaker offers various deployment strategies such as canary, blue/green, shadow, and A/B testing for model deployment, along with cost-effective deployment with multi-model, multi-container endpoints, and elastic scaling. With SageMaker inference, you can view the performance metrics for your endpoints in Amazon CloudWatch, automatically scale endpoints based on traffic, and update your models in production without losing any availability.

SageMaker offers four options to deploy your model so you can start making predictions:

  • Real-time inference – This is suitable for workloads with millisecond latency requirements, payload sizes up to 6 MB, and processing times of up to 60 seconds.
  • Batch transform – This is ideal for offline predictions on large batches of data that are available up-front.
  • Asynchronous inference – This is designed for workloads that don’t have sub-second latency requirements, payload sizes up to 1 GB, and processing times of up to 15 minutes.
  • Serverless inference – With serverless inference, you can quickly deploy ML models for inference without having to configure or manage the underlying infrastructure. Additionally, you pay only for the compute capacity used to process inference requests, which is ideal for intermittent workloads.

The following diagram can help you understand the SageMaker hosting model deployment options along with the associated fitness function evaluations.

Let’s explore each of the deployment options in more detail.

Real-time inference in SageMaker

SageMaker real-time inference is recommended if you have sustained traffic and need lower and consistent latency for your requests with payload sizes up to 6 MB, and processing times of up to 60 seconds. You deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling. Real-time inference is popular for use cases where you expect a low-latency, synchronous response with predictable traffic patterns, such as personalized recommendations for products and services or transactional fraud detection use cases.

Typically, a client application sends requests to the SageMaker HTTPS endpoint to obtain inferences from a deployed model. You can deploy multiple variants of a model to the same SageMaker HTTPS endpoint. This is useful for testing variations of a model in production. Auto scaling allows you to dynamically adjust the number of instances provisioned for a model in response to changes in your workload.

The following table provides guidance on evaluating SageMaker real-time inference based on the fitness functions.

Fitness function Description
Cost

Real-time endpoints offer synchronous response to inference requests. Because the endpoint is always running and available to provide real-time synchronous inference response, you pay for using the instance. Costs can quickly add up when you deploy multiple endpoints, especially if the endpoints don’t fully utilize the underlying instances. Choosing the right instance for your model helps ensure you have the most performant instance at the lowest cost for your models. Auto scaling is recommended to dynamically adjust the capacity depending on traffic to maintain steady and predictable performance at the possible lowest cost.

SageMaker extends access to Graviton2 and Graviton3-based ML instance families. AWS Graviton processors are custom built by Amazon Web Services using 64-bit Arm Neoverse cores to deliver the best price performance for your cloud workloads running on Amazon EC2. With Graviton-based instances, you have more options for optimizing the cost and performance when deploying your ML models on SageMaker.

SageMaker also supports Inf1 instances, providing high performance and cost-effective ML inference. With 1–16 AWS Inferentia chips per instance, Inf1 instances can scale in performance and deliver up to three times higher throughput and up to 50% lower cost per inference compared to the AWS GPU-based instances. To use Inf1 instances in SageMaker, you can compile your trained models using Amazon SageMaker Neo and select the Inf1 instances to deploy the compiled model on SageMaker.

You can also explore Savings Plans for SageMaker to benefit from cost savings up to 64% compared to the on-demand price.

When you create an endpoint, SageMaker attaches an EBS storage volume to each ML compute instance that hosts the endpoint. The size of the storage volume depends on the instance type. Additional cost for real-time endpoints includes cost of GB-month of provisioned storage, plus GB data processed in and GB data processed out of the endpoint instance.

Inference latency Real-time inference is ideal when you need a persistent endpoint with millisecond latency requirements. It supports payload sizes up to 6 MB, and processing times of up to 60 seconds.
Throughput

An ideal value of inference throughput is subjective to factors such as model, model input size, batch size, and endpoint instance type. As a best practice, review CloudWatch metrics for input requests and resource utilization, and select the appropriate instance type to achieve optimal throughput.

A business application can be either throughput optimized or latency optimized. For example, dynamic batching can help increase the throughput for latency-sensitive apps using real-time inference. However, there are limits to batch size, without which the inference latency could be affected. Inference latency will grow as you increase the batch size to improve throughput. Therefore, real-time inference is an ideal option for latency-sensitive applications. SageMaker provides options of asynchronous inference and batch transform, which are optimized to give higher throughput compared to real-time inference if the business applications can tolerate a slightly higher latency.

Scaling configuration complexity

SageMaker real-time endpoints support auto scaling out of the box. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances, helping you reduce your compute cost. Without auto scaling, you need to provision for peak traffic or risk model unavailability. Unless the traffic to your model is steady throughout the day, there will be excess unused capacity. This leads to low utilization and wasted resources.

With SageMaker, you can configure different scaling options based on the expected traffic pattern. Simple scaling or target tracking scaling is ideal when you want to scale based on a specific CloudWatch metric. You can do this by choosing a specific metric and setting threshold values. The recommended metrics for this option are average CPUUtilization or SageMakerVariantInvocationsPerInstance.

If you require advanced configuration, you can set a step scaling policy to dynamically adjust the number of instances to scale based on the size of the alarm breach. This helps you configure a more aggressive response when demand reaches a certain level.

You can use a scheduled scaling option when you know that the demand follows a particular schedule in the day, week, month, or year. This helps you specify a one-time schedule or a recurring schedule or cron expressions along with start and end times, which form the boundaries of when the auto scaling action starts and stops.

For more details, refer to Configuring autoscaling inference endpoints in Amazon SageMaker and Load test and optimize an Amazon SageMaker endpoint using automatic scaling.

Traffic pattern Real-time inference is ideal for workloads with a continual or regular traffic pattern.

Asynchronous inference in SageMaker

SageMaker asynchronous inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1 GB), long processing times (up to 15 minutes), and near-real-time latency requirements. Example workloads for asynchronous inference include healthcare companies processing high-resolution biomedical images or videos like echocardiograms to detect anomalies. These applications receive bursts of incoming traffic at different times in the day and require near-real-time processing at low cost. Processing times for these requests can range in the order of minutes, eliminating the need to run real-time inference. Instead, input payloads can be processed asynchronously from an object store like Amazon S3 with automatic queuing and a predefined concurrency threshold. Upon processing, SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Amazon Simple Notification Service (Amazon SNS).

The following table provides guidance on evaluating SageMaker asynchronous inference based on the fitness functions.

Fitness function Description
Cost Asynchronous inference is a great choice for cost-sensitive workloads with large payloads and burst traffic. Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests. Requests that are received when there are zero instances are queued for processing after the endpoint scales up.
Inference latency Asynchronous inference is ideal for near-real-time latency requirements. The requests are placed in a queue and processed as soon as the compute is available. This typically results in tens of milliseconds in latency.
Throughput Asynchronous inference is ideal for non-latency sensitive use cases, because applications don’t have to compromise on throughput. Requests aren’t dropped during traffic spikes because the asynchronous inference endpoint queues up requests rather than dropping them.
Scaling configuration complexity

SageMaker supports auto scaling for asynchronous endpoint. Unlike real-time hosted endpoints, asynchronous inference endpoints support scaling down instances to zero by setting the minimum capacity to zero. For asynchronous endpoints, SageMaker strongly recommends that you create a policy configuration for target-tracking scaling for a deployed model (variant).

For use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.

Traffic pattern Asynchronous endpoints queue incoming requests and process them asynchronously. They’re a good option for intermittent or infrequent traffic patterns.

Batch inference in SageMaker

SageMaker batch transform is ideal for offline predictions on large batches of data that are available up-front. The batch transform feature is a high-performance and high-throughput method for transforming data and generating inferences. It’s ideal for scenarios where you’re dealing with large batches of data, don’t need subsecond latency, or need to both preprocess and transform the training data. Customers in certain domains such as advertising and marketing or healthcare often need to make offline predictions on hyperscale datasets where high throughput is often the objective of the use case and latency isn’t a concern.

When a batch transform job starts, SageMaker initializes compute instances and distributes the inference workload between them. It releases the resources when the jobs are complete, so you pay only for what was used during the run of your job. When the job is complete, SageMaker saves the prediction results in an S3 bucket that you specify. Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. Example workloads for SageMaker batch transform include offline applications such as banking applications for predicting customer churn where an offline job can be scheduled to run periodically.

The following table provides guidance on evaluating SageMaker batch transform based on the fitness functions.

Fitness function Description
Cost SageMaker batch transform allows you to run predictions on large or small batch datasets. You are charged for the instance type you choose, based on the duration of use. SageMaker manages the provisioning of resources at the start of the job and releases them when the job is complete. There is no additional data processing cost.
Inference latency You can use event-based or scheduled invocation. Latency could vary depending on the size of inference data, job concurrency, complexity of the model, and compute instance capacity.
Throughput

Batch transform jobs can be done on a range of datasets, from petabytes of data to very small datasets. There is no need to resize larger datasets into small chunks of data. You can speed up batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job.

Batch processing can increase throughput and optimize your resources because it helps complete a larger number of inferences in a certain amount of time at the expense of latency. To optimize model deployment for higher throughput, the general guideline is to increase the batch size until throughput decreases.

Scaling configuration complexity SageMaker batch transform is used for offline inference that is not latency sensitive.
Traffic pattern For offline inference, a batch transform job is scheduled or started using an event-based trigger.

Serverless inference in SageMaker

SageMaker serverless inference allows you to deploy ML models for inference without having to configure or manage the underlying infrastructure. Based on the volume of inference requests your model receives, SageMaker serverless inference automatically provisions, scales, and turns off compute capacity. As a result, you pay for only the compute time to run your inference code and the amount of data processed, not for idle time. You can use SageMaker’s built-in algorithms and ML framework-serving containers to deploy your model to a serverless inference endpoint or choose to bring your own container. If traffic becomes predictable and stable, you can easily update from a serverless inference endpoint to a SageMaker real-time endpoint without the need to make changes to your container image. With serverless inference, you also benefit from other SageMaker features, including built-in metrics such as invocation count, faults, latency, host metrics, and errors in CloudWatch.

The following table provides guidance on evaluating SageMaker serverless inference based on the fitness functions.

Fitness function Description
Cost With a pay-as-you-run model, serverless inference is a cost-effective option if you have infrequent or intermittent traffic patterns. You pay only for the duration for which the endpoint processes the request, and therefore can save costs if the traffic pattern is intermittent.
Inference latency

Serverless endpoints offer low inference latency (in the order of milliseconds to seconds), with the ability to scale instantly from tens to thousands of inferences within seconds based on the usage patterns, making it ideal for ML applications with intermittent or unpredictable traffic.

Because serverless endpoints provision compute resources on demand, your endpoint may experience a few extra seconds of latency (cold start) for the first invocation after an idle period. The cold start time depends on your model size, how long it takes to download your model, and the startup time of your container.

Throughput When configuring your serverless endpoint, you can specify the memory size and maximum number of concurrent invocations. SageMaker serverless inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. As a general rule, the memory size should be at least as large as your model size. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, and 6144 MB. Regardless of the memory size you choose, serverless endpoints have 5 GB of ephemeral disk storage available.
Scaling configuration complexity Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers.
Traffic pattern Serverless inference is ideal for workloads with infrequent or intermittent traffic patterns.

Model hosting design patterns in SageMaker

SageMaker inference endpoints use Docker containers for hosting ML models. Containers allow you package software into standardized units that run consistently on any platform that supports Docker. This ensures portability across platforms, immutable infrastructure deployments, and easier change management and CI/CD implementations. SageMaker provides pre-built managed containers for popular frameworks such as Apache MXNet, TensorFlow, PyTorch, Sklearn, and Hugging Face. For a full list of available SageMaker container images, refer to Available Deep Learning Containers Images. In the case that SageMaker doesn’t have a supported container, you can also build your own container (BYOC) and push your own custom image, installing the dependencies that are necessary for your model.

To deploy a model on SageMaker, you need a container (SageMaker managed framework containers or BYOC) and a compute instance to host the container. SageMaker supports multiple advanced options for common ML model hosting design patterns where models can be hosted on a single container or co-hosted on a shared container.

A real-time ML application may use a single model or multiple models to serve a single prediction request. The following diagram shows various inference scenarios for an ML application.

Let’s explore a suitable SageMaker hosting option for each of the preceding inference scenarios. You can refer to the fitness functions to assess if it’s the right option for the given use case.

Hosting a single-model based ML application

There are several options to host single-model based ML applications using SageMaker hosting services depending on the deployment scenario.

Single-model endpoint

SageMaker single-model endpoints allow you to host one model on a container hosted on dedicated instances for low latency and high throughput. These endpoints are fully managed and support auto scaling. You can configure the single-model endpoint as a provisioned endpoint where you pass in endpoint infrastructure configuration such as the instance type and count, or a serverless endpoint where SageMaker automatically launches compute resources and scales them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. Serverless endpoints are for applications with intermittent or unpredictable traffic.

The following diagram shows single-model endpoint inference scenarios.

The following table provides guidance on evaluating fitness functions for a provisioned single-model endpoint. For serverless endpoint fitness function evaluations, refer to the serverless endpoint section in this post.

Fitness function Description
Cost You are charged for usage of the instance type you choose. Because the endpoint is always running and available, costs can quickly add up. Choosing the right instance for your model helps ensure you have the most performant instance at the lowest cost for your models. Auto scaling is recommended to dynamically adjust the capacity depending on traffic to maintain steady and predictable performance at the possible lowest cost.
Inference latency A single-model endpoint provides real-time, interactive, synchronous inference with millisecond latency requirements.
Throughput Throughput can be impacted by various factors, such as model input size, batch size, endpoint instance type, and so on. It is recommended to review CloudWatch metrics for input requests and resource utilization, and select the appropriate instance type to achieve optimal throughput. SageMaker provides features to manage resources and optimize inference performance when deploying ML models. You can optimize model performance using Neo, or use Inf1 instances for better throughput of your SageMaker hosted models using a GPU instance for your endpoint.
Scaling configuration complexity Auto scaling is supported out of the box. SageMaker recommends choosing an appropriate scaling configuration by performing load tests.
Traffic pattern A single-model endpoint is ideal for workloads with predictable traffic patterns.

Co-hosting multiple models

When you’re dealing with a large number of models, deploying each one on an individual endpoint with a dedicated container and instance can result in a significant increase in cost. Additionally, it also becomes difficult to manage so many models in production, specifically when you don’t need to invoke all the models at the same time but still need them to be available at all times. Co-hosting multiple models on the same underlying compute resources makes it easy to manage ML deployments at scale and lowers your hosting costs through increased usage of the endpoint and its underlying compute resources. SageMaker supports advanced model co-hosting options such as multi-model endpoint (MME) for homogenous models and multi-container endpoint (MCE) for heterogenous models. Homogeneous models use the same ML framework on a shared service container, whereas heterogenous models allow you to deploy multiple serving containers that use different models or frameworks on a single endpoint.

The following diagram shows model co-hosting options using SageMaker.

SageMaker multi-model endpoints

SageMaker MMEs allow you to host multiple models using a shared serving container on a single endpoint. This is a scalable and cost-effective solution to deploy a large number of models that cater to the same use case, framework, or inference logic. MMEs can dynamically serve requests based on the model invoked by the caller. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to them. This feature is ideal when you have a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time. Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency, allowing MMEs to effectively use the instances across all models. SageMaker MMEs support hosting both CPU and GPU backed models. By using GPU backed models, you can lower your model deployment costs through increased usage of the endpoint and its underlying accelerated compute instances. For a real world use case of MMEs, refer to How to scale machine learning inference for multi-tenant SaaS use cases.

The following table provides guidance on evaluating the fitness functions for MMEs.

Fitness function Description
Cost

MMEs enable using a shared serving container to host thousands of models on a single endpoint. This reduces hosting costs significantly by improving endpoint utilization compared with using single-model endpoints. For example, if you have 10 models to deploy using an ml.c5.large instance, based on SageMaker pricing, the cost of having 10 single-model persistent endpoints is: 10 * $0.102 = $1.02 per hour.

Whereas with one MME hosting the 10 models, we achieve 10 times cost savings: 1 * $0.102 = $0.102 per hour.

Inference latency

By default, MMEs cache frequently used models in memory and on disk to provide low-latency inference. The cached models are unloaded or deleted from disk only when a container runs out of memory or disk space to accommodate a newly targeted model. MMEs allow lazy loading of models, which means models are loaded into memory when invoked for the first time. This optimizes memory utilization; however, it causes response time spikes on first load, resulting in a cold start problem. Therefore, MMEs are also well suited to scenarios that can tolerate occasional cold-start-related latency penalties that occur when invoking infrequently used models.

To meet the latency and throughput goals of ML applications, GPU instances are preferred over CPU instances (given the computational power GPUs offer). With MME support for GPU, you can deploy thousands of deep learning models behind one SageMaker endpoint. MMEs can run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can significantly save cost and achieve the best price performance. If your use case demands significantly higher transactions per second (TPS) or latency requirements, we recommend hosting the models on dedicated endpoints.

Throughput

An ideal value of MME inference throughput depends on factors such as model, payload size, and endpoint instance type. A higher amount of instance memory enables you to have more models loaded and ready to serve inference requests. You don’t need to waste time loading the model. A higher amount of vCPUs enables you to invoke more unique models concurrently. MMEs dynamically load and unload the model to and from instance memory, which may impact I/O performance.

SageMaker MMEs with GPU work using NVIDIA Triton Inference Server, which is an open-source inference serving software that simplifies the inference serving process and provides high inference performance. SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. If the model is already loaded in the container memory, the subsequent requests are served faster because SageMaker doesn’t need to download and load it again.

A proper performance testing and analysis is recommended in successful production deployments. SageMaker provides CloudWatch metrics for multi-model endpoints so you can determine the endpoint usage and the cache hit rate to help optimize your endpoint.

Scaling configuration complexity SageMaker multi-model endpoints fully support auto scaling, which manages replicas of models to ensure models scale based on traffic patterns. However, a proper load testing is recommended to determine the optimal size of the instances for auto scaling the endpoint. Right-sizing the MME fleet is important to avoid too many models unloading. Loading hundreds of models on a few larger instances may lead to throttling in some cases, and using more and smaller instances could be preferred. To take advantage of automated model scaling in SageMaker, make sure you have instance auto scaling set up to provision additional instance capacity. Set up your endpoint-level scaling policy with either custom parameters or invocations per minute (recommended) to add more instances to the endpoint fleet. The invocation rates used to trigger an auto scale event are based on the aggregate set of predictions across the full set of models served by the endpoint.
Traffic pattern MMEs are ideal when you have a large number of similar sized models that you can serve through a shared serving container and don’t need to access all the models at the same time.

SageMaker multi-container endpoints

SageMaker MCEs support deploying up to 15 containers that use different models or frameworks on a single endpoint, and invoking them independently or in sequence for low-latency inference and cost savings. The models can be completely heterogenous, with their own independent serving stack. Securely hosting multiple models from different frameworks on a single instance could save you up to 90% in cost.

The MCE invocation patterns are as follows:

  • Inference pipelines – Containers in an MME can be invoked in a linear sequence, also known as a serial inference pipeline. They are typically used to separate preprocessing, model inference, and postprocessing into independent containers. The output from the current container is passed as input to the next. They are represented as a single pipeline model in SageMaker. An inference pipeline can be deployed as an MME, where one of the containers in the pipeline can dynamically serve requests based on the model being invoked.
  • Direct invocation – With direct invocation, a request can be sent to a specific inference container hosted on an MCE.

The following table provides guidance on evaluating the fitness functions for MCEs.

Fitness function Description
Cost MCEs enable you to run up to 15 different ML containers on a single endpoint and invoke them independently, thereby saving costs. This option is ideal when you have multiple models running on different serving stacks with similar resource needs, and when individual models don’t have sufficient traffic to utilize the full capacity of the endpoint instances. MCEs are therefore more cost effective than a single-model endpoint. MCEs offer synchronous inference response, which means the endpoint is always available and you pay for the uptime of the instance. Cost can add up depending on the number and type of instances.
Inference latency MCEs are ideal for running ML apps with different ML frameworks and algorithms for each model that are accessed infrequently but still require low-latency inference. The models are always available for low-latency inference and there is no cold start problem.
Throughput MCEs are limited to up to 15 containers on a multi-container endpoint, and GPU inference is not supported due to resource contention. For multi-container endpoints using direct invocation mode, SageMaker not only provides instance-level metrics as it does with other common endpoints, but also supports per-container metrics. As a best practice, review CloudWatch metrics for input requests and resource utilization, and the select appropriate instance type to achieve optimal throughput.
Scaling configuration complexity MCEs support auto scaling. However, in order to configure automatic scaling, it is recommended that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint doesn’t scale out, and there may not be enough instances to handle all the requests to the high CPU utilization model.
Traffic pattern MCEs are ideal for workloads with continual or regular traffic patterns, for hosting models across different frameworks (such as TensorFlow, PyTorch, or Sklearn) that may not have sufficient traffic to saturate the full capacity of an endpoint instance.

Hosting a multi-model based ML application

Many business applications need to use multiple ML models to serve a single prediction request to their consumers. For example, a retail company that wants to provide recommendations to its users. The ML application in this use case may want to use different custom models for recommending different categories of products. If the company wants to add personalization to the recommendations by using individual user information, the number of custom models further increases. Hosting each custom model on a distinct compute instance is not only cost prohibitive, but also leads to underutilization of the hosting resources if not all models are frequently used. SageMaker offers efficient hosting options for multi-model based ML applications.

The following diagram shows multi-model hosting options for a single endpoint using SageMaker.

Serial inference pipeline

An inference pipeline is a SageMaker model that is composed of a linear sequence of 2–15 containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pretrained SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and postprocessing data science tasks. The output from one container is passed as input to the next. When defining the containers for a pipeline model, you also specify the order in which the containers are run. They are represented as a single pipeline model in SageMaker. The inference pipeline can be deployed as an MME, where one of the containers in the pipeline can dynamically serve requests based on the model being invoked. You can also run a batch transform job with an inference pipeline. Inference pipelines are fully managed.

The following table provides guidance on evaluating the fitness functions for ML model hosting using a serial inference pipeline.

Fitness function Description
Cost Serial inference pipeline enables you to run up to 15 different ML containers on a single endpoint, leading to cost effectiveness of hosting the inference containers. There are no additional costs for using this feature. You pay only for the instances running on an endpoint. Cost can add up depending on the number and type of instances.
Inference latency When an ML application is deployed as an inference pipeline, the data between different models doesn’t leave the container space. Feature processing and inferences run with low latency because the containers are co-located on the same EC2 instances.
Throughput Within an inference pipeline model, SageMaker handles invocations as a sequence of HTTP requests. The first container in the pipeline handles the initial request, then the intermediate response is sent as a request to the second container, and so on, for each container in the pipeline. SageMaker returns the final response to the client. Throughput is subjective to factors such as model, model input size, batch size, and endpoint instance type. As a best practice, review CloudWatch metrics for input requests and resource utilization, and select the appropriate instance type to achieve optimal throughput.
Scaling configuration complexity Serial inference pipelines support auto scaling. However, in order to configure automatic scaling, it is recommended that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint doesn’t scale out and there may not be enough instances to handle all the requests to the high CPU utilization model.

Traffic pattern

Serial inference pipelines are ideal for predictable traffic patterns with models that run sequentially on the same endpoint.

Deploying model ensembles (Triton DAG):

SageMaker offers integration with NVIDIA Triton Inference Server through Triton Inference Server Containers. These containers include NVIDIA Triton Inference Server, support for common ML frameworks, and useful environment variables that let you optimize performance on SageMaker. With NVIDIA Triton container images, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

In business use cases where ML applications use several models to serve a prediction request, if each model uses a different framework or is hosted on a separate instance, it may lead to increased workload and cost as well as an increase in overall latency. SageMaker NVIDIA Triton Inference Server supports deployment of models from all major frameworks, such as TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, and Python/C++ model formats and more. Triton model ensemble represents a pipeline of one or more models or preprocessing and postprocessing logic, and the connection of input and output tensors between them. A single inference request to an ensemble triggers the run of the entire pipeline. Triton also has multiple built-in scheduling and batching algorithms that combine individual inference requests to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference. The models can be run on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

Hosting multiple GPU backed models on multi-model endpoints is supported through the SageMaker Triton Inference Server. The NVIDIA Triton Inference Server has been extended to implement an MME API contract, to integrate with MMEs. You can use the NVIDIA Triton Inference Server, which creates a model repository configuration for different framework backends, to deploy an MME with auto scaling. This feature allows you to scale hundreds of hyper-personalized models that are fine-tuned to cater to unique end-user experiences in AI applications. You can also use this feature to achieve needful price performance for your inference application using fractional GPUs. To learn more, refer to Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

The following table provides guidance on evaluating the fitness functions for ML model hosting using MMEs with GPU support on Triton inference containers. For single-model endpoints and serverless endpoint fitness function evaluations, refer to the earlier sections in this post.

Fitness function Description
Cost SageMaker MMEs with GPU support using Triton Inference Server provide a scalable and cost-effective way to deploy a large number of deep learning models behind one SageMaker endpoint. With MMEs, multiple models share the GPU instance behind an endpoint. This enables you to break the linearly increasing cost of hosting multiple models and reuse infrastructure across all the models. You pay for the uptime of the instance.
Inference latency

SageMaker with Triton Inference Server is purpose-built to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. It has a wide range of supported ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including NVIDIA GPUs, CPUs, and AWS Inferentia.

With MME support for GPU using SageMaker Triton Inference Server, you can deploy thousands of deep learning models behind one SageMaker endpoint. SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. If the model is already loaded in the container memory, the subsequent requests are served faster because SageMaker doesn’t need to download and load it again.

Throughput

MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server. This allows you easily use the NVIDIA Triton multi-framework, high-performance inference serving with the SageMaker fully managed model deployment.

Triton supports all NVIDIA GPU-, x86-, Arm® CPU-, and AWS Inferentia-based inferencing. It offers dynamic batching, concurrent runs, optimal model configuration, model ensemble, and streaming audio and video inputs to maximize throughput and utilization. Other factors such as network and payload size may play a minimal role in the overhead associated with the inference.

Scaling configuration complexity

MMEs can scale horizontally using an auto scaling policy, and provision additional GPU compute instances based on metrics such as InvocationsPerInstance and GPUUtilization to serve any traffic surge to MME endpoints.

With Triton inference server, you can easily build a custom container that includes your model with Triton and bring it to SageMaker. SageMaker Inference will handle the requests and automatically scale the container as usage increases, making model deployment with Triton on AWS easier.

Traffic pattern

MMEs are ideal for predictable traffic patterns with models run as DAGs on the same endpoint.

SageMaker takes care of traffic shaping to the MME endpoint and maintains optimal model copies on GPU instances for best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high utilization, SageMaker unloads the least-used models from the container to free up resources to load more frequently used models.

Best practices

Consider the following best practices:

  • High cohesion and low coupling between models – Host the models in the same container that has high cohesion (drives single-business functionality) and encapsulate them together for ease of upgrade and manageability. At the same time, decouple those models from each other (host them in different container) so that you can easily upgrade one model without impacting other models. Host multiple models that use different containers behind one endpoint and invoke then independently or add model preprocessing and postprocessing logic as a serial inference pipeline.
  • Inference latency – Group the models that are single-business functionality driven and host them in a single container to minimize the number of hops and therefore minimize the overall latency. There are other caveats, like if the grouped models use multiple frameworks; you might also choose to host in multiple containers but run on the same host to reduce latency and minimize cost.
  • Logically group ML models with high cohesion – The logical group may consist of models that are homogeneous (for example, all XGBoost models) or heterogeneous (for example, a few XGBoost and a few BERT). It may consist of models that are shared across multiple business functionalities or may be specific to fulfilling only one business functionality.
    • Shared models – If the logical group consists of shared models, the ease of upgrading the models and latency will play a major role in architecting the SageMaker endpoints. For example, if latency is a priority, it’s better to place all the models in a single container behind a single SageMaker endpoint to avoid multiple hops. The downside is that if any of the models need to be upgraded, it will result in upgrading all the relevant SageMaker endpoints hosting this model.
    • Non-shared models – If the logical group consists of only business feature specific models and is not shared with other groups, the packaging complexity and latency dimensions will become key to achieve. It’s advisable to host these models in a single container behind a single SageMaker endpoint.
  • Efficient use of hardware (CPU, GPU) – Group CPU-based models together and host them on the same host so that you can efficiently use the CPU. Similarly, group GPU-based models together so that you can efficiently use and scale them. There are hybrid workloads that require both CPU and GPU on the same host. Hosting the CPU-only and GPU-only models on the same host should be driven by high cohesion and application latency requirements. Additionally, cost, ability to scale, and blast radius on impact in case of failure are the key dimensions to look into.
  • Fitness functions – Use fitness functions as a guideline for selecting an ML hosting option.

Conclusion

When it comes to ML hosting, there is no one-size-fits-all approach. ML practitioners need to choose the right design pattern to address their ML hosting challenges. Evaluating the fitness functions provides prescriptive guidance on selecting the right ML hosting option.

For more details on each of the hosting options, refer to the following posts in this series:


About the authors

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Deepali Rajale is AI/ML Specialist Technical Account Manager at Amazon Web Services. She works with enterprise customers providing technical guidance on implementing machine learning solutions with best practices. In her spare time, she enjoys hiking, movies and hanging out with family and friends.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

Best practices for creating Amazon Lex interaction models

Best practices for creating Amazon Lex interaction models

Amazon Lex is an AWS service for building conversational interfaces into any application using voice and text, enabling businesses to add sophisticated, natural language chatbots across different channels. Amazon Lex uses machine learning (ML) to understand natural language (normal conversational text and speech). In this post, we go through a set of best practices for using ML to create a bot that will delight your customers by accurately understanding them. This allows your bot to have more natural conversations that don’t require the user to follow a set of strict instructions. Designing and building an intelligent conversational interface is very different than building a traditional application or website, and this post will help you develop some of the new skills required.

Let’s look at some of the terminology we use frequently in this post:

  • Utterance – The phrase the user says to your live bot.
  • Sample utterance – Some examples of what users might say. These are attached to intents and used to train the bot.
  • Intent – This represents what the user meant and should be clearly connected to a response or an action from the bot. For instance, an intent that responds to a user saying hello, or an intent that can respond and take action if a user wants to order a coffee. A bot has one or more intents that utterances can be mapped to.
  • Slot – A parameter that can capture specific types of information from the utterance (for example, the time of an appointment or the customer’s name). Slots are attached to intents.
  • Slot value – Either examples of what the slot should capture, or a specific list of values for a slot (for example, large, medium, and small as values for a slot for coffee sizes).

The below image shows how all these pieces fit together to make up your bot.

A diagram showing how an interaction with an Amazon Lex bot flows through automatic speech recognition, natural language understanding, fulfilment (including conversational user experience) and back to text to speech

Building a well-designed bot requires several different considerations. These include requirements gathering and discovery, conversational design, testing through automation and with users, and monitoring and optimizing your bot. Within the conversational design aspect, there are two main elements: the interaction model and the conversational or voice user experience (CUX/VUX). CUX and VUX encompass the personality of the bot, the types of responses, the flow of the conversation, variations for modality, and how the bot handles unexpected inputs or failures. The interaction model is the piece that can take what the user said (utterance) and map it to what they meant (intent). In this post, we only look at how to design and optimize your interaction model.

Because Amazon Lex uses machine learning, that puts the creator of the bot in the role of machine teacher. When we build a bot, we need to give it all the knowledge it needs about the types of conversations it will support. We do this both by how we configure the bot (intents and slots) and the training data we give it (sample utterances and slot values). The underlying service then enriches it with knowledge about language generally, enabling it to understand phrases beyond the exact data we have given it.

The best practices listed in the following sections can support you in building a bot that will give your customers a great user experience and work well for your use case.

Creating intents

Each intent is a concept you teach your bot to understand. For instance, it could be an intent that represents someone ordering a coffee, or someone greeting your bot. You need to make sure that you make it really clear and easy for the bot to recognize that a particular utterance should be matched to that intent.

Imagine if someone gave you a set of index cards with phrases on them, each sorted into piles, but with no other context or details. They then started to give you additional index cards with phrases and asked you to add them to the right pile, simply based on the phrases on the cards in each pile. If each pile represented a clear concept with similar phrasing, this would be easy. But if there were no clear topic in each, you would struggle to work out how to match them to a pile. You may even start to use other clues, like “these are all short sentences” or “only these have punctuation.”

Your bot uses similar techniques, but remember that although ML is smart, it’s not as smart as a human, and doesn’t have all the external knowledge and context a human has. If a human with no context of what your bot does might struggle to understand what was meant, your bot likely will too. The best practices in this section can help you create intents that will be recognizable and more likely to be matched with the desired utterance.

1. Each intent should represent a single concept

Each intent should represent one concept or idea, and not just a topic. It’s okay to have multiple intents that map to the same action or response if separating them gives each a clearer, cohesive concept. Let’s look at some dos and don’ts:

  • Don’t create generic intents that group multiple concepts together.

For example, the following intent combines phrases about a damaged product and more general complaint phrases:

DamageComplaint
I've received a damaged product
i received a damaged product
I'm really frustrated
Your company is terrible at deliveries
My product is broken
I got a damaged package
I'm going to return this order
I'll never buy from you again

The following intent is another example, which combines updating personal details with updating the mobile application:

UpdateNeeded
I need to update my address
Can I update the address you have for me
How do I update my telephone number
I can't get the update for the mobile app to work
Help me update my iphone app
How do I get the latest version of the mobile app

  • Do split up intents when they have very different meanings. For example, we can split up the UpdateNeeded intent from the previous example into two intents:

UpdatePersonalDetails
I need to update my address
Can I update the address you have for me
How do I update my telephone number

UpdateMobileApp
I can't get the update for the mobile app to work
Help me update my iphone app
How do I get the latest version of the mobile app

  • Do split up intents when they have the same action or response needed, but use very different phrasing. For example, the following two intents may have the same end result, but the first is directly telling us they need to tow their car, whereas the second is only indirectly hinting that they may need their car towed.

RoadsideAssistanceRequested
I need to tow my car

Can I get a tow truck
Can you send someone out to get my car

RoadsideAssistanceNeeded
I've had an accident

I hit an animal
My car broke down

2. Reduce overlap between intents

Let’s think about that stack of index cards again. If there were cards with the same (or very similar) phrases, it would be hard to know which stack to add a new card with that phrase onto. It’s the same in this case. We want really clear-cut sets of sample utterances in each intent. The following are a few strategies:

  • Don’t create intents with very similar phrasing that have similar meanings. For example, because Amazon Lex will generalize outside of the sample utterances, phrases that aren’t clearly one specific intent could get mismatched, for instance a customer saying “I’d like to book an appointment” when there are two appointment intents, like the following:

BookDoctorsAppointment
I’d like to book a doctors appointment

BookBloodLabAppointment
I’d like to book a lab appointment

  • Do use slots to combine intents that are on the same topic and have similar phrasing. For example, by combining the two intents in the previous example, we can more accurately capture any requests for an appointment, and then use a slot to determine the correct type of appointment:

BookAppointment
I’d like to book a {appointmentType} appointment

  • Don’t create intents where one intent is subset of another. For example, as your bot grows, it can be easy to start creating intents to capture more detailed information:

BookFlight
I'd like to book a flight
book me a round trip flight
i need to book flight one way

BookOneWayFlight
book me a one-way flight
I’d like to book a one way flight
i need to book flight one way please

  • Do use slots to capture different subsets of information within an intent. For example, instead of using different intents to capture the information on the type of flight, we can use a slot to capture this:

BookFlight
I'd like to book a flight
book me a {itineraryType} flight
i need to book flight {itineraryType}
I’d like to book a {itineraryType} flight

3. Have the right amount of data

In ML, training data is key. Hundreds or thousands of samples are often needed to get good results. You’ll be glad to hear that Amazon Lex doesn’t require a huge amount of data, and in fact you don’t want to have too many sample utterances in each intent, because they may start to diverge or add confusion. However, it is key that we provide enough sample utterances to create a clear pattern for the bot to learn from.

Consider the following:

  • Have at least 15 utterances per intent.
  • Add additional utterances incrementally (batches of 10–15) so you can test the performance in stages. A larger number of utterances is not necessarily better.
  • Review intents with a large number of utterances (over 100) to evaluate if you can either remove very similar utterances, or should split the intent into multiple intents.
  • Keep the number of utterances similar across intents. This allows recognition for each intent to be balanced, and avoids accidentally biasing your bot to certain intents.
  • Regularly review your intents based on learnings from your production bot, and continue to add and adjust the utterances. Designing and developing bot is an iterative process that never stops.

4. Have diversity in your data

Amazon Lex is a conversational AI—its primary purpose is to chat with humans. Humans tend to have a large amount of variety in how they phrase things. When designing a bot, we want to make sure we’re capturing that range in our intent configuration. It’s important to re-evaluate and update your configuration and sample data on a regular basis, especially if you’re expanding or changing your user base over time. Consider the following recommendations:

  • Do have a diverse range of utterances in each intent. The following are examples of the types of diversity you should consider:
    • Utterance lengths – The following is an example of varying lengths:

BookFlight
book flight
I need to book a flight
I want to book a flight for my upcoming trip

    • Vocabulary – We need to align this with how our customers talk. You can capture this through user testing or by using the conversational logs from your bot. For example:

OrderFlowers
I want to buy flowers
Can I order flowers
I need to get flowers

    • Phrasing – We need a mix of utterances that represent the different ways our customers might phrase things. The following example shows utterances using “book” as a verb, “booking” as a noun, “flight booking” as a subject, and formal and informal language:

BookFlight
I need to book a flight
can you help with a flight booking
Flight booking is what I am looking for
please book me a flight
I'm gonna need a flight

    • Punctuation – We should include a range of common usage. We should also include non-grammatical usage if this something a customer would use (especially when typing). See the following example:

OrderFlowers
I want to order flowers.
i wanted to get flowers!
Get me some flowers... please!!

    • Slot usage – Provide sample utterances that show both using and not using slots. Use different mixes of slots across those that include them. Make sure the slots have examples with different places they could appear in the utterance. For example:

CancelAppointment
Cancel appointment
Cancel my appointment with Dr. {DoctorLastName}
Cancel appointment on {AppointmentDate} with Dr. {DoctorLastName}
Cancel my appointment on {AppointmentDate}
Can you tell Dr. {DoctorLastName} to cancel my appointment
Please cancel my doctors appointment

  • Don’t keep adding utterances that are just small variances in phrasing. Amazon Lex is able to handle generalizing these for you. For example, you wouldn’t require each of these three variations as the differences are minor:

DamagedProductComplaint
I've received a damaged product
I received a damaged product
Received damaged product

  • Don’t add diversity to some intents but not to others. We need to be consistent with the forms of diversity we add. Remember the index cards from the beginning—when an utterance isn’t clear, the bot may start to use other clues, like sentence length or punctuation, to try to make a match. There are times you may want to use this to your advantage (for example, if you genuinely want to direct all one-word phrases to a particular intent), but it’s important you avoid doing this by accident.

Creating slots

We touched on some good practices involving slots in the previous section, but let’s look at some more specific best practices for slots.

5. Use short noun or adjective phrases for slots

Slots represent something that can be captured definitively as a parameter, like the size of the coffee you want to order, or the airport you’re flying to. Consider the following:

  • Use nouns or short adjectives for your slot values. Don’t use slots for things like carrier phrases (“how do I” or “what could I”) because this will reduce the ability of Amazon Lex to generalize your utterances. Try to keep slots for values you need to capture to fulfil your intent.
  • Keep slots generally to one or two words.

6. Prefer slots over explicit values

You can use slots to generalize the phrases you’re using, but we need to stick to the recommendations we just reviewed as well. To make our slot values as easy to identify as possible, we never use values included in the slot directly in sample utterances. Keep in mind the following tips:

  • Don’t explicitly include values that could be slots in the sample utterances. For example:

OrderFlowers
I want to buy roses
I want to buy lilies
I would love to order some orchids
I would love to order some roses

  • Do use slots to reduce repetition. For example:

OrderFlowers
I want to buy {flowers}
I would love to order some {flowers}

flowers
roses
lilies
orchids

  • Don’t mix slots and real values in the sample utterances. For example:

OrderFlowers
I want to buy {flowers}
I want to buy lilies
I would love to order some {flowers}

flowers
roses
lilies
orchids

  • Don’t have intents with only slots in the sample utterances if the slot types are AlphaNumeric, Number, Date, GRXML, are very broad custom slots, or include abbreviations. Instead, expand the sample utterances by adding conversational phrases that include the slot to the sample utterances.

7. Keep your slot values coherent

The bot has to decide whether to match a slot based only on what it can learn from the values we have entered. If there is a lot of similarity or overlap within slots in the same intent, this can cause challenges with the right slot being matched.

  • Don’t have slots with overlapping values in the same intent. Try to combine them instead. For example:

pets
cat
dog
goldfish

animals
horse
cat
dog

8. Consider how the words will be transcribed

Amazon Lex uses automated speech recognition (ASR) to transcribe speech. This means that all inputs to your Amazon Lex interaction model are processed as text, even when using a voice bot. We need to remember that a transcription may vary from how users might type the same thing. Consider the following:

  • Enter acronyms, or other words whose letters should be pronounced individually, as single letters separated by a period and a space. This will more closely match how it will be transcribed. For example:

A. T. M.
A. W. S.
P. A.

  • Review the audio and transcriptions on a regular basis, so you can adjust your sample utterances or slot types. To do this, turn on conversation logs, and enable both text and audio logs, whenever possible.

9. Use the right options available for your slots

Many different types of slots and options are available, and using the best options for each of our slots can help the recognition of those slot values. We always want to take the time to understand the options before deciding on how to design our slots:

  • Use the restrict option to limit slots to a closed set of values. You can define synonyms for each value. This could be, for instance, the menu items in your restaurant.
  • Use the expand option when you want to be able to identify more than just the sample values you provide (for example, Name).
  • Turn obfuscation on for slots that are collecting sensitive data to prevent the data from being logged.
  • Use runtime hints to improve slot recognition when you can narrow down the potential options at runtime. Choosing one slot might narrow down the options for another; for example, a particular type of furniture may not have all color options.
  • Use spelling styles to capture uncommon words or words with variations in spellings such as names.

10. Use custom vocabulary for specialist domains

In most cases, a custom vocabulary is not required, but can be helpful if your users will use specialist words not common in everyday language. In this case, adding one can be helpful in making sure that your transcriptions are accurate. Keep the following in mind:

  • Do use a custom vocabulary to add words that aren’t readily recognized by Amazon Lex in voice-based conversations. This improves the speech-to-text transcription and overall customer experience.
  • Don’t use short or common words like “on,” “it,” “to,” “yes,” or “no” in a custom vocabulary.
  • Do decide how much weight to give a word based on how often the word isn’t recognized in the transcription and how rare the word is in the input. Words that are difficult to pronounce require a higher weight. Use a representative test set to determine if a weight is appropriate. You can collect an audio test set by turning on audio logging in conversation logs.
  • Do use custom slot types for lists of catalog values or entities such as product names or mutual funds.

11. GRXML slots need a strict grammar

When migrating to Amazon Lex from a service that may already have grammars in place (such as traditional automatic speech recognition engines), it is possible to reuse GRXML grammars during the new bot design process. However, when creating a completely new Amazon Lex bot, we recommend first checking if other slot types might meet your needs before using GRXML. Consider the following:

  • Do use GRXML slots only for spoken input, and not text-based interactions.
  • Don’t add the carrier phrases for the GRXML slots in the GRXML file (grammar) itself.
  • Do put carrier phrases into the slot sample utterances, such as I live in {zipCode} or {zipCode} is my zip code.
  • Do author the grammar to only capture correct slot values. For example, to capture a five-digit US ZIP code, you should only accept values that are exactly five digits.

Summary

In this post, we walked through a set of best practices that should help you as you design and build your next bot. As you take away this information, it’s important to remember that best practices are always context dependent. These aren’t rules, but guidelines to help you build a high-performing chatbot. As you keep building and optimizing your own bots, you will find some of these are more important for your use case than others, and you might add your own additional best practices. As a bot creator, you have a lot of control over how you configure your Amazon Lex bot to get the best results for your use case, and these best practices should give you a great place to start.

We can summarize the best practices in this post as follows:

  • Keep each intent to a single clear concept with a coherent set of utterances
  • Use representative, balanced, and diverse sample utterance data
  • Use slots to make intents clearer and capture data
  • Keep each slot to a single topic with a clear set of values
  • Know and use the right type of slot for your use case

For more information on Amazon Lex, check out Getting started with Amazon Lex for documentation, tutorials, how-to videos, code samples, and SDKs.


About the Author

Picture of Gillian ArmstrongGillian Armstrong is a Builder Solutions Architect. She is excited about how the Cloud is opening up opportunities for more people to use technology to solve problems, and especially excited about how cognitive technologies, like conversational AI, are allowing us to interact with computers in more human ways.

Read More

Power recommendations and search using an IMDb knowledge graph – Part 3

Power recommendations and search using an IMDb knowledge graph – Part 3

This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

The following diagram illustrates the complete architecture implemented as part of this series.

In Part 1, we discussed the applications of GNNs and how to transform and prepare our IMDb data into a knowledge graph (KG). We downloaded the data from AWS Data Exchange and processed it in AWS Glue to generate KG files. The KG files were stored in Amazon Simple Storage Service (Amazon S3) and then loaded in Amazon Neptune.

In Part 2, we demonstrated how to use Amazon Neptune ML (in Amazon SageMaker) to train the KG and create KG embeddings.

In this post, we walk you through how to apply our trained KG embeddings in Amazon S3 to out-of-catalog search use cases using Amazon OpenSearch Service and AWS Lambda. You also deploy a local web app for an interactive search experience. All the resources used in this post can be created using a single AWS Cloud Development Kit (AWS CDK) command as described later in the post.

Background

Have you ever inadvertently searched a content title that wasn’t available in a video streaming platform? If yes, you will find that instead of facing a blank search result page, you find a list of movies in same genre, with cast or crew members. That’s an out-of-catalog search experience!

Out-of-catalog search (OOC) is when you enter a search query that has no direct match in a catalog. This event frequently occurs in video streaming platforms that constantly purchase a variety of content from multiple vendors and production companies for a limited time. The absence of relevancy or mapping from a streaming company’s catalog to large knowledge bases of movies and shows can result in a sub-par search experience for customers that query OOC content, thereby lowering the interaction time with the platform. This mapping can be done by manually mapping frequent OOC queries to catalog content or can be automated using machine learning (ML).

In this post, we illustrate how to handle OOC by utilizing the power of the IMDb dataset (the premier source of global entertainment metadata) and knowledge graphs.

OpenSearch Service is a fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing trillions of requests per month. OpenSearch Service offers kNN search, which can enhance search in use cases such as product recommendations, fraud detection, and image, video, and some specific semantic scenarios like document and query similarity. For more information about the natural language understanding-powered search functionalities of OpenSearch Service, refer to Building an NLU-powered search application with Amazon SageMaker and the Amazon OpenSearch Service KNN feature.

Solution overview

In this post, we present a solution to handle OOC situations through knowledge graph-based embedding search using the k-nearest neighbor (kNN) search capabilities of OpenSearch Service. The key AWS services used to implement this solution are OpenSearch Service, SageMaker, Lambda, and Amazon S3.

Check out Part 1 and Part 2 of this series to learn more about creating knowledge graphs and GNN embedding using Amazon Neptune ML.

Our OOC solution assumes that you have a combined KG obtained by merging a streaming company KG and IMDb KG. This can be done through simple text processing techniques that match titles along with the title type (movie, series, documentary), cast, and crew. Additionally, this joint knowledge graph has to be trained to generate knowledge graph embeddings through the pipelines mentioned in Part 1 and Part 2. The following diagram illustrates a simplified view of the combined KG.

To demonstrate the OOC search functionality with a simple example, we split the IMDb knowledge graph into customer-catalog and out-of-customer-catalog. We mark the titles that contain “Toy Story” as an out-of-customer catalog resource and the rest of the IMDb knowledge graph as customer catalog. In a scenario where the customer catalog is not enhanced or merged with external databases, a search for “toy story” would return any title that has the words “toy” or “story” in its metadata, with the OpenSearch text search. If the customer catalog was mapped to IMDb, it would be easier to glean that the query “toy story” doesn’t exist in the catalog and that the top matches in IMDb are “Toy Story,” “Toy Story 2,” “Toy Story 3,” “Toy Story 4,” and “Charlie: Toy Story” in decreasing order of relevance with text match. To get within-catalog results for each of these matches, we can generate five closest movies in customer catalog-based kNN embedding (of the joint KG) similarity through OpenSearch Service.

A typical OOC experience follows the flow illustrated in the following figure.

The following video shows the top five (number of hits) OOC results for the query “toy story” and relevant matches in the customer catalog (number of recommendations).

Here, the query is matched to the knowledge graph using text search in OpenSearch Service. We then map the embeddings of the text match to the customer catalog titles using the OpenSearch Service kNN index. Because the user query can’t be directly mapped to the knowledge graph entities, we use a two-step approach to first find title-based query similarities and then items similar to the title using knowledge graph embeddings. In the following sections, we walk through the process of setting up an OpenSearch Service cluster, creating and uploading knowledge graph indexes, and deploying the solution as a web application.

Prerequisites

To implement this solution, you should have an AWS account, familiarity with OpenSearch Service, SageMaker, Lambda, and AWS CloudFormation, and have completed the steps in Part 1 and Part 2 of this series.

Launch solution resources

The following architecture diagram shows the out-of-catalog workflow.

You will use the AWS Cloud Development Kit (CDK) to provision the resources required for the OOC search applications. The code to launch these resources performs the following operations:

  1. Creates a VPC for the resources.
  2. Creates an OpenSearch Service domain for the search application.
  3. Creates a Lambda function to process and load movie metadata and embeddings to OpenSearch Service indexes (**-ReadFromOpenSearchLambda-**).
  4. Creates a Lambda function that takes as input the user query from a web app and returns relevant titles from OpenSearch (**-LoadDataIntoOpenSearchLambda-**).
  5. Creates an API Gateway that adds an additional layer of security between the web app user interface and Lambda.

To get started, complete the following steps:

  1. Run the code and notebooks from Part 1 and Part 2.
  2. Navigate to the part3-out-of-catalog folder in the code repository.

  1. Launch the AWS CDK from the terminal with the command bash launch_stack.sh.
  2. Provide the two S3 file paths created in Part 2 as input:
    1. The S3 path to the movie embeddings CSV file.
    2. The S3 path to the movie node file.

  1. Wait until the script provisions all the required resources and finishes running.
  2. Copy the API Gateway URL that the AWS CDK script prints out and save it. (We use this for the Streamlit app later).

Create an OpenSearch Service Domain

For illustration purposes, you create a search domain on one Availability Zone in an r6g.large.search instance within a secure VPC and subnet. Note that the best practice would be to set up on three Availability Zones with one primary and two replica instances.

Create an OpenSearch Service index and upload data

You use Lambda functions (created using the AWS CDK launch stack command) to create the OpenSearch Service indexes. To start the index creation, complete the following steps:

  1. On the Lambda console, open the LoadDataIntoOpenSearchLambda Lambda function.
  2. On the Test tab, choose Test to create and ingest data into the OpenSearch Service index.

The following code to this Lambda function can be found in part3-out-of-catalog/cdk/ooc/lambdas/LoadDataIntoOpenSearchLambda/lambda_handler.py:

embedding_file = os.environ.get("embeddings_file")
movie_node_file = os.environ.get("movie_node_file")
print("Merging files")
merged_df = merge_data(embedding_file, movie_node_file)
print("Embeddings and metadata files merged")

print("Initializing OpenSearch client")
ops = initialize_ops()
indices = ops.indices.get_alias().keys()
print("Current indices are :", indices)

# This will take 5 minutes
print("Creating knn index")
# Create the index using knn settings. Creating OOC text is not needed
create_index('ooc_knn',ops)
print("knn index created!")

print("Uploading the data for knn index")
response = ingest_data_into_ops(merged_df, ops, ops_index='ooc_knn', post_method=post_request_emb)
print(response)
print("Upload complete for knn index")

print("Uploading the data for fuzzy word search index")
response = ingest_data_into_ops(merged_df, ops, ops_index='ooc_text', post_method=post_request)
print("Upload complete for fuzzy word search index")
# Create the response and add some extra content to support CORS
response = {
    "statusCode": 200,
    "headers": {
        "Access-Control-Allow-Origin": '*'
    },
    "isBase64Encoded": False
}

The function performs the following tasks:

  1. Loads the IMDB KG movie node file that contains the movie metadata and its associated embeddings from the S3 file paths that were passed to the stack creation file launch_stack.sh.
  2. Merges the two input files to create a single dataframe for index creation.
  3. Initializes the OpenSearch Service client using the Boto3 Python library.
  4. Creates two indexes for text (ooc_text) and kNN embedding search (ooc_knn) and bulk uploads data from the combined dataframe through the ingest_data_into_ops function.

This data ingestion process takes 5–10 minutes and can be monitored through the Amazon CloudWatch logs on the Monitoring tab of the Lambda function.

You create two indexes to enable text-based search and kNN embedding-based search. The text search maps the free-form query the user enters to the titles of the movie. The kNN embedding search finds the k closest movies to the best text match from the KG latent space to return as outputs.

Deploy the solution as a local web application

Now that you have a working text search and kNN index on OpenSearch Service, you’re ready to build a ML-powered web app.

We use the streamlit Python package to create a front-end illustration for this application. The IMDb-Knowledge-Graph-Blog/part3-out-of-catalog/run_imdb_demo.py Python file in our GitHub repo has the required code to la­­­­unch a local web app to explore this capability.

To run the code, complete the following steps:

  1. Install the streamlit and aws_requests_auth Python package in your local virtual Python environment through for following commands in your terminal:
pip install streamlit

pip install aws-requests-auth
  1. Replace the placeholder for the API Gateway URL in the code as follows with the one created by the AWS CDK:

api = '<ENTER URL OF THE API GATEWAY HERE>/opensearch-lambda?q={query_text}&numMovies={num_movies}&numRecs={num_recs}'

  1. Launch the web app with the command streamlit run run_imdb_demo.py from your terminal.

This script launches a Streamlit web app that can be accessed in your web browser. The URL of the web app can be retrieved from the script output, as shown in the following screenshot.

The app accepts new search strings, number of hits, and number of recommendations. The number of hits correspond to how many matching OOC titles we should retrieve from the external (IMDb) catalog. The number of recommendations corresponds to how many nearest neighbors we should retrieve from the customer catalog based on kNN embedding search. See the following code:

search_text=st.sidebar.text_input("Please enter search text to find movies and recommendations")
num_movies= st.sidebar.slider('Number of search hits', min_value=0, max_value=5, value=1)
recs_per_movie= st.sidebar.slider('Number of recommendations per hit', min_value=0, max_value=10, value=5)
if st.sidebar.button('Find'):
    resp= get_movies()

This input (query, number of hits and recommendations) is passed to the **-ReadFromOpenSearchLambda-** Lambda function created by the AWS CDK through the API Gateway request. This is done in the following function:

def get_movies():
    result = requests.get(api.format(query_text=search_text, num_movies=num_movies, num_recs=recs_per_movie)).json()

The output results of the Lambda function from OpenSearch Service is passed to API Gateway and is displayed in the Streamlit app.

Clean up

You can delete all the resources created by the AWS CDK through the command npx cdk destroy –app “python3 appy.py” --all in the same instance (inside the cdk folder) that was used to launch the stack (see the following screenshot).

Conclusion

In this post, we showed you how to create a solution for OOC search using text and kNN-based search using SageMaker and OpenSearch Service. You used custom knowledge graph model embeddings to find nearest neighbors in your catalog to that of IMDb titles. You can now, for example, search for “The Rings of Power,” a fantasy series developed by Amazon Prime Video, on other streaming platforms and reason how they could have optimized the search result.

For more information about the code sample in this post, see the GitHub repo. To learn more about collaborating with the Amazon ML Solutions Lab to build similar state-of-the-art ML applications, see Amazon Machine Learning Solutions Lab. For more information on licensing IMDb datasets, visit developer.imdb.com.


About the Authors

Divya Bhargavi is a Data Scientist and Media and Entertainment Vertical Lead at the Amazon ML Solutions Lab,  where she solves high-value business problems for AWS customers using Machine Learning. She works on image/video understanding, knowledge graph recommendation systems, predictive advertising use cases.

Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Matthew Rhodes is a Data Scientist I working in the Amazon ML Solutions Lab. He specializes in building Machine Learning pipelines that involve concepts such as Natural Language Processing and Computer Vision.

Karan Sindwani is a Data Scientist at Amazon ML Solutions Lab, where he builds and deploys deep learning models. He specializes in the area of computer vision. In his spare time, he enjoys hiking.

Soji Adeshina is an Applied Scientist at AWS where he develops graph neural network-based models for machine learning on graphs tasks with applications to fraud & abuse, knowledge graphs, recommender systems, and life sciences. In his spare time, he enjoys reading and cooking.

Vidya Sagar Ravipati is a Manager at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

AWS positioned in the Leaders category in the 2022 IDC MarketScape for APEJ AI Life-Cycle Software Tools and Platforms Vendor Assessment

AWS positioned in the Leaders category in the 2022 IDC MarketScape for APEJ AI Life-Cycle Software Tools and Platforms Vendor Assessment

The recently published IDC MarketScape: Asia/Pacific (Excluding Japan) AI Life-Cycle Software Tools and Platforms 2022 Vendor Assessment positions AWS in the Leaders category. This was the first and only APEJ-specific analyst evaluation focused on AI life-cycle software from IDC. The vendors evaluated for this MarketScape offer various software tools needed to support end-to-end machine learning (ML) model development, including data preparation, model building and training, model operation, evaluation, deployment, and monitoring. The tools are typically used by data scientists and ML developers from experimentation to production deployment of AI and ML solutions.

AI life-cycle tools are essential to productize AI/ML solutions. They go quite a few steps beyond AI/ML experimentation: to achieve deployment anywhere, performance at scale, cost optimization, and increasingly important, support systematic model risk management—explainability, robustness, drift, privacy protection, and more. Businesses need these tools to unlock the value of enterprise data assets at greater scale and faster speed.

Vendor Requirements for the IDC MarketScape

To be considered for the MarketScape, the vendor had to provide software products for various aspects of the end-to-end ML process under independent product stock-keeping units (SKUs) or as part of a general AI software platform. The products had to be based on the company’s own IP, and the products should have generated software license revenue or consumption-based software revenue for at least 12 months in APEJ as of March 2022. The company had to be among the top 15 vendors by the reported revenues of 2020–2021 in the APEJ region, according to IDC’s AI Software Tracker. AWS met the criteria and was evaluated by IDC along with eight other vendors.

The result of IDC’s comprehensive evaluation was published October 2022 in the IDC MarketScape: Asia/Pacific (Excluding Japan) AI Life-Cycle Software Tools and Platforms 2022 Vendor Assessment. AWS is positioned in the Leaders category based on current capabilities. The AWS strategy is to make continuous investments in AI/ML services to help customers innovate with AI and ML.

AWS position

“AWS is placed in the Leaders category in this exercise, receiving higher ratings in various assessment categories—the breadth of tooling services provided, options to lower cost for performance, quality of customer service and support, and pace of product innovation, to name a few.”

– Jessie Danqing Cai, Associate Research Director, Big Data & Analytics Practice, IDC Asia/Pacific.

The visual below is part of the MarketScape and shows the AWS position evaluated by capabilities and strategies.

The IDC MarketScape vendor analysis model is designed to provide an overview of the competitive fitness of ICT suppliers in a given market. The research methodology utilizes a rigorous scoring methodology based on both qualitative and quantitative criteria that results in a single graphical illustration of each vendor’s position within a given market. The Capabilities score measures vendor product, go-to-market, and business execution in the short term. The Strategy score measures alignment of vendor strategies with customer requirements in a 3–5-year time frame. Vendor market share is represented by the size of the icons.

Amazon SageMaker evaluated as part of the MarketScape

As part of the evaluation, IDC dove deep into Amazon SageMaker capabilities. SageMaker is a fully managed service to build, train, and deploy ML models for any use case with fully managed infrastructure, tools, and workflows. Since the launch of SageMaker in 2017, over 250 capabilities and features have been released.

ML practitioners such as data scientists, data engineers, business analysts, and MLOps professionals use SageMaker to break down barriers across each step of the ML workflow through their choice of integrated development environments (IDEs) or no-code interfaces. Starting with data preparation, SageMaker makes it easy to access, label, and process large amounts of structured data (tabular data) and unstructured data (photo, video, geospatial, and audio) for ML. After data is prepared, SageMaker offers fully managed notebooks for model building and reduces training time from hours to minutes with optimized infrastructure. SageMaker makes it easy to deploy ML models to make predictions at the best price-performance for any use case through a broad selection of ML infrastructure and model deployment options. Finally, the MLOps tools in SageMaker help you scale model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden.

The MarketScape calls out three strengths for AWS:

  • Functionality and offering – SageMaker provides a broad and deep set of tools for data preparation, model training, and deployment, including AWS-built silicon: AWS Inferentia for inference workloads and AWS Trainium for training workloads. SageMaker supports model explainability and bias detection through Amazon SageMaker Clarify.
  • Service delivery – SageMaker is natively available on AWS, the second largest public cloud platform in the APEJ region (based on IDC Public Cloud Services Tracker, IaaS+PaaS, 2021 data), with regions in Japan, Australia, New Zealand, Singapore, India, Indonesia, South Korea, and Greater China. Local zones are available to serve customers in ASEAN countries: Thailand, the Philippines, and Vietnam.
  • Growth opportunities – AWS actively contributes to open-source projects such as Gluon and engages with regional developer and student communities through many events, online courses, and Amazon SageMaker Studio Lab, a no-cost SageMaker notebook environment.

SageMaker launches at re:Invent 2022

SageMaker innovation continued at AWS re:Invent 2022, with eight new capabilities. The launches included three new capabilities for ML model governance. As the number of models and users within an organization increases, it becomes harder to set least-privilege access controls and establish governance processes to document model information (for example, input datasets, training environment information, model-use description, and risk rating). After models are deployed, customers also need to monitor for bias and feature drift to ensure they perform as expected. A new role manager, model cards, and model dashboard simplify access control and enhance transparency to support ML model governance.

There were also three launches related to Amazon SageMaker Studio notebooks. SageMaker Studio notebooks gives practitioners a fully managed notebook experience, from data exploration to deployment. As teams grow in size and complexity, dozens of practitioners may need to collaboratively develop models using notebooks. AWS continues to offer the best notebook experience for users, with the launch of three new features that help you coordinate and automate notebook code.

To support model deployment, new capabilities in SageMaker help you run shadow tests to evaluate a new ML model before production release by testing its performance against the currently deployed model. Shadow testing can help you catch potential configuration errors and performance issues before they impact end-users.

Finally, SageMaker launched support for geospatial ML, allowing data scientists and ML engineers to easily build, train, and deploy ML models using geospatial data. You can access geospatial data sources, purpose-built processing operations, pre-trained ML models, and built-in visualization tools to run geospatial ML faster and at scale.

Today, tens of thousands of customers use Amazon SageMaker to train models with billions of parameters and make over 1 trillion predictions per month. To learn more about SageMaker, visit the webpage and explore how fully managed infrastructure, tools, and workflows can help you accelerate ML model development.


About the author

Kimberly Madia is a Principal Product Marketing Manager with AWS Machine Learning. Her goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

Read More

How Thomson Reuters delivers personalized content subscription plans at scale using Amazon Personalize

How Thomson Reuters delivers personalized content subscription plans at scale using Amazon Personalize

This post is co-written by Hesham Fahim from Thomson Reuters.

Thomson Reuters (TR) is one of the world’s most trusted information organizations for businesses and professionals. It provides companies with the intelligence, technology, and human expertise they need to find trusted answers, enabling them to make better decisions more quickly. TR’s customers span across the financial, risk, legal, tax, accounting, and media markets.

Thomson Reuters provides market-leading products in the Tax, Legal and News campaign, which users can sign up to using a subscription licensing model. To enhance this experience for their customers, TR wanted to create a centralized recommendations platform that allowed their sales team to suggest the most relevant subscription packages to their customers, generating suggestions that help raise awareness of products that could help their customers serve the market better through tailored product selections.

Prior to building this centralized platform, TR had a legacy rules-based engine to generate renewal recommendations. The rules in this engine were predefined and written in SQL, which aside from posing a challenge to manage, also struggled to cope with the proliferation of data from TR’s various integrated data source. TR customer data is changing at a faster rate than the business rules can evolve to reflect changing customer needs. The key requirement for TR’s new machine learning (ML)-based personalization engine was centered around an accurate recommendation system that takes into account recent customer trends. The desired solution would be one with low operational overhead, the ability to accelerate delivering business goals, and a personalization engine that could be constantly trained with up-to-date data to deal with changing consumer habits and new products.

Personalizing the renewal recommendations based on what would be valuable products for TR’s customers was an important business challenge for the sales and marketing team. TR has a wealth of data that could be used for personalization that has been collected from customer interactions and stored within a centralized data warehouse. TR has been an early adopter of ML with Amazon SageMaker, and their maturity in the AI/ML domain meant that they had collated a significant dataset of relevant data within a data warehouse, which the team could train a personalization model with. TR has continued their AI/ML innovation and has recently developed a revamped recommendation platform using Amazon Personalize, which is a fully managed ML service that uses user interactions and items to generate recommendations for users. In this post, we explain how TR used Amazon Personalize to build a scalable, multi-tenanted recommender system that provides the best product subscription plans and associated pricing to their customers.

Solution architecture

The solution had to be designed considering TR’s core operations around understanding users through data; providing these users with personalized and relevant content from a large corpus of data was a mission-critical requirement. Having a well-designed recommendation system is key to getting quality recommendations that are customized to each user’s requirements.

The solution required collecting and preparing user behavior data, training an ML model using Amazon Personalize, generating personalized recommendations through the trained model, and driving marketing campaigns with the personalized recommendations.

TR wanted to take advantage of AWS managed services where possible to simplify operations and reduce undifferentiated heavy lifting. TR used AWS Glue DataBrew and AWS Batch jobs to perform the extract, transform, and load (ETL) jobs in the ML pipelines, and SageMaker along with Amazon Personalize to tailor the recommendations. From a training data volume and runtime perspective, the solution needed to be scalable to process millions of records within the time frame already committed to downstream consumers in TR’s business teams.

The following sections explain the components involved in the solution.

ML training pipeline

Interactions between the users and the content is collected in the form of clickstream data, which is generated as the customer clicks on the content. TR analyzes if this is part of their subscription plan or beyond their subscription plan so that they can provide additional details about the price and plan enrollment options. The user interactions data from various sources is persisted in their data warehouse.

The following diagram illustrates the ML training pipeline.
ML engine training pipeline
The pipeline starts with an AWS Batch job that extracts the data from the data warehouse and transforms the data to create interactions, users, and items datasets.

The following datasets are used to train the model:

  • Structured product data – Subscriptions, orders, product catalog, transactions, and customer details
  • Semi-structured behavior data – Users, usage, and interactions

This transformed data is stored in an Amazon Simple Storage Service (Amazon S3) bucket, which is imported into Amazon Personalize for ML training. Because TR wants to generate personalized recommendations for their users, they use the USER_PERSONALIZATION recipe to train ML models for their custom data, which is referred as creating a solution version. After the solution version is created, it’s used for generating personalized recommendations for the users.

The entire workflow is orchestrated using AWS Step Functions. The alerts and notifications are captured and published to Microsoft Teams using Amazon Simple Notification Service (Amazon SNS) and Amazon EventBridge.

Generating personalized recommendations pipeline: Batch inference

Customer requirements and preferences change very often, and the latest interactions captured in clickstream data serves as a key data point to understand the changing preferences of the customer. To adapt to ever-changing customer preferences, TR generates personalized recommendations on a daily basis.

The following diagram illustrates the pipeline to generate personalized recommendations.
Pipeline to generate personalized recommendations in Batch
A DataBrew job extracts the data from the TR data warehouse for the users who are eligible to provide recommendations during renewal based on the current subscription plan and recent activity. The DataBrew visual data preparation tool makes it easy for TR data analysts and data scientists to clean and normalize data to prepare it for analytics and ML. The ability to choose from over 250 pre-built transformations within the visual data preparation tool to automate data preparation tasks, all without the need to write any code, was an important feature. The DataBrew job generates an incremental dataset for interactions and input for the batch recommendations job and stores the output in a S3 bucket. The newly generated incremental dataset is imported into the interactions dataset. When the incremental dataset import job is successful, an Amazon Personalize batch recommendations job is triggered with the input data. Amazon Personalize generates the latest recommendations for the users provided in the input data and stores it in a recommendations S3 bucket.

Price optimization is the last step before the newly formed recommendations are ready to use. TR runs a cost optimization job on the recommendations generated and uses SageMaker to run custom models on the recommendations as part of this final step. An AWS Glue job curates the output generated from Amazon Personalize and transforms it into the input format required by the SageMaker custom model. TR is able to take the advantage of breadth of the services that AWS provides, using both Amazon Personalize and SageMaker in the recommendation platform to tailor recommendations based on the type of customer firm and end-users.

The entire workflow is decoupled and orchestrated using Step Functions, which gives the flexibility of scaling the pipeline depending on the data processing requirements. The alerts and notifications are captured using Amazon SNS and EventBridge.

Driving email campaigns

The recommendations generated along with the pricing results are used to drive email campaigns to TR’s customers. An AWS Batch job is used to curate the recommendations for each customer and enrich it with the optimized pricing information. These recommendations are ingested into TR’s campaigning systems, which drive the following email campaigns:

  • Automated subscription renewal or upgrade campaigns with new products that might interest the customer
  • Mid-contract renewal campaigns with better offers and more relevant products and legal content materials

The information from this process is also replicated to the customer portal so customers reviewing their current subscription can see the new renewal recommendations. TR has seen a higher conversion rate from email campaigns, leading to increased sales orders, since implementing the new recommendation platform.

What’s next: Real-time recommendations pipeline

Customer requirements and shopping behaviors change in real time, and adapting recommendations to the real-time changes is key to serving the right content. After seeing a great success deploying a batch recommendation system, TR is now planning to take this solution to the next level by implementing a real-time recommendations pipeline to generate recommendations using Amazon Personalize.

The following diagram illustrates the architecture to provide real-time recommendations.
Real-time recommendations pipeline
The real-time integration starts with collecting the live user engagement data and streaming it to Amazon Personalize. As the users are interacting with TR’s applications, they generate clickstream events, which are published into Amazon Kinesis Data Streams. Then the events are ingested into TR’s centralized streaming platform, which is built on top of Amazon Managed Streaming for Kafka (Amazon MSK). Amazon MSK makes it easy to ingest and process streaming data in real time with fully managed Apache Kafka. In this architecture, Amazon MSK serves as a streaming platform and performs any data transformations required on the raw incoming clickstream events. Then an AWS Lambda function is triggered to filter the events to the schema compatible with the Amazon Personalize dataset and push those events to an Amazon Personalize event tracker using a putEvent API. This allows Amazon Personalize to learn from your user’s most recent behavior and include relevant items in recommendations.

TR’s web applications invoke an API deployed in Amazon API Gateway to get recommendations, which triggers a Lambda function to invoke a GetRecommendations API call with Amazon Personalize. Amazon Personalize provides the latest set of personalized recommendations curated to the user behavior, which are provided back to the web applications via Lambda and API Gateway.

With this real-time architecture, TR can serve their customers with personalized recommendations curated to their most recent behavior and serve their needs better.

Conclusion

In this post, we showed you how TR used Amazon Personalize and other AWS services to implement a recommendation engine. Amazon Personalize enabled TR to accelerate the development and deployment of high-performance models to provide recommendations to their customers. TR is able to onboard a new suite of products within weeks now, compared to months earlier. With Amazon Personalize and SageMaker, TR is able to elevate the customer experience with better content subscription plans and prices for their customers.

If you enjoyed reading this blog and would like to learn more about Amazon Personalize and how it can help your organization build recommendation systems, please see the developer guide.


About the Authors

Hesham Fahim is a Lead Machine Learning Engineer and Personalization Engine Architect at Thomson Reuters. He has worked with organizations in academia and industry ranging from large enterprises to mid-sized startups. With a focus on scalable deep learning architectures, He has experience in mobile robotics, biomedical image analysis as well as recommender systems. Away from computers he enjoys astrophotography, reading and long distance biking.

Srinivasa Shaik is a Solutions Architect at AWS based in Boston. He helps Enterprise customers to accelerate their journey to the cloud. He is passionate about containers and machine learning technologies. In his spare time, he enjoys spending time with his family, cooking, and traveling.

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Simone Zucchet is a Senior Solutions Architect at AWS. With over 6 years of experience as a Cloud Architect, Simone enjoys working on innovative projects that help transform the way organizations approach business problems. He helps support large enterprise customers at AWS and is part of the Machine Learning TFC. Outside of his professional life, he enjoys working on cars and photography.

Read More