3D Artist ‘CG Geek’ Builds Massive Sci-Fi World in Record Time This Week ‘In the NVIDIA Studio’

3D Artist ‘CG Geek’ Builds Massive Sci-Fi World in Record Time This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

3D and animation extraordinaire CG Geek completed an ambitious design challenge this week In the NVIDIA Studio — building a massive, sci-fi-inspired 3D world in only three days. The creation of the world, dubbed The Fullness of Time, was fast-tracked by his GeForce RTX 4090 GPU.

72 Hours to Build a Sci-Fi World

Animator and visual effects artist CG Geek teaches aspiring artists how to get started on his popular YouTube channel. He also shares tutorials on Blender, his favorite 3D app because “it’s open source, and the community is always challenging one another to push limits even further,” he said.

To see how far those limits could be pushed, CG Geek kicked off a timed design challenge last week as part of CES, putting together a fully rendered and animated project in only three days — powered by NVIDIA Studio technologies and his GeForce RTX 4090 GPU.

The artist polled his community on Instagram, Twitter and YouTube for a genre to use as a starting point for the project.

Sci-fi was the clear winner, so he envisioned what a far-future city skyline would look like. The first step was to populate the space with futuristic 3D buildings and skyscrapers.

CG Geek formed simple shapes in Blender, scaling them to match the sizes of real-world buildings. He then added materials and reflections to create beautifully textured structures before adding geometry, or geo nodes, a recently added feature in Blender and a crucial aspect of 3D modeling.

Geo nodes virtually eliminate procedural workflows. The traditional process of constructing objects follows a linear pattern, with one tool used after the next and each step only reversible by manual undo operations. Geo nodes allow for non-linear, non-destructive workflows and the instancing of objects to create incredibly detailed scenes using small amounts of data.

Sculpting of the 3D world is nearly complete.

CG Geek scanned objects using his iPhone to create realistic 3D models from photos. He then used Adobe Photoshop to apply detailed textures, one of 30 GPU-accelerated features made possible by his GeForce RTX 4090 GPU. The RTX-accelerated Super Resolution feature, which uses AI to upscale images with higher quality, was especially useful for exporting textures across the entire piece, CG Geek said.

CG Geek added fine details like ivy and realistic wear and tear to his sci-fi buildings until he reached the desired look.

His process used during the challenge is covered in a tutorial on building detailed, low-poly sci-fi buildings in a matter of minutes:

CG Geek’s RTX 4090 GPU enables him to use Blender Cycle’s RTX-accelerated, AI-powered OptiX ray tracing in the viewport for interactive, photorealistic movement within such a detailed environment. This virtually eliminates wait times, allowing him to create at the speed of his imagination.

CG Geek can play back the entire animation in real time without exporting, thanks to the power of the RTX 4090 GPU.

The artist quickly and easily applied realistic textures for the sand and water as well as animations. Final renders were delivered quickly with RTX-accelerated OptiX ray tracing in Blender Cycles.

It took CG Geek just 21 hours to build the futuristic metropolis and 10 hours to render it at 4K resolution.

“Currently, NVIDIA stands alone at the top of high-performance GPUs for 3D tasks like Blender,” he said. ”For real-time editing workflows, nothing comes close to beating the RTX 4090 GPU in speed.”

3D artist CG Geek.

View more of CG Geek’s work and tutorials.

Five-to-Nine Hustle, Powered by NVIDIA Studio

Nine to five o’clock is when people typically have a job, classes or other responsibilities. For many artists, it’s from five to nine that the real creativity kicks in and inspirational juices start flowing.

Make the most of your side hustling.

More than ever, creators are turning their passions into opportunities and monetizing their side hustles. NVIDIA Studio is celebrating these entrepreneurs and helping them learn, explore and take their creative endeavors to the next level:

  • With technology and resources — the latest advances in GPU-acceleration and AI-powered features help get the job done faster, plus Studio Drivers add creative app optimization and reliability to systems.
  • With education — hundreds of select tutorials, free to the public and created by creative professionals, offer everything from quick tricks and tips to multipart, in-depth series to elevate and expand the skill sets of content creators.
  • With inspiration — experience the creative journeys of interdimensional Studio artists, moving storytellers and esteemed streamers across creative fields in 3D animation, video editing, graphic design, photography and more.

Begin your side hustle journey with NVIDIA Studio.

#NewYearNewArt Challenge 

The latest NVIDIA Studio community challenge has kicked off: #NewYearNewArt.

With a new year will come new art, and we’d love to see yours! Use the hashtag #NewYearNewArt and tag @NVIDIAStudio to show off recent creations for a chance to be featured on our channels.

Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More

Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk

Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk

Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk

OpenAI researchers collaborated with Georgetown University’s Center for Security and Emerging Technology and the Stanford Internet Observatory to investigate how large language models might be misused for disinformation purposes. The collaboration included an October 2021 workshop bringing together 30 disinformation researchers, machine learning experts, and policy analysts, and culminated in a co-authored report building on more than a year of research. This report outlines the threats that language models pose to the information environment if used to augment disinformation campaigns and introduces a framework for analyzing potential mitigations. Read the full report at here.

Read report

As generative language models improve, they open up new possibilities in fields as diverse as healthcare, law, education and science. But, as with any new technology, it is worth considering how they can be misused. Against the backdrop of recurring online influence operations—covert or deceptive efforts to influence the opinions of a target audience—the paper asks:

How might language models change influence operations, and what steps can be taken to mitigate this threat?

Our work brought together different backgrounds and expertise—researchers with grounding in the tactics, techniques, and procedures of online disinformation campaigns, as well as machine learning experts in the generative artificial intelligence field—to base our analysis on trends in both domains.

We believe that it is critical to analyze the threat of AI-enabled influence operations and outline steps that can be taken before language models are used for influence operations at scale. We hope our research will inform policymakers that are new to the AI or disinformation fields, and spur in-depth research into potential mitigation strategies for AI developers, policymakers, and disinformation researchers.

How Could AI Affect Influence Operations?

When researchers evaluate influence operations, they consider the actors, behaviors, and content. The widespread availability of technology powered by language models has the potential to impact all three facets:

  1. Actors: Language models could drive down the cost of running influence operations, placing them within reach of new actors and actor types. Likewise, propagandists-for-hire that automate production of text may gain new competitive advantages.

  2. Behavior: Influence operations with language models will become easier to scale, and tactics that are currently expensive (e.g., generating personalized content) may become cheaper. Language models may also enable new tactics to emerge—like real-time content generation in chatbots.

  3. Content: Text creation tools powered by language models may generate more impactful or persuasive messaging compared to propagandists, especially those who lack requisite linguistic or cultural knowledge of their target. They may also make influence operations less discoverable, since they repeatedly create new content without needing to resort to copy-pasting and other noticeable time-saving behaviors.

Our bottom-line judgment is that language models will be useful for propagandists and will likely transform online influence operations. Even if the most advanced models are kept private or controlled through application programming interface (API) access, propagandists will likely gravitate towards open-source alternatives and nation states may invest in the technology themselves.

Critical Unknowns

Many factors impact whether, and the extent to which, language models will be used in influence operations. Our report dives into many of these considerations. For example:

  • What new capabilities for influence will emerge as a side effect of well-intentioned research or commercial investment? Which actors will make significant investments in language models?
  • When will easy-to-use tools to generate text become publicly available? Will it be more effective to engineer specific language models for influence operations, rather than apply generic ones?
  • Will norms develop that disincentivize actors who wage AI-enabled influence operations? How will actor intentions develop?

While we expect to see diffusion of the technology as well as improvements in the usability, reliability, and efficiency of language models, many questions about the future remain unanswered. Because these are critical possibilities that can change how language models may impact influence operations, additional research to reduce uncertainty is highly valuable.

A Framework for Mitigations

To chart a path forward, the report lays out key stages in the language model-to-influence operation pipeline. Each of these stages is a point for potential mitigations.To successfully wage an influence operation leveraging a language model, propagandists would require that: (1) a model exists, (2) they can reliably access it, (3) they can disseminate content from the model, and (4) an end user is affected. Many possible mitigation strategies fall along these four steps, as shown below.

Stage in the pipeline 1. Model Construction 2. Model Access 3. Content Dissemination 4. Belief Formation
Illustrative Mitigations AI developers build models that are more fact-sensitive. AI providers impose stricter usage restrictions on language models. Platforms and AI providers coordinate to identify AI content. Institutions engage in media literacy campaigns.
Developers spread radioactive data to make generative models detectable. AI providers develop new norms around model release. Platforms require “proof of personhood” to post. Developers provide consumer focused AI tools.
Governments impose restrictions on data collection. AI providers close security vulnerabilities. Entities that rely on public input take steps to reduce their exposure to misleading AI content.
Governments impose access controls on AI hardware. Digital provenance standards are widely adopted.

If a Mitigation Exists, is it Desirable?

Just because a mitigation could reduce the threat of AI-enabled influence operations does not mean that it should be put into place. Some mitigations carry their own downside risks. Others may not be feasible. While we do not explicitly endorse or rate mitigations, the paper provides a set of guiding questions for policymakers and others to consider:

  • Technical Feasibility: Is the proposed mitigation technically feasible? Does it require significant changes to technical infrastructure?
  • Social Feasibility: Is the mitigation feasible from a political, legal, and institutional perspective? Does it require costly coordination, are key actors incentivized to implement it, and is it actionable under existing law, regulation, and industry standards?
  • Downside Risk: What are the potential negative impacts of the mitigation, and how significant are they?
  • Impact: How effective would a proposed mitigation be at reducing the threat?

We hope this framework will spur ideas for other mitigation strategies, and that the guiding questions will help relevant institutions begin to consider whether various mitigations are worth pursuing.

This report is far from the final word on AI and the future of influence operations. Our aim is to define the present environment and to help set an agenda for future research. We encourage anyone interested in collaborating or discussing relevant projects to connect with us. For more, read the full report at here.

Read report

Report Authors

Josh A. Goldstein (Georgetown University’s Center for Security and Emerging Technology)
Girish Sastry (OpenAI)
Micah Musser (Georgetown University’s Center for Security and Emerging Technology)
Renée DiResta (Stanford Internet Observatory)
Matthew Gentzel (Longview Philanthropy) (work done at OpenAI)
Katerina Sedova (US Department of State) (work done at Center for Security and Emerging Technology prior to government service)

OpenAI

Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk

OpenAI researchers collaborated with Georgetown University’s Center for Security and Emerging Technology and the Stanford Internet Observatory to investigate how large language models might be misused for disinformation purposes. The collaboration included an October 2021 workshop bringing together 30 disinformation researchers, machine learning experts, and policy analysts, and culminated in a co-authored report building on more than a year of research. This report outlines the threats that language models pose to the information environment if used to augment disinformation campaigns and introduces a framework for analyzing potential mitigations. Read the full report here.OpenAI Blog

The Greenest Generation: NVIDIA, Intel and Partners Supercharge AI Computing Efficiency

The Greenest Generation: NVIDIA, Intel and Partners Supercharge AI Computing Efficiency

AI is at the heart of humanity’s most transformative innovations — from developing COVID vaccines at unprecedented speeds and diagnosing cancer to powering autonomous vehicles and understanding climate change.

Virtually every industry will benefit from adopting AI, but the technology has become more resource intensive as neural networks have increased in complexity. To avoid placing unsustainable demands on electricity generation to run this computing infrastructure, the underlying technology must be as efficient as possible.

Accelerated computing powered by NVIDIA GPUs and the NVIDIA AI platform offer the efficiency that enables data centers to sustainably drive the next generation of breakthroughs.

And now, timed with the launch of 4th Gen Intel Xeon Scalable processors, NVIDIA and its partners have kicked off a new generation of accelerated computing systems that are built for energy-efficient AI. When combined with NVIDIA H100 Tensor Core GPUs, these systems can deliver dramatically higher performance, greater scale and higher efficiency than the prior generation, providing more computation and problem-solving per watt.

The new Intel CPUs will be used in NVIDIA DGX H100 systems, as well as in more than 60 servers featuring H100 GPUs from NVIDIA partners around the world.

Supercharging Speed, Efficiency and Savings for Enterprise AI

The coming NVIDIA and Intel-powered systems will help enterprises run workloads an average of 25x more efficiently than traditional CPU-only data center servers. This incredible performance per watt means less power is needed to get jobs done, which helps ensure the power available to data centers is used as efficiently as possible to supercharge the most important work.

Compared to prior-generation accelerated systems, this new generation of NVIDIA-accelerated servers speed training and inference to boost energy efficiency by 3.5x – which translates into real cost savings, with AI data centers delivering over 3x lower total cost of ownership.

New 4th Gen Intel Xeon CPUs Move More Data to Accelerate NVIDIA AI

Among the features of the new 4th Gen Intel Xeon CPU is support for PCIe Gen 5, which can double the data transfer rates from CPU to NVIDIA GPUs and networking. Increased PCIe lanes allow for a greater density of GPUs and high-speed networking within each server.

Faster memory bandwidth also improves the performance of data-intensive workloads such as AI, while networking speeds — up to 400 gigabits per second (Gbps) per connection — support faster data transfers between servers and storage.

NVIDIA DGX H100 systems and servers from NVIDIA partners with H100 PCIe GPUs come with a license for NVIDIA AI Enterprise, an end-to-end, secure, cloud-native suite of AI development and deployment software, providing a complete platform for excellence in efficient enterprise AI.

NVIDIA DGX H100 Systems Supercharge Efficiency for Supersize AI

As the fourth generation of the world’s premier purpose-built AI infrastructure, NVIDIA DGX H100 systems provide a fully optimized platform powered by the operating system of the accelerated data center, NVIDIA Base Command software.

Each DGX H100 system features eight NVIDIA H100 GPUs, 10 NVIDIA ConnectX-7 network adapters and dual 4th Gen Intel Xeon Scalable processors to deliver the performance required to build large generative AI models, large language models, recommender systems and more.

Combined with NVIDIA networking, this architecture supercharges efficient computing at scale by delivering up to 9x more performance than the previous generation and 20x to 40x more performance than unaccelerated X86 dual-socket servers for AI training and HPC workloads. If a language model previously required 40 days to train on a cluster of X86-only servers, the NVIDIA DGX H100 using Intel Xeon CPUs and ConnectX-7 powered networking could complete the same work in as little as 1-2 days.

NVIDIA DGX H100 systems are the building blocks of an enterprise-ready, turnkey NVIDIA DGX SuperPOD, which delivers up to one exaflop of AI performance, providing a leap in efficiency for large-scale enterprise AI deployment.

NVIDIA Partners Boost Data Center Efficiency 

For AI data center workloads, NVIDIA H100 GPUs enable enterprises to build and deploy applications more efficiently.

Bringing a new generation of performance and energy efficiency to enterprises worldwide, a broad portfolio of systems with H100 GPUs and 4th Gen Intel Xeon Scalable CPUs are coming soon from NVIDIA partners, including ASUS, Atos, Cisco, Dell Technologies, Fujitsu, GIGABYTE, Hewlett Packard Enterprise, Lenovo, QCT and Supermicro.

As the bellwether of the efficiency gains to come, the Flatiron Institute’s Lenovo ThinkSystem with NVIDIA H100 GPUs tops the latest Green500 list — and NVIDIA technologies power 23 of the top 30 systems on the list. The Flatiron system uses prior-generation Intel CPUs, so even more efficiency is expected from the systems now coming to market.

Additionally, connecting servers with NVIDIA ConnectX-7 networking and Intel 4th Gen Xeon Scalable processors will increase efficiency and reduce infrastructure and power consumption.

NVIDIA ConnectX-7 adapters support PCIe Gen 5 and 400 Gbps per connection using Ethernet or InfiniBand, doubling networking throughput between servers and to storage. The adapters support advanced networking, storage and security offloads. ConnectX-7 reduces the number of cables and switch ports needed, saving 17% or more on electricity needed for the networking of large GPU-accelerated HPC and AI clusters and contributing to the better energy efficiency of these new servers.

NVIDIA AI Enterprise Software Delivers Full-Stack AI Solution

These next-generation systems also deliver a leap forward in operational efficiency as they’re optimized for the NVIDIA AI Enterprise software suite.

Running on NVIDIA H100, NVIDIA AI Enterprise accelerates the data science pipeline and streamlines the development and deployment of predictive AI models to automate essential processes and gain rapid insights from data.

With an extensive library of full-stack software, including AI workflows of reference applications, frameworks, pretrained models and infrastructure optimization, the software provides an ideal foundation for scaling enterprise AI success.

To try out NVIDIA H100 running AI workflows and frameworks supported in NVIDIA AI Enterprise, sign up for NVIDIA LaunchPad free of charge.

Watch NVIDIA founder and CEO Jensen Huang speak at the 4th Gen Intel Xeon Scalable processor launch event.

Read More

Best practices for load testing Amazon SageMaker real-time inference endpoints

Best practices for load testing Amazon SageMaker real-time inference endpoints

Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers. It also provides common ML algorithms that are optimized to run efficiently against extremely large data in a distributed environment.

SageMaker real-time inference is ideal for workloads that have real-time, interactive, low-latency requirements. With SageMaker real-time inference, you can deploy REST endpoints that are backed by a specific instance type with a certain amount of compute and memory. Deploying a SageMaker real-time endpoint is only the first step in the path to production for many customers. We want to be able to maximize the performance of the endpoint to achieve a target transactions per second (TPS) while adhering to latency requirements. A large part of performance optimization for inference is making sure you select the proper instance type and count to back an endpoint.

This post describes the best practices for load testing a SageMaker endpoint to find the right configuration for the number of instances and size. This can help us understand the minimum provisioned instance requirements to meet our latency and TPS requirements. From there, we dive into how you can track and understand the metrics and performance of the SageMaker endpoint utilizing Amazon CloudWatch metrics.

We first benchmark the performance of our model on a single instance to identify the TPS it can handle per our acceptable latency requirements. Then we extrapolate the findings to decide on the number of instances we need in order to handle our production traffic. Finally, we simulate production-level traffic and set up load tests for a real-time SageMaker endpoint to confirm our endpoint can handle the production-level load. The entire set of code for the example is available in the following GitHub repository.

Overview of solution

For this post, we deploy a pre-trained Hugging Face DistilBERT model from the Hugging Face Hub. This model can perform a number of tasks, but we send a payload specifically for sentiment analysis and text classification. With this sample payload, we strive to achieve 1000 TPS.

Deploy a real-time endpoint

This post assumes you are familiar with how to deploy a model. Refer to Create your endpoint and deploy your model to understand the internals behind hosting an endpoint. For now, we can quickly point to this model in the Hugging Face Hub and deploy a real-time endpoint with the following code snippet:

# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'distilbert-base-uncased',
'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.12xlarge' # ec2 instance type
)

Let’s test our endpoint quickly with the sample payload that we want to use for load testing:


import boto3
import json
client = boto3.client('sagemaker-runtime')
content_type = "application/json"
request_body = {'inputs': "I am super happy right now."}
data = json.loads(json.dumps(request_body))
payload = json.dumps(data)
response = client.invoke_endpoint(
EndpointName=predictor.endpoint_name,
ContentType=content_type,
Body=payload)
result = response['Body'].read()
result

Note that we’re backing the endpoint using a single Amazon Elastic Compute Cloud (Amazon EC2) instance of type ml.m5.12xlarge, which contains 48 vCPU and 192 GiB of memory. The number of vCPUs is a good indication of the concurrency the instance can handle. In general, it’s recommended to test different instance types to make sure we have an instance that has resources that are properly utilized. To see a full list of SageMaker instances and their corresponding compute power for real-time Inference, refer to Amazon SageMaker Pricing.

Metrics to track

Before we can get into load testing, it’s essential to understand what metrics to track to understand the performance breakdown of your SageMaker endpoint. CloudWatch is the primary logging tool that SageMaker uses to help you understand the different metrics that describe your endpoint’s performance. You can utilize CloudWatch logs to debug your endpoint invocations; all logging and print statements you have in your inference code are captured here. For more information, refer to How Amazon CloudWatch works.

There are two different types of metrics CloudWatch covers for SageMaker: instance-level and invocation metrics.

Instance-level metrics

The first set of parameters to consider is the instance-level metrics: CPUUtilization and MemoryUtilization (for GPU-based instances, GPUUtilization). For CPUUtilization, you may see percentages above 100% at first in CloudWatch. It’s important to realize for CPUUtilization, the sum of all the CPU cores is being displayed. For example, if the instance behind your endpoint contains 4 vCPUs, this means the range of utilization is up to 400%. MemoryUtilization, on the other hand, is in the range of 0–100%.

Specifically, you can use CPUUtilization to get a deeper understanding of if you have sufficient or even an excess amount of hardware. If you have an under-utilized instance (less than 30%), you could potentially scale down your instance type. Conversely, if you are around 80–90% utilization, it would benefit to pick an instance with greater compute/memory. From our tests, we suggest around 60–70% utilization of your hardware.

Invocation metrics

As suggested by the name, invocation metrics is where we can track the end-to-end latency of any invokes to your endpoint. You can utilize the invocation metrics to capture error counts and what type of errors (5xx, 4xx, and so on) that your endpoint may be experiencing. More importantly, you can understand the latency breakdown of your endpoint calls. A lot of this can be captured with ModelLatency and OverheadLatency metrics, as illustrated in the following diagram.

Latencies

The ModelLatency metric captures the time that inference takes within the model container behind a SageMaker endpoint. Note that the model container also includes any custom inference code or scripts that you have passed for inference. This unit is captured in microseconds as an invocation metric, and generally you can graph a percentile across CloudWatch (p99, p90, and so on) to see if you’re meeting your target latency. Note that several factors can impact model and container latency, such as the following:

  • Custom inference script – Whether you have implemented your own container or used a SageMaker-based container with custom inference handlers, it’s best practice to profile your script to catch any operations that are specifically adding a lot of time to your latency.
  • Communication protocol – Consider REST vs. gRPC connections to the model server within the model container.
  • Model framework optimizations – This is framework specific, for example with TensorFlow, there are a number of environment variables you can tune that are TF Serving specific. Make sure to check what container you’re using and if there are any framework-specific optimizations you can add within the script or as environment variables to inject in the container.

OverheadLatency is measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This part is largely outside of your control and falls under the time taken by SageMaker overheads.

End-to-end latency as a whole depends on a variety of factors and isn’t necessarily the sum of ModelLatency plus OverheadLatency. For example, if you client is making the InvokeEndpoint API call over the internet, from the client’s perspective, the end-to-end latency would be internet + ModelLatency + OverheadLatency. As such, when load testing your endpoint in order to accurately benchmark the endpoint itself, it’s recommended to focus on the endpoint metrics (ModelLatency, OverheadLatency, and InvocationsPerInstance) to accurately benchmark the SageMaker endpoint. Any issues related to end-to-end latency can then be isolated separately.

A few questions to consider for end-to-end latency:

  • Where is the client that is invoking your endpoint?
  • Are there any intermediary layers between your client and the SageMaker runtime?

Auto scaling

We don’t cover auto scaling in this post specifically, but it’s an important consideration in order to provision the correct number of instances based on the workload. Depending on your traffic patterns, you can attach an auto scaling policy to your SageMaker endpoint. There are different scaling options, such as TargetTrackingScaling, SimpleScaling, and StepScaling. This allows your endpoint to scale in and out automatically based on your traffic pattern.

A common option is target tracking, where you can specify a CloudWatch metric or custom metric that you have defined and scale out based on that. A frequent utilization of auto scaling is tracking the InvocationsPerInstance metric. After you have identified a bottleneck at a certain TPS, you can often use that as a metric to scale out to a greater number of instances to be able to handle peak loads of traffic. To get a deeper breakdown of auto scaling SageMaker endpoints, refer to Configuring autoscaling inference endpoints in Amazon SageMaker.

Load testing

Although we utilize Locust to display how we can load test at scale, if you’re trying to right size the instance behind your endpoint, SageMaker Inference Recommender is a more efficient option. With third-party load testing tools, you have to manually deploy endpoints across different instances. With Inference Recommender, you can simply pass an array of the instance types you want to load test against, and SageMaker will spin up jobs for each of these instances.

Locust

For this example, we use Locust, an open-source load testing tool that you can implement using Python. Locust is similar to many other open-source load testing tools, but has a few specific benefits:

  • Easy to set up – As we demonstrate in this post, we’ll pass a simple Python script that can easily be refactored for your specific endpoint and payload.
  • Distributed and scalable – Locust is event-based and utilizes gevent under the hood. This is very useful for testing highly concurrent workloads and simulating thousands of concurrent users. You can achieve high TPS with a single process running Locust, but it also has a distributed load generation feature that enables you to scale out to multiple processes and client machines, as we will explore in this post.
  • Locust metrics and UI – Locust also captures end-to-end latency as a metric. This can help supplement your CloudWatch metrics to paint a full picture of your tests. This is all captured in the Locust UI, where you can track concurrent users, workers, and more.

To further understand Locust, check out their documentation.

Amazon EC2 setup

You can set up Locust in whatever environment is compatible for you. For this post, we set up an EC2 instance and install Locust there to conduct our tests. We use a c5.18xlarge EC2 instance. The client-side compute power is also something to consider. At times when you run out of compute power on the client side, this is often not captured, and is mistaken as a SageMaker endpoint error. It’s important to place your client in a location of sufficient compute power that can handle the load that you are testing at. For our EC2 instance, we use an Ubuntu Deep Learning AMI, but you can utilize any AMI as long as you can properly set up Locust on the machine. To understand how to launch and connect to your EC2 instance, refer to the tutorial Get started with Amazon EC2 Linux instances.

The Locust UI is accessible via port 8089. We can open this by adjusting our inbound security group rules for the EC2 Instance. We also open up port 22 so we can SSH into the EC2 instance. Consider scoping the source down to the specific IP address you are accessing the EC2 instance from.

Security Groups

After you’re connected to your EC2 instance, we set up a Python virtual environment and install the open-source Locust API via the CLI:

virtualenv venv #venv is the virtual environment name, you can change as you desire
source venv/bin/activate #activate virtual environment
pip install locust

We’re now ready to work with Locust for load testing our endpoint.

Locust testing

All Locust load tests are conducted based off of a Locust file that you provide. This Locust file defines a task for the load test; this is where we define our Boto3 invoke_endpoint API call. See the following code:

config = Config(
retries = {
'max_attempts': 0,
'mode': 'standard'
}
)

self.sagemaker_client = boto3.client('sagemaker-runtime',config=config)
self.endpoint_name = host.split('/')[-1]
self.region = region
self.content_type = content_type
self.payload = payload

In the preceding code, adjust your invoke endpoint call parameters to suit your specific model invocation. We use the InvokeEndpoint API using the following piece of code in the Locust file; this is our load test run point. The Locust file we’re using is locust_script.py.

def send(self):

request_meta = {
"request_type": "InvokeEndpoint",
"name": "SageMaker",
"start_time": time.time(),
"response_length": 0,
"response": None,
"context": {},
"exception": None,
}
start_perf_counter = time.perf_counter()

try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type
)
response_body = response["Body"].read()

Now that we have our Locust script ready, we want to run distributed Locust tests to stress test our single instance to find out how much traffic our instance can handle.

Locust distributed mode is a little more nuanced than a single-process Locust test. In distributed mode, we have one primary and multiple workers. The primary worker instructs the workers on how to spawn and control the concurrent users that are sending a request. In our distributed.sh script, we see by default that 240 users will be distributed across the 60 workers. Note that the --headless flag in the Locust CLI removes the UI feature of Locust.

#replace with your endpoint name in format https://<<endpoint-name>>
export ENDPOINT_NAME=https://$1

export REGION=us-east-1
export CONTENT_TYPE=application/json
export PAYLOAD='{"inputs": "I am super happy right now."}'
export USERS=240
export WORKERS=60
export RUN_TIME=1m
export LOCUST_UI=false # Use Locust UI

.
.
.

locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results &
.
.
.

for (( c=1; c<=$WORKERS; c++ ))
do
locust -f $SCRIPT -H $ENDPOINT_NAME --worker --master-host=localhost &
done

./distributed.sh huggingface-pytorch-inference-2022-10-04-02-46-44-677 #to execute Distributed Locust test

We first run the distributed test on a single instance backing the endpoint. The idea here is we want to fully maximize a single instance to understand the instance count we need to achieve our target TPS while staying within our latency requirements. Note that if you want to access the UI, change the Locust_UI environment variable to True and take the public IP of your EC2 instance and map port 8089 to the URL.

The following screenshot shows our CloudWatch metrics.

CloudWatch Metrics

Eventually, we notice that although we initially achieve a TPS of 200, we start noticing 5xx errors in our EC2 client-side logs, as shown in the following screenshot.

We can also verify this by looking at our instance-level metrics, specifically CPUUtilization.

CloudWatch MetricsHere we notice CPUUtilization at nearly 4,800%. Our ml.m5.12x.large instance has 48 vCPUs (48 * 100 = 4800~). This is saturating the entire instance, which also helps explain our 5xx errors. We also see an increase in ModelLatency.

It seems as if our single instance is getting toppled and doesn’t have the compute to sustain a load past the 200 TPS that we are observing. Our target TPS is 1000, so let’s try to increase our instance count to 5. This might have to be even more in a production setting, because we were observing errors at 200 TPS after a certain point.

Endpoint settings

We see in both the Locust UI and CloudWatch logs that we have a TPS of nearly 1000 with five instances backing the endpoint.

Locust

CloudWatch MetricsIf you start experiencing errors even with this hardware setup, make sure to monitor CPUUtilization to understand the full picture behind your endpoint hosting. It’s crucial to understand your hardware utilization to see if you need to scale up or even down. Sometimes container-level problems lead to 5xx errors, but if CPUUtilization is low, it indicates that it’s not your hardware but something at the container or model level that might be leading to these issues (proper environment variable for number of workers not set, for example). On the other hand, if you notice your instance is getting fully saturated, it’s a sign that you need to either increase the current instance fleet or try out a larger instance with a smaller fleet.

Although we increased the instance count to 5 to handle 100 TPS, we can see that the ModelLatency metric is still high. This is due to the instances being saturated. In general, we suggest aiming to utilize the instance’s resources between 60–70%.

Clean up

After load testing, make sure to clean up any resources you won’t utilize via the SageMaker console or through the delete_endpoint Boto3 API call. In addition, make sure to stop your EC2 instance or whatever client setup you have to not incur any further charges there as well.

Summary

In this post, we described how you can load test your SageMaker real-time endpoint. We also discussed what metrics you should be evaluating when load testing your endpoint to understand your performance breakdown. Make sure to check out SageMaker Inference Recommender to further understand instance right-sizing and more performance optimization techniques.


About the Authors

Marc Karp is a ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Ram Vegiraju is a ML Architect with the SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Read More

Optimizing TensorFlow for 4th Gen Intel Xeon Processors

Optimizing TensorFlow for 4th Gen Intel Xeon Processors

Posted by Ashraf Bhuiyan, AG Ramesh from Intel, Penporn Koanantakool from Google

TensorFlow 2.9.1 was the first release to include, by default, optimizations driven by the Intel® oneAPI Deep Neural Network (oneDNN) library, for 3rd Gen Intel ® 3rd Xeon® processors (Cascade Lake). Since then, Intel and Google have continued our collaboration to introduce new TensorFlow optimizations for the next generation of Intel Xeon processors.

These optimizations accelerate TensorFlow models using the new matrix-based instructions set, Intel® Advanced Matrix Extension (AMX). The Intel AMX instructions are designed to accelerate deep learning operations such as matrix multiplication and convolutions that use Google’s bfloat16 and 8-bit low precision data types. Low precision data types are widely used and provide significant improvement over the default 32-bit floating format without significant loss in accuracy.

We are happy to announce that these features are now available as a preview in the nightly build of TensorFlow on Github, and also in the Intel optimized build. TensorFlow developers can now use Intel AMX on the 4th Gen Intel® Xeon® Scalable processor (formerly known as Sapphire Rapids) using the existing mixed precision support available in TensorFlow. We are excited by the results – several popular AI models run up to 19x faster by moving from 3rd Gen to 4th Gen Intel Xeon processors using Intel AMX.

Intel’s Advanced Matrix Extension (AMX) Accelerations in 4th Gen Intel Xeon Processor

The Intel® Advanced Matrix Extension (AMX) is an X86-based extension which introduces a new programming framework for dot products of two matrices. Intel AMX serves as an AI acceleration engine and builds on capabilities such as AVX-512 (for optimized vector operations) and Deep Learning Boost (through Vector Neural network Instructions for optimized resource utilization/caching and for lower precision AI optimizations) in previous generations of Intel Xeon processors.

In Intel AMX, a new type of 2-dimensional register file, called “tiles”, and a set of 12 new X86 instructions to operate on the tiles, are introduced. New instruction TDPBF16PS performs a dot product of bfloat16 tiles, and TDPBSSD performs dot product of signed 8-bit integer tiles. Other instructions include tile configuration and data movement to the Intel AMX unit. Further details can be found in the document published by Intel.

How to take advantage of AMX optimizations on 4th Gen Intel Xeon.

Intel AMX optimizations are included in the official TensorFlow nightly releases. The latest stable release 2.11 includes preliminary support, however full support will be available in a subsequent stable release.

Users running TensorFlow on Intel 4th gen Intel Xeon can take advantage of the optimizations with minimal changes:

a)    For bfloat16 mixed precision, developers can accelerate their models using Keras mixed precision API, as explained here. You can easily invoke auto mixed precision by including these lines in your code, that’s it! 

   

from tensorflow.keras import mixed_precisionpolicy = mixed_precision.Policy('mixed_bfloat16') mixed_precision.set_global_policy(policy)

b)    Using Intel AMX with 8-bit quantized models requires the models to be quantized to use int8. Any existing standard models, for example RN50, BERT, SSD-RN34 that have been previously quantized with Intel Neural Compressor will run with no changes needed.

    Performance improvements

    The following charts show performance improvement on a 2-socket, 56-core 4th Gen Intel Xeon using Intel AMX low precision on various popular vision and language models, where the baseline is a 2-socket, 40-core 3rd Gen Intel Xeon with FP32 precision. We use Intel Optimization for TensorFlow* preview and the launch_benchmark script from Model Zoo for Intel® Architecture .

    Bar chart showing comparison of Speeddup between 4th Gen Intel Xeon with AMX BF16 vs. 3rd Gen Intel Xeon with FP32 across mixed precision models

    Here in the chart, inference with mixed precision models on a 4th Gen Intel Xeon was 1.9x to 9.6x faster than FP32 models on a 3rd Gen Intel Xeon. (BS=x indicates a large batch size, depending on the model)

    Bar chart showing comparison of Speeddup between 4th Gen Intel Xeon with AMX BF16 vs. 3rd Gen Intel Xeon with FP32 for training across mixed precision models

    Training models with auto-mixed-precision on a 4th Gen Intel Xeon was 2.3x to 5.5x faster than FP32 models on a 3rd Gen Intel Xeon.

    Bar chart showing comparison of Speeddup between 4th Gen Intel Xeon with AMX Int8 vs. 3rd Gen Intel Xeon with FP32 across mixed precision models

    Similarly, quantized model inference on a 4th Gen Intel Xeon was 3.3x to 19x faster than FP32 precision on a 3rd Gen Intel Xeon.

    In addition to the above popular models, we have tested 100s of other models to ensure that the performance gain is observed across the board.

    Next Steps

    We are working to continuously tune and improve the Intel AMX optimizations in future releases of TensorFlow. We encourage users to optimize their AI models with Intel AMX on Intel 4th Gen processors to get a significant performance boost; not just for inference, but also for pre-training, fine tuning and transfer learning. We would like to hear from you, please provide feedback through the TensorFlow Github page or the oneAPI Deep Neural Network library GitHub page.

    Acknowledgements

    The results presented in this blog is the work of many people including the TensorFlow and oneDNN teams at Intel and our collaborators in Google’s TensorFlow team.

    From Intel: Md Faijul Amin, Mahmoud Abuzaina, Gauri Deshpande, Ashiq Imran, Kanvi Khanna, Geetanjali Krishna, Sachin Muradi, Srinivasan Narayanamoorthy, Bhavani Subramanian, Yimei Sun, Om Thakkar, Jojimon Varghese, Tatyana Primak, Shamima Najnin, Mona Minakshi, Haihao Shen, Shufan Wu, Feng Tian, Chandan Damannagari.

    From Google: Eugene Zhulenev, Antonio Sanchez, Emilio Cota.

    *For configuration details see www.intel.com/performanceindex


    Notices and Disclaimers:

    Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured list by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

    Read More

    Get smarter search results with the Amazon Kendra Intelligent Ranking and OpenSearch plugin

    Get smarter search results with the Amazon Kendra Intelligent Ranking and OpenSearch plugin

    If you’ve had the opportunity to build a search application for unstructured data (i.e., wiki, informational web sites, self-service help pages, internal documentation, etc.) using open source or commercial-off-the-shelf search engines, then you’re probably familiar with the inherent accuracy challenges involved in getting relevant search results. The intended meaning of both query and document can be lost because the search is reduced to matching component keywords and terms. Consequently, while you get results that may contain the right words, they aren’t always relevant to the user. You need your search engine to be smarter so it can rank documents based on matching the meaning or semantics of the content to the intention of the user’s query.

    Amazon Kendra provides a fully managed intelligent search service that automates document ingestion and provides highly accurate search and FAQ results based on content across many data sources. If you haven’t migrated to Amazon Kendra and would like to improve the quality of search results, you can use Amazon Kendra Intelligent Ranking for self-managed OpenSearch on your existing search solution.

    We’re delighted to introduce the new Amazon Kendra Intelligent Ranking for self-managed OpenSearch, and its companion plugin for the OpenSearch search engine! Now you can easily add intelligent ranking to your OpenSearch document queries, with no need to migrate, duplicate your OpenSearch indexes, or rewrite your applications. The difference between Amazon Kendra Intelligent Ranking for self-managed OpenSearch and the fully managed Amazon Kendra service is that while the former provides powerful semantic re-ranking for the search results, the later provides additional search accuracy improvements and functionality such as incremental learning, question answering, FAQ matching, and built-in connectors. For more information about the fully managed service, please visit the Amazon Kendra service page.

    With Amazon Kendra Intelligent Ranking for self-managed OpenSearch, previous results like this:

    Query: What is the address of the White House?

    Hit1 (best): The president delivered an address to the nation from the White House today.

    Hit2: The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500

    become like this: 

    Query: What is the address of the White House?

    Hit1 (best): The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500

    Hit2: The president delivered an address to the nation from the White House today.

    In this post, we show you how to get started with Amazon Kendra Intelligent Ranking for self-managed OpenSearch, and we provide a few examples that demonstrate the power and value of this feature.

    Components of Amazon Kendra Intelligent Ranking for self-managed OpenSearch

    Prerequisites

    For this tutorial, you’ll need a bash terminal on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. Hint: consider using an Amazon Cloud9 instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

    You will:

    • Install Docker, if it’s not already installed on your system.
    • Install the latest AWS Command Line Interface (AWS CLI), if it’s not already installed.
    • Create and start OpenSearch containers, with the Amazon Kendra Intelligent Ranking plugin enabled.
    • Create test indexes, and load some sample documents.
    • Run some queries, with and without intelligent ranking, and be suitably impressed by the differences!

    Install Docker

    If Docker (i.e., docker and docker-compose) is not already installed in your environment, then install it. See Get Docker for directions.

    Install the AWS CLI

    If you don’t already have the latest version of the AWS CLI installed, then install and configure it now (see AWS CLI Getting Started). Your default AWS user credentials must have administrator access, or ask your AWS administrator to add the following policy to your user permissions:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": "kendra-ranking:*",
                "Resource": "*"
            }
        ]
    }

    Create and start OpenSearch using the Quickstart script

    Download the search_processing_kendra_quickstart.sh script:

    wget https://raw.githubusercontent.com/msfroh/search-relevance/quickstart-script/helpers/search_processing_kendra_quickstart.sh
    chmod +x search_processing_kendra_quickstart.sh

    Make it executable:

    chmod +x ./search_processing_kendra_quickstart.sh

    The quickstart script:

    1. Creates an Amazon Kendra Intelligent Ranking Rescore Execution Plan in your AWS account.
    2. Creates Docker containers for OpenSearch and its Dashboards.
    3. Configures OpenSearch to use the Kendra Intelligent Ranking Service.
    4. Starts the OpenSearch services.
    5. Provides helpful guidance for using the service.

    Use the --help option to see the command line options:

    ./search_processing_kendra_quickstart.sh --help

    Now, execute the script to automate the Amazon Kendra and OpenSearch setup:

    ./search_processing_kendra_quickstart.sh --create-execution-plan

    That’s it! OpenSearch and OpenSearch Dashboard containers are now up and running.

    Read the output message from the quickstart script, and make a note of the directory where you can run the handy docker-compose commands, and the cleanup_resources.sh script.

    Try a test query to validate you can connect to your OpenSearch container:

    curl -XGET --insecure -u 'admin:admin' 'https://localhost:9200'

    Note that if you get the error curl(35):OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:9200, it means that OpenSearch is still coming up. Please wait for a couple of minutes for OpenSearch to be ready and try again.

    Create test indexes and load sample documents

    The script below is used to create an index and load sample documents. Save it on your computer as bulk_post.sh:

    #!/bin/bash
    curl -u admin:admin -XPOST https://localhost:9200/_bulk --insecure --data-binary @$1 -H 'Content-Type: application/json'

    Save the data files below as tinydocs.jsonl:

    { "create" : { "_index" : "tinydocs",  "_id" : "tdoc1" } }
    {"title": "WhiteHouse1", "body": "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500"}
    { "create" : { "_index" : "tinydocs",  "_id" : "tdoc2" } }
    {"title": "WhiteHouse2", "body": "The president delivered an address to the nation from the White House today."}

    And save the data file below as dstinfo.jsonl:

    (This data is adapted from Daylight Saving Time article).

    { "create" : { "_index" : "dstinfo",  "_id" : "dst1" } }
    {"title": "Daylight Saving Time", "body": "Daylight saving time begins on the second Sunday in March at 2 a.m., and clocks are set an hour ahead, according to the Farmers’ Almanac. It lasts for eight months and ends on the first Sunday in November, when clocks are set back an hour at 2 a.m."}
    { "create" : { "_index" : "dstinfo",  "_id" : "dst2" } }
    {"title":"History of daylight saving time", "body": "Founding Father Benjamin Franklin is often deemed the brain behind daylight saving time after a letter he wrote in 1784 to a Parisian newspaper, according to the Farmers’ Almanac. But Franklin’s letter suggested people simply change their routines and schedules — not the clocks — to the sun’s cycles. Perhaps surprisingly, daylight saving time had a soft rollout in the United States in 1883 to solve issues with railroad accidents, according to the U.S. Bureau of Transportation Services. It was instituted across the United States in 1918, according to the Congressional Research Service. In 2005, Congress changed it to span from March to November instead of its original timeframe of April to October."}
    { "create" : { "_index" : "dstinfo",  "_id" : "dst3" } }
    {"title": "Daylight saving time participants", "body":"The United States is one of more than 70 countries that follow some form of daylight saving time, according to World Data. States can individually decide whether or not to follow it, according to the Farmers’ Almanac. Arizona and Hawaii do not, nor do parts of northeastern British Columbia in Canada. Puerto Rico and the Virgin Islands, both U.S. territories, also don’t follow daylight saving time, according to the Congressional Research Service."}
    { "create" : { "_index" : "dstinfo",  "_id" : "dst4" } }
    {"title":"Benefits of daylight saving time", "body":"Those in favor of daylight saving time, whether eight months long or permanent, also vouch that it increases tourism in places such as parks or other public attractions, according to National Geographic. The longer days can keep more people outdoors later in the day."}

    Make the script executable:

    chmod +x ./bulk_post.sh

    Now use the bulk_post.sh script to create indexes and load the data by running the two commands below:

    ./bulk_post.sh tinydocs.jsonl
    ./bulk_post.sh dstinfo.jsonl

    Run sample queries

    Prepare query scripts

    OpenSearch queries are defined in JSON using the OpenSearch query domain specific language (DSL). For this post, we use the Linux curl command to send queries to our local OpenSearch server using HTTPS.

    To make this easy, we’ve defined two small scripts to construct our query DSL and send it to OpenSearch.

    The first script creates a regular OpenSearch text match query on two document fields – title and body. See OpenSearch documentation for more on the multi-match query syntax. We’ve kept the query very simple, but you can experiment later with defining alternate types of queries.

    Save the script below as query_nokendra.sh:

    #!/bin/bash
    curl -XGET "https://localhost:9200/$1/_search?pretty" -u 'admin:admin' --insecure -H 'Content-Type: application/json' -d'
      {
        "query": {
          "multi_match": {
            "fields": ["title", "body"],
            "query": "'"$2"'"
          }
        },
        "size": 20
      }
      '

    The second script is similar to the first one, but this time we add a query extension to instruct OpenSearch to invoke the Amazon Kendra Intelligent Ranking plugin as a post-processing step to re-rank the original results using the Amazon Kendra Intelligent Ranking service.

    The size property determines how many OpenSearch result documents are sent to Kendra for re-ranking. Here, we specify a maximum of 20 results for re-ranking. Two properties, title_field (optional) and body_field (required), specify the document fields used for intelligent ranking.

    Save the script below as query_kendra.sh:

    #!/bin/bash
    curl -XGET "https://localhost:9200/$1/_search?pretty" -u 'admin:admin' --insecure -H 'Content-Type: application/json' -d'
      {
        "query": {
          "multi_match": {
            "fields": ["title", "body"],
            "query": "'"$2"'"
          }
        },
        "size": 20,
        "ext": {
          "search_configuration": {
            "result_transformer": {
              "kendra_intelligent_ranking": {
                "order": 1,
                "properties": {
                  "title_field": "title",
                  "body_field": "body"
                }
              }
            }
          }
        }
      }
      '

    Make both scripts executable:

    chmod +x ./query_*kendra.sh

    Run initial queries

    Start with a simple query on the tinydocs index, to reproduce the example used in the post introduction.

    Use the query_nokendra.sh script to search for the address of the White House:

    ./query_nokendra.sh tinydocs "what is the address of White House"

    You see the results shown below. Observe the order of the two results, which are ranked by the score assigned by the OpenSearch text match query. Although the top scoring result does contain the keywords address and White House, it’s clear the meaning doesn’t match the intent of the question. The keywords match, but the semantics do not.

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 1.1619741,
        "hits" : [
          {
            "_index" : "tinydocs",
            "_id" : "tdoc2",
            "_score" : 1.1619741,
            "_source" : {
              "title" : "Whitehouse2",
              "body" : "The president delivered an address to the nation from the White House today."
            }
          },
          {
            "_index" : "tinydocs",
            "_id" : "tdoc1",
            "_score" : 1.0577903,
            "_source" : {
              "title" : "Whitehouse1",
              "body" : "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500"
            }
          }
        ]
      }
    }

    Now let’s run the query with Amazon Kendra Intelligent Ranking, using the query_kendra.sh script:

    ./query_kendra.sh tinydocs "what is the address of White House"

    This time, you see the results in a different order as shown below. The Amazon Kendra Intelligent Ranking service has re-assigned the score values, and assigned a higher score to the document that more closely matches the intention of the query. From a keyword perspective, this is a poorer match because it doesn’t contain the word address; however, from a semantic perspective it’s the better response. Now you see the benefit of using the Amazon Kendra Intelligent Ranking plugin!

    {
      "took" : 522,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 2,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 0.3798389,
        "hits" : [
          {
            "_index" : "tinydocs",
            "_id" : "tdoc1",
            "_score" : 0.3798389,
            "_source" : {
              "title" : "Whitehouse1",
              "body" : "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500"
            }
          },
          {
            "_index" : "tinydocs",
            "_id" : "tdoc2",
            "_score" : 0.25906953,
            "_source" : {
              "title" : "Whitehouse2",
              "body" : "The president delivered an address to the nation from the White House today."
            }
          }
        ]
      }
    }

    Run additional queries and compare search results

    Try the dstinfo index now, to see how the same concept works with different data and queries. While you can use the scripts query_nokendra.sh and query_kendra.sh to make queries from the command line, let’s use instead the OpenSearch Dashboards Compare Search Results Plugin to run queries and compare search results.

    Paste the local Dashboards URL into your browser: http://localhost:5601/app/searchRelevance – / to access the dashboard comparison tool. Use the default credentials: Username: admin, Password: admin.

    In the search bar, enter: what is daylight saving time?

    For the Query 1 and Query 2 index, select dstinfo.

    Copy the DSL query below and paste it in the Query panel under Query 1. This is a keyword search query.

    {
      "query": { "multi_match": { "fields": ["title", "body"], "query": "%SearchText%" } }, 
      "size": 20
    }

    Now copy the DSL query below and paste it in the Query panel under Query 2. This query invokes the Amazon Kendra Intelligent Ranking plugin for self-managed OpenSearch to perform semantic re-ranking of the search results.

    {
      "query": { "multi_match": { "fields": ["title", "body"], "query": "%SearchText%" } },
      "size": 20,
      "ext": {
        "search_configuration": {
          "result_transformer": {
            "kendra_intelligent_ranking": {
              "order": 1,
              "properties": { "title_field": "title", "body_field": "body" }
            }
          }
        }
      }
    }

    Choose the Search button to run the queries and observe the search results. In Result 1, the hit ranked last is probably actually the most relevant response to this query. In Result 2, the output from Amazon Kendra Intelligent Ranking has the most relevant answer correctly ranked first.

    Now that you have experienced Amazon Kendra Intelligent Ranking for self-managed OpenSearch, experiment with a few queries of your own. Use the data we have already loaded or use the bulk_post.sh script to load your own data.

    Explore the Amazon Kendra ranking rescore API

    As you’ve seen from this post, the Amazon Kendra Intelligent Ranking plugin for OpenSearch can be conveniently used for semantic re-ranking of your search results. However, if you use a search service that doesn’t support the Amazon Kendra Intelligent Ranking plugin for self-managed OpenSearch, then you can use the Rescore function from the Amazon Kendra Intelligent Ranking API directly.

    Try this API using the search results from the example query we used above: what is the address of the White House?

    First, find your Execution Plan Id by running:

    aws kendra-ranking list-rescore-execution-plans

    The JSON below contains the search query, and the two results that were returned by the original OpenSearch match query, with their original OpenSearch scores. Replace {kendra-execution-plan_id} with your Execution Plan Id (from above) and save it as rescore_input.json:

    {
        "RescoreExecutionPlanId": "{kendra-execution-plan_id}", 
        "SearchQuery": "what is the address of White House", 
        "Documents": [
            { "Id": "tdoc1",  "Title": "Whitehouse1",  "Body": "The president delivered an address to the nation from the White House today.",  "OriginalScore": 1.4484794 },
            { "Id": "tdoc2",  "Title": "Whitehouse2",  "Body": "The White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500",  "OriginalScore": 1.2401118 }
        ]
    }

    Run the CLI command below to re-score this list of documents using the Amazon Kendra Intelligent Ranking service:

    aws kendra-ranking rescore --cli-input-json "`cat rescore_input.json`"

    The output of a successful execution of this will look as below.

    {
        "ResultItems": [
            {
                "Score": 0.39321771264076233, 
                "DocumentId": "tdoc2"
            }, 
            {
                "Score": 0.328217089176178, 
                "DocumentId": "tdoc1"
            }
        ], 
        "RescoreId": "991459b0-ca9e-4ba8-b0b3-1e8e01f2ad15"
    }

    As expected, the document tdoc2 (containing the text bodyThe White House is located at: 1600 Pennsylvania Avenue NW, Washington, DC 20500”) now has the higher ranking, as it’s the semantically more relevant response for the query. The ResultItems list in the output contains each input DocumentId with its new Score, ranked in descending order of Score.

    Clean up

    When you’re done experimenting, shut down, and remove your Docker containers and Rescore Execution Plan by running the cleanup_resources.sh script created by the Quickstart script, e.g.:

    ./opensearch-kendra-ranking-docker.xxxx/cleanup_resources.sh

    Conclusion

    In this post, we showed you how to use Amazon Kendra Intelligent Ranking plugin for self-managed OpenSearch to easily add intelligent ranking to your OpenSearch document queries to dramatically improve the relevance ranking of the results, while using your existing OpenSearch search engine deployments.

    You can also use the Amazon Kendra Intelligent Ranking Rescore API directly to intelligently re-score and rank results from your own applications.

    Read the Amazon Kendra Intelligent Ranking for self-managed OpenSearch documentation to learn more about this feature, and start planning to apply it in your production applications.


    About the Authors

    Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

    Bob StrahanBob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

    Read More

    Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker

    Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker

    Machine learning (ML) applications are complex to deploy and often require the ability to hyper-scale, and have ultra-low latency requirements and stringent cost budgets. Use cases such as fraud detection, product recommendations, and traffic prediction are examples where milliseconds matter and are critical for business success. Strict service level agreements (SLAs) need to be met, and a typical request may require multiple steps such as preprocessing, data transformation, feature engineering, model selection logic, model aggregation, and postprocessing.

    Deploying ML models at scale with optimized cost and compute efficiencies can be a daunting and cumbersome task. Each model has its own merits and dependencies based on the external data sources as well as runtime environment such as CPU/GPU power of the underlying compute resources. An application may require multiple ML models to serve a single inference request. In certain scenarios, a request may flow across multiple models. There is no one-size-fits-all approach, and it’s important for ML practitioners to look for tried-and-proven methods to address recurring ML hosting challenges. This has led to the evolution of design patterns for ML model hosting.

    In this post, we explore common design patterns for building ML applications on Amazon SageMaker.

    Design patterns for building ML applications

    Let’s look at the following design patterns to use for hosting ML applications.

    Single-model based ML applications

    This is a great option when your ML use case requires a single model to serve a request. The model is deployed on a dedicated compute infrastructure with the ability to scale based on the input traffic. This option is also ideal when the client application has a low-latency (in the order of milliseconds or seconds) inference requirement.

    Multi-model based ML applications

    To make hosting more cost-effective, this design pattern allows you to host multiple models on the same tenant infrastructure. Multiple ML models can share the host or container resources, including caching the most-used ML models in the memory, resulting in better utilization of memory and compute resources. Depending on the types of the models you chose to deploy, model co-hosting may use the following methods:

    • Multi-model hosting – This option allows you to host multiple models using a shared serving container on a single endpoint. This feature is ideal when you have a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time.
    • Multi-container hosting – This option is ideal when you have multiple models running on different serving stacks with similar resource needs, and when individual models don’t have sufficient traffic to utilize the full capacity of the endpoint instances. Multi-container hosting allows you to deploy multiple containers that use different models or frameworks on a single endpoint. The models can be completely heterogenous, with their own independent serving stack.
    • Model ensembles – In a lot of production use cases, there can often be many upstream models feeding inputs to a given downstream model. This is where ensembles are useful. Ensemble patterns involve mixing output from one or more base models in order to reduce the generalization error of the prediction. The base models can be diverse and trained by different algorithms. Model ensembles can out-perform single models because the prediction error of the model decreases when the ensemble approach is used.

    The following are common use cases of ensemble patterns and their corresponding design pattern diagrams:

    • Scatter-gather – In a scatter-gather pattern, a request for inference is routed to a number of models. An aggregator is then used to collect the responses and distill them into a single inference response. For example, an image classification use case may use three different models to perform the task. The scatter-gather pattern allows you to combine results from inferences run on three different models and pick the most probable classification model.

    • Model aggregate – In an aggregation pattern, outputs from multiple models are averaged. For classification models, multiple models’ predictions are evaluated to determine the class that received the most votes and is treated as the final output of the ensemble. For example, in a two-class classification problem to classify a set of fruits as oranges or apples, if two models vote for an orange and one model votes for an apple, then the aggregated output will be an orange. Aggregation helps combat inaccuracy in individual models and makes the output more accurate.

    • Dynamic selection – Another pattern for ensemble models is to dynamically perform model selection for the given input attributes. For example, in a given input of images of fruits, if the input contains an orange, model A will be used because it’s specialized for oranges. If the input contains an apple, model B will be used because it’s specialized for apples.

    • Serial inference ML applications – With a serial inference pattern, also known as an inference pipeline, use cases have requirements to preprocess incoming data before invoking a pre-trained ML model for generating inferences. Additionally, in some cases, the generated inferences may need to be processed further, so that they can be easily consumed by downstream applications. An inference pipeline allows you to reuse the same preprocessing code used during model training to process the inference request data used for predictions.

    • Business logic – Productionizing ML always involves business logic. Business logic patterns involve everything that’s needed to perform an ML task that is not ML model inference. This includes loading the model from Amazon Simple Storage Service (Amazon S3), for example, database lookups to validate the input, obtaining pre-computed features from the feature store, and so on. After these business logic steps are complete, the inputs are passed through to ML models.

    ML inference options

    For model deployment, it’s important to work backward from your use case. What is the frequency of the prediction? Do you expect live traffic to your application and real-time response to your clients? Do you have many models trained for different subsets of data for the same use case? Does the prediction traffic fluctuate? Is latency of inference a concern? Based on these details, all the preceding design patterns can be implemented using the following deployment options:

    • Real-time inference – Real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. Real-time ML inference workloads may include a single-model based ML application, where an application requires only one ML model to serve a single request, or a multi-model based ML application, where an application requires multiple ML models to serve a single request.
    • Near-real-time (asynchronous) inference – With-near-real time inference, you can queue incoming requests. This can be utilized for running inference on inputs that are hundreds of MBs. It operates in near-real time and allows users to use the input for inference, and read the output from the endpoint from an S3 bucket. It can especially be handy in cases with NLP and computer vision, where there are large payloads that require longer preprocessing times.
    • Batch inference – Batch inference can be utilized for running inference offline on a large dataset. Because it runs offline, batch inference doesn’t offer the lowest latency. Here, the inference request is processed with either a scheduled or event-based trigger of a batch inference job.
    • Serverless inference – Serverless inference is ideal for workloads that have idle periods between traffic spurts and can tolerate a few extra seconds of latency (cold start) for the first invocation after an idle period. For example, a chatbot service or an application to process forms or analyze data from documents. In this case, you might want an online inference option that is able to automatically provision and scale compute capacity based on the volume of inference requests. And during idle time, it should be able to turn off compute capacity completely so that you’re not charged. Serverless inference takes away the undifferentiated heavy lifting of selecting and managing servers by automatically launching compute resources and scaling them in and out depending on traffic.

    Use fitness functions to select the right ML inference option

    Deciding on the right hosting option is important because it impacts the end-users rendered by your applications. For this purpose, we’re borrowing the concept of fitness functions, which was coined by Neal Ford and his colleagues from AWS Partner ThoughtWorks in their work Building Evolutionary Architectures. Fitness functions provide a prescriptive assessment of various hosting options based on the customer’s objectives. Fitness functions help you obtain the necessary data to allow for the planned evolution of your architecture. They set measurable values to assess how close your solution is to achieving your set goals. Fitness functions can and should be adapted as the architecture evolves to guide a desired change process. This provides architects with a tool to guide their teams while maintaining team autonomy.

    There are five main fitness functions that customers care about when it comes to selecting the right ML inference option for hosting their ML models and applications.

    Fitness function Description
    Cost

    To deploy and maintain an ML model and ML application on a scalable framework is a critical business process, and the costs may vary greatly depending on choices made about model hosting infrastructure, hosting option, ML frameworks, ML model characteristics, optimizations, scaling policy, and more. The workloads must utilize the hardware infrastructure optimally to ensure that the cost remains in check.

    This fitness function specifically refers to the infrastructure cost, which is a part of overall total cost of ownership (TCO). The infrastructure costs are the combined costs for storage, network, and compute. It’s also critical to understand other components of TCO, including operational costs and security and compliance costs.

    Operational costs are the combined costs of operating, monitoring, and maintaining the ML infrastructure. The operational costs are calculated as the number of engineers required based on each scenario and the annual salary of engineers, aggregated over a specific period.

    Customers using self-managed ML solutions on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS) need to build operational tooling themselves.

    Customers using SageMaker incur significantly less TCO. SageMaker inference is a fully managed service and provides capabilities out of the box for deploying ML models for inference. You don’t need to provision instances, monitor instance health, manage security updates or patches, emit operational metrics, or build monitoring for your ML inference workloads. It has built-in capabilities to ensure high availability and resiliency. SageMaker supports security with end-to-end encryption at rest and in transit, including encryption of the root volume and Amazon Elastic Block Store (Amazon EBS) volume, Amazon Virtual Private Cloud (Amazon VPC) support, AWS PrivateLink, customer-managed keys, AWS Identity and Access Management (IAM) fine-grained access control, AWS CloudTrail audits, internode encryption for training, tag-based access control, network isolation, and Interactive Application Proxy.

    All of these security features are provided out of the box in SageMaker, and can save businesses tens of development months of engineering effort over a 3-year period. SageMaker is a HIPAA-eligible service, and is certified under PCI, SOC, GDPR, and ISO. SageMaker also supports FIPS endpoints. For more information about TCO, refer to The total cost of ownership of Amazon SageMaker.

    Inference latency Many ML models and applications are latency critical, in which the inference latency must be within the bounds specified by a service level objective. Inference latency depends upon a multitude of factors, including model size and complexity, hardware platform, software environment, and network architecture. For example, larger and more complex models can take longer to run inference.
    Throughput (transactions per second) For model inference, optimizing throughput is crucial for performance tuning and achieving the business objective of the ML application. As we continue to advance rapidly in all aspects of ML, including low-level implementations of mathematical operations in chip design, hardware-specific libraries play a greater role in performance optimization. Various factors such as payload size, network hops, nature of hops, model graph features, operators in the model, and the CPU, GPU, and memory profile of the model hosting instances affect the throughput of the ML model.
    Scaling configuration complexity It’s crucial for the ML models or applications to run on a scalable framework that can handle the demand of varying traffic. It also allows for the maximum utilization of CPU and GPU resources and prevents over-provisioning of compute resources.
    Expected traffic pattern ML models or applications can have different traffic patterns, ranging from continuous real-time live traffic to periodic peaks of thousands of requests per second, and from infrequent, unpredictable request patterns to offline batch requests on larger datasets. Working backward from the expected traffic pattern is recommended in order to select the right hosting option for your ML model.

    Deploying models with SageMaker

    SageMaker is a fully managed AWS service that provides every developer and data scientist with the ability to quickly build, train, and deploy ML models at scale. With SageMaker inference, you can deploy your ML models on hosted endpoints and get inference results. SageMaker provides a wide selection of hardware and features to meet your workload requirements, allowing you to select over 70 instance types with hardware acceleration. SageMaker can also provide inference instance type recommendation using a new feature called SageMaker Inference Recommender, in case you’re not sure which one would be most optimal for your workload.

    You can choose deployment options to best meet your use cases, such as real time inference, asynchronous, batch, and even serverless endpoints. In addition, SageMaker offers various deployment strategies such as canary, blue/green, shadow, and A/B testing for model deployment, along with cost-effective deployment with multi-model, multi-container endpoints, and elastic scaling. With SageMaker inference, you can view the performance metrics for your endpoints in Amazon CloudWatch, automatically scale endpoints based on traffic, and update your models in production without losing any availability.

    SageMaker offers four options to deploy your model so you can start making predictions:

    • Real-time inference – This is suitable for workloads with millisecond latency requirements, payload sizes up to 6 MB, and processing times of up to 60 seconds.
    • Batch transform – This is ideal for offline predictions on large batches of data that are available up-front.
    • Asynchronous inference – This is designed for workloads that don’t have sub-second latency requirements, payload sizes up to 1 GB, and processing times of up to 15 minutes.
    • Serverless inference – With serverless inference, you can quickly deploy ML models for inference without having to configure or manage the underlying infrastructure. Additionally, you pay only for the compute capacity used to process inference requests, which is ideal for intermittent workloads.

    The following diagram can help you understand the SageMaker hosting model deployment options along with the associated fitness function evaluations.

    Let’s explore each of the deployment options in more detail.

    Real-time inference in SageMaker

    SageMaker real-time inference is recommended if you have sustained traffic and need lower and consistent latency for your requests with payload sizes up to 6 MB, and processing times of up to 60 seconds. You deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support auto scaling. Real-time inference is popular for use cases where you expect a low-latency, synchronous response with predictable traffic patterns, such as personalized recommendations for products and services or transactional fraud detection use cases.

    Typically, a client application sends requests to the SageMaker HTTPS endpoint to obtain inferences from a deployed model. You can deploy multiple variants of a model to the same SageMaker HTTPS endpoint. This is useful for testing variations of a model in production. Auto scaling allows you to dynamically adjust the number of instances provisioned for a model in response to changes in your workload.

    The following table provides guidance on evaluating SageMaker real-time inference based on the fitness functions.

    Fitness function Description
    Cost

    Real-time endpoints offer synchronous response to inference requests. Because the endpoint is always running and available to provide real-time synchronous inference response, you pay for using the instance. Costs can quickly add up when you deploy multiple endpoints, especially if the endpoints don’t fully utilize the underlying instances. Choosing the right instance for your model helps ensure you have the most performant instance at the lowest cost for your models. Auto scaling is recommended to dynamically adjust the capacity depending on traffic to maintain steady and predictable performance at the possible lowest cost.

    SageMaker extends access to Graviton2 and Graviton3-based ML instance families. AWS Graviton processors are custom built by Amazon Web Services using 64-bit Arm Neoverse cores to deliver the best price performance for your cloud workloads running on Amazon EC2. With Graviton-based instances, you have more options for optimizing the cost and performance when deploying your ML models on SageMaker.

    SageMaker also supports Inf1 instances, providing high performance and cost-effective ML inference. With 1–16 AWS Inferentia chips per instance, Inf1 instances can scale in performance and deliver up to three times higher throughput and up to 50% lower cost per inference compared to the AWS GPU-based instances. To use Inf1 instances in SageMaker, you can compile your trained models using Amazon SageMaker Neo and select the Inf1 instances to deploy the compiled model on SageMaker.

    You can also explore Savings Plans for SageMaker to benefit from cost savings up to 64% compared to the on-demand price.

    When you create an endpoint, SageMaker attaches an EBS storage volume to each ML compute instance that hosts the endpoint. The size of the storage volume depends on the instance type. Additional cost for real-time endpoints includes cost of GB-month of provisioned storage, plus GB data processed in and GB data processed out of the endpoint instance.

    Inference latency Real-time inference is ideal when you need a persistent endpoint with millisecond latency requirements. It supports payload sizes up to 6 MB, and processing times of up to 60 seconds.
    Throughput

    An ideal value of inference throughput is subjective to factors such as model, model input size, batch size, and endpoint instance type. As a best practice, review CloudWatch metrics for input requests and resource utilization, and select the appropriate instance type to achieve optimal throughput.

    A business application can be either throughput optimized or latency optimized. For example, dynamic batching can help increase the throughput for latency-sensitive apps using real-time inference. However, there are limits to batch size, without which the inference latency could be affected. Inference latency will grow as you increase the batch size to improve throughput. Therefore, real-time inference is an ideal option for latency-sensitive applications. SageMaker provides options of asynchronous inference and batch transform, which are optimized to give higher throughput compared to real-time inference if the business applications can tolerate a slightly higher latency.

    Scaling configuration complexity

    SageMaker real-time endpoints support auto scaling out of the box. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances, helping you reduce your compute cost. Without auto scaling, you need to provision for peak traffic or risk model unavailability. Unless the traffic to your model is steady throughout the day, there will be excess unused capacity. This leads to low utilization and wasted resources.

    With SageMaker, you can configure different scaling options based on the expected traffic pattern. Simple scaling or target tracking scaling is ideal when you want to scale based on a specific CloudWatch metric. You can do this by choosing a specific metric and setting threshold values. The recommended metrics for this option are average CPUUtilization or SageMakerVariantInvocationsPerInstance.

    If you require advanced configuration, you can set a step scaling policy to dynamically adjust the number of instances to scale based on the size of the alarm breach. This helps you configure a more aggressive response when demand reaches a certain level.

    You can use a scheduled scaling option when you know that the demand follows a particular schedule in the day, week, month, or year. This helps you specify a one-time schedule or a recurring schedule or cron expressions along with start and end times, which form the boundaries of when the auto scaling action starts and stops.

    For more details, refer to Configuring autoscaling inference endpoints in Amazon SageMaker and Load test and optimize an Amazon SageMaker endpoint using automatic scaling.

    Traffic pattern Real-time inference is ideal for workloads with a continual or regular traffic pattern.

    Asynchronous inference in SageMaker

    SageMaker asynchronous inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1 GB), long processing times (up to 15 minutes), and near-real-time latency requirements. Example workloads for asynchronous inference include healthcare companies processing high-resolution biomedical images or videos like echocardiograms to detect anomalies. These applications receive bursts of incoming traffic at different times in the day and require near-real-time processing at low cost. Processing times for these requests can range in the order of minutes, eliminating the need to run real-time inference. Instead, input payloads can be processed asynchronously from an object store like Amazon S3 with automatic queuing and a predefined concurrency threshold. Upon processing, SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Amazon Simple Notification Service (Amazon SNS).

    The following table provides guidance on evaluating SageMaker asynchronous inference based on the fitness functions.

    Fitness function Description
    Cost Asynchronous inference is a great choice for cost-sensitive workloads with large payloads and burst traffic. Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests. Requests that are received when there are zero instances are queued for processing after the endpoint scales up.
    Inference latency Asynchronous inference is ideal for near-real-time latency requirements. The requests are placed in a queue and processed as soon as the compute is available. This typically results in tens of milliseconds in latency.
    Throughput Asynchronous inference is ideal for non-latency sensitive use cases, because applications don’t have to compromise on throughput. Requests aren’t dropped during traffic spikes because the asynchronous inference endpoint queues up requests rather than dropping them.
    Scaling configuration complexity

    SageMaker supports auto scaling for asynchronous endpoint. Unlike real-time hosted endpoints, asynchronous inference endpoints support scaling down instances to zero by setting the minimum capacity to zero. For asynchronous endpoints, SageMaker strongly recommends that you create a policy configuration for target-tracking scaling for a deployed model (variant).

    For use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.

    Traffic pattern Asynchronous endpoints queue incoming requests and process them asynchronously. They’re a good option for intermittent or infrequent traffic patterns.

    Batch inference in SageMaker

    SageMaker batch transform is ideal for offline predictions on large batches of data that are available up-front. The batch transform feature is a high-performance and high-throughput method for transforming data and generating inferences. It’s ideal for scenarios where you’re dealing with large batches of data, don’t need subsecond latency, or need to both preprocess and transform the training data. Customers in certain domains such as advertising and marketing or healthcare often need to make offline predictions on hyperscale datasets where high throughput is often the objective of the use case and latency isn’t a concern.

    When a batch transform job starts, SageMaker initializes compute instances and distributes the inference workload between them. It releases the resources when the jobs are complete, so you pay only for what was used during the run of your job. When the job is complete, SageMaker saves the prediction results in an S3 bucket that you specify. Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. Example workloads for SageMaker batch transform include offline applications such as banking applications for predicting customer churn where an offline job can be scheduled to run periodically.

    The following table provides guidance on evaluating SageMaker batch transform based on the fitness functions.

    Fitness function Description
    Cost SageMaker batch transform allows you to run predictions on large or small batch datasets. You are charged for the instance type you choose, based on the duration of use. SageMaker manages the provisioning of resources at the start of the job and releases them when the job is complete. There is no additional data processing cost.
    Inference latency You can use event-based or scheduled invocation. Latency could vary depending on the size of inference data, job concurrency, complexity of the model, and compute instance capacity.
    Throughput

    Batch transform jobs can be done on a range of datasets, from petabytes of data to very small datasets. There is no need to resize larger datasets into small chunks of data. You can speed up batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job.

    Batch processing can increase throughput and optimize your resources because it helps complete a larger number of inferences in a certain amount of time at the expense of latency. To optimize model deployment for higher throughput, the general guideline is to increase the batch size until throughput decreases.

    Scaling configuration complexity SageMaker batch transform is used for offline inference that is not latency sensitive.
    Traffic pattern For offline inference, a batch transform job is scheduled or started using an event-based trigger.

    Serverless inference in SageMaker

    SageMaker serverless inference allows you to deploy ML models for inference without having to configure or manage the underlying infrastructure. Based on the volume of inference requests your model receives, SageMaker serverless inference automatically provisions, scales, and turns off compute capacity. As a result, you pay for only the compute time to run your inference code and the amount of data processed, not for idle time. You can use SageMaker’s built-in algorithms and ML framework-serving containers to deploy your model to a serverless inference endpoint or choose to bring your own container. If traffic becomes predictable and stable, you can easily update from a serverless inference endpoint to a SageMaker real-time endpoint without the need to make changes to your container image. With serverless inference, you also benefit from other SageMaker features, including built-in metrics such as invocation count, faults, latency, host metrics, and errors in CloudWatch.

    The following table provides guidance on evaluating SageMaker serverless inference based on the fitness functions.

    Fitness function Description
    Cost With a pay-as-you-run model, serverless inference is a cost-effective option if you have infrequent or intermittent traffic patterns. You pay only for the duration for which the endpoint processes the request, and therefore can save costs if the traffic pattern is intermittent.
    Inference latency

    Serverless endpoints offer low inference latency (in the order of milliseconds to seconds), with the ability to scale instantly from tens to thousands of inferences within seconds based on the usage patterns, making it ideal for ML applications with intermittent or unpredictable traffic.

    Because serverless endpoints provision compute resources on demand, your endpoint may experience a few extra seconds of latency (cold start) for the first invocation after an idle period. The cold start time depends on your model size, how long it takes to download your model, and the startup time of your container.

    Throughput When configuring your serverless endpoint, you can specify the memory size and maximum number of concurrent invocations. SageMaker serverless inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. As a general rule, the memory size should be at least as large as your model size. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, and 6144 MB. Regardless of the memory size you choose, serverless endpoints have 5 GB of ephemeral disk storage available.
    Scaling configuration complexity Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers.
    Traffic pattern Serverless inference is ideal for workloads with infrequent or intermittent traffic patterns.

    Model hosting design patterns in SageMaker

    SageMaker inference endpoints use Docker containers for hosting ML models. Containers allow you package software into standardized units that run consistently on any platform that supports Docker. This ensures portability across platforms, immutable infrastructure deployments, and easier change management and CI/CD implementations. SageMaker provides pre-built managed containers for popular frameworks such as Apache MXNet, TensorFlow, PyTorch, Sklearn, and Hugging Face. For a full list of available SageMaker container images, refer to Available Deep Learning Containers Images. In the case that SageMaker doesn’t have a supported container, you can also build your own container (BYOC) and push your own custom image, installing the dependencies that are necessary for your model.

    To deploy a model on SageMaker, you need a container (SageMaker managed framework containers or BYOC) and a compute instance to host the container. SageMaker supports multiple advanced options for common ML model hosting design patterns where models can be hosted on a single container or co-hosted on a shared container.

    A real-time ML application may use a single model or multiple models to serve a single prediction request. The following diagram shows various inference scenarios for an ML application.

    Let’s explore a suitable SageMaker hosting option for each of the preceding inference scenarios. You can refer to the fitness functions to assess if it’s the right option for the given use case.

    Hosting a single-model based ML application

    There are several options to host single-model based ML applications using SageMaker hosting services depending on the deployment scenario.

    Single-model endpoint

    SageMaker single-model endpoints allow you to host one model on a container hosted on dedicated instances for low latency and high throughput. These endpoints are fully managed and support auto scaling. You can configure the single-model endpoint as a provisioned endpoint where you pass in endpoint infrastructure configuration such as the instance type and count, or a serverless endpoint where SageMaker automatically launches compute resources and scales them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. Serverless endpoints are for applications with intermittent or unpredictable traffic.

    The following diagram shows single-model endpoint inference scenarios.

    The following table provides guidance on evaluating fitness functions for a provisioned single-model endpoint. For serverless endpoint fitness function evaluations, refer to the serverless endpoint section in this post.

    Fitness function Description
    Cost You are charged for usage of the instance type you choose. Because the endpoint is always running and available, costs can quickly add up. Choosing the right instance for your model helps ensure you have the most performant instance at the lowest cost for your models. Auto scaling is recommended to dynamically adjust the capacity depending on traffic to maintain steady and predictable performance at the possible lowest cost.
    Inference latency A single-model endpoint provides real-time, interactive, synchronous inference with millisecond latency requirements.
    Throughput Throughput can be impacted by various factors, such as model input size, batch size, endpoint instance type, and so on. It is recommended to review CloudWatch metrics for input requests and resource utilization, and select the appropriate instance type to achieve optimal throughput. SageMaker provides features to manage resources and optimize inference performance when deploying ML models. You can optimize model performance using Neo, or use Inf1 instances for better throughput of your SageMaker hosted models using a GPU instance for your endpoint.
    Scaling configuration complexity Auto scaling is supported out of the box. SageMaker recommends choosing an appropriate scaling configuration by performing load tests.
    Traffic pattern A single-model endpoint is ideal for workloads with predictable traffic patterns.

    Co-hosting multiple models

    When you’re dealing with a large number of models, deploying each one on an individual endpoint with a dedicated container and instance can result in a significant increase in cost. Additionally, it also becomes difficult to manage so many models in production, specifically when you don’t need to invoke all the models at the same time but still need them to be available at all times. Co-hosting multiple models on the same underlying compute resources makes it easy to manage ML deployments at scale and lowers your hosting costs through increased usage of the endpoint and its underlying compute resources. SageMaker supports advanced model co-hosting options such as multi-model endpoint (MME) for homogenous models and multi-container endpoint (MCE) for heterogenous models. Homogeneous models use the same ML framework on a shared service container, whereas heterogenous models allow you to deploy multiple serving containers that use different models or frameworks on a single endpoint.

    The following diagram shows model co-hosting options using SageMaker.

    SageMaker multi-model endpoints

    SageMaker MMEs allow you to host multiple models using a shared serving container on a single endpoint. This is a scalable and cost-effective solution to deploy a large number of models that cater to the same use case, framework, or inference logic. MMEs can dynamically serve requests based on the model invoked by the caller. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to them. This feature is ideal when you have a large number of similar models that you can serve through a shared serving container and don’t need to access all the models at the same time. Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency, allowing MMEs to effectively use the instances across all models. SageMaker MMEs support hosting both CPU and GPU backed models. By using GPU backed models, you can lower your model deployment costs through increased usage of the endpoint and its underlying accelerated compute instances. For a real world use case of MMEs, refer to How to scale machine learning inference for multi-tenant SaaS use cases.

    The following table provides guidance on evaluating the fitness functions for MMEs.

    Fitness function Description
    Cost

    MMEs enable using a shared serving container to host thousands of models on a single endpoint. This reduces hosting costs significantly by improving endpoint utilization compared with using single-model endpoints. For example, if you have 10 models to deploy using an ml.c5.large instance, based on SageMaker pricing, the cost of having 10 single-model persistent endpoints is: 10 * $0.102 = $1.02 per hour.

    Whereas with one MME hosting the 10 models, we achieve 10 times cost savings: 1 * $0.102 = $0.102 per hour.

    Inference latency

    By default, MMEs cache frequently used models in memory and on disk to provide low-latency inference. The cached models are unloaded or deleted from disk only when a container runs out of memory or disk space to accommodate a newly targeted model. MMEs allow lazy loading of models, which means models are loaded into memory when invoked for the first time. This optimizes memory utilization; however, it causes response time spikes on first load, resulting in a cold start problem. Therefore, MMEs are also well suited to scenarios that can tolerate occasional cold-start-related latency penalties that occur when invoking infrequently used models.

    To meet the latency and throughput goals of ML applications, GPU instances are preferred over CPU instances (given the computational power GPUs offer). With MME support for GPU, you can deploy thousands of deep learning models behind one SageMaker endpoint. MMEs can run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. With this, you can significantly save cost and achieve the best price performance. If your use case demands significantly higher transactions per second (TPS) or latency requirements, we recommend hosting the models on dedicated endpoints.

    Throughput

    An ideal value of MME inference throughput depends on factors such as model, payload size, and endpoint instance type. A higher amount of instance memory enables you to have more models loaded and ready to serve inference requests. You don’t need to waste time loading the model. A higher amount of vCPUs enables you to invoke more unique models concurrently. MMEs dynamically load and unload the model to and from instance memory, which may impact I/O performance.

    SageMaker MMEs with GPU work using NVIDIA Triton Inference Server, which is an open-source inference serving software that simplifies the inference serving process and provides high inference performance. SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. If the model is already loaded in the container memory, the subsequent requests are served faster because SageMaker doesn’t need to download and load it again.

    A proper performance testing and analysis is recommended in successful production deployments. SageMaker provides CloudWatch metrics for multi-model endpoints so you can determine the endpoint usage and the cache hit rate to help optimize your endpoint.

    Scaling configuration complexity SageMaker multi-model endpoints fully support auto scaling, which manages replicas of models to ensure models scale based on traffic patterns. However, a proper load testing is recommended to determine the optimal size of the instances for auto scaling the endpoint. Right-sizing the MME fleet is important to avoid too many models unloading. Loading hundreds of models on a few larger instances may lead to throttling in some cases, and using more and smaller instances could be preferred. To take advantage of automated model scaling in SageMaker, make sure you have instance auto scaling set up to provision additional instance capacity. Set up your endpoint-level scaling policy with either custom parameters or invocations per minute (recommended) to add more instances to the endpoint fleet. The invocation rates used to trigger an auto scale event are based on the aggregate set of predictions across the full set of models served by the endpoint.
    Traffic pattern MMEs are ideal when you have a large number of similar sized models that you can serve through a shared serving container and don’t need to access all the models at the same time.

    SageMaker multi-container endpoints

    SageMaker MCEs support deploying up to 15 containers that use different models or frameworks on a single endpoint, and invoking them independently or in sequence for low-latency inference and cost savings. The models can be completely heterogenous, with their own independent serving stack. Securely hosting multiple models from different frameworks on a single instance could save you up to 90% in cost.

    The MCE invocation patterns are as follows:

    • Inference pipelines – Containers in an MME can be invoked in a linear sequence, also known as a serial inference pipeline. They are typically used to separate preprocessing, model inference, and postprocessing into independent containers. The output from the current container is passed as input to the next. They are represented as a single pipeline model in SageMaker. An inference pipeline can be deployed as an MME, where one of the containers in the pipeline can dynamically serve requests based on the model being invoked.
    • Direct invocation – With direct invocation, a request can be sent to a specific inference container hosted on an MCE.

    The following table provides guidance on evaluating the fitness functions for MCEs.

    Fitness function Description
    Cost MCEs enable you to run up to 15 different ML containers on a single endpoint and invoke them independently, thereby saving costs. This option is ideal when you have multiple models running on different serving stacks with similar resource needs, and when individual models don’t have sufficient traffic to utilize the full capacity of the endpoint instances. MCEs are therefore more cost effective than a single-model endpoint. MCEs offer synchronous inference response, which means the endpoint is always available and you pay for the uptime of the instance. Cost can add up depending on the number and type of instances.
    Inference latency MCEs are ideal for running ML apps with different ML frameworks and algorithms for each model that are accessed infrequently but still require low-latency inference. The models are always available for low-latency inference and there is no cold start problem.
    Throughput MCEs are limited to up to 15 containers on a multi-container endpoint, and GPU inference is not supported due to resource contention. For multi-container endpoints using direct invocation mode, SageMaker not only provides instance-level metrics as it does with other common endpoints, but also supports per-container metrics. As a best practice, review CloudWatch metrics for input requests and resource utilization, and the select appropriate instance type to achieve optimal throughput.
    Scaling configuration complexity MCEs support auto scaling. However, in order to configure automatic scaling, it is recommended that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint doesn’t scale out, and there may not be enough instances to handle all the requests to the high CPU utilization model.
    Traffic pattern MCEs are ideal for workloads with continual or regular traffic patterns, for hosting models across different frameworks (such as TensorFlow, PyTorch, or Sklearn) that may not have sufficient traffic to saturate the full capacity of an endpoint instance.

    Hosting a multi-model based ML application

    Many business applications need to use multiple ML models to serve a single prediction request to their consumers. For example, a retail company that wants to provide recommendations to its users. The ML application in this use case may want to use different custom models for recommending different categories of products. If the company wants to add personalization to the recommendations by using individual user information, the number of custom models further increases. Hosting each custom model on a distinct compute instance is not only cost prohibitive, but also leads to underutilization of the hosting resources if not all models are frequently used. SageMaker offers efficient hosting options for multi-model based ML applications.

    The following diagram shows multi-model hosting options for a single endpoint using SageMaker.

    Serial inference pipeline

    An inference pipeline is a SageMaker model that is composed of a linear sequence of 2–15 containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pretrained SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and postprocessing data science tasks. The output from one container is passed as input to the next. When defining the containers for a pipeline model, you also specify the order in which the containers are run. They are represented as a single pipeline model in SageMaker. The inference pipeline can be deployed as an MME, where one of the containers in the pipeline can dynamically serve requests based on the model being invoked. You can also run a batch transform job with an inference pipeline. Inference pipelines are fully managed.

    The following table provides guidance on evaluating the fitness functions for ML model hosting using a serial inference pipeline.

    Fitness function Description
    Cost Serial inference pipeline enables you to run up to 15 different ML containers on a single endpoint, leading to cost effectiveness of hosting the inference containers. There are no additional costs for using this feature. You pay only for the instances running on an endpoint. Cost can add up depending on the number and type of instances.
    Inference latency When an ML application is deployed as an inference pipeline, the data between different models doesn’t leave the container space. Feature processing and inferences run with low latency because the containers are co-located on the same EC2 instances.
    Throughput Within an inference pipeline model, SageMaker handles invocations as a sequence of HTTP requests. The first container in the pipeline handles the initial request, then the intermediate response is sent as a request to the second container, and so on, for each container in the pipeline. SageMaker returns the final response to the client. Throughput is subjective to factors such as model, model input size, batch size, and endpoint instance type. As a best practice, review CloudWatch metrics for input requests and resource utilization, and select the appropriate instance type to achieve optimal throughput.
    Scaling configuration complexity Serial inference pipelines support auto scaling. However, in order to configure automatic scaling, it is recommended that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint doesn’t scale out and there may not be enough instances to handle all the requests to the high CPU utilization model.

    Traffic pattern

    Serial inference pipelines are ideal for predictable traffic patterns with models that run sequentially on the same endpoint.

    Deploying model ensembles (Triton DAG):

    SageMaker offers integration with NVIDIA Triton Inference Server through Triton Inference Server Containers. These containers include NVIDIA Triton Inference Server, support for common ML frameworks, and useful environment variables that let you optimize performance on SageMaker. With NVIDIA Triton container images, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

    In business use cases where ML applications use several models to serve a prediction request, if each model uses a different framework or is hosted on a separate instance, it may lead to increased workload and cost as well as an increase in overall latency. SageMaker NVIDIA Triton Inference Server supports deployment of models from all major frameworks, such as TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, and Python/C++ model formats and more. Triton model ensemble represents a pipeline of one or more models or preprocessing and postprocessing logic, and the connection of input and output tensors between them. A single inference request to an ensemble triggers the run of the entire pipeline. Triton also has multiple built-in scheduling and batching algorithms that combine individual inference requests to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference. The models can be run on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

    Hosting multiple GPU backed models on multi-model endpoints is supported through the SageMaker Triton Inference Server. The NVIDIA Triton Inference Server has been extended to implement an MME API contract, to integrate with MMEs. You can use the NVIDIA Triton Inference Server, which creates a model repository configuration for different framework backends, to deploy an MME with auto scaling. This feature allows you to scale hundreds of hyper-personalized models that are fine-tuned to cater to unique end-user experiences in AI applications. You can also use this feature to achieve needful price performance for your inference application using fractional GPUs. To learn more, refer to Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

    The following table provides guidance on evaluating the fitness functions for ML model hosting using MMEs with GPU support on Triton inference containers. For single-model endpoints and serverless endpoint fitness function evaluations, refer to the earlier sections in this post.

    Fitness function Description
    Cost SageMaker MMEs with GPU support using Triton Inference Server provide a scalable and cost-effective way to deploy a large number of deep learning models behind one SageMaker endpoint. With MMEs, multiple models share the GPU instance behind an endpoint. This enables you to break the linearly increasing cost of hosting multiple models and reuse infrastructure across all the models. You pay for the uptime of the instance.
    Inference latency

    SageMaker with Triton Inference Server is purpose-built to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. It has a wide range of supported ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including NVIDIA GPUs, CPUs, and AWS Inferentia.

    With MME support for GPU using SageMaker Triton Inference Server, you can deploy thousands of deep learning models behind one SageMaker endpoint. SageMaker loads the model to the NVIDIA Triton container’s memory on a GPU accelerated instance and serves the inference request. The GPU core is shared by all the models in an instance. If the model is already loaded in the container memory, the subsequent requests are served faster because SageMaker doesn’t need to download and load it again.

    Throughput

    MMEs offer capabilities for running multiple deep learning or ML models on the GPU, at the same time, with Triton Inference Server. This allows you easily use the NVIDIA Triton multi-framework, high-performance inference serving with the SageMaker fully managed model deployment.

    Triton supports all NVIDIA GPU-, x86-, Arm® CPU-, and AWS Inferentia-based inferencing. It offers dynamic batching, concurrent runs, optimal model configuration, model ensemble, and streaming audio and video inputs to maximize throughput and utilization. Other factors such as network and payload size may play a minimal role in the overhead associated with the inference.

    Scaling configuration complexity

    MMEs can scale horizontally using an auto scaling policy, and provision additional GPU compute instances based on metrics such as InvocationsPerInstance and GPUUtilization to serve any traffic surge to MME endpoints.

    With Triton inference server, you can easily build a custom container that includes your model with Triton and bring it to SageMaker. SageMaker Inference will handle the requests and automatically scale the container as usage increases, making model deployment with Triton on AWS easier.

    Traffic pattern

    MMEs are ideal for predictable traffic patterns with models run as DAGs on the same endpoint.

    SageMaker takes care of traffic shaping to the MME endpoint and maintains optimal model copies on GPU instances for best price performance. It continues to route traffic to the instance where the model is loaded. If the instance resources reach capacity due to high utilization, SageMaker unloads the least-used models from the container to free up resources to load more frequently used models.

    Best practices

    Consider the following best practices:

    • High cohesion and low coupling between models – Host the models in the same container that has high cohesion (drives single-business functionality) and encapsulate them together for ease of upgrade and manageability. At the same time, decouple those models from each other (host them in different container) so that you can easily upgrade one model without impacting other models. Host multiple models that use different containers behind one endpoint and invoke then independently or add model preprocessing and postprocessing logic as a serial inference pipeline.
    • Inference latency – Group the models that are single-business functionality driven and host them in a single container to minimize the number of hops and therefore minimize the overall latency. There are other caveats, like if the grouped models use multiple frameworks; you might also choose to host in multiple containers but run on the same host to reduce latency and minimize cost.
    • Logically group ML models with high cohesion – The logical group may consist of models that are homogeneous (for example, all XGBoost models) or heterogeneous (for example, a few XGBoost and a few BERT). It may consist of models that are shared across multiple business functionalities or may be specific to fulfilling only one business functionality.
      • Shared models – If the logical group consists of shared models, the ease of upgrading the models and latency will play a major role in architecting the SageMaker endpoints. For example, if latency is a priority, it’s better to place all the models in a single container behind a single SageMaker endpoint to avoid multiple hops. The downside is that if any of the models need to be upgraded, it will result in upgrading all the relevant SageMaker endpoints hosting this model.
      • Non-shared models – If the logical group consists of only business feature specific models and is not shared with other groups, the packaging complexity and latency dimensions will become key to achieve. It’s advisable to host these models in a single container behind a single SageMaker endpoint.
    • Efficient use of hardware (CPU, GPU) – Group CPU-based models together and host them on the same host so that you can efficiently use the CPU. Similarly, group GPU-based models together so that you can efficiently use and scale them. There are hybrid workloads that require both CPU and GPU on the same host. Hosting the CPU-only and GPU-only models on the same host should be driven by high cohesion and application latency requirements. Additionally, cost, ability to scale, and blast radius on impact in case of failure are the key dimensions to look into.
    • Fitness functions – Use fitness functions as a guideline for selecting an ML hosting option.

    Conclusion

    When it comes to ML hosting, there is no one-size-fits-all approach. ML practitioners need to choose the right design pattern to address their ML hosting challenges. Evaluating the fitness functions provides prescriptive guidance on selecting the right ML hosting option.

    For more details on each of the hosting options, refer to the following posts in this series:


    About the authors

    Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

    Deepali Rajale is AI/ML Specialist Technical Account Manager at Amazon Web Services. She works with enterprise customers providing technical guidance on implementing machine learning solutions with best practices. In her spare time, she enjoys hiking, movies and hanging out with family and friends.

    Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

    Read More