September 2022 – Page 19

Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference

The last few years have seen rapid development in the field of natural language processing (NLP). Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large language models. Today, we announce new capabilities in Amazon SageMaker that can help: you can configure the maximum Amazon EBS volume size and timeout quotas to facilitate large model inference. Coupled with model parallel inference techniques, you can now use the fully managed model deployment and management capabilities of SageMaker when working with large models with billions of parameters.

In this post, we demonstrate these new SageMaker capabilities by deploying a large, pre-trained NLP model from Hugging Face across multiple GPUs. In particular, we use the Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve under 0.1 second latency in a text generation use case with 6 billion parameter GPT-J. Complete example on our GitHub repository coming soon.

Large language models and the increasing necessity of model parallel inference

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340 million parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500 times, with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open-source Bloom 176 B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from model zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge.

Large language models can be difficult to host for low-latency inference use cases because of their size. Typically, ML practitioners simply host a model (or even multiple models) within the memory of a single accelerator device that handles inference end to end on its own. However, large language models can be too big to fit within the memory of a single accelerator, so this paradigm can’t work. For example, open-source GPT-NeoX with 20 billion parameters can require more than 80 GB of accelerator memory, which is more than triple what is available on an NVIDIA A10G, a popular GPU for inference. Practitioners have a few options to work against this accelerator memory constraint. A simple but slow approach is to use CPU memory and stream model parameters sequentially to the accelerator. However, this introduces a communication bottleneck between the CPU and GPU, which can add seconds to inference latency and is therefore unsuitable for many use cases that require fast responses. Another approach is to optimize or compress the model so that it can fit on a single device. Practitioners must implement complex techniques such as quantization, pruning, distillation, and others to reduce the memory requirements. This approach requires a lot of time and expertise and can also reduce the accuracy and generalization of a model, which can also be a non-starter for many use cases.

A third option to use model parallelism. With model parallelism, the parameters and layers of a model are partitioned and then spread across multiple accelerators. This approach allows practitioners to take advantage of both the memory and processing power of multiple accelerators at once and can deliver low-latency inference without impacting the accuracy of the model. Model parallelism is already a popular technique in training (see Introduction to Model Parallelism) and is increasingly becoming used in inference as practitioners require low-latency responses from large models.

There are two general types of model parallelism: pipeline parallelism and tensor parallelism. Pipeline parallelism splits a model between layers, so that any given layer is contained within the memory of a single GPU. In contrast, tensor parallelism splits layers such that a model layer is spread out across multiple GPUs. Both of these model parallel techniques are used in training (often together), but tensor parallelism can be a better choice for inference because batch size is often one with inference. When batch size is one, only tensor parallelism can take advantage of multiple GPUs at once when processing the forward pass to improve latency.

In this post, we use DeepSpeed to partition the model using tensor parallelism techniques. DeepSpeed Inference supports large Transformer-based models with billions of parameters. It allows you to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost. For more information, refer to DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression and this DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

Solution overview

The Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. The DJL is built with native Java concepts on top of existing deep learning frameworks. The DJL is designed to be deep learning engine agonistic. You can switch engines at any point. The DJL also provides automatic CPU/GPU choice based on hardware configuration.

Although the DJL is designed originally for Java developers to get start with ML, DJLServing is a high-performance universal model serving solution powered by the DJL that is programming language agnostic. It can serve the commonly seen model types, such the PyTorch TorchScript model, TensorFlow SavedModel bundle, Apache MXNet model, ONNX model, TensorRT model, and Python script model. DJLServing supports dynamic batching and worker auto scaling to increase throughput. You can load different versions of a model on a single endpoint. You can also serve models from different ML frameworks at the same time. What’s more, DJLServing natively supports multi-GPU by setting up MPI configurations and socket connections for inference. This frees the heavy lifting of setting up a multi-GPU environment.

Our proposed solution uses the newly announced SageMaker capabilities, DJLServing and DeepSpeed Inference, for large model inference. As of this writing, all Transformer-based models are supported. This solution is intended for parallel model inference using a single model on a single instance.

DJLServing is built with multiple layers. The routing layer is built on top of Netty. The remote requests are handled in the routing layer to distribute to workers, either threads in Java or processes in Python, to run inference. The total number of Java threads are set to 2 * cpu_core from the machine to make full usage of computing power. The worker numbers can be configured per model or the DJL’s auto-detection on hardware. The following diagram illustrates our architecture.

Inference large models on SageMaker

The following steps demonstrate how to deploy a gpt-j-6B model in SageMaker using DJL serving. This is made possible by the capability to configure the EBS volume size, model download timeout time, and startup health-check timeout time. You can try out this demo by running the following notebook.

Pull the Docker image and push to Amazon ECR

The Docker image djl-serving:0.18.0-deepspeed is our DJL serving container with DeepSpeed incorporated. We then push this image to Amazon Elastic Container Registry (Amazon ECR) for later use. See the following code:

docker pull deepjavalibrary/djl-serving:0.18.0-deepspeed

Create our model file

First, we create a file called serving.properties that contains only one line of code. This tells the DJL model server to use the Rubikon engine. Rubikon is an AWS developed large model supporting package. In this demo, it facilitates the MPI threads setup and socket connection. It also sets the number of GPUs (model slicing number) by reading in the TENSOR_PARALLEL_DEGREE parameter defined in our model.py file in the next paragraph. The file contains the following code:

engine=Rubikon

Next, we create our model.py file, which defines our model as gpt-j-6B. In our code, we read in the TENSOR_PARALLEL_DEGREE environment variable (default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Please note, DeepSpeed provides a few built-in partition logics, and gpt-j-6B is one of them. We use it by specifying replace_method and relpace_with_kernel_inject. If you have your customized model and need DeepSpeed to partition effectively, you need to change relpace_with_kernel_inject to false and add injection_policy to make the runtime partition work. For more information, refer to Initializing for Inference.

from djl_python import Input, Output
import os
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None

def get_model():
    model_name = 'EleutherAI/gpt-j-6B'
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model = AutoModelForCausalLM.from_pretrained(model_name, revision="float32", torch_dtype=torch.float32)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    model = deepspeed.init_inference(model,
                                     mp_size=tensor_parallel,
                                     dtype=model.dtype,
                                     replace_method='auto',
                                     replace_with_kernel_inject=True)
    generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model()

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)
    return Output().add(result)

We create a directory called gpt-j and copy model.py and serving.properties to this directory:

mkdir gpt-j
cp model.py gpt-j
cp serving.properties gpt-j

Lastly, we create the model file and upload it to Amazon Simple Storage Service (Amazon S3):

tar cvfz gpt-j.tar.gz gpt-j
aws s3 cp gpt-j.tar.gz s3://djl-sm-test/deepspeed/

Create a SageMaker model

We now create a SageMaker model. We use the ECR image we created earlier and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure TENSOR_PARALLEL_DEGREE=2, which means the model will be partitioned along 2 GPUs. See the following code:

aws sagemaker create-model 
--model-name gpt-j 
--primary-container 
Image=<account_id>.dkr.ecr.us-east-1.amazonaws.com/djl-deepspeed:latest,ModelDataUrl=s3://djl-sm-test/deepspeed/gpt-j.tar.gz,Environment={TENSOR_PARALLEL_DEGREE=2} 
--execution-role-arn <IAM_role_arn>

After running the preceding command, you see output similar to the following:

{
    "ModelArn": "arn:aws:sagemaker:us-east-1:<account_id>:model/gpt-j"
}

Create a SageMaker endpoint

You can use any instances with multiple GPUs for testing. In this demo, we use a p3.16xlarge instance. In the following code, note how we set the ModelDataDownloadTimeoutInSeconds, ContainerStartupHealthCheckTimeoutInSeconds, and VolumeSizeInGB parameters to accommodate the large model size. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

aws sagemaker create-endpoint-config 
    --region us-east-1 
    --endpoint-config-name gpt-j-config 
    --production-variants '[
      {
        "ModelName": "gpt-j",
        "VariantName": "AllTraffic",
        "InstanceType": "ml.p3.16xlarge",
        "InitialInstanceCount": 1,
        "VolumeSizeInGB": 256,
        "ModelDataDownloadTimeoutInSeconds": 1800,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600
        }
    ]'

Lastly, we create a SageMaker endpoint:

aws sagemaker create-endpoint 
--endpoint-name gpt-j 
--endpoint-config-name gpt-j-config

You see it printed out in the following code:

{
    "EndpointArn": "arn:aws:sagemaker:us-east-1:<aws-account-id>:endpoint/gpt-j"
}

Starting the endpoint might take a while. You can try a few more times if you run into the InsufficientInstanceCapacity error.

Performance tuning

Performance tuning and optimization is an empirical process often involving multiple iterations. The number of parameters to tune is combinatorial and the set of configuration parameter values aren’t independent of each other. Various factors affect optimal parameter tuning, including payload size, type, and the number of ML models in the inference request flow graph, storage type, compute instance type, network infrastructure, application code, inference serving software runtime and configuration, and more.

SageMaker real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. There are four most commonly used metrics for monitoring inference request latency for SageMaker inference endpoints:

Container latency – The time it takes to send the request, fetch the response from the model’s container, and complete inference in the container. This metric is available in Amazon CloudWatch as part of the invocation metrics published by SageMaker.
Model latency – The total time taken by all SageMaker containers in an inference pipeline. This metric is available in CloudWatch as part of the invocation metrics published by SageMaker.
Overhead latency – Measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This metric is available in CloudWatch as part of the invocation metrics published by SageMaker.
End-to-end latency – Measured from the time the client sends the inference request until it receives a response back. You can publish this as a custom metric in CloudWatch.

Container latency depends on several factors; the following are among the most important:

Underlying protocol (HTTP(s)/gRPC) used to communicate with the inference server
Overhead related to creating new TLS connections
Deserialization time of the request/response payload
Request queuing and batching features provided by the underlying inference server
Request scheduling capabilities provided by the underlying inference server
Underlying runtime performance of the inference server
Performance of preprocessing and postprocessing libraries before calling the model prediction function
Underlying ML framework backend performance
Model-specific and hardware-specific optimizations

In this section, we focus primarily on container latency and specifically on optimizing DJLServing running inside a SageMaker container.

Tune the ML engine for multi-threaded inference

One of the advantages of the DJL is multi-threaded inference support. It can help increase the throughput of your inference on multi-core CPUs and GPUs and reduce memory consumption compare to Python. Refer to Inference Performance Optimization for more information about optimizing the number of threads for different engines.

Tune Netty

DJLServing is built with multiple layers. The routing layer is built on top of Netty. Netty is a NIO client server framework that enables quick and easy development of network applications such as protocol servers and clients. In Netty, Channel is the main container; it contains a ChannelPipeline and is associated with an EventLoop (a container for a thread) from an EventLoopGroup. EventLoop is essentially an I/O thread and may be shared by multiple channels. ChannelHandlers are run on these EventLoop threads. This simple threading model means that you don’t need to worry about concurrency issues in the run of your ChannelHandlers. You are always guaranteed sequential runs on the same thread for a single run through your pipeline. DJLServing uses Netty’s EpollEventLoopGroup on Linux. The total number of Netty threads by default is set to 2 * the number of virtual CPUs from the machine to make full usage of computing power. Furthermore, because you don’t create large numbers of threads, your CPU isn’t overburdened by context switching. This default setting works fine in most cases; however, if you want to set the number of Netty threads for processing the incoming requests, you can do so by setting the SERVING_NUMBER_OF_NETTY_THREADS environment variable.

Tune workload management (WLM) of DJLServing

DJLServing has WorkLoadManager, which is responsible for managing the workload of the worker thread. It manages the thread pools and job queues, and scales up or down the required amount of worker threads per ML model. It has auto scaling, which adds an inference job to the job queue of the next free worker and scales up the worker thread pool for that specific model if necessary. The scaling is primarily based on the job queue depth of the model, the batch size, and the current number of worker threads in the pool. The job_queue_size controls the number of inference jobs that can be queued up at any point in time. By default, it is set to 100. If you have higher concurrency needs per model serving instance, you can increase the job_queue_size, thread pool size, and minimum or maximum thread workers for a particular model by setting the properties in serving.properties, as shown in the following example code:

serving.properties
# use minWorkers/maxWorkers for all devices
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4

As of this writing, you can’t configure job_queue_size in serving.properties. The default value job_queue_size is controlled by an environment variable, and you can only configure the per-model setting with the registerModel API.

Many practitioners tend to run inference sequentially when the server is invoked with multiple independent requests. Although easier to set up, it’s usually not the best practice to utilize GPU’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput.

All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements.

You can fine-tune the following parameters to increase the throughput per model:

batch_size – The inference batch size. The default value is 1.
max_batch_delay – The maximum delay for batch aggregation. The default value is 100 milliseconds.
max_idle_time – The maximum idle time before the worker thread is scaled down.
min_worker – The minimum number of worker processes. For the DJL’s DeepSpeed engine, min_worker is set to number of GPUs/TENSOR_PARALLEL_DEGREE.
max_worker – The maximum number of worker processes. For the DJL’s DeepSpeed engine, max_worker is set to mumber of GPUs/TENSOR_PARALLEL_DEGREE.

Tune degree of tensor parallelism

For large model support that doesn’t fit in the single accelerator device memory, the number of Python processes are determined by the total number of accelerator devices on the host. The tensor_parallel_degree is created for slicing the model and distribute to multiple accelerator devices. In this case, even if a model is too large to host on a single accelerator, it can still be handled by DJLServing and can run on multiple accelerator devices by partitioning the model. Internally, DJLServing creates multiple MPI processes (equal to tensor_parallel_degree) to manage the slice of each model on each accelerator device.

You can set the number of partitions for your model by setting the TENSOR_PARALLEL_DEGREE environment variable. Please note this configuration is a global setting and applies to all the models on the host. If the TENSOR_PARALLEL_DEGREE is less than the total number of accelerator devices (GPUs), DJLServing launches multiple Python process groups equivalent to the total number of GPUs/TENSOR_PARALLEL_DEGREE. Each Python process group consists of Python processes equivalent to TENSOR_PARALLEL_DEGREE. Each Python process group holds the full copy of the model.

Summary

In this post, we showcased the newly launched SageMaker capability to allow you to configure inference instance EBS volumes, model downloading timeout, and container startup timeout. We demonstrated this new capability in an example of deploying a large model in SageMaker. We also covered options available to tune the performance of the DJL. For more details about SageMaker and the new capability launched, refer to [!Link] and [!Link].

About the authors

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Robert Van Dusen is a Senior Product Manager at AWS.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

RGB-X Classification for Electronics Sorting

Effectively disassembling and recovering materials from waste electrical and electronic equipment (WEEE) is a critical step in moving global supply chains from carbon-intensive, mined materials to recycled and renewable ones. Conventional recycling processes rely on shredding and sorting waste streams, but for WEEE, which is comprised of numerous dissimilar materials, we explore targeted disassembly of numerous objects for improved material recovery. Many WEEE objects share many key features and therefore can look quite similar, but their material composition and internal component layout can…Apple Machine Learning Research

Tips to improve your Amazon Rekognition Custom Labels model

In this post, we discuss best practices to improve the performance of your computer vision models using Amazon Rekognition Custom Labels. Rekognition Custom Labels is a fully managed service to build custom computer vision models for image classification and object detection use cases. Rekognition Custom Labels builds off of the pre-trained models in Amazon Rekognition, which are already trained on tens of millions of images across many categories. Instead of thousands of images, you can get started with a small set of training images (a few hundred or less) that are specific to your use case. Rekognition Custom Labels abstracts away the complexity involved in building a custom model. It automatically inspects the training data, selects the right ML algorithms, selects the instance type, trains multiple candidate models with various hyperparameters settings, and outputs the best trained model. Rekognition Custom Labels also provides an easy-to-use interface from the AWS Management Console for managing the entire ML workflow, including labeling images, training the model, deploying the model, and visualizing the test results.

There are times when a model’s accuracy isn’t the best, and you don’t have many options to adjust the configuration parameters of the model. Behind the scenes there are multiple factors that play a key role to build a high-performing model, such as the following:

Picture angle
Image resolution
Image aspect ratio
Light exposure
Clarity and vividness of background
Color contrast
Sample data size

The following are the general steps to be followed to train a production-grade Rekognition Custom Labels model:

Review Taxonomy – This defines the list of attributes/items that you want to identify in an image.
Collect relevant data – This is the most important step, where you need to collect relevant images that should resemble what you would see in a production environment. This could involve images of objects with varying backgrounds, lighting, or camera angles. You then create a training and testing datasets by splitting the collected images. You should only include real-world images as part of the testing dataset, and shouldn’t include any synthetically generated images. Annotations of the data you collected are crucial for the model performance. Make sure the bounding boxes are tight around the objects and the labels are accurate. We discuss some tips that you can consider when building an appropriate dataset later in this post.
Review training metrics – Use the preceding datasets to train a model and review the training metrics for F1 score, precision, and recall. We will discuss in details about how to analyze the training metrics later in this post.
Evaluate the trained model – Use a set of unseen images (not used for training the model) with known labels to evaluate the predictions. This step should always be performed to make sure that the model performs as expected in a production environment.
Re-training (optional) – In general, training any machine learning model is an iterative process to achieve the desired results, a computer vision model is no different. Review the results in Step 4, to see if more images need to be added to the training data and repeat the above Steps 3 – 5.

In this post, we focus on the best practices around collecting relevant data (Step 2) and evaluating your trained metrics (Step 3) to improve your model performance.

Collect relevant data

This is the most critical stage of training a production-grade Rekognition Custom Labels model. Specifically, there are two datasets: training and testing. Training data is used for training the model, and you need to spend the effort building an appropriate training set. Rekognition Custom Labels models are optimized for F1 score on the testing dataset to select the most accurate model for your project. Therefore, it’s essential to curate a testing dataset that resembles the real world.

Number of images

We recommend having a minimum of 15-20 images per label. Having more images with more variations that reflects your use case will improve the model performance.

Balanced dataset

Ideally, each label in the dataset should have a similar number of samples. There shouldn’t be a massive disparity in the number of images per label. For example, a dataset where the highest number of images for a label is 1,000 vs. 50 images for another label resembles an imbalanced dataset. We recommend avoiding scenarios with lopsided ratio of 1:50 between the label with the least number of images vs. the label with the highest number of images.

Varying types of images

Include images in the training and test dataset that resembles what you will be using in the real world. For example, if you want to classify images of living rooms vs. bedrooms, you should include empty and furnished images of both rooms.

The following is an example image of a furnished living room.

In contrast, the following is an example of an unfurnished living room.

The following is an example image of a furnished bedroom.

The following is an example image of an unfurnished bedroom.

Varying backgrounds

Include images with different backgrounds. Images with natural context can provide better results than plain background.

The following is an example image of the front yard of a house.

The following is an example image of the front yard of a different house with a different background.

Varying lighting conditions

Include images with varying lighting so that it covers the different lighting conditions that occur during inference (for example, with and without flash). You can also include images with varying saturation, hue, and brightness.

The following is an example image of a flower under normal light.

In contrast, the following image is of the same flower under bright light.

Varying angles

Include images taken from various angles of the object. This helps the model learn different characteristics of the objects.

The following images are of the same bedroom from different angles.

There could be occasions where it’s not possible to acquire images of varying types. In those scenarios, synthetic images can be generated as part of the training dataset. For more information about common image augmentation techniques, refer to Data Augmentation.

Add negative labels

For image classification, adding negative labels can help increase model accuracy. For example, you can add a negative label, which doesn’t match any of the required labels. The following image represents the different labels used to identify fully grown flowers.

Adding the negative label not_fully_grown helps the model learn characteristics that aren’t part of the fully_grown label.

Handling label confusion

Analyze the results on the test dataset to recognize any patterns that are missed in the training or testing dataset. Sometimes it’s easy to spot such patterns by visually examining the images. In the following image, the model is struggling to resolve between a backyard vs. patio label.

In this scenario, adding more images to these labels in the dataset and also redefining the labels so that each label is distinct can help increase the accuracy of the model.

Data augmentation

Inside Rekognition Custom Labels, we perform various data augmentations for model training, including random cropping of the image, color jittering, random Gaussian noises, and more. Based on your specific use cases, it might also be beneficial to add more explicit data augmentations to your training data. For example, if you’re interested in detecting animals in both color and black and white images, you could potentially get better accuracy by adding black and white and color versions of the same images to the training data.

We don’t recommend augmentations on testing data unless the augmentations reflect your production use cases.

Review training metrics

F1 score, precision, recall, and assumed threshold are the metrics that are generated as an output of training a model using Rekognition Custom Labels. The models are optimized for the best F1 score based on the testing dataset that is provided. The assumed threshold is also generated based on the testing dataset. You can adjust the threshold based on your business requirement in terms of precision or recall.

Because the assumed thresholds are set on the testing dataset, an appropriate test set should reflect the real-world production use case. If the test dataset isn’t representative of the use case, you may see artificially high F1 scores and poor model performance on your real-world images.

These metrics are helpful when performing an initial evaluation of the model. For a production-grade system, we recommend evaluating the model against an external dataset (500–1,000 unseen images) representative of the real world. This helps evaluate how the model would perform in a production system and also identify any missing patterns and correct them by retraining the model. If you see a mismatch between F1 scores and external evaluation, we suggest you examine whether your test data is reflecting the real-world use case.

Conclusion

In this post, we walked you through the best practices for improving Rekognition Custom Labels models. We encourage you to learn more about Rekognition Custom Labels and try it out for your business-specific datasets.

About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Yogesh Chaturvedi is a Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Hao Yang is a Senior Applied Scientist at the Amazon Rekognition Custom Labels team. His main research interests are object detection and learning with limited annotations. Outside works, Hao enjoys watching films, photography, and outdoor activities.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Use ADFS OIDC as the IdP for an Amazon SageMaker Ground Truth private workforce

To train a machine learning (ML) model, you need a large, high-quality, labeled dataset. Amazon SageMaker Ground Truth helps you build high-quality training datasets for your ML models. With Ground Truth, you can use workers from either Amazon Mechanical Turk, a vendor company of your choosing, or an internal, private workforce to enable you to create a labeled dataset. You can use the labeled dataset output from Ground Truth to train your own models. You can also use the output as a training dataset for an Amazon SageMaker model.

With Ground Truth, you can create a private workforce of employees or contractors to handle your data within your organization. This enables customers who want to keep their data within their organization to use a private workforce to support annotation workloads containing sensitive business data or personal identifiable information (PII) that can’t be handled by external parties. Alternately, if data annotation requires domain-specific subject matter expertise, you can use a private workforce to route tasks to employees, contractors, or third-party annotators with that specific domain knowledge. This workforce can be employees in your company or third-party workers who have domain and industry knowledge of your datasets. For example, if the task is to label medical images, you could create a private workforce of people knowledgeable about the images in question.

You can configure a private workforce to authenticate using OpenID Connect (OIDC) with your Identity Provider (IdP). In this post, we demonstrate how to configure OIDC with on-premises Active Directory using Active Directory Federation Service (ADFS). Once the configuration is set up, you can configure and manage work teams, track worker performance, and set up notifications when labeling tasks are available in Ground Truth.

Solution overview

When you use existing on-premises Active Directory credentials to authenticate your private workforce, you don’t need to worry about managing multiple identities in different environments. Workers use existing Active Directory credentials to federate to your labeling portal.

Prerequisites

Make sure you have the following prerequisites:

A registered public domain
An existing or newly deployed ADFS environment
An AWS Identity and Access Management (IAM) user with permissions to run SageMaker API operations

Additionally, make sure you use Ground Truth in a supported Region.

Configure Active Directory

The Ground Truth private workforce OIDC configuration requires sending a custom claim sagemaker:groups to Ground Truth from your IdP.

Create an AD group named sagemaker (be sure to use all lower-case).
Add the users that will form your private workforce to this group.

Configure ADFS

The next step is to configure an ADFS application with specific claims that Ground Truth uses to obtain Issuer, ClientId, and ClientSecret, and other optional claims from your IdP to authenticate workers by obtaining an authentication code from the configured AuthorizationEndpoint in your IdP.

For more information about the claims your IdP sends to Ground Truth, refer to Send Required and Optional Claims to Ground Truth and Amazon A2I.

Create Application Group

To create your application group, complete the following steps:

Open the ADFS Management Console
Change the ADFS Federation Service Identifier from https://${HostName}/adfs/service/trust to https://${HostName}/adfs
Choose Application Group, right-click, and choose Add Application Group.
Enter a name (for example, SageMaker Ground Truth Workforce) and description.
Under Template, for Client-Server applications, choose Server application accessing a web API.
Choose Next.
Copy and save the client ID for future reference.
For Redirect URI, use a placeholder such as https://privateworkforce.local.
Choose Add, then choose Next.
Select Generate a shared secret and save the generated value for later use, then choose Next.
In Configure Web API section, enter the client ID obtained earlier.
Choose Add, then choose Next.
Select Permit everyone under Access Control Policy, then choose Next.
Under Permitted scopes, select openid, then choose Next.
Review the configuration information, then choose Next and Close.

Configure claim descriptions

To configure your claim descriptions, complete the following steps:

In the ADFS Management Console, expand Service Section.
Right-click Claim Description and choose Add Claim Description.
For Display name, enter SageMaker Client ID.
For Short Name, enter sagemaker:client_id.
For Claim identifier, enter sagemaker:client_id.
Select the options to publish the claim to federation metadata for both accept and send.
Choose OK.
Repeat these steps for the remaining claim groups (Sagemaker Name, Sagemaker Sub, and Sagemaker Groups), as shown in the following screenshot.

Note that your claim identifier is listed as Claim Type.

Configure the application group claim rules

To configure your application group claim rules, complete the following steps:

Choose Application Groups, then choose the application group you just created.
Under Web API, choose the name shown, which opens the Web API properties.
Choose the Issuance Transform Rules tab and choose Add Rule.
Choose Transform an Incoming Claim and provide the following information:
- For Claim rule name, enter sagemaker:client_id.
- For Incoming claim type, choose OAuth Client Id.
- For Outgoing claim type, choose the claim SageMaker Client ID.
- Leave other values as default.
- Choose Finish.
Choose Add New Rule.
Choose Transform an Incoming Claim and provide the following information:
- For Claim rule name, enter sagemaker:sub.
- For Incoming claim type, choose Primary SID.
- For Outgoing claim type, choose the claim Sagemaker Sub.
- Leave other values as default.
- Choose Finish.
Choose Add New Rule.
Choose Transform an Incoming Claim and provide the following information:
- For Claim rule name, choose sagemaker:name.
- For Incoming claim type, choose Name.
- For Outgoing claim type, choose the claim Sagemaker Name.
- Leave other values as default.
- Choose Finish.
Choose Add New Rule.
Choose Send Group Membership as a Claim and provide the following information:
- For Claim rule name, enter sagemaker:groups.
- For User’s group, choose the sagemaker AD group created earlier.
- For Outgoing claim type, choose the claim Sagemaker Groups.
- For Outgoing claim value, enter sagemaker.
- Choose Finish.
Choose Apply and OK.

You should have four rules, as shown in the following screenshot.

Create and configure an OIDC IdP workforce using the SageMaker API

In this step, you create a workforce from the AWS Command Line Interface (AWS CLI) using an IAM user or role with appropriate permissions.

Run the following AWS CLI command to create a private workforce. The oidc-config parameter contains information you must obtain from the IdP. Provide the appropriate values that you obtained from your IdP:
1. client_id is the client ID, and client_secret is the client secret you obtained when creating your application group.
2. You can reconstruct AuthorizationEndpoint, TokenEndpoint, UserInfoEndpoint, LogoutEndpoint, and JwksUri by replacing only the sts.example.com portion with your ADFS endpoint.
```
aws sagemaker create-workforce --oidc-config "ClientId="9b123069-0afc-56f2-a7ce-bd8e4dc705gh",ClientSecret="vtMG9fz_D9W2Y6u4t390wQ4o-hr8VsdHxD294FsD",Issuer="https://sts.example.com/adfs",AuthorizationEndpoint="https://sts.example.com/adfs/oauth2/authorize/",TokenEndpoint="https://sts.example.com/adfs/oauth2/token/",UserInfoEndpoint="https://sts.example.com/adfs/userinfo",LogoutEndpoint="https://sts.example.com/adfs/oauth2/logout",JwksUri="https://sts.example.com/adfs/discovery/keys“ --workforce-name privatewf --region us-east-1
```
  The preceding command should successfully return the WorkforceArn. Save this output for reference later.

Use the following code to describe the created workforce to get the SubDomain.
We use this to configure the redirect URI in ADFS. After Ground Truth authenticates a worker, this URI redirects the worker to the worker portal where the workers can access labeling or human review tasks.

aws sagemaker describe-workforce --workforce-name "privatewf" --region us-east-1

{
		"Workforce": {
			"WorkforceName": "privatewf",
			"WorkforceArn": "arn:aws:sagemaker:us-east-1:206400014001:workforce/privatewf",
			"LastUpdatedDate": "2022-03-20T11:45:57.916000-07:00",
			"SourceIpConfig": {
				"Cidrs": []
			},
			"SubDomain": "drxxxxxlf0.labeling.us-east-1.sagemaker.aws",
			"OidcConfig": {
				"ClientId": "9b123069-0afc-56f2-a7ce-bd8e4dc705gh",
				"Issuer": "https://sts.example.com/adfs",
				"AuthorizationEndpoint": "https://sts.example.com/adfs/oauth2/authorize/",
				"TokenEndpoint": "https://sts.example.com/adfs/oauth2/token/",
				"UserInfoEndpoint": "https://sts.example.com/adfs/userinfo",
				"LogoutEndpoint": "https://sts.example.com/adfs/oauth2/logout",
				"JwksUri": "https://sts.example.com/adfs/discovery/keys“"
			},
			"CreateDate": "2022-03-20T11:45:57.916000-07:00"
		}
	}

Copy the SubDomain and append /oauth2/idpresponse to the end. For example, it should look like https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse.You use this URL to update the redirect URI in ADFS.
Choose the application you created earlier (SageMaker Ground Truth Private Workforce).
Choose the name under Server application.
Select the placeholder URL used earlier and choose Remove.
Enter the appended SubDomain value.
Choose Add.
Choose OK twice.

Validate the OIDC IdP workforce authentication response

Now that you have configured OIDC with your IdP, it’s time to validate the authentication workflow using curl.

Replace the placeholder values with your information, then enter the modified URI in your browser:

https://sts.example.com/adfs/oauth2/authorize?client_id=9b123069-0afc-56f2-a7ce-bd8e4dc705gh&redirect_uri=https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse&scope=openid&response_type=code

You should be prompted to log in with AD credentials. You may receive a 401 Authorization Required error.

Copy the code parameter from the browser query and use it to perform a curl with the following command. The portion you need to copy starts with code=. Replace this code with code you copied. Also, don’t forget to change the values of url, client_id, client_secret, and redirect_uri:

url is the token endpoint from ADFS.
client_id is the client ID from the application group in ADFS.

client_secret is the client secret from ADFS.

curl -k --request POST 
	  --url 'https://sts.example.com/adfs/oauth2/token/' 
	  --header 'content-type: application/x-www-form-urlencoded' 
	  --data grant_type=authorization_code 
	  --data 'client_id=9b123069-0afc-56f2-a7ce-bd8e4dc705gh' 
	  --data client_secret=vtMG9fz_D9W2Y6u4t390wQ4o-hr8VsdHxD294FsD 
	  --data code=ZE-yvYF7GUmaFmAGAUdlcg.3Oy-_lPP2QgBAJxAW8uvXYgXojg.GXiaFggY5IdmrumD00cPkdjpABXTAG25YdXJxBr64HPwyl1WJDlcr1pqvURR1ZkBsBA1DxrloTQM4IGH1LcNVIzGcoynNm151leWXnIIP11JjOdl4Jt7tGyxyymll0c0IqfQcOk0w-oU9q2k-nx3jmAK4Pmw3D0Ghhm4jL6_15gBwvY4-mY6DVDg2sGQMELj-dNzfvMuMiLJQhX5XyUJcHjW69KX9xxnHfa3MCZbp2oF_41HBtMazPqKKC04TQPvTiAeMzUZ0-Z3IQhA9_mfv28JPdpGlPOxr8QM9vu9ANCbURimjPkmHA2Gm3df9QUbsIxEtQ-OuAPWlcg5MNbqGQ 
	  --data 'redirect_uri=https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse'

After making the appropriate modifications, copy the entire command and run it from a terminal.
The output of the command generates an access token in JWT format.

Copy this output to the encoded box and decode it with JWT.
The decoded message should contain the required claims you configured. If the claims are present, proceed to the next step; if not, ensure you have followed all the steps outlined so far.

From the output obtained in the preceding step, run the following command from a terminal after making necessary modifications. Replace the value for Bearer with the access_token obtained in the preceding command’s output and the userinfo with your own.

curl -X POST -H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IkNLX2k1SEtOS1B2QVdGWnhCRkZ2T2NuVUhNQSIsImtpZCI6IkNLX2k1SEtOS1B2QVdGWnhCRkZ2T2NuVUhNQSJ9.eyJhdWQiOiJ1cm46bWljcm9zb2Z0OnVzZXJpbmZvIiwiaXNzIjoiaHR0cHM6Ly9mcy5hZC5nbWluZHByby5jb20vYWRmcyIsImlhdCI6MTY0OTE5NzYzMCwibmJmIjoxNjQ5MTk3NjMwLCJleHAiOjE2NDkyMDEyMzAsImFwcHR5cGUiOiJDb25maWRlbnRpYWwiLCJhcHBpZCI6IjBlZDQ0MDYzLTMzZDUtNGYxZi1hZTg4LTQ0OTgzZDRlN2E3MiIsImF1dGhtZXRob2QiOiJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6YWM6Y2xhc3NlczpQYXNzd29yZFByb3RlY3RlZFRyYW5zcG9ydCIsImF1dGhfdGltZSI6IjIwMjItMDQtMDVUMjI6Mjc6MDkuOTcxWiIsInZlciI6IjEuMCIsInNjcCI6Im9wZW5pZCIsInN1YiI6ImV2MTdTQkRUWnFXd2NXR0R1Z2s1OHRXQm4wYkRKbDBvYnAzbU9sL1hVUlk9In0.hsED4iUlQPgiiLaCyrKTKg3aKQjsKsLKPusPncRz3rNCSTp5xh8APDo33hhBx5JK-Ie2FG9Pa78dHdY_U2UtGBl2IHKmIfPcBTdkLGc1a8PlSQLvManCcEwzxAaO5J_jGdbt_P3qvy3cA6YCgNUwV3Ex9VTySLK1r-gLvnWE4zEiz_QytdlXvwFDIZi94YTgGf8b5uOQieM9pgJ0D9d-HOUw7-sKMBbZLqeYh_heNekwV3p3FQAIQyqifzl5qaftMR_J6lpOINHPtSPbl80MwHpmoDPHa0emWg6wuSZa7gpDbqDGHmuwQfbVhBdNLY8v9Nm4MA5RbSWQmqZmwG0GkA' -d '' -k -v https://sts.example.com/adfs/userinfo

The output from this command may look similar to following code:

{
    "sub":"122",
    "exp":"10000",
    "sagemaker-groups":["privateworkforce"]
    "sagemaker-name":"name",
    "sagemaker-sub":"122",
    "sagemaker-client_id":"123456"
}

Now that you have successfully validated your OIDC configuration, it’s time to create the work teams.

Create a private work team

To create a private work team, complete the following steps:

On the Ground Truth console, choose Labeling workforces.
Select Private.
In the Private teams section, select Create private team.
In the Team details section, enter a team name.
In the Add workers section, enter the name of a single user group.
All workers associated with this group in your IdP are added to this work team.

To add more than one user group, choose Add new user group and enter the names of the user groups you want to add to this work team. Enter one user group per line.
Optionally, for Ground Truth labeling jobs, if you provide an email for workers in your JWT, Ground Truth notifies workers when a new labeling task is available if you select an Amazon Simple Notification Service (Amazon SNS) topic.
Choose Create private team.

Test access to the private labeling portal

To test your access, browse to https://console.aws.amazon.com/sagemaker/groundtruth#/labeling-workforces and open the labeling portal sign-in URL in a new browser window or incognito mode.

Cost

You will be charged for the number of jobs labeled by your internal employees. For more information, refer to Amazon SageMaker Data Labeling Pricing.

Clean up

You can delete the private workforce using the SageMaker API, DeleteWorkforce. If you have work teams associated with the private workforce, you must delete them before deleting the work force. For more information, see Delete a work team.

Summary

In this post, we demonstrated how to configure an OIDC application with Active Directory Federation Services and use your existing Active Directory credentials to authenticate to a Ground Truth labeling portal.

We’d love to hear from you. Let us know what you think in the comments section.

About the authors

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Aishwarya Kaushal is a Senior Software Engineer at Amazon. Her focus is on solving challenging problems using machine learning, building scalable AI solutions using distributed systems and helping customers to adopt the new features/products. In her spare time, Aishwarya enjoys watching sci-fi movies, listening to music and dancing.

How Amp on Amazon used data to increase customer engagement, Part 2: Building a personalized show recommendation platform using Amazon SageMaker

Amp is a new live radio app from Amazon. With Amp, you can host your own radio show and play songs from the Amazon Music catalog, or tune in and listen to shows other Amp users are hosting. In an environment where content is plentiful and diverse, it’s important to tailor the user experience to each user’s individual taste, so they can easily find shows they like and discover new content they would enjoy.

Amp uses machine learning (ML) to provide personalized recommendations for live and upcoming Amp shows on the app’s home page. The recommendations are computed using a Random Forest model using features representing the popularity of a show (such as listen and like count), popularity of a creator (such as total number of times the recent shows were played), and personal affinities of a user to a show’s topic and creator. Affinities are computed either implicitly from the user’s behavioral data or explicitly from topics of interest (such as pop music, baseball, or politics) as provided in their user profiles.

This is Part 2 of a series on using data analytics and ML for Amp and creating a personalized show recommendation list platform. The platform has shown a 3% boost to customer engagement metrics tracked (liking a show, following a creator, enabling upcoming show notifications) since its launch in May 2022.

Refer to Part 1 to learn how behavioral data was collected and processed using the data and analytics systems.

Solution overview

The ML-based show recommender for Amp has five main components, as illustrated in the following architecture diagram:

The Amp mobile app.
Back-end services that gather the behavioral data, such as likes and follows, as well as broadcast show-related information such as status updates when shows go live.
Real-time ingestion of behavioral and show data, and real-time (online) feature computing and storage.
Batch (offline) feature computing and storage.
A Recommender System that handles incoming requests from the app backend for getting a list of shows. This includes real-time inference to rank shows based on personalized and non-personalized features.

This post focuses on parts 3, 4, and 5 in an effort to detail the following:

Real-time data ingestion and transformations using Amazon Kinesis Data Firehose and AWS Lambda.
Batch data processing via Amazon SageMaker Processing
Amazon MemoryDB for Redis and Amazon SageMaker Feature Store for storing and serving real-time and batch computed features, respectively
Real-time ranking via SageMaker inference

The following diagram shows the high-level architecture and its components.

In the following sections, we provide more details regarding real-time feature computing, batch feature computing, real-time inference, operational health, and the outcomes we observed.

Real-time feature computing

Some features, such as like and listen count for a show, need to be streamed in continuously and used as is, whereas others, such as the number of listening sessions longer than 5 minutes, need to also be transformed in real time as raw data for sessions is streamed. These types of features where values need to be computed at inference time are referred as point-in-time (PIT) features. Data for PIT features need to be updated quickly, and the latest version should be written and read with low latency (under 20 milliseconds per user for 1,000 shows). The data also needs to be in a durable storage because missing or partial data may cause deteriorated recommendations and poor customer experience. In addition to the read/write latency, PIT features also require low reflection time. Reflection time is the time it takes for a feature to be available to read after the contributing events were emitted, for example, the time between a listener liking a show and the PIT LikeCount feature being updated.

Sources of the data are the backend services directly serving the app. Some of the data are transformed into metrics that are then broadcasted via Amazon Simple Notification Service (Amazon SNS) to downstream listeners such as the ML feature transformation pipeline. An in-memory database such as MemoryDB is an ideal service for durable storage and ultra-fast performance at high volumes. The compute component that transforms and writes features to MemoryDB is Lambda. App traffic follows daily and weekly patterns of peaks and dips depending on the time and day. Lambda allows for automatic scaling to incoming volume of events. The independent nature of each individual metric transformation also makes Lambda, which is a stateless service on its own, a good fit for this problem. Putting Amazon Simple Queue Service (Amazon SQS) between Amazon SNS and Lambda not only prevents message loss but also acts as a buffer for unexpected bursts of traffic that preconfigured Lambda concurrency limits may not be sufficient to serve.

Batch feature computing

Features that use historical behavioral data to represent a user’s ever-evolving taste are more complex to compute and can’t be computed in real time. These features are computed by a batch process that runs every so often, for example once daily. Data for batch features should support fast querying for filtering and aggregation of data, and may span long periods of time, so will be larger in volume. Because batch features are also retrieved and sent as inputs for real-time inference, they should still be read with low latency.

Collecting raw data for batch feature computing doesn’t have the sub-minute reflection time requirement PIT features have, which makes it feasible to buffer the events longer and transform metrics in batch. This solution utilized Kinesis Data Firehose, a managed service to quickly ingest streaming data into several destinations, including Amazon Simple Storage Service (Amazon S3) for persisting metrics to the S3 data lake to be utilized in offline computations. Kinesis Data Firehose provides an event buffer and Lambda integration to easily collect, batch transform, and persist these metrics to Amazon S3 to be utilized later by the batch feature computing. Batch feature computations don’t have the same low latency read/write requirements as PIT features, which makes Amazon S3 the better choice because it provides low-cost, durable storage for storing these large volumes of business metrics.

Our initial ML model uses 21 batch features computed daily using data captured in the past 2 months. This data includes both playback and app engagement history per user, and grows with the number of users and frequency of app usage. Feature engineering at this scale requires an automated process to pull the required input data, process it in parallel, and export the outcome to persistent storage. The processing infrastructure is needed only for the duration of the computations. SageMaker Processing provides prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs at large scale. The underlying infrastructure for a Processing job is fully managed by SageMaker. Cluster resources are provisioned for the duration of your job, and cleaned up when a job is complete.

Each step in the batch process—data gathering, feature engineering, feature persistence—is part of a workflow that requires error handling, retries, and state transitions in between. With AWS Step Functions, you can create a state machine and split your workflow into several steps of preprocessing and postprocessing, as well as a step to persist the features into SageMaker Feature Store or the other data to Amazon S3. A state machine in Step Functions can be triggered via Amazon EventBridge to automate batch computing to run at a set schedule, such as once every day at 10:00 PM UTC.

After the features are computed, they need to be versioned and stored to be read during inference as well as model retraining. Rather than build your own feature storage and management service, you can use SageMaker Feature Store. Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. It stores history of ML features in the offline store (Amazon S3) and also provides APIs to an online store to allow low-latency reads of most recent features. The offline store can serve the historical data for further model training and experimentation, and the online store can be called by your customer-facing APIs to get features for real-time inference. As we evolve our services to provide more personalized content, we anticipate training additional ML models and with the help of Feature Store, search, discover and reuse features amongst these models.

Real-time inference

Real-time inference usually requires hosting ML models behind endpoints. You could do this using web servers or containers, but this requires ML engineering effort and infrastructure to manage and maintain. SageMaker makes it easy to deploy ML models to real-time endpoints. SageMaker lets you train and upload ML models and host them by creating and configuring SageMaker endpoints. Real-time inference satisfies the low-latency requirements for ranking shows as they are browsed on the Amp home page.

In addition to managed hosting, SageMaker provides managed endpoint scaling. SageMaker inference allows you to define an auto scaling policy with minimum and maximum instance counts and a target utilization to trigger the scaling. This way, you can easily scale in or out as demand changes.

Operational health

The number of events this system handles for real-time feature computing changes accordingly with the natural pattern of app usage (higher or lower traffic based on time of day or day of the week). Similarly, the number of requests it receives for real-time inference scales with the number of concurrent app users. These services also get unexpected peaks in traffic due to self promotions in social media by popular creators. Although it’s important to ensure the system can scale up and down to serve the incoming traffic successfully and frugally, it’s also important to monitor operational metrics and alert for any unexpected operational issues to prevent loss of data and service to customers. Monitoring the health of these services is straightforward using Amazon CloudWatch. Vital service health metrics such as faults and latency of operations as well as utilization metrics such as memory, disk, and CPU usage are available out of the box using CloudWatch. Our development team uses metrics dashboards and automated monitoring to ensure we can serve our clients with high availability (99.8%) and low latency (less than 200 milliseconds end-to end to get recommended shows per user).

Measuring the outcome

Prior to the ML-based show recommender described in this post, a simpler heuristic algorithm ranked Amp shows based on a user’s personal topics of interest that are self-reported on their profile. We set up an A/B test to measure the impact of switching to ML-based recommenders with a user’s data from their past app interactions. We identified improvements in metrics such as listening duration and number of engagement actions (liking a show, following a show creator, turning on notifications) as indicators of success. A/B testing with 50% of users receiving show recommendations ranked for them via the ML-based recommender has shown a 3% boost to customer engagement metrics and a 0.5% improvement to playback duration.

Conclusion

With purpose-built services, the Amp team was able to release the personalized show recommendation API as described in this post to production in under 3 months. The system also scales well for the unpredictable loads created by well-known show hosts or marketing campaigns that could generate an influx of users. The solution uses managed services for processing, training, and hosting, which helps reduce the time spent on day-to-day maintenance of the system. We’re also able to to monitor all these managed services via CloudWatch to ensure the continued health of the systems in production.

A/B testing the first version of the Amp’s ML-based recommender against a rule-based approach (which sorts shows by customer’s topics of interest only) has shown that the ML-based recommender exposes customers to higher-quality content from more diverse topics, which results in a higher number of follows and enabled notifications. The Amp team is continuously working towards improving the models to provide highly relevant recommendations.

For more information about Feature Store, visit Amazon SageMaker Feature Store and check out other customer use cases in the AWS Machine Learning Blog.

About the authors

Tulip Gupta is a Solutions Architect at Amazon Web Services. She works with Amazon to design, build, and deploy technology solutions on AWS. She assists customers in adopting best practices while deploying solution in AWS, and is a Analytics and ML enthusiast. In her spare time, she enjoys swimming, hiking and playing board games.

David Kuo is a Solutions Architect at Amazon Web Services. He works with AWS customers to design, build and deploy technology solutions on AWS. He works with Media and Entertainment customers and has interests in machine learning technologies. In his spare time, he wonders what he should do with his spare time.

Manolya McCormick is a Sr Software Development Engineer for Amp on Amazon. She designs and builds distributed systems using AWS to serve customer facing applications. She enjoys reading and cooking new recipes at her spare time.

Jeff Christophersen is a Sr. Data Engineer for Amp on Amazon. He works to design, build, and deploy Big Data solutions on AWS that drive actionable insights. He assists internal teams in adopting scalable and automated solutions, and is a Analytics and Big Data enthusiast. In his spare time, when he is not on a pair of skis you can find him on his mountain bike.

How Amp on Amazon used data to increase customer engagement, Part 1: Building a data analytics platform

Amp, the new live radio app from Amazon, is a reinvention of radio featuring human-curated live audio shows. It’s designed to provide a seamless customer experience to listeners and creators by debuting interactive live audio shows from your favorite artists, radio DJs, podcasters, and friends.

However, as a new product in a new space for Amazon, Amp needed more relevant data to inform their decision-making process. Amp wanted a scalable data and analytics platform to enable easy access to data and perform machine leaning (ML) experiments for live audio transcription, content moderation, feature engineering, and a personal show recommendation service, and to inspect or measure business KPIs and metrics.

This post is the first in a two-part series. Part 1 shows how data was collected and processed using the data and analytics platform, and Part 2 shows how the data was used to create show recommendations using Amazon SageMaker, a fully managed ML service. The personalized show recommendation list service has shown a 3% boost to customer engagement metrics tracked (such as liking a show, following a creator, or enabling upcoming show notifications) since its launch in May 2022.

Solution overview

The data sources for Amp can be broadly categorized as either streaming (near-real time) or batch (point in time). The source data is emitted from Amp-owned systems or other Amazon systems. The two different data types are as follows:

Streaming data – This type of data mainly consists of follows, notifications (regarding users’ friends, favorite creators, or shows), activity updates, live show interactions (call-ins, co-hosts, polls, in-app chat), real-time updates on live show activities (live listen count, likes), live audio playback metrics, and other clickstream metrics from the Amp application. Amp stakeholders require this data to power ML processes or predictive models, content moderation tools, and product and program dashboards (for example, trending shows). Streaming data enables Amp customers to conduct and measure experimentation.
Batch data – This data mainly consists of catalog data, show or creator metadata, and user profile data. Batch data enables more point-in-time reporting and analytics vs. real-time.

The following diagram illustrates the high-level architecture.

The Amp data and analytics platform can be broken down into three high-level systems:

Streaming data ingestion, stream processing and transformation, and stream storage
Batch data ingestion, batch processing and transformation, and batch storage
Business intelligence (BI) and analytics

In the following sections, we discuss each component in more detail.

Streaming data ingestion, processing, transformation, and storage

Amp created a serverless streaming ingestion pipeline capable of tapping into data from sources without the need for infrastructure management, as shown in the following diagram.

The pipeline was able to ingest the Amp show catalog data (what shows are available on Amp) and pass it to the data lake for two different use cases: one for near-real-time analytics, and one for batch analytics.

As part of the ingestion pipeline, the Amp team has an Amazon Simple Queue Service (Amazon SQS) queue that receives messages from an upstream Amazon Simple Notification Service (Amazon SNS) topic that contains information on changes to shows in the catalog. These changes could be the addition of new shows or adjustments to existing ones that have been scheduled.

When the message is received by the SQS queue, it triggers the AWS Lambda function to make an API call to the Amp catalog service. The Lambda function retrieves the desired show metadata, filters the metadata, and then sends the output metadata to Amazon Kinesis Data Streams. Amazon Kinesis Data Firehose receives the records from the data stream. Kinesis Data Firehose then invokes a secondary Lambda function to perform a data transformation that flattens the JSON records received and writes the transformed records to an Amazon Simple Storage Service (Amazon S3) data lake for consumption by Amp stakeholders.

Kinesis Data Firehose enabled buffering and writing data to Amazon S3 every 60 seconds. This helped Amp teams make near-real-time programming decisions that impacted external customers.

The streaming ingestion pipeline supported the following objectives: performance, availability, scalability, and flexibility to send data to multiple downstream applications or services:

Kinesis Data Streams handles streaming data ingestion when necessary. Kinesis Data Streams supported these objectives by enabling the Amp team to quickly ingest data for analytics with minimal operational load. As a fully managed service, it reduced operational overhead, and Amp was able to scale with the product needs.
Lambda enabled the team to create lightweight functions to run API calls and perform data transformations.
Because Kinesis Data Firehose is a managed service, it was able to handle all the scaling, sharding, and monitoring needs of the streaming data without any additional overheard for the team.

Batch data ingestion, processing, transformation, and storage

Amp created a transient batch (point in time) ingestion pipeline capable of data ingestion, processing and transformation, and storage, as shown in the following diagram.

A transient extract, transform, and load (ETL) and extract, load, and transform (ELT) job approach was implemented because of the batch nature of these workloads and unknown data volumes. As a part of the workflow automation, Amazon SQS was used to trigger a Lambda function. The Lambda function then activated the AWS Glue crawler to infer the schema and data types. The crawler wrote the schema metadata to the AWS Glue Data Catalog, providing a unified metadata store for data sharing.

The ETL and ELT jobs were required to run on either a set schedule or event-driven workflow. To handle these needs, Amp used Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Apache Airflow is an open-source Python-based workflow management platform. Amazon MWAA is a fully managed service that automatically handles scaling. It provides sequencing, error handling, retry logic, and state. With Amazon MWAA, Amp was able to take advantage of the the benefits of Airflow for job orchestration while not having to manage or maintain dedicated Airflow servers. Additionally, by using Amazon MWAA, Amp was able to create a code repository and workflow pipeline stored in Amazon S3 that Amazon MWAA could access. The pipeline allowed Amp data engineers to easily deploy Airflow DAGs or PySpark scripts across multiple environments.

Amp used Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS) to configure and manage containers for their data processing and transformation jobs. Due to the unique nature of the Amp service, the initial expected data volumes that would be processed were relatively unknown. To provide flexibility as the service evolved, the team decided to go with Amazon EMR on EKS to eliminate any unnecessary operational overheard required to bootstrap and scale Amazon EMR for data processing. This approach allowed them to run transient hybrid EMR clusters backed by a mix of AWS Fargate and Amazon Elastic Compute Cloud (Amazon EC2) nodes, where all system tasks and workloads were offloaded to Fargate, while Amazon EC2 handled all the Apache Spark processing and transformation. This provided the flexibility to have a cluster with one node running, while the Amazon EKS auto scaler dynamically instantiated and bootstrapped any additional EC2 nodes that were required for the job. When the job was complete, they were automatically deleted by the cluster auto scaler. This pattern eliminated the need for the team to manage any of the cluster bootstrap actions or scaling required to respond to evolving workloads.

Amazon S3 was used as the central data lake, and data was stored in Apache Parquet (Parquet) format. Parquet is a columnar format, which speeds up data retrieval and provides efficient data compression. Amazon S3 provided the flexibility, scalability, and security needs for Amp. With Amazon S3, the Amp team was able to centralize data storage in one location and federate access to the data virtually across any service or tool within or outside of AWS. The data lake was split into two S3 buckets: one for raw data ingestion and one for transformed data output. Amazon EMR performed the transformation from raw data to transformed data. With Amazon S3 as the central data lake, Amp was able to securely expose and share the data with other teams across Amp and Amazon.

To simplify data definition, table access provisioning, and the addition and removal of tables, they used AWS Glue crawlers and the AWS Glue Data Catalog. Because Amp is a new service and constantly evolving, the team needed a way to easily define, access, and manage the tables in the data lake. The crawlers handled data definition (including schema changes) and the addition and removal of tables, while the Data Catalog served as a unified metadata store.

Business intelligence and analytics

The following diagram illustrates the architecture for the BI and analytics component.

Amp chose to store the data in the S3 data lake, and not in the data warehouse. This enabled them to access it in a unified manner through the AWS Glue Data Catalog and provided greater flexibility for data consumers. This resulted in faster data access across a variety of services or tools. With data being stored in Amazon S3, it also reduced data warehouse infrastructure costs, because the costs are a function of the compute type and the amount of data stored.

The Amazon Redshift RA3 node type was used as the compute layer to enable stakeholders to query data stored in Amazon S3. Amazon Redshift RA3 nodes decouple storage and compute, and are designed for an access pattern through the AWS Glue Data Catalog. RA3 nodes introduce Amazon Redshift Managed Storage, which is Amazon S3 backed. The combination of these features enabled Amp to right-size the clusters and provide better query performance for their customers while minimizing costs.

Amazon Redshift configuration was automated using a Lambda function, which connected to a given cluster and ran parameterized SQL statements. The SQL statements contained the logic to deploy schemas, user groups, and users, while AWS Secrets Manager was used to automatically generate, store, and rotate Amazon Redshift user passwords. The underlying configuration variables were stored in Amazon DynamoDB. The Lambda function retrieved the variables and requested temporary Amazon Redshift credentials to perform the configuration. This process enabled the Amp team to set up Amazon Redshift clusters in a consistent manner.

Business outcomes

Amp was able to achieve the following business outcomes:

Business reporting – Standard reporting required to run the business, such as daily flash reports, aggregated business review mechanisms, or project and program updates.
Product reporting – Specific reporting required to enable the inspection or measurement of key product KPIs and Metrics. This included visual reports through dashboards such as marketing promotion effectiveness, app engagement metrics, and trending shows.
ML experimentation – Enabled downstream Amazon teams to use this data to support experimentation or generate predictions and recommendations. For example, ML experimentations like a personalized show recommendation list, show categorization, and content moderation helped with Amp’s user retention.

Key benefits

By implementing a scalable, cost-efficient architecture, Amp was able to achieve the following:

Limited operational complexity – They built a flexible system that used AWS managed services wherever possible.
Use the languages of data – Amp was able to support the two most common data manipulation languages, Python and SQL, to perform platform operations, conduct ML experiments, and generate analytics. With this support, the developers with Amp were able to use languages they were familiar with.
Enable experimentation and measurement – Amp allowed developers to quickly generate the datasets needed to conduct experiments and measure the results. This helps in optimizing the Amp customer experience.
Build to learn but design to scale – Amp is a new product that is finding its market fit, and was able to focus their initial energy on building just enough features to get feedback. This enabled them to pivot toward the right product market fit with each launch. They were able to build incrementally, but plan for the long term.

Conclusion

In this post, we saw how Amp created their data analytics platform using user behavioral data from streaming and batch data sources. The key factors that drove the implementation were the need to provide a flexible, scalable, cost-efficient, and effort-efficient data analytics platform. Design choices were made evaluating various AWS services.

Part 2 of this series shows how we used this data and built out the personalized show recommendation list using SageMaker.

As next steps, we recommend doing a deep dive into each stage of your data pipeline system and making design choices that would be cost-effective and scalable for your needs. For more information, you can also check out other customer use cases in the AWS Analytics Blog.

If you have feedback about this post, submit it in the comments section.

About the authors

Build repeatable, secure, and extensible end-to-end machine learning workflows using Kubeflow on AWS

This is a guest blog post cowritten with athenahealth.

athenahealth a leading provider of network-enabled software and services for medical groups and health systems nationwide. Its electronic health records, revenue cycle management, and patient engagement tools allow anytime, anywhere access, driving better financial outcomes for its customers and enabling its provider customers to deliver better quality care.

In the artificial intelligence (AI) space, athenahealth uses data science and machine learning (ML) to accelerate business processes and provide recommendations, predictions, and insights across multiple services. From its first implementation in automated document services, touchlessly processing millions of provider-patient documents, to its more recent work in virtual assistants and improving revenue cycle performance, athenahealth continues to apply AI to help drive efficiency, service capabilities, and better outcomes for providers and their patients.

This blog post demonstrates how athenahealth uses Kubeflow on AWS (an AWS-specific distribution of Kubeflow) to build and streamline an end-to-end data science workflow that preserves essential tooling, optimizes operational efficiency, increases data scientist productivity, and sets the stage for extending their ML capabilities more easily.

Kubeflow is the open-source ML platform dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. Kubeflow achieves this by incorporating relevant open-source tools that integrate well with Kubernetes. Some of these projects include Argo for pipeline orchestration, Istio for service mesh, Jupyter for notebooks, Spark, TensorBoard, and Katib. Kubeflow Pipelines helps build and deploy portable, scalable ML workflows that can include steps like data extraction, preprocessing, model training, and model evaluation in the form of repeatable pipelines.

AWS is contributing to the open-source Kubeflow community by providing its own Kubeflow distribution (called Kubeflow on AWS) that helps organizations like athenahealth build highly reliable, secure, portable, and scalable ML workflows with reduced operational overhead through integration with AWS managed services. AWS provides various Kubeflow deployment options like deployment with Amazon Cognito, deployment with Amazon Relational Database Service (Amazon RDS) and Amazon Simple Storage Service (Amazon S3), and vanilla deployment. For details on service integration and available add-ons for each of these options, refer to Deployment.

Today, Kubeflow on AWS provides a clear path to using Kubeflow, augmented with the following AWS services:

Application Load Balancer for secure external traffic management over HTTPS
Amazon CloudWatch for persistent log management
Amazon Cognito for user authentication with Transport Layer Security (TLS)
AWS Deep Learning Containers (DLCs) for highly optimized Jupyter notebook server images
Amazon Elastic File System (Amazon EFS) or Amazon FSx for Lustre for a simple, scalable, and serverless file storage solution for increased training performance
Amazon Elastic Kubernetes Service (Amazon EKS) for managed Kubernetes clusters
Amazon RDS for highly scalable pipelines and a metadata store
AWS Secrets Manager to protect secrets needed to access your applications
Amazon S3 for an easy-to-use pipeline artifacts store
Access to AWS services from Katib and from pipeline pods using the AWS IAM Roles for Service Accounts (IRSA) integration with Kubeflow Profiles

Many AWS customers are taking advantage of the Kubeflow on AWS distribution, including athenahealth.

Here, the athenahealth MLOps team discuss the challenges they encountered and the solutions they created in their Kubeflow journey.

Challenges with the previous ML environment

Prior to our adoption of Kubeflow on AWS, our data scientists used a standardized set of tools and a process that allowed flexibility in the technology and workflow used to train a given model. Example components of the standardized tooling include a data ingestion API, security scanning tools, the CI/CD pipeline built and maintained by another team within athenahealth, and a common serving platform built and maintained by the MLOps team. However, as our use of AI and ML matured, the variety of tools and infrastructure created for each model grew. Although we were still able to support the existing process, we saw the following challenges on the horizon:

Maintenance and growth – Reproducing and maintaining model training environments took more effort as the number of deployed models increased. Each project maintained detailed documentation that outlined how each script was used to build the final model. In many cases, this was an elaborate process involving 5 to 10 scripts with several outputs each. These had to be manually tracked with detailed instructions on how each output would be used in subsequent processes. Maintaining this over time became cumbersome. Moreover, as the projects became more complex, the number of tools also increased. For example, most models utilized Spark and TensorFlow with GPUs, which required a larger variety of environment configurations. Over time, users would switch to newer versions of tools in their development environments but then couldn’t run older scripts when those versions became incompatible. Consequently, maintaining and augmenting older projects required more engineering time and effort. In addition, as new data scientists joined the team, knowledge transfers and onboarding took more time, because synchronizing local environments included many undocumented dependencies. Switching between projects faced the same issues because each model had its own workflows.
Security – We take security seriously, and therefore prioritize compliance with all contractual, legal, and regulatory obligations associated with ML and data science. Data must be utilized, stored, and accessed in specific ways, and we have embedded robust processes to ensure our practices comply with our legal obligations as well as align with industry best practices. Prior to Kubeflow adoption, ensuring that data was stored and accessed in a specific way involved regular verification across multiple, diverse workflows. We knew that we could improve efficiencies by consolidating these diverse workflows onto a single platform. However, that platform would need to be flexible enough to integrate well with our standardized tooling.
Operations – We also saw an opportunity to increase operational efficiency and management through centralizing the logging and monitoring of the workflows. Because each team had developed their own tools, we collected this information from each workflow individually and aggregated them.

The data science team evaluated various solutions for consolidating the workflows. In addition to addressing these requirements, we looked for a solution that would integrate seamlessly with the existing standardized infrastructure and tools. We selected Amazon EKS and Kubeflow on AWS as our workflow solution.

The data scientist development cycle incorporating Kubeflow

A data science project begins with a clean slate: no data, no code, only the business problem that can be solved with ML. The first task is a proof of concept (POC) to discover if the data holds enough signal to make an ML model effective at solving the business problem, starting with querying for the raw dataset from our Snowflake data warehouse. This stage is iterative, and the data scientists use Kubernetes pods or Kubeflow Jupyter notebooks during this process.

Our Kubeflow cluster uses the Karpenter cluster autoscaler, which makes spinning up resources easy for data scientists because they only need to focus on defining the desired instance types, while the provisioning work is done by a set of predefined Karpenter provisioners. We have separate provisioners for CPU and GPU instance types, and all the instances supported by Amazon EKS fall in one of these two categories as per our provisioner configuration. The data scientists choose instance types using node selectors, and Karpenter takes care of node lifecycle management.

After the query is developed, the data scientists extract the raw data to a location on Amazon S3, then launch a Jupyter notebook from the AWS Kubeflow UI to explore the data. The goal is to create the feature set that will be used to train the first model. This allows the data scientists to determine if there is enough signal in the data to fulfill the customer’s business need.

After the results are satisfactory, the data scientists move to the next stage of the development cycle and turn their discoveries into a robust pipeline. They convert the POC code into production-quality code that runs at scale. To ensure compliance through using approved libraries, a container is created with the appropriate base Docker image. For our data scientists, we have found that providing a standard Python, TensorFlow, and Spark base image gives sufficient flexibility for most, if not all, workloads. They can then use the Dockerfile of their component to further customize their development environment. This Dockerfile is then utilized by the CI/CD process to build the components image that will be used in production, therefore maintaining consistency between development and production environments.

We have a tool that gives data scientists the ability to launch their development environment in a pod running on Kubernetes. When this pod is running, the data scientists can then attach the Visual Studio Code IDE directly to the pod and debug their model code. After they have the code running successfully, they can then push their changes to git and a new development environment is created with the most recent changes.

The standard data science pipeline consists of stages that include extraction, preprocessing, training, and evaluation. Each stage in the pipeline appears as a component in Kubeflow, which consists of a Kubernetes pod that runs a command with some information passed in as parameters. These parameters can either be static values or references to output from a previous component. The Docker image used in the pod is built from the CI/CD process. Details on this process appear in the CI/CD workflow discussed in the next section.

Development Cycle on Kubeflow. The development workflow starts on the left with the POC. The completed model is deployed to the athenahealth model serving platform running on Amazon ECS.

CI/CD process supporting automated workflows

As part of our CI/CD process, we use Jenkins to build and test all Kubeflow component images in parallel. On successful completion, the pipeline component template contains reference pointers to the images, and the resulting pipeline is uploaded to Kubeflow. Parameters in the Jenkins pipeline allow users to launch the pipelines and run their model training tests after successful builds.

Alternatively, to maintain a short development cycle, data scientists can also launch the pipeline from their local machine, modifying any pipeline parameters they may be experimenting with.

Tooling exists to ensure the reference pointers from the CI/CD build are utilized by default. If there is a deployable artifact in the repo, then the CI/CD logic will continue to deploy the artifact to the athenahealth model serving platform (the Prediction Service) running on Amazon ECS with AWS Fargate. After all these stages have passed, the data scientist merges the code to the primary branch. The pipelines and deployable artifacts are then pushed to production.

CI/CD Deployment workflow. This diagram describes the Data Science build and deployment workflow. The CI/CD process is driven by Jenkins.

Security

In consolidating our data science workflows, we were able to centralize our approach to securing the training pipeline. In this section, we discuss our approach to data and cluster security.

Data security

Data security is of the utmost importance at athenahealth. For this reason, we develop and maintain infrastructure that is fully compliant with the regulations and standards that protect the security and integrity of these data.

To ensure we meet data compliance standards, we provision our AWS infrastructure in accordance with our athenahealth enterprise guidelines. The two main stores for data are Amazon RDS for highly scalable pipeline metadata and Amazon S3 for pipeline and model artifacts. For Amazon S3, we ensure the buckets are encrypted, HTTPS endpoints are enforced, and the bucket policies and AWS Identity and Access Management (IAM) roles follow the principles of least privilege when permitting access to the data. This is true for Amazon RDS data as well: encryption is always enabled, and the security groups and credential access follow the principle of least privilege. This standardization ensures that only authorized parties have access to the data, and this access is tracked.

In addition to these measures, the platform also undergoes security threat assessments and continuous security and compliance scans.

We also address data retention requirements via data lifecycle management for all S3 buckets that contain sensitive data. This policy automatically moves data to Amazon S3 Glacier after 30 days of creation. Exceptions to this are managed through data retrieval requests and are approved or denied on a case-by-case basis. This ensures that all workflows comply with the data retention policy. This also solves the problem with recovering data if a model performs poorly, and retraining is required, or when a new model must be evaluated against a historical iteration of an older model’s dataset.

For restricting access to Amazon S3 and Amazon RDS from within Kubeflow on AWS and Amazon EKS, we use IRSA (IAM Roles for Service Accounts), which provides IAM-based permission provisioning for resources within Kubernetes. Each tenant in Kubeflow has a unique pre-created service account which we bind to an IAM role created specifically to fulfill the tenant access requirements. User access to tenants is also restricted using the Amazon Cognito user pools group membership for each user. When a user is authenticated to the cluster, the generated token contains group claims, and Kubernetes RBAC uses this information to allow or deny access to a particular resource in the cluster. This setup is explained in more detail in the next section.

Cluster security using multi-user isolation

As we noted in the previous section, data scientists perform exploratory data analyses, run data analytics, and train ML models. To allocate resources, organize data, and manage workflows based on projects, Kubeflow on AWS provides isolation based on Kubernetes namespaces. This isolation works for interacting with the Kubeflow UI; however, it doesn’t provide any tooling to control access to the Kubernetes API using Kubectl. This means that user access can be controlled on the Kubeflow UI but not over the Kubernetes API via Kubectl.

The architecture described in the following diagram addresses this issue by unifying access to projects in Kubeflow based on group memberships. To achieve this, we took advantage of the Kubeflow on AWS manifests, which have integration with Amazon Cognito user pools. In addition, we use Kubernetes role-based access control (RBAC) to control authorization within the cluster. The user permissions are provisioned based on Amazon Cognito group membership. This information is passed to the cluster with the token generated by the OIDC client. This process is simplified thanks to the built-in Amazon EKS functionality that allows associating OIDC identity providers to authenticate with the cluster.

By default, Amazon EKS authentication is performed by the IAM authenticator, which is a tool that enables authenticating with an EKS cluster using IAM credentials. This authentication method has its merits; however, it’s not suitable for our use case because athenahealth uses Microsoft Azure Active Directory for identity service across the organization.

Kubernetes namespace isolation. Data Scientists can obtain membership to a single or multiple groups as needed for their work. Access is reviewed on a regular basis and removed as appropriate.

Azure Active Directory, being an enterprise-wide identity service, is the source of truth for controlling user access to the Kubeflow cluster. The setup for this includes creating an Azure Enterprise Application that acts as service principal and adding groups for various tenants that require access to the cluster. This setup on Azure is mirrored in Amazon Cognito by setting up a federated OIDC identity provider that outsources authentication responsibility to Azure. The access to Azure groups is controlled by SailPoint IdentityIQ, which sends access requests to the project owner to allow or deny as appropriate. In the Amazon Cognito user pool, two application clients are created: one is used to set up the authentication for the Kubernetes cluster using the OIDC identity provider, and the other to secure Kubeflow authentication into the Kubeflow UI. These clients are configured to pass group claims upon authentication with the cluster, and these group claims are used alongside RBAC to set up authorization within the cluster.

Kubernetes RBAC role bindings are set up between groups and the cluster role Kubeflow-edit, which is created upon installing Kubeflow in the cluster. This role binding ensures any user interacting with the cluster after logging in via OIDC can access the namespaces they have permissions for as defined in their group claims. Although this works for users interacting with the cluster using Kubectl, the Kubeflow UI currently doesn’t provision access to users based on group membership because it doesn’t use RBAC. Instead, it uses the Istio Authorization Policy resource to control access for users. To overcome this challenge, we developed a custom controller that synchronizes users by polling Amazon Cognito groups and adds or removes corresponding role bindings for each user rather than by group. This setup enables users to have the same level of permissions when interacting with both the Kubeflow UI and Kubectl.

Operational efficiency

In this section, we discuss how we took advantage of the open source and AWS tools available to us to manage and debug our workflows as well as to minimize the operational impact of upgrading Kubeflow.

Logging and monitoring

For logging, we utilize FluentD to push all our container logs to Amazon OpenSearch Service and system metrics to Prometheus. We then use Kibana and the Grafana UI for searching and filtering logs and metrics. The following diagram describes how we set this up.

Kubeflow Logging. We use both Grafana UI and Kibana to view and sift through the logs

The following screenshot is a Kibana UI view from our pipeline.

Sample Kibana UI View. Kibana allows for customized views.

Safe Kubeflow cluster upgrades

As we onboard users to Kubeflow on AWS, we maintain a reliable and consistent user experience while allowing the MLOps team to stay agile with releasing and integrating new features. On the surface, Kustomize seems modular for us to enable working and upgrading one component at a time without impacting others, thereby allowing us to add new capabilities with minimal disruption to the users. However, in practice there are scenarios where the best approach is to simply spin up a new Kubernetes cluster rather than applying component-level upgrades for existing clusters. We found two use cases where it made more sense to create completely new clusters:

Upgrading to a Kubernetes version where AWS does provide in-place cluster upgrades. However, it becomes difficult to test if each of the Kubeflow and Kubernetes resources are working as intended and the manifests retain backward compatibility.
Upgrading Kubeflow to a newer release where there are several features added or modified and it almost always is not a promising idea to perform in-place upgrades on an existing Kubernetes cluster.

In addressing this issue, we developed a strategy that enables us to have safe cluster replacements without impacting any existing workloads. To achieve this, we had to meet the following criteria:

Separate the Kubeflow storage and compute resources so that the pipeline metadata, pipeline artifacts, and user data are retained when deprovisioning the older cluster
Integrate with Kubeflow on AWS manifests so that when a Kubeflow version upgrade occurs, minimal changes are required
Have an effortless way to roll back if things go wrong after cluster upgrade
Have a simple interface to promote a candidate cluster to production

The following diagram illustrates this architecture.

Safe Kubeflow Cluster Upgrade. Once testing of the Kubeflow Candidate is successful, it is promoted to Kubeflow Prod through an update to Route 53.

Kubeflow on AWS manifests come pre-packaged with Amazon RDS and Amazon S3 integrations. With these managed services acting as common data stores, we can set up a blue-green deployment strategy. To achieve this, we ensured that the pipeline metadata is persisted in Amazon RDS, which works independently of the EKS cluster, and the pipeline logs and artifacts are persisted in Amazon S3. In addition to pipeline metadata and artifacts, we also set up FluentD to route pod logs to Amazon OpenSearch Service.

This ensures that the storage layer is completely separated from the compute layer and thereby enables testing changes during Kubeflow version updates on a completely new EKS cluster. After all the tests are successful, we’re able to simply change the Amazon Route 53 DNS record to the candidate cluster hosting Kubeflow. Also, we keep the old cluster running as backup for a few days just in case we need to roll back.

Benefits of Amazon EKS and Kubeflow on AWS for our ML pipeline

Amazon EKS and the Kubeflow on AWS package moved our development workflow to a pattern that strongly encourages repeatable model training. These tools allow us to have fully defined clusters with fully defined tenants and run fully defined code.

A lot of wins from building this platform are less quantitative and have more to do with how the workflows have improved for both platform developers and users. For example, MinIO was replaced with direct access to Amazon S3, which moves us closer to our original workflows and reduces the number of services we must maintain. We are also able to utilize Amazon RDS as the backend for Kubeflow, which enables easier migrations between clusters and gives us the ability to back up our pipelines nightly.

We also found the improvements in the Kubeflow integration with AWS managed services beneficial. For example, with Amazon RDS, Amazon S3, and Amazon Cognito preconfigured in the Kubeflow on AWS manifests, we save time and effort updating to newer distributions of Kubeflow. When we used to modify the official Kubeflow manifests manually, updating to a new version would take several weeks, from design to testing.

Switching to Amazon EKS gives us the opportunity to define our cluster in Kustomize (now part of Kubectl) and Terraform. It turns out that for platform work, Kubernetes and Terraform are very easy to work with after putting in enough time to learn. After many iterations, the tools available to us make it very easy to perform standard platform operations like upgrading a component or swapping out an entire development cluster. Compared to running jobs off raw Amazon Elastic Compute Cloud (Amazon EC2) instances, it’s hard to compare what an enormous difference it makes to have well-defined pods with guaranteed resource cleanup and retry mechanisms built in.

Kubernetes provides great security standards, and we have only scratched the surface of what the multi-user isolation allows us to do. We see multi-user isolation as a pattern that has more payoff in the future when the training platform produces production-level data, and we bring on developers from outside our team.

Meanwhile, Kubeflow allows us to have reproducible model training. Even with the same data, no training produces identical models, but we have the next best thing. With Kubeflow, we know exactly what code and data were used to train a model. Onboarding has greatly improved because each step in our pipeline is clearly and programmatically defined. When new data scientists have the task of fixing a bug, they need much less handholding because there is a clear structure to how outputs of code are used between stages.

Using Kubeflow also yields a lot of performance improvements compared to running on a single EC2 instance. Often in model training, data scientists need different tools and optimizations for preprocessing and training. For example, preprocessing is often run using distributed data processing tools, like Spark, whereas training is often run using GPU instances. With Kubeflow pipelines, they can specify different instance types for different stages in the pipeline. This allows them to use the powerful GPU instances in one stage and a fleet of smaller machines for distributed processing in another stage. Also, because Kubeflow pipelines describe the dependencies between stages, the pipelines can run stages in parallel.

Finally, because we created a process for adding tenants to the cluster, there is now a more formal way to register teams to a tenant on the cluster. Because we use Kubecost to track costs in our EKS cluster, it allows us to attribute cost to a single project rather than having cost attributed at the account level, which includes all data science projects. Kubecost presents a report of the money spent per namespace, which is tightly coupled to the tenant or team that is responsible for running the pipeline.

Despite all the benefits, we would caution to only undertake this kind of migration if there is total buy-in from users. Users that put in the time get a lot of benefits from using Amazon EKS and Kubernetes, but there is a significant learning curve.

Conclusion

With the implementation of the Kubeflow on AWS pipeline in our end-to-end ML infrastructure, we were able to consolidate and standardize our data science workflows while retaining our essential tooling (such as CI/CD and model serving). Our data scientists can now move between projects based on this workflow without the overhead of learning how to maintain a completely different toolset. For some of our models, we were also pleasantly surprised at the speed of the new workflow (five times faster), which allowed for more training iterations and consequently producing models with better predictions.

We also have established a solid foundation to augment our MLOps capabilities and scale the number and size of our projects. For example, as we harden our governance posture in model lineage and tracking, we have reduced our focus from over 15 workflows to just one. And when the Log4shell vulnerability came to light in late 2021, we were able to focus on a single workflow and quickly remediate as needed (performing Amazon Elastic Container Registry (Amazon ECR) scans, upgrading Amazon OpenSearch Service, updating our tooling, and more) with minimal impact to the ongoing work by the data scientists. As AWS and Kubeflow enhancements become available, we can incorporate them as we see fit.

This brings us to an important and understated aspect of our Kubeflow on AWS adoption. One of the critical outcomes of this journey is the ability to roll out upgrades and enhancements to Kubeflow seamlessly for our data scientists. Although we discussed our approach to this earlier, we also rely on the Kubeflow manifests provided by AWS. We started our Kubeflow journey as a proof of concept in 2019, prior to the release of version 1.0.0. (We’re currently on 1.4.1, evaluating 1.5. AWS is already working on the 1.6 version.) In the intervening 3 years, there have been at least six releases with substantial content. Through their disciplined approach to integrating and validating these upgrades and releasing the manifests on a predictable, reliable schedule, the Kubeflow team at AWS has been crucial in enabling the athenahealth MLOps team to plan our development roadmap, and consequently our resource allocations and areas of focus, further into the future with greater confidence.

You can follow the AWS Labs GitHub repository to track all AWS contributions to Kubeflow. You can also find AWS teams on the Kubeflow #AWS Slack Channel; your feedback there helps AWS prioritize the next features to contribute to the Kubeflow project.

About the authors

Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Tyler Kalbach is a Principal Member of Technical Staff at athenahealth. Tyler has approximately 7 years of experience in Analytics, Data Science, Neural Networks, and development of Machine Learning applications in the Healthcare space. He has contributed to several Machine Learning solutions that are currently serving production traffic. Currently working as a Principal Data Scientist in athenahealth’s Engineering organization, Tyler has been part of the team that has built the new Machine Learning Training Platform for athenahealth from the inception of that effort.

Victor Krylov is a Principal Member of Technical Staff at athenahealth. Victor is an engineer and scrum master, helping data scientists build secure fast machine learning pipelines. In athenahealth he has worked on interfaces, clinical ordering, prescriptions, scheduling, analytics and now machine learning. He values cleanly written and well unit tested code, but has an unhealthy obsession with code one-liners. In his spare time he enjoys listening to podcasts while walking his dog.

Sasank Vemuri is a Lead Member of Technical Staff at athenahealth. He has experience working with developing data driven solutions across domains such as healthcare, insurance and bioinformatics. Sasank currently works with designing and developing machine learning training and inference platforms on AWS and Kubernetes that help with training and deploying ML solutions at scale.

Anu Tumkur is an Architect at athenahealth. Anu has over two decades of architecture, design, development experience building various software products in machine learning, cloud operations, big data, real-time distributed data pipelines, ad tech, data analytics, social media analytics. Anu currently works as an architect in athenahealth’s Product Engineering organization on the Machine Learning Platform and Data Pipeline teams.

William Tsen is a Senior Engineering Manager at athenahealth. He has over 20 years of engineering leadership experience building solutions in healthcare IT, big data distributed computing, intelligent optical networks, real-time video editing systems, enterprise software, and group healthcare underwriting. William currently leads two awesome teams at athenahealth, the Machine Learning Operations and DevOps engineering teams, in the Product Engineering organization.

Large language models and the increasing necessity of model parallel inference

Solution overview

Inference large models on SageMaker

Pull the Docker image and push to Amazon ECR

Create our model file

Create a SageMaker model

Create a SageMaker endpoint

Performance tuning

Tune the ML engine for multi-threaded inference

Tune Netty

Tune workload management (WLM) of DJLServing

Tune degree of tensor parallelism

Summary

About the authors

Collect relevant data

Number of images

Balanced dataset

Varying types of images

Varying backgrounds

Varying lighting conditions

Varying angles

Add negative labels

Handling label confusion

Data augmentation

Review training metrics

Conclusion

About the authors

Solution overview

Prerequisites

Configure Active Directory

Configure ADFS

Create Application Group

Configure claim descriptions

Configure the application group claim rules

Create and configure an OIDC IdP workforce using the SageMaker API

Validate the OIDC IdP workforce authentication response

Create a private work team

Test access to the private labeling portal

Cost

Clean up

Summary

About the authors

Solution overview

Real-time feature computing

Batch feature computing

Real-time inference

Operational health

Measuring the outcome

Conclusion

About the authors

Solution overview

Streaming data ingestion, processing, transformation, and storage

Batch data ingestion, processing, transformation, and storage

Business intelligence and analytics

Business outcomes

Key benefits

Conclusion

About the authors

Challenges with the previous ML environment

The data scientist development cycle incorporating Kubeflow

CI/CD process supporting automated workflows

Security

Data security

Cluster security using multi-user isolation

Operational efficiency

Logging and monitoring

Safe Kubeflow cluster upgrades

Benefits of Amazon EKS and Kubeflow on AWS for our ML pipeline

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.