Amazon AWS – Page 206

Prepare data at scale in Amazon SageMaker Studio using serverless AWS Glue interactive sessions

September 13, 2022

by Sean Morgan Amazon AWS

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps, including preparing data and building, training, and deploying models.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue enables you to seamlessly collect, transform, cleanse, and prepare data for storage in your data lakes and data pipelines using a variety of capabilities, including built-in transforms.

Data engineers and data scientists can now interactively prepare data at scale using their Studio notebook’s built-in integration with serverless Spark sessions managed by AWS Glue. Starting in seconds and automatically stopping compute when idle, AWS Glue interactive sessions provide an on-demand, highly-scalable, serverless Spark backend to achieve scalable data preparation within Studio. Notable benefits of using AWS Glue interactive sessions on Studio notebooks include:

No clusters to provision or manage
No idle clusters to pay for
No up-front configuration required
No resource contention for the same development environment
The exact same serverless Spark runtime and platform as AWS Glue extract, transform, and load (ETL) jobs

In this post, we show you how to prepare data at scale in Studio using serverless AWS Glue interactive sessions.

Solution overview

To implement this solution, you complete the following high-level steps:

Update your AWS Identity and Access Management (IAM) role permissions.
Launch an AWS Glue interactive session kernel.
Configure your interactive session.
Customize your interactive session and run a scalable data preparation workload.

Update your IAM role permissions

To start, you need to update your Studio user’s IAM execution role with the required permissions. For detailed instructions, refer to Permissions for Glue interactive sessions in SageMaker Studio.

You first add the managed policies to your execution role:

On the IAM console, choose Roles in the navigation pane.
Find the Studio execution role that you will use, and choose the role name to go to the role summary page.
On the Permissions tab, on the Add Permissions menu, choose Attach policies.
Select the managed policies AmazonSageMakerFullAccess and AwsGlueSessionUserRestrictedServiceRole
Choose Attach policies.
The summary page shows your newly-added managed policies.Now you add a custom policy and attach it to your execution role.
On the Add Permissions menu, choose Create inline policy.

On the JSON tab, enter the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole",
                "sts:GetCallerIdentity"
            ],
            "Resource": "*"
        }
    ]
}

Modify your role’s trust relationship:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "glue.amazonaws.com",
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Launch an AWS Glue interactive session kernel

If you already have existing users within your Studio domain, you may need to have them shut down and restart their Jupyter Server to pick up the new notebook kernel images.

Upon reloading, you can create a new Studio notebook and select your preferred kernel. The built-in SparkAnalytics 1.0 image should now be available, and you can choose your preferred AWS Glue kernel (Glue Scala Spark or Glue PySpark).

Configure your interactive session

You can easily configure your AWS Glue interactive session with notebook cell magics prior to initialization. Magics are small commands prefixed with % at the start of Jupyter cells that provide shortcuts to control the environment. In AWS Glue interactive sessions, magics are used for all configuration needs, including:

%region – The AWS Region in which to initialize a session. The default is the Studio Region.
%iam_role – The IAM role ARN to run your session with. The default is the user’s SageMaker execution role.
%worker_type – The AWS Glue worker type. The default is standard.
%number_of_workers – The number of workers that are allocated when a job runs. The default is five.
%idle_timeout – The number of minutes of inactivity after which a session will time out. The default is 2,880 minutes.
%additional_python_modules – A comma-separated list of additional Python modules to include in your cluster. This can be from PyPi or Amazon Simple Storage Service (Amazon S3).
%%configure – A JSON-formatted dictionary consisting of AWS Glue-specific configuration parameters for a session.

For a comprehensive list of configurable magic parameters for this kernel, use the %help magic within your notebook.

Your AWS Glue interactive session will not start until the first non-magic cell is run.

Customize your interactive session and run a data preparation workload

As an example, the following notebook cells show how you can customize your AWS Glue interactive session and run a scalable data preparation workload. In this example, we perform an ETL task to aggregate air quality data for a given city, grouping by the hour of the day.

We configure our session to save our Spark logs to an S3 bucket for real-time debugging, which we see later in this post. Be sure that the iam_role that is running your AWS Glue session has write access to the specified S3 bucket.

%help

%session_id_prefix air-analysis-
%glue_version 3.0
%idle_timeout 60
%%configure
{
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://<BUCKET>/gis-spark-logs/"
}

Next, we load our dataset directly from Amazon S3. Alternatively, you could load data using your AWS Glue Data Catalog.

from pyspark.sql.functions import split, lower, hour
print(spark.version)
day_to_analyze = "2022-01-05"
df = spark.read.json(f"s3://openaq-fetches/realtime-gzipped/{day_to_analyze}/1641409725.ndjson.gz")
df_air = spark.read.schema(df.schema).json(f"s3://openaq-fetches/realtime-gzipped/{day_to_analyze}/*")

Finally, we write our transformed dataset to an output bucket location that we defined:

df_city = df_air.filter(lower((df_air.city)).contains('delhi')).filter(df_air.parameter == "no2").cache()
df_avg = df_city.withColumn("Hour", hour(df_city.date.utc)).groupBy("Hour").avg("value").withColumnRenamed("avg(value)", "no2_avg")
df_avg.sort("Hour").show()

# Examples of reading / writing to other data stores: 
# https://github.com/aws-samples/aws-glue-samples/tree/master/examples/notebooks

df_avg.write.parquet(f"s3://<BUCKET>/{day_to_analyze}.parquet")

After you’ve completed your work, you can end your AWS Glue interactive session immediately by simply shutting down the Studio notebook kernel, or you could use the %stop_session magic.

Debugging and Spark UI

In the preceding example, we specified the ”--enable-spark-ui”: “true” argument along with a "--spark-event-logs-path": location. This configures our AWS Glue session to record the sessions logs so that we can utilize a Spark UI to monitor and debug our AWS Glue job in real time.

For the process for launching and reading those Spark logs, refer to Launching the Spark history server. In the following screenshot, we’ve launched a local Docker container that has permission to read the S3 bucket the contains our logs. Optionally, you could host an Amazon Elastic Compute Cloud (Amazon EC2) instance to do this, as described in the preceding linked documentation.

Pricing

When you use AWS Glue interactive sessions on Studio notebooks, you’re charged separately for resource usage on AWS Glue and Studio notebooks.

AWS charges for AWS Glue interactive sessions based on how long the session is active and the number of Data Processing Units (DPUs) used. You’re charged an hourly rate for the number of DPUs used to run your workloads, billed in increments of 1 second. AWS Glue interactive sessions assign a default of 5 DPUs and require a minimum of 2 DPUs. There is also a 1-minute minimum billing duration for each interactive session. To see the AWS Glue rates and pricing examples, or to estimate your costs using the AWS Pricing Calculator, see AWS Glue pricing.

Your Studio notebook runs on an EC2 instance and you’re charged for the instance type you choose, based on the duration of use. Studio assigns you a default EC2 instance type of ml-t3-medium when you select the SparkAnalytics image and associated kernel. You can change the instance type of your Studio notebook to suit your workload. For information about SageMaker Studio pricing, see Amazon SageMaker Pricing.

Conclusion

The native integration of Studio notebooks with AWS Glue interactive sessions facilitates seamless and scalable serverless data preparation for data scientists and data engineers. We encourage you to try out this new functionality in Studio!

See Prepare Data using AWS Glue Interactive Sessions for more information.

About the authors

Sean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.

Amazon scientists win best-paper award for ad auction simulator

September 13, 2022

by Amazon AWS

Paper introduces a unified view of the learning-to-bid problem and presents AuctionGym, a simulation environment that enables reproducible validation of new solutions.Read More

Save the date: Join AWS at NVIDIA GTC, September 19–22

September 12, 2022

by Jeremy Singh Amazon AWS

Register free for NVIDIA GTC to learn from experts on how AI and the evolution of the 3D internet are profoundly impacting industries—and society as a whole. We have prepared several AWS sessions to give you guidance on how to use AWS services powered by NVIDIA technology to meet your goals. Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast machine learning (ML) training, cost-effective ML inference, flexible remote virtual workstations, and powerful HPC computations.

AWS is a Global Diamond Sponsor of the conference.

Available sessions

Scaling Deep Learning Training on Amazon EC2 using PyTorch (Presented by Amazon Web Services) [A41454]
As deep learning models grow in size and complexity, they need to be trained using distributed architectures. In this session, we review the details of the PyTorch fully sharded data parallel (FSDP) algorithm, which enables you to train deep learning models at scale.

Tuesday, September 20, at 2:00 PM – 2:50 PM PDT
Speakers: Shubha Kumbadakone, Senior GTM Specialist, AWS ML, AWS; and Less Wright, Partner Engineer, Meta

A Developer’s Guide to Choosing the Right GPUs for Deep Learning (Presented by Amazon Web Services) [A41463]
As a deep learning developer or data scientist, choosing the right GPU for deep learning can be challenging. On AWS, you can choose from multiple NVIDIA GPU-based EC2 compute instances depending on your training and deployment requirements. We dive into how to choose the right instance for your needs in this session.

Available on demand
Speaker: Shashank Prasanna, Senior Developer Advocate, AI/ML, AWS

Real-time Design in the Cloud with NVIDIA Omniverse on Amazon EC2 (Presented by Amazon Web Services) [A4631]
In this session, we discuss how, by deploying NVIDIA Omniverse Nucleus—the Universal Scene Description (USD) collaboration engine—on EC2 On-Demand compute instances, Omniverse is able to scale to meet the demands of global teams.

Available on demand
Speaker: Kellan Cartledge, Spatial Computing Solutions Architect, AWS

5G Killer App: Making Augmented and Virtual Reality a Reality [A41234]
Extended reality (XR), which comprises augmented, virtual, and mixed realities, is consistently envisioned as one of the key killer apps for 5G, because XR requires ultra-low latency and large bandwidths to deliver wired-equivalent experiences for users. In this session, we share how Verizon, AWS, and Ericsson are collaborating to combine 5G and XR technology with NVIDIA GPUs, RTX vWS, and CloudXR to build the infrastructure for commercial XR services across a variety of industries.

Tuesday, September 20, at 1:00 PM – 1:50 PM PDT
Speakers: David Randle, Global Head of GTM for Spatial Computing, AWS; Veronica Yip, Product Manager and Product Marketing Manager, NVIDIA; Balaji Raghavachari, Executive Director, Tech Strategy, Verizon; and Peter Linder, Head of 5G Marketing, North America, Ericsson

Accelerate and Scale GNNs with Deep Graph Library and GPUs [A41386]
Graphs play important roles in many applications, including drug discovery, recommender systems, fraud detection, and cybersecurity. Graph neural networks (GNNs) are the current state-of-the-art method for computing graph embeddings in these applications. This session discusses the recent improvements of the Deep Graph Library on NVIDIA GPUs in the DGL 0.9 release cycle.

Wednesday, September 21, at 2:00 PM – 2:50 PM PDT
Speaker: Da Zheng, Senior Applied Scientist, AWS

Register for free for access to this content, and be sure to visit our sponsor page to learn more about AWS solutions powered by NVIDIA. See you there!

About the author

Jeremy Singh is a Partner Marketing Manager for storage partners within the AWS Partner Network. In his spare time, he enjoys traveling, going to the beach, and spending time with his dog Bolin.

How Medidata used Amazon SageMaker asynchronous inference to accelerate ML inference predictions up to 30 times faster

September 12, 2022

by Rajnish Jain Amazon AWS

This post is co-written with Rajnish Jain, Priyanka Kulkarni and Daniel Johnson from Medidata.

Medidata is leading the digital transformation of life sciences, creating hope for millions of patients. Medidata helps generate the evidence and insights to help pharmaceutical, biotech, medical devices, and diagnostics companies as well as academic researchers with accelerating value, minimizing risk, and optimizing outcomes for their solutions. More than one million registered users across over 1,900 customers and partners access the world’s most trusted platform for clinical development, commercial, and real-world data.

Medidata’s AI team combines unparalleled clinical data, advanced analytics, and industry expertise to help life sciences leaders reimagine what is possible, uncover breakthrough insights to make confident decisions, and pursue continuous innovation. Medidata’s AI suite of solutions is backed by an integrated team of scientists, physicians, technologists, and ex-regulatory officials—built upon Medidata’s core platform comprising over 27,000 trials and 8 million patients.

Amazon SageMaker is a fully managed machine learning (ML) platform within the secure AWS landscape. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. For hosting trained ML models, SageMaker offers a wide array of options. Depending on the type of traffic pattern and latency requirements, you could choose one of these several options. For example, real-time inference is suitable for persistent workloads with millisecond latency requirements, payload sizes up to 6 MB, and processing times of up to 60 seconds. With Serverless Inference, you can quickly deploy ML models for inference without having to configure or manage the underlying infrastructure, and you pay only for the compute capacity used to process inference requests, which is ideal for intermittent workloads. For requests with large unstructured data with payload sizes up to 1 GB, with processing times up to 15 mins, and near real-time latency requirements, you can use asynchronous inference. Batch transform is ideal for offline predictions on large batches of data that are available up front.

In this collaborative post, we demonstrate how AWS helped Medidata take advantage of the various hosting capabilities within SageMaker to experiment with different architecture choices for predicting the operational success of proposed clinical trials. We also validate why Medidata chose SageMaker asynchronous inference for its final design and how this final architecture helped Medidata serve its customers with predictions up to 30 times faster while keeping ML infrastructure costs relatively low.

Architecture evolution

System design is not about choosing one right architecture. It’s the ability to discuss and experiment multiple possible approaches and weigh their trade-offs in satisfying the given requirements for our use case. During this process, it’s essential to take into account prior knowledge of various types of requirements and existing common systems that can interact with our proposed design. The scalability of a system is its ability to easily and cost-effectively vary resources allocated to it so as to serve changes in load. This applies to both increasing or decreasing user numbers or requests to the system.

In the following sections, we discuss how Medidata worked with AWS in iterating over a list of possible scalable architecture designs. We especially focus on the evolution journey, design choices, and trade-offs we went through to arrive at a final choice.

SageMaker batch transform

Medidata originally used SageMaker batch transform for ML inference to meet current requirements and develop a minimum viable product (MVP) for a new predictive solution due to low usage and loose performance requirements of the application. When a batch transform job starts, SageMaker initializes compute instances and distributes the inference or preprocessing workload between them. It’s a high-performance and high-throughput method for transforming data and generating inferences. It’s ideal for scenarios where you’re dealing with large batches of data, don’t need subsecond latency, and need to either preprocess or transform the data or use a trained model to run batch predictions on it in a distributed manner. The Sagemaker batch transform workflow also uses Amazon Simple Storage Service (Amazon S3) as the persistent layer, which maps to one of our data requirements.

Initially, using SageMaker batch transform worked well for the MVP, but as the requirements evolved and Medidata needed to support its customers in near real time, batch transform wasn’t suitable because it was an offline method and customers need to wait anywhere between 5–15 minutes for responses. This primarily included the startup cost for the underlying compute cluster to spin up every time a batch workload needs to be processed. This architecture also required configuring Amazon CloudWatch event rules to track the progress of the batch predictions job together with employing a database of choice to track the states and metadata of the fired job. The MVP architecture is shown in the following diagram.

The flow of this architecture is as follows:

The incoming bulk payload is persisted as an input to an S3 location. This event in turn triggers an AWS Lambda Submit function.
The Submit function kicks off a SageMaker batch transform job using the SageMaker runtime client.
The Submit function also updates a state and metadata tracker database of choice with the job ID and sets the status of the job to inProgress. The function also updates the job ID with its corresponding metadata information.
The transient (on-demand) compute cluster required to process the payload spins up, initiating a SageMaker batch transform job. At the same time, the job also emits status notifications and other logging information to CloudWatch logs.
The CloudWatch event rule captures the status of the batch transform job and sends a status notification to an Amazon Simple Notification Service (Amazon SNS) topic configured to capture this information.
The SNS topic is subscribed by a Notification Lambda function that is triggered every time an event rule is fired by CloudWatch and when there is a message in the SNS topic.
The Notification function then updates the status of the transform job for success or failure in the tracking database.

While exploring alternative strategies and architectures, Medidata realized that the traffic pattern for the application consisted of short bursts followed by periods of inactivity. To validate the drawbacks of this existing MVP architecture, Medidata performed some initial benchmarking to understand and prioritize the bottlenecks of this pipeline. As shown in the following diagram, the largest bottleneck was the transition time before running the model for inference due to spinning up new resources with each bulk request. The definition of a bulk request here corresponds to a payload that is a collection of operational site data to be processed rather than a single instance of a request. The second biggest bottleneck was the time to save and write the output, which was also introduced due to the batch model architecture.

As the number of clients increased and usage multiplied, Medidata prioritized user experience by tightening performance requirements. Therefore, Medidata decided to replace the batch transform workflow with a faster alternative. This led to Medidata experimenting with several architecture designs involving SageMaker real-time inference, Lambda, and SageMaker asynchronous inference. In the following sections, we compare these evaluated designs in depth and analyze the technical reasons for choosing one over the other for Medidata’s use case.

SageMaker real-time inference

You can use SageMaker real-time endpoints to serve your models for predictions in real time with low latency. Serving your predictions in real time requires a model serving stack that not only has your trained model, but also a hosting stack to be able to serve those predictions. The hosting stack typically include a type of proxy, a web server that can interact with your loaded serving code, and your trained model. Your model can then be consumed by client applications through a real-time invoke API request. The request payload sent when you invoke the endpoint is routed to a load balancer and then routed to your ML instance or instances that are hosting your models for prediction. SageMaker real-time inference comes with all of the aforementioned components and makes it relatively straightforward to host any type of ML model for synchronous real-time inference.

SageMaker real-time inference has a 60-second timeout for endpoint invocation, and the maximum payload size for invocation is capped out at 6 MB. Because Medidata’s inference logic is complex and frequently requires more than 60 seconds, real-time inference alone can’t be a viable option for dealing with bulk requests that normally require unrolling and processing many individual operational identifiers without re-architecting the existing ML pipeline. Additionally, real-time inference endpoints need to be sized to handle peak load. This could be challenging because Medidata has quick bursts of high traffic. Auto scaling could potentially fix this issue, but it would require manual tuning to ensure there are enough resources to handle all requests at any given time. Alternatively, we could manage a request queue to limit the number of concurrent requests at a given time, but this would introduce additional overhead.

Lambda

Serverless offerings like Lambda eliminate the hassle of provisioning and managing servers, and automatically take care of scaling in response to varying workloads. They can be also much cheaper for lower-volume services because they don’t run 24/7. Lambda works well for workloads that can tolerate cold starts after periods of inactivity. If a serverless function has not been run for approximately 15 minutes, the next request experiences what is known as a cold start because the function’s container must be provisioned.

Medidata built several proof of concept (POC) architecture designs to compare Lambda with other alternatives. As a first simple implementation, the ML inference code was packaged as a Docker image and deployed as a container using Lambda. To facilitate faster predictions with this setup, the invoked Lambda function requires a large provisioned memory footprint. For larger payloads, there is an extra overhead to compress the input before calling the Lambda Docker endpoint. Additional configurations are also needed for the CloudWatch event rules to save the inputs and outputs, tracking the progress of the request, and employing a database of choice to track the internal states and metadata of the fired requests. Additionally, there is also an operational overhead for reading and writing data to Amazon S3. Medidata calculated the projected cost of the Lambda approach based on usage estimates and determined it would be much more expensive than SageMaker with no added benefits.

SageMaker asynchronous inference

Asynchronous inference is one of the newest inference offerings in SageMaker that uses an internal queue for incoming requests and processes them asynchronously. This option is ideal for inferences with large payload sizes (up to 1 GB) or long-processing times (up to 15 minutes) that need to be processed as requests arrive. Asynchronous inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

For use cases that can tolerate a cold start penalty of a few minutes, you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.

Creating an asynchronous inference endpoint is very similar to creating a real-time endpoint. You can use your existing SageMaker models and only need to specify additional asynchronous inference configuration parameters while creating your endpoint configuration. Additionally, you can attach an auto scaling policy to the endpoint according to your scaling requirements. To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, SageMaker places the inference response in the previously returned Amazon S3 location. You can optionally choose to receive success or error notifications via Amazon SNS.

Based on the different architecture designs discussed previously, we identified several bottlenecks and complexity challenges with these architectures. With the launch of asynchronous inference and based on our extensive experimentation and performance benchmarking, Medidata decided to choose SageMaker asynchronous inference for their final architecture for hosting due to a number of reasons outlined earlier. SageMaker is designed from the ground up to support ML workloads, whereas Lambda is more of a general-purpose tool. For our specific use case and workload type, SageMaker asynchronous inference is cheaper than Lambda. Also, SageMaker asynchronous inference’s timeout is much longer (15 minutes) compared to the real-time inference timeout of 60 seconds. This ensures that asynchronous inference can support all of Medidata’s workloads without modification. Additionally, SageMaker asynchronous inference queues up requests during quick bursts of traffic rather than dropping them, which was a strong requirement as per our use case. Exception and error handling is automatically taken care of for you. Asynchronous inference also makes it easy to handle large payload sizes, which is a common pattern with our inference requirements. The final architecture diagram using SageMaker asynchronous inference is shown in the following figure.

The flow of our final architecture is as follows:

The Submit function receives the bulk payload from upstream consumer applications and is set up to be event-driven. This function uploads the payload to the pre-designated Amazon S3 location.
The Submit function then invokes the SageMaker asynchronous endpoint, providing it with the Amazon S3 pointer to the uploaded payload.
The function also updates the state of the request to inProgress in the state and metadata tracker database.
The SageMaker asynchronous inference endpoint reads the input from Amazon S3 and runs the inference logic. When the ML inference succeeds or fails, the inference output is written back to Amazon S3 and the status is sent to an SNS topic.
A Notification Lambda function subscribes to the SNS topic. The function is invoked whenever a status update notification is published to the topic.
The Notification function updates the status of the request to success or failure in the state and metadata tracker database.

To recap, the batch transform MVP architecture we started with took 5–15 minutes to run depending on the size of the input. With the switch to asynchronous inference, the new solution runs end to end in 10–60 seconds. We see a speedup of at least five times faster for larger inputs and up to 30 times faster for smaller inputs, leading to better customer satisfaction with the performance results. The revised final architecture greatly simplifies the previous asynchronous fan-out/fan-in architecture because we don’t have to worry about partitioning the incoming payload, spawning workers, and delegating and consolidating work amongst the worker Lambda functions.

Conclusion

With SageMaker asynchronous inference, Medidata’s customers using this new predictive application now experience a speedup that’s up to 30 times faster for predictions. Requests aren’t dropped during traffic spikes because the asynchronous inference endpoint queues up requests rather than dropping them. The built-in SNS notification was able to overcome the custom CloudWatch event log notification that Medidata had built to notify the app when the job was complete. In this case, the asynchronous inference approach is cheaper than Lambda. SageMaker asynchronous inference is an excellent option if your team is running heavy ML workloads with burst traffic while trying to minimize cost. This is a great example of collaboration with the AWS team to push the boundaries and use bleeding edge technology for maximum efficiency.

For detailed steps on how to create, invoke, and monitor asynchronous inference endpoints, refer to documentation, which also contains a sample notebook to help you get started. For pricing information, visit Amazon SageMaker Pricing. For examples on using asynchronous inference with unstructured data such as computer vision and natural language processing (NLP), refer to Run computer vision inference on large videos with Amazon SageMaker asynchronous endpoints and Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints, respectively.

About the authors

Rajnish Jain is a Senior Director of Engineering at Medidata AI based in NYC. Rajnish heads engineering for a suite of applications that use machine learning on AWS to help customers improve operational success of proposed clinical trials. He is passionate about the use of machine learning to solve business problems.

Priyanka Kulkarni is a Lead Software Engineer within Acorn AI at Medidata Solutions. She architects and develops solutions and infrastructure to support ML predictions at scale. She is a data-driven engineer who believes in building innovative software solutions for customer success.

Daniel Johnson is a Senior Software Engineer within Acorn AI at Medidata Solutions. He builds APIs to support ML predictions around the feasibility of proposed clinical trials.

Arunprasath Shankar is a Senior AI/ML Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Amazon and Harvard launch alliance to advance research in quantum networking

September 12, 2022

by Amazon AWS

Collaboration will seek to advance the development of a quantum internet.Read More

Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference

September 9, 2022

by Frank Liu Amazon AWS

The last few years have seen rapid development in the field of natural language processing (NLP). Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large language models. Today, we announce new capabilities in Amazon SageMaker that can help: you can configure the maximum Amazon EBS volume size and timeout quotas to facilitate large model inference. Coupled with model parallel inference techniques, you can now use the fully managed model deployment and management capabilities of SageMaker when working with large models with billions of parameters.

In this post, we demonstrate these new SageMaker capabilities by deploying a large, pre-trained NLP model from Hugging Face across multiple GPUs. In particular, we use the Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve under 0.1 second latency in a text generation use case with 6 billion parameter GPT-J. Complete example on our GitHub repository coming soon.

Large language models and the increasing necessity of model parallel inference

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340 million parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500 times, with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open-source Bloom 176 B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from model zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge.

Large language models can be difficult to host for low-latency inference use cases because of their size. Typically, ML practitioners simply host a model (or even multiple models) within the memory of a single accelerator device that handles inference end to end on its own. However, large language models can be too big to fit within the memory of a single accelerator, so this paradigm can’t work. For example, open-source GPT-NeoX with 20 billion parameters can require more than 80 GB of accelerator memory, which is more than triple what is available on an NVIDIA A10G, a popular GPU for inference. Practitioners have a few options to work against this accelerator memory constraint. A simple but slow approach is to use CPU memory and stream model parameters sequentially to the accelerator. However, this introduces a communication bottleneck between the CPU and GPU, which can add seconds to inference latency and is therefore unsuitable for many use cases that require fast responses. Another approach is to optimize or compress the model so that it can fit on a single device. Practitioners must implement complex techniques such as quantization, pruning, distillation, and others to reduce the memory requirements. This approach requires a lot of time and expertise and can also reduce the accuracy and generalization of a model, which can also be a non-starter for many use cases.

A third option to use model parallelism. With model parallelism, the parameters and layers of a model are partitioned and then spread across multiple accelerators. This approach allows practitioners to take advantage of both the memory and processing power of multiple accelerators at once and can deliver low-latency inference without impacting the accuracy of the model. Model parallelism is already a popular technique in training (see Introduction to Model Parallelism) and is increasingly becoming used in inference as practitioners require low-latency responses from large models.

There are two general types of model parallelism: pipeline parallelism and tensor parallelism. Pipeline parallelism splits a model between layers, so that any given layer is contained within the memory of a single GPU. In contrast, tensor parallelism splits layers such that a model layer is spread out across multiple GPUs. Both of these model parallel techniques are used in training (often together), but tensor parallelism can be a better choice for inference because batch size is often one with inference. When batch size is one, only tensor parallelism can take advantage of multiple GPUs at once when processing the forward pass to improve latency.

In this post, we use DeepSpeed to partition the model using tensor parallelism techniques. DeepSpeed Inference supports large Transformer-based models with billions of parameters. It allows you to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost. For more information, refer to DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression and this DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

Solution overview

The Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. The DJL is built with native Java concepts on top of existing deep learning frameworks. The DJL is designed to be deep learning engine agonistic. You can switch engines at any point. The DJL also provides automatic CPU/GPU choice based on hardware configuration.

Although the DJL is designed originally for Java developers to get start with ML, DJLServing is a high-performance universal model serving solution powered by the DJL that is programming language agnostic. It can serve the commonly seen model types, such the PyTorch TorchScript model, TensorFlow SavedModel bundle, Apache MXNet model, ONNX model, TensorRT model, and Python script model. DJLServing supports dynamic batching and worker auto scaling to increase throughput. You can load different versions of a model on a single endpoint. You can also serve models from different ML frameworks at the same time. What’s more, DJLServing natively supports multi-GPU by setting up MPI configurations and socket connections for inference. This frees the heavy lifting of setting up a multi-GPU environment.

Our proposed solution uses the newly announced SageMaker capabilities, DJLServing and DeepSpeed Inference, for large model inference. As of this writing, all Transformer-based models are supported. This solution is intended for parallel model inference using a single model on a single instance.

DJLServing is built with multiple layers. The routing layer is built on top of Netty. The remote requests are handled in the routing layer to distribute to workers, either threads in Java or processes in Python, to run inference. The total number of Java threads are set to 2 * cpu_core from the machine to make full usage of computing power. The worker numbers can be configured per model or the DJL’s auto-detection on hardware. The following diagram illustrates our architecture.

Inference large models on SageMaker

The following steps demonstrate how to deploy a gpt-j-6B model in SageMaker using DJL serving. This is made possible by the capability to configure the EBS volume size, model download timeout time, and startup health-check timeout time. You can try out this demo by running the following notebook.

Pull the Docker image and push to Amazon ECR

The Docker image djl-serving:0.18.0-deepspeed is our DJL serving container with DeepSpeed incorporated. We then push this image to Amazon Elastic Container Registry (Amazon ECR) for later use. See the following code:

docker pull deepjavalibrary/djl-serving:0.18.0-deepspeed

Create our model file

First, we create a file called serving.properties that contains only one line of code. This tells the DJL model server to use the Rubikon engine. Rubikon is an AWS developed large model supporting package. In this demo, it facilitates the MPI threads setup and socket connection. It also sets the number of GPUs (model slicing number) by reading in the TENSOR_PARALLEL_DEGREE parameter defined in our model.py file in the next paragraph. The file contains the following code:

engine=Rubikon

Next, we create our model.py file, which defines our model as gpt-j-6B. In our code, we read in the TENSOR_PARALLEL_DEGREE environment variable (default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Please note, DeepSpeed provides a few built-in partition logics, and gpt-j-6B is one of them. We use it by specifying replace_method and relpace_with_kernel_inject. If you have your customized model and need DeepSpeed to partition effectively, you need to change relpace_with_kernel_inject to false and add injection_policy to make the runtime partition work. For more information, refer to Initializing for Inference.

from djl_python import Input, Output
import os
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None

def get_model():
    model_name = 'EleutherAI/gpt-j-6B'
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model = AutoModelForCausalLM.from_pretrained(model_name, revision="float32", torch_dtype=torch.float32)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    model = deepspeed.init_inference(model,
                                     mp_size=tensor_parallel,
                                     dtype=model.dtype,
                                     replace_method='auto',
                                     replace_with_kernel_inject=True)
    generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model()

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)
    return Output().add(result)

We create a directory called gpt-j and copy model.py and serving.properties to this directory:

mkdir gpt-j
cp model.py gpt-j
cp serving.properties gpt-j

Lastly, we create the model file and upload it to Amazon Simple Storage Service (Amazon S3):

tar cvfz gpt-j.tar.gz gpt-j
aws s3 cp gpt-j.tar.gz s3://djl-sm-test/deepspeed/

Create a SageMaker model

We now create a SageMaker model. We use the ECR image we created earlier and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure TENSOR_PARALLEL_DEGREE=2, which means the model will be partitioned along 2 GPUs. See the following code:

aws sagemaker create-model 
--model-name gpt-j 
--primary-container 
Image=<account_id>.dkr.ecr.us-east-1.amazonaws.com/djl-deepspeed:latest,ModelDataUrl=s3://djl-sm-test/deepspeed/gpt-j.tar.gz,Environment={TENSOR_PARALLEL_DEGREE=2} 
--execution-role-arn <IAM_role_arn>

After running the preceding command, you see output similar to the following:

{
    "ModelArn": "arn:aws:sagemaker:us-east-1:<account_id>:model/gpt-j"
}

Create a SageMaker endpoint

You can use any instances with multiple GPUs for testing. In this demo, we use a p3.16xlarge instance. In the following code, note how we set the ModelDataDownloadTimeoutInSeconds, ContainerStartupHealthCheckTimeoutInSeconds, and VolumeSizeInGB parameters to accommodate the large model size. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

aws sagemaker create-endpoint-config 
    --region us-east-1 
    --endpoint-config-name gpt-j-config 
    --production-variants '[
      {
        "ModelName": "gpt-j",
        "VariantName": "AllTraffic",
        "InstanceType": "ml.p3.16xlarge",
        "InitialInstanceCount": 1,
        "VolumeSizeInGB": 256,
        "ModelDataDownloadTimeoutInSeconds": 1800,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600
        }
    ]'

Lastly, we create a SageMaker endpoint:

aws sagemaker create-endpoint 
--endpoint-name gpt-j 
--endpoint-config-name gpt-j-config

You see it printed out in the following code:

{
    "EndpointArn": "arn:aws:sagemaker:us-east-1:<aws-account-id>:endpoint/gpt-j"
}

Starting the endpoint might take a while. You can try a few more times if you run into the InsufficientInstanceCapacity error.

Performance tuning

Performance tuning and optimization is an empirical process often involving multiple iterations. The number of parameters to tune is combinatorial and the set of configuration parameter values aren’t independent of each other. Various factors affect optimal parameter tuning, including payload size, type, and the number of ML models in the inference request flow graph, storage type, compute instance type, network infrastructure, application code, inference serving software runtime and configuration, and more.

SageMaker real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. There are four most commonly used metrics for monitoring inference request latency for SageMaker inference endpoints:

Container latency – The time it takes to send the request, fetch the response from the model’s container, and complete inference in the container. This metric is available in Amazon CloudWatch as part of the invocation metrics published by SageMaker.
Model latency – The total time taken by all SageMaker containers in an inference pipeline. This metric is available in CloudWatch as part of the invocation metrics published by SageMaker.
Overhead latency – Measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This metric is available in CloudWatch as part of the invocation metrics published by SageMaker.
End-to-end latency – Measured from the time the client sends the inference request until it receives a response back. You can publish this as a custom metric in CloudWatch.

Container latency depends on several factors; the following are among the most important:

Underlying protocol (HTTP(s)/gRPC) used to communicate with the inference server
Overhead related to creating new TLS connections
Deserialization time of the request/response payload
Request queuing and batching features provided by the underlying inference server
Request scheduling capabilities provided by the underlying inference server
Underlying runtime performance of the inference server
Performance of preprocessing and postprocessing libraries before calling the model prediction function
Underlying ML framework backend performance
Model-specific and hardware-specific optimizations

In this section, we focus primarily on container latency and specifically on optimizing DJLServing running inside a SageMaker container.

Tune the ML engine for multi-threaded inference

One of the advantages of the DJL is multi-threaded inference support. It can help increase the throughput of your inference on multi-core CPUs and GPUs and reduce memory consumption compare to Python. Refer to Inference Performance Optimization for more information about optimizing the number of threads for different engines.

Tune Netty

DJLServing is built with multiple layers. The routing layer is built on top of Netty. Netty is a NIO client server framework that enables quick and easy development of network applications such as protocol servers and clients. In Netty, Channel is the main container; it contains a ChannelPipeline and is associated with an EventLoop (a container for a thread) from an EventLoopGroup. EventLoop is essentially an I/O thread and may be shared by multiple channels. ChannelHandlers are run on these EventLoop threads. This simple threading model means that you don’t need to worry about concurrency issues in the run of your ChannelHandlers. You are always guaranteed sequential runs on the same thread for a single run through your pipeline. DJLServing uses Netty’s EpollEventLoopGroup on Linux. The total number of Netty threads by default is set to 2 * the number of virtual CPUs from the machine to make full usage of computing power. Furthermore, because you don’t create large numbers of threads, your CPU isn’t overburdened by context switching. This default setting works fine in most cases; however, if you want to set the number of Netty threads for processing the incoming requests, you can do so by setting the SERVING_NUMBER_OF_NETTY_THREADS environment variable.

Tune workload management (WLM) of DJLServing

DJLServing has WorkLoadManager, which is responsible for managing the workload of the worker thread. It manages the thread pools and job queues, and scales up or down the required amount of worker threads per ML model. It has auto scaling, which adds an inference job to the job queue of the next free worker and scales up the worker thread pool for that specific model if necessary. The scaling is primarily based on the job queue depth of the model, the batch size, and the current number of worker threads in the pool. The job_queue_size controls the number of inference jobs that can be queued up at any point in time. By default, it is set to 100. If you have higher concurrency needs per model serving instance, you can increase the job_queue_size, thread pool size, and minimum or maximum thread workers for a particular model by setting the properties in serving.properties, as shown in the following example code:

serving.properties
# use minWorkers/maxWorkers for all devices
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4

As of this writing, you can’t configure job_queue_size in serving.properties. The default value job_queue_size is controlled by an environment variable, and you can only configure the per-model setting with the registerModel API.

Many practitioners tend to run inference sequentially when the server is invoked with multiple independent requests. Although easier to set up, it’s usually not the best practice to utilize GPU’s compute power. To address this, DJLServing offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput.

All the requests reach the dynamic batcher first before entering the actual job queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the batch_size settings in serving.properties. You can also configure max_batch_delay to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements.

You can fine-tune the following parameters to increase the throughput per model:

batch_size – The inference batch size. The default value is 1.
max_batch_delay – The maximum delay for batch aggregation. The default value is 100 milliseconds.
max_idle_time – The maximum idle time before the worker thread is scaled down.
min_worker – The minimum number of worker processes. For the DJL’s DeepSpeed engine, min_worker is set to number of GPUs/TENSOR_PARALLEL_DEGREE.
max_worker – The maximum number of worker processes. For the DJL’s DeepSpeed engine, max_worker is set to mumber of GPUs/TENSOR_PARALLEL_DEGREE.

Tune degree of tensor parallelism

For large model support that doesn’t fit in the single accelerator device memory, the number of Python processes are determined by the total number of accelerator devices on the host. The tensor_parallel_degree is created for slicing the model and distribute to multiple accelerator devices. In this case, even if a model is too large to host on a single accelerator, it can still be handled by DJLServing and can run on multiple accelerator devices by partitioning the model. Internally, DJLServing creates multiple MPI processes (equal to tensor_parallel_degree) to manage the slice of each model on each accelerator device.

You can set the number of partitions for your model by setting the TENSOR_PARALLEL_DEGREE environment variable. Please note this configuration is a global setting and applies to all the models on the host. If the TENSOR_PARALLEL_DEGREE is less than the total number of accelerator devices (GPUs), DJLServing launches multiple Python process groups equivalent to the total number of GPUs/TENSOR_PARALLEL_DEGREE. Each Python process group consists of Python processes equivalent to TENSOR_PARALLEL_DEGREE. Each Python process group holds the full copy of the model.

Summary

In this post, we showcased the newly launched SageMaker capability to allow you to configure inference instance EBS volumes, model downloading timeout, and container startup timeout. We demonstrated this new capability in an example of deploying a large model in SageMaker. We also covered options available to tune the performance of the DJL. For more details about SageMaker and the new capability launched, refer to [!Link] and [!Link].

About the authors

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Robert Van Dusen is a Senior Product Manager at AWS.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Tips to improve your Amazon Rekognition Custom Labels model

September 9, 2022

by Amit Gupta Amazon AWS

In this post, we discuss best practices to improve the performance of your computer vision models using Amazon Rekognition Custom Labels. Rekognition Custom Labels is a fully managed service to build custom computer vision models for image classification and object detection use cases. Rekognition Custom Labels builds off of the pre-trained models in Amazon Rekognition, which are already trained on tens of millions of images across many categories. Instead of thousands of images, you can get started with a small set of training images (a few hundred or less) that are specific to your use case. Rekognition Custom Labels abstracts away the complexity involved in building a custom model. It automatically inspects the training data, selects the right ML algorithms, selects the instance type, trains multiple candidate models with various hyperparameters settings, and outputs the best trained model. Rekognition Custom Labels also provides an easy-to-use interface from the AWS Management Console for managing the entire ML workflow, including labeling images, training the model, deploying the model, and visualizing the test results.

There are times when a model’s accuracy isn’t the best, and you don’t have many options to adjust the configuration parameters of the model. Behind the scenes there are multiple factors that play a key role to build a high-performing model, such as the following:

Picture angle
Image resolution
Image aspect ratio
Light exposure
Clarity and vividness of background
Color contrast
Sample data size

The following are the general steps to be followed to train a production-grade Rekognition Custom Labels model:

Review Taxonomy – This defines the list of attributes/items that you want to identify in an image.
Collect relevant data – This is the most important step, where you need to collect relevant images that should resemble what you would see in a production environment. This could involve images of objects with varying backgrounds, lighting, or camera angles. You then create a training and testing datasets by splitting the collected images. You should only include real-world images as part of the testing dataset, and shouldn’t include any synthetically generated images. Annotations of the data you collected are crucial for the model performance. Make sure the bounding boxes are tight around the objects and the labels are accurate. We discuss some tips that you can consider when building an appropriate dataset later in this post.
Review training metrics – Use the preceding datasets to train a model and review the training metrics for F1 score, precision, and recall. We will discuss in details about how to analyze the training metrics later in this post.
Evaluate the trained model – Use a set of unseen images (not used for training the model) with known labels to evaluate the predictions. This step should always be performed to make sure that the model performs as expected in a production environment.
Re-training (optional) – In general, training any machine learning model is an iterative process to achieve the desired results, a computer vision model is no different. Review the results in Step 4, to see if more images need to be added to the training data and repeat the above Steps 3 – 5.

In this post, we focus on the best practices around collecting relevant data (Step 2) and evaluating your trained metrics (Step 3) to improve your model performance.

Collect relevant data

This is the most critical stage of training a production-grade Rekognition Custom Labels model. Specifically, there are two datasets: training and testing. Training data is used for training the model, and you need to spend the effort building an appropriate training set. Rekognition Custom Labels models are optimized for F1 score on the testing dataset to select the most accurate model for your project. Therefore, it’s essential to curate a testing dataset that resembles the real world.

Number of images

We recommend having a minimum of 15-20 images per label. Having more images with more variations that reflects your use case will improve the model performance.

Balanced dataset

Ideally, each label in the dataset should have a similar number of samples. There shouldn’t be a massive disparity in the number of images per label. For example, a dataset where the highest number of images for a label is 1,000 vs. 50 images for another label resembles an imbalanced dataset. We recommend avoiding scenarios with lopsided ratio of 1:50 between the label with the least number of images vs. the label with the highest number of images.

Varying types of images

Include images in the training and test dataset that resembles what you will be using in the real world. For example, if you want to classify images of living rooms vs. bedrooms, you should include empty and furnished images of both rooms.

The following is an example image of a furnished living room.

In contrast, the following is an example of an unfurnished living room.

The following is an example image of a furnished bedroom.

The following is an example image of an unfurnished bedroom.

Varying backgrounds

Include images with different backgrounds. Images with natural context can provide better results than plain background.

The following is an example image of the front yard of a house.

The following is an example image of the front yard of a different house with a different background.

Varying lighting conditions

Include images with varying lighting so that it covers the different lighting conditions that occur during inference (for example, with and without flash). You can also include images with varying saturation, hue, and brightness.

The following is an example image of a flower under normal light.

In contrast, the following image is of the same flower under bright light.

Varying angles

Include images taken from various angles of the object. This helps the model learn different characteristics of the objects.

The following images are of the same bedroom from different angles.

There could be occasions where it’s not possible to acquire images of varying types. In those scenarios, synthetic images can be generated as part of the training dataset. For more information about common image augmentation techniques, refer to Data Augmentation.

Add negative labels

For image classification, adding negative labels can help increase model accuracy. For example, you can add a negative label, which doesn’t match any of the required labels. The following image represents the different labels used to identify fully grown flowers.

Adding the negative label not_fully_grown helps the model learn characteristics that aren’t part of the fully_grown label.

Handling label confusion

Analyze the results on the test dataset to recognize any patterns that are missed in the training or testing dataset. Sometimes it’s easy to spot such patterns by visually examining the images. In the following image, the model is struggling to resolve between a backyard vs. patio label.

In this scenario, adding more images to these labels in the dataset and also redefining the labels so that each label is distinct can help increase the accuracy of the model.

Data augmentation

Inside Rekognition Custom Labels, we perform various data augmentations for model training, including random cropping of the image, color jittering, random Gaussian noises, and more. Based on your specific use cases, it might also be beneficial to add more explicit data augmentations to your training data. For example, if you’re interested in detecting animals in both color and black and white images, you could potentially get better accuracy by adding black and white and color versions of the same images to the training data.

We don’t recommend augmentations on testing data unless the augmentations reflect your production use cases.

Review training metrics

F1 score, precision, recall, and assumed threshold are the metrics that are generated as an output of training a model using Rekognition Custom Labels. The models are optimized for the best F1 score based on the testing dataset that is provided. The assumed threshold is also generated based on the testing dataset. You can adjust the threshold based on your business requirement in terms of precision or recall.

Because the assumed thresholds are set on the testing dataset, an appropriate test set should reflect the real-world production use case. If the test dataset isn’t representative of the use case, you may see artificially high F1 scores and poor model performance on your real-world images.

These metrics are helpful when performing an initial evaluation of the model. For a production-grade system, we recommend evaluating the model against an external dataset (500–1,000 unseen images) representative of the real world. This helps evaluate how the model would perform in a production system and also identify any missing patterns and correct them by retraining the model. If you see a mismatch between F1 scores and external evaluation, we suggest you examine whether your test data is reflecting the real-world use case.

Conclusion

In this post, we walked you through the best practices for improving Rekognition Custom Labels models. We encourage you to learn more about Rekognition Custom Labels and try it out for your business-specific datasets.

About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Yogesh Chaturvedi is a Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Hao Yang is a Senior Applied Scientist at the Amazon Rekognition Custom Labels team. His main research interests are object detection and learning with limited annotations. Outside works, Hao enjoys watching films, photography, and outdoor activities.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Use ADFS OIDC as the IdP for an Amazon SageMaker Ground Truth private workforce

September 9, 2022

by Adeleke Coker Amazon AWS

To train a machine learning (ML) model, you need a large, high-quality, labeled dataset. Amazon SageMaker Ground Truth helps you build high-quality training datasets for your ML models. With Ground Truth, you can use workers from either Amazon Mechanical Turk, a vendor company of your choosing, or an internal, private workforce to enable you to create a labeled dataset. You can use the labeled dataset output from Ground Truth to train your own models. You can also use the output as a training dataset for an Amazon SageMaker model.

With Ground Truth, you can create a private workforce of employees or contractors to handle your data within your organization. This enables customers who want to keep their data within their organization to use a private workforce to support annotation workloads containing sensitive business data or personal identifiable information (PII) that can’t be handled by external parties. Alternately, if data annotation requires domain-specific subject matter expertise, you can use a private workforce to route tasks to employees, contractors, or third-party annotators with that specific domain knowledge. This workforce can be employees in your company or third-party workers who have domain and industry knowledge of your datasets. For example, if the task is to label medical images, you could create a private workforce of people knowledgeable about the images in question.

You can configure a private workforce to authenticate using OpenID Connect (OIDC) with your Identity Provider (IdP). In this post, we demonstrate how to configure OIDC with on-premises Active Directory using Active Directory Federation Service (ADFS). Once the configuration is set up, you can configure and manage work teams, track worker performance, and set up notifications when labeling tasks are available in Ground Truth.

Solution overview

When you use existing on-premises Active Directory credentials to authenticate your private workforce, you don’t need to worry about managing multiple identities in different environments. Workers use existing Active Directory credentials to federate to your labeling portal.

Prerequisites

Make sure you have the following prerequisites:

A registered public domain
An existing or newly deployed ADFS environment
An AWS Identity and Access Management (IAM) user with permissions to run SageMaker API operations

Additionally, make sure you use Ground Truth in a supported Region.

Configure Active Directory

The Ground Truth private workforce OIDC configuration requires sending a custom claim sagemaker:groups to Ground Truth from your IdP.

Create an AD group named sagemaker (be sure to use all lower-case).
Add the users that will form your private workforce to this group.

Configure ADFS

The next step is to configure an ADFS application with specific claims that Ground Truth uses to obtain Issuer, ClientId, and ClientSecret, and other optional claims from your IdP to authenticate workers by obtaining an authentication code from the configured AuthorizationEndpoint in your IdP.

For more information about the claims your IdP sends to Ground Truth, refer to Send Required and Optional Claims to Ground Truth and Amazon A2I.

Create Application Group

To create your application group, complete the following steps:

Open the ADFS Management Console
Change the ADFS Federation Service Identifier from https://${HostName}/adfs/service/trust to https://${HostName}/adfs
Choose Application Group, right-click, and choose Add Application Group.
Enter a name (for example, SageMaker Ground Truth Workforce) and description.
Under Template, for Client-Server applications, choose Server application accessing a web API.
Choose Next.
Copy and save the client ID for future reference.
For Redirect URI, use a placeholder such as https://privateworkforce.local.
Choose Add, then choose Next.
Select Generate a shared secret and save the generated value for later use, then choose Next.
In Configure Web API section, enter the client ID obtained earlier.
Choose Add, then choose Next.
Select Permit everyone under Access Control Policy, then choose Next.
Under Permitted scopes, select openid, then choose Next.
Review the configuration information, then choose Next and Close.

Configure claim descriptions

To configure your claim descriptions, complete the following steps:

In the ADFS Management Console, expand Service Section.
Right-click Claim Description and choose Add Claim Description.
For Display name, enter SageMaker Client ID.
For Short Name, enter sagemaker:client_id.
For Claim identifier, enter sagemaker:client_id.
Select the options to publish the claim to federation metadata for both accept and send.
Choose OK.
Repeat these steps for the remaining claim groups (Sagemaker Name, Sagemaker Sub, and Sagemaker Groups), as shown in the following screenshot.

Note that your claim identifier is listed as Claim Type.

Configure the application group claim rules

To configure your application group claim rules, complete the following steps:

Choose Application Groups, then choose the application group you just created.
Under Web API, choose the name shown, which opens the Web API properties.
Choose the Issuance Transform Rules tab and choose Add Rule.
Choose Transform an Incoming Claim and provide the following information:
- For Claim rule name, enter sagemaker:client_id.
- For Incoming claim type, choose OAuth Client Id.
- For Outgoing claim type, choose the claim SageMaker Client ID.
- Leave other values as default.
- Choose Finish.
Choose Add New Rule.
Choose Transform an Incoming Claim and provide the following information:
- For Claim rule name, enter sagemaker:sub.
- For Incoming claim type, choose Primary SID.
- For Outgoing claim type, choose the claim Sagemaker Sub.
- Leave other values as default.
- Choose Finish.
Choose Add New Rule.
Choose Transform an Incoming Claim and provide the following information:
- For Claim rule name, choose sagemaker:name.
- For Incoming claim type, choose Name.
- For Outgoing claim type, choose the claim Sagemaker Name.
- Leave other values as default.
- Choose Finish.
Choose Add New Rule.
Choose Send Group Membership as a Claim and provide the following information:
- For Claim rule name, enter sagemaker:groups.
- For User’s group, choose the sagemaker AD group created earlier.
- For Outgoing claim type, choose the claim Sagemaker Groups.
- For Outgoing claim value, enter sagemaker.
- Choose Finish.
Choose Apply and OK.

You should have four rules, as shown in the following screenshot.

Create and configure an OIDC IdP workforce using the SageMaker API

In this step, you create a workforce from the AWS Command Line Interface (AWS CLI) using an IAM user or role with appropriate permissions.

Run the following AWS CLI command to create a private workforce. The oidc-config parameter contains information you must obtain from the IdP. Provide the appropriate values that you obtained from your IdP:
1. client_id is the client ID, and client_secret is the client secret you obtained when creating your application group.
2. You can reconstruct AuthorizationEndpoint, TokenEndpoint, UserInfoEndpoint, LogoutEndpoint, and JwksUri by replacing only the sts.example.com portion with your ADFS endpoint.
```
aws sagemaker create-workforce --oidc-config "ClientId="9b123069-0afc-56f2-a7ce-bd8e4dc705gh",ClientSecret="vtMG9fz_D9W2Y6u4t390wQ4o-hr8VsdHxD294FsD",Issuer="https://sts.example.com/adfs",AuthorizationEndpoint="https://sts.example.com/adfs/oauth2/authorize/",TokenEndpoint="https://sts.example.com/adfs/oauth2/token/",UserInfoEndpoint="https://sts.example.com/adfs/userinfo",LogoutEndpoint="https://sts.example.com/adfs/oauth2/logout",JwksUri="https://sts.example.com/adfs/discovery/keys“ --workforce-name privatewf --region us-east-1
```
  The preceding command should successfully return the WorkforceArn. Save this output for reference later.

Use the following code to describe the created workforce to get the SubDomain.
We use this to configure the redirect URI in ADFS. After Ground Truth authenticates a worker, this URI redirects the worker to the worker portal where the workers can access labeling or human review tasks.

aws sagemaker describe-workforce --workforce-name "privatewf" --region us-east-1

{
		"Workforce": {
			"WorkforceName": "privatewf",
			"WorkforceArn": "arn:aws:sagemaker:us-east-1:206400014001:workforce/privatewf",
			"LastUpdatedDate": "2022-03-20T11:45:57.916000-07:00",
			"SourceIpConfig": {
				"Cidrs": []
			},
			"SubDomain": "drxxxxxlf0.labeling.us-east-1.sagemaker.aws",
			"OidcConfig": {
				"ClientId": "9b123069-0afc-56f2-a7ce-bd8e4dc705gh",
				"Issuer": "https://sts.example.com/adfs",
				"AuthorizationEndpoint": "https://sts.example.com/adfs/oauth2/authorize/",
				"TokenEndpoint": "https://sts.example.com/adfs/oauth2/token/",
				"UserInfoEndpoint": "https://sts.example.com/adfs/userinfo",
				"LogoutEndpoint": "https://sts.example.com/adfs/oauth2/logout",
				"JwksUri": "https://sts.example.com/adfs/discovery/keys“"
			},
			"CreateDate": "2022-03-20T11:45:57.916000-07:00"
		}
	}

Copy the SubDomain and append /oauth2/idpresponse to the end. For example, it should look like https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse.You use this URL to update the redirect URI in ADFS.
Choose the application you created earlier (SageMaker Ground Truth Private Workforce).
Choose the name under Server application.
Select the placeholder URL used earlier and choose Remove.
Enter the appended SubDomain value.
Choose Add.
Choose OK twice.

Validate the OIDC IdP workforce authentication response

Now that you have configured OIDC with your IdP, it’s time to validate the authentication workflow using curl.

Replace the placeholder values with your information, then enter the modified URI in your browser:

https://sts.example.com/adfs/oauth2/authorize?client_id=9b123069-0afc-56f2-a7ce-bd8e4dc705gh&redirect_uri=https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse&scope=openid&response_type=code

You should be prompted to log in with AD credentials. You may receive a 401 Authorization Required error.

Copy the code parameter from the browser query and use it to perform a curl with the following command. The portion you need to copy starts with code=. Replace this code with code you copied. Also, don’t forget to change the values of url, client_id, client_secret, and redirect_uri:

url is the token endpoint from ADFS.
client_id is the client ID from the application group in ADFS.

client_secret is the client secret from ADFS.

curl -k --request POST 
	  --url 'https://sts.example.com/adfs/oauth2/token/' 
	  --header 'content-type: application/x-www-form-urlencoded' 
	  --data grant_type=authorization_code 
	  --data 'client_id=9b123069-0afc-56f2-a7ce-bd8e4dc705gh' 
	  --data client_secret=vtMG9fz_D9W2Y6u4t390wQ4o-hr8VsdHxD294FsD 
	  --data code=ZE-yvYF7GUmaFmAGAUdlcg.3Oy-_lPP2QgBAJxAW8uvXYgXojg.GXiaFggY5IdmrumD00cPkdjpABXTAG25YdXJxBr64HPwyl1WJDlcr1pqvURR1ZkBsBA1DxrloTQM4IGH1LcNVIzGcoynNm151leWXnIIP11JjOdl4Jt7tGyxyymll0c0IqfQcOk0w-oU9q2k-nx3jmAK4Pmw3D0Ghhm4jL6_15gBwvY4-mY6DVDg2sGQMELj-dNzfvMuMiLJQhX5XyUJcHjW69KX9xxnHfa3MCZbp2oF_41HBtMazPqKKC04TQPvTiAeMzUZ0-Z3IQhA9_mfv28JPdpGlPOxr8QM9vu9ANCbURimjPkmHA2Gm3df9QUbsIxEtQ-OuAPWlcg5MNbqGQ 
	  --data 'redirect_uri=https://drxxxxxlf0.labeling.us-east-1.sagemaker.aws/oauth2/idpresponse'

After making the appropriate modifications, copy the entire command and run it from a terminal.
The output of the command generates an access token in JWT format.

Copy this output to the encoded box and decode it with JWT.
The decoded message should contain the required claims you configured. If the claims are present, proceed to the next step; if not, ensure you have followed all the steps outlined so far.

From the output obtained in the preceding step, run the following command from a terminal after making necessary modifications. Replace the value for Bearer with the access_token obtained in the preceding command’s output and the userinfo with your own.

curl -X POST -H 'Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IkNLX2k1SEtOS1B2QVdGWnhCRkZ2T2NuVUhNQSIsImtpZCI6IkNLX2k1SEtOS1B2QVdGWnhCRkZ2T2NuVUhNQSJ9.eyJhdWQiOiJ1cm46bWljcm9zb2Z0OnVzZXJpbmZvIiwiaXNzIjoiaHR0cHM6Ly9mcy5hZC5nbWluZHByby5jb20vYWRmcyIsImlhdCI6MTY0OTE5NzYzMCwibmJmIjoxNjQ5MTk3NjMwLCJleHAiOjE2NDkyMDEyMzAsImFwcHR5cGUiOiJDb25maWRlbnRpYWwiLCJhcHBpZCI6IjBlZDQ0MDYzLTMzZDUtNGYxZi1hZTg4LTQ0OTgzZDRlN2E3MiIsImF1dGhtZXRob2QiOiJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6YWM6Y2xhc3NlczpQYXNzd29yZFByb3RlY3RlZFRyYW5zcG9ydCIsImF1dGhfdGltZSI6IjIwMjItMDQtMDVUMjI6Mjc6MDkuOTcxWiIsInZlciI6IjEuMCIsInNjcCI6Im9wZW5pZCIsInN1YiI6ImV2MTdTQkRUWnFXd2NXR0R1Z2s1OHRXQm4wYkRKbDBvYnAzbU9sL1hVUlk9In0.hsED4iUlQPgiiLaCyrKTKg3aKQjsKsLKPusPncRz3rNCSTp5xh8APDo33hhBx5JK-Ie2FG9Pa78dHdY_U2UtGBl2IHKmIfPcBTdkLGc1a8PlSQLvManCcEwzxAaO5J_jGdbt_P3qvy3cA6YCgNUwV3Ex9VTySLK1r-gLvnWE4zEiz_QytdlXvwFDIZi94YTgGf8b5uOQieM9pgJ0D9d-HOUw7-sKMBbZLqeYh_heNekwV3p3FQAIQyqifzl5qaftMR_J6lpOINHPtSPbl80MwHpmoDPHa0emWg6wuSZa7gpDbqDGHmuwQfbVhBdNLY8v9Nm4MA5RbSWQmqZmwG0GkA' -d '' -k -v https://sts.example.com/adfs/userinfo

The output from this command may look similar to following code:

{
    "sub":"122",
    "exp":"10000",
    "sagemaker-groups":["privateworkforce"]
    "sagemaker-name":"name",
    "sagemaker-sub":"122",
    "sagemaker-client_id":"123456"
}

Now that you have successfully validated your OIDC configuration, it’s time to create the work teams.

Create a private work team

To create a private work team, complete the following steps:

On the Ground Truth console, choose Labeling workforces.
Select Private.
In the Private teams section, select Create private team.
In the Team details section, enter a team name.
In the Add workers section, enter the name of a single user group.
All workers associated with this group in your IdP are added to this work team.

To add more than one user group, choose Add new user group and enter the names of the user groups you want to add to this work team. Enter one user group per line.
Optionally, for Ground Truth labeling jobs, if you provide an email for workers in your JWT, Ground Truth notifies workers when a new labeling task is available if you select an Amazon Simple Notification Service (Amazon SNS) topic.
Choose Create private team.

Test access to the private labeling portal

To test your access, browse to https://console.aws.amazon.com/sagemaker/groundtruth#/labeling-workforces and open the labeling portal sign-in URL in a new browser window or incognito mode.

Cost

You will be charged for the number of jobs labeled by your internal employees. For more information, refer to Amazon SageMaker Data Labeling Pricing.

Clean up

You can delete the private workforce using the SageMaker API, DeleteWorkforce. If you have work teams associated with the private workforce, you must delete them before deleting the work force. For more information, see Delete a work team.

Summary

In this post, we demonstrated how to configure an OIDC application with Active Directory Federation Services and use your existing Active Directory credentials to authenticate to a Ground Truth labeling portal.

We’d love to hear from you. Let us know what you think in the comments section.

About the authors

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Aishwarya Kaushal is a Senior Software Engineer at Amazon. Her focus is on solving challenging problems using machine learning, building scalable AI solutions using distributed systems and helping customers to adopt the new features/products. In her spare time, Aishwarya enjoys watching sci-fi movies, listening to music and dancing.

How Amp on Amazon used data to increase customer engagement, Part 2: Building a personalized show recommendation platform using Amazon SageMaker

September 9, 2022

by Tulip Gupta Amazon AWS

Amp is a new live radio app from Amazon. With Amp, you can host your own radio show and play songs from the Amazon Music catalog, or tune in and listen to shows other Amp users are hosting. In an environment where content is plentiful and diverse, it’s important to tailor the user experience to each user’s individual taste, so they can easily find shows they like and discover new content they would enjoy.

Amp uses machine learning (ML) to provide personalized recommendations for live and upcoming Amp shows on the app’s home page. The recommendations are computed using a Random Forest model using features representing the popularity of a show (such as listen and like count), popularity of a creator (such as total number of times the recent shows were played), and personal affinities of a user to a show’s topic and creator. Affinities are computed either implicitly from the user’s behavioral data or explicitly from topics of interest (such as pop music, baseball, or politics) as provided in their user profiles.

This is Part 2 of a series on using data analytics and ML for Amp and creating a personalized show recommendation list platform. The platform has shown a 3% boost to customer engagement metrics tracked (liking a show, following a creator, enabling upcoming show notifications) since its launch in May 2022.

Refer to Part 1 to learn how behavioral data was collected and processed using the data and analytics systems.

Solution overview

The ML-based show recommender for Amp has five main components, as illustrated in the following architecture diagram:

The Amp mobile app.
Back-end services that gather the behavioral data, such as likes and follows, as well as broadcast show-related information such as status updates when shows go live.
Real-time ingestion of behavioral and show data, and real-time (online) feature computing and storage.
Batch (offline) feature computing and storage.
A Recommender System that handles incoming requests from the app backend for getting a list of shows. This includes real-time inference to rank shows based on personalized and non-personalized features.

This post focuses on parts 3, 4, and 5 in an effort to detail the following:

Real-time data ingestion and transformations using Amazon Kinesis Data Firehose and AWS Lambda.
Batch data processing via Amazon SageMaker Processing
Amazon MemoryDB for Redis and Amazon SageMaker Feature Store for storing and serving real-time and batch computed features, respectively
Real-time ranking via SageMaker inference

The following diagram shows the high-level architecture and its components.

In the following sections, we provide more details regarding real-time feature computing, batch feature computing, real-time inference, operational health, and the outcomes we observed.

Real-time feature computing

Some features, such as like and listen count for a show, need to be streamed in continuously and used as is, whereas others, such as the number of listening sessions longer than 5 minutes, need to also be transformed in real time as raw data for sessions is streamed. These types of features where values need to be computed at inference time are referred as point-in-time (PIT) features. Data for PIT features need to be updated quickly, and the latest version should be written and read with low latency (under 20 milliseconds per user for 1,000 shows). The data also needs to be in a durable storage because missing or partial data may cause deteriorated recommendations and poor customer experience. In addition to the read/write latency, PIT features also require low reflection time. Reflection time is the time it takes for a feature to be available to read after the contributing events were emitted, for example, the time between a listener liking a show and the PIT LikeCount feature being updated.

Sources of the data are the backend services directly serving the app. Some of the data are transformed into metrics that are then broadcasted via Amazon Simple Notification Service (Amazon SNS) to downstream listeners such as the ML feature transformation pipeline. An in-memory database such as MemoryDB is an ideal service for durable storage and ultra-fast performance at high volumes. The compute component that transforms and writes features to MemoryDB is Lambda. App traffic follows daily and weekly patterns of peaks and dips depending on the time and day. Lambda allows for automatic scaling to incoming volume of events. The independent nature of each individual metric transformation also makes Lambda, which is a stateless service on its own, a good fit for this problem. Putting Amazon Simple Queue Service (Amazon SQS) between Amazon SNS and Lambda not only prevents message loss but also acts as a buffer for unexpected bursts of traffic that preconfigured Lambda concurrency limits may not be sufficient to serve.

Batch feature computing

Features that use historical behavioral data to represent a user’s ever-evolving taste are more complex to compute and can’t be computed in real time. These features are computed by a batch process that runs every so often, for example once daily. Data for batch features should support fast querying for filtering and aggregation of data, and may span long periods of time, so will be larger in volume. Because batch features are also retrieved and sent as inputs for real-time inference, they should still be read with low latency.

Collecting raw data for batch feature computing doesn’t have the sub-minute reflection time requirement PIT features have, which makes it feasible to buffer the events longer and transform metrics in batch. This solution utilized Kinesis Data Firehose, a managed service to quickly ingest streaming data into several destinations, including Amazon Simple Storage Service (Amazon S3) for persisting metrics to the S3 data lake to be utilized in offline computations. Kinesis Data Firehose provides an event buffer and Lambda integration to easily collect, batch transform, and persist these metrics to Amazon S3 to be utilized later by the batch feature computing. Batch feature computations don’t have the same low latency read/write requirements as PIT features, which makes Amazon S3 the better choice because it provides low-cost, durable storage for storing these large volumes of business metrics.

Our initial ML model uses 21 batch features computed daily using data captured in the past 2 months. This data includes both playback and app engagement history per user, and grows with the number of users and frequency of app usage. Feature engineering at this scale requires an automated process to pull the required input data, process it in parallel, and export the outcome to persistent storage. The processing infrastructure is needed only for the duration of the computations. SageMaker Processing provides prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs at large scale. The underlying infrastructure for a Processing job is fully managed by SageMaker. Cluster resources are provisioned for the duration of your job, and cleaned up when a job is complete.

Each step in the batch process—data gathering, feature engineering, feature persistence—is part of a workflow that requires error handling, retries, and state transitions in between. With AWS Step Functions, you can create a state machine and split your workflow into several steps of preprocessing and postprocessing, as well as a step to persist the features into SageMaker Feature Store or the other data to Amazon S3. A state machine in Step Functions can be triggered via Amazon EventBridge to automate batch computing to run at a set schedule, such as once every day at 10:00 PM UTC.

After the features are computed, they need to be versioned and stored to be read during inference as well as model retraining. Rather than build your own feature storage and management service, you can use SageMaker Feature Store. Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. It stores history of ML features in the offline store (Amazon S3) and also provides APIs to an online store to allow low-latency reads of most recent features. The offline store can serve the historical data for further model training and experimentation, and the online store can be called by your customer-facing APIs to get features for real-time inference. As we evolve our services to provide more personalized content, we anticipate training additional ML models and with the help of Feature Store, search, discover and reuse features amongst these models.

Real-time inference

Real-time inference usually requires hosting ML models behind endpoints. You could do this using web servers or containers, but this requires ML engineering effort and infrastructure to manage and maintain. SageMaker makes it easy to deploy ML models to real-time endpoints. SageMaker lets you train and upload ML models and host them by creating and configuring SageMaker endpoints. Real-time inference satisfies the low-latency requirements for ranking shows as they are browsed on the Amp home page.

In addition to managed hosting, SageMaker provides managed endpoint scaling. SageMaker inference allows you to define an auto scaling policy with minimum and maximum instance counts and a target utilization to trigger the scaling. This way, you can easily scale in or out as demand changes.

Operational health

The number of events this system handles for real-time feature computing changes accordingly with the natural pattern of app usage (higher or lower traffic based on time of day or day of the week). Similarly, the number of requests it receives for real-time inference scales with the number of concurrent app users. These services also get unexpected peaks in traffic due to self promotions in social media by popular creators. Although it’s important to ensure the system can scale up and down to serve the incoming traffic successfully and frugally, it’s also important to monitor operational metrics and alert for any unexpected operational issues to prevent loss of data and service to customers. Monitoring the health of these services is straightforward using Amazon CloudWatch. Vital service health metrics such as faults and latency of operations as well as utilization metrics such as memory, disk, and CPU usage are available out of the box using CloudWatch. Our development team uses metrics dashboards and automated monitoring to ensure we can serve our clients with high availability (99.8%) and low latency (less than 200 milliseconds end-to end to get recommended shows per user).

Measuring the outcome

Prior to the ML-based show recommender described in this post, a simpler heuristic algorithm ranked Amp shows based on a user’s personal topics of interest that are self-reported on their profile. We set up an A/B test to measure the impact of switching to ML-based recommenders with a user’s data from their past app interactions. We identified improvements in metrics such as listening duration and number of engagement actions (liking a show, following a show creator, turning on notifications) as indicators of success. A/B testing with 50% of users receiving show recommendations ranked for them via the ML-based recommender has shown a 3% boost to customer engagement metrics and a 0.5% improvement to playback duration.

Conclusion

With purpose-built services, the Amp team was able to release the personalized show recommendation API as described in this post to production in under 3 months. The system also scales well for the unpredictable loads created by well-known show hosts or marketing campaigns that could generate an influx of users. The solution uses managed services for processing, training, and hosting, which helps reduce the time spent on day-to-day maintenance of the system. We’re also able to to monitor all these managed services via CloudWatch to ensure the continued health of the systems in production.

A/B testing the first version of the Amp’s ML-based recommender against a rule-based approach (which sorts shows by customer’s topics of interest only) has shown that the ML-based recommender exposes customers to higher-quality content from more diverse topics, which results in a higher number of follows and enabled notifications. The Amp team is continuously working towards improving the models to provide highly relevant recommendations.

For more information about Feature Store, visit Amazon SageMaker Feature Store and check out other customer use cases in the AWS Machine Learning Blog.

About the authors

Tulip Gupta is a Solutions Architect at Amazon Web Services. She works with Amazon to design, build, and deploy technology solutions on AWS. She assists customers in adopting best practices while deploying solution in AWS, and is a Analytics and ML enthusiast. In her spare time, she enjoys swimming, hiking and playing board games.

David Kuo is a Solutions Architect at Amazon Web Services. He works with AWS customers to design, build and deploy technology solutions on AWS. He works with Media and Entertainment customers and has interests in machine learning technologies. In his spare time, he wonders what he should do with his spare time.

Manolya McCormick is a Sr Software Development Engineer for Amp on Amazon. She designs and builds distributed systems using AWS to serve customer facing applications. She enjoys reading and cooking new recipes at her spare time.

Jeff Christophersen is a Sr. Data Engineer for Amp on Amazon. He works to design, build, and deploy Big Data solutions on AWS that drive actionable insights. He assists internal teams in adopting scalable and automated solutions, and is a Analytics and Big Data enthusiast. In his spare time, when he is not on a pair of skis you can find him on his mountain bike.

How Amp on Amazon used data to increase customer engagement, Part 1: Building a data analytics platform

September 9, 2022

by Tulip Gupta Amazon AWS

Amp, the new live radio app from Amazon, is a reinvention of radio featuring human-curated live audio shows. It’s designed to provide a seamless customer experience to listeners and creators by debuting interactive live audio shows from your favorite artists, radio DJs, podcasters, and friends.

However, as a new product in a new space for Amazon, Amp needed more relevant data to inform their decision-making process. Amp wanted a scalable data and analytics platform to enable easy access to data and perform machine leaning (ML) experiments for live audio transcription, content moderation, feature engineering, and a personal show recommendation service, and to inspect or measure business KPIs and metrics.

This post is the first in a two-part series. Part 1 shows how data was collected and processed using the data and analytics platform, and Part 2 shows how the data was used to create show recommendations using Amazon SageMaker, a fully managed ML service. The personalized show recommendation list service has shown a 3% boost to customer engagement metrics tracked (such as liking a show, following a creator, or enabling upcoming show notifications) since its launch in May 2022.

Solution overview

The data sources for Amp can be broadly categorized as either streaming (near-real time) or batch (point in time). The source data is emitted from Amp-owned systems or other Amazon systems. The two different data types are as follows:

Streaming data – This type of data mainly consists of follows, notifications (regarding users’ friends, favorite creators, or shows), activity updates, live show interactions (call-ins, co-hosts, polls, in-app chat), real-time updates on live show activities (live listen count, likes), live audio playback metrics, and other clickstream metrics from the Amp application. Amp stakeholders require this data to power ML processes or predictive models, content moderation tools, and product and program dashboards (for example, trending shows). Streaming data enables Amp customers to conduct and measure experimentation.
Batch data – This data mainly consists of catalog data, show or creator metadata, and user profile data. Batch data enables more point-in-time reporting and analytics vs. real-time.

The following diagram illustrates the high-level architecture.

The Amp data and analytics platform can be broken down into three high-level systems:

Streaming data ingestion, stream processing and transformation, and stream storage
Batch data ingestion, batch processing and transformation, and batch storage
Business intelligence (BI) and analytics

In the following sections, we discuss each component in more detail.

Streaming data ingestion, processing, transformation, and storage

Amp created a serverless streaming ingestion pipeline capable of tapping into data from sources without the need for infrastructure management, as shown in the following diagram.

The pipeline was able to ingest the Amp show catalog data (what shows are available on Amp) and pass it to the data lake for two different use cases: one for near-real-time analytics, and one for batch analytics.

As part of the ingestion pipeline, the Amp team has an Amazon Simple Queue Service (Amazon SQS) queue that receives messages from an upstream Amazon Simple Notification Service (Amazon SNS) topic that contains information on changes to shows in the catalog. These changes could be the addition of new shows or adjustments to existing ones that have been scheduled.

When the message is received by the SQS queue, it triggers the AWS Lambda function to make an API call to the Amp catalog service. The Lambda function retrieves the desired show metadata, filters the metadata, and then sends the output metadata to Amazon Kinesis Data Streams. Amazon Kinesis Data Firehose receives the records from the data stream. Kinesis Data Firehose then invokes a secondary Lambda function to perform a data transformation that flattens the JSON records received and writes the transformed records to an Amazon Simple Storage Service (Amazon S3) data lake for consumption by Amp stakeholders.

Kinesis Data Firehose enabled buffering and writing data to Amazon S3 every 60 seconds. This helped Amp teams make near-real-time programming decisions that impacted external customers.

The streaming ingestion pipeline supported the following objectives: performance, availability, scalability, and flexibility to send data to multiple downstream applications or services:

Kinesis Data Streams handles streaming data ingestion when necessary. Kinesis Data Streams supported these objectives by enabling the Amp team to quickly ingest data for analytics with minimal operational load. As a fully managed service, it reduced operational overhead, and Amp was able to scale with the product needs.
Lambda enabled the team to create lightweight functions to run API calls and perform data transformations.
Because Kinesis Data Firehose is a managed service, it was able to handle all the scaling, sharding, and monitoring needs of the streaming data without any additional overheard for the team.

Batch data ingestion, processing, transformation, and storage

Amp created a transient batch (point in time) ingestion pipeline capable of data ingestion, processing and transformation, and storage, as shown in the following diagram.

A transient extract, transform, and load (ETL) and extract, load, and transform (ELT) job approach was implemented because of the batch nature of these workloads and unknown data volumes. As a part of the workflow automation, Amazon SQS was used to trigger a Lambda function. The Lambda function then activated the AWS Glue crawler to infer the schema and data types. The crawler wrote the schema metadata to the AWS Glue Data Catalog, providing a unified metadata store for data sharing.

The ETL and ELT jobs were required to run on either a set schedule or event-driven workflow. To handle these needs, Amp used Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Apache Airflow is an open-source Python-based workflow management platform. Amazon MWAA is a fully managed service that automatically handles scaling. It provides sequencing, error handling, retry logic, and state. With Amazon MWAA, Amp was able to take advantage of the the benefits of Airflow for job orchestration while not having to manage or maintain dedicated Airflow servers. Additionally, by using Amazon MWAA, Amp was able to create a code repository and workflow pipeline stored in Amazon S3 that Amazon MWAA could access. The pipeline allowed Amp data engineers to easily deploy Airflow DAGs or PySpark scripts across multiple environments.

Amp used Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS) to configure and manage containers for their data processing and transformation jobs. Due to the unique nature of the Amp service, the initial expected data volumes that would be processed were relatively unknown. To provide flexibility as the service evolved, the team decided to go with Amazon EMR on EKS to eliminate any unnecessary operational overheard required to bootstrap and scale Amazon EMR for data processing. This approach allowed them to run transient hybrid EMR clusters backed by a mix of AWS Fargate and Amazon Elastic Compute Cloud (Amazon EC2) nodes, where all system tasks and workloads were offloaded to Fargate, while Amazon EC2 handled all the Apache Spark processing and transformation. This provided the flexibility to have a cluster with one node running, while the Amazon EKS auto scaler dynamically instantiated and bootstrapped any additional EC2 nodes that were required for the job. When the job was complete, they were automatically deleted by the cluster auto scaler. This pattern eliminated the need for the team to manage any of the cluster bootstrap actions or scaling required to respond to evolving workloads.

Amazon S3 was used as the central data lake, and data was stored in Apache Parquet (Parquet) format. Parquet is a columnar format, which speeds up data retrieval and provides efficient data compression. Amazon S3 provided the flexibility, scalability, and security needs for Amp. With Amazon S3, the Amp team was able to centralize data storage in one location and federate access to the data virtually across any service or tool within or outside of AWS. The data lake was split into two S3 buckets: one for raw data ingestion and one for transformed data output. Amazon EMR performed the transformation from raw data to transformed data. With Amazon S3 as the central data lake, Amp was able to securely expose and share the data with other teams across Amp and Amazon.

To simplify data definition, table access provisioning, and the addition and removal of tables, they used AWS Glue crawlers and the AWS Glue Data Catalog. Because Amp is a new service and constantly evolving, the team needed a way to easily define, access, and manage the tables in the data lake. The crawlers handled data definition (including schema changes) and the addition and removal of tables, while the Data Catalog served as a unified metadata store.

Business intelligence and analytics

The following diagram illustrates the architecture for the BI and analytics component.

Amp chose to store the data in the S3 data lake, and not in the data warehouse. This enabled them to access it in a unified manner through the AWS Glue Data Catalog and provided greater flexibility for data consumers. This resulted in faster data access across a variety of services or tools. With data being stored in Amazon S3, it also reduced data warehouse infrastructure costs, because the costs are a function of the compute type and the amount of data stored.

The Amazon Redshift RA3 node type was used as the compute layer to enable stakeholders to query data stored in Amazon S3. Amazon Redshift RA3 nodes decouple storage and compute, and are designed for an access pattern through the AWS Glue Data Catalog. RA3 nodes introduce Amazon Redshift Managed Storage, which is Amazon S3 backed. The combination of these features enabled Amp to right-size the clusters and provide better query performance for their customers while minimizing costs.

Amazon Redshift configuration was automated using a Lambda function, which connected to a given cluster and ran parameterized SQL statements. The SQL statements contained the logic to deploy schemas, user groups, and users, while AWS Secrets Manager was used to automatically generate, store, and rotate Amazon Redshift user passwords. The underlying configuration variables were stored in Amazon DynamoDB. The Lambda function retrieved the variables and requested temporary Amazon Redshift credentials to perform the configuration. This process enabled the Amp team to set up Amazon Redshift clusters in a consistent manner.

Business outcomes

Amp was able to achieve the following business outcomes:

Business reporting – Standard reporting required to run the business, such as daily flash reports, aggregated business review mechanisms, or project and program updates.
Product reporting – Specific reporting required to enable the inspection or measurement of key product KPIs and Metrics. This included visual reports through dashboards such as marketing promotion effectiveness, app engagement metrics, and trending shows.
ML experimentation – Enabled downstream Amazon teams to use this data to support experimentation or generate predictions and recommendations. For example, ML experimentations like a personalized show recommendation list, show categorization, and content moderation helped with Amp’s user retention.

Key benefits

By implementing a scalable, cost-efficient architecture, Amp was able to achieve the following:

Limited operational complexity – They built a flexible system that used AWS managed services wherever possible.
Use the languages of data – Amp was able to support the two most common data manipulation languages, Python and SQL, to perform platform operations, conduct ML experiments, and generate analytics. With this support, the developers with Amp were able to use languages they were familiar with.
Enable experimentation and measurement – Amp allowed developers to quickly generate the datasets needed to conduct experiments and measure the results. This helps in optimizing the Amp customer experience.
Build to learn but design to scale – Amp is a new product that is finding its market fit, and was able to focus their initial energy on building just enough features to get feedback. This enabled them to pivot toward the right product market fit with each launch. They were able to build incrementally, but plan for the long term.

Conclusion

In this post, we saw how Amp created their data analytics platform using user behavioral data from streaming and batch data sources. The key factors that drove the implementation were the need to provide a flexible, scalable, cost-efficient, and effort-efficient data analytics platform. Design choices were made evaluating various AWS services.

Part 2 of this series shows how we used this data and built out the personalized show recommendation list using SageMaker.

As next steps, we recommend doing a deep dive into each stage of your data pipeline system and making design choices that would be cost-effective and scalable for your needs. For more information, you can also check out other customer use cases in the AWS Analytics Blog.

If you have feedback about this post, submit it in the comments section.

Solution overview

Update your IAM role permissions

Launch an AWS Glue interactive session kernel

Configure your interactive session

Customize your interactive session and run a data preparation workload

Debugging and Spark UI

Pricing

Conclusion

About the authors

Available sessions

About the author

Architecture evolution

SageMaker batch transform

SageMaker real-time inference

Lambda

SageMaker asynchronous inference

Conclusion

About the authors

Large language models and the increasing necessity of model parallel inference

Solution overview

Inference large models on SageMaker

Pull the Docker image and push to Amazon ECR

Create our model file

Create a SageMaker model

Create a SageMaker endpoint

Performance tuning

Tune the ML engine for multi-threaded inference

Tune Netty

Tune workload management (WLM) of DJLServing

Tune degree of tensor parallelism

Summary

About the authors

Collect relevant data

Number of images

Balanced dataset

Varying types of images

Varying backgrounds

Varying lighting conditions

Varying angles

Add negative labels

Handling label confusion

Data augmentation

Review training metrics

Conclusion

About the authors

Solution overview

Prerequisites

Configure Active Directory

Configure ADFS

Create Application Group

Configure claim descriptions

Configure the application group claim rules

Create and configure an OIDC IdP workforce using the SageMaker API

Validate the OIDC IdP workforce authentication response

Create a private work team

Test access to the private labeling portal

Cost

Clean up

Summary

About the authors

Solution overview

Real-time feature computing

Batch feature computing

Real-time inference

Operational health

Measuring the outcome

Conclusion

About the authors

Solution overview

Streaming data ingestion, processing, transformation, and storage

Batch data ingestion, processing, transformation, and storage

Business intelligence and analytics

Business outcomes

Key benefits

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.