How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Amazon’s product search engine indexes billions of products, serves hundreds of millions of customers worldwide, and is one of the most heavily used services in the world. The Amazon Search team develops machine learning (ML) technology that powers the Amazon.com search engine and helps customers search effortlessly. To deliver a great customer experience and operate at the massive scale required by the Amazon.com search engine, this team is always looking for ways to build more cost-effective systems with real-time latency and throughput requirements. The team constantly explores hardware and compilers optimized for deep learning to accelerate model training and inference, while reducing operational costs across the board.

In this post, we describe how Amazon Search uses AWS Inferentia, a high-performance accelerator purpose built by AWS to accelerate deep learning inference workloads. The team runs low-latency ML inference with Transformer-based NLP models on AWS Inferentia-based Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, and saves up to 85% in infrastructure costs while maintaining strong throughput and latency performance.

Deep learning for duplicate and query intent prediction

Searching the Amazon Marketplace is a multi-task, multi-modal problem, dealing with several inputs such as ASINs (Amazon Standard Identification Number, a 10-digit alphanumeric number that uniquely identifies products), product images, textual descriptions, and queries. To create a tailored user experience, predictions from many models are used for different aspects of search. This is a challenge because the search system has thousands of models with tens of thousands of transactions per second (TPS) at peak load. We focus on two components of that experience:

  • Customer-perceived duplicate predictions – To show the most relevant list of products that match a user’s query, it’s important to identify products that customers have a hard time differentiating between
  • Query intent prediction – To adapt the search page and product layout to better suit what the customer is looking for, it’s important to predict the intent and type of the user’s query (for example, a media-related query, help query, and other query types)

Both of these predictions are made using Transformer model architectures, namely BERT-based models. In fact, both share the same BERT-based model as a basis, and each one stacks a classification/regression head on top of this backbone.

Duplicate prediction takes in various textual features for a pair of evaluated products as inputs (such as product type, title, description, and so on) and is computed periodically for large datasets. This model is trained end to end in a multi-task fashion. Amazon SageMaker Processing jobs are used to run these batch workloads periodically to automate their launch and only pay for the processing time that is used. For this batch workload use case, the requirement for inference throughput was 8,800 total TPS.

Intent prediction takes the user’s textual query as input and is needed in real time to dynamically serve everyday traffic and enhance the user experience on the Amazon Marketplace. The model is trained on a multi-class classification objective. This model is then deployed on Amazon Elastic Container Service (Amazon ECS), which enables quick auto scaling and easy deployment definition and management. Because this is a real-time use case, it required the P99 latency to be under 10 milliseconds to ensure a delightful user experience.

AWS Inferentia and the AWS Neuron SDK

EC2 Inf1 instances are powered by AWS Inferentia, the first ML accelerator purpose built by AWS to accelerate deep learning inference workloads. Inf1 instances deliver up to 2.3 times higher throughput and up to 70% lower cost per inference than comparable GPU-based EC2 instances. You can keep training your models using your framework of choice (PyTorch, TensorFlow, MXNet), and then easily deploy them on AWS Inferentia to benefit from the built-in performance optimizations. You can deploy a wide range of model types using Inf1 instances, from image recognition, object detection, natural language processing (NLP), and modern recommender models.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the EC2 Inf1 instances. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch. Therefore, you can deploy deep learning models on AWS Inferentia with the same familiar APIs provided by your framework of choice, and benefit from the boost in performance and lowest cost-per-inference in the cloud.

Since its launch, the Neuron SDK has continued to increase the breadth of models it supports while continuing to improve performance and reduce inference costs. This includes NLP models (BERTs), image classification models (ResNet, VGG), and object detection models (OpenPose and SSD).

Deploy on Inf1 instances for low latency, high throughput, and cost savings

The Amazon Search team wanted to save costs while meeting their high throughput requirement on duplication prediction, and the low latency requirement on query intent prediction. They chose to deploy on AWS Inferentia-based Inf1 instances and not only met the high performance requirements, but also saved up to 85% on inference costs.

Customer-perceived duplicate predictions

Prior to the usage of Inf1, a dedicated Amazon EMR cluster was running using CPU-based instances. Without relying on hardware acceleration, a large number of instances were necessary to meet the high throughput requirement of 8,800 total transactions per second. The team switched to inf1.6xlarge instances, each with 4 AWS Inferentia accelerators, and 16 NeuronCores (4 cores per AWS Inferentia chip). They traced the Transformer-based model for a single NeuronCore and loaded one mode per NeuronCore to maximize throughput. By taking advantage of the 16 available NeuronCores, they decreased inference costs by 85% (based on the current public Amazon EC2 on-demand pricing).

Query intent prediction

Given the P99 latency requirement of 10 milliseconds or less, the team loaded the model to every available NeuronCore on inf1.6xlarge instances. You can easily do this with PyTorch Neuron using the torch.neuron.DataParallel API. With the Inf1 deployment, the model latency was 3 milliseconds, end-to-end latency was approximately 10 milliseconds, and maximum throughput at peak load reached 16,000 TPS.

Get started with sample compilation and deployment code

The following is some sample code to help you get started on Inf1 instances and realize the performance and cost benefits like the Amazon Search team. We show how to compile and perform inference with a PyTorch model, using PyTorch Neuron.

First, the model is compiled with torch.neuron.trace():

m = torch.jit.load(f="./cpu_model.pt", map_location=torch.device('cpu'))
m.eval()
model_neuron = torch.neuron.trace(
    m,
    inputs,
    compiler_workdir="work_" + str(cores) + "_" + str(batch_size),
    compiler_args=[
        '--fp32-cast=all', '--neuroncore-pipeline-cores=' + str(cores)
    ])
model_neuron.save("m5_batch" + str(batch_size) + "_cores" + str(cores) +
                  "_with_extra_op_and_fp32cast.pt")

For the full list of possible arguments to the trace method, refer to PyTorch-Neuron trace Python API. As you can see, compiler arguments can be passed to the torch.neuron API directly. All FP32 operators are cast to BF16 with --fp32-cast=all, providing the highest performance while preserving dynamic range. More casting options are available to let you control the performance to model precision trade-off. The models used for both use cases were compiled for a single NeuronCore (no pipelining).

We then load the model on Inferentia with torch.jit.load, and use it for prediction. The Neuron runtime automatically loads the model to NeuronCores.

cm_cpd_preprocessing_jit = torch.jit.load(f=CM_CPD_PROC,
                                          map_location=torch.device('cpu'))
cm_cpd_preprocessing_jit.eval()
m5_model = torch.jit.load(f=CM_CPD_M5)
m5_model.eval()

input = get_input()
with torch.no_grad():
    batch_cm_cpd = cm_cpd_preprocessing_jit(input)
    input_ids, attention_mask, position_ids, valid_length, token_type_ids = (
        batch_cm_cpd['input_ids'].type(torch.IntTensor),
        batch_cm_cpd['attention_mask'].type(torch.HalfTensor),
        batch_cm_cpd['position_ids'].type(torch.IntTensor),
        batch_cm_cpd['valid_length'].type(torch.IntTensor),
        batch_cm_cpd['token_type_ids'].type(torch.IntTensor))
    model_res = m5_model(input_ids, attention_mask, position_ids, valid_length,
                         token_type_ids)

Conclusion

The Amazon Search team was able to reduce their inference costs by 85% using AWS Inferentia-based Inf1 instances, under heavy traffic and demanding performance requirements. AWS Inferentia and the Neuron SDK provided the team the flexibility to optimize the deployment process separately from training, and put forth a shallow learning curve via well-rounded tools and familiar framework APIs.

You can unlock performance and cost benefits by getting started with the sample code provided in this post. Also, check out the end-to-end tutorials to run ML models on Inferentia with PyTorch and TensorFlow.


About the authors

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Weiqi Zhang is a Software Engineering Manager at Search M5, where he works on productizing large-scale models for Amazon machine learning applications. His interests include information retrieval and machine learning infrastructure.

Jason Carlson is a Software Engineer for developing machine learning pipelines to help reduce the number of stolen search impressions due to customer-perceived duplicates. He mostly works with Apache Spark, AWS, and PyTorch to help deploy and feed/process data for ML models. In his free time, he likes to read and go on runs.

Shaohui Xi is an SDE at the Search Query Understanding Infra team. He leads the effort for building large-scale deep learning online inference services with low latency and high availability. Outside of work, he enjoys skiing and exploring good foods.

Zhuoqi Zhang is a Software Development Engineer at the Search Query Understanding Infra team. He works on building model serving frameworks to improve latency and throughput for deep learning online inference services. Outside of work, he likes playing basketball, snowboarding, and driving.

Haowei Sun is a software engineer in the Search Query Understanding Infra team. She works on designing APIs and infrastructure supporting deep learning online inference services. Her interests include service API design, infrastructure setup, and maintenance. Outside of work, she enjoys running, hiking, and traveling.

Jaspreet Singh is an Applied Scientist on the M5 team, where he works on large-scale foundation models to improve the customer shopping experience. His research interests include multi-task learning, information retrieval, and representation learning.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Amazon Comprehend Targeted Sentiment adds synchronous support

Earlier this year, Amazon Comprehend, a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text, launched the Targeted Sentiment feature. With Targeted Sentiment, you can identify groups of mentions (co-reference groups) corresponding to a single real-world entity or attribute, provide the sentiment associated with each entity mention, and offer the classification of the real-world entity based on a pre-determined list of entities.

Today, we’re excited to announce the new synchronous API for targeted sentiment in Amazon Comprehend, which provides a granular understanding of the sentiments associated with specific entities in input documents.

In this post, we provide an overview of how you can get started with the Amazon Comprehend Targeted Sentiment synchronous API, walk through the output structure, and discuss three separate use cases.

Targeted sentiment use cases

Real-time targeted sentiment analysis in Amazon Comprehend has several applications to enable accurate and scalable brand and competitor insights. You can use targeted sentiment for business-critical processes such as live market research, producing brand experience, and improving customer satisfaction.

The following is an example of using targeted sentiment for a movie review.

“Movie” is the primary entity, identified as type movie, and is mentioned two more times as “movie” and the pronoun “it.” The Targeted Sentiment API provides the sentiment towards each entity. Green refers to a positive sentiment, red for negative, and blue for neutral.

Traditional analysis provides sentiment of the overall text, which in this case is mixed. With targeted sentiment, you can get more granular insights. In this scenario, the sentiment towards the movie is both positive and negative: positive in regards to the actors, but negative in relation to the overall quality. This can provide targeted feedback for the film team, such as to exercise more diligence in script writing, but to consider the actors for future roles.

Prominent applications of real-time sentiment analysis will vary across industries. It includes extracting marketing and customer insights from live social media feeds, videos, live events, or broadcasts, understanding emotions for research purposes, or deterring cyberbullying. Synchronous targeted sentiment drives business value by providing real-time feedback within seconds so that you can make decisions in real time.

Let’s take a closer look at these various real-time targeted sentiment analysis applications and how different industries may use them:

  • Scenario 1 – Opinion mining of financial documents to determine sentiment towards a stock, person, or organization
  • Scenario 2 – Real-time call center analytics to determine granular sentiment in customer interactions
  • Scenario 3 – Monitoring organization or product feedback across social media and digital channels, and providing real-time support and resolutions

In the following sections, we discuss each use case in more detail.

Scenario 1: Financial opinion mining and trading signal generation

Sentiment analysis is crucial for market-makers and investment firms when building trading strategies. Determining granular sentiment can help traders infer what reaction the market may have towards global events, business decisions, individuals, and industry direction. This sentiment can be a determining factor on whether to buy or sell a stock or commodity.

To see how we can use the Targeted Sentiment API in these scenarios, let’s look at a statement from Federal Reserve Chair Jerome Powell on inflation.

As we can see in the example, understanding the sentiment towards inflation can inform a buy or sell decision. In this scenario, it can be inferred from the Targeted Sentiment API that Chair Powell’s opinion on inflation is negative, and this is most likely going to result in higher interest rates slowing economic growth. For most traders, this could result in a sell decision. The Targeted Sentiment API can provide traders faster and more granular insight than a traditional document review, and in an industry where speed is crucial, it can result in substantial business value.

The following is a reference architecture for using targeted sentiment in financial opinion mining and trading signal generation scenarios.

Scenario 2: Real-time contact center analysis

A positive contact center experience is crucial in delivering a strong customer experience. To help ensure positive and productive experiences, you can implement sentiment analysis to gauge customer reactions, the changing customer moods through the duration of the interaction, and the effectiveness of contact center workflows and employee training. With the Targeted Sentiment API, you can get granular information within your contact center sentiment analysis. Not only can we determine the sentiment of the interaction, but now we can see what caused the negative or positive reaction and take the appropriate action.

We demonstrate this with the following transcripts of a customer returning a malfunctioning toaster. For this example, we show sample statements that the customer is making.

As we can see, the conversation starts off fairly negative. With the Targeted Sentiment API, we’re able to determine the root cause of the negative sentiment and see it’s regarding a malfunctioning toaster. We can use this information to run certain workflows, or route to different departments.

Through the conversation, we can also see the customer wasn’t receptive to the offer of a gift card. We can use this information to improve agent training, reevaluate if we should even bring up the topic in these scenarios, or decide if this question should only be asked with a more neutral or positive sentiment.

Lastly, we can see that the service that was provided by the agent was received positively even though the customer was still upset about the toaster. We can use this information to validate agent training and reward strong agent performance.

The following is a reference architecture incorporating targeted sentiment into real-time contact center analytics.

Scenario 3: Monitoring social media for customer sentiment

Social media reception can be a deciding factor for product and organizational growth. Tracking how customers are reacting to company decisions, product launches, or marketing campaigns is critical in determining effectiveness.

We can demonstrate how to use the Targeted Sentiment API in this scenario by using Twitter reviews of a new set of headphones.

In this example, there are mixed reactions to the launch of the headphones, but there is a consistent theme of the sound quality being poor. Companies can use this information to see how users are reacting to certain attributes and see where product improvements should be made in future iterations.

The following is a reference architecture using the Targeted Sentiment API for social media sentiment analysis.

Get started with Targeted Sentiment

To use targeted sentiment on the Amazon Comprehend console, complete the following steps:

  1. On the Amazon Comprehend console, choose Launch Amazon Comprehend.
  2. For Input text, enter any text that you want to analyze.
  3. Choose Analyze.

After the document has been analyzed, the output of the Targeted Sentiment API can be found on the Targeted sentiment tab in the Insights section. Here you can see the analyzed text, each entity’s respective sentiment, and the reference group it’s associated with.

In the Application integration section, you can find the request and response for the analyzed text.

Programmatically use Targeted Sentiment

To get started with the synchronous API programmatically, you have two options:

  • detect-targeted-sentiment – This API provides the targeted sentiment for a single text document
  • batch-detect-targeted-sentiment – This API provides the targeted sentiment for a list of documents

You can interact with the API with the AWS Command Line Interface (AWS CLI) or through the AWS SDK. Before we get started, make sure that you have configured the AWS CLI, and have the required permissions to interact with Amazon Comprehend.

The Targeted Sentiment synchronous API requires two request parameters to be passed:

  • LanguageCode – The language of the text
  • Text or TextList – The UTF-8 text that is processed

The following code is an example for the detect-targeted-sentiment API:

{
"LanguageCode": "string", 
"Text": "string"
}

The following is an example for the batch-detect-targeted-sentiment API:

{

"LanguageCode": "string", 
"TextList": ["string"]

}

Now let’s look at some sample AWS CLI commands.

The following code is an example for the detect-targeted-sentiment API:

aws comprehend 
--region us-east-2 
detect-targeted-sentiment  
--text "I like the burger but service was bad" 
--language-code en

The following is an example for the batch-detect-targeted-sentiment API:

aws comprehend 
--region us-east-2 
batch-detect-targeted-sentiment 
--text-list "We loved the Seashore Hotel! It was clean and the staff was friendly. However, the Seashore was a little too noisy at night." "I like the burger but service is bad" 
--language-code en

The following is a sample Boto3 SDK API call:

import boto3
import subprocess

session = boto3.Session()
comprehend_client = session.client(service_name='comprehend', region_name='us-east-2')

The following is an example of the detect-targeted-sentiment API:

response = comprehend_client.detect_targeted_sentiment(
LanguageCode='en',
Text = "I like the burger but service was bad"
)
print(response)

The following is an example of the batch-detect-targeted-sentiment API:

response = comprehend_client.batch_detect_targeted_sentiment(
    LanguageCode='en',
    TextList = ["I like the burger but service was bad","The staff was really sweet though"]
)

For more details about the API syntax, refer to the Amazon Comprehend Developer Guide.

API response structure

The Targeted Sentiment API provides a simple way to consume the output of your jobs. It provides a logical grouping of the entities (entity groups) detected, along with the sentiment for each entity. The following are some definitions of the fields that are in the response:

  • Entities – The significant parts of the document. For example, Person, Place, Date, Food, or Taste.
  • Mentions – The references or mentions of the entity in the document. These can be pronouns or common nouns such as “it,” “him,” “book,” and so on. These are organized in order by location (offset) in the document.
  • DescriptiveMentionIndex – The index in Mentions that gives the best depiction of the entity group. For example, “ABC Hotel” instead of “hotel,” “it,” or other common noun mentions.
  • GroupScore – The confidence that all the entities mentioned in the group are related to the same entity (such as “I,” “me,” and “myself” referring to one person).
  • Text – The text in the document that depicts the entity.
  • Type – A description of what the entity depicts.
  • Score – The model confidence that this is a relevant entity.
  • MentionSentiment – The actual sentiment found for the mention.
  • Sentiment – The string value of positive, neutral, negative, or mixed.
  • SentimentScore – The model confidence for each possible sentiment.
  • BeginOffset – The offset into the document text where the mention begins.
  • EndOffset – The offset into the document text where the mention ends.

For a more detailed breakdown, refer to Extract granular sentiment in text with Amazon Comprehend Targeted Sentiment or Output file organization.

Conclusion

Sentiment analysis remains crucial for organizations for a myriad of reasons—from tracking customer sentiment over time for businesses, to inferring whether a product is liked or disliked, to understanding opinions of users of a social network towards certain topics, or even predicting the results of campaigns. Real-time targeted sentiment can be effective for businesses, allowing them to go beyond overall sentiment analysis to explore insights to drive customer experiences using Amazon Comprehend.

To learn more about Targeted Sentiment for Amazon Comprehend, refer to Targeted sentiment.


About the authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Wrick Talukdar is a Senior Architect with Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.

Read More

Run machine learning enablement events at scale using AWS DeepRacer multi-user account mode

This post was co-written by Marius Cealera, Senior Partner Solutions Architect at AWS, Zdenko Estok, Cloud Architect at Accenture and Sakar Selimcan, Cloud Architect at Accenture.

Machine learning (ML) is a high-stakes business priority, with companies spending $306 billion on ML applications in the past 3 years. According to Accenture, companies that scale ML across a business can achieve nearly triple the return on their investments. But too many companies aren’t achieving the value they expected. Scaling ML effectively for the long term requires the professionalization of the industry and the democratization of ML literacy across the enterprise. This requires more accessible ML training, speaking to a larger number of people with diverse backgrounds.

This post shows how companies can introduce hundreds of employees to ML concepts by easily running AWS DeepRacer events at scale.

Run AWS DeepRacer events at scale

AWS DeepRacer is a simple and fun way to get started with reinforcement learning (RL), an ML technique where an agent, such as a physical or virtual AWS DeepRacer vehicle, discovers the optimal actions to take in a given environment. You can get started with RL quickly with hands-on tutorials that guide you through the basics of training RL models and testing them in an exciting, autonomous car racing experience.

“We found the user-friendly nature of DeepRacer allowed our enablement sessions to reach parts of our organizations that are usually less inclined to participate in AI/ML events,” says Zdenko Estok, a Cloud Architect at Accenture. “Our post-event statistics indicate that up to 75% of all participants to DeepRacer events are new to AI/ML and 50% are new to AWS.”

Until recently, organizations hosting private AWS DeepRacer events had to create and assign AWS accounts to every event participant. This often meant securing and monitoring usage across hundreds or even thousands of AWS accounts. The setup and participant onboarding was cumbersome and time-consuming, often limiting the size of the event. With AWS DeepRacer multi-user account management, event organizers can provide hundreds of participants access to AWS DeepRacer using a single AWS account, simplifying event management and improving the participant experience.

Build a solution around AWS DeepRacer multi-user account management

You can use AWS DeepRacer multi-user account management to set usage quotas on training hours, monitor spending on training and storage, enable and disable training, and view and manage models for every event participant. In addition, when combined with an enterprise identity provider (IdP), AWS DeepRacer multi-user account management provides a quick and frictionless onboarding experience for event participants. The following diagram explains what such a setup looks like.

 Solution diagram showing AWS IAM Identity Center being used to provide access to the AWS DeepRacer console

The solution assumes access to an AWS account.

To set up your account with AWS DeepRacer admin permissions for multi-user, follow the steps in Set up your account with AWS DeepRacer admin permissions for multi-user to attach the AWS Identity and Access Management (IAM) AWS DeepRacer Administrator policy, AWSDeepRacerAccountAdminAccess, to the user, group, or role used to administer the event. Next, navigate to the AWS DeepRacer console and activate multi-user account mode.

By activating multi-user account mode, you enable participants to train models on the AWS DeepRacer console, with all training and storage charges billed to the administrator’s AWS account. By default, a sponsoring account in multi-user mode is limited to 100 concurrent training jobs, 100 concurrent evaluation jobs, 1,000 cars, and 50 private leaderboards, shared among all sponsored profiles. You can increase these limits by contacting Customer Service.

This setup also relies on using an enterprise IdP with AWS IAM Identity Center (Successor to AWS Single Sign-On) enabled. For information on setting up IAM Identity Center with an IdP, see Enable IAM Identity Center and Connect to your external identity provider. Note that different IdPs may require slightly different setup steps. Consult your IdP’s documentation for more details.

The solution depicted here works as follows:

  1. Event participants are directed to a dedicated event portal. This can be a simple webpage where participants can enter their enterprise email address in a basic HTML form and choose Register. Registered participants can use this portal to access the AWS DeepRacer console. You can further personalize this page to gather additional user data (such as the user’s DeepRacer AWS profile or their level of AI and ML knowledge) or to add event marketing and training materials.
  2. The event portal registration form calls a customer API endpoint that stores email addresses in Amazon DynamoDB through AWS AppSync. For more information, refer to Attaching a Data Source for a sample CloudFormation template on setting up AWS AppSync with DynamoDB and calling the API from a browser client.
  3. For every new registration, an Amazon DynamoDB Streams event triggers an AWS Lambda function that calls the IdP’s API (in this case, the Azure Active Directory API) to add the participant’s identity in a dedicated event group that was previously set up with IAM Identity Center. The IAM Identity Center permission set controls the level of access racers have in the AWS account. At a minimum, this permission set should include the AWSDeepRacerDefaultMultiUserAccess managed policy. For more information, refer to Permission sets and AWS DeepRacer managed policies.
  4. If the IdP call is successful, the same Lambda function sends an email notification using Amazon Pinpoint, informing the participant the registration was successful and providing the AWS Management Console access URL generated in IAM Identity Center. For more information, refer to Send email by using the Amazon Pinpoint API.
  5. When racers choose this link, they’re asked to authenticate with their enterprise credentials, unless their current browser session is already authenticated. After authentication, racers are redirected to the AWS DeepRacer console where they can start training AWS DeepRacer models and submit them to virtual races.
  6. Event administrators use the AWS DeepRacer console to create and manage races. Race URLs can be shared with the racers through a Lambda-generated email, either as part as the initial registration flow or as a separate notification. Event administrators can monitor and limit usage directly on the AWS DeepRacer console, including estimated spending and training model hours. Administrators can also pause racer sponsorship and delete models.
  7. Finally, administrators can disable multi-user account mode after the event ends and remove participant access to the AWS account either by removing the users from IAM Identity Center or by  disabling the setup in the external IdP.

Conclusion

AWS DeepRacer events are a great way to raise interest and increase ML knowledge across all pillars and levels of an organization. This post explains how you can couple AWS DeepRacer multi-user account mode with IAM Identity Center and an enterprise IdP to run AWS DeepRacer events at scale with minimum administrative effort, while ensuring a great participant experience.

The solution presented in this post was developed and used by Accenture to run the world’s largest private AWS DeepRacer event in 2021, with more than 2,000 racers. By working with the Accenture AWS Business Group (AABG), a strategic collaboration by Accenture and AWS, you can learn from the cultures, resources, technical expertise, and industry knowledge of two leading innovators, helping you accelerate the pace of innovation to deliver disruptive products and services. Connect with our team at accentureaws@amazon.com to engage with a network of specialists steeped in industry knowledge and skilled in strategic AWS services in areas ranging from big data to cloud native to ML.


About the authors

Marius Cealera is a senior partner solutions architect at AWS. He works closely with the Accenture AWS Business Group (AABG) to develop and implement innovative cloud solutions. When not working, he enjoys being with his family, biking and trekking in and around Luxembourg.

Zdenko Estok works as a cloud architect and DevOps engineer at Accenture. He works with AABG to develop and implement innovative cloud solutions, and specializes in Infrastructure as Code and Cloud Security. Zdenko likes to bike to the office and enjoys pleasant walks in nature.

Selimcan “Can” Sakar is a cloud first developer and solution architect at Accenture Germany with focus on emerging technologies such as AI/ML, IoT, and Blockchain. Can suffers from Gear Acquisition Syndrome (aka G.A.S.) and likes to pursuit new instruments, bikes and sim-racing equipment in his free time.

Read More

Enable intelligent decision-making with Amazon SageMaker Canvas and Amazon QuickSight

Every company, regardless of its size, wants to deliver the best products and services to its customers. To achieve this, companies want to understand industry trends and customer behavior, and optimize internal processes and data analyses on a routine basis. This is a crucial component of a company’s success.

A very prominent part of the analyst role includes business metrics visualization (like sales revenue) and prediction of future events (like increase in demand) to make data-driven business decisions. To approach this first challenge, you can use Amazon QuickSight, a cloud-scale business intelligence (BI) service that provides easy-to-understand insights and gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. For the second task, you can use Amazon SageMaker Canvas, a cloud service that expands access to machine learning (ML) by providing business analysts with a visual point-and-click interface that allows you to generate accurate ML predictions on your own.

When looking at these metrics, business analysts often identify patterns in customer behavior, in order to determine whether the company risks losing the customer. This problem is called customer churn, and ML models have a proven track record of predicting such customers with high accuracy (for an example, see Elula’s AI Solutions Help Banks Improve Customer Retention).

Building ML models can be a tricky process because it requires an expert team to manage the data preparation and ML model training. However, with Canvas, you can do that without any special knowledge and with zero lines of code. For more information, check out Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.

In this post, we show you how to visualize the predictions generated from Canvas in a QuickSight dashboard, enabling intelligent decision-making via ML.

Overview of solution

In the post Predict customer churn with no-code machine learning using Amazon SageMaker Canvas, we assumed the role of a business analyst in the marketing department of a mobile phone operator, and we successfully created an ML model to identify customers with potential risk of churn. Thanks to the predictions generated by our model, we now want to make an analysis of a potential financial outcome to make data-driven business decisions about potential promotions for these clients and regions.

The architecture that will help us achieve this is shown in the following diagram.

The workflow steps are as follows:

  1. Upload a new dataset with the current customer population into Canvas.
  2. Run a batch prediction and download the results.
  3. Upload the files into QuickSight to create or update visualizations.

You can perform these steps in Canvas without writing a single line of code. For the full list of supported data sources, refer to Importing data in Amazon SageMaker Canvas.

Prerequisites

For this walkthrough, make sure that the following prerequisites are met:

Use the customer churn model

After you complete the prerequisites, you should have a model trained on historical data in Canvas, ready to be used with new customer data to predict customer churn, which you can then use in QuickSight.

  1. Create a new file churn-no-labels.csv by randomly selecting 1,500 lines from the original dataset churn.csv and removing the Churn? column.

We use this new dataset to generate predictions.

We complete the next steps in Canvas. You can open Canvas via the AWS Management Console, or via the SSO application provided by your cloud administrator. If you’re not sure how to access Canvas, refer to Getting started with using Amazon SageMaker Canvas.

  1. On the Canvas console, choose Datasets in the navigation pane.
  2. Choose Import.

  1. Choose Upload and choose the churn-no-labels.csv file that you created.
  2. Choose Import data.

The data import process time depends on the size of the file. In our case, it should be around 10 seconds. When it’s complete, we can see the dataset is in Ready status.

  1. To preview the first 100 rows of the dataset, choose the options menu (three dots) and choose Preview.

  1. Choose Models in the navigation pane, then choose the churn model you created as part of the prerequisites.

  1. On the Predict tab, choose Select dataset.

  1. Select the churn-no-labels.csv dataset, then choose Generate predictions.

Inference time depends on model complexity and dataset size; in our case, it takes around 10 seconds. When the job is finished, it changes its status to Ready and we can download the results.

  1. Choose the options menu (three dots), Download, and Download all values.

Optionally, we can take a quick look at the results choosing Preview. The first two columns are predictions from the model.

We have successfully used our model to predict churn risk for our current customer population. Now we’re ready to visualize business metrics based on our predictions.

Import data to QuickSight

As we discussed previously, business analysts require predictions to be visualized together with business metrics in order to make data-driven business decisions. To do that, we use QuickSight, which provides easy-to-understand insights and gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. With QuickSight, we can build visualizations like graphs and charts in seconds with a simple drag-and-drop interface. In this post, we build several visualizations to better understand business risks and how we could manage them, such as where we should launch new marketing campaigns.

To get started, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

QuickSight supports many data sources. In this post, we use a local file, the one we previously generated in Canvas, as our source data.

  1. Choose Upload a file.

  1. Choose the recently downloaded file with predictions.

QuickSight uploads and analyzes the file.

  1. Check that everything is as expected in the preview, then choose Next.

  1. Choose Visualize.

The data is now successfully imported and we’re ready to analyze it.

Create a dashboard with business metrics of churn predictions

It’s time to analyze our data and make a clear and easy-to-use dashboard that recaps all the information necessary for data-driven business decisions. This type of dashboard is an important tool in the arsenal of a business analysts.

The following is an example dashboard that can help identify and act on the risk of customer churn.

On this dashboard, we visualize several important business metrics:

  • Customers likely to churn – The left donut chart represents the number and percent of users over 50% risk of churning. This chart helps us quickly understand the size of a potential problem.
  • Potential revenue loss – The top middle donut chart represents the amount of revenue loss from users over 50% risk of churning. This chart helps us quickly understand the size of potential revenue loss from churn. The chart also shows that we could lose several above-average customers as a percent of potential revenue lost that’s bigger than the percent of users at risk of churning.
  • Potential revenue loss by state – The top right horizontal bar chart represents the size of revenue lost versus revenue from customers not at risk of churning. This visual could help us understand which state is the most important for us from a marketing campaign perspective.
  • Details about customers at risk of churning – The bottom left table contains details about all our customers. This table could be helpful if we want to quickly look at the details of several customers with and without churn risk.

Customers likely to churn

We start by building a chart with customers at risk of churning.

  1. Under Fields list, choose the Churn? attribute.

QuickSight automatically builds a visualization.

Although the bar plot is a common visualization to understand data distribution, we prefer to use a donut chart. We can change this visual by changing its properties.

  1. Choose the donut chart icon under Visual types.
  2. Choose the current name (double-click) and change it to Customers likely to churn.

  1. To customize other visual effects (remove legend, add values, change font size), choose the pencil icon and make your changes.

As shown in the following screenshot, we increased the area of the donut, as well as added some extra information in the labels.

Potential revenue loss

Another important metric to consider when calculating the business impact of customer churn is potential revenue loss. This is an important metric because it helps us understand the business impact from customers not at risk of churning. In the telecom industry, for example, we could have many inactive clients who have a high risk of churn and but zero revenue. This chart can help us understand if we’re in a such situation or not. To add this metric to our dashboard, we create a custom calculated field by providing the mathematical formula for computing potential revenue loss, then visualize it as another donut chart.

  1. On the Add menu, choose Add calculated field.

  1. Name the field Total charges.
  2. Enter the formula {Day Charge}+{Eve Charge}+{Intl Charge}+{Night Charge}.
  3. Choose Save.

  1. On the Add menu, choose Add visual.

  1. Under Visual types, choose the donut chart icon.
  2. Under Fields list, drag Churn? to Group/Color.
  3. Drag Total charges to Value.
  4. On the Value menu, choose Show as and choose Currency.
  5. Choose the pencil icon to customize other visual effects (remove legend, add values, change font size).

At this moment, our dashboard has two visualizations.

We can already observe that in total we could lose 18% (270) customers, which equals 24% ($6,280) in revenue. Let’s explore further by analyzing potential revenue loss at the state level.

Potential revenue loss by state

To visualize potential revenue loss by state, let’s add a horizontal bar graph.

  1. On the Add menu, choose Add visual.

  1. Under Visual types¸ choose the horizontal bar chart icon.
  2. Under Fields list¸ drag Churn? to Group/Color.
  3. Drag Total charges to Value.
  4. On the Value menu, choose Show as and Currency.
  5. Drag Stage to Y axis.
  6. Choose the pencil icon to customize other visual effects (remove legend, add values, change font size).

  1. We can also sort our new visual by choosing Total charges at the bottom and choosing Descending.

This visual could help us understand which state is the most important from a marketing campaign perspective. For example, in Hawaii, we could potentially lose half our revenue ($253,000) while in Washington, this value is less than 10% ($52,000). We can also see that in Arizona, we risk losing almost every customer.

Details about customers at risk of churning

Let’s build a table with details about customers at risk of churning.

  1. On the Add menu, choose Add visual.

  1. Under Visual types, choose the table icon.
  2. Under Field lists, drag Phone, State, Int’l Plan, Vmail Plan, Churn?, and Account Length to Group by.
  3. Drag probability to Value.
  4. On the Value menu, choose Show as and Percent.

Customize your dashboard

QuickSight offers several options to customize your dashboard, such as the following.

  1. To add a name, on the Add menu, choose Add title.

  1. Enter a title (for this post, we rename our dashboard Churn analysis).

  1. To resize your visuals, choose the bottom right corner of the chart and drag to the desired size.
  2. To move a visual, choose the top center of the chart and drag it to a new location.
  3. To change the theme, choose Themes in the navigation pane.
  4. Choose your new theme (for example, Midnight), and choose Apply.

Publish your dashboard

A dashboard is a read-only snapshot of an analysis that you can share with other QuickSight users for reporting purposes. Your dashboard preserves the configuration of the analysis at the time you publish it, including such things as filtering, parameters, controls, and sort order. The data used for the analysis isn’t captured as part of the dashboard. When you view the dashboard, it reflects the current data in the datasets used by the analysis.

To publish your dashboard, complete the following steps:

  1. On the Share menu, choose Publish dashboard.

  1. Enter a name for your dashboard.
  2. Choose Publish dashboard.

Congratulations, you have successfully created a churn analysis dashboard.

Update your dashboard with a new prediction

As the model evolves and we generate new data from the business, we might need to update this dashboard with new information. Complete the following steps:

  1. Create a new file churn-no-labels-updated.csv by randomly selecting another 1,500 lines from the original dataset churn.csv and removing the Churn? column.

We use this new dataset to generate new predictions.

  1. Repeat the steps from the Use the customer churn model section of this post to get predictions for the new dataset, and download the new file.
  2. On the QuickSight console, choose Datasets in the navigation pane.
  3. Choose the dataset we created.

  1. Choose Edit dataset.

  1. On the drop-down menu, choose Update file.

  1. Choose Upload file.

  1. Choose the recently downloaded file with the predictions.
  2. Review the preview, then choose Confirm file update.

After the “File updated successfully” message appears, we can see that file name has also changed.

  1. Choose Save & publish.

  1. When the “Saved and published successfully” message apears, you can go back to the main menu by choosing the QuickSight logo in the left upper corner.

  1. Choose Dashboards in the navigation pane and choose the dashboard we created before.

You should see your dashboard with the updated values.

We have just updated our QuickSight dashboard with the most recent predictions from Canvas.

Clean up

To avoid future charges, log out from Canvas.

Conclusion

In this post, we used an ML model from Canvas to predict customers at risk of churning and built a dashboard with insightful visualizations to help us make data-driven business decisions. We did so without writing a single line of code thanks to user-friendly interfaces and clear visualizations. This enables business analysts to be agile in building ML models, and perform analyses and extract insights in complete autonomy from data science teams.

To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts. To learn more about the latest QuickSight features and best practices, see AWS Big Data Blog.


About the Author

Aleksandr Patrushev is AI/ML Specialist Solutions Architect at AWS, based in Luxembourg. He is passionate about the cloud and machine learning, and the way they could change the world. Outside work, he enjoys hiking, sports, and spending time with his family.

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Amazon SageMaker Autopilot is up to eight times faster with new ensemble training mode powered by AutoGluon

Amazon SageMaker Autopilot has added a new training mode that supports model ensembling powered by AutoGluon. Ensemble training mode in Autopilot trains several base models and combines their predictions using model stacking. For datasets less than 100 MB, ensemble training mode builds machine learning (ML) models with high accuracy quickly—up to eight times faster than hyperparameter optimization (HPO) training mode with 250 trials, and up to 5.8 times faster than HPO training mode with 100 trials. It supports a wide range of algorithms, including LightGBM, CatBoost, XGBoost, Random Forest, Extra Trees, linear models, and neural networks based on PyTorch and FastAI.

How AutoGluon builds ensemble models

AutoGluon-Tabular (AGT) is a popular open-source AutoML framework that trains highly accurate ML models on tabular datasets. Unlike existing AutoML frameworks, which primarily focus on model and hyperparameter selection, AGT succeeds by ensembling multiple models and stacking them in multiple layers. The default behavior of AGT can be summarized as follows: Given a dataset, AGT trains various base models ranging from off-the-shelf boosted trees to customized neural networks on the dataset. The predictions from the base models are used as features to build a stacking model, which learns the appropriate weight of each base model. With these learned weights, the stacking model then combines the base model’s predictions and returns the combined predictions as the final set of predictions.

How Autopilot’s ensemble training mode works

Different datasets have characteristics that are suitable for different algorithms. Given a dataset with unknown characteristics, it’s difficult to know beforehand which algorithms will work best on a dataset. With this in mind, data scientists using AGT often create multiple custom configurations with a subset of algorithms and parameters. They run these configurations on a given dataset to find the best configuration in terms of performance and inference latency.

Autopilot is a low-code ML product that automatically builds the best ML models for your data. In the new ensemble training mode, Autopilot selects an optimal set of AGT configurations and runs multiple trials to return the best model. These trials are run in parallel to evaluate if AGT’s performance can be further improved, in terms of objective metrics or inference latency.

Results observed using OpenML benchmarks

To evaluate the performance improvements, we used OpenML benchmark datasets with sizes varying from 0.5–100 MB and ran 10 AGT trials with different combinations of algorithms and hyperparameter configurations. The tests compared ensemble training mode to HPO mode with 250 trials and HPO mode with 100 trials. The following table compares the overall Autopilot experiment runtime (in minutes) between the two training modes for various dataset sizes.

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Runtime Improvement with HPO 250 Runtime Improvement with HPO 100
< 1MB 121.5 mins 88.0 mins 15.0 mins 8.1x 5.9x
1–10 MB 136.1 mins 76.5 mins 25.8 mins 5.3x 3.0x
10–100 MB 152.7 mins 103.1 mins 60.9 mins 2.5x 1.7x

For comparing performance of multiclass classification problems, we use accuracy, for binary classification problems we use the F1-score, and for regression problems we use R2. The gains in objective metrics are shown in the following tables. We observed that ensemble training mode performed better than HPO training mode (both 100 and 250 trials).

Note that the ensemble mode shows consistent improvement over HPO mode with 250 trials irrespective of dataset size and problem types.

The following table compares accuracy for multi-class classification problems (higher is better).

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Percentage Improvement over HPO 250
< 1MB 0.759 0.761 0.771 1.46%
1–5 MB 0.941 0.935 0.957 1.64%
5–10 MB 0.639 0.633 0.671 4.92%
10–50 MB 0.998 0.999 0.999 0.11%
51–100 MB 0.853 0.852 0.875 2.56%

The following table compares F1 scores for binary classification problems (higher is better).

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Percentage Improvement over HPO 250
< 1MB 0.801 0.807 0.826 3.14%
1–5 MB 0.59 0.587 0.629 6.60%
5–10 MB 0.886 0.889 0.898 1.32%
10–50 MB 0.731 0.736 0.754 3.12%
51–100 MB 0.503 0.493 0.541 7.58%

The following table compares R2 for regression problems (higher is better).

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Percentage Improvement over HPO 250
< 1MB 0.717 0.718 0.716 0%
1–5 MB 0.803 0.803 0.817 2%
5–10 MB 0.590 0.586 0.614 4%
10–50 MB 0.686 0.688 0.684 0%
51–100 MB 0.623 0.626 0.631 1%

In the next sections, we show how to use the new ensemble training mode in Autopilot to analyze datasets and easily build high-quality ML models.

Dataset overview

We use the Titanic dataset to predict if a given passenger survived or not. This is a binary classification problem. We focus on creating an Autopilot experiment using the new ensemble training mode and compare the results of F1 score and overall runtime with an Autopilot experiment using HPO training mode (100 trials).

Column Name Description
Passengerid Identification number
Survived Survival
Pclass Ticket class
Name Passenger name
Sex Sex
Age Age in years
Sibsp Number of siblings or spouses aboard the Titanic
Parch Number of parents or children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of embarkation

The dataset has 890 rows and 12 columns. It contains demographic information about the passengers (age, sex, ticket class, and so on) and the Survived (yes/no) target column.

Prerequisites

Complete the following prerequisite steps:

  1. Ensure that you have an AWS account, secure access to log in to the account via the AWS Management Console, and AWS Identity and Access Management (IAM) permissions to use Amazon SageMaker and Amazon Simple Storage Service (Amazon S3) resources.
  2. Download the Titanic dataset and upload it to an S3 bucket in your account.
  3. Onboard to a SageMaker domain and access Amazon SageMaker Studio to use Autopilot. For instructions, refer Onboard to Amazon SageMaker Domain. If you’re using existing Studio, upgrade to the latest version of Studio to use the new ensemble training mode.

Create an Autopilot experiment with ensemble training mode

When the dataset is ready, you can initialize an Autopilot experiment in Studio. For full instructions, refer to Create an Amazon SageMaker Autopilot experiment. Create an Autopilot experiment by providing an experiment name, the data input, and specifying the target data to predict in the Experiment and data details section. Optionally, you can specify the data spilt ratio and auto creation of the Amazon S3 output location.

For our use case, we provide an experiment name, input Amazon S3 location, and choose Survived as the target. We keep the auto split enabled and override the default output Amazon S3 location.

Next, we specify the training method in the Training method section. You can either let Autopilot select the training mode automatically using Auto based on the dataset size, or select the training mode manually for either ensembling or HPO. The details on each option are as follows:

  • Auto – Autopilot automatically chooses either ensembling or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO, otherwise it chooses ensembling.
  • Ensembling – Autopilot uses AutoGluon’s ensembling technique to train several base models and combines their predictions using model stacking into an optimal predictive model.
  • Hyperparameter optimization – Autopilot finds the best version of a model by tuning hyperparameters using the Bayesian Optimization technique and running training jobs on your dataset. HPO selects the algorithms most relevant to your dataset and picks the best range of hyperparameters to tune the models.

For our use case, we select Ensembling as our training mode.

After this, we proceed to the Deployment and advanced settings section. Here, we deselect the Auto deploy option. Under Advanced settings, you can specify the type of ML problem that you want to solve. If nothing is provided, Autopilot automatically determines the model based on the data you provide. Because ours is a binary classification problem, we choose Binary classification as our problem type and F1 as our objective metric.

Finally, we review our selections and choose Create experiment.

At this point, it’s safe to leave Studio and return later to check on the result, which you can find on the Experiments menu.

The following screenshot shows the final results of our titanic-ens ensemble training mode Autopilot job.

You can see the multiple trials that have been attempted by the Autopilot in ensemble training mode. Each trial returns the best model from the pool of individual model runs and stacking ensemble model runs.

To explain this a little further, let’s assume Trial 1 considered all eight supported algorithms and used stacking level 2. It will internally create the individual models for each algorithm as well as the weighted ensemble models with stack Level 0, Level 1, and Level 2. However, the output of Trial 1 will be the best model from the pool of models created.

Similarly, let’s consider Trial 2 to have picked up tree based boosting algorithms only. In this case, Trial 2 will internally create three individual models for each of the three algorithms as well as the weighted ensemble models, and return the best model from its run.

The final model returned by a trial may or may not be a weighted ensemble model, but the majority of the trials will most likely return their best weighted ensemble model. Finally, based on the selected objective metric, the best model amongst all the 10 trials will be identified.

In the preceding example, our best model was the one with highest F1 score (our objective metric). Several other useful metrics, including accuracy, balanced accuracy, precision, and recall are also shown. In our environment, the end-to-end runtime for this Autopilot experiment was 10 minutes.

Create an Autopilot experiment with HPO training mode

Now let’s perform all of the aforementioned steps to create a second Autopilot experiment with the HPO training method (default 100 trials). Apart from training method selection, which is now Hyperparameter optimization, everything else stays the same. In HPO mode, you can specify the number of trials by setting Max candidates under Advanced settings for Runtime, but we recommend leaving this to default. Not providing any value in Max candidates will run 100 HPO trials. In our environment, the end-to-end runtime for this Autopilot experiment was 2 hours.

Runtime and performance metric comparison

We see that for our dataset (under 1 MB), not only did ensemble training mode run 12 times faster than HPO training mode (120 minutes to 10 minutes), but it also produced improved F1 scores and other performance metrics.

Training Mode F1 Score Accuracy Balanced Accuracy AUC Precision Recall Log Loss Runtime
Ensemble modeWeightedEnsemble 0.844 0.878 0.865 0.89 0.912 0.785 0.394 10 mins
HPO mode – XGBoost 0.784 0.843 0.824 0.867 0.831 0.743 0.428 120 mins

Inference

Now that we have a winner model, we can either deploy it to an endpoint for real-time inferencing or use batch transforms to make predictions on the unlabeled dataset we downloaded earlier.

Summary

You can run your Autopilot experiments faster without any impact on performance with the new ensemble training mode for datasets less than 100 MB. To get started, create an SageMaker Autopilot experiment on the Studio console and select Ensembling as your training mode, or let Autopilot infer the training mode automatically based on the dataset size. You can refer to the CreateAutoMLJob API reference guide for updates to API, and upgrade to the latest version of Studio to use the new ensemble training mode. For more information on this feature, see Model support, metrics, and validation with Amazon SageMaker Autopilot and to learn more about Autopilot, visit the product page.


About the authors

Janisha Anand is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.

Saket Sathe is a Senior Applied Scientist in the SageMaker Autopilot team. He is passionate about building the next generation of machine learning algorithms and systems. Aside from work, he loves to read, cook, slurp ramen, and play badminton.

Abhishek Singh is a Software Engineer for the Autopilot team in AWS. He has 8+ years experience as a software developer, and is passionate about building scalable software solutions that solve customer problems. In his free time, Abhishek likes to stay active by going on hikes or getting involved in pick up soccer games.

Vadim Omeltchenko is a Sr. AI/ML Solutions Architect who is passionate about helping AWS customers innovate in the cloud. His prior IT experience was predominantly on the ground.

Read More

Configure a custom Amazon S3 query output location and data retention policy for Amazon Athena data sources in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Snowflake, and 26 federated query data sources supported by Amazon Athena.

Starting today, when importing data from Athena data sources, you can configure the S3 query output location and data retention period to import data in Data Wrangler to control where and how long Athena stores the intermediary data. In this post, we walk you through this new feature.

Solution overview

Athena is an interactive query service that makes it easy to browse the AWS Glue Data Catalog, and analyze data in Amazon S3 and 26 federated query data sources using standard SQL. When you use Athena to import data, you can use Data Wrangler’s default S3 location for the Athena query output, or specify an Athena workgroup to enforce a custom S3 location. Previously, you had to implement cleanup workflows to remove this intermediary data, or manually set up S3 lifecycle configuration to control storage cost and meet your organization’s data security requirements. This is a big operational overhead, and not scalable.

Data Wrangler now supports custom S3 locations and data retention periods for your Athena query output. With this new feature, you can change the Athena query output location to a custom S3 bucket. You now have a default data retention policy of 5 days for the Athena query output, and you can change this to meet your organization’s data security requirements. Based on the retention period, the Athena query output in the S3 bucket gets cleaned up automatically. After you import the data, you can perform exploratory data analysis on this dataset and store the clean data back to Amazon S3.

The following diagram illustrates this architecture.

For our use case, we use a sample bank dataset to walk through the solution. The workflow consists of the following steps:

  1. Download the sample dataset and upload it to an S3 bucket.
  2. Set up an AWS Glue crawler to crawl the schema and store the metadata schema in the AWS Glue Data Catalog.
  3. Use Athena to access the Data Catalog to query data from the S3 bucket.
  4. Create a new Data Wrangler flow to connect to Athena.
  5. When creating the connection, set the retention TTL for the dataset.
  6. Use this connection in the workflow and store the clean data in another S3 bucket.

For simplicity, we assume that you have already set up the Athena environment (steps 1–3). We detail the subsequent steps in this post.

Prerequisites

To set up the Athena environment, refer to the User Guide for step-by-step instructions, and complete steps 1–3 as outlined in the previous section.

Import your data from Athena to Data Wrangler

To import your data, complete the following steps:

  1. On the Studio console, choose the Resources icon in the navigation pane.
  2. Choose Data Wrangler on the drop-down menu.
  3. Choose New flow.
  4. On the Import tab, choose Amazon Athena.

    A detail page opens where you can connect to Athena and write a SQL query to import from the database.
  5. Enter a name for your connection.
  6. Expand Advanced configuration.
    When connecting to Athena, Data Wrangler uses Amazon S3 to stages the queried data. By default, this data is staged at the S3 location s3://sagemaker-{region}-{account_id}/athena/ with a retention period of 5 days.
  7. For Amazon S3 location of query results, enter your S3 location.
  8. Select Data retention period and set the data retention period (for this post, 1 day).
    If you deselect this option, the data will persist indefinitely.Behind the scenes, Data Wrangler attaches an S3 lifecycle configuration policy to that S3 location to automatically clean up. See the following example policy:

     "Rules": [
            {
                "Expiration": {
                    "Days": 1
                },
                "ID": "sm-data-wrangler-retention-policy-xxxxxxx",
                "Filter": {
                    "Prefix": "athena/test"
                },
                "Status": "Enabled"
            }
        ]

    You need s3:GetLifecycleConfiguration and s3:PutLifecycleConfiguration for your SageMaker execution role to correctly apply the lifecycle configuration policies. Without these permissions, you get error messages when you try to import the data.

    The following error message is an example of missing the GetLifecycleConfiguration permission.

    The following error message is an example of missing the PutLifecycleConfiguration permission.

  9. Optionally, for Workgroup, you can specify an Athena workgroup.
    An Athena workgroup isolates users, teams, applications, or workloads into groups, each with its own permissions and configuration settings. When you specify a workgroup, Data Wrangler inherits the workgroup setting defined in Athena. For example, if a workgroup has an S3 location defined to store query results and enable Override client side settings, you can’t edit the S3 query result location.By default, Data Wrangler also saves the Athena connection for you. This is displayed as a new Athena tile in the Import tab. You can always reopen that connection to query and bring different data into Data Wrangler.
  10. Deselect Save connection if you don’t want to save the connection.
  11. To configure the Athena connection, choose None for Sampling to import the entire dataset.

    For large datasets, Data Wrangler allows you to import a subset of your data to build out your transformation workflow, and only process the entire dataset when you’re ready. This speeds up the iteration cycle and save processing time and cost. To learn more about different data sampling options available, visit Amazon SageMaker Data Wrangler now supports random sampling and stratified sampling.
  12. For Data catalog¸ choose AwsDataCatalog.
  13. For Database, choose your database.

    Data Wrangler displays the available tables. You can choose each table to check the schema and preview the data.
  14. Enter the following code in the query field:
    Select *
    From bank_additional_full

  15. Choose Run to preview the data.
  16. If everything looks good, choose Import.
  17. Enter a dataset name and choose Add to import the data into your Data Wrangler workspace.

Analyze and process data with Data Wrangler

After you load the data in to Data Wrangler, you can do exploratory data analysis (EDA) and prepare the data for machine learning.

  1. Choose the plus sign next to the bank-data dataset in the data flow, and choose Add analysis.
    Data Wrangler provides built-in analyses, including a Data Quality and Insights Report, data correlation, a pre-training bias report, a summary of your dataset, and visualizations (such as histograms and scatter plots). Additionally, you can create your own custom visualization.
  2. For Analysis type¸ choose Data Quality and Insight Report.
    This automatically generates visualizations, analyses to identify data quality issues, and recommendations for the right transformations required for your dataset.
  3. For Target column, choose Y.
  4. Because this is a classification problem statement, for Problem type, select Classification.
  5. Choose Create.

    Data Wrangler creates a detailed report on your dataset. You can also download the report to your local machine.
  6. For data preparation, choose the plus sign next to the bank-data dataset in the data flow, and choose Add transform.
  7. Choose Add step to start building your transformations.

At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.

You can now start building your transforms and analyses based on your business requirements.

Clean up

To avoid ongoing costs, delete the Data Wrangler resources using the steps below when you’re finished.

  1. Select Running Instances and Kernels icon.
  2. Under RUNNING APPS, click on the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  3. Choose Shut down all to confirm.

Conclusion

In this post, we provided an overview of customizing your S3 location and enabling S3 lifecycle configurations for importing data from Athena to Data Wrangler. With this feature, you can store intermediary data in a secured S3 location, and automatically remove the data copy after the retention period to reduce the risk for unauthorized access to data. We encourage you to try out this new feature. Happy building!

To learn more about Athena and SageMaker, visit the Athena User Guide and Amazon SageMaker Documentation.


About the authors

 Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Read More

Use RStudio on Amazon SageMaker to create regulatory submissions for the life sciences industry

Pharmaceutical companies seeking approval from regulatory agencies such as the US Food & Drug Administration (FDA) or Japanese Pharmaceuticals and Medical Devices Agency (PMDA) to sell their drugs on the market must submit evidence to prove that their drug is safe and effective for its intended use. A team of physicians, statisticians, chemists, pharmacologists, and other clinical scientists review the clinical trial submission data and proposed labeling. If the review establishes that the there is sufficient statistical evidence to prove that the health benefits of the drug outweigh the risks, the drug is approved for sale.

The clinical trial submission package consists of tabulated data, analysis data, trial metadata, and statistical reports consisting of statistical tables, listings, and figures. In the case of the US FDA, the electronic common technical document (eCTD) is the standard format for submitting applications, amendments, supplements, and reports to the FDA’s Center for Biologics Evaluation and Research (CBER) and Center for Drug Evaluation and Research (CDER). For the FDA and Japanese PMDA, it’s a regulatory requirement to submit tabulated data in CDISC Standard Data Tabulation Model (SDTM), analysis data in CDISC Analysis Dataset Model (ADaM), and trial metadata in CDISC Define-XML (based on Operational Data Model (ODM)).

In this post, we demonstrate how we can use RStudio on Amazon SageMaker to create such regulatory submission deliverables. This post describes the clinical trial submission process, how we can ingest clinical trial research data, tabulate and analyze the data, and then create statistical reports—summary tables, data listings, and figures (TLF). This method can enable pharmaceutical customers to seamlessly connect to clinical data stored in their AWS environment, process it using R, and help accelerate the clinical trial research process.

Drug development process

The drug development process can broadly be divided into five major steps, as illustrated in the following figure.

Drug Development Process

It takes on an average 10–15 years and approximately USD $1–3 billion for one drug to receive a successful approval out of around 10,000 potential molecules. During the early phases of research (the drug discovery phase), promising drug candidates are identified, which move further to preclinical research. During the preclinical phase, researchers try to find out the toxicity of the drug by performing in vitro experiments in the lab and in vivo experiments on animals. After preclinical testing, drugs move on the clinical trial research phase, where they must be tested on humans to ascertain their safety and efficacy. The researchers design clinical trials and detail the study plan in the clinical trial protocol. They define the different clinical research phases—from small Phase 1 studies to determine drug safety and dosage, to a bigger Phase 2 trials to determine drug efficacy and side effects, to even bigger Phase 3 and 4 trials to determine drug efficacy, safety, and monitoring adverse reactions. After successful human clinical trials, the drug sponsor files a New Drug Application (NDA) to market the drug. The regulatory agencies review all the data, work with the sponsor on prescription labeling information, and approve the drug. After the drug’s approval, the regulatory agencies review post-market safety reports to ensure the complete product’s safety.

In 1997, Clinical Data Interchange Standards Consortium (CDISC), a global, non-profit organization comprising of pharmaceutical companies, CROs, biotech, academic institutions, healthcare providers, and government agencies, was started as volunteer group. CDISC has published data standards to streamline the flow of data from collection through submissions, and facilitated data interchange between partners and providers. CDISC has published the following standards:

  • CDASH (Clinical Data Acquisition Standards Harmonization) – Standards for collected data
  • SDTM (Study Data Tabulation Model) – Standards for submitting tabulated data
  • ADaM (Analysis Data Model) – Standards for analysis data
  • SEND (Standard for Exchange of Nonclinical Data) – Standards for nonclinical data
  • PRM (Protocol Representation Model) – Standards for protocol

These standards can help trained reviewers analyze data more effectively and quickly using standard tools, thereby reducing drug approval times. It’s a regulatory requirement from the US FDA and Japanese PMDA to submit all tabulated data using the SDTM format.

R for clinical trial research submissions

SAS and R are two of the most used statistical analysis software used within the pharmaceutical industry. When development of the SDTM standards was started by CDISC, SAS was in almost universal use in the pharmaceutical industry and at the FDA. However, R is gaining tremendous popularity nowadays because it’s open source, and new packages and libraries are continuously added. Students primarily use R during their academics and research, and they take this familiarity with R to their jobs. R also offers support for emerging technologies such as advanced deep learning integrations.

Cloud providers such as AWS have now become the platform of choice for pharmaceutical customers to host their infrastructure. AWS also provides managed services such as SageMaker, which makes it effortless to create, train, and deploy machine learning (ML) models in the cloud. SageMaker also allows access to the RStudio IDE from anywhere via a web browser. This post details how statistical programmers and biostatisticians can ingest their clinical data into the R environment, how R code can be run, and how results are stored. We provide snippets of code that allow clinical trial data scientists to ingest XPT files into the R environment, create R data frames for SDTM and ADaM, and finally create TLF that can be stored in an Amazon Simple Storage Service (Amazon S3) object storage bucket.

RStudio on SageMaker

On November 2, 2021, AWS in collaboration with RStudio PBC announced the general availability of RStudio on SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to SageMaker in just a few simple steps. To learn more about this exciting collaboration, check out Announcing RStudio on Amazon SageMaker.

Along with the RStudio Workbench, the RStudio suite for R developers also offers RStudio Connect and RStudio Package Manager. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. It makes it easy to share ML and data science insights from data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.

Solution overview

In the following sections, we discuss how we can import raw data from a remote repository or S3 bucket in RStudio on SageMaker. It’s also possible to connect directly to Amazon Relational Database Service (Amazon RDS) and data warehouses like Amazon Redshift (see Connecting R with Amazon Redshift) directly from RStudio; however, this is outside the scope of this post. After data has been ingested from a couple of different sources, we process it and create R data frames for a table. Then we convert the table data frame into an RTF file and store the results back in an S3 bucket. These outputs can then potentially be used for regulatory submission purposes, provided the R packages used in the post have been validated for use for regulatory submissions by the customer.

Set up RStudio on SageMaker

For instructions on setting up RStudio on SageMaker in your environment, refer to Get started with RStudio on SageMaker. Make sure that the execution role of RStudio on SageMaker has access to download and upload data to the S3 bucket in which data is stored. To learn more about how to manage R packages and publish your analysis using RStudio on SageMaker, refer to Announcing Fully Managed RStudio on SageMaker for Data Scientists.

Ingest data into RStudio

In this step, we ingest data from various sources to make it available for our R session. We import data in SAS XPT format; however, the process is similar if you want to ingest data in other formats. One of the advantages of using RStudio on SageMaker is that if the source data is stored in your AWS accounts, then SageMaker can natively access the data using AWS Identity and Access Management (IAM) roles.

Access data stored in a remote repository

In this step, we import ADaM data from the FDA’s GitHub repository. We create a local directory called data in the RStudio environment to store the data and download demographics data (dm.xpt) from the remote repository. In this context, the local directory refers to a directory created on the your private Amazon EFS storage that is attached by default to your R session environment. See the following code:

######################################################
# Step 1.1 – Ingest Data from Remote Data Repository #
######################################################

# Remote Data Path 
raw_data_url = “https://github.com/FDA/PKView/raw/master/Installation%20Package/OCP/data/clinical/DRUG000/0000/m5/datasets/test001/tabulations/sdtm”
raw_data_name = “dm.xpt”

#Create Local Directory to store downloaded files
dir.create(“data”)
local_file_location <- paste0(getwd(),”/data/”)
download.file(raw_data_url, paste0(local_file_location,raw_data_name))

When this step is complete, you can see dm.xpt being downloaded by navigating to Files, data, dm.xpt.

Access data stored in Amazon S3

In this step, we download data stored in an S3 bucket in our account. We have copied contents from the FDA’s GitHub repository to the S3 bucket named aws-sagemaker-rstudio for this example. See the following code:

#####################################################
# Step 1.2 - Ingest Data from S3 Bucket             #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()

s3_bucket = "aws-sagemaker-rstudio"
s3_key = "DRUG000/test001/tabulations/sdtm/pp.xpt"

session$download_data(local_file_location, s3_bucket, s3_key)

When the step is complete, you can see pp.xpt being downloaded by navigating to Files, data, pp.xpt.

Process XPT data

Now that we have SAS XPT files available in the R environment, we need to convert them into R data frames and process them. We use the haven library to read XPT files. We merge CDISC SDTM datasets dm and pp to create ADPP dataset. Then we create a summary statistic table using the ADPP data frame. The summary table is then exported in RTF format.

First, XPT files are read using the read_xpt function of the haven library. Then an analysis dataset is created using the sqldf function of the sqldf library. See the following code:

########################################################
# Step 2.1 - Read XPT files. Create Analysis dataset.  #
########################################################

library(haven)
library(sqldf)


# Read XPT Files, convert them to R data frame
dm = read_xpt("data/dm.xpt")
pp = read_xpt("data/pp.xpt")

# Create ADaM dataset
adpp = sqldf("select a.USUBJID
                    ,a.PPCAT as ACAT
                    ,a.PPTESTCD
                    ,a.PPTEST
                    ,a.PPDTC
                    ,a.PPSTRESN as AVAL
                    ,a.VISIT as AVISIT
                    ,a.VISITNUM as AVISITN
                    ,b.sex
                from pp a 
           left join dm b 
                  on a.usubjid = b.usubjid
             ")

Then, an output data frame is created using functions from the Tplyr and dplyr libraries:

########################################################
# Step 2.2 - Create output table                       #
########################################################

library(Tplyr)
library(dplyr)

t = tplyr_table(adpp, SEX) %>% 
  add_layer(
    group_desc(AVAL, by = "Area under the concentration-time curve", where= PPTESTCD=="AUC") %>% 
      set_format_strings(
        "n"        = f_str("xx", n),
        "Mean (SD)"= f_str("xx.x (xx.xx)", mean, sd),
        "Median"   = f_str("xx.x", median),
        "Q1, Q3"   = f_str("xx, xx", q1, q3),
        "Min, Max" = f_str("xx, xx", min, max),
        "Missing"  = f_str("xx", missing)
      )
  )  %>% 
  build()

output = t %>% 
  rename(Variable = row_label1,Statistic = row_label2,Female =var1_F, Male = var1_M) %>% 
  select(Variable,Statistic,Female, Male)

The output data frame is then stored as an RTF file in the output folder in the RStudio environment:

#####################################################
# Step 3 - Save the Results as RTF                  #
#####################################################
library(rtf)

dir.create("output")
rtf = RTF("output/tab_adpp.rtf")  
addHeader(rtf,title="Section 1 - Tables", subtitle="This Section contains all tables")
addParagraph(rtf, "Table 1 - Pharmacokinetic Parameters by Sex:n")
addTable(rtf, output)
done(rtf)

Upload outputs to Amazon S3

After the output has been generated, we put the data back in an S3 bucket. We can achieve this by creating a SageMaker session again, if a session isn’t active already, and uploading the contents of the output folder to an S3 bucket using the session$upload_data function:

#####################################################
# Step 4 - Upload outputs to S3                     #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()
s3_bucket = "aws-sagemaker-rstudio"
output_location = "output/"
s3_folder_name = "output"
session$upload_data(output_location, s3_bucket, s3_folder_name)

With these steps, we have ingested data, processed it, and uploaded the results to be made available for submission to regulatory authorities.

Clean up

To avoid incurring any unintended costs, you need to quit your current session. On the top right corner of the page, choose the power icon. This will automatically stop the underlying instance and therefore stop incurring any unintended compute costs.

Challenges

The post has outlined steps for ingesting raw data stored in an S3 bucket or from a remote repository. However, there are many other sources of raw data for a clinical trial, primarily eCRF (electronic case report forms) data stored in EDC (electronic data capture) systems such as Oracle Clinical, Medidata Rave, OpenClinica, or Snowflake; lab data; data from eCOA (clinical outcome assessment) and ePRO (electronic Patient-Reported Outcomes); real-world data from apps and medical devices; and electronic health records (EHRs) at the hospitals. Significant preprocessing is involved before this data can be made usable for regulatory submissions. Building connectors to various data sources and collecting them in a centralized data repository (CDR) or a clinical data lake, while maintaining proper access controls, poses significant challenges.

Another key challenge to overcome is that of regulatory compliance. The computer system used for creating regulatory submission outputs must be compliant with appropriate regulations, such as 21 CFR Part 11, HIPAA, GDPR, or any other GxP requirements or ICH guidelines. This translates to working in a validated and qualified environment with controls for access, security, backup, and auditability in place. This also means that any R packages that are used to create regulatory submission outputs must be validated before use.

Conclusion

In this post, we saw that the some of the key deliverables for an eCTD submission were CDISC SDTM, ADaM datasets, and TLF. This post outlined the steps needed to create these regulatory submission deliverables by first ingesting data from a couple of sources into RStudio on SageMaker. We then saw how we can process the ingested data in XPT format; convert it into R data frames to create SDTM, ADaM, and TLF; and then finally upload the results to an S3 bucket.

We hope that with the broad ideas laid out in the post, statistical programmers and biostatisticians can easily visualize the end-to-end process of loading, processing, and analyzing clinical trial research data into RStudio on SageMaker and use the learnings to define a custom workflow suited for your regulatory submissions.

Can you think of any other applications of using RStudio to help researchers, statisticians, and R programmers to make their lives easier? We would love to hear about your ideas! And if you have any questions, please share them in the comments section.

Resources

For more information, visit the following links:


About the authors

Rohit Banga is a Global Clinical Development Industry Specialist based out of London, UK. He is a biostatistician by training and helps Healthcare and LifeScience customers deploy innovative clinical development solutions on AWS. He is passionate about how data science, AI/ML, and emerging technologies can be used to solve real business problems within the Healthcare and LifeScience industry. In his spare time, Rohit enjoys skiing, BBQing, and spending time with family and friends.

Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.

Read More