Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Amazon Comprehend  Custom Classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use Custom Classification to automatically categorize inbound requests by problem type based on how the customer has described the issue.  You can use custom classifiers to automatically label support emails with appropriate issue types, routing customer phone calls to the right agents, and categorizing social media posts into user segments.

For custom classification, you start by creating a training job with a ground truth dataset comprising a collection of text and corresponding category labels. Upon completing the job, you have a classifier that can classify any new text into one or more named categories. When the custom classification model classifies a new unlabeled text document, it predicts what it has learned from the training data. Sometimes you may not have a training dataset with various language patterns, or once you deploy the model, you start seeing completely new data patterns. In these cases, the model may not be able to classify these new data patterns accurately. How can we ensure continuous model training to keep it up to date with new data and patterns?

In this two part blog series, we discuss an architecture pattern that allows you to build an active learning workflow for Amazon Comprehend custom classification models. The first post will describe a workflow comprising real-time classification, feedback pipelines and human review workflows using Amazon Augmented AI (Amazon A2I). The second post will cover the automated model building using the human reviewed data, selecting the best model, and automated deployment of an endpoint of the chosen model.

Feedback loops play a pivotal role in keeping the models up to date. This feedback helps the models learn about their misclassifications and learn the right ones. This process of teaching the models continuously through feedback and deploying them is called active learning.

For every prediction Amazon Comprehend Custom Classification makes, it also gives a confidence score associated with its prediction. This architecture proposes that you set an acceptable threshold and only accept the predictions with a confidence score that exceeds the threshold. All the predictions that have a confidence score less than the desired threshold are flagged for human review. The human decides whether to accept the model’s prediction or correct it.

In some instances, the model may be confident about its predictions, but the classification might be wrong. In these scenarios, the end-user applications that receive the model predictions can request explicit feedback from its users on the prediction quality. A human moderator reviews this explicit feedback and reclassifies instances where the feedback was negative. This process of generating human-verified data and using it for model retraining helps keep the models up to date, reduce data drift, and achieve higher model accuracy.

Feedback Workflow Architecture.

In this section, we discuss an architectural pattern for implementing an end-to-end active learning workflow for custom classification models in Amazon Comprehend using Amazon A2I. The active learning workflow comprises the following components:

  1. Real-time classification
  2. Feedback loops
  3. Human classification
  4. Model building
  5. Model selection
  6. Model deployment

The following diagram illustrates this architecture covering the first three components. In the following sections, we walk you through each step in the workflow.

Architecture Diagram for Feedback Loops

Real-time classification

To use custom classification in Amazon Comprehend, you need to create a custom classification job that reads a ground truth dataset from an Amazon Simple Storage Service (Amazon S3) bucket and builds a classification model. After the model builds successfully, you can create an endpoint that allows you to make real-time classifications of unlabeled text. This stage is represented by steps 1–3 in the preceding architecture:

  1. The end-user application calls an API Gateway endpoint with a text that needs to be classified.
  2. The API Gateway endpoint then calls an AWS Lambda function configured to call an Amazon Comprehend endpoint.
  3. The Lambda function calls the Amazon Comprehend endpoint, which returns the unlabeled text classification and a confidence score.

Feedback collection

When the endpoint returns the classification and the confidence score during the real-time classification, you can send instances with low-confidence scores to human review. This type of feedback is called implicit feedback.

  1. The Lambda function sends the implicit feedback to an Amazon Kinesis Data Firehose.

The other type of feedback is called explicit feedback and comes from the application’s end-users that use the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. Explicit feedback can be sent either in real-time through an API or a batch process.

  1. End-users of the application submit explicit real-time feedback through an API Gateway endpoint.
  2. The Lambda function backing the API endpoint transforms the data into a standard feedback format and writes it to the Kinesis Data Firehose delivery stream.
  3. End-users of the application can also submit explicit feedback as a batch file by uploading it to an S3 bucket.
  4. A trigger configured on the S3 bucket triggers a Lambda function.
  5. The Lambda function transforms the data into a standard feedback format and writes it to the delivery stream.
  6. Both the implicit and explicit feedback data gets sent to a delivery stream in a standard format. All this data is buffered and written to an S3 bucket.

Human classification

The human classification stage includes the following steps:

  1. A trigger configured on the feedback bucket in Step 10 invokes a Lambda function.
  2. The Lambda function creates Amazon A2I human review tasks for all the feedback data received.
  3. Workers assigned to the classification jobs log in to the human review portal and either approve the classification by the model or classify the text with the right labels.
  4. After the human review, all these instances are stored in an S3 bucket and used for retraining the models. Part 2 of this series covers the retraining workflow.

Solution overview

The next few sections of the post go over how to set up this architecture in your AWS account. We classify news into four categories: World, Sports, Business, and Sci/Tech, using the AG News dataset for custom classification, and set up the implicit and explicit feedback loop. You need to complete two manual steps:

  1. Create an Amazon Comprehend custom classifier and an endpoint.
  2. Create an Amazon SageMaker private workforce, worker task template, and human review workflow.

After this, you run the provided AWS CloudFormation template to set up the rest of the architecture.

Prerequisites

Before you get started, download the dataset and upload it to Amazon S3. This dataset comprises a collection of news articles and their corresponding category labels. We have created a training dataset called train.csv from the original dataset and made it available for download.

The following screenshot shows a sample of the train.csv file.

CSV file representing the Training data set

After you download the train.csv file, upload it to an S3 bucket in your account for reference during training. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

Creating a custom classifier and an endpoint

To create your classifier for classifying news, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. Choose Train classifier.
  3. For Name, enter news-classifier-demo.
  4. Select Using Multi-class mode.
  5. For Training data S3 location, enter the path for train.csv in your S3 bucket, for example, s3://<your-bucketname>/train.csv.
  6. For Output data S3 location, enter the S3 bucket path where you want the output, such as s3://<your-bucketname>/.
  7. For IAM role, select Create an IAM role.
  8. For Permissions to access, choose Input and output (if specified) S3 bucket.
  9. For Name suffix, enter ComprehendCustom.

Comprehend Custom Classification Model Creation

  1. Scroll down and choose Train Classifier to start the training process.

The training takes some time to complete. You can either wait to create an endpoint or come back to this step later after finishing the steps in the section Creating a private workforce, worker task template, and human review workflow.

Creating a custom classifier real-time endpoint

To create your endpoint, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and select your model news-classifier-demo.
  3. From the Actions drop-down menu, choose Create endpoint.
  4. For Endpoint name, enter classify-news-endpoint and give it one inference unit.
  5. Choose Create endpoint
  6. Copy the endpoint ARN as shown in the following screenshot. You use it when running the CloudFormation template in a future step.

Custom Classification Model Endpoint Page

Creating a private workforce, worker task template, and human review workflow.

This section walks you through creating a private workforce in Amazon SageMaker, a worker task template, and your human review workflow.

Creating a labeling workforce

  1. For this post, you will create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console).
  2. Once the user accepts the invitation, you will have to add him to the workforce. For instructions, see the Add a Worker to a Work Team section the Manage a Workforce (Amazon SageMaker Console)

Creating a worker task template

To create a worker task template, complete the following steps:

  1. On the Amazon A2I console, choose Worker task templates.
  2. Choose to Create a template.
  3. For Template name, enter custom-classification-template.
  4. For Template type, choose Custom,
  5. In the Template editor, enter the following GitHub UI template code.
  6. Choose Create.

Worker Task Template

Creating a human review workflow

To create your human review workflow, complete the following steps:

  1. On the Amazon A2I console, choose Human review workflows.
  2. Choose Create human review workflow.
  3. For Name, enter classify-workflow.
  4. Specify an S3 bucket to store output: s3://<your bucketname>/.

Use the same bucket where you downloaded your train.csv in the prerequisite step.

  1. For IAM role, select Create a new role.
  2. For Task type, choose Custom.
  3. Under Worker task template creation, select the custom classification template you created.
  4. For Task description, enter Read the instructions and review the document.
  5. Under Workers, select Private.
  6. Use the drop-down list to choose the private team that you created.
  7. Choose Create.
  8. Copy the workflow ARN (see the following screenshot) to use when initializing the CloudFormation parameters.

Human Review Workflow Page

Deploying the CloudFormation template to set up active learning feedback

Now that you have completed the manual steps, you can run the CloudFormation template to set up this architecture’s building blocks, including the real-time classification, feedback collection, and the human classification.

Before deploying the CloudFormation template, make sure you have the following to pass as parameters:

  • Custom classifier endpoint ARN
  • Amazon A2I workflow ARN
  1. Choose Launch Stack:

  1. Enter the following parameters:
    1. ComprehendEndpointARN – The endpoint ARN you copied.
    2. HumanReviewWorkflowARN – The workflow ARN you copied.
    3. ComrehendClassificationScoreThreshold – Enter 0.5, which means a 50% threshold for low confidence score.

CloudFormation Required Parameters

  1. Choose Next until the Capabilities
  2. Select the check-box to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.

For more information about these resources, see AWS IAM resources.

  1. Choose Create stack.

Acknowledgement section of the CloudFormation Page

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

CloudFormation Outputs

  1. On the Outputs tab of the stack (see the following screenshot), copy the value for  BatchUploadS3Bucket, FeedbackAPIGatewayID, and TextClassificationAPIGatewayID to interact with the feedback loop.
  2. Both the TextClassificationAPI and FeedbackAPI will require and API key to interact with them. The Cloudformtion output ApiGWKey refers to the name of the API key. Currently this API key is associated with a usage plan that allows 2000 requests per month.
  3. On the API Gateway console, choose either the TextClassification API or the the FeedbackAPI. Choose API Keys from the left navigation. Choose your API key from step 7. Expand the API key section in the right pane and copy the value.

API Key page

  1. You can manage the usage plan by following the instructions on, Create, configure, and test usage plans with the API Gateway console.
  2. You can also add fine grained authentication and authorization to your APIs. For more information on securing your APIs, you can follow instructions on Controlling and managing access to a REST API in API Gateway.

Testing the feedback loop

In this section, we walk you through testing your feedback loop, including real-time classification, implicit and explicit feedback, and human review tasks.

Real-time classification

To interact and test these APIs, you need to download Postman.

The API Gateway endpoint receives an unlabeled text document from a client application and internally calls the custom classification endpoint, which returns the predicted label and a confidence score.

  1. Open Postman and enter the TextClassificationAPIGateway URL in POST method.
  2. In the Headers section, configure the API key.  x-api-key :  << Your API key >>
  3. In the text field, enter the following JSON code (make sure you have JSON selected and enable raw):
{"classifier":"<your custom classifier name>", "sentence":"MS Dhoni retires and a billion people had mixed feelings."}
  1. Choose Send.

You get a response back with a confidence score and class, as seen in the following screenshot.

Sample JSON request to the Classify Text API endpoint.

Implicit feedback

When the endpoint returns the classification and the confidence score during the real-time classification, you can route all the instances where the confidence score doesn’t meet the threshold to human review. This type of feedback is called implicit feedback. For this post, we set the threshold as 0.5 as an input to the CloudFormation stack parameter.

You can change this threshold when deploying the CloudFormation template based on your needs.

Explicit feedback

The explicit feedback comes from the end-users of the application that uses the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send the predicted label by the model’s explicit feedback through the following methods:

  • Real time through an API, which is usually triggered through a like/dislike button on a UI.
  • Batch process, where a file with a collection of misclassified utterances is put together based on a user survey conducted by the customer outreach team.

Invoking the explicit real-time feedback loop

To test the Feedback API, complete the following steps:

  1. Open Postman and enter the FeedbackAPIGatewayID value from your CloudFormation stack output in POST method.
  2. In the Headers section, configure the API key.  x-api-key :  << Your API key >>
  3. In the text field, enter the following JSON code (for classifier, enter the classifier you created, such as news-classifier-demo, and make sure you have JSON selected and enable raw):
{"classifier":"<your custom classifier name>","sentence":"Sachin is Indian Cricketer."}
  1. Choose Send.

Sample JSON request to the Feedback API endpoint.

Submitting explicit feedback as a batch file

Download the following test feedback JSON file, populate it with your data, and upload it into the BatchUploadS3Bucket created when you deployed your CloudFormation template. The following code shows some sample data in the file:

{
   "classifier":"news-classifier-demo",
   "sentences":[
      "US music firms take legal action against 754 computer users alleged to illegally swap music online.",
      "A gamer spends $26,500 on a virtual island that exists only in a PC role-playing game."
   ]
}

Uploading the file triggers the Lambda function that starts your human review loop.

Human review tasks

All the feedback collected through the implicit and explicit methods is sent for human classification. The labeling workforce can include Amazon Mechanical Turk, private teams, or AWS Marketplace vendors. For this post, we create a private workforce. The URL to the labeling portal is located on the Amazon SageMaker console, on the Labeling workforces page, on the Private tab.

Private Workforce section of the SageMaker console.

After you log in, you can see the human review tasks assigned to you. Select the task to complete and choose Start working.

Human Review Task Page

You see the tasks displayed based on the worker template used when creating the human workflow.

Human Review Task

After you complete the human classification and submit the tasks, the human-reviewed data is stored in the S3 bucket you configured when creating the human review workflow. Go to Amazon Sagemaker-> Human review workflows->output location:

Human Review Task Output Location

This human-reviewed data is used to retrain the custom classification model to learn newer patterns and improve its overall accuracy. Below is screenshot of the human annotated output file output.json in S3 bucket:

Human Review Task Output payload

The process of retraining the models with human-reviewed data, selecting the best model, and automatically deploying the new endpoints completes the active learning workflow. We cover these remaining steps in Part 2 of this series.

Cleaning up

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. On the Amazon S3 console, delete the S3 bucket that contains the training dataset.
  2. On the Amazon Comprehend console, delete the endpoint and the classifier.
  3. On the Amazon A2I console, delete the human review workflow, worker template, and the private workforce.
  4. On the AWS CloudFormation console, delete the stack you created. (This removes the resources the CloudFormation template created.)

Conclusion

Amazon Comprehend helps you build scalable and accurate natural language processing capabilities without any machine learning experience. This post provides a reusable pattern and infrastructure for active learning workflows for custom classification models. The feedback pipelines and human review workflow help the custom classifier learn new data patterns continuously. The second part of this series covers the automatic model building, selection, and deployment of custom classification models.

For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.


About the Authors

 Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and develop products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.

 

 

 

 

Joyson Neville Lewis obtained his master’s in Information Technology from Rutgers University in 2018. He has worked as a Software/Data engineer before diving into the Conversational AI domain in 2019, where he works with companies to connect the dots between business and AI using voice and chatbot solutions. Joyson joined Amazon Web Services in February of 2018 as a Big Data Consultant for AWS Professional Services team in NYC.

Read More

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Many organizations use Amazon SageMaker for their machine learning (ML) requirements and source data from a data lake stored on Amazon Simple Storage Service (Amazon S3). The petabyte scale source data on Amazon S3 may not always be clean because data lakes ingest data from several source systems, such as like flat files, external feeds, databases, and Hadoop. It may contain extreme values in source attributes, considered as outliers in the data. Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error, missing data, or simply through natural deviations in populations. Outliers in training data can easily impact model accuracy of many ML models, like linear and logistic regression. These anomalies result in ML scientists and analysts facing skewed results. Outliers can dramatically impact ML models and change the model equation completely with bad predictions or estimations.

Data scientists and analysts are looking for a way to remove outliers. Analysts come from a strong data background, and are very fluent in writing SQL queries with programming languages. The following tools are a natural choice for ML scientists to remove outliers and carry out data visualization:

  • Amazon Athena – An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
  • Pandas – An open-source, high-performance, easy-to-use library that provides for data structures and data analysis library like matplotlib for Python programming language
  • Amazon SageMaker – A fully managed service that provides you with the ability to build, train, and deploy ML models quickly

To illustrate how to use Athena with Pandas for anomaly detection and visualization using Amazon SageMaker, we clean a set of New York City Taxi and Limousine Commission (TLC) Trip Record Data by removing outlier records. In this dataset, outliers are when a taxi trip’s duration is for multiple days, 0 seconds, or less than 0 seconds. Then we use the Pandas matplotlib library to plot graphs to visualize trip duration values.

Solution overview

To implement this solution, you perform the following high-level steps:

  1. Create an AWS Glue Data Catalog and browse the data on the Athena console.
  2. Create an Amazon SageMaker Jupyter notebook and install PyAthena.
  3. Identify anomalies using Athena SQL-Pandas from the Jupyter notebook.
  4. Visualize data and remove outliers using Athena SQL-Pandas.

The following diagram illustrates the architecture of this solution.

Prerequisites

To follow this post, you should be familiar with the following:

  • The Amazon S3 file upload process
  • AWS Glue crawlers and the Data Catalog
  • Basic SQL queries
  • Jupyter notebooks
  • Assigning a basic AWS Identity and Access Management (IAM) policy to a role

Preparing the data

For this post, we use New York City Taxi and Limousine Commission (TLC) Trip Record Data, which is a publicly available dataset.

  1. Download the file yellow_tripdata_2019-01.csv to your local machine.
  2. Create the S3 bucket s3-yellow-cab-trip-details (your name will be different).
  3. Upload the file to your bucket using the Amazon S3 console.

Creating the Data Catalog and browsing the data

After you upload the data to Amazon S3, you create the Data Catalog in AWS Glue. This allows you to run SQL queries using Athena.

  1. On the AWS Glue console, create a new database.
  2. For Database name, enter db_yellow_cab_trip_details.

  1. Create an AWS Glue crawler to gather the metadata in the file and catalog it.

For this post, I use the database (db_yellow_cab_trip_details) to save tables with the added pre-fix as src_.

  1. Run the crawler.

The crawler can take 2–3 minutes to complete. You can check the status on Amazon CloudWatch.

The following screenshot shows the crawler details on the AWS Glue console.

When the crawler is complete, the table is available in the Data Catalog. All the metadata and column-level information is displayed with corresponding data types.

We can now check the data on the Athena console to make sure we can read the file as a table and run a SQL query.

Run your query with the following code:

SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10;

The following screenshot shows your output on the Athena console.

Creating a Jupyter notebook and installing PyAthena

You now create a new notebook instance from Amazon SageMaker and install PyAthena using Jupyter.

Amazon SageMaker has managed built-in Jupyter notebooks that allow you to write code in Python, Julia, R, or Scala to explore, analyze, and do modeling with a small set of data.

Make sure the role used for your notebook has access on Athena (use IAM policies to verify and add S3FullAccess and AmazonAthenaFullAccess).

To create your notebook, complete the following steps:

  1. On the Amazon SageMaker console, under Notebook, choose Notebook instances.
  2. Choose Create notebook instance.

  1. On the Create notebook instance page, enter a name and choose an instance type.

We recommend using an ml.m4.10xlarge instance, due to the size of the dataset. You should choose an appropriate instance depending on your data; costs vary for different instances.

Wait until the Notebook instance status shows as InService (this step can take up to 5 minutes).

  1. When the instance is ready, choose Open Jupyter.

  1. Open the conda_python3 kernel from the notebook instance.
  2. Enter the following commands to install PyAthena:
! pip install --upgrade pip
! pip install PyAthena

The following screenshot shows the output.

You can also install PyAthena when you create the notebook instance by using lifecycle configurations. See the following code:

#!/bin/bash
sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate python3
pip install --upgrade pip
pip install --upgrade  PyAthena
source /home/ec2-user/anaconda3/bin/deactivate
EOF

The following screenshot shows where you enter the preceding code in the Scripts section when creating a lifecycle configuration.

You can run a SQL query from the notebook to validate connectivity to Athena and pull data for visualization.

To import the libraries, enter the following code:

from pyathena import connect
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

To connect to Athena, enter the following code:

conn = connect(s3_staging_dir='s3://<your-Query-result-location>',region_name='us-east-1')

To check the sample data, enter the following query:

df_sample = pd.read_sql("SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10", conn)
df_sample

The following screenshot shows the output.

Detecting anomalies with Athena, Pandas, and Amazon SageMaker

Now that we can connect to Athena, we can run SQL queries to find the records that have unusual trip_duration values.

The following Athena query checks anomalies in the trip_duration data to find the top 50 records with the maximum duration:

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 desc limit 50", conn)

df_anomaly_duration

The following screenshot shows the output; there are many outliers (trips with a duration greater than 1 day).

The output shows the duration in seconds, minutes, hours, and days.

The following query checks for anomalies and shows the top 50 records with the lowest minimum duration (negative value or 0 seconds):

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 asc limit 50", conn)

df_anomaly_duration

The following screenshot shows the output; multiple trips have a negative value or duration of 0.

Similarly, we can use different SQL queries using to analyze the data and find other outliers. We can also clean the data by using SQL queries and, if needed, save the data in Amazon S3 with CTAS queries.

Visualizing the data and removing outliers

Pull the data using the following Athena query in a Pandas DataFrame, and use matplotlib.pyplot to create a visual graph to see the outliers:

df_full = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details ", conn)
plt.figure(figsize=(12,12))
plt.scatter(range(len(df_full["duration_second"])), np.sort(df_full["duration_second"]))
plt.xlabel('index')
plt.ylabel('duration_second')

plt.show()

The process of plotting the full dataset can take 7–10 minutes. To reduce time, add a limit to the number of records in the query:

SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 100000 

The following screenshot shows the output.

To ignore outliers, run the following query. You replot the graph after removing the outlier records in which the duration is equal or less than 0 seconds or longer than 1 day:

df_clean_data = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second from db_yellow_cab_trip_details.src_yellow_cab_trip_details where 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp))  > 0 and date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) < 1 ", conn)
plt.figure(figsize=(12,12))
plt.scatter(range(len(df_clean_data["duration_second"])), np.sort(df_clean_data["duration_second"]))
plt.xlabel('index')
plt.ylabel('duration_second')
plt.show()

The following screenshot shows the output.

Cleaning up

When you’re done, delete the notebook instance to avoid recurring deployment costs.

  1. On the Amazon SageMaker notebook, choose your notebook instance.
  2. Choose Stop.

  1. When the status shows as Stopped, choose Delete.

Conclusion

This post walked you through finding and removing outliers from your dataset and data visualization. We used an Amazon SageMaker notebook to run analytical queries using Athena SQL, and used Athena to read the dataset, which is saved in Amazon S3 with the metadata catalog in AWS Glue. We used queries in Athena to find anomalies in the data and ignore these outliers. We also used notebook instances to visualize graphs using Pandas’ matplotlib.pyplot library.

You can try this solution for your use-cases to remove outliers using Athena SQL and SageMaker notebook. If you have comments or feedback, please leave them below.


About the Authors

Rahul Sonawane is a Senior Consultant, Big Data at the Shared Delivery Teams at Amazon Web Services.

 

 

 

 

Behram Irani is a Senior Solutions Architect, Data & Analytics at Amazon Web Services.

Read More

Football tracking in the NFL with Amazon SageMaker

Football tracking in the NFL with Amazon SageMaker

With the 2020 football season kicking off, Amazon Web Services (AWS) is continuing its work with the National Football League (NFL) on several ongoing game-changing initiatives. Specifically, the NFL and AWS are teaming up to develop state-of-the-art cloud technology using machine learning (ML) aimed at aiding the officiating process through real-time football detection. As a first step in this process, the Amazon Machine Learning Solutions Lab developed a computer vision model for the challenge of football detection. In this post, we provide in-depth examples including code snippets and visualizations to demonstrate the key components of the football detection pipeline, starting with data labeling and following up with training and deployment using Amazon SageMaker and Apache MXNet Gluon.

Detecting the football in NFL Broadcast videos

The following video illustrates detecting the football frame by frame.

Football Tracking

Computer vision-based object detection techniques use deep learning algorithms to predict the location of objects in images and videos. Today, object detection has many far-reaching, high business value use cases, such as in self-driving car technology, where detecting pedestrians and vehicles is of paramount importance in ensuring safety on the roads. For the NFL, object detection technology like this is crucial as the game continues to evolve at a rapid pace. For example, they can use real-time object identification to generate new advanced analytics around player and team performance, in addition to aiding game officials in ball spotting. This technology is part of the larger suite of innovations in the AWS/NFL partnership.

The following sections of the post outline how we used NFL broadcast video data to train object detection models that analyze thousands of images to locate and classify the football from background objects.

Creating an object detection dataset with Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML. With the help of Ground Truth, we created a custom object detection dataset by breaking the NFL play segments into images. It offers a user interface (UI) that allowed us to quickly spin up a bounding box labeling job, in which human annotators can quickly draw bounding box labels around thousands of football image sequences stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following screenshot illustrates the object detection UI.

The labeling job outputs a manifest file that contains the S3 file path of the image, the (x, y) bounding box coordinates, and the class label (for this use case, football) needed for the model training process.

The following screenshot shows the manifest file that is automatically updated in the S3 path.

The following screenshot shows the contents of the output.manifest file.

Supercharging training with Apache MXNet Gluon and Amazon SageMaker Script Mode

Our approach to model development relied on an ML technique called transfer learning. In it, we take neural networks previously trained on similar applications with strong results and fine-tune these models on our annotated data. We converted the annotations from the labeling job to RecordIO format for compact storage and faster disk access. Neural networks have the tendency to overfit training data, leading to poor out-of-sample results. The MXNet Gluon toolkit we added provides image normalization and image augmentations, such as randomized image flipping and cropping, to help reduce overfitting during training.

Amazon SageMaker provides a simple UI to train object detection models with no code, offering the Single Shot Detector (SSD) pre-trained model with several out-of-the-box configurations. For a more customized architecture, we use Amazon SageMaker Script Mode, which allows you to bring your own training algorithms and directly train models while staying within the user-friendly confines of Amazon SageMaker. We could train larger, more accurate models directly from Amazon SageMaker notebooks by combining Script Mode with pre-trained models like Yolov3 and Faster-RCNN with several backbone network combinations from the Gluon Model Zoo for object detection. See the following code:

import os
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

s3_output_path = "s3://<path to bucket where model weights will be saved>/"

model_estimator = MXNet(
    entry_point="train.py",
    role=role,
    train_instance_count=1,  # value can be more than 1 for multi node training
    train_instance_type="ml.p3.16xlarge",
    framework_version="1.6.0",
    output_path=s3_output_path,
    py_version="py3",
    distributions={"parameter_server": {"enabled": True}},
    hyperparameters={"epochs": 15},
)

m.fit("s3://<bucket path for train and validation record-io files>/")

Object detection algorithm: Background

Object detectors typically combine two key components: detection of objects in images and regression for estimating bounding box coordinates of objects. During training, object detectors are optimized to reduce both detection error and localization error (bounding box prediction error) via a loss function.

Current state-of-the-art object detectors contain deep learning architectures that use pre-trained convolutional neural networks (CNNs) like VGG-16 or ResNet-50 as base networks to perform rich feature extraction from input images. SSD predicts the relative offsets to a fixed set of boxes at every location of a convolutional feature map. Empirically, SSD underperforms other object detector algorithms on small objects like football. In contrast, YOLOv3 uses DarkNet-53 for feature extraction, which concatenates multiple feature maps together to make predictions, leading to improved performance on smaller objects.

Faster-RCNN in comparison to both SSD and YOLOv3 uses an additional shared deep neural network to predict region proposals of the input image feature maps, which is aggregated in the feed downstream in the model for object classification and bounding box prediction. Faster-RCNN empirically outperformed other networks on small objects in our use case. One major consideration in addition to performance, when choosing object detectors, is model inference time. SSD and YOLOv3 tend to have fast inference times as measured in frames per second, which is a key consideration for real-time applications; larger networks like Faster-RCNN have slower inference time.

Hyperparameter optimization on Amazon SageMaker

A standard metric for evaluating object detectors is mean average precision (mAP). mAP is based on the model precision recall (PR) curve and provides a numerical metric that can be directly used across models. You can generate PR curves by setting model confidence score thresholds to different levels, resulting in precision and recall pairs. Plotting these pairs with a bit of interpolation results in a PR curve. Average precision (AP) is then defined as the area under this PR curve. Similarly, you may want to detect multiple objects in an image, such as K > 1 objects: mAP is the mean AP across all K classes.

Automatic model tuning in Amazon SageMaker, also known as hyperparameter optimization, allowed us to try over 100 models with unique parameter configurations to achieve the best model possible given our data. Hyperparameter optimization uses strategies like random search and Bayesian search to help tune hyper parameters in the ML algorithm. See the following sample code:

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    " network": CategoricalParameter(["resnet50_v1b", "resnet101_v1d"])
}

### Objective metric Regex based on print statements in script
objective_metric_name = "Validation: "
metric_definitions = [{"Name": "Validation: ", "Regex": "Validation: ([0-9\.]+)"}]

tuner = HyperparameterTuner(
    model_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=100,
    max_parallel_jobs=10,
)

tuner.fit("s3://<bucket path for train and validation record-io files>/")

To do this, we specified the location of our data and manifest file on Amazon S3 and chose our Amazon SageMaker instance type and object detection algorithm to use (SSD with ResNet50). Amazon SageMaker hyperparameter optimization then launched several configurations of the base model with unique hyperparameter configurations, using Bayesian search to determine which configuration achieves the best model based on a preset test metric. In our case, we optimized towards the highest mean average precision (mAP) on our held-out test data. The following graph shows a visualization of a sample set of hyperparameter optimization jobs from the hyperparameter optimization tuner object.

Deploying the model

Deploying the model required only a few additional lines of code (hosting methods) within our Amazon SageMaker notebook instance. We can simply call tuner.deploy on our hyperparameter optimization tuner to deploy the best model based on the evaluation metric that was set for the hyperparameter optimization training job. The code below demonstrates a proof-of-concept deployment on Amazon SageMaker:

predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

The model weights for each training job are stored in Amazon S3. We can deploy any of the jobs or Amazon SageMaker trained models by passing its model artifact path to an Amazon SageMaker estimator object. To do this, we referred to the preconfigured container optimized to perform inference and linked it to the model weights. After this model-container pair was created on our account, we could configure an endpoint with the instance type and number of instances needed by the NFL. See the following code:

from sagemaker.mxnet.model import MXNetModel

sagemaker_model = MXNetModel(
    model_data="s3://<path to training job model file>/model.tar.gz",
    role=role,
    py_version="py3",
    framework_version="1.6.0",
    entry_point="train.py",
)

predictor = sagemaker_model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge"
)

Model inference for football detection

At runtime, a client sends a request to the endpoint hosting the model container on an Amazon Elastic Compute Cloud (Amazon EC2) instance and returns the output (inference). In production, scaling endpoints for large-scale inference on the NFL broadcast videos is significantly simplified with this pipeline.

Sensitivity and error analysis

When exploring strategies to improve model performance, you can scale up (use larger architectures) or scale out (acquire more data). After scaling up, which we discussed earlier during model exploration, data scientists commonly collect additional data in hopes of improving model generalizability. For our use case, we specifically aimed at reducing localization error. To do this, we created several test sets that quantitatively and qualitatively helped us understand mAP in relation to specific characteristics of the input video: occlusion (high vs. low) of the football, bounding box size (small vs. large) and aspect ratio (tall vs. wide) effects, camera angle (endzone vs. sideline), and contrast (high vs. low) between the football and its background area. From this, we understood which qualitative aspect of image the model was struggling to predict, and these findings led us to strategically gather additional data to target and improve upon these areas.

Summary

The NFL uses cloud computing to create innovative experiences that introduce additional ways for fans to enjoy football while making the game more efficient and fast-paced. By combining football detection with additional new technologies, the NFL can reduce game stoppage, support officiating, and bring real-time insight into what’s happening on the field, leading to a greater connection with the game that fans love. While fans take delight in “America’s game,” they can rest assured that the NFL in collaboration with AWS is utilizing the newest and best technologies to make the game more enjoyable with a broader range of data points.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, implementing HPO, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Michael Lopez is the Director of Football Data and Analytics at the National Football League and a Lecturer of Statistics and Research Associate at Skidmore College. At the National Football League, his work centers on how to use data to enhance and better understand the game of football.

 

 

 

Colby Wise is a Senior Data Scientist and manager at the Amazon Machine Learning Solutions Lab where he works with customers across different verticals to accelerate their use of machine learning and AWS cloud services to solve their business challenges.

 

 

 

 

Divya Bhargavi is a Data Scientist at the Amazon Machine Learning Solutions Lab where she develops machine learning models to address customers’ business problems. Most recently, she worked on Computer Vision solutions involving both classical and deep learning methods for a sports customer.

Read More

Improved OCR and structured data extraction with Amazon Textract

Improved OCR and structured data extraction with Amazon Textract

Optical character recognition (OCR) technology, which enables extracting text from an image, has been around since the mid-20th century, and continues to be a research topic today. OCR and document understanding are still vibrant areas of research because they’re both valuable and hard problems to solve.

AWS has been investing in improving OCR and document understanding technology, and our research scientists continue to publish research papers in these areas. For example, the research paper Can you read me now? Content aware rectification using angle supervision describes how to tackle the problem of document rectification which is fundamental to the OCR process on documents. Additionally, the paper SCATTER: Selective Context Attentional Scene Text Recognizer introduces a novel way to perform scene text recognition, which is the task of recognizing text against complex image backgrounds. For more recent publications in this area, see Computer Vision.

Amazon scientists also incorporate these research findings into best-of-breed technologies such as Amazon Textract, a fully managed service that uses machine learning (ML) to identify text and data from tables and forms in documents—such as tax information from a W2, or values from a table in a scanned inventory report—and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring customization or human intervention.

One of the advantages of a fully managed service is the automatic and periodic improvement to the underlying ML models to improve accuracy. You may need to extract information from documents that have been scanned or pictured in different lighting conditions, a variety of angles, and numerous document types. As the models are trained using data inputs that encompass these different conditions, they become better at detecting and extracting data.

In this post, we discuss a few recent updates to Amazon Textract that improve the overall accuracy of document detection and extraction.

Currency symbols

Amazon Textract now detects a set of currency symbols (Chinese yuan, Japanese yen, Indian rupee, British pound, and US dollar) and the degree symbol with more precision without much regression on existing symbol detection.

For example, the following is a sample table in a document from a company’s annual report.

The following screenshot shows the output on the Amazon Textract console before the latest update.

Amazon Textract detects all the text accurately. However, the Indian rupee symbol is recognized as an “R” instead of “₹”. The following screenshot shows the output using the updated model.

The rupee symbol is detected and extracted accurately. Similarly, the degree symbol and the other currency symbols (yuan, yen, pound, and dollar) are now supported in Amazon Textract.

Detecting rows and columns in large tables

Amazon Textract released a new table model update that more accurately detects rows and columns of large tables that span an entire page. Overall table detection and extraction of data and text within tables has also been improved.

The following is an example of a table in a personal investment account statement.

The following screenshot shows the Amazon Textract output prior to the new model update.

Even though all the rows, columns, and text is detected properly, the output also contains empty columns. The original table didn’t have a clear separation for columns, so the model included extra columns.

The following screenshot shows the output after the model update.

The output now is much cleaner. Amazon Textract still extracts all the data accurately from this table and now includes the correct number of columns. Similar performance improvement can be seen in tables that span an entire page and columns are not omitted.

Improved accuracy in forms

Amazon Textract now has higher accuracy on a variety of forms, especially income verification documents such as pay stubs, bank statements, and tax documents. The following screenshot shows an example of such a form.

The preceding form is not of high-quality resolution. Regardless, you may have to process such documents in your organization. The following screenshot is the Amazon Textract output using one of the previous models.

Although the older model detected many of the check boxes, it didn’t capture all of them. The following screenshot shows the output using the new model.

With this new model, Amazon Textract accurately detected all the check boxes in the document.

Summary

The improvements to the currency symbols and the degree symbol detection will be launched in the Asia Pacific (Singapore) region on September 24th, 2020, followed by other regions where Amazon Textract is available in the next few days. With the latest improvements to Amazon Textract, you can retrieve information from documents with more accuracy. Tables spanning the entire page are detected more accurately, currency symbols  (yuan, yen, rupee, pound, and dollar) and the degree symbol are now supported, and key-value pairs and check boxes in financial forms are detected with more precision. To start extracting data from your documents and images, try Amazon Textract for yourself.


About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

Read More

Preventing customer churn by optimizing incentive programs using stochastic programming

Preventing customer churn by optimizing incentive programs using stochastic programming

In recent years, businesses are increasingly looking for ways to integrate the power of machine learning (ML) into business decision-making. This post demonstrates the use case of creating an optimal incentive program to offer customers identified as being at risk of leaving for a competitor, or churning. It extends a popular ML use case, predicting customer churn, and shows how to optimize an incentive program to address the real business goal of preventing customer churn. We use a large phone company for our use case.

Although it’s usual to treat this as a binary classification problem, the real world is less binary: people become likely to churn for some time before they actually churn. Loss of brand loyalty occurs some time before someone actually buys from a competitor. There’s frequently a slow rise in dissatisfaction over time before someone is finally driven to act. Providing the right incentive at the right time can reset a customer’s satisfaction.

This post builds on the post Gain customer insights using Amazon Aurora machine learning. There we met a telco CEO and heard his concern about customer churn. In that post, we moved from predicting customer churn to intervening in time to prevent it. We built a solution that integrates Amazon Aurora machine learning with the Amazon SageMaker built-in XGBoost algorithm to predict which customers will churn. We then integrated Amazon Comprehend to identify the customer’s sentiment when they called customer service. Lastly, we created a naïve incentive to offer customers identified as being at risk at the time they called.

In this post, we focus on replacing this naïve incentive with an optimized incentive program. Rather than using an abstract cost function, we optimize using the actual economic value of each customer and a limited incentive budget. We use a mathematical optimization approach to calculate the optimal incentive to offer each customer, based on our estimate of the probability that they’ll churn, and the probability that they’ll accept our incentive to stay.

Solution overview

Our incentive program is intended to be used in a system such as that described in the post Gain customer insights using Amazon Aurora machine learning. For simplicity, we’ve built this post so that it can run separately.

We use a Jupyter notebook running on an Amazon SageMaker notebook instance. Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. In the Jupyter notebook, we first build and host an XGBoost model, identical to the one in the prior post. Then we run the optimization based on a given marketing budget and review the expected results.

Setting up the solution infrastructure

To set up the environment necessary to run this example in your own AWS account, follow Steps 0 and 1 in the post Simulate quantum systems on Amazon SageMaker to set up an Amazon SageMaker instance.

Then, as in Step 2, open a terminal. Enter the following command to copy the notebook to your Amazon SageMaker notebook instance:

wget https://aws-ml-blog.s3.amazonaws.com/artifacts/prevent_churn_by_optimizing_incentives/preventing_customer_churn_by_optimizing_incentives.ipynb

Alternatively, you can review a pre-run version of the notebook.

Building the XGBoost model

The first sections of the notebook—Setup, Data Exploration, Train, and Host, are the same as the sample notebook Amazon SageMaker Examples – Customer Churn. The exception is that we capture a copy of the data for later use and add a column to calculate each customer’s total spend.

At the end of these sections, we have a running XGBoost model on an Amazon SageMaker endpoint. We can use it to predict which customers will churn.

Assessing and optimizing

In this section of the post and accompanying notebook, we focus on assessing the XGBoost model and creating our optimal incentive program.

We can assess model performance by looking at the prediction scores, as shown in the original customer churn prediction post, Amazon SageMaker Examples – Customer Churn.

So how do we calculate the minimum incentive that will give the desired result? Rather than providing a single program to all customers, can we save money and gain a better outcome by using variable incentives, customized to a customer’s churn probability and value? And if so, how?

We can do so by building on components we’ve already developed.

Assigning costs to our predictions

The costs of churn for the mobile operator depend on the specific actions that the business takes. One common approach is to assign costs, treating each customer’s prediction as binary: they churn or don’t, we predicted correctly or we didn’t. To demonstrate this approach, we must make some assumptions. We assign the true negatives the cost of $0. Our model essentially correctly identified a happy customer in this case, and we won’t offer them an incentive. An alternative is to assign the actual value of the customer’s spend to the true negatives, because this is the customer’s contribution to our overall revenue.

False negatives are the most problematic because they incorrectly predict that a churning customer will stay. We lose the customer and have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. Such costs typically run in the hundreds of dollars, so for this use case, we assume $500 to be the cost for each false negative. For a better estimate, our marketing department should be able to give us a value to use for the overhead, and we have the actual customer spend for each customer in our dataset.

Finally, we give an incentive to customers that our model identifies as churning. For this post, we assume a one-time retention incentive of $50. This is the cost we apply to both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we waste the concession. We probably could have spent those dollars more effectively, but it’s possible we increased the loyalty of an already loyal customer, so that’s not so bad. We revise this approach later in this post.

Mapping the customer churn threshold

In previous versions of this notebook, we’ve shown the effect of false negatives that are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we’ve used a cost function that looks like the following equation:

cost_of_replacing_customer * FN(C) + customer_value * TN(C) + incentive_offered * FP(C) + incentive_offered * TP(C)

FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP. We want to find the cutoff, C, where the result of the expression is smallest.

We start by using the same values for all customers, to give us a starting point for discussion with the business. With our estimates, this equation becomes the following:

$500 * FN(C) + $0 * TN(C) + $50 * FP(C) + $50 * TP(C)

A straightforward way to understand the impact of these numbers is to simply run a simulation over a large number of possible cutoffs. We test 100 possible values, and produce the following graph.

The following output summarizes our results:

Cost is minimized near a cutoff of: 0.21000000000000002 for a cost of: $ 25800 for these 1500 customers.
Incentive is paid to 246 customers, for a total outlay of $ 12300
Total customer spend of these customers is $ 16324.36

The preceding chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive. Meanwhile, setting the threshold too high (such as 0.7 or above) results in too many lost customers, which ultimately grows to be nearly as costly. In between, there is a large grey area, where perhaps some more nuanced incentives would create better outcomes.

The overall cost can be minimized at $25,750 by setting the cutoff to 0.13, which is substantially better than the $100,000 or more we would expect to lose by not taking any action.

We can also calculate the dollar outlay of the program and compare to the total spend of the customers. Here we can see that paying the incentive to all predicted churn customers costs $13,750, and that these customers spend $183,700. (Your numbers may vary, depending on the specific customers chosen for the sample.)

What happens if we instead have a smaller budget for our campaign? We choose a budget of 1% of total customer monthly spend. The following output shows our results:

Total budget is: $895.90
Per customer incentive is $0.60

We can see that our cost changes. But it’s pretty clear that an incentive of approximately $0.60 is unlikely to change many people’s minds.

Can we do better? We could offer a range of incentives to customers that meet different criteria. For example, it’s worth more to the business to prevent a high spend customer from churning than a low spend customer. We could also target the grey area of customers that have less loyalty and could be swayed by another company’s advertising. We explore this in the following section.

Preventing customer churn using mathematical optimization of incentive programs

In this section, we use a more sophisticated approach to developing our customer retention program. We want to tailor our incentives to target the customers most likely to reconsider a churn decision.

Intuitively, we know that we don’t need to offer an incentive to customers with a low churn probability. Also, above some threshold, we’ve already lost the customer’s heart and mind, even if they haven’t actually left yet. So the best target for our incentive is between those two thresholds—these are the customers we can convince to stay.

The problem under investigation is inherently stochastic in that each customer might churn or not, and might accept the incentive (offer) or not. Stochastic programming [1, 2] is an approach for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real-world problems almost invariably include parameters that are unknown at the time a decision should be made. An example would be the construction of an investment portfolio to maximize return. An efficient portfolio would be defined as the portfolio that maximizes the expected return for a given amount of risk (such as standard deviation), or the portfolio that minimizes the risk subject to a given expected return [3].

Our use case has the following elements:

  • We know the number of customers, 𝑁.
  • We can use the customer’s current spend as the (upper bound) estimate of the profit they generate, P.
  • We can use the churn score from our ML model as an estimate of the probability of churn, alpha.
  • We use 1% of our total revenue as our campaign budget, C.
  • The probability that the customer is swayed, beta, depends on how convincing the incentive is to the customer, which we represent as 𝛾.
  • The incentive, c, is what we want to calculate.

We set up our inputs: P (profit), alpha (our churn probabilities, from our preceding model), and C, our campaign budget. We then define the function we wish to optimize, f(ci) being the expected total profit across the 𝑁 customers.

Our goal is to optimally allocate the discount 𝑐𝑖 across the 𝑁 customers to maximize the expected total profit. Mathematically this is equivalent to the following optimization problem:

Now we can specify how likely we think each customer is to accept the offer and not churn—that is, how convincing they’ll find the incentive. We represent this as 𝛾 in the formulae.

Although this is a matter of business judgment, we can use the preceding graph to inform that judgment. In this case, the business believes that if the churn probability is below 0.55, they are unlikely to churn, even without an incentive; on the other hand, if the customer’s churn probability is above 0.95, the customer has little loyalty and is unlikely to be convinced. The real targets for the incentives are the customers with churn probability between 0.55–0.95.

We could include that business insight into the optimization by setting the value for the convincing factor 𝛾𝑖 as follows:

  • 𝛾𝑖 = 100. This is equivalent to giving less importance as a deciding factor to the discount for customers whose churn probability is below 0.55 (they are loyal and less likely to churn), or greater than 0.95 (they will most likely leave despite the retention campaign).
  • 𝛾𝑖 = 1. This is equivalent to saying that the probability customer i will accept the discount is equal to 𝛽=1−𝑒C𝑖 for customers with churn probability between 0.55 and 0.95.

When we start to offer these incentives, we can log whether or not each customer accepts the offer and remains a customer. With that information, we can learn this function from experience, and use that learned function to develop the next set of incentives.

Solving the optimization problem

A variety of open-source solvers are available that can solve this optimization problem for us. Examples include SciPy scipy.optimize.minimize, or faster open-source solvers like GEKKO, which is what we use for this post. For large-scale problems, we would recommend using commercial optimization solvers like CPLEX or GUROBI.

After the optimization task has run, we check how much of our budget has been allocated.

The total spend is $1,000.00 compared to our budget of $1,000.00, and the total customer spend is $89,589.73 for 1,500 customers.

Now we evaluate the expected total profit for the following scenarios:

  1. Optimal discount allocation, as calculated by our optimization algorithm
  2. Uniform discount allocation: every customer is offered the same incentive
  3. No discount

The following graph shows our outcomes.

Expected total profit compared to no campaign: 17%

Expected total profit compared to uniform discount allocation: 5%

Lastly, we add the discount to our customer data. We can see how the discount we offer varies across our customer base. The red vertical line shows the value of the uniform discount. The pattern of discounts we offer closely mirrors the pattern in the prediction scores, where many customers aren’t identified as likely churners, and a few are identified as highly likely to churn.

We can also see a sample of the discounts we’d be offering to individual customers. See the following table.

For each customer, we can see their total monthly spend and the optimal incentive to offer them. We can see that the discount varies by churn probability, and we’re assured that the incentive campaign fits within our budget.

Depending on the size of the total budget we allocate, we may occasionally find that we’re offering all customers a discount. This discount allocation problem reminds us of the water-filling algorithm in wireless communications [4,5], where the problem is of maximizing the mutual information between the input and the output of a channel composed of several subchannels (such as a frequency-selective channel, a time-varying channel, or a set of parallel subchannels arising from the use of multiple antennas at both sides of the link) with a global power constraint at the transmitter. More power is allocated to the channels with higher gains to maximize the sum of data rates or the capacity of all the channels. The solution to this class of problems can be interpreted as pouring a limited volume of water into a tank, the bottom of which has the stair levels determined by the inverse of the subchannel gains.

Unfortunately, our problem doesn’t have an intuitive explanation as for the water-filling problem. This is due to the fact that, because of the nature of the objective function, the system of equations and inequalities corresponding to the KKT conditions [6] doesn’t admit a closed form solution.

The optimal incentives calculated here are the result of an optimization routine designed to maximize an economic figure, which is the expected total profit. Although this approach provides a principled way for marketing teams to make systematic, quantitative, and analytics-driven decisions, it’s also important to recall that the objective function to be optimized is a proxy measure to the actual total profit. It goes without saying that we can’t compute the actual profit based on future decisions (this would paradoxically imply maximizing the actual return based on future values of the stocks). But we can explore new ideas using techniques such as the potential outcomes work [7], which we could use to design strategies for back-testing our solution.

Conclusion

We’ve now taken another step towards preventing customer churn. We built on a prior post in which we integrated our customer data with our ML model to predict churn. We can now experiment with variations on this optimization equation, and see the effect of different campaign budgets or even different theories of how they should be modeled.

To gather more data on effective incentives and customer behavior, we could also test several campaigns against different subsets of our customers. We can collect their responses—do they churn after being offered this incentive, or not—and use that data in a future ML model to further refine the incentives offered. We can use this data to learn what kinds of incentives convince customers with different characteristics to stay, and use that new function within this optimization.

We’ve empowered marketing teams with the tools to make data-driven decisions that they can quickly turn into action. This approach can drive fast iterations on incentive programs, moving at the speed with which our customers make decisions. Over to you, marketing!

 

Bibliography

[1] S. Uryasev, P. M. Pardalos, Stochastic Optimization: Algorithm and Applications, Kluwer Academic: Norwell, MA, USA, 2001.

[2] John R. Birge and François V. Louveaux. Introduction to Stochastic Programming. Springer Verlag, New York, 1997.

[3] Francis, J. C. and Kim, D. (2013). Modern portfolio theory: Foundations, analysis, and new developments (Vol. 795). John Wiley & Sons.

[4] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991.

[5] D. P. Palomar and J. R. Fonollosa, “Practical algorithms for a family of water-filling solutions,” IEEE Trans. Signal Process., vol. 53, no. 2, pp. 686–695, Feb. 2005.

[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] Imbens, G. W. and D. B. Rubin (2015): Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.


About the Authors

Marco Guerriero, PhD, is a Practice Manager for Emergent Technologies and Intelligence Platform for AWS Professional Services. He loves working on ways for emergent technologies such as AI/ML, Big Data, IoT, Quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

 

 

 

Veronika Megler, PhD, is Principal Data Scientist for Amazon.com Consumer Packaging. Until recently she, was the Principal Data Scientist for AWS Professional Services. She enjoys adapting innovative big data, AI, and ML technologies to help companies solve new problems, and to solve old problems more efficiently and effectively. Her work has lately been focused more heavily on economic impacts of ML models and exploring causality.

 

 

Oliver Boom is a London based consultant for the Emerging Technologies and Intelligent Platforms team at AWS. He enjoys solving large-scale analytics problems using big data, data science and dev ops, and loves working at the intersection of business and technology. In his spare time he enjoys language learning, music production and surfing.

 

 

 

Dr Sokratis Kartakis is a UK-based Data Science Consultant for AWS. He works with enterprise customers to help them adopt and productionize innovative Machine Learning (ML) solutions at scale solving challenging business problems. His focus areas are ML algorithms, ML Industrialization, and AI/MLOps. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.

 

 

Read More

Selecting the right metadata to build high-performing recommendation models with Amazon Personalize

Selecting the right metadata to build high-performing recommendation models with Amazon Personalize

In this post, we show you how to select the right metadata for your use case when building a recommendation engine using Amazon Personalize. The aim is to help you optimize your models to generate more user-relevant recommendations. We look at which metadata is most relevant to include for different use cases, and where you may get better results by excluding other metadata. We also highlight a specific use case from Pulselive, one of our customers that recently used Amazon Personalize to enhance the recommendation capabilities of their customer’s websites, resulting in a 20% increase in video consumption.

Introducing Amazon Personalize

Amazon Personalize is a managed service that enables you to improve customer engagement by powering personalized product and content recommendations, and targeted marketing promotions. Amazon Personalize uses machine learning (ML) to create high-quality recommendations that you can use to personalize your user experience across digital channels such as websites, applications, and email systems. You can get started without any prior ML experience using simple APIs to easily build sophisticated personalization capabilities in just a few clicks. Amazon Personalize automatically processes and examines your metadata, identifies what is meaningful, allows you to pick an ML algorithm, and trains and optimizes a custom model based on your metadata. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

Let’s dive into how important that metadata is to get a performant model.

The role metadata selection plays in recommendations

The goal of metadata selection in recommendation engines is to select the right data to aid the training algorithm to discover valuable information about the similarities in user preferences and behavior, in addition to the properties and similarity of the items you’re trying to recommend through the engine. The ultimate goal is to provide a personalized experience, uniquely tailored for each user, and present them with the items that are the most relevant to them.

Nowadays, there are so many sources of data that a company could potentially use to capture user behavior and understand which items to present to them that it has become challenging to accurately select which metadata to consider and which to ignore. Irrespective of the use case, a commercial website can use large amounts of data about every aspect of each user’s behavior on the website, such as which items they’re frequently interacting with (watching a video or ordering an item), how long they spend on each item’s page, or even how erratic or smooth the movement of their cursor is while scrolling through a page. All this information can reveal a lot about a user’s preferences and what would be the ideal items to recommend to them.

There are two main categories of approaches to recommendation engines: collaborative filtering and content-based filtering.

Collaborative filtering compares the behavior of the users with each other and tries to calculate the similarity between them to find shared interests. Therefore, the recommendation engine knows that if user A has very similar behavior to user B, then user A would likely be interested in some of the items that user B has interacted with, and vice versa.

Content-based filtering looks at the actual items the users interact with. If a user has interacted with items A and B, and product C is very similar to A and B, then item C will likely be of interest to the user.

We also have hybrid models that use both user behavior and item-related data to find the underlying patterns that reveal the ideal items to recommend to each user.

Each method requires a different approach to metadata selection because they require different types of data to be collected and used for the training. For example, building collaborative filtering engines requires data related to the behavior of users on the website, whereas building a content-based engine requires more data related to the items (item-specific metadata and which users interacted with which items). A hybrid solution requires data related to both the users and the items.

As a general rule, authenticated experiences are most optimal. When your users have personal accounts that they log in to, you can provide them with a more personalized experience tailored to their needs because you can easily track and record every aspect of their behavior (along with additional metadata), whereas it’s harder to track anonymous or guest users and map them to their previous sessions.

The problems that can occur if metadata selection isn’t done right

If metadata selection isn’t done correctly, it can potentially lead to poor recommendations that are either too generic (showing most users the most popular and commonly interacted products) or not relevant (showing items that are completely irrelevant to the unique user).

When too much information is included in training a recommendation model, it can lead to noise in the model. Metadata that has no correlation with user preferences but was included in training skews the model and makes it harder for the algorithm to find the valuable underlying patterns that allow for a successful recommender system.

This can also apply to the depth (amount of history) of the data that is used to train a model. Perhaps relevant metadata has been selected, but the freshness of the data in many cases is a stronger indicator of relevance—the most recent metadata is more relevant than historical data for the same kind of interactions. This is because user behavior and preferences vary over time and people’s interests can change rather quickly; therefore, presenting a user with a recommendation that was considered relevant to them a few months ago doesn’t guarantee that the recommendation is relevant to them today. This is why it’s important to keep your recommender system up to date with current user behavior.

Conversely, if too little information is included, the recommendation model under-performs. If you don’t include valuable information that can aid the performance of the model, the recommendation model makes suboptimal suggestions.

A wrong approach to metadata selection can make it harder for the algorithms to find the underlying patterns that connect users and items. This means that the recommendations that the users are presented with aren’t personalized as expected.

Terminology of recommendation engines

To introduce the topic further, let’s dive into some of the terminology associated with Amazon Personalize:

  • Datasets and dataset groupsDatasets contain the data used to train a recommendation model. You can use different dataset groups to serve different purposes. For example, separate applications, with their own users and items, can have their own dataset groups.
  • Recipes and solutions – Amazon Personalize uses recipes, which are the combination of the learning algorithm with the hyperparameters and datasets used. Training a model with different recipes leads to different results. The resultant models that are deployed are referred to as a solution version.
  • Campaigns – A deployed solution version is known as a campaign. A campaign allows Amazon Personalize to make recommendations for your users.

Metadata types are dictated in the datasets used to train a model. In the following section, we look at how to do that.

Selecting metadata

Amazon Personalize uses different recipes that are aimed towards either of the two main categories of recommendation engines—collaborative filtering and content-based filtering—and also the hybrid methods. For more information about pre-defined recipes, see Choosing a Recipe.

No matter which recipe you chose to work with, Amazon Personalize has three main types of datasets that it can use to build models (solutions), and each is related to one of the following categories:

  • Users
  • Items
  • Interactions

The users and items dataset types are known as metadata types, and are only used by certain recipes. As their names imply, their metadata has unique fields that describe each individual user or item. User metadata could be age, gender, and geography. Typical item metadata is color, category, shape, price (in the case of items) or content category, ratings, and genre, if the type of item we’re trying to recommend is a video or movie.

The interactions metadata is the direct interactions of a user with an item, which is usually the most revealing information for the relationship between users and items. Some examples of interactions data can be clicks (user A clicked on item X), purchases (user actually purchased an item), amount of time spent on an item’s webpage, the addition of an item to a user’s wishlist, or even the fact that the user hovered their cursor for a few milliseconds more than usual over a certain item.

The minimum number of interactions Amazon Personalize expects in order to start making recommendations is 1,000 interactions from a minimum of 25 users. User and item metadata datasets are optional, and their importance depends on your use case and the algorithm (recipe) you’re using.

The following screenshot shows the Datasets page on the Amazon Personalize console.

What data types are supported by each category?

Each dataset has a set of required fields, reserved keywords, and required datatypes, as shown in the following table.

Dataset Type Required Fields Reserved Keywords
Users USER_ID (string)
one metadata field
Items ITEM_ID (string)
one metadata field
CREATION_TIMESTAMP (long)

 

 

Interactions USER_ID (string)
ITEM_ID (string)
TIMESTAMP (long)

EVENT_TYPE (string)

IMPRESSION (string)

EVENT_VALUE (float, null)

 

Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. Each dataset type has specific requirements. Schemas in Amazon Personalize are defined in the Avro format.

The following example code shows an interactions schema. The EVENT_TYPE and EVENT_VALUE fields are optional, and are reserved keywords recognized by Amazon Personalize. LOCATION and DEVICE are optional contextual metadata fields.

{
  "type": "record",
  "name": "Interactions",
  "namespace": "com.amazonaws.personalize.schema",
  "fields": [
      {
          "name": "USER_ID",
          "type": "string"
      },
      {
          "name": "ITEM_ID",
          "type": "string"
      },
      {
          "name": "EVENT_TYPE",
          "type": "string"
      },
      {
          "name": "EVENT_VALUE",
          "type": "float"
      },
      {
          "name": "LOCATION",
          "type": "string",
          "categorical": true
      },
      {
          "name": "DEVICE",
          "type": "string",
          "categorical": true
      },
      {
          "name": "TIMESTAMP",
          "type": "long"
      }
  ],
  "version": "1.0"
}

Creating a schema using the AWS Python SDK

To create a schema using the AWS Python SDK, complete the following steps:

  1. Define the Avro format schema that you want to use.
  2. Save the schema in a JSON file in the default Python folder.
  3. Create the schema using the following code:
import boto3

personalize = boto3.client('personalize')

with open('schema.json') as f:
    createSchemaResponse = personalize.create_schema(
        name = 'YourSchema',
        schema = f.read()
    )

schema_arn = createSchemaResponse['schemaArn']

print('Schema ARN:' + schema_arn )

Amazon Personalize returns the ARN of the new schema.

  1. Store the ARN for later use.

Filtering your metadata

Amazon Personalize allows you to experiment with building different models (or solutions) based on different metadata by enabling you to filter records from your interactions dataset and set a threshold for each event type, or simply select and leave out certain event types. You can filter records from an interactions dataset in two ways:

  • Set a threshold to exclude records based on a specific value by specifying an event value in your recipe. If the records include a value that is associated with a specific event—for example, the price a user paid is associated with the purchase of an item—you can set a specific value in a recipe as a threshold to exclude records from training. The amount is called an event value.
  • Exclude records of a certain type by specifying an event type in your recipe. A dataset often includes specific types of activities, for example, purchase, click, or wishlisted. These are called event types. To include only records for specific event types in training, filter your dataset by event type in your recipe.

To filter your metadata, call the CreateSolution API. If you want to specify the event type, for example purchase, set it in the eventType parameter. If you want to specify an event value, for example 10, set it in the eventValueThreshold parameter. You can also specify an event type and an event value. You can specify an eventType, an eventType and eventValueThreshold, or neither. You can’t specify just eventValueThreshold alone. See the following code:

import boto3
 
personalize = boto3.client('personalize')

# Create the solution
create_solution_response = personalize.create_solution(
    name = "your-solution-name",
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn,
    "eventType": "purchase",
    solutionConfig = {
        "eventValueThreshold": "10"
    }
)

# Store the solution ARN
solution_arn = create_solution_response['solutionArn']

# Use the solution ARN to get the solution status
solution_description = personalize.describe_solution(solutionArn = solution_arn)['solution']
print('Solution status: ' + solution_description['status'])

When selecting metadata for a recommendation engine, it’s helpful to ask the following questions to help guide your decisions:

  • What is likely to be the strongest indicator of a good recommendation—similar users, similar items, or their combined interactions? This can help determine which metadata to select and tag in the datasets. As described, the interactions dataset is the minimum that Amazon Personalize expects, so you have to choose wisely which types of interactions (or events) you want to capture. A combination of interactions and metadata is typically recommended, but choosing which types of interactions to record is important.
  • What is the temporal value of the data? Is old data less potent? How much less? How can you use real-time APIs with real-time data to get the most relevant recommendations that reflect the users’ change of preferences over time?
  • Which metrics best show whether the recommendation engine is working well? Can you align Amazon Personalize metrics with your own KPIs? Can you construct an A/B test with live customers?

The answers to these questions can be a good guide to improve the recommendation system.

Applying metadata selection: Pulselive use case

In a recent engagement with Pulselive, an AWS customer that builds and hosts solutions for large sports organizations, we were asked to aid them in prototyping a personalized recommendation engine for one of their customers, a renowned European football club, to suggest videos to the visitors of their website according to their preferences and past behavior. Their goal was to use all the data they could to provide the website’s visitors with a tailored, highly personalized experience by recommending videos relevant to each user to increase engagement with the content.

Our initial approach was to use some of their existing recorded historical data to extract the minimum required information needed to start building Amazon Personalize solutions that can recommend the right videos to the right users. Therefore, the metadata we initially selected was the simplest form of user-video interactions—clicks—from a historical dataset of which users had clicked on which videos and at what time. We started with 30,000 user interactions.

That allowed us to build a baseline solution that used that information to evaluate the relevance of each video to each user and considered it as our starting point. The next goal was to enrich the dataset by selecting the right metadata and observing the impact that the new models had on user engagement.

At this point, we have to mention that it’s somewhat challenging to predict how well the recommendation system will do when deployed into production. Amazon Personalize provides some standard out-of-the-box metrics when a model has finished training to give you an idea of how well it did at recommending the most relevant items higher on the recommendations list (such as having a high precision or coverage). But you can only evaluate the true impact on your customers when deploying the system into production.

Pulselive chose to do A/B testing, comparing the results of their existing recommendation methods to those produced from an Amazon Personalize campaign. They started with redirecting 5% of their traffic through the Amazon Personalize campaign. After seeing good results, they eventually rolled out to 50% of the traffic being redirected to Amazon Personalize. For more information, see Increasing engagement with personalized online sports content.

Regarding metadata selection, we quickly realized that the users and items in the initial historical dataset weren’t very recent, and most of their IDs didn’t correspond to users and items that had recent activity on their production website.

Luckily, apart from an initial historical dataset, Amazon Personalize can also enrich its models in real time by allowing you to feed in interaction data from your live website. Through the use of the Amazon Personalize PutEvents API, you can record any action users take on the website and feed it into Amazon Personalize in near-real time, updating the model with the most recent user behavior and preferences. This is an important capability because it’s natural for user preferences to change over time, and you don’t want to risk presenting them with items that are either out of date or not relevant to them anymore.

This also means that you can directly connect Amazon Personalize to your website, with no historical data or any models trained, and start feeding in events. After a while, Amazon Personalize has gathered enough data to start making accurate recommendations. For more information, see Recording Events.

We spent some time discussing what other relevant user behavior metadata we could capture, and decided to start recording some to observe whether these would result in a more accurate recommendation system that would impact user engagement on the site. Two simple measures for this were seeing if recommended videos were more frequently visited and watched for longer periods.

We started recording the source of the clicks (recommended list vs. other links in the website), the amount of time a user spent on a clicked video in seconds, and the percentage of the video that time represented (because it’s different for someone to spend 1 minute on a 20-minute video, compared to spending the same time on a 1-minute video to watch it in its entirety). These additions proved to be very important because after a while, user engagement started improving. We discussed and investigated providing more detailed information about user behavior on the website, but decided to pay more attention to the metadata.

Items metadata was important because it allowed Amazon Personalize to have more context on the nature of each video. This ranged from general and broad video categories, such as interviews and games, to more specific categories, such as “Leagues” and “Friendly games,” to more specific metadata, such as which players are featured in a video. Adding metadata about the content for each video significantly improved the personalized recommendations because the solution had a notion of context that helped determine what type on content each user preferred to watch.

Equally, on the user metadata side, more detailed information was provided, trying to capture the demographics and preferences of each user. Of course, in the case of the users, we had to deal with the cold-start problem (new users or guest users for which the system didn’t have any information yet). Luckily, the Amazon Personalize HRNN-Coldstart recipe has proved to be very sufficient in solving this problem by quickly linking the new user’s behavior to existing ones. The more time a guest or new user spends on the platform, the more Amazon Personalize understands about their preferences and adjusts its recommendations accordingly.

We had many options of what type of metadata to include in the interactions dataset, but it’s important to make sure we only use relevant metadata, and we had to pay attention to the balance between providing too much information to a model and providing too little.

For example, we considered recording the movement of each user’s cursors on the website and sending these as well to Amazon Personalize, which in theory could provide a marginal improvement to the performance of the recommendation system. But doing so proved to be expensive and tolling both on the front end (it impacted website performance) and the back end (the volume of data the system had to record, store, and send to Amazon Personalize significantly increased). Therefore, after careful consideration, we decided that cursor movement metadata wasn’t worth keeping.

After a few months, Pulselive rolled out the Amazon Personalize-based recommendation system to nearly half of their customer’s website visitors, and saw that that group’s engagement with their videos increased by 20%.

Conclusion

Recommendation engines can provide more pertinent results to users based on metadata about a user’s historical selections, or on the types of items of interest.

In this post, we looked at how to select the right metadata to get the best results when training a recommendation engine on Amazon Personalize by evaluating which metadata to include and which to exclude. We also looked at a specific use case and how an AWS customer, Pulselive, increased engagement with videos on their customer’s website by providing personalized recommendations to users.

For more information on creating recommendation engines with Amazon Personalize and metadata selection, see the following:


About the Authors

Andrew Hood is a Prototyping Engagement Manager at AWS.

 

 

 

 

Ion Kleopas is an ML Prototyping Architect at AWS.

Read More

Streamline modeling with Amazon SageMaker Studio and the Amazon Experiments SDK

Streamline modeling with Amazon SageMaker Studio and the Amazon Experiments SDK

The modeling phase is a highly iterative process in machine learning (ML) projects, where data scientists experiment with various data preprocessing and feature engineering strategies, intertwined with different model architectures, which are then trained with disparate sets of hyperparameter values. This highly iterative process with many moving parts can, over time, manifest into a tremendous headache in terms of keeping track of the design decisions applied in each iteration and how the training and evaluation metrics of each iteration compare to the previous versions of the model.

While your head may be spinning by now, fear not! Amazon SageMaker has a solution!

This post walks you through an end-to-end example of using Amazon SageMaker Studio and the Amazon SageMaker Experiments SDK to organize, track, visualize, and compare our iterative experimentation with a Keras model. Although this use case is specific to Keras framework, you can extend the same approach to other deep learning frameworks and ML algorithms.

Amazon SageMaker is a fully managed service, created with the goal of democratizing ML by empowering developers and data scientists to quickly and cost-effectively build, train, deploy, and monitor ML models.

What Is Amazon SageMaker Experiments?

Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you effortlessly organize, track, compare, and evaluate your ML experiments. Before we dive into the hands-on exercise, let’s first take a step back and review the building blocks of an experiment and their referential relationships. The following diagram illustrates these building blocks.

Figure 1. The building blocks of Amazon SageMaker Experiments

Amazon SageMaker Experiments is composed of the following components:

  • Experiment – An ML problem that we want to solve. Each experiment consists of a collection of trials.
  • Trial An iteration of a data science workflow related to an experiment. Each trial consists of several trial components.
  • Trial component – A stage in a given trial. For instance, as we see in our example, we create one trial component for the data preprocessing stage and one trial component for model training. In a similar fashion, we can also add a trial component for any data postprocessing.
  • Tracker – A mechanism that records various metadata about a particular trial component, including any parameters, inputs, outputs, artifacts, and metrics. A tracker can be linked to a particular training component to assign the collected metadata to it.

Now that we’ve set a rock-solid foundation on the key building blocks of the Amazon SageMaker Experiments SDK, let’s dive into the fun hands-on component.

Prerequisites

You should have an AWS account and a sufficient level of access to create resources in the following AWS services:

Solution overview

As part of this post, we walk through the following high-level steps:

  1. Environment setup
  2. Data preprocessing and feature engineering
  3. Modeling with Amazon SageMaker Experiments
  4. Training and evaluation metric exploration
  5. Environment cleanup

Setting up the environment

We can set up our environment in a few simple steps:

  1. Clone the source code from the GitHub repo, which contains the complete demo, into your Amazon SageMaker Studio environment.
  2. Open the included Jupyter notebook and choose the Python 3 (TensorFlow 2 CPU Optimized)
  3. When the kernel is ready, install sagemaker-experiments package, which enables us to work with the Amazon SageMaker Experiments SDK, and s3fs package, to enable our pandas dataframes to easily integrate with objects in Amazon S3.
  4. Import all required packages and initialize the variables.

The following screenshot shows the environment setup.

Figure 2. Environment Setup

Data preprocessing and feature engineering

Excellent! Now, let’s dive into data preprocessing and feature engineering. In our use case, we use the abalone dataset from the UCI Machine Learning Repository.

Run the steps in the provided Jupyter notebook to complete all data preprocessing and feature engineering. After your data is preprocessed, it’s time for us to seamlessly capture our preprocessing strategy! Let’s create an experiment with the following code:

sm = boto3.client('sagemaker') 
ts = datetime.now().strftime('%Y-%m-%d-%H-%M-%S-%f')

abalone_experiment = Experiment.create(
    experiment_name = 'predict-abalone-age-' + ts,
    description = 'Predicting the age of an abalone based on a set of features describing it',
    sagemaker_boto_client=sm)

Now, we can create a Tracker to describe the Pre-processing Trial Component, including the location of the artifacts:

with Tracker.create(display_name='Pre-processing', sagemaker_boto_client=sm, artifact_bucket=sm_bucket, artifact_prefix=artifacts_path) as tracker:
    tracker.log_parameters({
        'train_test_split': 0.8
    })
    tracker.log_input(name='raw data', media_type='s3/uri', value=source_url)
    tracker.log_output(name='preprocessed data', media_type='s3/uri', value=processed_data_path)
    tracker.log_artifact(name='preprocessors', media_type='s3/uri', file_path='preprocessors.pickle')
    
processing_component = tracker.trial_component

Fantastic! We now have our experiment ready and we’ve already done our due diligence to capture our data preprocessing strategy. Next, let’s dive into the modeling phase.

Modeling with Amazon SageMaker Experiments

Our Keras model has two fully connected hidden layers with a variable number of neurons and variable activation functions. This flexibility enables us to pass these values as arguments to a training job and quickly parallelize our experimentation with several model architectures.

We have mean squared logarithmic error defined as the loss function, and the model is using the Adam optimization algorithm. Finally, the model tracks mean squared logarithmic error as our metric, which automatically propagates into our training trial component in our experiment, as we see shortly:

def model(x_train, y_train, x_test, y_test, args):
    """Generate a simple model"""
    model = Sequential([
                Dense(args.l1_size, activation=args.l1_activation, kernel_initializer='normal'),
                Dense(args.l2_size, activation=args.l2_activation, kernel_initializer='normal'),
                Dense(1, activation='linear')
    ])

    model.compile(optimizer=Adam(learning_rate=args.learning_rate),
                  loss='mean_squared_logarithmic_error',
                  metrics=['mean_squared_logarithmic_error'])
    model.fit(x_train, y_train, batch_size=args.batch_size, epochs=args.epochs, verbose=1)
    model.evaluate(x_test,y_test,verbose=1)

    return model

Fantastic! Follow the steps in the provided notebook to define the hyperparameters for experimentation and instantiate the TensorFlow estimator. Finally, let’s start our training jobs and supply the names of our experiment and trial via the experiment_config dictionary:

abalone_estimator.fit(processed_data_path,
                        job_name=job_name,
                        wait=False,
                        experiment_config={
                                        'ExperimentName': abalone_experiment.experiment_name,
                                        'TrialName': abalone_trial.trial_name,
                                        'TrialComponentDisplayName': 'Training',
                                        })

Exploring the training and evaluation metrics

Upon completion of the training jobs, we can quickly visualize how different variations of the model compare in terms of the metrics collected during model training. For instance, let’s see how the loss has been decreasing by epoch for each variation of the model and observe the model architecture that is most effective in decreasing the loss:

  1. Choose the Amazon SageMaker Experiments List icon on the left sidebar.
  2. Choose your experiment to open it and press Shift to select all four trials.
  3. Choose any of the highlighted trials (right-click) and choose Open in trial component list.
  4. Press Shift to select the four trial components representing the training jobs and choose Add chart.
  5. Choose New chart and customize it to plot the collected metrics that you want to analyze. For our use case, choose the following:
    1. For Data type¸ choose Time series.
    2. For Chart type¸ choose Line.
    3. For X-axis dimension, choose epoch.
    4. For Y-axis, choose loss_TRAIN_last.

Figure 3. Generating plots based on the collected model training metrics

Wow! How quick and effortless was that?! I encourage you to further explore plotting various other metrics on your own. For instance, you can choose the Summary data type to generate a scatter plot and explore if there is a relationship between the size of the first hidden layer in your neural network and the mean squared logarithmic error. See the following screenshot.

Figure 4. Plot of the relationship between the size of the first hidden layer in the neural network and Mean-Squared Logarithmic Error during model evaluation

Next, let’s choose our best-performing trial (abalone-trial-0). As expected, we see two trial components. One represents our data Pre-processing, and the other reflects our model Training. When we open the Training trial component, we see that it contains all the hyperparameters, input data location, Amazon S3 location of this particular version of the model, and more.

Figure 5. Metadata about model training, automatically collected by Amazon SageMaker Experiments

Similarly, when we open the Pre-processing component, we see that it captures where the source data came from, where the processed data was stored in Amazon S3, and where we can easily find our trained encoder and scalers, which we’ve packaged into the preprocessors.pickle artifact.

Figure 6. Metadata about data pre-processing and feature engineering, automatically collected by Amazon SageMaker Experiments

Cleaning up

What a fun exploration this has been! Let’s now clean up after ourselves by running the cleanup function provided at the end of the notebook to hierarchically delete all elements of the experiment that we created in this post:

abalone_experiment.delete_all('--force')

Conclusion

You have now learned to seamlessly track the design decisions that you made during data preprocessing and model training, as well as rapidly compare and analyze the performance of various iterations of your model by using the tracked metrics of the trials in your experiment.

I hope that you enjoyed diving into the intricacies of the Amazon SageMaker Experiments SDK and exploring how Amazon SageMaker Studio smoothly integrates with it, enabling you to lose yourself in experimentation with your ML model without losing track of the hard work you’ve done! I highly encourage you to leverage the Amazon SageMaker Experiments Python SDK in your next ML engagement and I invite you to consider contributing to the further evolution of this open-sourced project.


About the Author

Ivan Kopas is a Machine Learning Engineer for AWS Professional Services, based out of the United States. Ivan is passionate about working closely with AWS customers from a variety of industries and helping them leverage AWS services to spearhead their toughest AI/ML challenges. In his spare time, he enjoys spending time with his family, working out, hanging out with friends and diving deep into the fascinating realms of economics, psychology and philosophy.

 

 

Read More