Explaining Amazon SageMaker Autopilot models with SHAP

Explaining Amazon SageMaker Autopilot models with SHAP

Machine learning (ML) models have long been considered black boxes because predictions from these models are hard to interpret. However, recently, several frameworks aiming at explaining ML models were proposed. Model interpretation can be divided into local and global explanations. A local explanation considers a single sample and answers questions like “Why does the model predict that Customer A will stop using the product?” or “Why did the ML system refuse John Doe a loan?” Another interesting question is “What should John Doe change in order to get the loan approved?” In contrast, global explanations aim at explaining the model itself and answer questions like “Which features are important for prediction?” You can use local explanations to derive global explanations by averaging many samples. For further reading on interpretable ML, see the excellent book Interpretable Machine Learning by Christoph Molnar.

In this post, we demonstrate using the popular model interpretation framework SHAP for both local and global interpretation.

SHAP

SHAP is a game theoretic framework inspired by shapley values that provides local explanations for any model. SHAP has gained popularity in recent years, probably due to its strong theoretical basis. The SHAP package contains several algorithms that, when given a sample and model, derive the SHAP value for each of the model’s input features. The SHAP value of a feature represents its contribution to the model’s prediction.

To explain models built by Amazon SageMaker Autopilot, we use SHAP’s KernelExplainer, which is a black box explainer. KernelExplainer is robust and can explain any model, so can handle the complex feature processing of Amazon SageMaker Autopilot. KernelExplainer only requires that the model support an inference functionality that, when given a sample, returns the model’s prediction for that sample. The prediction is the predicted value for regression and the class probability for classification.

SHAP includes several other explainers, such as TreeExplainer and DeepExplainer, which are specific for decision forest and neural networks, respectively. These are not black box explainers and require knowledge of the model structure and trained params. TreeExplainer and DeepExplainer are limited and, as of this writing, can’t support any feature processing.

Creating a notebook instance

You can run the example code provided in this post. It’s recommended to run the code inside an Amazon SageMaker instance type of ml.m5.xlarge or larger to accelerate running time. To launch the notebook with the example code using Amazon SageMaker Studio, complete the following steps:

  1. Launch an Amazon SageMaker Studio instance.
  2. Open terminal and clone the GitHub repogit clone https://github.com/awslabs/amazon-sagemaker-examples.git
  3. Open the notebook autopilot/model-explainability/explaining_customer_churn_model.ipynb.
  4. Use kernel Python 3 (Data Science).

Setting up the required packages

In this post, we start with a model built by Amazon SageMaker Autopilot, which was already trained on a binary classification task. See the following code:

import boto3
import pandas as pd
import sagemaker
from sagemaker import AutoML
from datetime import datetime
import numpy as np
region = boto3.Session().region_name
session = sagemaker.Session()

For instructions on creating and training an Amazon SageMaker Autopilot model, see Customer Churn Prediction with Amazon SageMaker Autopilot.

Install SHAP with the following code:

!conda install -c conda-forge -y shap
import shap
from shap import KernelExplainer
from shap import sample
from scipy.special import expit

Initialize the plugin to make the plots interactive.
shap.initjs()

Creating an inference endpoint

Create an inference endpoint for the trained model built by Amazon SageMaker Autopilot. See the following code:

autopilot_job_name = '<your_automl_job_name_here>'
autopilot_job = AutoML.attach(autopilot_job_name, sagemaker_session=session)
ep_name = 'sagemaker-automl-' + datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

For classification response to work with SHAP we need the probability scores. This can be achieved by providing a list of keys for response content. The order of the keys will dictate the content order in the response. This parameter is not needed for regression.

inference_response_keys = ['predicted_label', 'probability']

Create the inference endpoint

autopilot_job.deploy(initial_instance_count=1, instance_type='ml.m5.2xlarge', inference_response_keys=inference_response_keys, endpoint_name=ep_name)

You can skip this step if an endpoint with the argument inference_response_keys set as ['predicted_label', 'probability'] was already created.

Wrapping the Amazon SageMaker Autopilot endpoint with an estimator class

For ease of use, we wrap the inference endpoint with a custom estimator class. Two inference functions are provided: predict, which returns the numeric prediction value to be used for regression, and predict_proba, which returns the class probabilities to be used for classification. See the following code:

from sagemaker.predictor import RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV

class AutomlEstimator:
    def __init__(self, endpoint, sagemaker_session):
        self.predictor = RealTimePredictor(
            endpoint=endpoint,
            sagemaker_session=sagemaker_session,
            content_type=CONTENT_TYPE_CSV,
            accept=CONTENT_TYPE_CSV
        )
    
    def get_automl_response(self, x):
        if x.__class__.__name__ == 'ndarray':
            payload = ""
            for row in x:
                payload = payload + ','.join(map(str, row)) + 'n'
        else:
            payload = x.to_csv(sep=',', header=False, index=False)
        return self.predictor.predict(payload).decode('utf-8')

    # Prediction function for regression
    def predict(self, x):
        response = self.get_automl_response(x)
        # Return the first column from the response array containing the numeric prediction value (or label in case of classification)
        response = np.array([x.split(',')[0] for x in response.split('n')[:-1]])
        return response

    # Prediction function for classification
    def predict_proba(self, x):
        # Return the probability score from AutoPilot’s endpoint response
        response = self.get_automl_response(x)
        response = np.array([x.split(',')[1] for x in response.split('n')[:-1]])
        return response.astype(float)

Create an instance of AutomlEstimator:

automl_estimator = AutomlEstimator(endpoint=ep_name, sagemaker_session=session)

Data

In this notebook, we use the same dataset as used in the Customer Churn Prediction with Amazon SageMaker Autopilot GitHub repo. Follow the notebook in the GitHub repo to download the dataset if it was not previously downloaded.

Background data

KernelExplainer requires a sample of the data to be used as background data. KernelExplainer uses this data to simulate a feature being missing by replacing the feature value with a random value from the background. We use shap.sample to sample 50 rows from the dataset to be used as background data. Using more samples as background data produces more accurate results, but runtime increases. The clustering algorithms provided in SHAP only support numeric data. You can use a vector of zeros as background data to produce reasonable results.

Choosing background data is challenging. For more information, see AI Explanations Whitepaper and Runtime considerations.

churn_data = pd.read_csv('../Data sets/churn.txt')
data_without_target = churn_data.drop(columns=['Churn?'])
background_data = sample(data_without_target, 50)

Setting up KernelExplainer

Next, we create the KernelExplainer. Because it’s a black box explainer, KernelExplainer only requires a handle to the predict (or predict_proba) function and doesn’t require any other information about the model. For classification, it’s recommended to derive feature importance scores in the log-odds space because additivity is a more natural assumption there, so we use Logit. For regression, you should use Identity. See the following code:

problem_type = automl_job.describe_auto_ml_job(job_name=automl_job_name)['ResolvedAttributes']['ProblemType'] 
link = "identity" if problem_type == 'Regression' else "logit" 

The handle to predict_proba is passed to KernelExplainer since KernelSHAP requires the class probability:

explainer = KernelExplainer(automl_estimator.predict_proba, background_data, link=link)

By analyzing the background data, KernelExplainer provides us with explainer.expected_value, which is the model prediction with all features missing. Considering a customer for which we have no data at all (all features are missing), this should theoretically be the model prediction. See the following code:

Since expected_value is given in the log-odds space we convert it back to probability using expit which is the inverse function to logit

print('expected value =', expit(explainer.expected_value))
expected value = 0.21051377184689046

Local explanation with KernelExplainer

We use KernelExplainer to explain the prediction of a single sample, the first sample in the dataset. See the following code:

# Get the first sample
x = data_without_target.iloc[0:1]

ManagedEndpoint will auto delete the endpoint after calculating the SHAP values. To disable auto delete, use ManagedEndpoint(ep_name, auto_delete=False)

from managed_endpoint import ManagedEndpoint
with ManagedEndpoint(ep_name) as mep:
    shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='aic')

The SHAP package includes many visualization tools. The following force_plot code provides a visualization for the SHAP values of a single sample. Since shap_values are provided in the log-odds space, we convert them back to the probability space by using Logit

shap.force_plot(explainer.expected_value, shap_values, x, link=link)

The following visualization is the result.

From this plot, we learn that the most influential feature is VMail Message, which pushes the probability down by about 7%. VMail Message = 25 makes the probability 7% lower in comparison to the notion of that feature being missing. SHAP values don’t provide the information of how increasing or decreasing VMail Message affects prediction.

In many use cases, we’re interested only in the most influential features. By setting l1_reg='num_features(5)', SHAP provides non-zero scores for only the most influential five features:

with ManagedEndpoint(ep_name) as mep:
    shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='num_features(5)')
shap.force_plot(explainer.expected_value, shap_values, x, link=link)

The following visualization is the result.

KernelExplainer computation cost

KernelExplainer computation cost is dominated by the inference calls. To estimate SHAP values for a single sample, KernelExplainer calls the inference function twice: first with the sample unaugmented, and then with many randomly augmented instances of the sample. The number of augmented instances in our use case is 50 (number of samples in the background data) * 2088 (nsamples = 'auto') = 104,400. So, for this use case, the cost of running KernelExplainer for a single sample is roughly the cost of 104,400 inference calls.

Global explanation with KernelExplainer

Next, we use KernelExplainer to provide insight about the model as a whole. We do this by running KernelExplainer locally on 50 samples and aggregating the results:

X = sample(data_without_target, 50)
with ManagedEndpoint(ep_name) as mep:
    shap_values = explainer.shap_values(X, nsamples='auto', l1_reg='aic')

You can use force_plot to visualize SHAP values for many samples simultaneously, force_plot then rotates the plot of each sample by 90 degrees and stacks the plots horizontally. See the following code:

shap.force_plot(explainer.expected_value, shap_values, X, link=link)

The resulting plot is interactive (in the notebook) and can be manually analyzed.

summary_plot is another visualization tool displaying the mean absolute value of the SHAP values for each feature using a bar plot. Currently, summary_plot doesn’t support link functions, so the SHAP values are presented in the log-odds space (and not the probability space). See the following code:

shap.summary_plot(shap_values, X, plot_type="bar")

The following graph shows the results.

Conclusion

In this post, we demonstrated how to use KernelSHAP to explain models created by Amazon SageMaker Autopilot, both locally and globally. KernelExplainer is a robust black box explainer that requires only that the model support an inference functionality that, when given a sample, returns the model’s prediction for that sample. This inference functionality was provided by wrapping the Amazon SageMaker Autopilot inference endpoint with a custom estimator class.

For more information about Amazon SageMaker Autopilot, see Amazon SageMaker Autopilot.

To explore related features of Amazon SageMaker, see the following:

 


About the Authors

Yotam Elor is a Senior Applied Scientist at AWS Sagemaker. He works on Sagemaker Autopilot – AWS’s auto ML solution.

 

 

 

 

Somnath Sarkar is a Software Engineer in the AWS SageMaker Autopilot team. He enjoys machine learning in general with focus in scalable and distributed systems.

Read More

Creating an intelligent ticket routing solution using Slack, Amazon AppFlow, and Amazon Comprehend

Creating an intelligent ticket routing solution using Slack, Amazon AppFlow, and Amazon Comprehend

Support tickets, customer feedback forms, user surveys, product feedback, and forum posts are some of the documents that businesses collect from their customers and employees. The applications used to collect these case documents typically include incident management systems, social media channels, customer forums, and email. Routing these cases quickly and accurately to support groups best suited to handle them speeds up resolution times and increases customer satisfaction.

In traditional incident management systems internal to a business, assigning the case to a support group is either done by the employee during case creation or a centralized support group routing these tickets to specialized groups after case creation. Both of these scenarios have drawbacks. In the first scenario, the employee opening the case should be aware of the various support groups and their function. The decision to pick the right support group increases the cognitive overload on the employee opening the case. There is a chance of human error in both scenarios, which results in re-routing cases and thereby increasing the resolution times. These repetitive tasks result in a decrease of employee productivity.

Enterprises use business communication platforms like Slack to facilitate conversations between employees. This post provides a solution that simplifies reporting incidents through Slack and routes them to the right support groups. You can use this solution to set up a Slack channel in which employees can report many types of support issues. Individual support groups have their own private Slack channels.

Amazon AppFlow provides a no-code solution to transfer data from Slack channels into AWS securely. You can use Amazon Comprehend custom classification to classify the case documents into support groups automatically. Upon classification, you post the message in the respective private support channel by using Slack Application Programming Interface (API) integration. Depending on the incident management system, you can automate the ticket creation process using APIs.

When you combine Amazon AppFlow with Amazon Comprehend, you can implement an accurate, intelligent routing solution that eliminates the need to create and assign tickets to support groups manually. You can increase productivity by focusing on higher-priority tasks.

Solution overview

For our use case, we use the fictitious company AnyCorp Software Inc, whose programmers use a primary Slack channel to ask technical questions about four different topics. The programmer gets a reply with a ticket number that they can refer to for future communication. The question is intelligently routed to one of the five  specific channels dedicated to each topic-specific support group. The following diagram illustrates this architecture.

The solution to building this intelligent ticket routing solution comprises four main components:

  • Communication platform – A Slack application with a primary support channel for employees to report issues, four private channels for each support group and one private channel for all other issues.
  • Data transfer – A flow in Amazon AppFlow securely transfers data from the primary support channel in Slack to an Amazon Simple Storage Service (Amazon S3) bucket, scheduled to run every 1 minute.
  • Document classification – A multi-class custom document classifier in Amazon Comprehend uses ground truth data comprising issue descriptions and their corresponding support group labels. You also create an endpoint for this custom classification model.
  • Routing controller – An AWS Lambda function is triggered when Amazon AppFlow puts new incidents into the S3 bucket. For every incident received, the function calls the Amazon Comprehend custom classification model endpoint, which returns a label for the support group best suited to address the incident. After receiving the label from Amazon Comprehend, the function using the Slack API replies to the original thread in the primary support channel. The reply contains a ticket number and the support group’s name that will address the issue. Simultaneously, the function posts the issue to the private channel associated with the support group that will address the issue.

Dataset

For Amazon Comprehend to classify a document into one of the named categories, it needs to train on a dataset with known inputs and outputs. For our use case, we use the Jira Social Repository dataset hosted on GitHub. The dataset comprises issues extracted from the Jira Issue Tracking System of four popular open-source ecosystems: the Apache Software Foundation, Spring, JBoss, and CodeHaus communities. We used the Apache Software Foundation issues, filtered four categories (GROOVY, HADOOP, MAVEN, and LOG4J2) and created a CSV file for training purposes.

  1. Download the data.zip
  2. On the Amazon S3 console, choose Create bucket.
  3. For Bucket name, enter [YOUR_COMPANY]-comprehend-issue-classifier.
  4. Choose Create.
  5. Unzip the train-data.zip file and upload all files in folder into the [YOUR_COMPANY]-comprehend-issue-classifier bucket.
  6. Create another bucket named [YOUR_COMPANY]-comprehend-issue-classifier-output.

We store the output of the custom classification model training in this folder.

Your [YOUR_COMPANY]-comprehend-issue-classifier bucket should look like the following screenshot.

Training Data for Comprehend

Deploying Amazon Comprehend

To deploy Amazon Comprehend, complete the following steps:

  1. On the Amazon Comprehend console, under Customization, choose Custom classification.
  2. Choose Train classifier.
  3. For Name, enter comprehend-issue-classifier.
  4. For Classifier mode, select Using Multi-class mode.

Because our dataset has multiple classes and only one class per line, we use the multi-class mode.

  1. For S3 location, enter s3://[YOUR_COMPANY]-comprehend-issue-classifier.
  2. For Output data, choose Browse S3.
  3. Find the bucket you created in the previous step and choose the s3://[YOUR_COMPANY]-comprehend-issue-classifier-output folder.
  4. For IAM role, select Create an IAM role.
  5. For Permissions to access, choose Input and output (if specified) S3 bucket.
  6. For Name suffix, enter comprehend-issue-classifier.

Custom Classification Model Training Job Configuration

  1. Choose Train classifier.

The process can take up to 30 minutes to complete.

  1. When the training is complete and the status shows as Trained, choose comprehend-issue-classifier.
  2. In the Endpoints section, choose Create endpoint.
  3. For Endpoint name, enter comprehend-issue-classifier-endpoint.
  4. For Inference units, enter 1.
  5. Choose Create endpoint.
  6. When the endpoint is created, copy its ARN from the Endpoint details section to use later in the Lambda function.

Creating a Slack app

In this section, we create a Slack app to connect with Amazon AppFlow for our intelligent ticket routing solution. For more information, see Creating, managing, and building apps.

  • Sign in to your Slack workspace where you’d like to create the ticket routing solution, or create a new workspace.
  • Create a Slack app named TicketResolver.
  • After you create the app, in the navigation pane, under Features, choose OAuth & Permissions.
  • For Redirect URLs, enter https://console.aws.amazon.com/appflow/oauth.
  • For User Token Scopes, add the following:
    • channels:history
    • channels:read
    • chat:write
    • groups:history
    • groups:read
    • im:history
    • im:read
    • mpim:history
    • mpim:read
    • users:read
  1. In the navigation pane, under Settings, choose Basic Information.
  2. Expand Install your app to your workspace.
  3. Choose Install App to Workspace.
  4. Follow the instructions to install the app to your workspace.

  1. Create a Slack channel named testing-slack-integration. This channel is your primary channel to report issues.
  2. Create an additional five channels: groovy-issues, hadoop-issues, maven-isssues, log4j2-issues, other-issues. Mark them all as private. These will be used by the support groups designated to handle the specific issues.
  3. In your channel, choose Connect an app.

  1. Connect the TicketResolver app you created.

Deploying the AWS CloudFormation template

You can deploy this architecture using the provided AWS CloudFormation template in us-east-1.

  1. Choose Launch Stack:

  1. Provide a stack name.
  2. Provide the following parameters:
    • CategoryChannelMap, which is a mapping between Amazon Comprehend categories and your Slack channels in string format; for example, '{ "GROOVY":"groovy-issues", "HADOOP":"hadoop-issues", "MAVEN":"maven-issues", "LOG4J2":"log4j-issues", "OTHERS":"other-issues" }'
    • ComprehendClassificationScoreThreshold, which can be left with default value of 0.75
    • ComprehendEndpointArn which is created in the previous step that looks like arn:aws:comprehend:{YOUR_REGION}:{YOUR_ACCOUNT_ID}:document-classifier-endpoint/comprehend-issue-classifier-endpoint
    • Region where your AWS resources are provisioned. Default is set to us-east-1
    • SlackOAuthAccessToken, which is the OAuth access token on your Slack API page in the OAuth Tokens & Redirect URLs section

    • SlackClientID can be found under App Credentials section from your slack app home page as Client ID
    • SlackClientSecret can be found under App Credentials section from your slack app home page as Client Secret

    • SlackWorkspaceInstanceURL which can be found by clicking the down arrow next to workspace.

    • SlackChannelID which is the channel ID for the testing-slack-integration channel

    • LambdaCodeBucket which is a bucket name where your lambda code is stored. Default is set to intelligent-routing-lambda-code, which is the public bucket containing the lambda deployment package. If your AWS account is in us-east-1, no change is needed. For other regions, please download the lambda deployment package from here. Create a s3 bucket in your AWS account and upload the package, and change the parameter value to your bucket name.
    • LambdaCodeKey which is a zip file name of your lambda code. Default is set to lambda.zip, which is the name of deployment package in the public bucket. Please revise this to your file name if you had to download and upload the lambda deployment package to your bucket in step k.
  1. Choose Next
  2. In the Capabilities and transforms section, select all three check-boxes to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.
  3. Choose Create stack.

This process might take 15 minutes or more to complete, and creates the following resources:

  • IAM roles for the Lambda function to use
  • A Lambda function to integrate Slack with Amazon Comprehend to categorize issues typed by Slack users
  • An Amazon AppFlow Slack connection for the flow to use
  • An Amazon AppFlow flow to securely connect the Slack app with AWS services

Activating the Amazon AppFlow flow

You can create a flow on the Amazon AppFlow console.

  1. On the Amazon AppFlow console, choose View flows.
  2. Choose Activate flow.

Your SlackAppFlow is now active and runs every 1 minute to gather incremental data from the Slack channel testing-slack-integration.

Testing your integration

We can test the end-to-end integration by typing some issues related to your channels in the testing-slack-integration channel and waiting for about 1 minute for your Amazon AppFlow connection to transfer data to the S3 bucket. This triggers the Lambda function to run Amazon Comprehend analysis and return a category, and finally respond in the testing-slack-integration channel and the channel with the corresponding category with a random ticket number generated.

For example, in the following screenshot, we enter a Maven-related issue in the testing-slack-integration channel.

You see a reply from the TicketResolver app added to your original response in the testing-slack-integration channel.

Also, you see a slack message posted in channel.

Cleaning up

To avoid incurring any charges in the future, delete all the resources you created as part of this post:

  • Amazon Comprehend endpoint comprehend-issue-classifier-endpoint
  • Amazon Comprehend classifier comprehend-issue-classifier
  • Slack app TicketResolver
  • Slack channels testing-slack-integration, groovy-issues, hadoop-issues, maven-issues, log4j2-issues, and other-issues
  • S3 bucket comprehend-issue-classifier-output
  • S3 bucket comprehend-issue-classifier
  • CloudFormation stack (this removes all the resources the CloudFormation template created)

Conclusion

In this post, you learned how to use Amazon Comprehend, Amazon AppFlow, and Slack to create an intelligent issue-routing solution. For more information about securely transferring data software-as-a-service (SaaS) applications like Salesforce, Marketo, Slack, and ServiceNow to AWS, see Get Started with Amazon AppFlow. For more information about Amazon Comprehend custom classification models, see Custom Classification. You can also discover other Amazon Comprehend features and get inspiration from other AWS blog posts about using Amazon Comprehend beyond classification.

 


About the Author

Shanthan Kesharaju is a Senior Architect who helps our customers with AI/ML strategy and architecture. He is an award winning product manager and has built top trending Alexa skills. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

So Young Yoon is a Conversation A.I. Architect at AWS Professional Services where she works with customers across multiple industries to develop specialized conversational assistants which have helped these customers provide their users faster and accurate information through natural language. Soyoung has M.S. and B.S. in Electrical and Computer Engineering from Carnegie Mellon University.

 

Read More

Real-time data labeling pipeline for ML workflows using Amazon SageMaker Ground Truth

Real-time data labeling pipeline for ML workflows using Amazon SageMaker Ground Truth

High-quality machine learning (ML) models depend on accurately labeled, high-quality training, validation, and test data. As ML and deep learning models are increasingly integrated into production environments, it’s becoming more important than ever to have customizable, real-time data labeling pipelines that can continuously receive and process unlabeled data.

For example, you may want to create a consumer-facing application that regularly collects and sends new data objects to a data labeling pipeline, which produces labels and builds a dataset for model training or retraining. This pipeline creates a positive feedback loop that leads to more accurate, sophisticated models.

Amazon SageMaker Ground Truth streaming labeling jobs provide infrastructure and resources to create a continuously running labeling job that receives new data objects on demand and sends them to human workers to be labeled. You can chain multiple streaming labeling jobs together to create more intricate and refined data labeling pipelines.

Use this blog post to learn how to set up and customize Ground Truth streaming labeling jobs.

Walkthrough overview

In addition to discussing the benefits of using a streaming labeling job, such as eliminating delays, enforcing idempotency, and customizing input data sources, this post features two Jupyter notebooks that you can use to set up streaming labeling jobs. You can use these notebooks or follow the console instructions in this post to create a streaming labeling job using a supported, language-specific AWS Software Development Kit (SDK) of your choice.

The first notebook shows you how to create Ground Truth streaming labeling jobs. This notebook supports built-in and custom task types, which allow you to quickly create data labeling pipelines for various data types such as image, text, video, video frame, 3D point cloud, and more. This walkthrough demonstrates how to use Amazon Simple Notification Service (Amazon SNS) to send secure, real-time messages to a streaming labeling job to feed new data objects to human workers for labeling. You learn how you can set up notifications to receive the output data from that labeling task in real time, as soon as workers finish labeling a data object.

When you create a streaming labeling job, you can route the output data of that job to another streaming labeling job to create more complex data labeling pipelines and for data label verification and adjustment. This is referred to as chaining labeling jobs. You can use the second notebook with this post to learn how to chain two streaming labeling jobs together.

You can run both notebooks on default mode, requiring little to no input. Default mode creates image object detection (bounding box) labeling jobs and demonstrates how to send data objects to these labeling jobs. If you have your own data objects that you want to use, you can turn off default mode.

To get started, complete the Prerequisites and Launching a notebook instance and setting up the demo notebook sections in this post to gather the resources you need to complete this tutorial and, optionally, set up the Jupyter notebooks ground_truth_create_streaming_labeling_job.ipynb and ground_truth_create_chained_streaming_labeling_job.ipynb in an Amazon SageMaker notebook instance.

The following diagram illustrates the solution architecture.

Advantages of streaming labeling jobs

The first notebook shows you how to create Ground Truth streaming labeling jobs.

Real-time input channel

You can feed objects in real time and continuously to a labeling job. Amazon SNS allows you to configure topics to feed objects in real time to a running labeling job.

Long-running workflows

You can launch labeling jobs that can run for a long time if they’re actively being fed objects. Streaming jobs are designed to be long-running workflows that keep running until you choose to stop them.

Ground Truth will stop the job if it is idle for a long time. A job is defined as idle if Ground Truth doesn’t detect any objects waiting to be labeled in the system over a certain number of days. For example, if Ground Truth doesn’t receive new data objects from the SNS input topic and all the objects fed to the system are already labeled, a timer for idle time starts. If the idle timer hits a certain number, Ground Truth stops the labeling job.

In short, if objects are actively flowing through the system at regular cadence, and you can achieve a long-running workflow. For more information about configuring idle timers, see Stop a Streaming Labeling Job.

Eliminate delays

With streaming labeling jobs, objects can flow through your data labeling pipeline faster. Streaming jobs work in a sliding window manner, where Ground Truth keeps sending objects for labeling as long as slots are available. The slots are defined by the parameter MaxConcurrentTaskCount, which defines the maximum number of objects (slots) that can be filled by objects to be sent for labeling. When MaxConcurrentTaskCount is reached, you can view the number of data objects queued in Amazon Simple Queue Services (Amazon SQS).

For example, if MaxConcurrentTaskCount is 10, and 25 objects are sent via the input SNS topic, Ground Truth sends a maximum of 10 objects to the workers at a time and a maximum of 15 remaining objects are in the Amazon Simple Queue Service (Amazon SQS) queue. If a worker works on and submits 2 objects out of the 10 that were sent, only 8 slots are currently filled, and 2 more are sent to workers from the remaining 15 objects. This way, workers have a constant flow of objects coming in from your inputs, up to a maximum of 10 objects. There aren’t any delays resulting from batching objects. As workers work on objects, new objects are pumped in constantly and you can achieve data labeling with greater speed.

Rate limiting

You can limit and control how and when you feed data to workers. When you feed objects to your input SNS topic, they’re collected in an SQS queue in your account, named GroundTruth-<labeling-job-name>. If more objects are sent to the labeling job than the MaxConcurrentTaskCount, they remain in the SQS queue. Otherwise, they are sent to workers to be labeled. Any object in the SQS queue is available for a maximum of 14 days.

For example, if MaxConcurrentTaskCount is 1000, and 2,500 objects are sent to a streaming labeling job via an input SNS topic, Ground Truth sends a maximum of 1,000 objects to the workers at a time, and initially, 1,500 remain in the SQS queue. The speed of the workers determines how quickly the 1,500 objects in the queue are sent for labeling. If these objects remain in the queue for longer than expected, this serves as an indicator that you have sent more objects than can be worked on by workers in a given timeframe. If the data objects take longer than expected to label, you can adjust input to feed objects to Amazon SNS at a slower pace. You can also change the value of MaxConcurrentTaskCount to suit the pace of the worker.

To monitor the speed and quantity of data objects being fed into the SQS queue associated with a streaming labeling job, you can set up alarms for the queue with Amazon CloudWatch. For more information, see Available CloudWatch metrics for Amazon SQS. For example, you can set up an alarm on the ApproximateAgeOfOldestMessage metric to see how close your oldest data object is to the 14-day limit. When this alarm is trigged, you can take appropriate actions, like resending the object to the input SNS topic or notifying workers that tasks will expire if not completed within a given timeframe.

Output notifications

A new SNS channel is added as an output channel for your labeling job. When a worker completes a labeling job task from a Ground Truth streaming labeling job, Ground Truth uses your output topic to publish output data to one or more endpoints that you specify. To receive notifications when a worker finishes a labeling task, you must subscribe an endpoint to your SNS output topic. For example, you can subscribe an email, an AWS Lambda function, or an SQS queue to the SNS output topic used for labeling job, and any object labeled through Ground Truth appears in real time after labeling.

In addition to the SNS output topic, you can also use the frequent Amazon Simple Storage Service (Amazon S3) output file updates in the Amazon S3 output path. All labels are added to an output manifest file in Amazon S3. You can reference this file if, for example, the real-time output notifications were missed. If the S3 bucket is versioned, you can view and access different versions of the output manifest file.

Idempotency

You can use a unique identifier to distinguish the objects you feed to a labeling job and track them in the output. You can bring your own unique identifier, or take advantage of an auto-generated identifier Ground Truth creates if you don’t supply one.

When you send a data object to your streaming labeling job using an Amazon SNS message, you can specify your deduplication key and deduplication ID. The unique identifier helps make sure that each object sent for labeling is unique. If you send two objects with the same unique identifier, the latter object is considered a duplicate. This prevents in accidental injection of objects that weren’t intended and also provides an ID to track output data when labels are generated. For more information, see Duplicate Message Handling.

Drop objects into Amazon S3

You can set up your S3 buckets to automatically publish data labeling requests to your SNS input topic any time a data object is added to the bucket. With this setup, you can drop objects into the S3 bucket and they are automatically sent to your streaming labeling job.

For more information about setting up your S3 bucket and notifications, see Send Data Objects to Your Labeling Job Using An S3 Bucket.

Solution overview

To complete this use case, use the notebook ground_truth_create_streaming_labeling_job.ipynb in the Amazon SageMaker Examples GitHub repo.

After completing the prerequisites, you can use this walkthrough to do the following:

  1. Launch a notebook instance and set up the demo notebook
  2. Launch a streaming job
  3. Monitor the job
  4. Send objects to an ongoing job
  5. Stop the labeling job

Streaming labeling jobs are launched using the Ground Truth API operation CreateLabelingJob in a supported language-specific AWS SDK.

Prerequisites

If you’re a new user of Ground Truth streaming labeling jobs, it’s recommended you review Ground Truth Streaming Labeling Jobs before completing this walkthrough.

To complete this walkthrough, you need the following:

  • An AWS account.
  • An S3 bucket in the same AWS Region you use to launch your streaming labeling job. If you’re using a demo notebook, this bucket must also be in the same Region as your Amazon SageMaker notebook instance. You can either specify this bucket in the notebook variable BUCKET, or use the default bucket in the Region that you create your notebook instance in. For more information, see How do I create an S3 Bucket?
  • An AWS Identity and Access Management (IAM) execution role with required permissions. The notebook automatically uses the role you used to create your notebook instance (see the next item in this list). Add the following permissions to this IAM role:
    • Attach managed policies AmazonSageMakerGroundTruthExecution. The following GIF demonstrates how to attach this policy to the role on the IAM console.

    • When you create your role, you specify Amazon S3 permissions. You can either allow that role to access all your resources in Amazon S3, or you can specify particular buckets. Make sure that your IAM role has access to the S3 bucket that you plan to use. This bucket must be in the same Region as your notebook instance.
  • A work team. A work team is a group of people that you select to label your data. A work team is a group of workers from a workforce, which is made up of workers engaged through Amazon Mechanical Turk, vendor-managed workers, or your own private workers that you invite to work on your tasks. Whichever workforce type you choose, Ground Truth takes care of sending tasks to workers. To preview the worker UI, use a private workforce and add yourself to the work team you use in the notebook.
    • To use a private or vendor workforce, record the Amazon Resource Name (ARN) of the work team you use—you need it in the accompanying Jupyter notebooks. The following GIF demonstrates how to quickly create a private work team on the Amazon SageMaker console.

    • If you don’t specify a private or vendor workforce, the notebook automatically uses the Mechanical Turk workforce. When you create the labeling job, you can specify the total amount you pay an individual worker for labeling a data object. To learn more, see Amazon SageMaker Ground Truth pricing.
  • If you’re not using default mode in the notebooks, you must supply a HTML worker task template. This template is used to render the human task UI that your workers use to complete tasks. You can copy your template directly to the notebooks, which provides logic to write the template to Amazon S3, or you can add the template to your S3 bucket and record the template Amazon S3 URI. For more information about sample templates, see Built-in Task Types. For more information about custom labeling workflows, see Step 2: Creating your custom labeling task template.
  • A list of label categories. The notebooks use this list to create a label category configuration file and upload it to Amazon S3. When you use default mode in the notebooks, this list is provided.
  • If you’re not using the notebooks, you need two Lambda functions to pre-process your input data (PreHumanTaskLambdaArn) and output data (AnnotationConsolidationLambdaArn). If you use one of the built-in task types, Ground Truth provides these functions.

Launching a notebook instance and setting up the demo notebook

To use the notebooks, you can launch an Amazon SageMaker notebook instance. For more information, see Create a Notebook Instance. When your notebook instance is active, complete the following steps to use the notebooks:

  1. On the Amazon SageMaker console in Notebook instances, locate your notebook instance.
  2. Choose Open Jupyter or Open Jupyter Lab.
  3. In Jupyter, choose the SageMaker Examples In Jupyter Lab, choose the Amazon SageMaker icon to see a list of example notebooks.
  4. In the Ground Truth Labeling Jobs section, select one of the following notebooks to use alongside this post. In Jupyter, choose Use next to a notebook to start using it. In Jupyter Lab, select the notebook, then choose Create Copy.
    1. ground_truth_create_streaming_labeling_job.ipynb
    2. ground_truth_create_chained_streaming_labeling_job.ipynb

Launching a streaming job

Streaming labeling jobs are created using the same API operation, CreateLabelingJob, as non-streaming labeling jobs. To create a streaming labeling job, you specify an input topic as your input data source, and an output topic as your output data source. New data objects are continuously sent to your labeling job through the input topic, and output data is sent to the output topic as soon as workers complete labeling tasks. You can configure your output topic to send a notification or trigger an event any time output data is received.

When you create a streaming labeling job, the input manifest file is optional.

You can use the Amazon SNS API operation CreateTopic to create your input and output topics, or you can use the Amazon SNS console. The response to a successful request to CreateTopic includes the topic ARN. You use the topic ARNs of your input and output topics in CreateLabelingJob in the parameters.

If the name of the topic contains GroundTruth (not case-sensitive) or SageMaker (not case-sensitive), the policy AmazonSageMakerGroundTruthExecution grants sufficient permissions to publish messages to your labeling job. If not, make sure to grant your IAM role permission to perform the actions sns:Publish and sns:Subscribe for your SNS topics.

Creating an SNS topic using the Amazon Python (Boto3) SDK

The notebook ground_truth_create_streaming_labeling_job.ipynb creates SNS topics using the AWS Python (Boto3) SDK. In the following code, replace LABELING_JOB_NAME with the name of the labeling job:

sns = boto3.client('sns')

# Create Input Topic
input_response = sns.create_topic(Name= LABELING_JOB_NAME + '-Input')
INPUT_SNS_TOPIC_ARN = input_response['TopicArn']

# Create Output Topic
output_response = sns.create_topic(Name= LABELING_JOB_NAME + '-Output')
OUTPUT_SNS_TOPIC_ARN = output_response['TopicArn']

Creating an SNS topic on the Amazon SNS console

To create an SNS topic on the Amazon SNS console, complete the following steps:

  1. On the Amazon SNS console, choose Topics.
  2. Choose Create topic.

  1. For Name, enter a name.
  2. For Display name, enter an optional display name.
  3. If required, add additional configurations for your topic, such as Encryption, Access policy, Delivery retry policy, Delivery status logging, and

After the topics are created, feed the input topic ARN to LabelingJobSnsDataSource.SnsTopicArn and the output topic ARN to OutputConfig.SnsTopicArn.

Creating a streaming labeling job using CreateLabelingJob

You must create Ground Truth streaming labeling jobs with the Amazon SageMaker API operation CreateLabelingJob.

The ground_truth_create_streaming_labeling_job.ipynb notebook walks you through creating the resources required and configuring the request.

If you’re not using this notebook, use an AWS SDK supported by CreateLabelingJob. For more information about using an API request to create a streaming labeling job, see Example: Use SageMaker API To Create Streaming Labeling Job. If you’re a new user of Ground Truth, it’s recommended that you use one of the image or text based built-in task types to familiarize yourself with Ground Truth streaming labeling jobs.

After you fill in the parameters of your request, submit the request to create a labeling job. Refer to the Use the CreateLabelingJob API to create a streaming labeling job section in the ground_truth_create_streaming_labeling_job.ipynb notebook. You can also use the AWS Command Line Interface (AWS CLI) or AWS SDK. For more information, see Example: Use SageMaker API To Create Streaming Labeling Job.

Monitoring the job

You can call DescribeLabelingJob after the job is created. Refer to the Use the DescribeLabelingJob API to describe a streaming labeling job section in the ground_truth_create_streaming_labeling_job.ipynb notebook.

Make sure the LabelingJobStatus is InProgress before feeding objects via the SNS channel. The following code is an example of how you can use DescribeLabelingJob (using the AWS Python (Boto3) SDK) to retrieve the labeling job status:

sagemaker = boto3.client('sagemaker')
sagemaker.describe_labeling_job(LabelingJobName=LABELING_JOB_NAME)['LabelingJobStatus']

If you specified the optional field S3DataSource.ManifestS3Uri in the CreateLabelingJob request, the objects in the Amazon S3 file are automatically sent to workers as soon as the labeling job starts. The LabelCounters element of the response to your DescribeLabelingJob request shows these objects as Unlabeled initially, and then HumanLabeled after they have been annotated and workers have submitted their work.

Amazon SQS offers a secure, durable, and available hosted queue. Streaming labeling jobs create an SQS queue in your account. You can check for the queue by the name GroundTruth-LABELING_JOB_NAME. The following code is an example of how you can use GetQueueUrl (using the AWS Python (Boto3) SDK) to retrieve the labeling job status:

sqs = boto3.client('sqs')
response = sqs.get_queue_url(QueueName='GroundTruth-' + LABELING_JOB_NAME.lower())

Sending objects to an ongoing job

After your labeling job has started, data objects can be fed to it through the console or the Amazon SNS API. For more information, see Send Data Objects Using Amazon SNS. The format of the SNS message that you use to send a data object to your labeling job is the same as the augmented manifest format.

For example, to send a new image object to an image classification labeling job, your message may look similar to the following:

{"source-ref": "s3://awsexamplebucket/example-image.jpg"}

If you create a text-based labeling job, your request may look similar to the following:

{"source": "Lorem ipsum dolor sit amet"}

Publishing a request on the Amazon SNS console

To publish a request to your labeling job on the Amazon SNS console, complete the following steps:

  1. On the Amazon SNS console, choose Topics.
  2. Choose your input topic.
  3. Choose Publish message.

Publishing a request using the Publish API operation

You can use the Amazon SNS API operation Publish to send a request to label a data object to your streaming labeling job via a supported AWS SDK.

The notebook demonstrates how to publish a message using this operation.

The following code is an example of how you can use the AWS Python (Boto3) SDK to send a request to Publish. Replace INPUT_TOPIC_ARN with the ARN of your input topic, and replace REQUEST with a request similar to the preceding examples.

sns = boto3.client('sns')
published_message = sns.publish(TopicArn=INPUT_TOPIC_ARN,Message=REQUEST)

After you publish a request, a call to DescribeLabelingJob shows Unlabeled incremented by 1:

"LabelCounters" : {
    'TotalLabeled': 0, 
    'HumanLabeled': 0, 
    'MachineLabeled': 0,  
    'FailedNonRetryableError': 0,  
    'Unlabeled': 1
}

Previewing the worker task

If you used a private workforce and made yourself a worker on the work team used to create the labeling job, you can navigate to your worker portal to preview the worker task. You can find the worker portal link in the Labeling workforces page on the Ground Truth console (on the Amazon SageMaker console) in the Region you used to launch the labeling job. This link is also included in the welcome email sent to you when you were added to the work team.

When a worker submits a data object after labeling it, it is sent to your output topic. Additionally, the results are periodically added to the S3 output bucket you specified when you created your labeling job in S3OutputPath.

Stopping the labeling job

You can use streaming labeling jobs in long-running workflows, and they run until you stop them. This allows you to continuously feed objects to the labeling job.

However, if the system detects no objects are available in the system to be labeled and is idle continuously for more than a certain number of days, GroundTruth attempts to stop the job. For more information, see Stopping Streaming Jobs.

You can stop your labeling job on the Ground Truth console or using the Ground Truth API operation StopLabelingJob. To use the console, complete the following steps:

  1. On the Amazon SageMaker console, choose Ground Truth. Be sure to use the Region you used to launch your labeling job.
  2. Select the labeling job you want to stop.
  3. From the Actions drop-down menu, choose Stop job.

The final cells in the notebook demonstrate how you can stop a labeling job using the AWS Python (Boto3) SDK:

sagemaker = boto3.client('sagemaker')
sagemaker.stop_labeling_job(LabelingJobName=LABELING_JOB_NAME)

When a labeling job has been successfully stopped, its status shows as Stopped.

Other Features of Streaming Labeling Jobs

The following sections cover additional features of streaming labeling jobs: sending objects to your labeling job by dropping them in an S3 bucket, and chaining multiple labeling jobs together.

Sending data objects to your labeling job using an S3 bucket

You can set up your S3 buckets to automatically publish data labeling requests to your SNS input topic any time a data object is added to the bucket. With this setup, you can drop objects into the S3 bucket and they are automatically sent to your streaming labeling job.

To configure an S3 bucket to automatically send data objects to your SNS input topic, you need to add an access policy to the input topic to allow Amazon S3 to add an event to it. The following code illustrates the type of policy to attach with your topic ARN (replace SNS-topic-ARN):

{
 "Version": "2012-10-17",
 "Id": "example-ID",
 "Statement": [
  {
   "Sid": "example-statement-ID",
   "Effect": "Allow",
   "Principal": {
    "AWS":"*"  
   },
   "Action": [
    "SNS:Publish"
   ],
   "Resource": "SNS-topic-ARN",
   "Condition": {
      "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:<bucket-name>" },
      "StringEquals": { "aws:SourceAccount": "<bucket-owner-account-id>" }
   }
  }
 ]
}

To set up your S3 bucket to send data objects to your streaming labeling job on the Amazon S3 console, complete the following steps:

  1. On the Amazon S3 console, choose the bucket that you want to use to send data objects to your labeling job.
  2. On the Properties tab, under Advanced settings, choose Events.
  3. Choose Add notification.
  4. Give your notification a name.
  5. Select All object create events.
  6. Optionally, enter a prefix if you want to drop data objects into a prefix within the S3 bucket.
  7. If you only want to send specific types of data objects to your SNS input topic, specify a suffix. For example, to ensure only image files are sent to your SNS input topic, you can enter .jpg,.png,.jpeg.
  8. From the Send to drop-down menu, choose SNS Topic.
  9. Choose the SNS input topic you used or will use to create your labeling job.
  10. Choose Save.

The following GIF demonstrates how to set up this configuration on the Amazon S3 console.

Chaining

To create sophisticated, persistent, real-time data labeling pipelines that allow you to add multiple types of annotations to data objects, audit and verify labels, and more, you can chain multiple streaming labeling jobs.

Chaining allows you to send the output of one streaming labeling job to another streaming labeling job. For example, the output data of Job 1 can be sent to Job 2 as input, the output data of Job 2 can flow to Job n-1, and the data of Job n-1 can flow to Job n in real time.

As an example use case, you could use Job 1 to add a semantic segmentation mask to a sequence of video frames. You then use Job 2 to add bounding boxes to identify and localize data objects in each frame. Finally, you use Job 3 to verify and adjust labels as needed.

To set this up, you use the output SNS topic of Job 1 as the input SNS topic of Job 2. Similarly, you use the output SNS topic of Job 2 as the input SNS topic of Job 3, and so on. The following diagram illustrates this architecture.

After you set up your jobs this way, a data object flowing through Job 1 makes its way to Job 2 automatically after passing through Job 1. The following are some possibilities for chaining with two jobs:

  • Specify different label attribute names for jobs with a similar task type. For example, Job 1 (label data) chains to Job 2 (adjust, review, and verify annotations from Job 1).
  • Use different label attribute names for jobs with different task types. For example, Job 1 (labeling for image classification) chains to Job 2 (labeling for object detection).
  • Use the same label attribute names for both jobs. For example, Job 1 (labeling) chains to Job 2 (partial labeled data of Job 1 flows to Job 2).

You can use the notebook ground_truth_create_chained_streaming_labeling_job.ipynb to learn how to chain two streaming labeling jobs. This example demonstrates the first use case in the preceding list (different label attribute names for jobs with similar task types). When used on default mode, this notebook chains a bounding box (object detection) job (Job 1) to a bounding box audit job. Any bounding boxes annotated in Job 1 can be adjusted in Job 2 in real time. You can generalize this use case to set up quality-check workflows, in which a work team reviews the annotations of another work team.

You can also use the notebook to set up any kind of chained streaming jobs to achieve multiple job-chaining configurations.

Conclusion

This post covers the benefits of using Ground Truth streaming feature and how to create and chain streaming labeling jobs. This post merely scratches the surface of what Ground Truth streaming can do.

To get started, use one of the notebooks included in this post to launch and experiment with streaming labeling jobs, or see Create a Streaming Labeling Job.

Let us know what you think in the comments.

 


About the Authors

Priyanka Gopalakrishna is a software engineer at Amazon AI. Her focus is on solving labeling problems using machine learning and building scalable AI solutions using distributed systems.

 

 

 

 

Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon.

Read More

Deploying and using the Document Understanding Solution

Deploying and using the Document Understanding Solution

Based on our day to day experience, the information we consume is entirely digital. We read the news on our mobile devices far more than we do from printed copy newspapers. Tickets for sporting events, music concerts, and airline travel are stored in apps on our phones. One could go weeks or longer without needing to have any paper currency in his or her wallet, as digital payments are ubiquitous. However, many companies across different industries still primarily operate on manual, paper-based processes. For example, healthcare payors, construction companies, and law firms deal with billions of documents and forms, making the process of finding information difficult and time-consuming. When documents are found, extracting information through manual data entry can be slow, expensive, and error prone, resulting in increases in compliance risks. Furthermore, domain experts need to identify and categorize domain-specific phrases and keywords (or entities), or use traditional Optical Character Recognition (OCR) and keyword detection software that requires manual customization. These approaches can create scrambled output and unusable results. AWS AI services such as Amazon Kendra, Amazon Textract, Amazon Comprehend, and Amazon Comprehend Medical help solve these challenges by automating data extraction and comprehension using machine learning (ML).

Overview of the Document Understanding Solution

The Document Understanding Solution (DUS) allows you to use the power of AWS AI for enterprise search, document digitization, discovery, and extraction and redaction of select information. Part of the Intelligent Document Processing services offered from AWS, this solution uses AWS artificial intelligence (AI) services to solve business problems.

Search and discovery

These challenges exist in almost every business vertical. Imagine a manufacturer that has to maintain archives of thousands – if not millions – of product and tool specifications. Without document digitization of archives, there could be massive underutilization of their highly valuable tool data and information retrieval could be complex and costly. In another example, a company in the financial industry could have 1000s of financial reports in paper format. Without a simple way to extract and digitize this data, it could take an extensive manual effort to keypunch it.

To help with these situations, DUS leverages multiple ML services, including Amazon Textract. Amazon Textract is a fully managed machine learning service that automatically extracts text and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract will move the data from the documents to a format that can be readily searched. Next, Amazon Kendra and Amazon Elasticsearch Service (Amazon ES) are available to provide the end user search experience in DUS. Amazon Kendra is an intelligent search service powered by machine learning. Amazon Kendra uses ML to obtain better results for natural language questions, and will return an exact answer from within a document, whether that is a text snippet, FAQ, or a PDF document. In addition to Amazon Kendra, the DUS provides a rich search experience to the user through the use of Amazon Elasticsearch Service. Amazon Elasticsearch Service is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale.

Control and compliance

In addition to search, the ability to analyze documents at scale is essential. Amazon Textract extracts text from documents, which can then be input into Amazon Comprehend or Amazon Comprehend Medical. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. It can identify key phrases and entities, such as places, people, and brands. Amazon Comprehend Medical is similar to Comprehend. It is a natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. It can identify medical entities, such as medical conditions and medications.

Identifying these key pieces of information allows for compliance controls through redaction. For example, an insurer could use this solution to feed a workflow that automatically redacts personally identifiable information (PII) or protected health information (PHI) for their review before archiving claim forms by automatically recognizing the important key-value pairs and entities that require protection.

Other industries can also use this solution for complying with regulatory standards, such as GDPR and HIPAA. For example, this solution could be used by a law firm to redact PII, organization names or brand names. Another example includes a security agency needing to redact all vital information such as names, locations and/or dates from a case file for data security or privacy concerns.

Workflow automation

The DUS solution delivers results at scale in production workflows. Organizations can more rapidly process documents such as insurance claims and forms, and seamlessly extract tables from PDFs into CSVs to conduct additional analysis. With detection and categorization of medical entities and ICD-10-CM ontologies, medical institutes can recognize exponential savings in workforce, time, and other resources that are spent identifying and classifying patient information. All the data is stored by the solution in easily accessible formats, such as CSV and JSON files, which can be fed into downstream pipelines. Additionally, the bulk processing feature in DUS allows you to import a large number of documents directly for processing and analysis.

The following diagram illustrates the DUS architecture.


Deploying DUS

For instructions on setting up DUS, see Document Understanding Solution on AWS Solutions.

Deploying DUS sets up a web application that you can use for document understanding. The deployment includes setting up infrastructure in your AWS account and pre-loading sample documents.

Using DUS

Once you have successfully completed deploying the DUS demo, you are then provided with instructions on how to login into the application. After logging in, you are directed to the homepage, as seen below. You have three options which cover the common use-cases in document understanding solution: Discovery, Compliance, and Workflow Automation.

When you select the Discovery track you will be directed to the preloaded documents page or the Document List page. You may select one of the preloaded sample documents or upload your own document. From here, you can search for a specific document by using a phrase or keyword.

If you decide to upload your own document, choose upload your own documents above the available documents. You will then be directed to a new page to upload your own documents. This page also has sample documents from different industry verticals for you to experiment with.


Back on the Document List page, you will find some PDF and image files. Text in these documents are not actually tagged or available to use by default. However, since these documents have been processed by the solution, you will now be able to search for information within these documents. If you decide to search for a specific phrase or keyword in the search bar, then the solution will analyze the text it has extracted from the documents and provide you with search results. The search results can be displayed in three different ways; a comparative view of Amazon ES (traditional search) and Amazon Kendra (semantic search), just Amazon ES or just Amazon Kendra .

For Amazon Kendra results, you also have the option to provide feedback by either up-voting or down-voting an Amazon Kendra suggested answer.

Amazon Kendra also supports filtering based on user context. Under the Amazon Kendra results view, you can filter results based on the users for the preloaded documents. Click the Filter button to the right of the Amazon Kendra Results title. You can then select a persona and one of the suggested questions to display filtered results. Amazon Kendra will then rank results based on the selected persona. You can toggle between the various personas to compare how the results differ. For demonstration purposes, the Document Understanding Solution comes with preloaded documents and personas from the medical industry. You will be able to notice that based on the question and persona selected, results are ranked differently creating a more targeted search experience for the user.

From the Document List search results view, you can select a document that you want to further explore. This will direct you to the Document Details page. See the following image.

The following image shows the tool bar above the search bar, where you can choose to see different types of information from the document.


The tabs have the following functions:

  • Preview – Under this tab, you are able to view the original document as well as download a searchable PDF version of the document. This helps users to convert their documents – be it images or PDFs into easily searchable PDF files.
  • Raw Text – Under this tab, you can access all the text identified in the file.
  • Key-Value Pairs – Under this tab, key-value pairs from the document are highlighted. In this process, all forms in the document are identified and stored in a key-value pair format. If desired, you can download a CSV file of the key-value pairs. This is especially useful for organizations that have structured data and would want to automate their data extraction and storage workflows. For example, organizations that have a lot of forms like job applications or medical patient forms.
  • Tables – Under this tab, you can view all the tables identified in the document. Like the key-value pairs, you can download the tables in the CSV format. Companies dealing with balance sheets or with invoices would find this feature extremely useful since it allows users to easily convert tables, images and PDFs into CSV files which can then be used for further analysis.
  • Entities and Medical Entities – Under these tabs, you can find the general and medical entities in the document respectively. These entities include persons, locations, dates, PHI and medical information which helps organization to easily identify and extract critical medical data in a document.

For exploring redaction controls, choose the Compliance option on the toolbar. Here you can choose to redact information like key-value pairs, entities, medical entities or even keyword matches by switching to the respective tabs on the tool bar and choosing Redact. One example of how this feature may be useful is to consider a clinic that wants to redact PHI information before they decide to share medical records. Another example is an organization that wants to redact specific information identified as keylue pairs in forms present in their documents. As seen in the following image, you can redact information, download the redacted document and even clear redactions after use.

In terms of Workflow Automation, the Document Understanding Solution also provides some input and output capabilities via the AWS Console which makes it easier to integrate DUS into an existing pipeline. DUS supports a bulk document processing mode, in which you can simply input documents into an Amazon Simple Storage Service (Amazon S3) bucket which will be asynchronously analyzed and made available in the application. More information on bulk processing is available on the AWS Solutions Implementation Guide. Results from the different AWS AI services are all stored within Amazon S3 buckets and the corresponding metadata is available in Amazon DynamoDB tables. This helps users of the solution to build downstream pipelines from these datastores that hold the document analysis data.

Summary:

This post reviewed how you can integrate Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon Kendra to conduct enterprise search, document digitization, document discovery, and extraction and redaction of select information.

To access the DUS source code, see Document Understanding Solution on GitHub. This solution has been made open source so that you can extend and incorporate the solution into your AWS workflows.


About the Authors

Simran Baxendale is a Program Manager in the Amazon Machine Learning Solutions Lab. She helps define, coordinate and execute program strategy for the demos applications team.

 

 

 

 

Curtis Bray is a manager in the Amazon Machine Learning Solutions Lab. He leads the demos applications team that focuses on building use case based demos that show customers how to unlock the power of AWS AI/ML services to solve real world business problems.

 

 

 

 

Alex Chirayath is an SDE in the Amazon Machine Learning Solutions Lab. He helps customers adopt AWS AI services by building solutions to address common business problems.

Read More

Training and serving H2O models using Amazon SageMaker

Training and serving H2O models using Amazon SageMaker

Model training and serving steps are two essential pieces of a successful end-to-end machine learning (ML) pipeline. These two steps often require different software and hardware setups to provide the best mix for a production environment. Model training is optimized for a low-cost, feasible total run duration, scientific flexibility, and model interpretability objectives, whereas model serving is optimized for low cost, high throughput, and low latency objectives.

Therefore, a wide-spread approach is to train a model with a popular data science language like Python or R, and create model artifact formats such as Model Object, Optimized (MOJO), Predictive Model Markup Language (PMML) or Open Neural Network Exchange (ONNX) and serve the model on a microservice (e.g., Spring Boot application) based on Open Java Development Kit (OpenJDK).

This post demonstrates how to implement this approach end-to-end using Amazon SageMaker for the popular open-source ML framework H2O. Amazon SageMaker is a fully managed service that provides every developer and data scientist the ability to build, train, and deploy ML models quickly. Amazon SageMaker is a versatile ML service, which allows you to use ML frameworks and programming languages of your choice. H2O was founded by H2O.ai, an AWS Partner Network (APN) Advanced Partner. You can choose from a wide range of options to train and deploy H2O models on the AWS Cloud, and H2O provides some design pattern examples to productionize H2O ML pipelines.

The H2O framework supports three type of model artifacts, as summarized in the following table.

Dimension Binary Models Plain Old Java Object (POJO) Model Object, Optimized (MOJO)
Definition The H2O binary model is intended for non-production ML experimentation with the features supported by a specific H2O version. A POJO is an ordinary Java object, not bounded by any special restriction. It’s a way to export a model built in H2O and implement it in a Java application. A MOJO is also a Java object, but the model tree is out of this object, because it has a generic tree-walker code to navigate the model. This allows model artifacts to be much smaller.
Use case Intended for interactive ML experimentation. Suitable for production usage. Suitable for production usage
Deployment Restrictions The model hosting image should run an H2O cluster and the same h2o version as the binary model. 1 GB maximum model artifact file size restriction for H2O. No size restriction for H2O.
Inference Performance High latency (up to a few seconds)—not recommended for production. Only slightly faster than MOJOs for binomial and regression models. Latency is typically in single-digit milliseconds. Significant inference efficiency gains over POJOs for multi-nominal and large models. Latency is typically in single-digit milliseconds.

During my trials, I explored some of the design patterns that Amazon SageMaker manages end to end, summarized in the following table.

ID Design Pattern Advantages Disadvantages
A

Train and deploy the model with the Amazon SageMaker Marketplace algorithm offered by H2O.ai

 

No effort is required to create any custom container and Amazon SageMaker algorithm resource. An older version of the h2o Python library is available. All other disadvantages in option B also apply to this option.
B

Train using a custom container with h2o Python library. Export the model artifact as H2O binary model format. Serve the model using a custom container running a Flask application and running inference by h2o Python library.

 

It’s possible to use any version of the h2o Python library. H2O binary model inference latency is significantly higher than MOJO artifacts. It’s prone to failures due to h2o Python library version incompatibility.
C

Train using a custom container with the h2o Python library. Export the model in MOJO format. Serve the model using a custom container running a Flask application and running inference by pyH2oMojo.

 

Because MOJO model format is supported, the model inference latency is lower than option B and it’s possible to use any version of the h2o Python library. Using pyH2oMojo has a higher latency and it’s prone to failures due to weak support for continuously evolving H2O versions.
D Train using a custom container with the h2o Python library. Export the model in MOJO format. Serve the model using a custom container based on Amazon Corretto running a Spring Boot application and h2o-genmodel Java library. It’s possible to use any version of h2o Python library and h2o-genmodel libraries. It offers the lowest model inference latency. The majority of data scientists prefer using only scripting languages.

It’s possible to add a few more options to the preceding list, especially if you want to run distributed training with Sparkling Water. After testing all these alternatives, I have concluded that design pattern D is the most suitable option for a wide range of use cases to productionize H2O. Design pattern D is built by a custom model training container with the h2o Python library and a custom model inference container with Spring Boot application and h2o-genmodel Java library. This post shows how to build an ML workflow based on this design pattern in the subsequent sections.

Problem and dataset

You can use the Titanic Passenger Survival dataset, which is publicly available thanks to Kaggle and encyclopedia-titanica, to build a predictive model that answers what kind of people are more likely to survive in a catastrophic shipwreck. It uses 11 independent variables such as age, gender, and passenger class to predict the binary classification target variable Survived. For this post, we split the original training dataset 80%/20% to create train.csv and validation.csv input files. The datasets are located under the /examples directory of the parent repository. This dataset requires features preprocessing operations like data imputation of null values for the Age feature and string indexing for Sex and Embarked features to train a model using the Gradient Boosting Machines (GBM) algorithm using the H2O framework.

Overview of solution

The solution in this post offers an ML training and deployment process orchestrated by AWS Step Functions and implemented with Amazon SageMaker. The following diagram illustrates the workflow.

This workflow is developed using a JSON-based language called Amazon State Language (ASL). The Step Functions API provides service integrations to Amazon SageMaker, child workflows, and other services.

Two Amazon Elastic Container Registry (Amazon ECR) images contain the code mentioned in design pattern D:

  • h2o-gbm-trainer – H2O model training Docker image running a Python application
  • h2o-gbm-predictor – H2O model inference Docker image running a Spring Boot application

The creation of a manifest.json file in an Amazon Simple Storage Service (Amazon S3) bucket initiates an event notification, which starts the pipeline. This file can be generated by a prior data preparation job, which creates the training and validation datasets during a periodical production run. Uploading this file triggers an AWS Lambda function, which collects the ML workflow run duration configurations from the manifest.json file and AWS Systems Manager Parameter Store and starts the ML workflow.

Prerequisites

Make sure that you complete all the prerequisites before starting deployment. Deploying and running this workflow involves two types of dependencies:

Deploying the ML workflow infrastructure

The infrastructure required for this post is created with an AWS CloudFormation template compliant to AWS Serverless Application Model (AWS SAM), which simplifies how to define functions, state machines, and APIs for serverless applications. I calculated the cost for a test run is less than $1 in the eu-central-1 Region. For installation instructions, see Installation.

The deployment takes approximately 2 minutes. When it’s complete, the status switches to CREATE_COMPLETE for all stacks.

The nested stacks create three serverless applications:

Creating a model training Docker image

Amazon SageMaker launches this Docker image on Amazon SageMaker training instances in the runtime. It’s a slightly modified version of the open-sourced Docker image repository by our partner H2O.AI, which extends the Amazon Linux 2 Docker image. Only the training code and its required dependencies are preserved; the H2O version is upgraded and a functionality to export MOJO model artifacts is added.

Navigate to h2o-gbm-trainer repository in your command line. Optionally, you can test it in your local PC. Build and deploy the model training Docker image to Amazon ECR using the installation command.

Creating a model inference Docker image

Amazon SageMaker launches this Docker image on Amazon SageMaker model endpoint instances in the runtime. The Amazon Corretto Docker Image (amazoncorretto:8) is extended to provide dependencies with Amazon Linux 2 Docker image and Java settings required to launch a Spring Boot application.

Depending on an open-source distribution of OpenJDK has several drawbacks, such as backward incompatibility between minor releases, delays in bug fixing, security vulnerabilities like backports, and suboptimal performance for a production service. Therefore, I used Amazon Corretto, which is a no-cost, multiplatform, secure, production-ready downstream distribution of the OpenJDK. In addition, Corretto offers performance improvements (AWS Online Tech Talk) with respect to OpenJDK (openjdk:8-alpine), which are observable during the Spring Boot application launch and model inference latency. The Spring Boot framework is preferred to build the model hosting application for the following reasons:

  • It’s easy to build a standalone production-grade microservice
  • It requires minimal Spring configuration and easy deployment
  • It’s easy to build RESTful web services
  • It scales the system resource utilization according to the intensity of the model invocations

The following image is the class diagram of the Spring Boot application created for the H2O GBM model predictor.

SagemakerController class is an entry point of this Spring Boot Java application, launched by SagemakerLauncher class in the model inference Docker image. SagemakerController class initializes the service in init() method by loading the H2O MOJO model artifact from Amazon S3 with H2O settings to impute the missing model scoring input features and loading a predictor object.

SagemakerController class also provides the /ping and /invocations REST API interfaces required by Amazon SageMaker, which are called by asynchronous and concurrent HTTPS requests to Amazon SageMaker model endpoint instances in the runtime. Amazon SageMaker reserves the /ping path for health checks during the model endpoint deployment. The /invocations path is mapped to the invoke() method, which forwards the incoming model invocation requests to the predict() method of the predictor object asynchronously. This predict() method uses Amazon SageMaker instance resources dedicated to the model inference Docker image efficiently thanks to its non-blocking asynchronous and concurrent calls.

Navigate to the h2o-gbm-predictor repository in your command line. Optionally, you can test it in your local PC. Build and deploy the model inference Docker image to Amazon ECR using the installation command.

Creating a custom Amazon SageMaker algorithm resource

After publishing the model training and inference Docker images on Amazon ECR, it’s time to create an Amazon SageMaker algorithm resource called h2o-gbm-algorithm. As displayed in the following diagram, an Amazon SageMaker algorithm resource contains training and inference Docker image URIs, Amazon SageMaker instance types, input channels, supported hyperparameters, and algorithm evaluation metrics.

Navigate to the h2o-gbm-algorithm-resource repository in your command line. Then run the installation command to create your algorithm resource.

After a few seconds, an algorithm resource is created.

Because all the required infrastructure components are now deployed, it’s time to run the ML pipeline to train and deploy H2O models.

Running the ML workflow

To start running your workflow, complete the following steps:

  1. Upload the train.csv and validation.csv files to their dedicated directories in the <s3bucket> bucket (replace <s3bucket> with the S3 bucket name in the manifest.json file):
aws s3 cp examples/train.csv s3://<s3bucket>/titanic/training/
aws s3 cp examples/validation.csv s3://<s3bucket>/titanic/validation/
  1. Upload the file under the s3://<s3bucket>/manifests directory located in the same S3 bucket specified during the ML workflow deployment:
aws s3 cp examples/manifest.json s3://<s3bucket>/manifests

As soon as the manifest.json file is uploaded to Amazon S3, Step Functions puts the ML workflow in a Running state.

Training the H2O model using Amazon SageMaker

To train your H2O model, complete the following steps:

  1. On the Step Functions console, navigate to ModelTuningWithEndpointDeploymentStateMachine to find it in Running state and observe the Model Tuning Job step.

  1. On the Amazon SageMaker console, under Training, choose Hyperparameter tuning jobs.
  2. Drill down to the tuning job in progress.

After 4 minutes, all training jobs and the model tuning job change to Completed status.

The following screenshot shows the performance and configuration details of the best training job.

  1. Navigate to the Amazon SageMaker model link to display the model definition in detail.

The following screenshot shows the detailed settings associated with the created Amazon SageMaker model resource.

Deploying the MOJO model to an auto-scaling Amazon SageMaker model endpoint

To deploy your MOJO model, complete the following steps:

  1. On the Step Functions console, navigate to ModelTuningWithEndpointDeploymentStateMachine to find it in Running state.
  2. Observe the ongoing Deploy Auto-scaling Model Endpoint step.

The following screenshot shows the Amazon SageMaker model endpoint during the deployment.

Auto-scaling model endpoint deployment takes approximately 5–6 minutes. When the endpoint is deployed, the Step Functions workflow successfully concludes.

  1. Navigate to the model endpoint that is in InService status; it’s now ready to accept incoming requests.

  1. Drill down to the model endpoint details and observe the endpoint runtime settings.

This model endpoint can scale from one to four instances, which are all behind Amazon SageMaker Runtime.

Testing the Amazon SageMaker model endpoint

For Window users, enter the following code to invoke the model endpoint:

aws sagemaker-runtime invoke-endpoint --endpoint-name survival-endpoint ^
--content-type application/jsonlines ^
--accept application/jsonlines ^
--body "{"Pclass":"3","Sex":"male","Age":"22","SibSp":"1","Parch":"0","Fare":"7.25","Embarked":"S"}"  response.json && cat response.json

For Linux and macOS users, enter the following code to invoke the model endpoint:

aws sagemaker-runtime invoke-endpoint --endpoint-name survival-endpoint 
--content-type application/jsonlines 
--accept application/jsonlines 
--body "{"Pclass":"3","Sex":"male","Age":"22","SibSp":"1","Parch":"0","Fare":"7.25","Embarked":"S"}"  response.json --cli-binary-format raw-in-base64-out && cat response.json

As displayed in the following model endpoint response, this unfortunate third-class male passenger didn’t survive (prediction is 0) according to the trained model:

{"calibratedClassProbabilities":"null","classProbabilities":"[0.686304913500942, 0.313695086499058]","prediction":"0","predictionIndex":0}

The invocation round-trip latency might be higher in the first call, but it decreases in the subsequent calls. This latency measurement from your PC to the Amazon SageMaker model endpoint also involves the network overhead of the local PC to AWS Cloud connection. To have an objective evaluation of model invocation performance, a load test based on real-life traffic expectations is essential.

Cleaning up

To stop incurring costs to your AWS account, delete the resources created in this post. For instructions, see Cleanup.

Conclusion

In this post, I explained how to use Amazon SageMaker to train and serve models for an H2O framework in a production-scale design pattern. This approach uses custom containers running a model training application built with a data science scripting language and a separate model hosting application built with a low-level language like Java, and has proven to be very robust and repeatable. You could also adapt this design pattern and its artifacts to other ML use cases.

 


About the Author

As a Machine Learning Prototyping Architect, Anil Sener builds prototypes on Machine Learning, Big Data Analytics, and Data Streaming, which accelerates the production journey on AWS for top EMEA customers. He has two masters degrees in MIS and Data Science.

 

 

Read More

This month in AWS Machine Learning: October edition

This month in AWS Machine Learning: October edition

Every day there is something new going on in the world of AWS Machine Learning—from launches to new to use cases to interactive trainings. We’re packaging some of the not-to-miss information from the ML Blog and beyond for easy perusing each month. Check back at the end of each month for the latest roundup.

Launches

This month we announced price drops for Amazon SageMaker, improvements to Amazon Personalize, and personal protective equipment (PPE) detection for Amazon Rekognition.

Use cases

Get ideas and architectures from AWS customers, partners, ML Heroes, and AWS experts on how to apply ML to your use case:

Explore more ML stories

Want more news about developments in ML? Check out the following stories:

  • How do you lead an organization through rapid change while keeping your customers at the forefront? Listen in to this Conversations with Leaders podcast where former Head of AI/ML for Zappos, Ameen Kazerouni, shares how he tackled that question, his thoughts on the core elements for approaching ML, investing in and up-skilling employees, and more.
  • The United States presidential election is days away. Dive into what candidates have said about the issues you care about using the Wall Street Journal’s recently launched Talk2020 tool. Talk2020 uses Amazon Kendra to allow you to search candidate transcripts using natural language capabilities. Start searching now and read more about how it was created.
  • This Wall Street Journal article examines the importance of ML in the advancement of healthcare. Discover how AWS customers like Livongo, Cambia Health, and Moderna are using ML to deliver higher-quality services and better patient outcomes.
  • With the increasing application of artificial intelligence and ML in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game. Go the behind the scenes on the Kick Predictor, which predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game.
  • With the help of Amazon ML Solutions Lab, the NFL’s Next Gen Stats team details how they developed a model to successfully predict the trajectories of defensive backs from when the pass is thrown to when the pass should arrive to the receiver.

Mark your calendars

Join us for the following exciting ML events:

  • If you missed it last month, be sure to catch up on SageMaker Fridays. Get started faster with machine learning with practical use cases and more, using Amazon SageMaker.
  • Registration is now open for re:Invent 2020. Don’t miss the machine learning keynote on December 8!

 


About the Author

Laura Jones is a product marketing lead for AWS AI/ML where she focuses on sharing the stories of AWS’s customers and educating organizations on the impact of machine learning. As a Florida native living and surviving in rainy Seattle, she enjoys coffee, attempting to ski and enjoying the great outdoors.

Read More

Introducing the COVID-19 Simulator and Machine Learning Toolkit for Predicting COVID-19 Spread

Introducing the COVID-19 Simulator and Machine Learning Toolkit for Predicting COVID-19 Spread

There have been breakthroughs in understanding COVID-19, such as how soon an exposed person will develop symptoms and how many people on average will contract the disease after contact with an exposed individual. The wider research community is actively working on accurately predicting the percent population who are exposed, recovered, or have built immunity. Researchers currently build epidemiology models and simulators using available data from agencies and institutions, as well as historical data from similar diseases such as influenza, SARS, and MERS. It’s an uphill task for any model to accurately capture all the complexities of the real world. Challenges in building these models include learning parameters that influence variations in disease spread across multiple countries or populations, being able to combine various intervention strategies (such as school closures and stay-at-home orders), and running what-if scenarios by incorporating trends from diseases similar to COVID-19. COVID-19 remains a relatively unknown disease with no historic data to predict trends.

We are now open-sourcing a toolset for researchers and data scientists to better model and understand the progression of COVID-19 in a given community over time. This toolset is comprised of a disease progression simulator and several machine learning (ML) models to test the impact of various interventions. First, the ML models help bootstrap the system by estimating the disease progression and comparing the outcomes to historical data. Next, you can run the simulator with learned parameters to play out what-if scenarios for various interventions. In the following diagram, we illustrate the interactions among the extensible building blocks in the toolset.

In this post, we describe in detail how our disease simulation works, how simulation parameters are learned using supervised learning, and predict the incidence of disease given an intervention score.

Historical trends for infectious diseases

We provide several notebooks in our open-source toolset to run what-if scenarios at the state level in the US, India, and countries in Europe. In these notebooks, we use various data sources that frequently publish the number of new cases. For example, for the US, we use the Delphi Epidata API from Carnegie Mellon University (CMU) to access various datasets, including but not limited to the Johns Hopkins Center for Systems Science and Engineering (JHU-CSSE), survey trends from Google search and Facebook, and historical data for H1N1 in 2009–2010.

We can use our notebook, covid19_data_exploration.ipynb, to overlay historical data from previous pandemics with COVID-19. For example, the following graphs compare COVID-19 to seasonal flu and the H1N1 pandemic in California, Texas, and Illinois.

The first graph shows the 7-day average of the number of incidences in California during seasonal flu, H1N1, and COVID-19.

Although COVID-19 cases peaked in summer for most states in the US, there are exceptions. In Illinois, the most cases occurred early in the year, similar to the H1N1 peak in spring.

On the other hand, in other states such as Texas, we observe a potential peak aligning with the H1N1 peak in fall.

The trends differ greatly across states and countries. Therefore, we provide notebooks that enable you to run what-if scenarios by learning from existing data and projecting into the future using anticipated peaks. 

Results from running what-if scenarios

The notebook covid19_simulator.ipynb has a comprehensive list of regions and countries across the world to run what-if scenarios. In this section, we discuss various what-if scenarios for France, Italy, the US, and Maharashtra, India. First, we use ML to predict the disease trends, including peaks and waves based on parameters specified by the user (for example, we use 3 months of COVID-19 case data to bootstrap and follow the H1N1 trend or create a second or third wave after 6 months from the first wave). Next, we play out intervention scenarios such as mild intervention or strict intervention and discuss the results.

France

For France, we considered the what-if scenario of having a stricter intervention and a second wave in 6 months but with a higher peak than the first wave, as expected in H1N1-like trends. The following graphs compare the daily number of cases and cumulative number of cases. Our projection in this scenario (orange line) closely matches with the actual curve (blue line).

Italy

For Italy, we consider the scenario of having a mild intervention policy versus a stricter intervention policy and a second wave in 6 months with a higher peak, as expected in H1N1-like trends. The first set of graphs shows the number of daily cases and total cases with a mild intervention policy.

The following graphs compare daily and total case numbers with a stricter intervention policy.

During the first wave, the milder intervention projection initially matches better, and stricter interventions result in a decline. Therefore, in this what-if scenario, we can see how our model captures varying interventions. However, although the trend with the second wave matches our predictions, using the assumption of a second wave in 6 months doesn’t align with the new trend. Therefore, the best projection for Italy’s what-if scenario should shift the second wave to where we usually expect H1N1 would have been—specifically, in fall.

US

For our US scenario, we considered having another wave in 3 months while keeping a stricter level of interventions. Overall, the US aligns better with stricter intervention scores based on the invention scoring mechanism provided by the Oxford Coronavirus Government Response Tracker, which we use for all the examples in this post and notebook. In this what-if scenario, we start observing a trend for another wave that aligns with H1N1-like trends in fall.

Maharashtra, India

Our Maharashtra, India, scenario considers having another wave in 6 months while keeping the same level of intervention policies. The graph on the left shows the actual (blue) and estimated (orange) number of cases. The graph the right is the cumulative number of cases. In this scenario, we can see the impact of experiencing a second wave is similar to the first one.

Disease simulator

We model the disease progression for each individual in a population using a finite state machine, and then report out the aggregate state of the population. We assign a probability distribution to the disease parameters for each individual, parametrized by a mean, standard deviation, and lower and upper limits. For example, you can set parameters such as individuals will develop symptoms within 2–5 days after exposure, with the majority of the population developing symptoms in 2–3 days. Similarly, you can set parameters for the recovery period, such as within 14–21 days after exposure. The stochasticity allows for variation in the population at the individual level to mimic real-world scenarios.

Our finite state machine is similar to the simulation model in COVID-19 Projections Using Machine Learning, with additional states for infection transmission by asymptomatic individuals, as shown in the following diagram. The default state machine is extensible in the sense that you can add any disease progression state to the model as long as the state transitions are well-defined from and to the new state. For example, you can add the state for having tested positive.

Our disease simulation can also capture population dynamics. The transition from one state to the next for an individual is influenced by the states of the others in the population. For example, a person transitions from a Susceptible to Exposed state based on factors such as whether the person is vulnerable due to pre-exiting health issues or interventions such as social distancing. Theoretically, our simulation model iterates each individual’s state within an automata network [4]. The state transition probabilities are driven by two types of factors:

  • Individual, disease-specific factors – The probability densities assigned to the individual on how soon the symptoms will appear dictates transitioning from an Exposed state to Onset of systems
  • Population, transmission-specific factors – The probability to transition from Susceptible to Exposed is higher for an individual with a larger social network or exposure to infected individuals

Learning simulation parameters

The simulation has different types of parameters. Some of these parameters are known or discovered by researchers and scientists, such as the number of days prior to the onset of symptoms. Other parameters such as transmission rate, which varies greatly among populations and rapidly over time, can be learned from actual data published by agencies. The following are three core simulation parameters learned by ML methods:

  • Transmission rate – Transmission rate can be derived either directly from the recent case counts of the target location or as an expected value from the transmission rates of the countries matching the transmission rate pattern of the target location. As most of the regions (country / state / county) have now surpassed the peak of the 1st wave or already entered the 2nd wave, the transmission rate can be measured more reliably from respective country’s daily confirmed-cases data itself.
  • Time (weeks) to reach the peak of the first wave of infection – This parameter can be learnt from the countries with matching transmission rate patterns. For those regions that are now beyond the peak of the 1st wave, this parameter can be captured by a sliding window analysis of the daily confirmed cases curve. In absence of sufficient matching countries, you can use a configurable range, such as 1–5 weeks.
  • Transmission control – This parameter is learnt from a configurable range by reducing the simulation error for a validation timeframe with known case counts. 100% interventions can not prevent 100% transmission. Interventions tend to control only a fraction of the overall transmission scope. This parameter is intended to represent that fraction and can vary a lot across regions. It is learnt by fitting the wave-1 data against values from a range (e.g. 0.1 to 1) through iterative trial simulations.

Learning intervention scores

We score intervention effectiveness in the following three ways:

  • Fitting score stringency index – Fits the daily intervention scores in a regression model with the OXCGRT-provided stringency index (score_stringency_idx) as the dependent variable. Subsequently, it extracts the intervention effectiveness scores as the ensemble-based regression model’s feature importance.
  • Fitting confirmed case counts – Fits the daily intervention scores in a regression model with the changes in confirmed case counts (moving average) as the dependent variable. Feature importance scores indicate intervention effectiveness.
  • Observing case count variations – Measures the changes in total case count by turning off the interventions one by one. Scores the interventions in proportion to the respective changes resulting from it being turned off.

Finally, these three scores can be combined using a configurable weighted average. Although these approaches would be affected by the co-occurrences and correlations among the interventions, as a whole, it can represent approximate relative effectiveness scores of the interventions.

Limitations of our toolset

Our toolset has the following limitations:

  • Our disease model expects multiple waves of infections following Gaussian distribution, a pattern quite evident in past influenza pandemics [2].
  • Our disease model doesn’t include death rates; however, it’s relatively straightforward to extend the finite state machine.
  • Country-level intervention scores, to some extent, could be applicable at lower levels, like states, and thus have been used accordingly. However, a more accurate approach would be to gather and use the intervention scores at respective regional levels.
  • The population size needs to be large enough due to underlying probabilistic components, such as 1,000 or more individuals.
  • Although we assumed that on average one person can infect three people, based on a recent study published in Oxford Academic [cite], we anticipate this number to vary across populations. Therefore, it’s a configurable parameter in our base. Several studies indicate that individuals who don’t exhibit symptoms are the largest transmitters.
  • Our model is designed to simulate the first 2 waves of the infection in our code repository and additional waves can be added as required.

Toolset architecture deep dive

In this section, we dive deep into the five main components of the toolset architecture.

  • Bootstrapping – This block exposes the configurable parameters. The configurable parameters can be adjusted between 0.1 to 1.0. Similarly, wave1_weeks was initially 2 weeks; now its range is 1–5 weeks.
  • Infection Wave(s) Analysis: We do a sliding window analysis of the smoothened daily confirmed cases data to detect the starting point and the peak of the 1st and 2nd. This information is subsequently used to infer the parameters of the underlying probability distributions that work at the core of the simulator.
  • Intervention effectiveness scorer – We use supervised learning to estimate an intervention effectiveness score for a population using the research data from OXCGRT [1] or similar sources. Then we create a weighted average score.
  • Optimizer – The optimization model iteratively varies the parameters to be learned and reduces the error in predicting the incidence rate based on the historical data.
  • Predictions – After the simulation parameters are learned for a specific country or population, we can use the intervention effectiveness scores (the what-if scenario of a given disease progression pattern over time, such as the second peak in June) to run our simulation to predict the relative impact of an intervention in future.

Toolset inputs and outputs

The inputs are as follows:

  • Daily infection counts from over 60 countries
  • Daily country-level rating of over 10 interventions, such as stay-at-home orders or school closures
  • A disease incidence pattern with peaks and their timing and duration
  • Simulation duration

Our output is the incidence rate over the course of the simulation.

Summary

Our open-source code simulates COVID-19 case projections at various regional granularity levels. The output is the projection of the total confirmed cases over a specific timeline for a target state or a country, for a given degree of intervention.

Our solution first tries to understand the approximate time to peak and expected case rates of the daily COVID-19 cases for the target entity (state/country) by analysis of the disease incidence patterns. Next, it selects the best (optimal) parameters using optimization techniques on a simulation model. Finally, it generates the projections of daily and cumulative confirmed cases, starting from the beginning of the outbreak uptil a specified length of time in the future.

To get started, we have provided a few sample simulations at state and country levels in the covid19_simulator.ipynb notebook in https://github.com/aws-samples/covid19-simulation, which you can run on Amazon SageMaker or a local environment.

 

References

[1] Oxford Coronavirus Government Response Tracker https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker

[2] Mummert A, Weiss H, Long LP, Amigó JM, Wan XF (2013) A Perspective on Multiple Waves of Influenza Pandemics. PLOS ONE 8(4): e60343. https://doi.org/10.1371/journal.pone.0060343

[3] Viceconte, Giulio, and Nicola Petrosillo. “COVID-19 R0: Magic number or conundrum?.” Infectious disease reports vol. 12,1 8516. 24 Feb. 2020, doi:10.4081/idr.2020.8516 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7073717/

[4] https://nyuscholars.nyu.edu/en/publications/automata-networks-and-artificial-intelligence)


FAQs

Q. What is incidence rate vs. prevalence rate?

Incidence refers to the occurrence of new cases of disease or injury in a population over a specified period of time. Incidence rate is a measure of incidence that incorporates time directly into the denominator. Prevalence differs from incidence, in that prevalence includes all cases, both new and pre-existing, in the population at the specified time, whereas incidence is limited to new cases only.

Q. How are the interventions effectiveness scored?

Interventions effectiveness can be scored in the following three ways:

  • Fitting score stringency index – Fits the daily intervention scores in a regression model with the OXCGRT-provided stringency index (score_stringency_idx) as the dependent variable. Subsequently, it extracts the intervention effectiveness scores as the ensemble-based regression model’s feature importance.
  • Fitting confirmed case counts – Fits the daily intervention scores in a regression model with the changes in confirmed case counts (moving average) as the dependent variable. Feature importance scores indicate intervention effectiveness.
  • Observing case count variations – Measures the changes in total case count by turning off the interventions one by one. Scores the interventions in proportion to the respective changes resulting from it being turned off.

Finally, these three scores can be combined using a configurable weighted average. Although these approaches would be affected by the co-occurrences and correlations among the interventions, as a whole, it can represent an approximate relative effectiveness scores of the interventions.

Q. How are we computing expected transmission rate and time to peak?

Given the latest case rate and transmission rate growth patterns of a region, the solution first identifies the countries that have exhibited similar patterns in the past and eventually plateaued. From these reference countries, it determines the time-to-peak and mean/median daily growth rates until the peak. When a considerable number (default 5) of countries found with matching patterns, then the expected transmission rate, and time-to-peak are computed as weighted averages using pattern and population similarity levels.

Q. What parameters are learned?

The solution always learns the transmission probability for the location in context (country or state) by fitting the simulation model outcome against the confirmed case counts. Optionally, it can also learn the optimal time (in weeks) to reach the peak of the infection spread. Prior to optimization, the time to reach the peak is approximated from the data of other countries having similar transmission patterns.

Q. Can the simulation work starting at any point of the disease progression timeline?

No. One of the key parameters in the solution is the recent transmission rate change (growth). If the simulation starts at a very early stage of disease progression, the transmission rate might be too low to come up with realistic future projections. Similarly, if the simulation starts at the plateau or declining phase, the transmission rate change might be negative and therefore generate incorrect projections. This solution works best on the moderate-to-high-growth phase of the disease progression.

 


About the Authors

Tomal Deb is a Data Scientist in the Amazon Machine Learning Solutions Lab. He has worked on a wide range of data science problems involving NLP, Recommender Systems, Forecasting , Numerical Optimization, etc.

 

 

 

 

Sahika Genc is a Principal Applied Scientist in the AWS AI team. Her current research focus is deep reinforcement learning (RL) for smart automation and robotics. Previously, she was a senior research scientist in the Artificial Intelligence and Learning Laboratory at the General Electric (GE) Global Research Center, where she led science teams on healthcare analytics for patient monitoring.

 

 

 

 

Sunil Mallya is a Principal Deep Learning Scientist in the AWS AI team. He leads engineering for Amazon Comprehend and enjoys solving problems in the area of NLP. In addition, Sunil also enjoys working on Reinforcement Learning and Autonomous Cars.

 

 

 

 

Atanu Roy is a Principal Deep Learning Architect in the Amazon ML Solutions Lab and leads the team for India. He spends most of his spare time and money on his solo travels.

 

 

 

 

Vinay Hanumaiah is a Deep Learning Architect at Amazon ML Solutions Lab, where he helps customers build AI and ML solutions to accelerate their business challenges. Prior to this, he contributed to the launch of AWS DeepLens and Amazon Personalize. In his spare time, he enjoys time with his family and is an avid rock climber.

 

 

 

 

Nate Slater leads the US West and APAC/Japan/China business for the Amazon Machine Learning Solutions Lab.

 

 

 

 

 

Taha A. Kass-Hout, MD, MS, is General Manager, Machine Learning & Chief Medical Officer at Amazon Web Services (AWS).

Read More