Amazon AWS – Page 301

Redacting PII from application log output with Amazon Comprehend

January 20, 2021

by Pradeep Singh Amazon AWS

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in text. The service can extract people, places, sentiments, and topics in unstructured data. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in application logs, customer emails, support tickets, and more. No ML experience required. Redacting PII entities helps you protect privacy and comply with local laws and regulations.

Use case: Applications printing PII data in log output

Some applications print PII data in their log output inadvertently. In some cases, this may be due to developers forgetting to remove debug statements before deploying the application in production, and in other cases it may be due to legacy applications that are handed down and are difficult to update. PII can also get printed in stack traces. It’s generally a mistake to have PII present in such logs. Correlation IDs and primary keys are better identifiers than PII when debugging applications.

PII in application logs can quickly propagate to downstream systems, compounding security concerns. For example it may get submitted to search and analytics systems where it’s searchable and viewable by everyone. It may also be stored in object storage such as Amazon Simple Storage Service (Amazon S3) for analytics purposes. With the PII detection API of Amazon Comprehend, you can remove PII from application log output before such a log statement even gets printed.

In this post, I take the use case of a Java application that is generating log output with PII. The initial log output goes through filter-like processing that redacts PII before the log statement is output by the application. You can take a similar approach for other programming languages.

The application can be repackaged by changing its log format file, such as log4j.xml, and adding one Java class from this sample project, or adding this Java class as a dependency in the form of a .jar file.

The sample application is available in the following GitHub repo.

PII entity types

The following table lists some of the entity types Amazon Comprehend detects.

PII Entity Types	Description
EMAIL	An email address, such as marymajor@email.com.
NAME	An individual’s name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the “John Doe Organization” as an organization, and it recognizes “Jane Doe Street” as an address.
PHONE	A phone number. This entity type also includes fax and pager numbers.
SSN	A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last 4 digits are present.

For the full list, see Detect Personally Identifiable Information (PII).

The API response from Amazon Comprehend includes the entity type, its begin offset, end offset, and a confidence score. For this post, we use all of them.

Application overview

Our example application is a very simple application that simulates opening a bank account for a user. In its current form, the log output looks like the following code. We can see this by making requests to the endpoint payment:

curl localhost:8080/payment
2020-09-29T10:29:04,115 INFO [http-nio-8080-exec-1] c.e.l.c.PaymentController: Processing user User(name=Terina, ssn=626031641, email=mel.swift@Taylor.com, description=Ea minima omnis autem illo.)
2020-09-29T10:29:04,711 INFO [http-nio-8080-exec-2] c.e.l.c.PaymentController: User Napoleon, SSN 366435036, opened an account
2020-09-29T10:29:05,253 INFO [http-nio-8080-exec-4] c.e.l.c.PaymentController: User Cristen, SSN 197961488, opened an account
2020-09-29T10:29:05,673 INFO [http-nio-8080-exec-5] c.e.l.c.PaymentController: Processing user User(name=Giuseppe, ssn=713425581, email=elijah.dach@Shawnna.com, description=Impedit asperiores in magnam exercitationem.)

The output prints Name, SSN, and Email. This PII data is being generated by the java-faker library, which is a Java port of the well-known Ruby gem. See the following code:

        <dependency>
            <groupId>com.github.javafaker</groupId>
            <artifactId>javafaker</artifactId>
            <version>1.0.2</version>
        </dependency>

Log4j 2

Log4j 2 is a common Java library used for logging. Appenders in Log4j are responsible for delivering log events to their destinations, which can be console, file, and more. Log4j also has a RewriteAppender that lets you rewrite the log message before it is output. RewriteAppender works in conjunction with a RewritePolicy that provides the implementation for changing the log output.

The sample application uses the following log4j.xml file for log configuration:

<?xml version="1.0" encoding="UTF-8"?>
<!-- status="trace" -->
<Configuration packages="com.example.logging">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout
                    pattern="%style{%d{ISO8601}}{black} %highlight{%-5level }[%style{%t}{bright,blue}] %style{%C{1.}}{bright,yellow}: %msg%n%throwable" />
        </Console>
        <Rewrite name="Rewrite">
            <SensitiveDataPolicy
                    maskMode="MASK"
                    mask="*"
                    minScore="0.9"
                    entitiesToReplace="SSN,EMAIL"
            />
            <AppenderRef ref="Console" />
        </Rewrite>
    </Appenders>

    <Loggers>
        <!-- LOG everything at INFO level -->
        <Root level="info">
            <AppenderRef ref="Console" />
        </Root>
        <!-- LOG "com.example*" at DEBUG level -->
        <Logger name="com.example" level="debug" additivity="false">
            <AppenderRef ref="Rewrite" />
        </Logger>
    </Loggers>

</Configuration>

SensitiveDataPolicy

The Log4j RewritePolicy we created for this project is named SensitiveDataPolicy. It uses four parameters:

maskMode – This parameter has two modes:
- REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is [SSN].
- MASK – The policy replaces the discovered entity with a string consisting of the character provided as a mask parameter.
mask – The character to use to replace the discovered entity with. Only relevant if maskMode is MASK.
minScore – The minimum confidence score acceptable to us.
entitiesToReplace – A comma-separated list of entity type names that we want to replace. For example, we’re choosing to replace social security number and email, so the string value we provide is SSN,EMAIL. Amazon Comprehend also detects NAME in our application, but it’s printed as is.

Choosing redaction vs. masking is a matter of preference. Redaction is usually preferred when the context needs to be preserved, such as in natural text, whereas masking is best for maintaining text length as well as structured data such as formatted files or key-value pairs.

Detecting PII is as simple as making an API call to Amazon Comprehend using the AWS SDK and providing the text to analyze:

        DetectPiiEntitiesRequest piiEntitiesRequest =
                DetectPiiEntitiesRequest.builder()
                        .languageCode("en")
                        .text(msg.getFormattedMessage())
                        .build();

        DetectPiiEntitiesResponse piiEntitiesResponse = comprehendClient.detectPiiEntities(piiEntitiesRequest);

Asynchronous logging

Because our policy makes synchronous calls to Amazon Comprehend for PII detection, we want this processing to happen asynchronously, outside of customer request loop, to avoid introducing latency. For instructions, see Asynchronous Loggers for Low-Latency Logging. We add the Disruptor library to our classpath by adding it to pom.xml:

        <dependency>
            <groupId>com.lmax</groupId>
            <artifactId>disruptor</artifactId>
            <version>3.4.2</version>
        </dependency>

We also need to set a system property. After we package our application with mvn package, we can run it as in the following code:

java -jar target/comprehend-logging.jar -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector

Updated log output

The log output from this application now looks like the following. We can see that SSN and Email are being suppressed.

2020-09-29T12:52:30,423 INFO  [http-nio-8080-exec-6] ?: User Willa, SSN *********, opened an account
2020-09-29T12:52:30,824 INFO  [http-nio-8080-exec-8] ?: User Vania, SSN *********, opened an account
2020-09-29T12:52:31,245 INFO  [http-nio-8080-exec-9] ?: Processing user User(name=Laronda, ssn=*********, email=******************************, description=Doloremque culpa iure dolore omnis.)
2020-09-29T12:52:31,637 INFO  [http-nio-8080-exec-1] ?: Processing user User(name=Tommye, ssn=*********, email=*************************, description=Corporis sed tempore.)

Conclusion

We learned how to use Amazon Comprehend to redact sensitive data natively within next-generation applications. For information about applying it as a postprocessing technique for logs in storage, see Detecting and redacting PII using Amazon Comprehend. The API lets you have complete control over the entities that are important for your use case and lets you either mask or redact the information.

For more information about Amazon Comprehend availability and quotas, see Amazon Comprehend endpoints and quotas.

About the Author

Pradeep Singh is a Solutions Architect at Amazon Web Services. He helps AWS customers take advantage of AWS services to design scalable and secure applications. His expertise spans Application Architecture, Containers, Analytics and Machine Learning.

Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines

January 19, 2021

by Sean Morgan Amazon AWS

We recently announced Amazon SageMaker Pipelines, the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration. Three components improve the operational resilience and reproducibility of your ML workflows: pipelines, model registry, and projects. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.

SageMaker projects introduce MLOps templates that automatically provision the underlying resources needed to enable CI/CD capabilities for your ML development lifecycle. You can use a number of built-in templates or create your own custom template. You can use SageMaker Pipelines independently to create automated workflows; however, when used in combination with SageMaker projects, the additional CI/CD capabilities are provided automatically. The following screenshot shows how the three components of SageMaker Pipelines can work together in an example SageMaker project.

This post focuses on using an MLOps template to bootstrap your ML project and establish a CI/CD pattern from sample code. We show how to use the built-in build, train, and deploy project template as a base for a customer churn classification example. This base template enables CI/CD for training ML models, registering model artifacts to the model registry, and automating model deployment with manual approval and automated testing.

MLOps template for building, training, and deploying models

We start by taking a detailed look at what AWS services are launched when this build, train, and deploy MLOps template is launched. Later, we discuss how to modify the skeleton for a custom use case.

To get started with SageMaker projects, you must first enable it on the Amazon SageMaker Studio console. This can be done for existing users or while creating new ones. For more information, see SageMaker Studio Permissions Required to Use Projects.

In SageMaker Studio, you can now choose the Projects menu on the Components and registries menu.

On the projects page, you can launch a preconfigured SageMaker MLOps template. For this post, we choose MLOps template for model building, training, and deployment.

Launching this template starts a model building pipeline by default, and while there is no cost for using SageMaker Pipelines itself, you will be charged for the services launched. Cost varies by Region. A single run of the model build pipeline in us-east-1 is estimated to cost less than $0.50. Models approved for deployment incur the cost of the SageMaker endpoints (test and production) for the Region using an ml.m5.large instance.

After the project is created from the MLOps template, the following architecture is deployed.

Included in the architecture are the following AWS services and resources:

The MLOps templates that are made available through SageMaker projects are provided via an AWS Service Catalog portfolio that automatically gets imported when a user enables projects on the Studio domain.
Two repositories are added to AWS CodeCommit:
- The first repository provides scaffolding code to create a multi-step model building pipeline including the following steps: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you can see in the pipeline.py file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known UCI Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to run the pipeline automatically.
- The second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This repo also uses CodePipeline and CodeBuild, which run an AWS CloudFormation template to create model endpoints for staging and production.
Two CodePipeline pipelines:
- The ModelBuild pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the ModelBuild CodeCommit repository.
- The ModelDeploy pipeline automatically triggers whenever a new model version is added to the model registry and the status is marked as Approved. Models that are registered with Pending or Rejected statuses aren’t deployed.
An Amazon Simple Storage Service (Amazon S3) bucket is created for output model artifacts generated from the pipeline.
SageMaker Pipelines uses the following resources:
- This workflow contains the directed acyclic graph (DAG) that trains and evaluates our model. Each step in the pipeline keeps track of the lineage and intermediate steps can be cached for quickly re-running the pipeline. Outside of templates, you can also create pipelines using the SDK.
- Within SageMaker Pipelines, the SageMaker model registry tracks the model versions and respective artifacts, including the lineage and metadata for how they were created. Different model versions are grouped together under a model group, and new models registered to the registry are automatically versioned. The model registry also provides an approval workflow for model versions and supports deployment of models in different accounts. You can also use the model registry through the boto3 package.
Two SageMaker endpoints:
- After a model is approved in the registry, the artifact is automatically deployed to a staging endpoint followed by a manual approval step.
- If approved, it’s deployed to a production endpoint in the same AWS account.

All SageMaker resources, such as training jobs, pipelines, models, and endpoints, as well as AWS resources listed in this post, are automatically tagged with the project name and a unique project ID tag.

Modifying the sample code for a custom use case

After your project has been created, the architecture described earlier is deployed and the visualization of the pipeline is available on the Pipelines drop-down menu within SageMaker Studio.

To modify the sample code from this launched template, we first need to clone the CodeCommit repositories to our local SageMaker Studio instance. From the list of projects, choose the one that was just created. On the Repositories tab, you can select the hyperlinks to locally clone the CodeCommit repos.

ModelBuild repo

The ModelBuild repository contains the code for preprocessing, training, and evaluating the model. The sample code trains and evaluates a model on the UCI Abalone dataset. We can modify these files to solve our own customer churn use case. See the following code:

|-- codebuild-buildspec.yml
|-- CONTRIBUTING.md
|-- pipelines
| |-- abalone
| | |-- evaluate.py
| | |-- __init__.py
| | |-- pipeline.py
| | |-- preprocess.py
| |-- get_pipeline_definition.py
| |-- __init__.py
| |-- run_pipeline.py
| |-- _utils.py
| |-- __version__.py
|-- README.md
|-- sagemaker-pipelines-project.ipynb
|-- setup.cfg
|-- setup.py
|-- tests
| -- test_pipelines.py
|-- tox.ini

We now need a dataset accessible to the project.

Open a new SageMaker notebook inside Studio and run the following cells:

!wget http://dataminingconsultant.com/DKD2e_data_sets.zip
!unzip -o DKD2e_data_sets.zip
!mv "Data sets" Datasets

import os
import boto3
import sagemaker
prefix = 'sagemaker/DEMO-xgboost-churn'
region = boto3.Session().region_name
default_bucket = sagemaker.session.Session().default_bucket()
role = sagemaker.get_execution_role()

RawData = boto3.Session().resource('s3')
.Bucket(default_bucket).Object(os.path.join(prefix, 'data/RawData.csv'))
.upload_file('./Datasets/churn.txt')

print(os.path.join("s3://",default_bucket, prefix, 'data/RawData.csv'))

Rename the abalone directory to customer_churn. This requires us to modify the path inside codebuild-buildspec.yml as shown in the sample repository. See the following code:
```
run-pipeline --module-name pipelines.customer-churn.pipeline 
```

Replace the preprocess.py code with the customer churn preprocessing script found in the sample repository.
Replace the pipeline.py code with the customer churn pipeline script found in the sample repository.
1. Be sure to replace the “InputDataUrl” default parameter with the Amazon S3 URL obtained in step 1:
```
input_data = ParameterString(
    name="InputDataUrl",
   default_value=f"s3://YOUR_BUCKET/RawData.csv",
)
```
2. Update the conditional step to evaluate the classification model:
```
# Conditional step for evaluating model quality and branching execution
cond_lte = ConditionGreaterThanOrEqualTo(
    left=JsonGet(step=step_eval, property_file=evaluation_report, json_path="binary_classification_metrics.accuracy.value"), right=0.8
)
```
One last thing to note is the default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 80% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.

Replace the evaluate.py code with the customer churn evaluation script found in the sample repository. One piece of the code we’d like to point out is that, because we’re evaluating a classification model, we need to update the metrics we’re evaluating and associating with trained models:

report_dict = {
  "binary_classification_metrics": {
      "accuracy": {
          "value": acc,
           "standard_deviation" : "NaN"
       },
       "auc" : {
          "value" : roc_auc,
          "standard_deviation": "NaN"
       },
   },
}

evaluation_output_path = '/opt/ml/processing/evaluation/evaluation.json'
with open(evaluation_output_path, 'w') as f:
    f.write(json.dumps(report_dict))

The JSON structure of these metrics are required to match the format of sagemaker.model_metrics for complete integration with the model registry.

ModelDeploy repo

The ModelDeploy repository contains the AWS CloudFormation buildspec for the deployment pipeline. We don’t make any modifications to this code because it’s sufficient for our customer churn use case. It’s worth noting that model tests can be added to this repo to gate model deployment. See the following code:

├── build.py
├── buildspec.yml
├── endpoint-config-template.yml
├── prod-config.json
├── README.md
├── staging-config.json
└── test
    ├── buildspec.yml
    └── test.py

Triggering a pipeline run

Committing these changes to the CodeCommit repository (easily done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. After a few moments, we can monitor the run by choosing the pipeline inside the SageMaker project.

The following screenshot shows our pipeline details. Choosing the pipeline run displays the steps of the pipeline, which you can monitor.

When the pipeline is complete, you can go to the Model groups tab inside the SageMaker project and inspect the metadata attached to the model artifacts.

If everything looks good, we can manually approve the model.

This approval triggers the ModelDeploy pipeline and exposes an endpoint for real-time inference.

Conclusion

SageMaker Pipelines enables teams to leverage best practice CI/CD methods within their ML workflows. In this post, we showed how a data scientist can modify a preconfigured MLOps template for their own modeling use case. Among the many benefits is that the changes to the source code can be tracked, associated metadata can be tied to trained models for deployment approval, and repeated pipeline steps can be cached for reuse. To learn more about SageMaker Pipelines, check out the website and the documentation. Try SageMaker Pipelines in your own workflows today.

About the Authors

Sean Morgan is an AI/ML Solutions Architect at AWS. He previously worked in the semiconductor industry, using computer vision to improve product yield. He later transitioned to a DoD research lab where he specialized in adversarial ML defense and network security. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Addons.

Hallie Weishahn is an AI/ML Specialist Solutions Architect at AWS, focused on leading global standards for MLOps. She previously worked as an ML Specialist at Google Cloud Platform. She works with product, engineering, and key customers to build repeatable architectures and drive product roadmaps. She provides guidance and hands-on work to advance and scale machine learning use cases and technologies. Troubleshooting top issues and evaluating existing architectures to enable integrations from PoC to a full deployment is her strong suit.

Shelbee Eigenbrode is an AI/ML Specialist Solutions Architect at AWS. Her current areas of depth include DevOps combined with ML/AI. She has been in technology for 23 years, spanning multiple roles and technologies. With over 35 patents granted across various technology domains, her passion for continuous innovation combined with a love of all things data turned her focus to the field of d ata science. Combining her backgrounds in data, DevOps, and machine learning, her current passion is helping customers to not only embrace data science but also ensure all models have a path to production by adopting MLOps practices. In her spare time, she enjoys reading and spending time with family, including her fur family (aka dogs), as well as friends.

Labeling mixed-source, industrial datasets with Amazon SageMaker Ground Truth

January 19, 2021

by Dr. Markus Bestehorn Amazon AWS

Prior to using any kind of supervised machine learning (ML) algorithm, data has to be labeled. Amazon SageMaker Ground Truth simplifies and accelerates this task. Ground Truth uses pre-defined templates to assign labels that classify the content of images or videos or verify existing labels. Ground Truth allows you to define workflows for labeling various kinds of data, such as text, video, or images, without writing any code. Although these templates are applicable to a wide range of use cases in which the data to be labeled is in a single format or from a single source, industrial workloads often require labeling data from different sources and in different formats. This post explores the use case of industrial welding data consisting of sensor readings and images to show how to implement customized, complex, mixed-source labeling workflows using Ground Truth.

For this post, you deploy an AWS CloudFormation template in your AWS account to provision the foundational resources to get started with implementing of this labeling workflow. This provides you with hands-on experience for the following topics:

Creating a private labeling workforce in Ground Truth
Creating a custom labeling job using the Ground Truth framework with the following components:
- Designing a pre-labeling AWS Lambda function that pulls data from different sources and runs a format conversion where necessary
- Implementing a customized labeling user interface in Ground Truth using crowd templates that dynamically loads the data generated by the pre-labeling Lambda function
- Consolidating labels from multiple workers using a customized post-labeling Lambda function
Configuring a custom labeling job using Ground Truth with a customized interface for displaying multiple pieces of data that have to be labeled as a single item

Prior to diving deep into the implementation, I provide an introduction into the use case and show how the Ground Truth custom labeling framework eases the implementation of highly complex labeling workflows. To make full use of this post, you need an AWS account on which you can deploy CloudFormation templates. The total cost incurred on your account for following this post is under $1.

Labeling complex datasets for industrial welding quality control

Although the mechanisms discussed in this post are generally applicable to any labeling workflow with different data formats, I use data from a welding quality control use case. In this use case, the manufacturing company running the welding process wants to predict whether the welding result will be OK or if a number of anomalies have occurred during the process. To implement this using a supervised ML model, you need to obtain labeled data with which to train the ML model, such as datasets representing welding processes that need to be labeled to indicate whether the process was normal or not. We implement this labeling process (not the ML or modeling process) using Ground Truth, which allows welding experts to make assessments about the result of a welding and assign this result to a dataset consisting of images and sensor data.

The CloudFormation template creates an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account that contains images (prefix images) and CSV files (prefix sensor_data). The images contain pictures taken during an industrial welding process similar to the following, where a welding beam is applied onto a metal surface (for image source, see TIG Stainless Steel 304):

The CSV files contain sensor data representing current, electrode position, and voltage measured by sensors on the welding machine. For the full dataset, see the GitHub repo. A raw sample of this CSV data is as follows:

0|96.19|1023|420|4.5|4.5|1|8
0.1|96.13|894|424|4.5|4.5|1|8
0.2|96.06|884|425|4.5|4.5|1|8
0.3|96.05|884|426|4.5|4.5|1|8
0.4|96.12|887|426|4.5|4.5|1|8
0.5|96.17|902|426|4.5|4.5|2|8
0.6|95.82|974|426|4.5|4.5|2|8
0.7|95.45|1304|426|4.5|4.5|3|8
0.8|95.15|1410|428|4.5|4.5|3|8
0.9|94.96|1446|428|4.5|4.5|3|8
1|94.79|1464|428|4.5|4.5|3|8
...

The first column of the data is a timestamp in milliseconds normalized to the start of the welding process. Each row consists of various sensor values associated with the timestamp. The first row is the electrode position, the second row is the current, and the third row is the voltage (the other values are irrelevant here). For instance, the row with timestamp 1, 100 milliseconds after the start of the welding process, has an electrode position of 94.79, a current of 1464, and a voltage of 428.

Because it’s difficult for humans to make assessments using the raw CSV data, I also show how to preprocess such data on the fly for labeling and turn it into more easily readable plots. This way, the welding experts can view the images and the plots to make their assessment about the welding process.

Deploying the CloudFormation template

To simplify the setup and configurations needed in the following, I created a CloudFormation template that deploys several foundations into your AWS account. To start this process, complete the following steps:

Sign in to your AWS account.
Choose one of the following links, depending on which AWS Region you’re using:

us-east-1
us-west-2
eu-west-1
eu-central-1
ap-northeast-1
ap-southeast-1

Keep all the parameters as they are and select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
Choose Create stack to start the deployment.

The deployment takes about 3–5 minutes, during which time a bucket with data to label, some AWS Lambda functions, and an AWS Identity and Access Management (IAM) role are deployed. The process is complete when the status of the deployment switches to CREATE_COMPLETE.

The Outputs tab has additional information, such as the Amazon S3 path to the manifest file, which you use throughout this post. Therefore, it’s recommended to keep this browser tab open and follow the rest of the post in another tab.

Creating a Ground Truth labeling workforce

Ground Truth offers three options for defining workforces that complete the labeling: Amazon Mechanical Turk, vendor-specific workforces, and private workforces. In this section, we configure a private workforce because we want to complete the labeling ourselves. Create a private workforce with the following steps:

On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
On the Private tab, choose Create private team.

Enter a name for the labeling workforce. For our use case, I enter welding-experts.
Select Invite new workers by email.
Enter your e-mail address, an organization name, and a contact e-mail (which may be the same as the one you just entered).
Choose Create private team.

The console confirms the creation of the labeling workforce at the top of the screen. When you refresh the page, the new workforce shows on the Private tab, under Private teams.

You also receive an e-mail with login instructions, including a temporary password and a link to open the login page.

Choose the link and use your e-mail and temporary password to authenticate and change the password for the login.

It’s recommended to keep this browser tab open so you don’t have to log in again. This concludes all necessary steps to create your workforce.

Configuring a custom labeling job

In this section, we create a labeling job and use this job to explain the details and data flow of a custom labeling job.

On the Amazon SageMaker console, under Ground Truth, choose Labeling jobs.
Choose Create labeling job.

Enter a name for your labeling job, such as WeldingLabelJob1.
Choose Manual data setup.
For Input dataset location, enter the ManifestS3Path value from the CloudFormation stack Outputs
For Output dataset location, enter the ProposedOutputPath value from the CloudFormation stack Outputs
For IAM role, choose Enter a custom IAM role ARN.
Enter the SagemakerServiceRoleArn value from the CloudFormation stack Outputs
For the task type, choose Custom.
Choose Next.

The IAM role is a customized role created by the CloudFormation template that allows Ground Truth to invoke Lambda functions and access Amazon S3.

Choose to use a private labeling workforce.
From the drop-down menu, choose the workforce welding-experts.
For task timeout and task expiration time, 1 hour is sufficient.
The number of workers per dataset object is 1.
In the Lambda functions section, for Pre-labeling task Lambda function, choose the function that starts with PreLabelingLambda-.
For Post-labeling task Lambda function, choose the function that starts with PostLabelingLambda-.

Enter the following code into the templates section. This HTML code specifies the interface that the workers in the private label workforce see when labeling items. For our use case, the template displays four images, and the categories to classify welding results is as follows:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<crowd-form>
  <crowd-classifier
    name="WeldingClassification"
    categories="['Good Weld', 'Burn Through', 'Contamination', 'Lack of Fusion', 'Lack of Shielding Gas', 'High Travel Speed', 'Not sure']"
    header="Please classify the welding process."
  >
      <classification-target>
        <div>
          <h3>Welding Image</h3>
	      	<p><strong>Welding Camera Image </strong>{{ task.input.image.title }}</p>
	      	<p><a href="{{ task.input.image.file | grant_read_access }}" target="_blank">Download Image</a></p>
	      	<p>
	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.image.file | grant_read_access }}"/>
	      	</p>
	    </div>
	    <hr/>
        <div>
          <h3>Current Graph</h3>
	      	<p><strong>Current Graph </strong>{{ task.input.current.title }}</p>
	      	<p><a href="{{ task.input.current.file | grant_read_access }}" target="_blank">Download Current Plot</a></p>
	      	<p>
	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.current.file | grant_read_access }}"/>
	      	</p>
	    </div>
        <hr/>
        <div>
          <h3>Electrode Position Graph</h3>
	      	<p><strong>Electrode Position Graph </strong>{{ task.input.electrode.title }}</p>
	      	<p><a href="{{ task.input.electrode.file | grant_read_access }}" target="_blank">Download Electrode Position Plot</a></p>
	      	<p>
	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.electrode.file | grant_read_access }}"/>
	      	</p>
	    </div>
        <hr/>
        <div>
          <h3>Voltage Graph</h3>
	      	<p><strong>Voltage Graph </strong>{{ task.input.voltage.title }}</p>
	      	<p><a href="{{ task.input.voltage.file | grant_read_access }}" target="_blank">Download Voltage Plot</a></p>
	      	<p>
	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.voltage.file | grant_read_access }}"/>
	      	</p>
	    </div>
      </classification-target>



    <full-instructions header="Classification Instructions">
      <p>Read the task carefully and inspect the image as well as the plots.</p>
      <p>
		  The image is a picture taking during the welding process. The plots show the corresponding sensor data for
		  the electrode position, the voltage and the current measured during the welding process.
	  </p>
    </full-instructions>

    <short-instructions>
      <p>Read the task carefully and inspect the image as well as the plots</p>
    </short-instructions>
  </crowd-classifier>
</crowd-form>

The wizard for creating the labeling job has a preview function in the section Custom labeling task setup, which you can use to check if all configurations work properly.

To preview the interface, choose Preview.

This opens a new browser tab and shows a test version of the labeling interface, similar to the following screenshot.

To create the labeling job, choose Create.

Ground Truth sets up the labeling job as specified, and the dashboard shows its status.

Assigning labels

To finalize the labeling job that you configured, you log in to the worker portal and assign labels to different data items consisting of images and data plots. The details on how the different components of the labeling job work together are explained in the next section.

On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
On the Private tab, choose the link for Labeling portal sign-in URL.

When Ground Truth is finished preparing the labeling job, you can see it listed in the Jobs section. If it’s not showing up, wait a few minutes and refresh the tab.

Choose Start working.

This launches the labeling UI, which allows you to assign labels to mixed datasets consisting of welding images and plots for current, electrode position, and voltage.

For this use case, you can assign seven different labels to a single dataset. These different classes and labels are defined in the HTML of the UI, but you can also insert them dynamically using the pre-labeling Lambda function (discussed in the next section). Because we don’t actually use the labeled data for ML purposes, you can assign the labels randomly to the five items that are displayed by Ground Truth for this labeling job.

After labeling all the items, the UI switches back to the list with available jobs. This concludes the section about configuring and launching the labeling job. In the next section, I explain the mechanics of a custom labeling job in detail and also dive deep into the different elements of the HTML interface.

Custom labeling deep dive

A custom labeling job combines the data to be labeled with three components to create a workflow that allows workers from the labeling workforce to assign labels to each item in the dataset:

Pre-labeling Lambda function – Generates the content to be displayed on the labeling interface using the manifest file specified during the configuration of the labeling job. For this use case, the function also converts the CSV files into human readable plots and stores these plots as images in the S3 bucket under the prefix plots.
Labeling interface – Uses the output of the pre-labeling function to generate a user interface. For this use case, the interface displays four images (the picture taken during the welding process and the three graphs for current, electrode position, and voltage) and a form that allows workers to classify the welding process.
Label consolidation Lambda function – Allows you to implement custom strategies to consolidate classifications of one or several workers into a single response. For our workforce, this is very simple because there is only a single worker whose labels are consolidated into a file, which is stored by Ground Truth into Amazon S3.

Before we analyze these three components, I provide insights into the structure of the manifest file, which describes the data sources for the labeling job.

Manifest and dataset files

The manifest file is a file conforming to the JSON lines format, in which each line represents one item to label. Ground Truth expects either a key source or source-ref in each line of the file. For this use case, I use source, and the mapped value must be a string representing an Amazon S3 path. For this post, we only label five items, and the JSON lines are similar to the following code:

{"source": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-1.json"}

For our use case with multiple input formats and files, each line in the manifest points to a dataset file that is also stored on Amazon S3. Our dataset is a JSON document, which contains references to the welding images and the CSV file with the sensor data:

{
  "sensor_data": {"s3Path": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/sensor_data/weld.1.csv"},
  "image": {"s3Path": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png"}
}

Ground Truth takes each line of the manifest file and triggers the pre-labeling Lambda function, which we discuss next.

Pre-labeling Lambda function

A pre-labeling Lambda function creates a JSON object that is used to populate the item-specific portions of the labeling interface. For more information, see Processing with AWS Lambda.

Before Ground Truth displays an item for labeling to a worker, it runs the pre-labeling function and forwards the information in the manifest’s JSON line to the function. For our use case, the event passed to the function is as follows:

{
  "version": "2018-10-06", 
  "labelingJobArn": "arn:aws:sagemaker:eu-west-1:XXX:labeling-job/weldinglabeljob1",
  "dataObject": { 
    "source": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-1.json" 
  }
}

Although I omit the implementation details here (for those interested, the code is deployed with the CloudFormation template for review), the function for our labeling job uses this input to complete the following steps:

Download the file referenced in the source field of the input (see the preceding code).
Download the dataset file that is referenced in the source
Download a CSV file containing the sensor data. The dataset file is expected to have a reference to this CSV file.
Generate plots for current, electrode position, and voltage from the contents of the CSV file.
Upload the plot files to Amazon S3.
Generate a JSON object containing the references to the aforementioned plot files and the welding image referenced in the dataset file.

When these steps are complete, the function returns a JSON object with two parts :

taskInput – Fully customizable JSON object that contains information to be displayed on the labeling UI.
isHumanAnnotationRequired – A string representing a Boolean value (True or False), which you can use to exclude objects from being labeled by humans. I don’t use this flag for this use case because we want to label all the provided data items.

For more information, see Processing with AWS Lambda.

Because I want to show the welding images and the three graphs for current, electrode position, and voltage, the result of the Lambda function is as follows for the first dataset:

{
  "taskInput": { 
    "image": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png", 
      "title": " from image at s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png"
    }, 
    "voltage": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-current.png", 
      "title": " from file at plots/weld.1.csv-current.png"
    },
    "electrode": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-electrode_pos.png", 
      "title": " from file at plots/weld.1.csv-electrode_pos.png" 
    }, 
    "current": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-voltage.png", 
      "title": " from file at plots/weld.1.csv-voltage.png" 
    } 
  }, 
  "isHumanAnnotationRequired": "true"
}

In the preceding code, the taskInput is fully customizable; the function returns the Amazon S3 paths to the images to display, and also a title, which has some non-functional text. Next, I show how to access these different parts of the taskInput JSON object when building the customized labeling UI displayed to workers by Ground Truth.

Labeling UI: Accessing taskInput content

Ground Truth uses the output of the Lambda function to fill in content into the HTML skeleton that is provided at the creation of the labeling job. In general, the contents of the taskInput output object is accessed using task.input in the HTML code.

For instance, to retrieve the Amazon S3 path where the welding image is stored from the output, you need to access the path taskInput/image/file. Because the taskInput object from the function output is mapped to task.input in the HTML, the corresponding reference to the welding image file is task.input.image.file. This reference is directly integrated into the HTML code of the labeling UI to display the welding image:

<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.image.file | grant_read_access }}"/>

The grant_read_access filter is needed for files in S3 buckets that aren’t publicly accessible. This makes sure that the URL passed to the browser contains a short-lived access token for the image and thereby avoids having to make resources publicly accessible for labeling jobs. This is often mandatory because the data to be labeled, such as machine data, is confidential. Because the pre-labeling function has also converted the CSV files into plots and images, their integration into the UI is analogous.

Label consolidation Lambda function

The second Lambda function that was configured for the custom labeling job runs when all workers have labeled an item or the time limit of the labeling job is reached. The key task of this function is to derive a single label from the responses of the workers. Additionally, the function can be for any kind of further processing of the labeled data, such as storing them on Amazon S3 in a format ideally suited for the ML pipeline that you use.

Although there are different possible strategies to consolidate labels, I focus on the cornerstones of the implementation for such a function and show how they translate to our use case. The consolidation function is triggered by an event similar to the following JSON code:

{ 
  "version": "2018-10-06", 
  "labelingJobArn": "arn:aws:sagemaker:eu-west-1:261679111194:labeling-job/weldinglabeljob1", 
  "payload": { 
    "s3Uri": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/output/WeldingLabelJob1/annotations/consolidated-annotation/consolidation-request/iteration-1/2020-09-15_16:16:11.json" 
  }, 
  "labelAttributeName": "WeldingLabelJob1", 
  "roleArn": "arn:aws:iam::261679111194:role/AmazonSageMaker-Service-role-unn4d0l4j0", 
  "outputConfig": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/output/WeldingLabelJob1/annotations", 
  "maxHumanWorkersPerDataObject": 1 
}

The key item in this event is the payload, which contains an s3Uri pointing to a file stored on Amazon S3. This payload file contains the list of datasets that have been labeled and the labels assigned to them by workers. The following code is an example of such a list entry:

{ 
  "datasetObjectId": "4", 
  "dataObject": { 
    "s3Uri": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-5.json" 
  }, 
  "annotations": [ 
    { 
      "workerId": "private.eu-west-1.abd2ec3e354db315",
      "annotationData": { 
          "content":"{"WeldingClassification":{"label":"Not sure"}}"
      } 
    } 
  ] 
}

Along with an identifier that you could use to determine which worker labeled the item, each entry lists for each dataset which labels have been assigned. For example, in the case of multiple workers, there are multiple entries in annotations. Because I created a single worker that labeled all the items for this post, there is only a single entry. The file dataset-5.json has been labeled with Not Sure for the classifier WeldingClassification.

The label consolidation function has to iterate over all list entries and determine for each dataset a label to use as the ground truth for supervised ML training. Ground Truth expects the function to return a list containing an entry for each dataset item with the following structure:

{ 
  "datasetObjectId": "4", 
  "consolidatedAnnotation": { 
    "content": { 
      "WeldingLabelJob1": {
         "WeldingClassification": "Not sure" 
      } 
    }
  } 
}

Each entry of the returned list must contain the datasetObjectId for the corresponding entry in the payload file and a JSON object consolidatedAnnotation, which contains an object content. Ground Truth expects content to contain a key that equals the name of the labeling job, (for our use case, WeldingLabelJob1). For more information, see Processing with AWS Lambda.
You can change this behavior when you create the labeling job by selecting I want to specify a label attribute name different from the labeling job name and entering a label attribute name.

The content inside this key equaling the name of the labeling job is freely configurable and can be arbitrarily complex. For our use case, it’s enough to return the assigned label Not Sure. If any of these formatting requirements are not met, Ground Truth assumes the labeling job didn’t run properly and failed.

Because I specified output as the desired prefix during the creation of the labeling job, the requirements are met, and Ground Truth uploads the list of JSON entries into the bucket and prefix specified during the creation of the consolidated labels, and they are uploaded with the following prefix:

output/WeldingLabelJob1/annotations/consolidated-annotation/consolidation-response/iteration-1/

You can use such files for training ML algorithms in Amazon SageMaker or for further processing.

Cleaning up

To avoid incurring future charges, delete all resources created for this post.

On the AWS CloudFormation console, choose Stacks.
Select the stack iiot-custom-label-blog.
Choose Delete.

This step removes all files and the S3 bucket from your account. The process takes about 3–5 minutes.

Conclusion

Supervised ML requires labeled data, and Ground Truth provides a platform for creating labeling workflows. This post showed how to build a complex industrial IoT labeling workflow, in which data from multiple sources needs to be considered for labeling items. The post explained how to create a custom labeling job and provided details on the mechanisms Ground Truth requires to implement such a workflow. To get started with writing your own custom labeling job, refer to the custom labeling documentation page for Ground Truth and potentially re-deploy the CloudFormation template of this post to get a sample for the pre-labeling and consolidation lambdas. Additionally, the blog post “Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth” provides additional insights into building custom labeling jobs.

About the Author

As a Principal Prototyping Engagement Manager, Dr. Markus Bestehorn is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.

Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data

January 19, 2021

by Ujjwal Ratan Amazon AWS

In this post, we walk you through the steps to build machine learning (ML) models in Amazon SageMaker with data stored in Amazon HealthLake using two example predictive disease models we trained on sample data using the MIMIC-III dataset. This dataset was developed by the MIT lab for Computational Physiology and consists of de-identified healthcare data associated with approximately 60,000 ICU admissions. The dataset includes multiple attributes about the patients like their demographics, vital signs, and medications, along with their clinical notes. We first developed the models using the structured data such as demographics, vital signs, and medications. Then we augmented these models with additional data extracted and normalized from clinical notes to test and compare their performance. In both these experiments, we found an improvement in model performance when modelled as a supervised learning (classification) or an unsupervised learning (clustering) problem. We present our findings and the setup of the experiments in this post.

Why multiple modalities?

Modality can be defined as the classification of a single independent sensory input/output between a computer and a human. For example, we can see objects and hear sounds by using our senses. These can be considered as two separate modalities. Datasets that represent multiple modalities are categorized as a multi-modal dataset. For instance, images can consist of tags that help search and organize them, and textual data can contain images to explain what’s in the image. When medical practitioners make clinical decisions, it’s usually based on information gathered from a variety of healthcare data modalities. A physician looks at patient’s observations, their past history, their scans, and even physical characteristics of the patient during the visit to make a definitive diagnosis. ML models need to take this into account when trying to achieve real-world performance. The post Building a medical image search platform on AWS shows how you can combine features from medical images and their corresponding radiology reports to create a medical image search platform. The challenge with creating such models is the preprocessing of these multi-modal datasets and extracting appropriate features from them.

Amazon HealthLake makes it easier to train models on multi-modal data

Amazon HealthLake is a HIPAA eligible service that enables healthcare providers, health insurance companies, and pharmaceutical companies to store, transform, query, and analyze health data on the AWS Cloud at petabyte scale. As part of the transformation, Amazon HealthLake tags and indexes unstructured data using specialized ML models. These tags and indexes can be used to query and search as well as understand relationships in the data for analytics.

When you export data from Amazon HealthLake, it adds a resource called DocumentReference to the output. This resource consists of clinical entities (Like medications, medical conditions, anatomy, and Protected Health Information (PHI)), the RxNorm codes for medications, and the ICD10 codes for medical conditions that are automatically derived from the unstructured notes about the patients. These are additional attributes about the patients that are embedded within the unstructured portions of their clinical records and would have been largely ignored for downstream analysis. Combining the structured data from the EHR with these attributes provides a more holistic picture of the patient and their conditions. To help determine the value of these attributes, we created a couple of experiments around clinical outcome prediction.

Architecture overview

The following diagram illustrates the architecture for our experiments.

You can export the normalized data to an Amazon Simple Storage Service (Amazon S3) bucket using the Export API. Then we use AWS Glue to crawl and build a catalog of the data. This catalog is shared by Amazon Athena to run the queries directly off of the exported data from Colossus. Athena also normalizes the JSON format files to rows and columns for easy querying. The DocumentReference resource JSON file is processed separately to extract indexed data derived from the unstructured portions of the patient records. The file consists of an extension tag that has a hierarchical JSON output consisting of patient attributes. There are multiple ways to process this file (like using Python-based JSON parsers or even string-based regex and pattern matching). For an example implementation, see the section Connecting Athena with HealthLake in the post Population health applications with Amazon HealthLake – Part 1: Analytics and monitoring using Amazon QuickSight.

Example setup

Accessing the MIMIC-III dataset requires you to request access. As part of this post, we don’t distribute any data but instead provide the setup steps so you can replicate these experiments when you have access to MIMIC-III. We also publish our conclusions and findings from the results.

For the first experiment, we build a binary disease classification model to predict patients with congestive heart failure (CHF). We measure its performance using accuracy, ROC, and confusion matrix for both structured and unstructured patient records. For the second experiment, we cluster a cohort of patients into a fixed number of groups and visualize the cluster separation before and after the addition of the unstructured patient records. For both our experiments, we build a baseline model and compare it with the multi-modal model, where we combine existing structured data with additional features (ICD-10 codes and Rx-Norm codes) in our training set.

These experiments are not intended to produce a state-of-the-art model on real-world datasets; its purpose is to demonstrate how you can utilize features exported from Amazon Healthlake for training models on structured and unstructured patient records to improve your overall model performance.

Features and data normalization

We took a variety of features related to patient encounters to train our models. This included the patient demographics (gender, marital status), the clinical conditions, procedures, medications, and observations. Because each patient could have multiple encounters consisting of multiple observations, clinical conditions, procedures, and medications, we normalized the data and converted each of these features into a list. This allowed us to get a training set with all these features (as a list) for each patient.

Similarly, for the unstructured features that Amazon Healthlake converted into the DocumentReference resource, we extracted the ICD-10 codes and Rx-Norm codes (using the methods described in the architecture) and converted them into feature vectors.

Feature engineering and model

For the categorical attributes in our dataset, we used a label encoder to convert the attributes into a numerical representation. For all other list attributes, we used term frequency-inverse document frequency (FI-IDF) vectors. This high-dimensional dataset was then shuffled and divided into 80% train and 20% test sets for training and evaluation of the models, respectively. For training our model, we used the gradient boosting library XGBoost. We considered mostly default hyperparameters and didn’t perform any hyperparameter tuning, because our objective was only to train a baseline model with structured patient records and then show improvement on those results with the unstructured features. Adopting better hyperparameters or changing to other feature engineering and modelling approaches can likely improve these results.

Example 1: Predicting patients with a congestive heart failure

For the first experiment, we took 500 patients with a positive CHF diagnosis. For the negative class, we randomly selected 500 patients who didn’t have a CHF diagnosis. We removed the clinical conditions from the positive class of patients that were directly related to CHF. For example, all the patients in the positive class were expected to have ICD-9 code 428, which stands for CHF. We filtered that out from the positive class to make sure the model is not overfitting on the clinical condition.

Baseline model

Our baseline model had an accuracy of 85.8%. The following graph shows the ROC curve.

The following graph shows the confusion matrix.

Amazon HealthLake augmented model

Our Amazon HealthLake augmented model had an accuracy of 89.1%. The following graph shows the ROC curve.

The following graph shows the confusion matrix.

Adding the features extracted from Amazon HealthLake allowed us to improve the model accuracy from 85% to 89% and also the AUC from 0.86 to 0.89. If you look at the confusion matrices for the two models, the false positives reduced from 20 to 13 and the false negatives reduced from 27 to 20.

Optimizing healthcare is about ensuring the patient is associated with their peers and the right cohort. As patient data is added or changes, it’s important to continuously identify and reduce false negative and positive identifiers for overall improvement in the quality of care.

To better explain the performance improvements, we picked a patient from the false negative cohort in the first model who moved to true positive in the second model. We plotted a word cloud for the top medical conditions for this patient for the first and the second model, as shown in the following images.

There is a clear difference between the medical conditions of the patient before and after the addition of features from Amazon HealthLake. The word cloud for model 2 is richer, with more medical conditions indicative of CHF than the one for model 1. The data embedded within the unstructured notes for this patient extracted by Amazon HealthLake helped this patient move from a false negative category to a true positive.

These numbers are based on synthetic experimental data we used from a subset of MIMIC-III patients. In a real-world scenario with higher-volume of patients, these numbers may differ.

Example 2: Grouping patients diagnosed with sepsis

For the second experiment, we took 500 patients with a positive sepsis diagnosis. We grouped these patients on the basis of their structured clinical records using k-means clustering. To show that this is a repeatable pattern, we chose the same feature engineering techniques as described in experiment 1. We didn’t divide the data into training and testing datasets because we were implementing an unsupervised learning algorithm.

We first analyzed the optimal number of clusters of the grouping using the Elbow method and arrived at the curve shown in the following graph.

This allowed us to determine that six clusters were the optimal number in our patient grouping.

Baseline model

We reduced the dimensionality of the input data using Principal Component Analysis (PCA) to two and plotted the following scatter plot.

The following were the counts of patients across each cluster:

Cluster 1
Number of patients: 44

Cluster 2
Number of patients: 30

Cluster 3
Number of patients: 109

Cluster 4
Number of patients: 66

Cluster 5
Number of patients: 106

Cluster 6
Number of patients: 145

We found that the at least four of the six clusters had a distinct overlap of patients. That means the structured clinical features weren’t enough to clearly divide the patients into six groups.

Enhanced model

For the enhanced model, we added the ICD-10 codes and their corresponding descriptions for each patient as extracted from Amazon HealthLake. However, this time, we could see a clear separation of the patient groups.

We also saw a change in distribution across the six clusters:

Cluster 1
Number of patients: 54

Cluster 2
Number of patients: 154

Cluster 3
Number of patients: 64

Cluster 4
Number of patients: 44

Cluster 5
Number of patients: 109

Cluster 6
Number of patients: 75

As you can see, adding features from the unstructured data for the patients allows us to improve the clustering model to clearly divide the patients into six clusters. We even saw that some patients moved across clusters, denoting that the model became better at recognizing those patients based on their unstructured clinical records.

Conclusion

In this post, we demonstrated how you can easily use SageMaker to build ML models on your data in Amazon HealthLake. We also demonstrated the advantages of augmenting data from unstructured clinical notes to improve the accuracy of disease prediction models. We hope this body of work provides you with examples of how to build ML models using SageMaker with your data stored and normalized in Amazon HealthLake and improve model performance for clinical outcome predictions. To learn more about Amazon HealthLake, please check the website and technical documentation for more information.

About the Authors

Ujjwal Ratan is a Principal Machine Learning Specialist in the Global Healthcare and Lifesciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials and quality of care improvement. He has expertise in scaling machine learning/deep learning algorithms on the AWS cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.

Nihir Chadderwala is an AI/ML Solutions Architect on the Global Healthcare and Life Sciences team. His background is building Big Data and AI-powered solutions to customer problems in variety of domains such as software, media, automotive, and healthcare. In his spare time, he enjoys playing tennis, watching and reading about Cosmos.

Parminder Bhatia is a science leader in the AWS Health AI, currently building deep learning algorithms for clinical domain at scale. His expertise is in machine learning and large scale text analysis techniques in low resource settings, especially in biomedical, life sciences and healthcare technologies. He enjoys playing soccer, water sports and traveling with his family.

Prime Video’s work on sports field registration, recap/intro detection

January 15, 2021

by admin Amazon AWS

Two papers at WACV propose neural models for enhancing video-streaming experiences.Read More

Automating Amazon Personalize solution using the AWS Step Functions Data Science SDK

January 14, 2021

by Neel Sendas Amazon AWS

Machine learning (ML)-based recommender systems aren’t a new concept across organizations such as retail, media and entertainment, and education, but developing such a system can be a resource-intensive task—from data labelling, training and inference, to scaling. You also need to apply continuous integration, continuous deployment, and continuous training to your ML model, or MLOps. The MLOps model helps build code and integration across ML tools and frameworks. Moreover, building a recommender system requires managing and orchestrating multiple workflows, for example waiting for your training jobs to finish and trigger deployments.

In this post, we show you how to use Amazon Personalize to create a serverless recommender system for movie recommendations with no ML experience required. Creating an Amazon Personalize solution workflow involves multiple steps, such as preparing and importing the data, choosing a recipe and creating a solution and finally generating recommendations. End-users or your data scientists can orchestrate these steps using AWS Lambda functions and AWS Step Functions. However, writing JSON code to manage such workflows can be complicated for your data scientists.

You can orchestrate an Amazon Personalize workflow using the AWS Step Functions Data Science SDK for Python using an Amazon SageMaker Jupyter notebook. The AWS Step Functions Data Science SDK is an open-source library that allows data scientists to easily create workflows that process and publish ML models using Amazon SageMaker Jupyter notebooks and Step Functions. You can create multi-step ML workflows in Python that orchestrate AWS infrastructure at scale, without having to provision and integrate the AWS services separately.

Solution and services overview

We have orchestrated the Amazon Personalize training workflow steps such as preparing and importing the data, choosing a recipe and creating a solution using AWS Step functions and AWS lambda functions to create a below Amazon Personalize end to end automated workflow:

We walk you through the below steps using the accompanying Amazon SageMaker Jupyter notebook. In order to create the above Amazon Personalize workflow, we are going to follow below detailed steps:

Complete the prerequisites of setting up permissions and preparing the dataset.
Set up AWS Lambda function and AWS Step function task states.
Add wait conditions to each AWS Step function task states.
Add a control flow and link the states.
Define and run the workflows by combining all the above steps.
Generate recommendations.

Prerequisites

Before you build your workflow and generate recommendations, you must set up a notebook instance, AWS Identity and Access Management (IAM) roles, and create an Amazon Simple Storage Service (Amazon S3) bucket.

Creating a notebook instance in Amazon SageMaker

Create an Amazon SageMaker notebook Instance by following instructions here. For creating IAM role for your notebook, make sure your notebook IAM role has AmazonS3FullAccess and AmazonPersonalizeFullAccess in order to access Amazon S3 and Amazon Personalize APIs through the Sagemaker notebook.

When the notebook is active, choose Open Jupyter.
On the Jupyter dashboard, choose New.
Choose Terminal.

In the terminal, enter the following code:

cd SageMaker 
git clone https://github.com/aws-samples/personalize-data-science-sdk-workflow.git

Open the notebook by choosing Personalize-Stepfunction-Workflow.ipynb in the root folder.

You’re now ready to run the following steps through the notebook cells. You can find the complete notebook for this solution in the GitHub repo.

Setting up Step Function Execution role

You need a Step Functions execution role so that you can create and invoke workflows in Step Functions. For instructions, see Create a role for Step Functions or go through the steps in the notebook Create an execution role for Step Functions.

Attach an inline policy to the role you created for notebook instance in previous step. Enter the JSON policy by copying the policy from the notebook.

Preparing your dataset

To prepare your dataset, complete the following steps:

Create an S3 bucket to store the training dataset and provide the bucket name and file name as ’movie-lens-100k.csv’ in the notebook step Setup S3 location and filename. See the following code:
```
bucket = "<SAMPLE BUCKET NAME>" # replace with the name of your S3 bucket
filename = "<SAMPLE FILE NAME>" # replace with a name that you want to save the dataset under
```

Attach the policy to the S3 bucket by running the Attach policy to Amazon S3 bucket step in the notebook.
Create an IAM role for Amazon Personalize by running the Create Personalize Role step in the notebook using Python boto3 create_role API.

To preprocess the dataset and upload to Amazon S3, run the Data-Preparation step of the following notebook cell to download the movie-lens dataset and select the movies that have a rating of 2 or above from the dataset:

!wget -N http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip
data = pd.read_csv('./ml-100k/u.data', sep='t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])
pd.set_option('display.max_rows', 5)
data

The following screenshot shows your output.

Run the following code to upload your data to the S3 bucket that you created before:

data = data[data['RATING'] > 2] # keep only movies rated 2 and above
data2 = data[['USER_ID', 'ITEM_ID', 'TIMESTAMP']] 
data2.to_csv(filename, index=False)

boto3.Session().resource('s3').Bucket(bucket).Object(filename).upload_file(filename)

Set up AWS Lambda function and AWS Step function task states.

After you complete the prerequisite steps, you would need do to below:

Setup Lambda functions in the AWS Console: You need to setup AWS Lambda function for each Amazon Personalize tasks such as creating a dataset, choosing a recipe and creating a solution using Amazon Personalize AWS SDK for Python (Boto3). With the Python code provided in the Lambda GitHub repo. This repository has python code for various lambda functions, you need to copy the code and create lambda functions following the steps in this documentation “build a Lambda function on the Lambda console.”

Note: While creating an execution role, make sure you provide AmazonPersonalizeFullAccess along with AWSLambdaBasicExecutionRole permission policy to the lambda role.

Orchestrate each of these lambda function as AWS Step function task states: A Task state represents a single unit of work performed by a state machine. AWS Step Functions can invoke Lambda functions directly from a task state. For more information see Creating a Step Functions State Machine That Uses Lambda. Below is a step function workflow diagram to orchestrate the lambda functions we are going to cover in this section:

We already created Lambda functions for each of the Amazon Personalize steps using the Amazon Personalize AWS SDK for Python (Boto3) with the Python code provided in the Lambda GitHub repo. We deep dive and explain each step in this section. Alternatively, you can build a Lambda function on the Lambda console.

Creating a schema

Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. For more information, see Datasets and Schemas. First we would create a step function task state to create a schema using below code in the walkthrough notebook.

lambda_state_schema = LambdaStep(
    state_id="create schema",
    parameters={  
        "FunctionName": "stepfunction-create-schema", #replace with the name of the function you created
        "Payload": {  
           "input": "personalize-stepfunction-schema"
        }
    },
    result_path='$'    
)

The above step function is invoking the AWS lambda function Lambda function stepfunction-create-schema.py. This lambda code snippet above is uses Personalize.create_schema() boto 3 APIs to create schema.

import boto3

client = boto3.client('personalize')
create_schema_response = personalize.create_schema(
        name = event['input'],
        schema = json.dumps(schema)
    )

This API creates an Amazon Personalize schema from the specified schema string. The schema you create must be in Avro JSON format.

Creating a dataset group

After you create a schema, similarly, we are going to create a step function task states for creating a dataset group .A dataset group contains related datasets that supply data for training a model. A dataset group can contain at most three datasets, one for each type of dataset: Interactions, Items and Users.

To train a model (create a solution), you need a dataset group that contains an Interactions dataset.

Run the notebook steps to create state id “create dataset” below:

lambda_state_createdataset = LambdaStep(
    state_id="create dataset",
    parameters={  
        "FunctionName": "stepfunctioncreatedataset", #replace with the name of the function you created
        
        "Payload": {  
           "schemaArn.$": '$.schemaArn',
           "datasetGroupArn.$": '$.datasetGroupArn',        
        } 
        
        
    },
    result_path = '$'
)

The above step function task state is invoking the AWS lambda function stepfunctioncreatedatagroup.py. Below is a code snippet of using personalize.create_dataset() to create a dataset:

create_dataset_response = personalize.create_dataset(

name = "personalize-stepfunction-dataset",

datasetType = dataset_type,

datasetGroupArn = event['datasetGroupArn'],

schemaArn = event['schemaArn']

)

This API creates an empty dataset and adds it to the specified dataset group.

Creating a dataset

In this step, you create a step function task state for automating an empty dataset and add it to the specified dataset group.

Run through the notebook steps to create a dataset step function. You can review the underlying AWS Lambda function stepfunctioncreatedataset.py. We use the same API personalize.create_dataset Python boto3 API to automate dataset creation which we used to create a dataset group in above step.

Importing your data

When you complete Step 1: Creating a Dataset Group and Step 2: Creating a Dataset and a Schema, you’re ready to import your training data into Amazon Personalize. When you import data, you can choose to import records in bulk, import records individually, or both, depending on your business requirements and the amount of historical data you have collected. If you have a large amount of historical records, we recommend you import data in bulk and add data incrementally as necessary.

Run through the notebook steps to create a step functions task state to import a dataset with state_id=”create dataset import job”. Let’s review the code snippet for underlying Lambda function stepfunction-createdatasetimportjob.py.

create_dataset_import_job_response = personalize.create_dataset_import_job(

jobName = "stepfunction-dataset-import-job",

datasetArn = datasetArn,

dataSource = {

"dataLocation": "s3://{}/{}".format(bucket, filename)

},

roleArn = roleArn

)

We are using personalize.create_dataset_import_job() boto3 APIs to import the dataset in the above lambda code. This API creates a job that imports training data from your data source (an Amazon S3 bucket) to an Amazon Personalize dataset.

Creating a solution

After you prepare and import the data, you’re ready to create a solution. A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions (trained models). A recipe is an Amazon Personalize term specifying an appropriate algorithm to train for a given use case. After you create a solution with a solution version, you can create a campaign to deploy the solution version and get recommendations. We will create a step function task states for choosing a recipe and creating a solution.

Choosing a recipe, configuring a solution and creating a solution version

Run the notebook cell Choose a recipe to create a step function task state to generate state ID state_id=”select recipe and create solution”. This state is representing the AWS lambda function stepfunction_select-recipe_create-solution.py. Below is the code snippet:

list_recipes_response = personalize.list_recipes()

recipe_arn = "arn:aws:personalize:::recipe/aws-user-personalization" # aws-user-personalization selected for demo purposes

#list_recipes_response
  create_solution_response = personalize.create_solution(

name = "stepfunction-solution",

datasetGroupArn = event['dataset_group_arn'],

recipeArn = recipe_arn

)

We are using personalize.list_recipes() boto3 APIs to return a list of available recipes and personalize.craete_solution() boto 3 API to create the configuration for training a model. A trained model is known as a solution.

We use the algorithm aws-user-personalization for this post. For more information, see Choosing a Recipe and Configuring a Solution.

After you choose a recipe and configure your solution, you’re ready to create a solution version, which refers to a trained ML model.

Run the notebook cell Create Solution Version to create a step function task state with the State ID state_id=” create solution version” and recipes using the underlying Lambda function stepfunction_create_solution_version.py. Below is the lambda code snippet using personalize.create_solution_version() boto3 API which trains or retrains an active solution.

create_solution_version_response = personalize.create_solution_version(


solutionArn = event['solution_arn']

)

A solution is created using the CreateSolution operation and must be in the ACTIVE state before calling CreateSolutionVersion . A new version of the solution is created every time you call this operation. For more information, see Creating a Solution Version.

Creating a campaign

A campaign is used to make recommendations for your users. You create a campaign by deploying a solution version.

Run the notebook cell Create Campaign to create step function task state with the state ID state_id=” create campaign” using the underlying Lambda function stepfunction_getsolution_metric_create_campaign.py. This lambda is using below API

get_solution_metrics_response = personalize.get_solution_metrics(
        solutionVersionArn = event['solution_version_arn']
    )

    create_campaign_response = personalize.create_campaign(
        name = "stepfunction-campaign",
        solutionVersionArn = event['solution_version_arn'],
        minProvisionedTPS = 1
    )

personalize.get_solution_metrics() gets the metrics for the specified solution version and personalize.create_campaign() creates a campaign by deploying a solution version.

In this section, we covered all the Step function task states and their AWS lambda functions needed to orchestrate the Amazon Personalize workflow. Go to next section to add the wait states to these step function states.

Adding a wait state to your steps

In this section, we add wait states because these steps need to wait for previous steps to finish before running the next step. For example, you should create a dataset after you create a dataset group.

Run the notebook steps from Wait for Schema to be ready to Wait for Campaign to be ACTIVE to make sure these Step Functions or Lambda steps are waiting for each step to run before they trigger the next step.

The following code is an example of what a wait state looks like to wait for the schema to be created:

wait_state_schema = Wait(
    state_id="Wait for create schema - 5 secs",
    seconds=5
)

We added a wait time of 5 seconds for the state ID, which means it should wait for 5 seconds before going to next state.

Adding a control flow and linking states by using the choice state

The AWS Step Functions Data Science SDK’s choice state supports branching logic based on the outputs from previous steps. You can create dynamic and complex workflows by adding this state.

After you define these steps, chain them together into a logical sequence. Run all the notebook steps to add choices while automating Amazon Personalize datasets, recipes, solutions, and the campaign.

The following code is an example of the choice state for the create_campaign workflow:

create_campaign_choice_state = Choice(
    state_id="Is the Campaign ready?"
)
create_campaign_choice_state.add_choice(
    rule=ChoiceRule.StringEquals(variable=lambda_state_campaign_status.output()['Payload']['status'], value='ACTIVE'),
    next_step=Succeed("CampaignCreatedSuccessfully")     
)
create_campaign_choice_state.add_choice(
    ChoiceRule.StringEquals(variable=lambda_state_campaign_status.output()['Payload']['status'], value='CREATE PENDING'),
    next_step=wait_state_campaign
)
create_campaign_choice_state.add_choice(
    ChoiceRule.StringEquals(variable=lambda_state_campaign_status.output()['Payload']['status'], value='CREATE IN_PROGRESS'),
    next_step=wait_state_campaign
)

create_campaign_choice_state.default_choice(next_step=Fail("CreateCampaignFailed"))

Defining and running workflows

You create a workflow that runs a group of Lambda functions (steps) in a specific order for example one Lambda function’s output passes to the next Lambda function’s input. We’re now ready to define the workflow definition for each step in Amazon Personalize:

Dataset workflow
Dataset import workflow
Recipe and solution workflow
Create campaign workflow
Main workflow to orchestrate the four preceding workflows for Amazon Personalize

After completing these steps, we can run our main workflow. To learn more about Step function workflows refer this link.

Dataset workflow

Run the following code to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

Dataset_workflow_definition=Chain([lambda_state_schema,
                                   wait_state_schema,
                                   lambda_state_datasetgroup,
                                   wait_state_datasetgroup,
                                   lambda_state_datasetgroupstatus
                                  ])
Dataset_workflow = Workflow(
    name="Dataset-workflow",
    definition=Dataset_workflow_definition,
    role=workflow_execution_role
)

The following screenshot shows the dataset workflow view.

Dataset import workflow

Run the following code to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

DatasetImport_workflow_definition=Chain([lambda_state_createdataset,
                                   wait_state_dataset,
                                   lambda_state_datasetimportjob,
                                   wait_state_datasetimportjob,
                                   lambda_state_datasetimportjob_status,
                                   datasetimportjob_choice_state
                                  ])

The following screenshot shows the dataset Import workflow view.

Recipe and solution workflow

Run the notebook to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

Create_receipe_sol_workflow_definition=Chain([lambda_state_select_receipe_create_solution,
                                   wait_state_receipe,
                                   lambda_create_solution_version,
                                   wait_state_solutionversion,
                                   lambda_state_solutionversion_status,
                                   solutionversion_choice_state
                                  ])

The following screenshot shows the create recipe workflow view.

Campaign workflow

Run the notebook to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

Create_Campaign_workflow_definition=Chain([lambda_create_campaign,
                                   wait_state_campaign,
                                   lambda_state_campaign_status,
                                   wait_state_datasetimportjob,
                                   create_campaign_choice_state
                                  ])

The following screenshot shows the campaign workflow view.

Main workflow

Now we automate all four workflow definitions by combining them into a single workflow definition in order to automate all the task states to create the dataset, recipe, solution, and campaign in Amazon Personalize, including the wait times and choice states. Run the following notebook cell to generate a main workflow definition:

Main_workflow_definition=Chain([call_dataset_workflow_state,
                                call_datasetImport_workflow_state,
                                call_receipe_solution_workflow_state,
                                call_campaign_solution_workflow_state
                               ])

Running the workflow

Run the following code to trigger an Amazon Personalize automated workflow to create a dataset, recipe, solution, and campaign using the underlying Lambda function and Step Functions states:

Main_workflow_execution = Main_workflow.execute()
Main_workflow_execution.render_progress()

Dataset Workflow	Dataset Import Workflow

Create recipe and Solution Workflow	Campaign Workflow

Alternatively you can Inspect in AWS Step Functions console to see the status of each step in progress.

To see detailed progress of each step whether its success or failed run the code below:

Main_workflow_execution.list_events(html=True)

Note: Wait till the above execution finishes.

Generating recommendations

Now that you have a successful campaign, you can use it to generate a recommendation workflow. The following screenshot shows your recommendation workflow view.

Running the recommendation workflow

Run the following code to trigger a recommendation workflow using the underlying Lambda function and Step Functions states:

recommendation_workflow_execution = recommendation_workflow.execute()
recommendation_workflow_execution.render_progress()

Use the following code to display the movie recommendations generated by the workflow:

item_list = recommendation_workflow_execution.get_output()['Payload']['item_list']

print("Recommendations:")
for item in item_list:
    np.int(item['itemId'])
    item_title = items.loc[items['ITEM_ID'] == np.int(item['itemId'])].values[0][-1]
    print(item_title)

You can find the complete notebook for this solution in the GitHub repo.

Cleaning up

Make sure to clean up the Amazon Personalize and the state machines created in this post to avoid incurring any charges. On the Amazon Personalize console, delete these resources in the following order:

Summary

Amazon Personalize makes it easy for developers to add highly personalized recommendations to customers who use their applications. It uses the same machine learning (ML) technology used by Amazon.com for real-time personalized recommendations – no ML expertise required. This post shows how to use the AWS Step Functions Data Science SDK to automate the process of orchestrating Amazon Personalize workflows such as such as dataset group, dataset, dataset import job, solution, solution version, campaign, and recommendations. This automation of creating a recommender function using Step Functions can improve the CI/CD pipeline of the model training and deployment process.

For additional technical documentation and example notebooks related to the SDK, see Introducing the AWS Step Functions Data Science SDK for Amazon SageMaker.

About the Authors

Neel Sendas is a Senior Technical Account Manager at Amazon Web Services. Neel works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He has worked on various ML use cases, ranging from anomaly detection to predictive product quality for manufacturing and logistics optimization. When he is not helping customers, he dabbles in golf and salsa dancing.

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explain-ability areas in AI/ML.

Swami Sivasubramanian: Machine learning is ‘one of the most disruptive technologies we will encounter in our generation’

January 14, 2021

by admin Amazon AWS

AWS vice president of machine learning delivers first-ever ML keynote at ninth annual conference.Read More

Using machine learning to predict vessel time of arrival with Amazon SageMaker

January 13, 2021

by Samir Araujo Amazon AWS

According to the International Chamber of Shipping, 90% of world commerce happens at sea. Vessels are transporting every possible kind of commodity, including raw materials and semi-finished and finished goods, making ocean transportation a key component of the global supply chain. Manufacturers, retailers, and the end consumer are reliant on hundreds of thousands of ships carrying freight across the globe, delivering their precious cargo at the port of discharge after navigating for days or weeks.

As soon as a vessel arrives at its port of call, off-loading operations begin. Bulk cargo, containers, and vehicles are discharged, depending on the kind of vessel. Complex landside operations are triggered by cargo off-loading, involving multiple actors. Terminal operators, trucking companies, railways, customs, and logistic service providers work together to make sure that goods are delivered according to a specific SLA to the consignee in the most efficient way.

Business problem

Shipping companies publicly advertise their vessels’ estimated time of arrival (ETA) in port, and downstream supply chain activities are planned accordingly. However, delays often occur, and the ETA might differ from the vessel’s actual time of arrival (ATA), for instance due to technical or weather-related issues. This impacts the entire supply chain, in many instances reducing productivity and increasing waste and inefficiencies.

Predicting the exact time a vessel arrives in a port and starts off-loading operations poses remarkable challenges. Today, a majority of companies rely on experience and improvisation to respectively guess ATA and cope with its fluctuations. Very few providers are leveraging machine learning (ML) techniques to scientifically predict ETA and help companies create better planning for their supply chain. In this post, we’ll show how to use Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly, to predict the arrival time of vessels.

Study

Vessel ETA prediction is a very complex problem. It involves a huge number of variables and a lot of uncertainty. So when you decide to apply a technique like ML on a problem like that, it’s crucial to have a baseline (such as an expert user or a rule-based engine) to compare the performance and understand if your model is good enough.

This work is a study of the challenge of accurately predicting the vessel ETA. It’s not a complete solution, but it can be seen as a reference for you to implement your own sound and complete model, based on your data and expertise. The solution includes the following high-level steps:

Reduce the problem to a single vessel voyage (when the vessel departs from one given port and gets to another).
Explore a temporal dataset.
Identify the spatiotemporal aspects in the checkpoints sent by each vessel.
From a given checkpoint, predict the ETA in days for the vessel to reach the destination port (inside a given vessel voyage).

The following image shows multiple vessel voyages of the same vessel in different colors. A shipping is composed of multiple voyages.

Methodology

Vesseltracker, an AWS customer focused on maritime transportation intelligence, shared with us a sample of the historical data they collect from vessels (checkpoints) and ports (port calls) every day. The checkpoints contain the main characteristics of each vessel, plus their current geoposition, speed, direction, draught, and more. The port calls are the dates and times of each vessel’s arrival or departure.

Because we had to train a ML model to predict continuous values, we decided to experiment with some regression algorithms like XGBoost, Random Forest, and MLP. At the end of the experiments (including hyperparameter optimization), we opted for the Random Forest Regressor, given it gave us a better performance.

To train a regressor, we had to transform the data and prepare one feature to be the label, in this case, the number of days (float) that the vessel takes to get to the destination port.

For feature engineering, it’s important to highlight the following steps:

Identify each vessel voyage in the temporal dataset. Join with the port calls to mark the departure and the arrival checkpoints.
Compute backward from the destination port to the departure port the accumulated time per checkpoint.
Apply any geo-hashing mechanism to encode the GPS (latitude, longitude) and transform it into a useful feature.
Compute the great-circle distance between each sequential pair of geopositions from checkpoints.
Because the vessel changes the speed over time, we need to compute a new feature (called efficiency) that helps the model ponder the vessel displacement (speed and performance) before computing the remaining time.
Use the historical data of all voyages and the great-circle distance between each checkpoint to create an in-memory graph that shows us all the paths and distances between each segment (checkpoints).

With this graph, you can compute the distance between the current position of the vessel and the destination port. This process resulted in a new feature called accum_dist, or accumulated distance. As the feature importance analysis shows, because this feature has a high linear correlation with the target, it has a higher importance to the model.

Amazon SageMaker

We chose Amazon SageMaker to manage the entire pipeline of our study. SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Traditional ML development is a complex, expensive, and iterative process made even harder because there are no integrated tools for the entire ML workflow. You need to stitch together tools and workflows, which is time consuming and error prone. SageMaker solves this challenge by providing all the components used for ML in a single toolset so models get to production faster with much less effort and at lower cost.

Dataset

The following tables show samples of the data we used to create the dataset. The port calls (shown in the first table) are expressed by a few attributes from the vessel, the port identification, and timestamps of the arrival and departure events of a given vessel.

	imo	arrival_date	departure_date	port
0	0	2019-01-07	2019-01-08	GUAM
1	0	2019-01-11	2019-01-12	NAHA
2	0	2019-01-12	2019-01-17	NAHA

Then we have the vessel checkpoints. A checkpoint is a message sent by each vessel at a frequency X (in this case, approximately 1 day) that contains information about the vessel itself and its current status. By joining both tables, we enrich the checkpoints with information about the vessel departure and arrival, which is crucial to enclose all the other checkpoints sent between these two events. The following table is an example of vessel checkpoint data. You can click on the table for an enlarged view.

In the next table, both tables are joined and cleaned, with the geolocation encoded and the accumulated time and distance calculated. This view is for one particular vessel, which shows all the checkpoints that belong to one given vessel voyage. You can click on the table for an enlarged view.

Finally, we have the dataset used to train our model. The first column is the label or the target value our model tries to create a regression. The rest of the columns are the features the decision tree uses during training. You can click on the table for an enlarged view.

Model and results

After preparing the data, it’s time to train our model. We used a technique called k-fold cross validation to create six different combinations of training and validation data (approximately 80% and 20%) to explore the variation of the data as much as possible. With the native support for Scikit-learn on SageMaker, we only had to create a Python script with the training code and share it to a SageMaker Estimator, prepared with the SageMaker Python library. See the following code:

model = RandomForestRegressor(
        n_estimators=est, verbose=0, n_jobs=4, criterion='mse',
        max_leaf_nodes=1500, max_depth=depth, random_state=0
    )

After training our model, we used a metric called R2 to evaluate the model performance. R2 measures the proportion of the variance in the dependent variable that is predictable from the independent variables. A poor model has a low R2 and a useful model has an R2 as close as possible to 1.0 (or 100%).

In this case, we expected a model that predicted values with a high level of correlation with the testing data. With this combination of data preparation, algorithm selection, and hyperparameters optimization and cross validation, our model achieved an R2 score of 0.9473.

This result isn’t bad, but it doesn’t mean that we can’t improve the solution. We can minimize the accumulated error by the model by adding important features to the dataset. These features can help the model better understand all the low-level nuances and conditions from each checkpoint that can cause a delay. Some examples include weather conditions from the geolocation of the vessel, port conditions, accidents, extraordinary events, seasonality, and holidays in the port countries.

Then we have the feature importance (shown in the following graph). It’s a measurement of how strong or important a given feature from the dataset is for the prediction itself. Each feature has a different importance, and we want to keep only those features that are impactful for the model in the dataset.

The graph shows that accumulated distance is the most important feature (which is expected, given the high correlation with the target), followed by efficiency (an artificial feature we created to ponder the impact of the vessel displacement over time). In third place, we have destination port, closer to the encoded geoposition.

You can download the notebooks created for this experiment, to see all the details of the implementation. Click on the links bellow to get them:

Data preparation Notebook: data exploration, data transformation and feature engineering applied to the raw data to create the dataset
Training Notebook: model training using Amazon SageMaker – Scikit-learn Estimator.

You will need Amazon SageMaker to run these notebooks so create a SageMaker Studio Domain in your AWS Account, upload the notebooks to your new environment and run your own experiments!

Summary

The use of ML in predicting vessel time of arrival can substantially increase the accuracy of land-side operations planning and implementation, in comparison to traditional, manual estimation methodologies that are used widely across the industry. We’re working with and shipping companies to improve the accuracy of our model, as well as on add other relevant features. If your company is interested in learning more about our model and how it can be consumed, please reach out to our Head of World Wide Technology for Transportation and Logistics, Michele Sancricca, at msancric@amazon.com.

About the Authors

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions for solving their business challenges, using the AWS platform. He has been working on several AI/ML projects related to Computer Vision, Natural Language Processing, Forecasting, ML at the edge, etc. He likes playing with hardware and automation projects in his free time and he has a particular interest for robotics.

Michele Sancricca is the AWS Worldwide Head of Technology for Transportation and Logistics. Previously, he worked as Head of Supply Chain Products for Amazon Global Mile and led the Digital Transformation Division of Mediterranean Shipping Company. A retired Lieutenant Commander, Michele spent 12 years in the Italian Navy as Telecommunication Officer and Commanding Officer.

Creating high-quality machine learning models for financial services using Amazon SageMaker Autopilot

January 13, 2021

by Sumon Samanta Amazon AWS

Machine learning (ML) is used throughout the financial services industry to perform a wide variety of tasks, such as fraud detection, market surveillance, portfolio optimization, loan solvency prediction, direct marketing, and many others. This breadth of use cases has created a need for lines of business to quickly generate high-quality and performant models that can be produced with little to no code. This reduces the long cycles for taking use cases from concept to production and generates business value. In this post, we explore how to use Amazon SageMaker Autopilot for some common use cases in the financial services industry.

Autopilot automatically generates pipelines, trains and tunes the best ML models for classification or regression tasks on tabular data, while allowing you to maintain full control and visibility. Autopilot enables automatic creation of ML models without requiring any ML experience. Autopilot automatically analyzes the dataset, processes the data into features, and trains multiple optimized ML models.

Data scientists in financial services often work on tasks where the datasets are highly imbalanced (heavily skewed towards examples of one class). Examples of such tasks include credit card fraud (where a very small fraction of the transactions are actually fraudulent) or bankruptcy (only few corporations file for bankruptcy). We demonstrate how Autopilot automatically handles class imbalance without requiring any additional inputs from the user.

Autopilot recently announced the ability to tune models using the Area Under a Curve (AUC) metric in addition to F1 as the objective metric (which is the default objective for binary classification tasks), more specifically the area under the Receiver Operating Characteristic (ROC) curve. In this post, we show how using the AUC as the model evaluation metric for highly imbalanced data allows Autopilot to generate high-quality models.

Our first use case is to detect credit card fraud based on various anonymized attributes. The dataset is highly imbalanced, with over 99% of the transactions being non-fraudulent. Our second use case is to predict bankruptcy of Polish companies [2]. Here, bankruptcy is similarly a binary response variable (will bankrupt = 1, will not bankrupt = 0), with 96% of the companies not becoming bankrupt.

Prerequisites

To reproduce these steps in your own environment, you must complete the following prerequisites:

Create an AWS Identity and Access Management (IAM) role that allows the Amazon SageMaker notebook access to Amazon Simple Storage Service (Amazon S3) for storing data.
Create an Amazon SageMaker notebook instance.
Create an S3 bucket to store the outputs of your machine learning models and any data.

Credit card fraud detection

In fraud detection tasks, companies are interested in maintaining a very low false positive rate while correctly identifying the fraudulent transactions to the greatest extent possible. A false positive can lead to a company canceling or placing a hold on a customers’ card over a legitimate transaction, which leads to a poor customer experience. As a result, accuracy is not the best metric to consider for this problem; better metrics are the AUC and the F1 score.

The following code shows data for a credit card fraud task:

import pandas as pd 
fraud_df = pd.read_csv('creditcard.csv') 
fraud_df.head(5)

You can click on the previous table to expand for better viewing.

Class 0 and class 1 correspond to No Fraud and Fraud accordingly. As we can see, other than Amount, other columns are anonymized. A key differentiator of Autopilot is its ability to process raw data directly, without the need for data processing on the part of data scientists. For example, Autopilot automatically converts categorical features into numerical values, handles missing values (as we show in the second example), and performs simple text preprocessing.

Using the AWS boto3 API or the AWS Command Line Interface (AWS CLI), we upload the data to Amazon S3 in CSV format:

import boto3
s3 = boto3.client('s3')
s3.upload_file(file_name, bucket, object_name=None)

fraud_df = pd.read_csv(<your S3 file location>)

Now, we select all columns except Class as features and Class as target:

X = fraud_df[set(fraud_df.columns) - set(['Class'])]
y = fraud_df['Class']
print (y.value_counts())
0    284315
1       492

The binary label column Class is highly imbalanced, which is a typical occurrence in financial use cases. We can verify how well Autopilot handles this highly imbalanced data.

In the following code, we demonstrate how to configure Autopilot in Jupyter notebooks. We have to provide train and test files, and to set TargetAttributeName as Class, this is the target column (the column we predict):

auto_ml_job_name = 'automl-creditcard-fraud'
import boto3
sm = boto3.client('sagemaker')
import sagemaker  
session = sagemaker.Session()

prefix = 'sagemaker/' + auto_ml_job_name
bucket = session.default_bucket()
training_data = pd.DataFrame(X_train)
training_data['Class'] = list(y_train)
test_data = pd.DataFrame(X_test)

train_file = 'train_data.csv';
training_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print('Train data uploaded to: ' + train_data_s3_path)

test_file = 'test_data.csv';
test_data.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print('Test data uploaded to: ' + test_data_s3_path)
input_data_config = [{
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)
        }
      },
      'TargetAttributeName': 'Class'
    }
  ]

Next, we create the Autopilot job. For this post, we set ProblemType='BinaryClassification' and job_objective='AUC'. If you don’t set these fields, Autopilot automatically determines the type of supervised learning problem by analyzing the data and uses the default metric for that problem type. The default metric for binary classification is F1. We explicitly set these parameters because we want to optimize AUC.

from sagemaker.automl.automl import AutoML
from time import gmtime, strftime, sleep
from sagemaker import get_execution_role

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
base_job_name = 'automl-card-fraud' 

target_attribute_name = 'Class'
role = get_execution_role()
automl = AutoML(role=role,
                target_attribute_name=target_attribute_name,
                base_job_name=base_job_name,
                sagemaker_session=session,
                problem_type='BinaryClassification',
                job_objective={'MetricName': 'AUC'},
                max_candidates=100)

For more information about the parameters for job configuration, see create-auto-ml-job.

After the Autopilot job is created, we call the fit() function to run it:

automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)
describe_response = automl.describe_auto_ml_job()
print (describe_response)
job_run_status = describe_response['AutoMLJobStatus']
    
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response['AutoMLJobStatus']
    print (job_run_status)
    sleep(30)
print ('completed')

When the job is complete, we can select the best candidate based on the AUC objective metric:

best_candidate = automl.describe_auto_ml_job()['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))
CandidateName: tuning-job-1-7e8f6c9dffe840a0bf-009-636d28c2
FinalAutoMLJobObjectiveMetricName: validation:auc
FinalAutoMLJobObjectiveMetricValue: 0.9890000224113464

We now create the Autopilot model object using the model artifacts from the Autopilot job in Amazon S3, and the inference container from the best candidate after running the tuning job. In addition to the predicted label, we’re interested in the probability of the prediction—we use this probability later to plot the AUC and precision and recall graphs.

model_name = 'automl-cardfraud-model-' + timestamp_suffix
inference_response_keys = ['predicted_label', 'probability']
model = automl.create_model(name=best_candidate_name,
candidate=best_candidate,inference_response_keys=inference_response_keys)

After the model is created, we can generate inferences for the test set using the following code. During inference time, Autopilot orchestrates deployment of the inference pipeline, including feature engineering and the ML algorithm on the inference machine.

s3_transform_output_path = 's3://{}/{}/inference-results/'.format(bucket, prefix);
output_path = s3_transform_output_path + best_candidate['CandidateName'] +'/'
transformer=model.transformer(instance_count=1, 
                          instance_type='ml.m5.xlarge',
                          assemble_with='Line',
                          output_path=output_path)
transformer.transform(data=test_data_s3_path, split_type='Line', content_type='text/csv', wait=False)

describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
    describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
    job_run_status = describe_response['TransformJobStatus']
    print (describe_response)
    sleep(30)
print ('transform job completed with status : ' + job_run_status)

Finally, we read the inference and predicted data into a dataframe:

import json
import io
from urllib.parse import urlparse

def get_csv_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:].strip('/')
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')    
pred_csv = get_csv_from_s3(transformer.output_path, '{}.out'.format(test_file))
data_auc=pd.read_csv(io.StringIO(pred_csv), header=None)
data_auc.columns= ['label', 'proba']

Model metrics

Common metrics to compare classifiers are the ROC curve and the precision-recall curve. The ROC curve is a plot of the true positive rate against the false positive rate for various thresholds. The higher the prediction quality of the classification model, the more the ROC curve is skewed toward the top left.

The precision-recall curve demonstrates the trade-off between precision and recall, with the best models having a precision-recall curve that is flat initially and drops steeply as the recall approaches 1. The higher the precision and recall, the more the curve is skewed towards the upper right.

To optimize for the F1 score, we simply repeat the steps from earlier, setting the job_objective={'MetricName': 'F1'} and rerunning the Autopilot job. Because the steps are identical, we don’t repeat them in this section. Please note, F1 objective is default for binary classification problems. The following code plots the ROC curve:

import matplotlib.pyplot as plt
colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]
from sklearn import metrics
for i in range(0,len(models)):
    fpr, tpr, _ = metrics.roc_curve(y_test, models[i]['proba'])
    fpr, tpr, _  = metrics.roc_curve(y_test, models[i]['proba'])
    auc_score = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label=str('Auto Pilot {:.2f} '+ model_names[i]).format(auc_score),color=colors[i]) 
        
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.title('ROC Cuve')

The following plot shows the results.

In the preceding AUC ROC plot, Autopilot models provide high AUC when optimizing both objective metrics. We also didn’t select any specific model or tune any hyperparameters; Autopilot did all that heavy lifting for us.

Finally, we plot the precision-recall curves for the trained Autopilot model:

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
from sklearn import metrics

colors = ['blue','green']
model_names = ['Objective : AUC','Objective : F1']
models = [data_auc,data_f1]

print ('model ', 'F1 ', 'precision ', 'recall ')
for i in range(0,len(models)):
precision, recall, _ = precision_recall_curve(y_test, models[i]['proba'])
print (model_names[i],f1_score(y_test, np.array(models[i]['label'])),precision_score(y_test, models[i]['label']),recall_score(y_test, models[i]['label']) )
plt.plot(recall,precision,color=colors[i],label=model_names[i])

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='upper right')
plt.show()

                    F1          precision      recall 
Objective : AUC 0.8164          0.872          0.7676
Objective : F1  0.7968          0.8947         0.7183

The following plot shows the results.

As we can see from the plot, Autopilot models provide good precision and recall, because the graph is heavily skewed toward the top-right corner.

Autopilot outputs

In addition to handling the heavy lifting of building and training the models, Autopilot provides visibility into the steps taken to build the models by generating two notebooks: CandidateDefinitionNotebook and DataExplorationNotebook.

You can use the candidate definition notebook to interactively step through the steps taken by Autopilot to arrive at the best candidate. You can also use this notebook to override various runtime parameters like parallelism, hardware used, algorithms explored, feature engineering scripts, hyperparameter tuning ranges, and more.

You can download the notebook from the following Amazon S3 location:

automl.describe_auto_ml_job()['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

The notebook also outlines the various feature engineering steps taken to build the models. The models are indexed by their model type and the feature engineering pipeline. For example, as shown in the Tuning Job Result Overview, the winning model corresponds to the pipeline dpp1-xgboost:

best_candidate_name = best_candidate['CandidateName']
print(best_candidate). From there if we look at 
print (describe_response)

If we search for ModelDataUrl, we can find Autopilot used dpp1-xgboost 'ModelDataUrl': 's3://sagemaker-us-east-1-<ACCOUNT-NUM>/automl-card-fraud-7/tuning/automl-car-dpp1-xgb/tuning-job-1-7e8f6c9dffe840a0bf-009-636d28c2/output/model.tar.gz'.

dpp1-xgboost is a data transformation strategy that transforms numeric features using RobustImputer. It merges all the generated features and applies RobustPCA followed by RobustStandardScaler. The transformed data is used to tune an XGBoost model.

From the candidate definition notebook, we can also see that Autopilot automatically applied up-weighting to the minority class using scale_pos_weight. This improves prediction quality for imbalanced datasets where the model doesn’t see many examples of the minority class during training. You can change the scale_pos_weight to a different value:

STATIC_HYPERPARAMETERS = {
    'xgboost': {
        'objective': 'binary:logistic',
        'scale_pos_weight': 568.6114285714285,
    },
}

The data exploration notebook generates a report that provides insights about the input dataset, such as the missing values or the data types for the different features:

automl.describe_auto_ml_job()['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

Having described in detail the use of Autopilot to detect credit card fraud, we now briefly discuss a second task: predicting the bankruptcy of companies.

Predicting bankruptcy of Polish companies

For this post, we explore the various economic attributes in the Polish companies bankruptcy data dataset. There are 64 features and a target attribute class. We rename the column class to bankrupt (not bankrupt = 0, bankrupt = 1) for clarity. As noted before, this dataset is also highly imbalanced, with 96% of the data in the non-bankrupt category.

You can click on the previous table to expand for better viewing.

We followed the same process for running and configuring Autopilot as in the credit card fraud use case. However, unlike the credit card fraud dataset, this dataset contains missing values. Because Autopilot automatically handles missing values, we simply pass the raw data to Autopilot.

We don’t repeat the code steps in this section; we merely show the ROC and precision-recall curves. Autopilot again yields high-quality models as evidenced from the AUC, ROC, and precision-recall curves. For bankruptcy prediction, incorrectly predicting false negatives can lead to poor investment decisions, and incorrectly predicting that solvent companies may go bankrupt might lead to missed opportunities.

To boost model performance, Autopilot also automatically up-weights the minority class label, penalizing the model for mis-classifying the minority class during training. The following plot shows the precision-recall curve.

The following plot shows the ROC curve.

As we can see from these plots, for bankruptcy, the AUC objective is slightly better than F1. Autopilot can generate accurate predictions for a complex event like bankruptcy without any specialized manual feature-engineering steps.

Cleaning up

The Autopilot job creates many underlying artifacts, such as dataset splits, preprocessing scripts, and preprocessed data. To avoid incurring costs, delete these resources using the following code:

#s3 = boto3.resource('s3')
#bucket = s3.Bucket(bucket)
 
#job_outputs_prefix = '{}/output/{}'.format(prefix,auto_ml_job_name)
#bucket.objects.filter(Prefix=job_outputs_prefix).delete()

Conclusion

In this post, we demonstrated how to create ML models without any prior knowledge of algorithms using Autopilot. For imbalanced data, which is common in financial services use cases, we showed that using objective metrics such as AUC and F1 along with the automatic minority class up-weighting can lead to high-quality models. Autopilot provides the flexibility of AutoML with the control and detail of a do-it-yourself approach by unveiling the underlying metadata and the code used to preprocess the data and train the models. Importantly, AutoPilot works on datasets of all sizes ranging from few MBs to hundreds of GBs without you having to set up the underlying infrastructure. Finally, note that Amazon SageMaker Studio provides a UI for you to build, train, and deploy models using Autopilot with little to no code. For more information about tuning, training, and deploying Autopilot models, see Create a machine learning model automatically with Amazon SageMaker Autopilot.

References

[1] Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

[2] Zieba, M., Tomczak, S. K., & Tomczak, J. M. (2016). Ensemble Boosted Trees with Synthetic Features Generation in Application to Bankruptcy Prediction. Expert Systems with Applications.

About the Authors

Sumon Samanta is a Senior Specialist Architect for Global Financial Services at AWS. Previously, he worked as a Quantitative Developer at several investment banks to develop pricing and risk systems.

Stefan Natu is a Sr. Machine Learning Specialist at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.

Ilya Epshteyn is a solutions architect with AWS. He helps customers to innovate on AWS by building highly available, scalable, and secure architectures. He enjoys spending time outdoors and building Lego creations with his kids.

Miroslav Miladinovic is a Software Development Manager at Amazon SageMaker.

Jean Baptiste Faddoul is an Applied Science Manager working on SageMaker Autopilot and Automatic Model Tuning

Yotam Elor is a Senior Applied Scientist at AWS Sagemaker. He works on Sagemaker Autopilot – AWS’s auto ML solution.

How to train procedurally generated game-like environments at scale with Amazon SageMaker RL

January 13, 2021

by Jonathan Chung Amazon AWS

A gym is a toolkit for developing and comparing reinforcement learning algorithms. Procgen Benchmark is a suite of 16 procedurally-generated gym environments designed to benchmark both sample efficiency and generalization in reinforcement learning. These environments are associated with the paper Leveraging Procedural Generation to Benchmark Reinforcement Learning (citation). Compared to Gym Retro, these environments have the following benefits:

Faster – Gym Retro environments are already fast, but Procgen environments can run over four times faster.
Non-deterministic – Gym Retro environments are always the same, so you can memorize a sequence of actions that gets the highest reward. Procgen environments are randomized so this isn’t possible.
Customizable – If you install from source, you can perform experiments where you change the environments, or build your own environments. The environment-specific code for each environment is often less than 300 lines. This is almost impossible with Gym Retro.

This post demonstrates how to use the Amazon SageMaker reinforcement learning starter kit for the NeurIPS 2020 – Procgen competition hosted on AIcrowd. The competition was held from June to November 2020, and results can be found here but you can still try out the solution on your own. Our solution allows participants using AIcrowd’s existing neurips2020-procgen-starter-kit to get started with SageMaker seamlessly without making any algorithmic changes. It also helps you reduce the time and effort required to build your sample-efficient reinforcement learning solutions using homogenous and heteregeneous scaling.

Finally, our solution utilizes Spot Instances to reduce cost. The cost savings with Spot GPU Instances is approximately 70% for GPU instances such as ml.p3.2x and ml.p3.8x when training with a popular state-of-the-art reinforcement learning algorithm, Proximate Policy Optimization, and a multi-layer convolutional neural network as the agent’s policy.

Architecture

As part of the solution, we use the following services:

Amazon Simple Storage Service (Amazon S3) to store datasets
A SageMaker notebook to preprocess and visualize the data, and to train the deep learning model
An Amazon Virtual Private Cloud (Amazon VPC) or AWS VPC
AWS Lambda

SageMaker reinforcement learning uses Ray and RLLib the same as in the starter kit. SageMaker supports distributed reinforcement learning in a single SageMaker ML instance with just a few lines of configuration by using the Ray RLlib library.

A typical SageMaker reinforcement learning job for an actor-critic algorithm uses GPU instances to learn a policy network and CPU instances to collect experiences for faster training at optimized costs. SageMaker allows you to achieve this by spinning up two jobs within the same Amazon VPC, and the communications between the instances are taken care of automatically.

Cost

You can contact AICrowd to get credits to use any AWS service.

You’re responsible for the cost of the AWS services used while running this solution, and should set up a budget alert when you’ve reached 90% of your allotted credits. For more information, see Amazon SageMaker Pricing.

As of September 1, 2020, SageMaker training costs (excluding notebook instances) are as follows:

c5.4xlarge – $0.952 per hour (16 vCPU)
g4dn.4xlarge – $1.686 per hour (1 GPU, 16 vCPU)
p3.2xlarge – $4.284 per hour (1 GPU, 8 vCPU)

Launching the solution

To launch the solution, complete the following steps:

While signed in to your AWS account, choose the following link to create the AWS CloudFormation stack for the Region you want to run your notebook:

You’re redirected to the AWS CloudFormation console to create your stack.

Acknowledge the use of the instance type for your SageMaker notebook and training instance.

Make sure that your AWS account has the limits for required instances. If you need to increase the limits for the instances you want to use, contact AWS Support.

As the final parameter, provide the name of the S3 bucket for the solution.

The default name is neurips-2020. You should provide a unique name to make sure there are no conflicts with your existing S3 buckets. An S3 bucket name is globally unique, and the namespace is shared by all AWS accounts. This means that after a bucket is created, the name of that bucket can’t be used by another AWS account in any AWS Region until the bucket is deleted.

Choose Create stack.

You can monitor the progress of your stack by choosing the Event tab or refreshing your screen. If you encounter any error during stack creation (such as confirming again that your S3 bucket name is unique), you can delete the stack and launch it again. When stack creation is complete, go to the SageMaker console. Your notebook should already be created and its status should read InService.

You’re now ready to start training!

On the SageMaker console, choose Notebook instances.
Locate the rl-procgen-neurips instance and choose Open Jupyter or Open JupyterLab.
Choose the notebook 1_train.ipynb.

You can use the cells following the training to run evaluations, do rollouts, and visualize your outcome.

Configuring algorithm parameters and the agent’s neural network model

To configure your RLLib algorithm parameters, go to your notebook folder and open source/train-sagemaker-distributed-{}.py. A subset of algorithm parameters are provided for PPO, but for the full set of algorithm-specific parameters, see Proximal Policy Optimization (PPO). For baselines provided in the starter kit, refer to experiments{}.yaml files and copy additional parameters to the RLLib configuration parameters in the source/train-sagemaker-distributed-.py.

To check whether your model is using the correct parameters, go to the S3 bucket and navigate to the JSON file with the parameters. For example, {Amazon SageMaker training job} >output>intermediate>training>{PPO_procgen_env_wrapper_}>param.json.

To add a custom model, create a file inside the models/ directory and name it models/my_vision_network.py.

For a working implementation of how to add a custom model, see the GitHub repo. You can set the custom_model field in the experiment .yaml file to my_vision_network to use that model.

Make sure that the model is registered. If you get an error that your model isn’t registered, go to train-sagmaker.py or train-sagmaker-distributed.py and edit def register_algorithms_and_preprocessors(self) by adding the following code:

ModelCatalog.register_custom_model("impala_cnn_tf", ImpalaCNN)

Distributed training with multiple instances

SageMaker supports distributed reinforcement learning in a single SageMaker ML instance with just a few lines of configuration by using the Ray RLlib library.

In homogeneous scaling, you use multiple instances with the same type (typically CPU instances) for a single SageMaker job. A single CPU core is reserved for the driver, and you can use the remaining as rollout workers, which generate experiences through environmental simulations. The number of available CPU cores increases with multiple instances. Homogeneous scaling is beneficial when experience collection is the bottleneck of the training workflow; for example, when your environment is computationally heavy.

With more rollout workers, neural network updates can often become the bottleneck. In this case, you could use heterogeneous scaling, in which you use different instance types together. A typical choice is to use GPU instances to perform network optimization and CPU instances to collect experiences for faster training at optimized costs. SageMaker allows you to achieve this by spinning up two jobs within the same Amazon VPC, and the communications between the instances are taken care of automatically.

To run distributed training with multiple instances, use 2_train-homo-distributed-cpu.ipynb / 3_train-homo-distributed-gpu.ipynb and train-hetero-distributed.ipynb for homogenous and heterogenous scaling, respectively. The configurable parameters for distributed training are stored in source/train-sagemaker-distributed.py. You don’t have to configure ray_num_cpus or ray_num_gpus.

Make sure you scale num_workers and train_batch_size to reflect the number of instances in the notebook. For example, if you set train_instance_count = 5 for a p3.2xlarge instance, the maximum number of workers is 39. See the following code:

"num_workers": 8*5 -1, # adjust based on total number of CPUs available in the cluster, e.g., p3.2xlarge has 8 CPUs and 1 CPU is reserved for resource allocation
  "num_gpus": 0.2, # adjust based on number of GPUs available in a single node, e.g., p3.2xlarge has 1 GPU
  "num_gpus_per_worker": 0.1, # adjust based on number of GPUs, e.g., p3.2x large (1 GPU - num_gpus) / num_workers = 0.1
  "rollout_fragment_length": 140,
  "train_batch_size": 64 * (8*5 -1),

To use a Spot Instance, you need to set the flag train_use_spot_instances = True in the final cell of train-homo-distributed.ipynb or train-hetero-distributed.ipynb. You can also use the MaxWaitTimeInSeconds parameter to control the total duration of your training job (actual training time plus waiting time).

Summary

We compared our starter kit with three different GPU instances (ml.g4n.4x, ml.p3.2x, and ml.p3.8x) using single and multiple instances. On all GPU instances, our Spot Instance training provided a 70% cost reduction. This means that you spend less than $1 with the starter kit hyperparameters for the competition’s benchmarking solution, such as, 8 MM steps, using an ml.p3.2x instance.

Our starter kit allows you to run multiple instances of ml.p3.2x with 1 GPU versus a single instance of ml.p3.8x with 4 GPUs. We observed that running a single instance of ml.p3.8x with 4 GPUs is more cost-effective than running five instances of ml.p3.2x (=5 GPUs) due to communication overhead. The single instance training with ml.p3.8x converges in 20 minutes, helping you iterate faster to meet the competition deadline.

Finally, we observed that ml.g4n.4x instances provide an additional 40% cost reduction over 70% reduction from Spot Instances. However, it takes longer to train: 45 minutes with ml.p3.8x versus 70 minutes with ml.g4n.4x.

To get started with this solution, sign in to your AWS account and choose the quick create to launch the CloudFormation stack for the Region you want to run your notebook.

About the Authors

Jonathan Chung is an Applied scientist in AWS. He works on applying deep learning to various applications including games and document analysis. He enjoys cooking and visiting historical cities around the world.

Anna Luo is an Applied Scientist in AWS. She works on utilizing reinforcement learning techniques for different domains including supply chain and recommender system. Her current personal goal is to master snowboarding.

Sahika Genc is a Principal Applied Scientist in the AWS AI team. Her current research focus is deep reinforcement learning (RL) for smart automation and robotics. Previously, she was a senior research scientist in the Artificial Intelligence and Learning Laboratory at the General Electric (GE) Global Research Center, where she led science teams on healthcare analytics for patient monitoring.