Building your own brand detection and visibility using Amazon SageMaker Ground Truth and Amazon Rekognition Custom Labels – Part 1: End-to-end solution

According to Gartner, 58% of marketing leaders believe brand is a critical driver of buyer behavior for prospects, and 65% believe it’s a critical driver of buyer behavior for existing customers. Companies spend huge amounts of money on advertisement to raise brand visibility and awareness. In fact, as per Gartner, CMO spends over 21% of their marketing budgets on advertising. Brands have to continuously maintain and improve their image, understand their presence on the web or media content, and measure their marketing effort. All these are a top priority for every marketer. However, calculating the ROI from such advertisement can be augmented with artificial intelligence (AI)-powered tools to deliver more accurate results.

Nowadays brand owners are naturally more than a little interested in finding out how effectively their outlays are working for them. However, it’s difficult to assess quantitatively just how good the brand exposure is in a given campaign or event. The current approach to computing such statistics has involved manually annotating broadcast material, which is time-consuming and expensive.

In this post, we show you how to mitigate these challenges by using Amazon Rekognition Custom Labels to train a custom computer vision model to detect brand logos without requiring machine learning (ML) expertise, and Amazon SageMaker Ground Truth to quickly build a training dataset from unlabeled video samples that can be used for training.

For this use case, we want to build a company brand detection and brand visibility application that allows you to submit a video sample for a given marketing event to evaluate how long your logo was displayed in the entire video and where in the video frame the logo was detected.

Solution overview

Amazon Rekognition Custom Labels is an automated ML (AutoML) feature that enables you to train custom ML models for image analysis without requiring ML expertise. Upload a small dataset of labeled images specific to your business use case, and Amazon Rekognition Custom Labels takes care of the heavy lifting of inspecting the data, selecting an ML algorithm, training a model, and calculating performance metrics.

No ML expertise is required to build your own model. The ease of use and intuitive setup of Amazon Rekognition Custom Labels allows any user to bring their own dataset for their use case, label them into separate folders, and launch the Amazon Rekognition Custom Labels training and validation.

The solution is built on a serverless architecture, which means you don’t have to provision your own servers. You pay for what you use. As demand grows or decreases, the compute power adapts accordingly.

This solution demonstrates an end-to-end workflow from preparing a training dataset using Ground Truth to training a model using Amazon Rekognition Custom Labels to identify and detect brand logos in video files. The solution has three main components: data labeling, model training, and running inference.

Data labeling

Three types of labels are available with Ground Truth:

  • Amazon Mechanical Turk – An option to engage a team of global, on-demand workers
  • Vendors – Third-party data labeling services listed on AWS Marketplace
  • Private labelers – Your own teams of private labelers to label the brand logo impressions frame by frame from the video

For this post, we use the private workforce option.

Training the model using Amazon Rekognition Custom Labels

After the labeling job is complete, we train our brand logo detection model using these labeled images. The solution in this post creates an Amazon Rekognition Custom Labels project and custom model. Amazon Rekognition Custom Labels automatically inspects the labeled data provided, selects the right ML algorithms and techniques, trains a model, and provides model performance metrics.

Running inference

When our model is trained, Amazon Rekognition Custom Labels provides an inference endpoint. We can then upload and analyze images or video files using the inference endpoint. The web user interface presents a bar chart that shows the distribution of detected custom labels per minute in the analyzed video.

Architecture overview

The solution uses the serverless architecture. The following architectural diagram illustrates an overview of the solution.

The following architectural diagram illustrates an overview of the solution.

The solution is composed of two AWS Step Functions state machines:

  • Training – Manages extracting image frames from uploaded videos, creating and waiting on a Ground Truth labeling job, and creating and training an Amazon Rekognition Custom Labels model. You can then use the model to run brand logo detection analysis.
  • Analysis – Handles analyzing video or image files. It manages extracting image frames from the video files, starting the custom label model, running the inference, and shutting down the custom label model.

The solution provides a built-in mechanism to manage your custom label model runtime to ensure the model is shut down to keep your cost at a minimum.

The web application communicates with the backend state machines using an Amazon API Gateway RESTful endpoint. The endpoint is protected with a valid AWS Identity and Access Management (IAM) credential. The authentication to the web application is done through an Amazon Cognito user pool, where an authenticated user is issued a secure, time-bounded, temporary credential that can then be used to access “scoped” resources such as uploading video and image files to an Amazon Simple Storage Service (Amazon S3) bucket, invoking the API Gateway RESTful API endpoint to create a new training project, or running inference with the Amazon Rekognition Custom Labels model you trained and built. We use Amazon CloudFront to host the static contents residing on an S3 bucket (web), which is protected through OAID.

Prerequisites

For this walkthrough, you should have an AWS account with appropriate IAM permissions to launch the provided AWS CloudFormation template.

Deploying the solution

You can deploy the solution using a CloudFormation template with AWS Lambda-backed custom resources. To deploy the solution, use one of the following CloudFormation templates and follows the instructions:

AWS Region CloudFormation Template URL
US East (N. Virginia)
US East (Ohio)
US West (Oregon)
EU (Ireland)
  1. Sign in to the AWS Management Console with your IAM user name and password.
  2. On the Create stack page, choose Next.

On the Create stack page, choose Next.

  1. On the Specify stack details page, for FFmpeg Component, choose AGREE AND INSTALL.
  2. For Email, enter a valid email address to use for administrative purposes.
  3. For Price Class, choose Use Only U.S., Canada and Europe [PriceClass_100].
  4. Choose Next.

Choose Next.

  1. On the Review stack page, under Capabilities, select both check boxes.
  2. Choose Create stack.

Choose Create stack.

The stack creation takes roughly 25 minutes to complete; we’re using Amazon CodeBuild to dynamically build the FFmpeg component, and Amazon CloudFront distribution takes about 15 minutes to propagate to the edge locations.

After the stack is created, you should receive an invitation email from no-reply@verificationmail.com. The email contains a CloudFront URL link to access the demo portal, your login username, and a temporary password.

The email contains a CloudFront URL link to access the demo portal, your login username, and a temporary password.

Choose the URL link to open the web portal with Mozilla Firefox or Google Chrome. After you enter your user name and temporary credentials, you’re prompted to create a new password.

Solution walkthrough

In this section, we walk you through the following high-level steps:

  1. Setting up a labeling team.
  2. Creating and training your model.
  3. Completing a video object detection labeling job.

Setting up a labeling team

The first time you log in to the web portal, you’re prompted to create your labeling workforce. The labeling workforce defines members of your labelers who are given labeling tasks to work on when you start a new training project. Choose Yes to configure your labeling team members.

Choose Yes to configure your labeling team members.

You can also navigate to the Labeling Team tab to manage members from your labeling team at any time.

Follow the instructions to add an email and choose Confirm and add members. See the following animation to walk you through the steps.

The newly added member receives two email notifications. The first email contains the credential for the labeler to access the labeling portal. It’s important to note that the labeler is only given access to consume a labeling job created by Ground Truth. They don’t have access to any AWS resources other than working on the labeling job.

They don’t have access to any AWS resources other than working on the labeling job.

A second email “AWS Notification – Subscription Confirmation” contains instructions to confirm your subscription to an Amazon Simple Notification Service (Amazon SNS) topic so the labeler gets notified whenever a new labeling task is ready to consume.

The labeler gets notified whenever a new labeling task is ready to consume.

Creating and training your first model

Let’s start to train our first model to identify logos for AWS and AWS DeepRacer. For this post, we use the video file AWS DeepRacer TV – Ep 1 Amsterdam.

  1. On the navigation menu, choose Training.
  2. Choose Option 1 to train a model to identify the logos with bounding boxes.
  3. For Project name, enter DemoProject.
  4. Choose Add label.
  5. Add the labels AWS and DeepRacer.
  6. Drag and drop a video file to the drop area.

You can drop multiple video or JPEG or PNG image files.

  1. Choose Create project.

The following GIF animation illustrates the process.

At this point, a labeling job is soon created by Ground Truth and the labeler receives an email notification when the job is ready to consume.

Completing the video object detection labeling job

Ground Truth recently launched a new set of pre-built templates that help label video files. For our post, we use the video object detection task template. For more information, see New – Label Videos with Amazon SageMaker Ground Truth.

The training workflow is currently paused, waiting for the labelers to work on the labeling job.

  1. After the labeler receives an email notification that a job is ready for them, they can log in to the labeling portal and start the job by choosing Start working.
  2. For Label category, choose a label.
  3. Draw bounding boxes around the AWS or AWS DeepRacer logos.

You can use the Predict next button to predict the bounding box in subsequent frames.

The following GIF animation demonstrates the labeling flow.

After the labeler completes the job, the backend training workflow resumes and collects the labeled images from the Ground Truth labeling job and starts the model training by creating an Amazon Rekognition Custom Labels project. The time to train a model varies from an hour to a few hours depending on the complexity of the objects (labels) and the size of your training dataset. Amazon Rekognition Custom Labels automatically splits the dataset 80/20 to create the training dataset and test dataset, respectively.

Running inference to detect brand logos

After the model is trained, let’s upload a video file and run predictions with the model we trained.

  1. On the navigation menu, choose Analysis.
  2. Choose Start new analysis.
  3. Specify the following:
    1. Project name – The project we created in Amazon Rekognition Custom Labels.
    2. Project version – The specific version of the trained model.
    3. Inference units – Your desired inference units, so you can dial up or dial down the inference endpoint. For example, if you require higher transactions per second (TPS), use a larger number of inference endpoint.
  4. Drop and drop image files (JPEG, PNG) or video files (MP4, MOV) to the drop area.
  5. When the upload is complete, choose Done and wait for the analysis process to finish.

The analysis workflow starts and waits for the trained Amazon Rekognition Custom Labels model, runs the inference frame by frame, and shuts the model down when it’s no longer in use.

The following GIF animation demonstrates the analysis flow.

Viewing prediction results

The solution provides an overall statistic of the detected brands distributed across the video. The following screenshot shows that the AWS DeepRacer logo is detected about 25% overall and is detected approximately 60% in the 00:01:00–00:02:00 timespan. In contrast, the AWS logo is detected at a much lower rate. For this post, we only used one video to train the model, which had relatively few AWS logos. We can improve the accuracy by retraining the model with more video files.

We can improve the accuracy by retraining the model with more video files.

You can expand the shot element view to see how the brand logos are detected frame by frame.

You can expand the shot element view to see how the brand logos are detected frame by frame.

If you choose a frame to view, it shows the logo with a confidence score. The images that are grayed out are the ones that don’t detect any logo. The following image shows that the AWS DeepRacer logo is detected at frame #10237 with a confidence score of 82%.

The following image shows that the AWS DeepRacer logo is detected at frame #10237 with a confidence score of 82%.

Another image shows that the AWS logo is detected with a confidence score of 60%.

Another image shows that the AWS logo is detected with a confidence score of 60%.

Cleaning up

To delete the demo solution, simply delete the CloudFormation stack that you deployed earlier. However, deleting the CloudFormation stack doesn’t remove the following resources, which you must clean up manually to avoid potential recurring costs:

  • S3 buckets (web, source, and logs)
  • Amazon Rekognition Custom Labels project (trained model)

Conclusion

This post demonstrated how to use Amazon Rekognition Custom Labels to detect brand logos in images and videos. No ML expertise is required to build your own model. The ease of use and intuitive setup of Amazon Rekognition Custom Labels allows you to bring your own dataset, label it into separate folders, and launch the Amazon Rekognition Custom Labels training and validation. We created the required infrastructure, demonstrated installing and running the UI, and discussed the security and cost of the infrastructure.

In the second post in this series, we deep dive into data labeling from a video file using Ground Truth to prepare the data for the training phase. We also explain the technical details of how we use Amazon Rekognition Custom Labels to train the model. In the third post in this series, we dive into the inference phase and show you the reach set of statistics for your brand visibility in a given video file.

For more information about the code sample in this post, see the GitHub repo.


About the Authors

Ken Shek Ken Shek is a Global Vertical Solutions Architect, Media and Entertainment in EMEA region. He helps media customers to designs, develops, and deploys workloads onto AWS Cloud using the AWS Cloud best practice. Ken graduated from University of California, Berkeley and received his master degree in Computer Science at Northwestern Polytechnical University.

 

 

Amit Mukherjee is a Sr. Partner Solutions Architect with a focus on Data Analytics and AI/ML. He works with AWS partners and customers to provide them with architectural guidance for building highly secure scalable data analytics platform and adopt machine learning at a large scale.

 

 

Sameer Goel is a Solutions Architect in Seattle, who drives customer’s success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with Master’s Degree from NEU Boston, with Data Science concentration. He enjoys building and experimenting creative projects and applications.

Read More

Model serving made easier with Deep Java Library and AWS Lambda

Developing and deploying a deep learning model involves many steps: gathering and cleansing data, designing the model, fine-tuning model parameters, evaluating the results, and going through it again until a desirable result is achieved. Then comes the final step: deploying the model.

AWS Lambda is one of the most cost effective service that lets you run code without provisioning or managing servers. It offers many advantages when working with serverless infrastructure. When you break down the logic of your deep learning service into a single Lambda function for a single request, things become much simpler and easy to scale. You can forget all about the resource handling needed for the parallel requests coming into your model. If your usage is sparse and tolerable to a higher latency, Lambda is a great choice among various solutions.

Now, let’s say you’ve decided to use Lambda to deploy your model. You go through the process, but it becomes confusing or complex with the various setup steps to run your models. Namely, you face issues with the Lambda size limits and managing the model dependencies within.

Deep Java Library (DJL) is a deep learning framework designed to make your life easier. DJL uses various deep learning backends (such as Apache MXNet, PyTorch, and TensorFlow) for your use case and is easy to set up and integrate within your Java application! Thanks to its excellent dependency management design, DJL makes it extremely simple to create a project that you can deploy on Lambda. DJL helps alleviate some of the problems we mentioned by downloading the prepackaged framework dependencies so you don’t have to package them yourself, and loads your models from a specified location such as Amazon Simple Storage Service (Amazon S3) so you don’t need to figure out how to push your models to Lambda.

This post covers how to get your models running on Lambda with DJL in 5 minutes.

About DJL

Deep Java Library (DJL) is a Deep Learning Framework written in Java, supporting both training and inference. DJL is built on top of modern deep learning engines (such as TenserFlow, PyTorch, and MXNet). You can easily use DJL to train your model or deploy your favorite models from a variety of engines without any additional conversion. It contains a powerful model zoo design that allows you to manage trained models and load them in a single line. The built-in model zoo currently supports more than 70 pre-trained and ready-to-use models from GluonCV, HuggingFace, TorchHub, and Keras.

Prerequisites

You need the following items to proceed:

In this post, we follow along with the steps from the following GitHub repo.

Building and deploying on AWS

First we need to ensure we’re in the correct code directory. We need to create the an S3 bucket for storage, an AWS CloudFormation stack, and the Lambda function with the following code:

cd lambda-model-serving
./gradlew deploy

This creates the following:

  • An S3 bucket with the name stored in bucket-name.txt
  • A CloudFormation stack named djl-lambda and a template file named out.yml
  • A Lambda function named DJL-Lambda

Now we have our model deployed on a serverless API. The next section invokes the Lambda function.

Invoking the Lambda function

We can invoke the Lambda function with the following code:

aws lambda invoke --function-name DJL-Lambda --payload '{"inputImageUrl":"https://djl-ai.s3.amazonaws.com/resources/images/kitten.jpg"}' build/output.json

The output is stored into build/output.json:

cat build/output.json
[
  {
    "className": "n02123045 tabby, tabby cat",
    "probability": 0.48384541273117065
  },
  {
    "className": "n02123159 tiger cat",
    "probability": 0.20599405467510223
  },
  {
    "className": "n02124075 Egyptian cat",
    "probability": 0.18810519576072693
  },
  {
    "className": "n02123394 Persian cat",
    "probability": 0.06411759555339813
  },
  {
    "className": "n02127052 lynx, catamount",
    "probability": 0.01021555159240961
  }
]

Cleaning up

Use the cleanup scripts to clean up the resources and tear down the services created in your AWS account:

./cleanup.sh

Cost analysis

What happens if we try to set this up on an Amazon Elastic Compute Cloud (Amazon EC2) instance and compare the cost to Lambda? EC2 instances need to continuously run for it to receive requests at any time. This means that you’re paying for that additional time when it’s not in use. If we use a cheap t3.micro instance with 2 vCPUs and 1 GB of memory (knowing that some of this memory is used by the operating system and for other tasks), the cost comes out to $7.48 a month or about 1.6 million requests to Lambda. When using a more powerful instance such as t3.small with 2 vCPUs and 4 GB of memory, the cost comes out to $29.95 a month or about 2.57 million requests to Lambda.

There are pros and cons with using either Lambda or Amazon EC2 for hosting, and it comes down to requirements and cost. Lambda is the ideal choice if your requirements allow for sparse usage and higher latency due to the cold startup of Lambda (5-second startup) when it isn’t used frequently, but it’s cheaper than Amazon EC2 if you aren’t using it much, and the first call can be slow. Subsequent requests become faster, but if Lambda sits idly for 30–45 minutes, it goes back to cold-start mode.

Amazon EC2, on the other hand, is better if you require low latency calls all the time or are making more requests than what it costs in Lambda (shown in the following chart).

Minimal package size

DJL automatically downloads the deep learning framework at runtime, allowing for a smaller package size. We use the following dependency:

runtimeOnly "ai.djl.mxnet:mxnet-native-auto:1.7.0-backport"

This auto-detection dependency results in a .zip file less than 3 MB. The downloaded MXNet native library file is stored in a /tmp folder that takes up about 155 MB of space. We can further reduce this to 50 MB if we use a custom build of MXNet without MKL support.

The MXNet native library is stored in an S3 bucket, and the framework download latency is negligible when compared to the Lambda startup time.

Model loading

The DJL model zoo offers many easy options to deploy models:

  • Bundling the model in a .zip file
  • Loading models from a custom model zoo
  • Loading models from an S3 bucket (supports Amazon SageMaker trained model .tar.gz format)

We use the MXNet model zoo to load the model. By default, because we didn’t specify any model, it uses the resnet-18 model, but you can change this by passing in an artifactId parameter in the request:

aws lambda invoke --function-name DJL-Lambda --payload '{"artifactId": "squeezenet", "inputImageUrl":"https://djl-ai.s3.amazonaws.com/resources/images/kitten.jpg"}' build/output.json

Limitations

There are certain limitations when using serverless APIs, specifically in AWS Lambda:

  • GPU instances are not yet available, as of this writing
  • Lambda has a 512 MB limit for the /tmp folder
  • If the endpoint isn’t frequently used, cold startup can be slow

As mentioned earlier, this way of hosting your models on Lambda is ideal when requests are sparse and the requirement allows for higher latency calls due to the Lambda cold startup. If your requirements require low latency for all requests, we recommend using AWS Elastic Beanstalk with EC2 instances.

Conclusion

In this post, we demonstrated how to easily launch serverless APIs using DJL. To do so, we just need to run the gradle deployment command, which creates the S3 bucket, CloudFormation stack, and Lambda function. This creates an endpoint to accept parameters to run your own deep learning models.

Deploying your models with DJL on Lambda is a great and cost-effective method if Lambda has sparse usage and allows for high latency, due to its cold startup nature. Using DJL allows your team to focus more on designing, building, and improving your ML models, while keeping costs low and keeping the deployment process easy and scalable.

For more information on DJL and its other features, see Deep Java Library.

Follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of DJL!


About the Author

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Read More

Multi-account model deployment with Amazon SageMaker Pipelines

Amazon SageMaker Pipelines is the first purpose-built CI/CD service for machine learning (ML). It helps you build, automate, manage, and scale end-to-end ML workflows and apply DevOps best practices of CI/CD to ML (also known as MLOps).

Creating multiple accounts to organize all the resources of your organization is a good DevOps practice. A multi-account strategy is important not only to improve governance but also to increase security and control of the resources that support your organization’s business. This strategy allows many different teams inside your organization, to experiment, innovate, and integrate faster, while keeping the production environment safe and available for your customers.

Pipelines makes it easy to apply the same strategy to deploying ML models. Imagine a use case in which you have three different AWS accounts, one for each environment: data science, staging, and production. The data scientist has the freedom to run experiments and train and optimize different models any time in their own account. When a model is good enough to be deployed in production, the data scientist just needs to flip the model approval status to Approved. After that, an automated process deploys the model on the staging account. Here you can automate testing of the model with unit tests or integration tests or test the model manually. After a manual or automated approval, the model is deployed to the production account, which is a more tightly controlled environment used to serve inferences on real-world data. With Pipelines, you can implement a ready-to-use multi-account environment.

In this post, you learn how to use Pipelines to implement your own multi-account ML pipeline. First, you learn how to configure your environment and prepare it to use a predefined template as a SageMaker project for training and deploying a model in two different accounts: staging and production. Then, you see in detail how this custom template was created and how to create and customize templates for your own SageMaker projects.

Preparing the environment

In this section, you configure three different AWS accounts and use SageMaker Studio to create a project that integrates a CI/CD pipeline with the ML pipeline created by a data scientist. The following diagram shows the reference architecture of the environment that is created by the SageMaker custom project and how AWS Organizations integrates the different accounts.

The following diagram shows the reference architecture of the environment that is created by the SageMaker custom project and how AWS Organizations integrates the different accounts.

The diagram contains three different accounts, managed by Organizations. Also, three different user roles (which may be the same person) operate this environment:

  • ML engineer – Responsible for provisioning the SageMaker Studio project that creates the CI/CD pipeline, model registry, and other resources
  • Data scientist – Responsible for creating the ML pipeline that ends with a trained model registered to the model group (also referred to as model package group)
  • Approver – Responsible for testing the model deployed to the staging account and approving the production deployment

It’s possible to run a similar solution without Organizations, if you prefer (although not recommended). But you need to prepare the permissions and the trust relationship between your accounts manually and modify the template to remove the Organizations dependency. Also, if you’re an enterprise with multiple AWS accounts and teams, it’s highly recommended that you use AWS Control Tower for provisioning the accounts and Organizations. AWS Control Tower provides the easiest way to set up and govern a new and secure multi-account AWS environment. For this post, we only discuss implementing the solution with Organizations.

But before you move on, you need to complete the following steps, which are detailed in the next sections:

  1. Create an AWS account to be used by the data scientists (data science account).
  2. Create and configure a SageMaker Studio domain in the data science account.
  3. Create two additional accounts for production and staging.
  4. Create an organizational structure using Organizations, then invite and integrate the additional accounts.
  5. Configure the permissions required to run the pipelines and deploy models on external accounts.
  6. Import the SageMaker project template for deploying models in multiple accounts and make it available for SageMaker Studio.

Configuring SageMaker Studio in your account

Pipelines provides built-in support for MLOps templates to make it easy for you to use CI/CD for your ML projects. These MLOps templates are defined as Amazon CloudFormation templates and published via AWS Service Catalog. These are made available to data scientists via SageMaker Studio, an IDE for ML. To configure Studio in your account, complete the following steps:

  1. Prepare your SageMaker Studio domain.
  2. Enable SageMaker project templates and SageMaker JumpStart for this account and Studio users.

If you have an existing domain, you can simply edit the settings for the domain or individual users to enable this option. Enabling this option creates two different AWS Identity and Account Management (IAM) roles in your AWS account:

  • AmazonSageMakerServiceCatalogProductsLaunchRole – Used by SageMaker to run the project templates and create the required infrastructure resources
  • AmazonSageMakerServiceCatalogProductsUseRole – Used by the CI/CD pipeline to run a job and deploy the models on the target accounts

If you created your SageMaker Studio domain before re:Invent 2020, it’s recommended that you refresh your environment by saving all the work in progress. On the File menu, choose Shutdown, and confirm your choice.

  1. Create and prepare two other AWS accounts for staging and production, if you don’t have them yet.

Configuring Organizations

You need to add the data science account and the two additional accounts to a structure in Organizations. Organizations helps you to centrally manage and govern your environment as you grow and scale your AWS resources. It’s free and benefits your governance strategy.

Each account must be added to a different organizational unit (OU).

  1. On the Organizations console, create a structure of OUs like the following:
  • Root
    • multi-account-deployment (OU)
      • 111111111111 (data science account—SageMaker Studio)
      • production (OU)
        • 222222222222 (AWS account)
      • staging (OU)
        • 333333333333 (AWS account)

After configuring the organization, each account owner receives an invite. The owners need to accept the invites, otherwise the accounts aren’t included in the organization.

  1. Now you need to enable trusted access with AWS organizations (“Enable all features” and “Enable trusted access in the StackSets”).

This process allows your data science account to provision resources in the target accounts. If you don’t do that, the deployment process fails. Also, this feature set is the preferred way to work with Organizations, and it includes consolidating billing features.

  1. Next, on the Organizations console, choose Organize accounts.
  2. Choose staging.
  3. Note down the OU ID.
  4. Repeat this process for the production OU.

Repeat this process for the production OU.

Configuring the permissions

You need to create a SageMaker execution role in each additional account. These roles are assumed by AmazonSageMakerServiceCatalogProductsUseRole in the data science account to deploy the endpoints in the target accounts and test them.

  1. Sign in to the AWS Management Console with the staging account.
  2. Run the following CloudFormation template.

This template creates a new SageMaker role for you.

  1. Provide the following parameters:
    1. SageMakerRoleSuffix – A short string (maximum 10 lowercase with no spaces or alpha-numeric characters) that is added to the role name after the following prefix: sagemaker-role-. The final role name is sagemaker-role-<<sagemaker_role_suffix>>.
    2. PipelineExecutionRoleArn – The ARN of the role from the data science account that assumes the SageMaker role you’re creating. To find the ARN, sign in to the console with the data science account. On the IAM console, choose Roles and search for AmazonSageMakerServiceCatalogProductsUseRole. Choose this role and copy the ARN (arn:aws:iam::<<data_science_acccount_id>>:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole).
  2. After creating this role in the staging account, repeat this process for the production account.

In the data science account, you now configure the policy of the Amazon Simple Storage Service (Amazon S3) bucket used to store the trained model. For this post, we use the default SageMaker bucket of the current Region. It has the following name format: sagemaker-<<region>>-<<aws_account_id>>.

  1. On the Amazon S3 console, search for this bucket, providing the Region you’re using and the ID of the data science account.

If you don’t find it, create a new bucket following this name format.

  1. On the Permissions tab, add the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": [
                        "arn:aws:iam::<<staging_account_id>>:root",
                        "arn:aws:iam::<<production_account_id>>:root"
                    ]
                },
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::sagemaker-<<region>>-<<aws_account_id>>",
                    "arn:aws:s3:::sagemaker-<<region>>-<<aws_account_id>>/*"
                ]
            }
        ]
    }

  1. Save your settings.

The target accounts now have permission to read the trained model during deployment.

The next step is to add new permissions to the roles AmazonSageMakerServiceCatalogProductsUseRole and AmazonSageMakerServiceCatalogProductsLaunchRole.

  1. In the data science account, on the IAM console, choose Roles.
  2. Find the AmazonSageMakerServiceCatalogProductsUseRole role and choose it.
  3. Add a new policy and enter the following JSON code.
  4. Save your changes.
  5. Now, find the AmazonSageMakerServiceCatalogProductsLaunchRole role, choose it and add a new policy with the following content:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": "s3:GetObject",
                "Resource": "arn:aws:s3:::aws-ml-blog/artifacts/sagemaker-pipeline-blog-resources/*"
            }
        ]
    }

  1. Save your changes.

That’s it! Your environment is almost ready. You only need one more step and you can start training and deploying models in different accounts.

Importing the custom SageMaker Studio project template

In this step, you import your custom project template.

  1. Sign in to the console with the data science account.
  2. On the AWS Service Catalog console, under Administration, choose Portfolios.
  3. Choose Create a new portfolio.
  4. Name the portfolio SageMaker Organization Templates.
  5. Download the following template to your computer.
  6. Choose the new portfolio.
  7. Choose Upload a new product.
  8. For Product name¸ enter Multi Account Deployment.
  9. For Description, enter Multi account deployment project.
  10. For Owner, enter your name.
  11. Under Version details, for Method, choose Use a template file.
  12. Choose Upload a template.
  13. Upload the template you downloaded.
  14. For Version title, choose 1.0.

The remaining parameters are optional.

  1. Choose Review.
  2. Review your settings and choose Create product.
  3. Choose Refresh to list the new product.
  4. Choose the product you just created.
  5. On the Tags tab, add the following tag to the product:
    1. Keysagemaker:studio-visibility
    2. ValueTrue

Back in the portfolio details, you see something similar to the following screenshot (with different IDs).

Back in the portfolio details, you see something similar to the following screenshot (with different IDs).

  1. On the Constraints tab, choose Create constraint.
  2. For Product, choose Multi Account Deployment (the product you just created).
  3. For Constraint type, choose Launch.
  4. Under Launch Constraint, for Method, choose Select IAM role.
  5. Choose AmazonSageMakerServiceCatalogProductsLaunchRole.
  6. Choose Create.
  7. On the Groups, roles, and users tab, choose Add groups, roles, users.
  8. On the Roles tab, select the role you used when configuring your SageMaker Studio domain.
  9. Choose Add access.

If you don’t remember which role you selected, in your data science account, go to the SageMaker console and choose Amazon SageMaker Studio. In the Studio Summary section, locate the attribute Execution role. Search for the name of this role in the previous step.

You’re done! Now it’s time to create a project using this template.

Creating your project

In the previous sections, you prepared the multi-account environment. The next step is to create a project using your new template.

  1. Sign in to the console with the data science account.
  2. On the SageMaker console, open SageMaker Studio with your user.
  3. Choose the Components and registries
  4. On the drop-down menu, choose Projects.
  5. Choose Create project.

Choose Create project.

On the Create project page, SageMaker templates is chosen by default. This option lists the built-in templates. However, you want to use the template you prepared for the multi-account deployment.

  1. Choose Organization templates.
  2. Choose Multi Account Deployment.
  3. Choose Select project template.

If you can’t see the template, make sure you completed all the steps correctly in the previous section.

If you can’t see the template, make sure you completed all the steps correctly in the previous section.

  1. In the Project details section, for Name, enter iris-multi-01.

The project name must have 15 characters or fewer.

  1. In the Project template parameters, use the names of the roles you created in each target account (staging and production) and provide the following properties:
    1. SageMakerExecutionRoleStagingName
    2. SageMakerExecutionRoleProdName
  2. Retrieve the OU IDs you created earlier for the staging and production OUs and provide the following properties:
    1. OrganizationalUnitStagingId
    2. OrganizationalUnitProdId
  3. Choose Create project.

Choose Create project.

Provisioning all the resources takes a few minutes, after which the project is listed in the Projects section. When you choose the project, a tab opens with the project’s metadata. The Model groups tab chows a model group with the same name as your project. It was also created during the project provisioning.

Provisioning all the resources takes a few minutes, after which the project is listed in the Projects section.

The environment is now ready for the data scientist to start training the model.

Training a model

Now that your project is ready, it’s time to train a model.

  1. Download the example notebook to use for this walkthrough.
  2. Choose the Folder icon to change the work area to file management.
  3. Choose the Create folder
  4. Enter a name for the folder.
  5. Choose the folder name.
  6. Choose the Upload file
  7. Choose the Jupyter notebook you downloaded and upload it to the new directory.
  8. Choose the notebook to open a new tab.

Choose the notebook to open a new tab.

You’re prompted to choose a kernel.

  1. Choose Python3 (Data Science).
  2. Choose Select.

Choose Select.

  1. In the second cell of the notebook, replace the project_name variable with the name you gave your project (for this post, iris-multi-01).

You can now run the Jupyter notebook. This notebook creates a very simple pipeline with only two steps: train and register model. It uses the iris dataset and the XGBoost built-in container as the algorithm.

  1. Run the whole notebook.

The process takes some time after you run the cell containing the following code:

start_response = pipeline.start(parameters={
    "TrainingInstanceCount": "1"
})

This starts the training job, which takes approximately 3 minutes to complete. After the training is finished, the next cell of the Jupyter notebook gets the latest version of the model in the model registry and marks it as Approved. Alternatively, you can approve a model from the SageMaker Studio UI. On the Model groups tab, choose the model group and desired version. Choose Update status and Approve before saving.

Choose Update status and Approve before saving

This is the end of the data scientist’s job but the beginning of running the CI/CD pipeline.

Amazon EventBridge monitors the model registry. The listener starts a new deployment job with the provisioned AWS CodePipeline workflow (created with you launched the SageMaker Studio project).

  1. On the CodePipeline console, choose the pipeline starting with the prefix sagemaker-, followed by the name of your project.

On the CodePipeline console, choose the pipeline starting with the prefix sagemaker-, followed by the name of your project.

Shortly after you approve your model, the deployment pipeline starts running. Wait for the pipeline to reach the state DeployStaging. That stage can take approximately 10 minutes to complete. After deploying the first endpoint in the staging account, the pipeline is tested, and then moves to the next step, ApproveDeployment. In this step, it waits for manual approval.

  1. Choose Review.
  2. Enter an approval reason in the text box.
  3. Choose Approve.

The model is now deployed in the production account.

You can also monitor the pipeline on the AWS CloudFormation console, to see the stacks and stack sets the pipeline creates to deploy endpoints in the target accounts. To see the deployed endpoints for each account, sign in to the SageMaker console as either the staging account or production account and choose Endpoints on the navigation pane.

Cleaning up

To clean up all the resources you provisioned in this example, complete the following steps:

  1. Sign in to the console with your main account.
  2. On the AWS CloudFormation console, click on StackSets and delete the following items (endpoints):
    1. Prod sagemaker-<<sagemaker-project-name>>-<<project-id>>-deploy-prod
    2. Stagingsagemaker-<<sagemaker-project-name>>-<<project-id>>-deploy-staging
  3. In your laptop or workstation terminal, use the AWS Command Line Interface (AWS CLI) and enter the following code to delete your project:
    aws sagemaker delete-project --project-name iris-multi-01

Make sure you’re using the latest version of the AWS CLI.

Building and customizing a template for your own SageMaker project

SageMaker projects and SageMaker MLOps project templates are powerful features that you can use to automatically create and configure the whole infrastructure required to train, optimize, evaluate, and deploy ML models. A SageMaker project is an AWS Service Catalog provisioned product that enables you to easily create an end-to-end ML solution. For more information, see the AWS Service Catalog Administrator Guide.

A product is a CloudFormation template managed by AWS Service Catalog. For more information about templates and their requirements, see AWS CloudFormation template formats.

ML engineers can design multiple environments and express all the details of this setup as a CloudFormation template, using the concept of infrastructure as code (IaC). You can also integrate these different environments and tasks using a CI/CD pipeline. SageMaker projects provide an easy, secure, and straightforward way of wrapping the infrastructure complexity around in the format of a simple project, which can be launched many times by the other ML engineers and data scientists.

The following diagram illustrates the main steps you need to complete in order to create and publish your custom SageMaker project template.

The following diagram illustrates the main steps you need to complete in order to create and publish your custom SageMaker project template.

We described these steps in more detail in the sections Importing the custom SageMaker Studio Project template and Creating your project.

As an ML engineer, you can design and create a new CloudFormation template for the project, prepare an AWS Service Catalog portfolio, and add a new product to it.

Both data scientists and ML engineers can use SageMaker Studio to create a new project with the custom template. SageMaker invokes AWS Service Catalog and starts provisioning the infrastructure described in the CloudFormation template.

As a data scientist, you can now start training the model. After you register it in the model registry, the CI/CD pipeline runs automatically and deploys the model on the target accounts.

If you look at the CloudFormation template from this post in a text editor, you can see that it implements the architecture we outline in this post.

The following code is a snippet of the template:

Description: Toolchain template which provides the resources needed to represent infrastructure as code.
  This template specifically creates a CI/CD pipeline to deploy a given inference image and pretrained Model to two stages in CD -- staging and production.
Parameters:
  SageMakerProjectName:
    Type: String
  SageMakerProjectId:
    Type: String
…
<<other parameters>>
…
Resources:
  MlOpsArtifactsBucket:
    Type: AWS::S3::Bucket
    DeletionPolicy: Retain
    Properties:
      BucketName: …
…
  ModelDeployCodeCommitRepository:
    Type: AWS::CodeCommit::Repository
    Properties:
      RepositoryName: …
      RepositoryDescription: …
      Code:
        S3:
          Bucket: …
          Key: …
…
  ModelDeployBuildProject:
    Type: AWS::CodeBuild::Project
…
  ModelDeployPipeline:
    Type: AWS::CodePipeline::Pipeline
…

The template has two key sections: Parameters (input parameters of the template) and Resources. SageMaker project templates require that you add two input parameters to your template: SageMakerProjectName and SageMakerProjectId. These parameters are used internally by SageMaker Studio. You can add other parameters if needed.

In the Resources section of the snippet, you can see that it creates the following:

  • A new S3 bucket used by the CI/CD pipeline to store the intermediary artifacts passed from one stage to another.
  • An AWS CodeCommit repository to store the artifacts used during the deployment and testing stages.
  • An AWS CodeBuild project to get the artifacts, and validate and configure them for the project. In the multi-account template, this project also creates a new model registry, used by the CI/CD pipeline to deploy new models.
  • A CodePipeline workflow that orchestrates all the steps of the CI/CD pipelines.

Each time you register a new model to the model registry or push a new artifact to the CodeCommit repo, this CodePipeline workflow starts. These events are captured by an EventBridge rule, provisioned by the same template. The CI/CD pipeline contains the following stages:

  • Source – Reads the artifacts from the CodeCommit repository and shares with the other steps.
  • Build – Runs the CodeBuild project to do the following:
    • Verify if a model registry is already created, and create one if needed.
    • Prepare a new CloudFormation template that is used by the next two deployment stages.
  • DeployStaging – Contains the following components:
    • DeployResourcesStaging – Gets the CloudFormation template prepared in the Build step and deploys a new stack. This stack deploys a new SageMaker endpoint in the target account.
    • TestStaging – Invokes a second CodeBuild project that runs a custom Python script that tests the deployed endpoint.
    • ApproveDeployment – A manual approval step. If approved, it moves to the next stage to deploy an endpoint in production, or ends the workflow if not approved.
  • DeployProd – Similar to DeployStaging, it uses the same CloudFormation template but with different input parameters. It deploys a new SageMaker endpoint in the production account. 

You can start a new training process and register your model to the model registry associated with the SageMaker project. Use the Jupyter notebook provided in this post and customize your own ML pipeline to prepare your dataset and train, optimize, and test your models before deploying them. For more information about these features, see Automate MLOps with SageMaker Projects. For more Pipelines examples, see the GitHub repo.

Conclusions and next steps

In this post, you saw how to prepare your own environment to train and deploy ML models in multiple AWS accounts by using SageMaker Pipelines.

With SageMaker projects, the governance and security of your environment can be significantly improved if you start managing your ML projects as a library of SageMaker project templates.

As a next step, try to modify the SageMaker project template and customize it to address your organization’s needs. Add as many steps as you want and keep in mind that you can capture the CI/CD events and notify users or call other services to build comprehensive solutions.


About the Author

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions solve their business challenges using the AWS platform. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.

Read More

Redacting PII from application log output with Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in text. The service can extract people, places, sentiments, and topics in unstructured data. You can now use Amazon Comprehend ML capabilities to detect and redact personally identifiable information (PII) in application logs, customer emails, support tickets, and more. No ML experience required. Redacting PII entities helps you protect privacy and comply with local laws and regulations.

Use case: Applications printing PII data in log output

Some applications print PII data in their log output inadvertently. In some cases, this may be due to developers forgetting to remove debug statements before deploying the application in production, and in other cases it may be due to legacy applications that are handed down and are difficult to update. PII can also get printed in stack traces. It’s generally a mistake to have PII present in such logs. Correlation IDs and primary keys are better identifiers than PII when debugging applications.

PII in application logs can quickly propagate to downstream systems, compounding security concerns. For example it may get submitted to search and analytics systems where it’s searchable and viewable by everyone. It may also be stored in object storage such as Amazon Simple Storage Service (Amazon S3) for analytics purposes. With the PII detection API of Amazon Comprehend, you can remove PII from application log output before such a log statement even gets printed.

In this post, I take the use case of a Java application that is generating log output with PII. The initial log output goes through filter-like processing that redacts PII before the log statement is output by the application. You can take a similar approach for other programming languages.

The application can be repackaged by changing its log format file, such as log4j.xml, and adding one Java class from this sample project, or adding this Java class as a dependency in the form of a .jar file.

The sample application is available in the following GitHub repo.

PII entity types

The following table lists some of the entity types Amazon Comprehend detects.

PII Entity Types Description
EMAIL An email address, such as marymajor@email.com.
NAME An individual’s name. This entity type does not include titles, such as Mr., Mrs., Miss, or Dr. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the “John Doe Organization” as an organization, and it recognizes “Jane Doe Street” as an address.
PHONE A phone number. This entity type also includes fax and pager numbers.
SSN A Social Security Number (SSN) is a 9-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last 4 digits are present.

For the full list, see Detect Personally Identifiable Information (PII).

The API response from Amazon Comprehend includes the entity type, its begin offset, end offset, and a confidence score. For this post, we use all of them.

Application overview

Our example application is a very simple application that simulates opening a bank account for a user. In its current form, the log output looks like the following code. We can see this by making requests to the endpoint payment:

curl localhost:8080/payment
2020-09-29T10:29:04,115 INFO [http-nio-8080-exec-1] c.e.l.c.PaymentController: Processing user User(name=Terina, ssn=626031641, email=mel.swift@Taylor.com, description=Ea minima omnis autem illo.)
2020-09-29T10:29:04,711 INFO [http-nio-8080-exec-2] c.e.l.c.PaymentController: User Napoleon, SSN 366435036, opened an account
2020-09-29T10:29:05,253 INFO [http-nio-8080-exec-4] c.e.l.c.PaymentController: User Cristen, SSN 197961488, opened an account
2020-09-29T10:29:05,673 INFO [http-nio-8080-exec-5] c.e.l.c.PaymentController: Processing user User(name=Giuseppe, ssn=713425581, email=elijah.dach@Shawnna.com, description=Impedit asperiores in magnam exercitationem.)

The output prints Name, SSN, and Email. This PII data is being generated by the java-faker library, which is a Java port of the well-known Ruby gem. See the following code:

        <dependency>
            <groupId>com.github.javafaker</groupId>
            <artifactId>javafaker</artifactId>
            <version>1.0.2</version>
        </dependency>

Log4j 2

Log4j 2 is a common Java library used for logging. Appenders in Log4j are responsible for delivering log events to their destinations, which can be console, file, and more. Log4j also has a RewriteAppender that lets you rewrite the log message before it is output. RewriteAppender works in conjunction with a RewritePolicy that provides the implementation for changing the log output.

The sample application uses the following log4j.xml file for log configuration:

<?xml version="1.0" encoding="UTF-8"?>
<!-- status="trace" -->
<Configuration packages="com.example.logging">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout
                    pattern="%style{%d{ISO8601}}{black} %highlight{%-5level }[%style{%t}{bright,blue}] %style{%C{1.}}{bright,yellow}: %msg%n%throwable" />
        </Console>
        <Rewrite name="Rewrite">
            <SensitiveDataPolicy
                    maskMode="MASK"
                    mask="*"
                    minScore="0.9"
                    entitiesToReplace="SSN,EMAIL"
            />
            <AppenderRef ref="Console" />
        </Rewrite>
    </Appenders>

    <Loggers>
        <!-- LOG everything at INFO level -->
        <Root level="info">
            <AppenderRef ref="Console" />
        </Root>
        <!-- LOG "com.example*" at DEBUG level -->
        <Logger name="com.example" level="debug" additivity="false">
            <AppenderRef ref="Rewrite" />
        </Logger>
    </Loggers>

</Configuration>

SensitiveDataPolicy

The Log4j RewritePolicy we created for this project is named SensitiveDataPolicy. It uses four parameters:

  • maskMode – This parameter has two modes:
    • REPLACE – The policy replaces discovered entities with their type names. For example, in case of social security numbers, the replaced string is [SSN].
    • MASK – The policy replaces the discovered entity with a string consisting of the character provided as a mask parameter.
  • mask – The character to use to replace the discovered entity with. Only relevant if maskMode is MASK.
  • minScore – The minimum confidence score acceptable to us.
  • entitiesToReplace – A comma-separated list of entity type names that we want to replace. For example, we’re choosing to replace social security number and email, so the string value we provide is SSN,EMAIL. Amazon Comprehend also detects NAME in our application, but it’s printed as is.

Choosing redaction vs. masking is a matter of preference. Redaction is usually preferred when the context needs to be preserved, such as in natural text, whereas masking is best for maintaining text length as well as structured data such as formatted files or key-value pairs.

Detecting PII is as simple as making an API call to Amazon Comprehend using the AWS SDK and providing the text to analyze:

        DetectPiiEntitiesRequest piiEntitiesRequest =
                DetectPiiEntitiesRequest.builder()
                        .languageCode("en")
                        .text(msg.getFormattedMessage())
                        .build();

        DetectPiiEntitiesResponse piiEntitiesResponse = comprehendClient.detectPiiEntities(piiEntitiesRequest);

Asynchronous logging

Because our policy makes synchronous calls to Amazon Comprehend for PII detection, we want this processing to happen asynchronously, outside of customer request loop, to avoid introducing latency. For instructions, see Asynchronous Loggers for Low-Latency Logging. We add the Disruptor library to our classpath by adding it to pom.xml:

        <dependency>
            <groupId>com.lmax</groupId>
            <artifactId>disruptor</artifactId>
            <version>3.4.2</version>
        </dependency>

We also need to set a system property. After we package our application with mvn package, we can run it as in the following code:

java -jar target/comprehend-logging.jar -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector

Updated log output

The log output from this application now looks like the following. We can see that SSN and Email are being suppressed.

2020-09-29T12:52:30,423 INFO  [http-nio-8080-exec-6] ?: User Willa, SSN *********, opened an account
2020-09-29T12:52:30,824 INFO  [http-nio-8080-exec-8] ?: User Vania, SSN *********, opened an account
2020-09-29T12:52:31,245 INFO  [http-nio-8080-exec-9] ?: Processing user User(name=Laronda, ssn=*********, email=******************************, description=Doloremque culpa iure dolore omnis.)
2020-09-29T12:52:31,637 INFO  [http-nio-8080-exec-1] ?: Processing user User(name=Tommye, ssn=*********, email=*************************, description=Corporis sed tempore.)

Conclusion

We learned how to use Amazon Comprehend to redact sensitive data natively within next-generation applications. For information about applying it as a postprocessing technique for logs in storage, see Detecting and redacting PII using Amazon Comprehend. The API lets you have complete control over the entities that are important for your use case and lets you either mask or redact the information.

For more information about Amazon Comprehend availability and quotas, see Amazon Comprehend endpoints and quotas.


About the Author

Pradeep Singh is a Solutions Architect at Amazon Web Services. He helps AWS customers take advantage of AWS services to design scalable and secure applications. His expertise spans Application Architecture, Containers, Analytics and Machine Learning.

 

Read More

Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines

We recently announced Amazon SageMaker Pipelines, the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration. Three components improve the operational resilience and reproducibility of your ML workflows: pipelines, model registry, and projects. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.

SageMaker projects introduce MLOps templates that automatically provision the underlying resources needed to enable CI/CD capabilities for your ML development lifecycle. You can use a number of built-in templates or create your own custom template. You can use SageMaker Pipelines independently to create automated workflows; however, when used in combination with SageMaker projects, the additional CI/CD capabilities are provided automatically. The following screenshot shows how the three components of SageMaker Pipelines can work together in an example SageMaker project.

The following screenshot shows how the three components of SageMaker Pipelines can work together in an example SageMaker project.

This post focuses on using an MLOps template to bootstrap your ML project and establish a CI/CD pattern from sample code. We show how to use the built-in build, train, and deploy project template as a base for a customer churn classification example. This base template enables CI/CD for training ML models, registering model artifacts to the model registry, and automating model deployment with manual approval and automated testing.

MLOps template for building, training, and deploying models

We start by taking a detailed look at what AWS services are launched when this build, train, and deploy MLOps template is launched. Later, we discuss how to modify the skeleton for a custom use case.

To get started with SageMaker projects, you must first enable it on the Amazon SageMaker Studio console. This can be done for existing users or while creating new ones. For more information, see SageMaker Studio Permissions Required to Use Projects.

For more information, see SageMaker Studio Permissions Required to Use Projects.

In SageMaker Studio, you can now choose the Projects menu on the Components and registries menu.

In SageMaker Studio, you can now choose the Projects menu on the Components and registries menu.

On the projects page, you can launch a preconfigured SageMaker MLOps template. For this post, we choose MLOps template for model building, training, and deployment.

On the projects page, you can launch a preconfigured SageMaker MLOps template.

Launching this template starts a model building pipeline by default, and while there is no cost for using SageMaker Pipelines itself, you will be charged for the services launched. Cost varies by Region. A single run of the model build pipeline in us-east-1 is estimated to cost less than $0.50. Models approved for deployment incur the cost of the SageMaker endpoints (test and production) for the Region using an ml.m5.large instance.

After the project is created from the MLOps template, the following architecture is deployed.

After the project is created from the MLOps template, the following architecture is deployed.

Included in the architecture are the following AWS services and resources:

  • The MLOps templates that are made available through SageMaker projects are provided via an AWS Service Catalog portfolio that automatically gets imported when a user enables projects on the Studio domain.
  • Two repositories are added to AWS CodeCommit:
    • The first repository provides scaffolding code to create a multi-step model building pipeline including the following steps: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you can see in the pipeline.py file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known UCI Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to run the pipeline automatically.
    • The second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This repo also uses CodePipeline and CodeBuild, which run an AWS CloudFormation template to create model endpoints for staging and production.
  • Two CodePipeline pipelines:
    • The ModelBuild pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the ModelBuild CodeCommit repository.
    • The ModelDeploy pipeline automatically triggers whenever a new model version is added to the model registry and the status is marked as Approved. Models that are registered with Pending or Rejected statuses aren’t deployed.
  • An Amazon Simple Storage Service (Amazon S3) bucket is created for output model artifacts generated from the pipeline.
  • SageMaker Pipelines uses the following resources:
    • This workflow contains the directed acyclic graph (DAG) that trains and evaluates our model. Each step in the pipeline keeps track of the lineage and intermediate steps can be cached for quickly re-running the pipeline. Outside of templates, you can also create pipelines using the SDK.
    • Within SageMaker Pipelines, the SageMaker model registry tracks the model versions and respective artifacts, including the lineage and metadata for how they were created. Different model versions are grouped together under a model group, and new models registered to the registry are automatically versioned. The model registry also provides an approval workflow for model versions and supports deployment of models in different accounts. You can also use the model registry through the boto3 package.
  • Two SageMaker endpoints:
    • After a model is approved in the registry, the artifact is automatically deployed to a staging endpoint followed by a manual approval step.
    • If approved, it’s deployed to a production endpoint in the same AWS account.

All SageMaker resources, such as training jobs, pipelines, models, and endpoints, as well as AWS resources listed in this post, are automatically tagged with the project name and a unique project ID tag.

Modifying the sample code for a custom use case

After your project has been created, the architecture described earlier is deployed and the visualization of the pipeline is available on the Pipelines drop-down menu within SageMaker Studio.

To modify the sample code from this launched template, we first need to clone the CodeCommit repositories to our local SageMaker Studio instance. From the list of projects, choose the one that was just created. On the Repositories tab, you can select the hyperlinks to locally clone the CodeCommit repos.

On the Repositories tab, you can select the hyperlinks to locally clone the CodeCommit repos.

ModelBuild repo

The ModelBuild repository contains the code for preprocessing, training, and evaluating the model. The sample code trains and evaluates a model on the UCI Abalone dataset. We can modify these files to solve our own customer churn use case. See the following code:

|-- codebuild-buildspec.yml
|-- CONTRIBUTING.md
|-- pipelines
| |-- abalone
| | |-- evaluate.py
| | |-- __init__.py
| | |-- pipeline.py
| | |-- preprocess.py
| |-- get_pipeline_definition.py
| |-- __init__.py
| |-- run_pipeline.py
| |-- _utils.py
| |-- __version__.py
|-- README.md
|-- sagemaker-pipelines-project.ipynb
|-- setup.cfg
|-- setup.py
|-- tests
| -- test_pipelines.py
|-- tox.ini

We now need a dataset accessible to the project.

  1. Open a new SageMaker notebook inside Studio and run the following cells:
    !wget http://dataminingconsultant.com/DKD2e_data_sets.zip
    !unzip -o DKD2e_data_sets.zip
    !mv "Data sets" Datasets
    
    import os
    import boto3
    import sagemaker
    prefix = 'sagemaker/DEMO-xgboost-churn'
    region = boto3.Session().region_name
    default_bucket = sagemaker.session.Session().default_bucket()
    role = sagemaker.get_execution_role()
    
    RawData = boto3.Session().resource('s3')
    .Bucket(default_bucket).Object(os.path.join(prefix, 'data/RawData.csv'))
    .upload_file('./Datasets/churn.txt')
    
    print(os.path.join("s3://",default_bucket, prefix, 'data/RawData.csv'))

  1. Rename the abalone directory to customer_churn. This requires us to modify the path inside codebuild-buildspec.yml as shown in the sample repository. See the following code:
    run-pipeline --module-name pipelines.customer-churn.pipeline 

  1. Replace the preprocess.py code with the customer churn preprocessing script found in the sample repository.
  2. Replace the pipeline.py code with the customer churn pipeline script found in the sample repository.
    1. Be sure to replace the “InputDataUrl” default parameter with the Amazon S3 URL obtained in step 1:
      input_data = ParameterString(
          name="InputDataUrl",
         default_value=f"s3://YOUR_BUCKET/RawData.csv",
      )

    2. Update the conditional step to evaluate the classification model:
      # Conditional step for evaluating model quality and branching execution
      cond_lte = ConditionGreaterThanOrEqualTo(
          left=JsonGet(step=step_eval, property_file=evaluation_report, json_path="binary_classification_metrics.accuracy.value"), right=0.8
      )

    One last thing to note is the default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 80% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.

  1. Replace the evaluate.py code with the customer churn evaluation script found in the sample repository. One piece of the code we’d like to point out is that, because we’re evaluating a classification model, we need to update the metrics we’re evaluating and associating with trained models:
    report_dict = {
      "binary_classification_metrics": {
          "accuracy": {
              "value": acc,
               "standard_deviation" : "NaN"
           },
           "auc" : {
              "value" : roc_auc,
              "standard_deviation": "NaN"
           },
       },
    }
    
    evaluation_output_path = '/opt/ml/processing/evaluation/evaluation.json'
    with open(evaluation_output_path, 'w') as f:
        f.write(json.dumps(report_dict))

The JSON structure of these metrics are required to match the format of sagemaker.model_metrics for complete integration with the model registry. 

ModelDeploy repo

The ModelDeploy repository contains the AWS CloudFormation buildspec for the deployment pipeline. We don’t make any modifications to this code because it’s sufficient for our customer churn use case. It’s worth noting that model tests can be added to this repo to gate model deployment. See the following code:

├── build.py
├── buildspec.yml
├── endpoint-config-template.yml
├── prod-config.json
├── README.md
├── staging-config.json
└── test
    ├── buildspec.yml
    └── test.py

Triggering a pipeline run

Committing these changes to the CodeCommit repository (easily done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. After a few moments, we can monitor the run by choosing the pipeline inside the SageMaker project.

After a few moments, we can monitor the run by choosing the pipeline inside the SageMaker project. The following screenshot shows our pipeline details.The following screenshot shows our pipeline details. Choosing the pipeline run displays the steps of the pipeline, which you can monitor.

Choosing the pipeline run displays the steps of the pipeline, which you can monitor.

When the pipeline is complete, you can go to the Model groups tab inside the SageMaker project and inspect the metadata attached to the model artifacts.

When the pipeline is complete, you can go to the Model groups tab inside the SageMaker project and inspect the metadata attached to the model artifacts.

If everything looks good, we can manually approve the model.

This approval triggers the ModelDeploy pipeline and exposes an endpoint for real-time inference.

This approval triggers the ModelDeploy pipeline and exposes an endpoint for real-time inference.

This approval triggers the ModelDeploy pipeline and exposes an endpoint for real-time inference. 

Conclusion

SageMaker Pipelines enables teams to leverage best practice CI/CD methods within their ML workflows. In this post, we showed how a data scientist can modify a preconfigured MLOps template for their own modeling use case. Among the many benefits is that the changes to the source code can be tracked, associated metadata can be tied to trained models for deployment approval, and repeated pipeline steps can be cached for reuse. To learn more about SageMaker Pipelines, check out the website and the documentation. Try SageMaker Pipelines in your own workflows today.


About the Authors

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He previously worked in the semiconductor industry, using computer vision to improve product yield. He later transitioned to a DoD research lab where he specialized in adversarial ML defense and network security. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Addons.

 

Hallie CrosbyHallie Weishahn is an AI/ML Specialist Solutions Architect at AWS, focused on leading global standards for MLOps. She previously worked as an ML Specialist at Google Cloud Platform. She works with product, engineering, and key customers to build repeatable architectures and drive product roadmaps. She provides guidance and hands-on work to advance and scale machine learning use cases and technologies. Troubleshooting top issues and evaluating existing architectures to enable integrations from PoC to a full deployment is her strong suit.

 

Shelbee EigenbrodeShelbee Eigenbrode is an AI/ML Specialist Solutions Architect at AWS. Her current areas of depth include DevOps combined with ML/AI. She has been in technology for 23 years, spanning multiple roles and technologies. With over 35 patents granted across various technology domains, her passion for continuous innovation combined with a love of all things data turned her focus to the field of d ata science. Combining her backgrounds in data, DevOps, and machine learning, her current passion is helping customers to not only embrace data science but also ensure all models have a path to production by adopting MLOps practices. In her spare time, she enjoys reading and spending time with family, including her fur family (aka dogs), as well as friends.

Read More

Labeling mixed-source, industrial datasets with Amazon SageMaker Ground Truth

Prior to using any kind of supervised machine learning (ML) algorithm, data has to be labeled. Amazon SageMaker Ground Truth simplifies and accelerates this task. Ground Truth uses pre-defined templates to assign labels that classify the content of images or videos or verify existing labels. Ground Truth allows you to define workflows for labeling various kinds of data, such as text, video, or images, without writing any code. Although these templates are applicable to a wide range of use cases in which the data to be labeled is in a single format or from a single source, industrial workloads often require labeling data from different sources and in different formats. This post explores the use case of industrial welding data consisting of sensor readings and images to show how to implement customized, complex, mixed-source labeling workflows using Ground Truth.

For this post, you deploy an AWS CloudFormation template in your AWS account to provision the foundational resources to get started with implementing of this labeling workflow. This provides you with hands-on experience for the following topics:

  • Creating a private labeling workforce in Ground Truth
  • Creating a custom labeling job using the Ground Truth framework with the following components:
    • Designing a pre-labeling AWS Lambda function that pulls data from different sources and runs a format conversion where necessary
    • Implementing a customized labeling user interface in Ground Truth using crowd templates that dynamically loads the data generated by the pre-labeling Lambda function
    • Consolidating labels from multiple workers using a customized post-labeling Lambda function
  • Configuring a custom labeling job using Ground Truth with a customized interface for displaying multiple pieces of data that have to be labeled as a single item

Prior to diving deep into the implementation, I provide an introduction into the use case and show how the Ground Truth custom labeling framework eases the implementation of highly complex labeling workflows. To make full use of this post, you need an AWS account on which you can deploy CloudFormation templates. The total cost incurred on your account for following this post is under $1.

Labeling complex datasets for industrial welding quality control

Although the mechanisms discussed in this post are generally applicable to any labeling workflow with different data formats, I use data from a welding quality control use case. In this use case, the manufacturing company running the welding process wants to predict whether the welding result will be OK or if a number of anomalies have occurred during the process. To implement this using a supervised ML model, you need to obtain labeled data with which to train the ML model, such as datasets representing welding processes that need to be labeled to indicate whether the process was normal or not. We implement this labeling process (not the ML or modeling process) using Ground Truth, which allows welding experts to make assessments about the result of a welding and assign this result to a dataset consisting of images and sensor data.

The CloudFormation template creates an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account that contains images (prefix images) and CSV files (prefix sensor_data). The images contain pictures taken during an industrial welding process similar to the following, where a welding beam is applied onto a metal surface (for image source, see TIG Stainless Steel 304):

 

The CSV files contain sensor data representing current, electrode position, and voltage measured by sensors on the welding machine. For the full dataset, see the GitHub repo. A raw sample of this CSV data is as follows:

0|96.19|1023|420|4.5|4.5|1|8
0.1|96.13|894|424|4.5|4.5|1|8
0.2|96.06|884|425|4.5|4.5|1|8
0.3|96.05|884|426|4.5|4.5|1|8
0.4|96.12|887|426|4.5|4.5|1|8
0.5|96.17|902|426|4.5|4.5|2|8
0.6|95.82|974|426|4.5|4.5|2|8
0.7|95.45|1304|426|4.5|4.5|3|8
0.8|95.15|1410|428|4.5|4.5|3|8
0.9|94.96|1446|428|4.5|4.5|3|8
1|94.79|1464|428|4.5|4.5|3|8
...

The first column of the data is a timestamp in milliseconds normalized to the start of the welding process. Each row consists of various sensor values associated with the timestamp. The first row is the electrode position, the second row is the current, and the third row is the voltage (the other values are irrelevant here). For instance, the row with timestamp 1, 100 milliseconds after the start of the welding process, has an electrode position of 94.79, a current of 1464, and a voltage of 428.

Because it’s difficult for humans to make assessments using the raw CSV data, I also show how to preprocess such data on the fly for labeling and turn it into more easily readable plots. This way, the welding experts can view the images and the plots to make their assessment about the welding process.

Deploying the CloudFormation template

To simplify the setup and configurations needed in the following, I created a CloudFormation template that deploys several foundations into your AWS account. To start this process, complete the following steps:

  1. Sign in to your AWS account.
  2. Choose one of the following links, depending on which AWS Region you’re using:
us-east-1
us-west-2
eu-west-1
eu-central-1
ap-northeast-1
ap-southeast-1
  1. Keep all the parameters as they are and select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  2. Choose Create stack to start the deployment.

The deployment takes about 3–5 minutes, during which time a bucket with data to label, some AWS Lambda functions, and an AWS Identity and Access Management (IAM) role are deployed. The process is complete when the status of the deployment switches to CREATE_COMPLETE.

The Outputs tab has additional information, such as the Amazon S3 path to the manifest file, which you use throughout this post. Therefore, it’s recommended to keep this browser tab open and follow the rest of the post in another tab.

Creating a Ground Truth labeling workforce

Ground Truth offers three options for defining workforces that complete the labeling: Amazon Mechanical Turk, vendor-specific workforces, and private workforces. In this section, we configure a private workforce because we want to complete the labeling ourselves. Create a private workforce with the following steps:

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
  2. On the Private tab, choose Create private team.

  1. Enter a name for the labeling workforce. For our use case, I enter welding-experts.
  2. Select Invite new workers by email.
  3. Enter your e-mail address, an organization name, and a contact e-mail (which may be the same as the one you just entered).
  4. Choose Create private team.

The console confirms the creation of the labeling workforce at the top of the screen. When you refresh the page, the new workforce shows on the Private tab, under Private teams.

You also receive an e-mail with login instructions, including a temporary password and a link to open the login page.

  1. Choose the link and use your e-mail and temporary password to authenticate and change the password for the login.

It’s recommended to keep this browser tab open so you don’t have to log in again. This concludes all necessary steps to create your workforce.

Configuring a custom labeling job

In this section, we create a labeling job and use this job to explain the details and data flow of a custom labeling job.

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling jobs.
  2. Choose Create labeling job.

  1. Enter a name for your labeling job, such as WeldingLabelJob1.
  2. Choose Manual data setup.
  3. For Input dataset location, enter the ManifestS3Path value from the CloudFormation stack Outputs
  4. For Output dataset location, enter the ProposedOutputPath value from the CloudFormation stack Outputs
  5. For IAM role, choose Enter a custom IAM role ARN.
  6. Enter the SagemakerServiceRoleArn value from the CloudFormation stack Outputs
  7. For the task type, choose Custom.
  8. Choose Next.

The IAM role is a customized role created by the CloudFormation template that allows Ground Truth to invoke Lambda functions and access Amazon S3.

  1. Choose to use a private labeling workforce.
  2. From the drop-down menu, choose the workforce welding-experts.
  3. For task timeout and task expiration time, 1 hour is sufficient.
  4. The number of workers per dataset object is 1.
  5. In the Lambda functions section, for Pre-labeling task Lambda function, choose the function that starts with PreLabelingLambda-.
  6. For Post-labeling task Lambda function, choose the function that starts with PostLabelingLambda-.
  7. Enter the following code into the templates section. This HTML code specifies the interface that the workers in the private label workforce see when labeling items. For our use case, the template displays four images, and the categories to classify welding results is as follows:
    <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
    <crowd-form>
      <crowd-classifier
        name="WeldingClassification"
        categories="['Good Weld', 'Burn Through', 'Contamination', 'Lack of Fusion', 'Lack of Shielding Gas', 'High Travel Speed', 'Not sure']"
        header="Please classify the welding process."
      >
          <classification-target>
            <div>
              <h3>Welding Image</h3>
    	      	<p><strong>Welding Camera Image </strong>{{ task.input.image.title }}</p>
    	      	<p><a href="{{ task.input.image.file | grant_read_access }}" target="_blank">Download Image</a></p>
    	      	<p>
    	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.image.file | grant_read_access }}"/>
    	      	</p>
    	    </div>
    	    <hr/>
            <div>
              <h3>Current Graph</h3>
    	      	<p><strong>Current Graph </strong>{{ task.input.current.title }}</p>
    	      	<p><a href="{{ task.input.current.file | grant_read_access }}" target="_blank">Download Current Plot</a></p>
    	      	<p>
    	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.current.file | grant_read_access }}"/>
    	      	</p>
    	    </div>
            <hr/>
            <div>
              <h3>Electrode Position Graph</h3>
    	      	<p><strong>Electrode Position Graph </strong>{{ task.input.electrode.title }}</p>
    	      	<p><a href="{{ task.input.electrode.file | grant_read_access }}" target="_blank">Download Electrode Position Plot</a></p>
    	      	<p>
    	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.electrode.file | grant_read_access }}"/>
    	      	</p>
    	    </div>
            <hr/>
            <div>
              <h3>Voltage Graph</h3>
    	      	<p><strong>Voltage Graph </strong>{{ task.input.voltage.title }}</p>
    	      	<p><a href="{{ task.input.voltage.file | grant_read_access }}" target="_blank">Download Voltage Plot</a></p>
    	      	<p>
    	      		<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.voltage.file | grant_read_access }}"/>
    	      	</p>
    	    </div>
          </classification-target>
    
    
    
        <full-instructions header="Classification Instructions">
          <p>Read the task carefully and inspect the image as well as the plots.</p>
          <p>
    		  The image is a picture taking during the welding process. The plots show the corresponding sensor data for
    		  the electrode position, the voltage and the current measured during the welding process.
    	  </p>
        </full-instructions>
    
        <short-instructions>
          <p>Read the task carefully and inspect the image as well as the plots</p>
        </short-instructions>
      </crowd-classifier>
    </crowd-form>
    

The wizard for creating the labeling job has a preview function in the section Custom labeling task setup, which you can use to check if all configurations work properly.

  1. To preview the interface, choose Preview.

This opens a new browser tab and shows a test version of the labeling interface, similar to the following screenshot.

  1. To create the labeling job, choose Create.

Ground Truth sets up the labeling job as specified, and the dashboard shows its status.

Assigning labels

To finalize the labeling job that you configured, you log in to the worker portal and assign labels to different data items consisting of images and data plots. The details on how the different components of the labeling job work together are explained in the next section.

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
  2. On the Private tab, choose the link for Labeling portal sign-in URL.

When Ground Truth is finished preparing the labeling job, you can see it listed in the Jobs section. If it’s not showing up, wait a few minutes and refresh the tab.

  1. Choose Start working.

This launches the labeling UI, which allows you to assign labels to mixed datasets consisting of welding images and plots for current, electrode position, and voltage.

For this use case, you can assign seven different labels to a single dataset. These different classes and labels are defined in the HTML of the UI, but you can also insert them dynamically using the pre-labeling Lambda function (discussed in the next section). Because we don’t actually use the labeled data for ML purposes, you can assign the labels randomly to the five items that are displayed by Ground Truth for this labeling job.

After labeling all the items, the UI switches back to the list with available jobs. This concludes the section about configuring and launching the labeling job. In the next section, I explain the mechanics of a custom labeling job in detail and also dive deep into the different elements of the HTML interface.

Custom labeling deep dive

A custom labeling job combines the data to be labeled with three components to create a workflow that allows workers from the labeling workforce to assign labels to each item in the dataset:

  • Pre-labeling Lambda function – Generates the content to be displayed on the labeling interface using the manifest file specified during the configuration of the labeling job. For this use case, the function also converts the CSV files into human readable plots and stores these plots as images in the S3 bucket under the prefix plots.
  • Labeling interface – Uses the output of the pre-labeling function to generate a user interface. For this use case, the interface displays four images (the picture taken during the welding process and the three graphs for current, electrode position, and voltage) and a form that allows workers to classify the welding process.
  • Label consolidation Lambda function – Allows you to implement custom strategies to consolidate classifications of one or several workers into a single response. For our workforce, this is very simple because there is only a single worker whose labels are consolidated into a file, which is stored by Ground Truth into Amazon S3.

Before we analyze these three components, I provide insights into the structure of the manifest file, which describes the data sources for the labeling job.

Manifest and dataset files

The manifest file is a file conforming to the JSON lines format, in which each line represents one item to label. Ground Truth expects either a key source or source-ref in each line of the file. For this use case, I use source, and the mapped value must be a string representing an Amazon S3 path. For this post, we only label five items, and the JSON lines are similar to the following code:

{"source": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-1.json"}

For our use case with multiple input formats and files, each line in the manifest points to a dataset file that is also stored on Amazon S3. Our dataset is a JSON document, which contains references to the welding images and the CSV file with the sensor data:

{
  "sensor_data": {"s3Path": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/sensor_data/weld.1.csv"},
  "image": {"s3Path": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png"}
}

Ground Truth takes each line of the manifest file and triggers the pre-labeling Lambda function, which we discuss next.

Pre-labeling Lambda function

A pre-labeling Lambda function creates a JSON object that is used to populate the item-specific portions of the labeling interface. For more information, see Processing with AWS Lambda.

Before Ground Truth displays an item for labeling to a worker, it runs the pre-labeling function and forwards the information in the manifest’s JSON line to the function. For our use case, the event passed to the function is as follows:

{
  "version": "2018-10-06", 
  "labelingJobArn": "arn:aws:sagemaker:eu-west-1:XXX:labeling-job/weldinglabeljob1",
  "dataObject": { 
    "source": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-1.json" 
  }
}

Although I omit the implementation details here (for those interested, the code is deployed with the CloudFormation template for review), the function for our labeling job uses this input to complete the following steps:

  1. Download the file referenced in the source field of the input (see the preceding code).
  2. Download the dataset file that is referenced in the source
  3. Download a CSV file containing the sensor data. The dataset file is expected to have a reference to this CSV file.
  4. Generate plots for current, electrode position, and voltage from the contents of the CSV file.
  5. Upload the plot files to Amazon S3.
  6. Generate a JSON object containing the references to the aforementioned plot files and the welding image referenced in the dataset file.

When these steps are complete, the function returns a JSON object with two parts :

  • taskInput – Fully customizable JSON object that contains information to be displayed on the labeling UI.
  • isHumanAnnotationRequired – A string representing a Boolean value (True or False), which you can use to exclude objects from being labeled by humans. I don’t use this flag for this use case because we want to label all the provided data items.

For more information, see Processing with AWS Lambda.

Because I want to show the welding images and the three graphs for current, electrode position, and voltage, the result of the Lambda function is as follows for the first dataset:

{
  "taskInput": { 
    "image": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png", 
      "title": " from image at s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png"
    }, 
    "voltage": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-current.png", 
      "title": " from file at plots/weld.1.csv-current.png"
    },
    "electrode": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-electrode_pos.png", 
      "title": " from file at plots/weld.1.csv-electrode_pos.png" 
    }, 
    "current": { 
      "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-voltage.png", 
      "title": " from file at plots/weld.1.csv-voltage.png" 
    } 
  }, 
  "isHumanAnnotationRequired": "true"
}

In the preceding code, the taskInput is fully customizable; the function returns the Amazon S3 paths to the images to display, and also a title, which has some non-functional text. Next, I show how to access these different parts of the taskInput JSON object when building the customized labeling UI displayed to workers by Ground Truth.

Labeling UI: Accessing taskInput content

Ground Truth uses the output of the Lambda function to fill in content into the HTML skeleton that is provided at the creation of the labeling job. In general, the contents of the taskInput output object is accessed using task.input in the HTML code.

For instance, to retrieve the Amazon S3 path where the welding image is stored from the output, you need to access the path taskInput/image/file. Because the taskInput object from the function output is mapped to task.input in the HTML, the corresponding reference to the welding image file is task.input.image.file. This reference is directly integrated into the HTML code of the labeling UI to display the welding image:

<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.image.file | grant_read_access }}"/>

The grant_read_access filter is needed for files in S3 buckets that aren’t publicly accessible. This makes sure that the URL passed to the browser contains a short-lived access token for the image and thereby avoids having to make resources publicly accessible for labeling jobs. This is often mandatory because the data to be labeled, such as machine data, is confidential. Because the pre-labeling function has also converted the CSV files into plots and images, their integration into the UI is analogous.

Label consolidation Lambda function

The second Lambda function that was configured for the custom labeling job runs when all workers have labeled an item or the time limit of the labeling job is reached. The key task of this function is to derive a single label from the responses of the workers. Additionally, the function can be for any kind of further processing of the labeled data, such as storing them on Amazon S3 in a format ideally suited for the ML pipeline that you use.

Although there are different possible strategies to consolidate labels, I focus on the cornerstones of the implementation for such a function and show how they translate to our use case. The consolidation function is triggered by an event similar to the following JSON code:

{ 
  "version": "2018-10-06", 
  "labelingJobArn": "arn:aws:sagemaker:eu-west-1:261679111194:labeling-job/weldinglabeljob1", 
  "payload": { 
    "s3Uri": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/output/WeldingLabelJob1/annotations/consolidated-annotation/consolidation-request/iteration-1/2020-09-15_16:16:11.json" 
  }, 
  "labelAttributeName": "WeldingLabelJob1", 
  "roleArn": "arn:aws:iam::261679111194:role/AmazonSageMaker-Service-role-unn4d0l4j0", 
  "outputConfig": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/output/WeldingLabelJob1/annotations", 
  "maxHumanWorkersPerDataObject": 1 
}

The key item in this event is the payload, which contains an s3Uri pointing to a file stored on Amazon S3. This payload file contains the list of datasets that have been labeled and the labels assigned to them by workers. The following code is an example of such a list entry:

{ 
  "datasetObjectId": "4", 
  "dataObject": { 
    "s3Uri": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-5.json" 
  }, 
  "annotations": [ 
    { 
      "workerId": "private.eu-west-1.abd2ec3e354db315",
      "annotationData": { 
          "content":"{"WeldingClassification":{"label":"Not sure"}}"
      } 
    } 
  ] 
}

Along with an identifier that you could use to determine which worker labeled the item, each entry lists for each dataset which labels have been assigned. For example, in the case of multiple workers, there are multiple entries in annotations. Because I created a single worker that labeled all the items for this post, there is only a single entry. The file dataset-5.json has been labeled with Not Sure for the classifier WeldingClassification.

The label consolidation function has to iterate over all list entries and determine for each dataset a label to use as the ground truth for supervised ML training. Ground Truth expects the function to return a list containing an entry for each dataset item with the following structure:

{ 
  "datasetObjectId": "4", 
  "consolidatedAnnotation": { 
    "content": { 
      "WeldingLabelJob1": {
         "WeldingClassification": "Not sure" 
      } 
    }
  } 
}

Each entry of the returned list must contain the datasetObjectId for the corresponding entry in the payload file and a JSON object consolidatedAnnotation, which contains an object content. Ground Truth expects content to contain a key that equals the name of the labeling job, (for our use case, WeldingLabelJob1). For more information, see Processing with AWS Lambda.
You can change this behavior when you create the labeling job by selecting I want to specify a label attribute name different from the labeling job name and entering a label attribute name.

The content inside this key equaling the name of the labeling job is freely configurable and can be arbitrarily complex. For our use case, it’s enough to return the assigned label Not Sure. If any of these formatting requirements are not met, Ground Truth assumes the labeling job didn’t run properly and failed.

Because I specified output as the desired prefix during the creation of the labeling job, the requirements are met, and Ground Truth uploads the list of JSON entries into the bucket and prefix specified during the creation of the consolidated labels, and they are uploaded with the following prefix:

output/WeldingLabelJob1/annotations/consolidated-annotation/consolidation-response/iteration-1/

You can use such files for training ML algorithms in Amazon SageMaker or for further processing.

Cleaning up

To avoid incurring future charges, delete all resources created for this post.

  1. On the AWS CloudFormation console, choose Stacks.
  2. Select the stack iiot-custom-label-blog.
  3. Choose Delete.

This step removes all files and the S3 bucket from your account. The process takes about 3–5 minutes.

Conclusion

Supervised ML requires labeled data, and Ground Truth provides a platform for creating labeling workflows. This post showed how to build a complex industrial IoT labeling workflow, in which data from multiple sources needs to be considered for labeling items. The post explained how to create a custom labeling job and provided details on the mechanisms Ground Truth requires to implement such a workflow. To get started with writing your own custom labeling job, refer to the custom labeling documentation page for Ground Truth and potentially re-deploy the CloudFormation template of this post to get a sample for the pre-labeling and consolidation lambdas. Additionally, the blog post “Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth” provides additional insights into building custom labeling jobs.


About the Author

As a Principal Prototyping Engagement Manager, Dr. Markus Bestehorn is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.

Read More

Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data

In this post, we walk you through the steps to build machine learning (ML) models in Amazon SageMaker with data stored in Amazon HealthLake using two example predictive disease models we trained on sample data using the MIMIC-III dataset. This dataset was developed by the MIT lab for Computational Physiology and consists of de-identified healthcare data associated with approximately 60,000 ICU admissions. The dataset includes multiple attributes about the patients like their demographics, vital signs, and medications, along with their clinical notes. We first developed the models using the structured data such as demographics, vital signs, and medications. Then we augmented these models with additional data extracted and normalized from clinical notes to test and compare their performance. In both these experiments, we found an improvement in model performance when modelled as a supervised learning (classification) or an unsupervised learning (clustering) problem. We present our findings and the setup of the experiments in this post.

Why multiple modalities?

Modality can be defined as the classification of a single independent sensory input/output between a computer and a human. For example, we can see objects and hear sounds by using our senses. These can be considered as two separate modalities. Datasets that represent multiple modalities are categorized as a multi-modal dataset. For instance, images can consist of tags that help search and organize them, and textual data can contain images to explain what’s in the image. When medical practitioners make clinical decisions, it’s usually based on information gathered from a variety of healthcare data modalities. A physician looks at patient’s observations, their past history, their scans, and even physical characteristics of the patient during the visit to make a definitive diagnosis. ML models need to take this into account when trying to achieve real-world performance. The post Building a medical image search platform on AWS shows how you can combine features from medical images and their corresponding radiology reports to create a medical image search platform. The challenge with creating such models is the preprocessing of these multi-modal datasets and extracting appropriate features from them.

Amazon HealthLake makes it easier to train models on multi-modal data

Amazon HealthLake is a HIPAA eligible service that enables healthcare providers, health insurance companies, and pharmaceutical companies to store, transform, query, and analyze health data on the AWS Cloud at petabyte scale. As part of the transformation, Amazon HealthLake tags and indexes unstructured data using specialized ML models. These tags and indexes can be used to query and search as well as understand relationships in the data for analytics.

When you export data from Amazon HealthLake, it adds a resource called DocumentReference to the output. This resource consists of clinical entities (Like medications, medical conditions, anatomy, and Protected Health Information (PHI)), the RxNorm codes for medications, and the ICD10 codes for medical conditions that are automatically derived from the unstructured notes about the patients. These are additional attributes about the patients that are embedded within the unstructured portions of their clinical records and would have been largely ignored for downstream analysis. Combining the structured data from the EHR with these attributes provides a more holistic picture of the patient and their conditions. To help determine the value of these attributes, we created a couple of experiments around clinical outcome prediction.

Architecture overview

The following diagram illustrates the architecture for our experiments.

The following diagram illustrates the architecture for our experiments.

You can export the normalized data to an Amazon Simple Storage Service (Amazon S3) bucket using the Export API. Then we use AWS Glue to crawl and build a catalog of the data. This catalog is shared by Amazon Athena to run the queries directly off of the exported data from Colossus. Athena also normalizes the JSON format files to rows and columns for easy querying. The DocumentReference resource JSON file is processed separately to extract indexed data derived from the unstructured portions of the patient records. The file consists of an extension tag that has a hierarchical JSON output consisting of patient attributes. There are multiple ways to process this file (like using Python-based JSON parsers or even string-based regex and pattern matching). For an example implementation, see the section Connecting Athena with HealthLake in the post Population health applications with Amazon HealthLake – Part 1: Analytics and monitoring using Amazon QuickSight.

Example setup

Accessing the MIMIC-III dataset requires you to request access. As part of this post, we don’t distribute any data but instead provide the setup steps so you can replicate these experiments when you have access to MIMIC-III. We also publish our conclusions and findings from the results.

For the first experiment, we build a binary disease classification model to predict patients with congestive heart failure (CHF). We measure its performance using accuracy, ROC, and confusion matrix for both structured and unstructured patient records. For the second experiment, we cluster a cohort of patients into a fixed number of groups and visualize the cluster separation before and after the addition of the unstructured patient records. For both our experiments, we build a baseline model and compare it with the multi-modal model, where we combine existing structured data with additional features (ICD-10 codes and Rx-Norm codes) in our training set.

These experiments are not intended to produce a state-of-the-art model on real-world datasets; its purpose is to demonstrate how you can utilize features exported from Amazon Healthlake for training models on structured and unstructured patient records to improve your overall model performance.

Features and data normalization

We took a variety of features related to patient encounters to train our models. This included the patient demographics (gender, marital status), the clinical conditions, procedures, medications, and observations. Because each patient could have multiple encounters consisting of multiple observations, clinical conditions, procedures, and medications, we normalized the data and converted each of these features into a list. This allowed us to get a training set with all these features (as a list) for each patient.

Similarly, for the unstructured features that Amazon Healthlake converted into the DocumentReference resource, we extracted the ICD-10 codes and Rx-Norm codes (using the methods described in the architecture) and converted them into feature vectors.

Feature engineering and model

For the categorical attributes in our dataset, we used a label encoder to convert the attributes into a numerical representation. For all other list attributes, we used term frequency-inverse document frequency (FI-IDF) vectors. This high-dimensional dataset was then shuffled and divided into 80% train and 20% test sets for training and evaluation of the models, respectively. For training our model, we used the gradient boosting library XGBoost. We considered mostly default hyperparameters and didn’t perform any hyperparameter tuning, because our objective was only to train a baseline model with structured patient records and then show improvement on those results with the unstructured features. Adopting better hyperparameters or changing to other feature engineering and modelling approaches can likely improve these results.

Example 1: Predicting patients with a congestive heart failure

For the first experiment, we took 500 patients with a positive CHF diagnosis. For the negative class, we randomly selected 500 patients who didn’t have a CHF diagnosis. We removed the clinical conditions from the positive class of patients that were directly related to CHF. For example, all the patients in the positive class were expected to have ICD-9 code 428, which stands for CHF. We filtered that out from the positive class to make sure the model is not overfitting on the clinical condition.

Baseline model

Our baseline model had an accuracy of 85.8%. The following graph shows the ROC curve.

Our baseline model had an accuracy of 85.8%. The following graph shows the ROC curve.

The following graph shows the confusion matrix.

The following graph shows the confusion matrix.

Amazon HealthLake augmented model

Our Amazon HealthLake augmented model had an accuracy of 89.1%. The following graph shows the ROC curve.

The following graph shows the ROC curve.

The following graph shows the confusion matrix.

The following graph shows the confusion matrix.

Adding the features extracted from Amazon HealthLake allowed us to improve the model accuracy from 85% to 89% and also the AUC from 0.86 to 0.89. If you look at the confusion matrices for the two models, the false positives reduced from 20 to 13 and the false negatives reduced from 27 to 20.

Optimizing healthcare is about ensuring the patient is associated with their peers and the right cohort. As patient data is added or changes, it’s important to continuously identify and reduce false negative and positive identifiers for overall improvement in the quality of care.

To better explain the performance improvements, we picked a patient from the false negative cohort in the first model who moved to true positive in the second model. We plotted a word cloud for the top medical conditions for this patient for the first and the second model, as shown in the following images.

There is a clear difference between the medical conditions of the patient before and after the addition of features from Amazon HealthLake. The word cloud for model 2 is richer, with more medical conditions indicative of CHF than the one for model 1. The data embedded within the unstructured notes for this patient extracted by Amazon HealthLake helped this patient move from a false negative category to a true positive.

These numbers are based on synthetic experimental data we used from a subset of MIMIC-III patients. In a real-world scenario with higher-volume of patients, these numbers may differ.

Example 2: Grouping patients diagnosed with sepsis

For the second experiment, we took 500 patients with a positive sepsis diagnosis. We grouped these patients on the basis of their structured clinical records using k-means clustering. To show that this is a repeatable pattern, we chose the same feature engineering techniques as described in experiment 1. We didn’t divide the data into training and testing datasets because we were implementing an unsupervised learning algorithm.

We first analyzed the optimal number of clusters of the grouping using the Elbow method and arrived at the curve shown in the following graph.

This allowed us to determine that six clusters were the optimal number in our patient grouping.

Baseline model

We reduced the dimensionality of the input data using Principal Component Analysis (PCA) to two and plotted the following scatter plot.

The following were the counts of patients across each cluster:

Cluster 1
Number of patients: 44

Cluster 2
Number of patients: 30

Cluster 3
Number of patients: 109

Cluster 4
Number of patients: 66

Cluster 5
Number of patients: 106

Cluster 6
Number of patients: 145

We found that the at least four of the six clusters had a distinct overlap of patients. That means the structured clinical features weren’t enough to clearly divide the patients into six groups.

Enhanced model

For the enhanced model, we added the ICD-10 codes and their corresponding descriptions for each patient as extracted from Amazon HealthLake. However, this time, we could see a clear separation of the patient groups.

We also saw a change in distribution across the six clusters:

Cluster 1
Number of patients: 54

Cluster 2
Number of patients: 154

Cluster 3
Number of patients: 64

Cluster 4
Number of patients: 44

Cluster 5
Number of patients: 109

Cluster 6
Number of patients: 75

As you can see, adding features from the unstructured data for the patients allows us to improve the clustering model to clearly divide the patients into six clusters. We even saw that some patients moved across clusters, denoting that the model became better at recognizing those patients based on their unstructured clinical records.

Conclusion

In this post, we demonstrated how you can easily use SageMaker to build ML models on your data in Amazon HealthLake. We also demonstrated the advantages of augmenting data from unstructured clinical notes to improve the accuracy of disease prediction models. We hope this body of work provides you with examples of how to build ML models using SageMaker with your data stored and normalized in Amazon HealthLake and improve model performance for clinical outcome predictions. To learn more about Amazon HealthLake, please check the website and technical documentation for more information.


About the Authors

Ujjwal Ratan is a Principal Machine Learning Specialist in the Global Healthcare and Lifesciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials and quality of care improvement. He has expertise in scaling machine learning/deep learning algorithms on the AWS cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.

 

Nihir Chadderwala is an AI/ML Solutions Architect on the Global Healthcare and Life Sciences team. His background is building Big Data and AI-powered solutions to customer problems in variety of domains such as software, media, automotive, and healthcare. In his spare time, he enjoys playing tennis, watching and reading about Cosmos.

 

 

Parminder Bhatia is a science leader in the AWS Health AI, currently building deep learning algorithms for clinical domain at scale. His expertise is in machine learning and large scale text analysis techniques in low resource settings, especially in biomedical, life sciences and healthcare technologies. He enjoys playing soccer, water sports and traveling with his family.

Read More