How Logz.io accelerates ML recommendations and anomaly detection solutions with Amazon SageMaker

Logz.io is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in DevOps, Security, and Data & Analytics. Logz.io offers a software as a service (SaaS) observability platform based on best-in-class open-source software solutions for log, metric, and tracing analytics. Customers are sending an increasing amount of data to Logz.io from various data sources to manage the health and performance of their applications and services. It can be overwhelming for new users who are looking to navigate across the various dashboards built over time, process different alert notifications, and connect the dots when troubleshooting production issues.

Mean time to detect (MTTD) and mean time to resolution (MTTR) are key metrics for our customers. They’re calculated by measuring the time a user in our platform starts to investigate an issue (such as production service down) to the point when they stop doing actions in the platform that are related to the specific investigation.

To help customers reduce MTTD and MTTR, Logz.io is turning to machine learning (ML) to provide recommendations for relevant dashboards and queries and perform anomaly detection via self-learning. As a result, the average user is equipped with the aggregated experience of their entire company, leveraging the wisdom of many. We found that our solution can reduce MTTR by up to 20%.

As MTTD decreases, users can identify the problem and resolve it faster. Our data semantic layer contains semantics for starting and stopping an investigation, and the popularity of each action the user is doing with respect to a specific alert.

In this post, we share how Logz.io used Amazon SageMaker to reduce the time and effort for our proof of concept (POC), experiments from research to production evaluation, and how we reduced our production inference cost.

The challenge

Until Logz.io used SageMaker, the time between research to POC testing and experiments on production was quite lengthy. This was because we needed to create Spark jobs to collect, clean, and normalize the data. DevOps required this work to read each data source. DevOps and data engineering skills aren’t part of our ML team, and this caused a high dependency between the teams.

Another challenge was to provide an ML inference service to our products while achieving optimal cost vs. performance ratio. Our optimal scenario is supporting as many models as possible for a computing unit, while providing high concurrency from customers with many models. We had flexibility on our inference time, because our initial window of the data stream for the inference service is 5 minutes bucket of logs.

Research phase

Data science is an iterative process that requires an interactive development environment for research, validating the data output on every iteration and data processing. Therefore, we encourage our ML researchers to use notebooks.

To accelerate the iteration cycle, we wanted to test our notebooks’ code on real production data, while running it at scale. Furthermore, we wanted to avoid the bottleneck of DevOps and data engineering during the initial test in production, while having the ability to view the outputs and trying to estimate the code runtime.

To implement this, we wanted to provide our data science team full control and end-to-end responsibility from research to initial test on production. We needed them to easily pull data, while preserving data access management and monitoring this access. They also needed to easily deploy their custom POC notebooks into production in a scalable manner, while monitoring the runtime and expected costs.

Evaluation phase

During this phase, we evaluated a few ML platforms in order to support both training and serving requirements. We found that SageMaker is the most appropriate for our use cases because it supports both training and inference. Furthermore, it’s customizable, so we can tailor it according to our preferred research process.

Initially, we started from local notebooks, testing various libraries. We ran into problems with pulling massive data from production. Later, we were stuck in a point of the modeling phase that took many hours on a local machine.

We evaluated many solutions and finally chose the following architecture:

  • DataPlate – The open-source version of DataPlate helped us pull and join our data easily by utilizing our Spark Amazon EMR clusters with a simple SQL, while monitoring the data access
  • SageMaker notebook instance and processing jobs – This helped us with the scalability of runtime and flexibility of machine types and ML frameworks, while collaborating our code via a Git connection

Research phase solution architecture

The following diagram illustrates the solution architecture of the research phase, and consists of the following components:

  • SageMaker notebooks – Data scientists use these notebooks to conduct their research.
  • AWS Lambda functionAWS Lambda is a serverless solution that runs a processing job on demand. The job uses a Docker container with the notebook we want to run during our experiment, together with all our common files that need to support the notebook (requirements.txt and the multi-processing functions code in a separate notebook).
  • Amazon ECRAmazon Elastic Container Registry (Amazon ECR) stores our Docker container.
  • SageMaker Processing job – We can run this data processing job on any ML machine, and it runs our notebook with parameters.
  • DataPlate – This service helps us use SQL and join several data sources easily. It translates it to Spark code and optimizes it, while monitoring data access and helping reduce data breaches. The Xtra version provided even more capabilities.
  • Amazon EMR – This service runs our data extractions as workloads over Spark, contacting all our data resources.

With the SageMaker notebook instance lifecycle, we can control the maximum notebook instance runtime, using the autostop.py template script.

After testing the ML frameworks, we chose the SageMaker MXNet kernel for our clustering and ranking phases.

To test the notebook code on our production data, we ran the notebook by encapsulating it via Docker in Amazon ECS and ran it as a processing job to validate the maximum runtime on different types of machines.

The Docker container also helps us share resources among notebooks’ tests. In some cases, a notebook calls other notebooks to utilize a multi-process by splitting big data frames into smaller data frames, which can run simultaneously on each vCPU in a large machine type.

The real-time production inference solution

In the research phase, we used Parquet Amazon Simple Storage Service (Amazon S3) files to maintain our recommendations. These are consumed once a day from our engineering pipeline to attach the recommendations to our alerts’ mechanism.

However, our roadmap requires a higher refresh rate solution and pulling once a day isn’t enough in the long term, because we want to provide recommendations even during the investigation.

To implement this solution at scale, we tested most of the SageMaker endpoint solutions in our anomaly-detection research. We tested 500 of the pre-built models with a single endpoint machine of various types and used concurrent multi-threaded clients to perform requests to the endpoint. We measured the response time, CPU, memory, and other metrics (for more information, see Monitor Amazon SageMaker with Amazon CloudWatch). We found that the multi-model endpoint is a perfect fit for our use cases.

A multi-model endpoint can reduce our costs dramatically in comparison to a single endpoint or even Kubernetes to use Flask (or other Python) web services. Our first assumption was that we must provide a single endpoint, using a 4-vCPU small machine, for each customer, and on average query four dedicated models, because each vCPU serves one model. With the multi-model endpoint, we could aggregate more customers on a single multi-endpoint machine.

We had a model and encoding files per customer, and after doing load tests, we determined that we could serve 50 customers, each using 10 models and even using the smallest ml.t2.medium instance for our solutions.

In this stage, we considered using multi-model endpoints. Multi-model endpoints provide a scalable and cost-effective solution to deploy a large number of models, enabling you to host multiple models with a single inference container. This reduces hosting costs by improving endpoint utilization compared to using multiple small single-model endpoints that each serve a single customer. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to them.

Furthermore, the multi-model endpoint advantage is that if you have a high inference rate from specific customers, its framework preserves the last serving models in memory for better performance.

After we estimated costs using multi-model endpoints vs. standard endpoints, we found out that it could potentially lead to cost reduction of approximately 80%.

The outcome

In this section, we review the steps and the outcome of the process.

We use the lifecycle notebook configuration to enable running the notebooks as processing jobs, by encapsulating the notebook in a Docker container in order to validate the code faster and use the autostop mechanism:

#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

set -e

# OVERVIEW
# This script installs the sagemaker_run_notebook extension package in SageMaker Notebook Instance
#
# There are two parameters you need to set:
# 1. S3_LOCATION is the place in S3 where you put the extension tarball
# 2. TARBALL is the name of the tar file that you uploaded to S3. You should just need to check
#    that you have the version right.
sudo -u ec2-user -i <<'EOF'
# PARAMETERS
VERSION=0.18.0
EXTENSION_NAME=sagemaker_run_notebook
# Set up the user setting and workspace directories
mkdir -p /home/ec2-user/SageMaker/.jupyter-user/{workspaces,user-settings}
# Run in the conda environment that the Jupyter server uses so that our changes are picked up
source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
# Install the extension and rebuild JupyterLab so it picks up the new UI
aws s3 cp s3://aws-emr-resources-11111111-us-east-1/infra-sagemaker/sagemaker_run_notebook-0.18.0-Logz-latest.tar.gz ./sagemaker_run_notebook-0.18.0-Logz-latest.tar.gz
pip install sagemaker_run_notebook-0.18.0-Logz-latest.tar.gz

jupyter lab build
source /home/ec2-user/anaconda3/bin/deactivate
EOF

# sudo -u ec2-user -i <<'EOF'
# PARAMETERS
for PACKAGE in pandas dataplate awswrangler==2.0.0 ipynb==0.5.1 prison==0.1.3 PyMySQL==0.10.1 requests==2.25.0 scipy==1.5.4 dtaidistance joblib sagemaker_run_notebook-0.18.0-Logz-latest.tar.gz fuzzywuzzy==0.18.0; do
  echo $PACKAGE

  # Note that "base" is special environment name, include it there as well.
  for env in base /home/ec2-user/anaconda3/envs/*; do
      source /home/ec2-user/anaconda3/bin/activate $(basename "$env")
      if [ $env = 'JupyterSystemEnv' ]; then
          continue
      fi
      pip install --upgrade "$PACKAGE"
      source /home/ec2-user/anaconda3/bin/deactivate
  done
done
jupyter lab build

# Tell Jupyter to use the user-settings and workspaces directory on the EBS
# volume.
echo "export JUPYTERLAB_SETTINGS_DIR=/home/ec2-user/SageMaker/.jupyter-user/user-settings" >> /etc/profile.d/jupyter-env.sh
echo "export JUPYTERLAB_WORKSPACES_DIR=/home/ec2-user/SageMaker/.jupyter-user/workspaces" >> /etc/profile.d/jupyter-env.sh

# The Jupyter server needs to be restarted to pick up the server part of the
# extension. This needs to be done as root.
initctl restart jupyter-server --no-wait

# OVERVIEW
# This script stops a SageMaker notebook once it's idle for more than 2 hour (default time)
# You can change the idle time for stop using the environment variable below.
# If you want the notebook the stop only if no browsers are open, remove the --ignore-connections flag
#
# Note that this script will fail if either condition is not met
#   1. Ensure the Notebook Instance has internet connectivity to fetch the example config
#   2. Ensure the Notebook Instance execution role permissions to SageMaker:StopNotebookInstance to stop the notebook
#       and SageMaker:DescribeNotebookInstance to describe the notebook.
# PARAMETERS
IDLE_TIME=3600

echo "Fetching the autostop script"
wget https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/master/scripts/auto-stop-idle/autostop.py

echo "Starting the SageMaker autostop script in cron"

(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -

We clone the sagemaker-run-notebook GitHub project, and add the following to the container:

  • Our pip requirements
  • The ability to run notebooks from within a notebook, which enables us multi-processing behavior to utilize all the ml.m5.12xlarge instance cores

This enables us to run workflows that consist of many notebooks running as processing jobs in a line of code, while defining the instance type to run on.

Because we can add parameters to the notebook, we can scale our processing by running simultaneously at different hours, days, or months to pull and process data.

We can also create scheduling jobs that run notebooks (and even limit the run time).

We also can observe the last runs and their details, such as processing time.

With the papermill that is used in the container, we can view the output of every run, which helps us debug in production.

Our notebook output review is in the form of a standard read-only notebook.

Multi-processing utilization helps us scale on each notebook processing and utilize all its cores. We generated functions in other notebooks that can do heavy processing, such as the following:

  • Explode JSONs
  • Find relevant rows in a DataFrame while the main notebook splits the DataFrame in #cpu-cores elements
  • Run clustering per alert type actions simultaneously

We then add these functional notebooks into the container that runs the notebook as a processing job. See the following Docker file (notice the COPY commands):

ARG BASE_IMAGE=need_an_image
FROM $BASE_IMAGE

ENV JUPYTER_ENABLE_LAB yes
ENV PYTHONUNBUFFERED TRUE

COPY requirements.txt /tmp/requirements.txt
RUN pip install papermill jupyter nteract-scrapbook boto3 requests==2.20.1
RUN pip install -r /tmp/requirements.txt

ENV PYTHONUNBUFFERED=TRUE
ENV PATH="/opt/program:${PATH}"

# Set up the program in the image
COPY multiprocessDownloadNormalizeFunctions.ipynb /tmp/multiprocessDownloadNormalizeFunctions.ipynb
COPY multiprocessFunctions.ipynb /tmp/multiprocessFunctions.ipynb
COPY run_notebook execute.py /opt/program/
ENTRYPOINT ["/bin/bash"]

# because there is a bug where you have to be root to access the directories
USER root

Results

During the research phase, we evaluated the option to run our notebooks as is to experiment and evaluate how our code performs on all our relevant data, not just a sample of data. We found that encapsulating our notebooks using processing jobs can be a great fit for us, because we don’t need to rewrite code and we can utilize the power of AWS compute optimized and memory optimized instances and follow the status of the process easily.

During the inference assessment, we evaluated various SageMaker endpoint solutions. We found that using a multi-model endpoint can help us serve approximately 50 customers, each having multiple (approximately 10) models in a single instance, which can meet our low-latency constraints, and therefore save us up to 80% of the cost.

With this solution architecture, we were able to reduce the MTTR of our customers, which is a main metric for measuring success using our platform. It reduces the total time from the point of responding to our alert link, which describes an issue in your systems, to when you’re done investigating the problem using our platform. During the investigation phase, we measure the users’ actions with and without our ML recommendation solution. This helps us provide recommendations for the best action to resolve the specific issue faster and pinpoint anomalies to identify the actual cause of the problem.

Conclusion and next steps

In this post, we shared how Logz.io used SageMaker to improve MTTD and MTTR.

As a next step, we’re considering expanding the solution with the following features:

We encourage you to try out SageMaker notebooks. For more examples, check out the SageMaker examples GitHub repo.


About the Authors

Amit Gross is leading the Research department of Logz.io, which is responsible for the AI solutions of all Logz.io products, from the research phase to the integration phase. Prior to Logz.io Amit has managed both Data Science and Security Research Groups at Here inc. and Cellebrite inc. Amit has M.Sc in computer science from Tel-Aviv University.

Yaniv Vaknin is a Machine Learning Specialist at Amazon Web Services. Prior to AWS, Yaniv held leadership positions with AI startups and Enterprise including co-founder and CEO of Dipsee.ai. Yaniv works with AWS customers to harness the power of Machine Learning to solve real world tasks and derive value. In his spare time, Yaniv enjoys playing soccer with his boys.

Eitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Read More

Detect mitotic figures in whole slide images with Amazon Rekognition

Even after more than a hundred years after its introduction, histology remains the gold standard in tumor diagnosis and prognosis. Anatomic pathologists evaluate histology to stratify cancer patients into different groups depending on their tumor genotypes and phenotypes, and their clinical outcome [1,2]. However, human evaluation of histological slides is subjective and not repeatable [3]. Furthermore, histological assessment is a time-consuming process that requires highly trained professionals.

With significant technological advances in the last decade, techniques such as whole slide imaging (WSI) and deep learning (DL) are now widely available. WSI is the scanning of conventional microscopy glass slides to produce a single, high-resolution image from those slides. This allows for the digitization and collection of large sets of pathology images, which would have been prohibitively time-consuming and expensive. The availability of such datasets creates new and innovative ways of accelerating diagnosis by using techniques such as machine learning (ML) to aid pathologists to accelerate diagnoses by quickly identifying features of interest.

In this post, we will explore how developers without previous ML experience can use Amazon Rekognition Custom Labels to train a model that classifies cellular features. Amazon Rekognition Custom Labels is a feature of Amazon Rekognition that enables you to build your own specialized ML-based image analysis capabilities to detect unique objects and scenes integral to your specific use case. In particular, we use a dataset containing whole slide images of canine mammary carcinoma [1] to demonstrate how to process these images and train a model that detects mitotic figures. This dataset has been used with permission from Prof. Dr. Marc Aubreville, who has kindly agreed to allow us to use it for this post. For more information, see the Acknowledgments section at the end of this post.

Solution Overview

The solution consists of two components:

  • An Amazon Rekognition Custom Labels model — To enable Amazon Rekognition to detect mitotic figures, we complete the following steps:

    • Sample the WSI dataset to produce adequately sized images using Amazon SageMaker Studio and a Python code running on a Jupyter notebook. Studio is a web-based, integrated development environment (IDE) for ML that provides all the tools you need to take your models from experimentation to production while boosting your productivity. We will use Studio to split the images into smaller ones to train our model.
    • Train a Amazon Rekognition Custom Labels model to recognize mitotic figures in hematoxylin-eosin samples using the data prepared in the previous step.
  • A frontend application — To demonstrate how to use a model like the one we trained in the previous step, we complete the following steps:

The following diagram illustrates the solution architecture.

All the necessary resources to deploy the implementation discussed in this post and the code for the whole section are available on GitHub. You can clone or fork the repository, make any changes you desire, and run it yourself.

In the next steps, we walk through the code to understand the different steps involved in obtaining and preparing the data, training the model, and using it from a sample application.

Costs

When running the steps in this walkthrough, you incur small costs from using the following AWS services:

  • Amazon Rekognition
  • AWS Fargate
  • Application Load Balancer
  • AWS Secrets Manager

Additionally, if no longer within the Free Tier period or conditions, you may incur costs from the following services:

  • CodePipeline
  • CodeBuild
  • Amazon ECR
  • Amazon SageMaker

If you complete the cleanup steps correctly after finishing this walkthrough, you may expect costs to be less than 10 USD, if the Amazon Rekognition Custom Labels model and the web application run for one hour or less.

Prerequisites

To complete all steps, you need the following:

Training the mitotic figure classification model

We run all the steps required to train the model from a Studio notebook. If you have never used Studio before, you may need to onboard first. For more information, see Onboard Quickly to Amazon SageMaker Studio.

Some of the following steps require more RAM than what is available in a standard ml.t3.medium notebook. Make sure that you have selected an ml.m5.large notebook. You should see a 2 vCPU + 8 GiB indication on the upper right corner of the page.

The code for this section is available as a Jupyter notebook file.

After onboarding to Studio, follow these instructions to grant Studio the necessary permissions to call Amazon Rekognition on your behalf.

Dependencies

To begin with, we need to complete the following steps:

  1. Update Linux packages and install the required dependencies, such as OpenSlide:
    !apt update > /dev/null && apt dist-upgrade -y > /dev/null
    !apt install -y build-essential openslide-tools python-openslide libgl1-mesa-glx > /dev/null

  2. Install the fastai and SlideRunner libraries using pip:
    !pip install SlideRunner SlideRunner_dataAccess fastai==1.0.61 > /dev/null

  3. Download the dataset (we provide a script to do this automatically):
    from dataset import download_dataset
    download_dataset()

Process the dataset

We will begin by importing some of the packages that we use throughout the data preparation stage. Then, we download and load the annotation database for this dataset. This database contains the positions in the whole slide images of the mitotic figures (the features we want to classify). See the following code:

%reload_ext autoreload
%autoreload 2
import os
from typing import List
import urllib
import numpy as np
from SlideRunner.dataAccess.database import Database
from pathlib import Path

DATABASE_URL = 'https://github.com/DeepPathology/MITOS_WSI_CMC/raw/master/databases/MITOS_WSI_CMC_MEL.sqlite'
DATABASE_FILENAME = 'MITOS_WSI_CMC_MEL.sqlite'

Path("./databases").mkdir(parents=True, exist_ok=True)
local_filename, headers = urllib.request.urlretrieve(
    DATABASE_URL,
    filename=os.path.join('databases', DATABASE_FILENAME),
)

Because we’re using SageMaker, we create a new SageMaker session object to ease tasks such as uploading our dataset to an Amazon Simple Storage Service (Amazon S3) bucket. We also use the S3 bucket that SageMaker creates by default to upload our processed image files.

The slidelist_test array contains the IDs of the slides that we use as part of the test dataset to evaluate the performance of the trained model. See the following code:

import sagemaker
sm_session = sagemaker.Session()

size=512
bucket_name = sm_session.default_bucket()

database = Database()
database.open(os.path.join('databases', DATABASE_FILENAME))

slidelist_test = ['14','18','3','22','10','15','21']

The next step is to obtain a set of areas of training and test slides, along with the labels in them, from which we can take smaller areas to use to train our model. The code for get_slides is in the sampling.py file in GitHub.

from sampling import get_slides

image_size = 512

lbl_bbox, training_slides, test_slides, files = get_slides(database, slidelist_test, negative_class=1, size=image_size)

We want to randomly sample from the training and test slides. We use the lists of training and test slides and randomly select n_training_images times a file for training, and n_test_images times a file for test:

n_training_images = 500
n_test_images = int(0.2 * n_training_images)

training_files = list([
    (y, files[y]) for y in np.random.choice(
        [x for x in training_slides], n_training_images)
])
test_files = list([
    (y, files[y]) for y in np.random.choice(
        [x for x in test_slides], n_test_images)
])

Next, we create a directory for training images and one for test images:

Path("rek_slides/training").mkdir(parents=True, exist_ok=True)
Path("rek_slides/test").mkdir(parents=True, exist_ok=True)

Before we produce the smaller images needed to train the model, we need some helper code that produces the metadata needed to describe the training and test data. The following code makes sure that a given bounding box surrounding the features of interest (mitotic figures) are well within the zone we’re cutting, and produces a line of JSON that describes the image and the features in it in Amazon SageMaker Ground Truth format, which is the format Amazon Rekognition Custom Labels requires. For more information about this manifest file for object detection, see Object localization in manifest files.

def check_bbox(x_start: int, y_start: int, bbox) -> bool:
    return (bbox._left > x_start and
            bbox._right < x_start + image_size and
            bbox._top > y_start and
            bbox._bottom < y_start + image_size)
            

def get_annotation_json_line(filename, channel, annotations, labels):
    
    objects = list([{'confidence' : 1} for i in range(0, len(annotations))])
    
    return json.dumps({
        'source-ref': f's3://{bucket_name}/data/{channel}/{filename}',
        'bounding-box': {
            'image_size': [{
                'width': size,
                'height': size,
                'depth': 3
            }],
            'annotations': annotations,
        },
        'bounding-box-metadata': {
            'objects': objects,
            'class-map': dict({ x: str(x) for x in labels }),
            'type': 'groundtruth/object-detection',
            'human-annotated': 'yes',
            'creation-date': datetime.datetime.now().isoformat(),
            'job-name': 'rek-pathology',
        }
    })


def generate_annotations(x_start: int, y_start: int, bboxes, labels, filename: str, channel: str):
    annotations = []
    
    for bbox in bboxes:
        if check_bbox(x_start, y_start, bbox):
            # Get coordinates relative to this slide.
            x0 = bbox.left - x_start
            y0 = bbox.top - y_start
            
            annotation = {
                'class_id': 1,
                'top': y0,
                'left': x0,
                'width': bbox.right - bbox.left,
                'height': bbox.bottom - bbox.top
            }
            
            annotations.append(annotation)
    
    return get_annotation_json_line(filename, channel, annotations, labels)

With the generate_annotations function in place, we can write the code to produce the training and test images:

import datetime
import json
import random

from fastai import *
from fastai.vision import *
from tqdm.notebook import tqdm


# Margin size, in pixels, for training images. This is the space we leave on
# each side for the bounding box(es) to be well into the image.
margin_size = 64

training_annotations = []
test_annotations = []


def check_bbox(x_start: int, y_start: int, bbox) -> bool:
    return (bbox._left > x_start and
            bbox._right < x_start + image_size and
            bbox._top > y_start and
            bbox._bottom < y_start + image_size)


def generate_images(file_list) -> None:
    for f_idx in tqdm(range(0, len(file_list)), desc='Writing training images...'):
        slide_idx, f = file_list[f_idx]
        bboxes = lbl_bbox[slide_idx][0]
        labels = lbl_bbox[slide_idx][1]

        # Calculate the minimum and maximum horizontal and vertical positions
        # that bounding boxes should have within the image.
        x_min = min(map(lambda x: x.left, bboxes)) - margin_size
        y_min = min(map(lambda x: x.top, bboxes)) - margin_size
        x_max = max(map(lambda x: x.right, bboxes)) + margin_size
        y_max = max(map(lambda x: x.bottom, bboxes)) + margin_size

        result = False
        while not result:
            x_start = random.randint(x_min, x_max - image_size)
            y_start = random.randint(y_min, y_max - image_size)

            for bbox in bboxes:
                if check_bbox(x_start, y_start, bbox):
                    result = True
                    break

        filename = f'slide_{f_idx}.png'
        channel = 'test' if slide_idx in test_slides else 'training'
        annotation = generate_annotations(x_start, y_start, bboxes, labels, filename, channel)

        if channel == 'training':
            training_annotations.append(annotation)
        else:
            test_annotations.append(annotation)

        img = Image(pil2tensor(f.get_patch(x_start, y_start) / 255., np.float32))
        img.save(f'rek_slides/{channel}/{filename}')

generate_images(training_files)
generate_images(test_files)

The last step towards having all of the required data is to write a manifest.json file for each of the datasets:

with open('rek_slides/training/manifest.json', 'w') as mf:
    mf.write("n".join(training_annotations))

with open('rek_slides/test/manifest.json', 'w') as mf:
    mf.write("n".join(test_annotations))

Transfer the files to S3

We use the upload_data method that the SageMaker session object exposes to upload the images and manifest files to the default SageMaker S3 bucket:

import sagemaker


sm_session = sagemaker.Session()
data_location = sm_session.upload_data(
    './rek_slides',
    bucket=bucket_name,
)

Train an Amazon Rekognition Custom Labels model

With the data already in Amazon S3, we can get to training a custom model. We use the Boto3 library to create an Amazon Rekognition client and create a project:

import boto3

project_name = 'rek-mitotic-figures-workshop'

rek = boto3.client('rekognition')
response = rek.create_project(ProjectName=project_name)

# If you have already created the project, use the describe_projects call to
# retrieve the project ARN.
# response = rek.describe_projects()['ProjectDescriptions'][0]

project_arn = response['ProjectArn']

With the project ready to use, now you need a project version that points to the training and test datasets in Amazon S3. Each version ideally points to different datasets (or different versions of it). This enables us to have different versions of a model, compare their performance, and switch between them as needed. See the following code:

version_name = '1'

output_config = {
    'S3Bucket': bucket_name,
    'S3KeyPrefix': 'output',
}

training_dataset = {
    'Assets': [
        {
            'GroundTruthManifest': {
                'S3Object': {
                    'Bucket': bucket_name,
                    'Name': 'data/training/manifest.json'
                }
            },
        },
    ]
}

testing_dataset = {
    'Assets': [
        {
            'GroundTruthManifest': {
                'S3Object': {
                    'Bucket': bucket_name,
                    'Name': 'data/test/manifest.json'
                }
            },
        },
    ]
}


def describe_project_versions():
    describe_response = rek.describe_project_versions(
        ProjectArn=project_arn,
        VersionNames=[version_name],
    )

    for model in describe_response['ProjectVersionDescriptions']:
        print(f"Status: {model['Status']}")
        print(f"Message: {model['StatusMessage']}")
    
    return describe_response
    
    
response = rek.create_project_version(
    VersionName=version_name,
    ProjectArn=project_arn,
    OutputConfig=output_config,
    TrainingData=training_dataset,
    TestingData=testing_dataset,
)

waiter = rek.get_waiter('project_version_training_completed')
waiter.wait(
    ProjectArn=project_arn,
    VersionNames=[version_name],
)

describe_response = describe_project_versions()

After we create the project version, Amazon Rekognition automatically starts the training process. The training time depends on several features, such as the size of the images and the number of them, the number of classes, and so on. In this case, for 500 images, the training takes about 90 minutes to finish.

Test the model

After training, every model in Amazon Rekognition Custom Labels is in the STOPPED state. To use it for inference, you need to start it. We get the project version ARN from the project version description and pass it over to the start_project_version. Notice the MinInferenceUnits parameter — we start with one inference unit. The actual maximum number of transactions per second (TPS) that this inference unit supports depends on the complexity of your model. To learn more about TPS, refer to this blog post.

model_arn = describe_response['ProjectVersionDescriptions'][0]['ProjectVersionArn']

response = rek.start_project_version(
    ProjectVersionArn=model_arn,
    MinInferenceUnits=1,
)
waiter = rek.get_waiter('project_version_running')
waiter.wait(
    ProjectArn=project_arn,
    VersionNames=[version_name],
)

When your project version is listed as RUNNING, you can start sending images to Amazon Rekognition for inference.

We use one of the files in the test dataset to test the newly started model. You can use any suitable PNG or JPEG file instead.

from matplotlib import pyplot as plt
from PIL import Image, ImageDraw


# We'll use one of our test images to try out our model.
with open('./rek_slides/test/slide_0.png', 'rb') as image_file:
    image_bytes=image_file.read()


# Send the image data to the model.
response = rek.detect_custom_labels(
    ProjectVersionArn=model_arn,
    Image={
        'Bytes': image_bytes
    }
)

img = Image.open(io.BytesIO(image_bytes))
draw = ImageDraw.Draw(img)

for custom_label in response['CustomLabels']:
    geometry = custom_label['Geometry']['BoundingBox']
    w = geometry['Width'] * img.width
    h = geometry['Height'] * img.height
    l = geometry['Left'] * img.width
    t = geometry['Top'] * img.height
    draw.rectangle([l, t, l + w, t + h], outline=(0, 0, 255, 255), width=5)

plt.imshow(np.asarray(img))

Streamlit application

To demonstrate the integration with Amazon Rekognition, we use a very simple Python application. We use the Streamlit library to build a spartan user interface, where we prompt the user to upload an image file.

We use the Boto3 library and the detect_custom_labels method, together with the project version ARN, to invoke the inference endpoint. The response is a JSON document that contains the positions and classes of the different objects detected in the image. In our case, these are the mitotic figures that the algorithm has found in the image we sent to the endpoint. See the following code:

import os

import boto3
import io
import streamlit as st
from PIL import Image, ImageDraw


rek_client = boto3.client('rekognition')


uploaded_file = st.file_uploader('Image file')
if uploaded_file is not None:
    image_bytes = uploaded_file.read()
    result = rek_client.detect_custom_labels(
        ProjectVersionArn='<YOUR_PROJECT_ARN_HERE>',
        Image={
            'Bytes': image_bytes
        }
    )
    img = Image.open(io.BytesIO(image_bytes))
    draw = ImageDraw.Draw(img)

    st.write(result['CustomLabels'])
    for custom_label in result['CustomLabels']:
        st.write(f"Label {custom_label['Name']}, confidence {custom_label['Confidence']}")
        geometry = custom_label['Geometry']['BoundingBox']
        w = geometry['Width'] * img.width
        h = geometry['Height'] * img.height
        l = geometry['Left'] * img.width
        t = geometry['Top'] * img.height
        st.write(f"Left, top = ({l}, {t}), width, height = ({w}, {h})")
        draw.rectangle([l, t, l + w, t + h], outline=(0, 0, 255, 255), width=5)

    st_img = st.image(img)

Deploy the application to AWS

To deploy the application, we use an AWS CDK script. The whole project can be found on GitHub . Let’s look at the different resources deployed by the script.

Create an Amazon ECR repository

As the first step towards setting up our deployment, we create an Amazon ECR repository, where we can store our application container images:

aws ecr create-repository --repository-name rek-wsi

Create and store your GitHub token in AWS Secrets Manager

CodePipeline needs a GitHub Personal Access Token to monitor your GitHub repository for changes and pull code. To create the token, follow the instructions in the GitHub documentation. The token requires the following GitHub scopes:

  • The repo scope, which is used for full control to read and pull artifacts from public and private repositories into a pipeline.
  • The admin:repo_hook scope, which is used for full control of repository hooks.

After creating the token, store it in a new secret in AWS Secrets Manager as follows:

aws secretsmanager create-secret --name rek-wsi/github --secret-string "{"oauthToken":"YOUR-TOKEN-VALUE-HERE"}"

Write configuration parameters to AWS Systems Manager Parameter Store

The AWS CDK script reads some configuration parameters from AWS Systems Manager Parameter Store, such as the name and owner of the GitHub repository, and target account and Region. Before launching the AWS CDK script, you need to create these parameters in your own account.

You can do that by using the AWS CLI. Simply invoke the put-parameter command with a name, a value, and the type of the parameter:

aws ssm put-parameter --name <PARAMETER-NAME> --value <PARAMETER-VALUE> --type <PARAMETER_TYPE>

The following is a list of all parameters required by the AWS CDK script. All of them are of type String:

  • /rek_wsi/prod/accountId — The ID of the account where we deploy the application.
  • /rek_wsi/prod/ecr_repo_name — The name of the Amazon ECR repository where the container images are stored.
  • /rek_wsi/prod/github/branch — The branch in the GitHub repository from which CodePipeline needs to pull the code.
  • /rek_wsi/prod/github/owner — The owner of the GitHub repository.
  • /rek_wsi/prod/github/repo — The name of the GitHub repository where our code is stored.
  • /rek_wsi/prod/github/token — The name or ARN of the secret in Secrets Manager that contains your GitHub authentication token. This is necessary for CodePipeline to be able to communicate with GitHub.
  • /rek_wsi/prod/region — The region where we will deploy the application.

Notice the prod segment in all parameter names. Although we do not need this level of detail for such a simple example, it will enable to reuse this approach with other projects where different environments may be necessary.

Resources created by the AWS CDK script

We need our application, running in a Fargate task, to have permissions to invoke Amazon Rekognition. So we first create an AWS Identity and Access Management (IAM) Task Role with the RekognitionReadOnlyPolicy policy attached to it. Notice that the assumed_by parameter in the following code takes the ecs-tasks.amazonaws.com service principal. This is because we’re using Amazon ECS as the orchestrator, so we need Amazon ECS to assume the role and pass the credentials to the Fargate task.

streamlit_task_role = iam.Role(
    self, 'StreamlitTaskRole',
    assumed_by=iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
    description='ECS Task Role assumed by the Streamlit task deployed to ECS+Fargate',
    managed_policies=[
        iam.ManagedPolicy.from_managed_policy_arn(
            self, 'RekognitionReadOnlyPolicy',
            managed_policy_arn='arn:aws:iam::aws:policy/AmazonRekognitionReadOnlyAccess'
        ),
    ],
)

Once built, our application container image sits in a private Amazon ECR repository. We need an object that describes it that we can pass when creating the Fargate service:

ecs_container_image = ecs.ContainerImage.from_ecr_repository(
    repository=ecr.Repository.from_repository_name(self, 'ECRRepo', 'rek-wsi'),
    tag='latest'
)

We create a new VPC and cluster for this application. You can modify this part to use your own VPC by using the from_lookup method of the Vpc class:

vpc = ec2.Vpc(self, 'RekWSI', max_azs=3)
cluster = ecs.Cluster(self, 'RekWSICluster', vpc=vpc)

Now that we have a VPC and cluster to deploy to, we create the Fargate service. We use 0.25 vCPU and 512 MB RAM for this task, and we place a public Application Load Balancer (ALB) in front of it. Once deployed, we use the ALB CNAME to access the application. See the following code:

fargate_service = ecs_patterns.ApplicationLoadBalancedFargateService(
    self, 'RekWSIECSApp',
    cluster=cluster,
    cpu=256,
    memory_limit_mib=512,
    desired_count=1,
    task_image_options=ecs_patterns.ApplicationLoadBalancedTaskImageOptions(
        image=ecs_container_image,
        container_port=8501,
        task_role=streamlit_task_role,
    ),
    public_load_balancer=True,
)

To automatically build and deploy a new container image every time we push code to our main branch, we create a simple pipeline consisting of a GitHub source action and a build step. Here is where we use the secrets we stored in AWS Secrets Manager and AWS Systems Manager Parameter Store in the previous steps.

pipeline = codepipeline.Pipeline(self, 'RekWSIPipeline')

# Create an artifact that points at the code pulled from GitHub.
source_output = codepipeline.Artifact()

# Create a source stage that pulls the code from GitHub. The repo parameters are
# stored in SSM, and the OAuth token in Secrets Manager.
source_action = codepipeline_actions.GitHubSourceAction(
    action_name='GitHub',
    output=source_output,
    oauth_token=SecretValue.secrets_manager(
        ssm.StringParameter.value_from_lookup(self, '/rek_wsi/prod/github/token'),
        json_field='oauthToken'),
    trigger=codepipeline_actions.GitHubTrigger.WEBHOOK,
    owner=ssm.StringParameter.value_from_lookup(self, '/rek_wsi/prod/github/owner'),
    repo=ssm.StringParameter.value_from_lookup(self, '/rek_wsi/prod/github/repo'),
    branch=ssm.StringParameter.value_from_lookup(self, '/rek_wsi/prod/github/branch'),
)

# Add the source stage to the pipeline.
pipeline.add_stage(
    stage_name='GitHub',
    actions=[source_action]
)

CodeBuild needs permissions to push container images to Amazon ECR. To grant these permissions, we add the AmazonEC2ContainerRegistryFullAccess policy to a bespoke IAM role that the CodeBuild service principal can assume:

# Create an IAM role that grants CodeBuild access to Amazon ECR to push containers.
build_role = iam.Role(
    self,
    'RekWsiCodeBuildAccessRole',
    assumed_by=iam.ServicePrincipal('codebuild.amazonaws.com'),
)

# Permissions are granted through an AWS managed policy, AmazonEC2ContainerRegistryFullAccess.
managed_ecr_policy = iam.ManagedPolicy.from_managed_policy_arn(
    self, 'cb_ecr_policy',
    managed_policy_arn='arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess',
)
build_role.add_managed_policy(policy=managed_ecr_policy)

The CodeBuild project logs into the private Amazon ECR repository, builds the Docker image with the Streamlit application, and pushes the image into the repository together with an appspec.yaml and an imagedefinitions.json file.

The appspec.yaml file describes the task (port, Fargate platform version, and so on), while the imagedefinitions.json file maps the names of the container images to their corresponding Amazon ECR URI. See the following code:

container_name = fargate_service.task_definition.default_container.container_name
build_project = codebuild.PipelineProject(
    self,
    'RekWSIProject',
    build_spec=codebuild.BuildSpec.from_object({
        'version': '0.2',
        'phases': {
            'pre_build': {
                'commands': [
                    'env',
                    'COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)',
                    'export TAG=${COMMIT_HASH:=latest}',
                    'aws ecr get-login-password --region $AWS_DEFAULT_REGION | '
                    'docker login --username AWS '
                    '--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com',
                ]
            },
            'build': {
                'commands': [
                    # Build the Docker image
                    'cd streamlit_app && docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .',
                    # Tag the image
                    'docker tag $IMAGE_REPO_NAME:$IMAGE_TAG '
                    '$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG',
                ]
            },
            'post_build': {
                'commands': [
                    # Push the container into ECR.
                    'docker push '
                    '$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG',
                    # Generate imagedefinitions.json
                    'cd ..',
                    "printf '[{"name":"%s","imageUri":"%s"}]' "
                    f"{container_name} "
                    "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG "
                    "> imagedefinitions.json",
                    'ls -l',
                    'pwd',
                    'sed -i s"|REGION_NAME|$AWS_DEFAULT_REGION|g" appspec.yaml',
                    'sed -i s"|ACCOUNT_ID|$AWS_ACCOUNT_ID|g" appspec.yaml',
                    'sed -i s"|TASK_NAME|$IMAGE_REPO_NAME|g" appspec.yaml',
                    f'sed -i s"|CONTAINER_NAME|{container_name}|g" appspec.yaml',
                ]
            }
        },
        'artifacts': {
            'files': [
                'imagedefinitions.json',
                'appspec.yaml',
            ],
        },
    }),
    environment=codebuild.BuildEnvironment(
        build_image=codebuild.LinuxBuildImage.STANDARD_5_0,
        privileged=True,
    ),
    environment_variables={
        'AWS_ACCOUNT_ID':
            codebuild.BuildEnvironmentVariable(value=self.account),
        'IMAGE_REPO_NAME':
            codebuild.BuildEnvironmentVariable(
                value=ssm.StringParameter.value_from_lookup(self, '/rek_wsi/prod/ecr_repo_name')),
        'IMAGE_TAG':
            codebuild.BuildEnvironmentVariable(value='latest'),
    },
    role=build_role,
)

Finally, we put the different pipeline stages together. The last action is the EcsDeployAction, which takes the container image built in the previous stage and does a rolling update of the tasks in our ECS cluster:

# Create an artifact to store the build output.
build_output = codepipeline.Artifact()
# Create a build action that ties the build project, the source artifact from the
# previous stage, and the output artifact together.
build_action = codepipeline_actions.CodeBuildAction(
    action_name='Build',
    project=build_project,
    input=source_output,
    outputs=[build_output],
)
# Add the build stage to the pipeline.
pipeline.add_stage(
    stage_name='Build',
    actions=[build_action]
)
deploy_action = codepipeline_actions.EcsDeployAction(
    action_name='Deploy',
    service=fargate_service.service,
    # image_file=build_output
    input=build_output,
)
pipeline.add_stage(
    stage_name='Deploy',
    actions=[deploy_action],
)

Cleanup

To avoid incurring future costs, clean up the resources you created as part of this solution.

Amazon Rekognition Custom Labels model

Before you shut down your Studio notebook, make sure you stop the Amazon Rekognition Custom Labels model. If you don’t, it continues to incur costs.

rek.stop_project_version(
    ProjectVersionArn=model_arn,
)

Alternatively, you can use the Amazon Rekognition console to stop the service:

  1. On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane.
  2. Choose Projects in the navigation pane.
  3. Choose version 1 of the rek-mitotic-figures-workshop project.
  4. On the Use Model tab, choose Stop.

Streamlit application

To destroy all resources associated to the Streamlit application, run the following code from the AWS CDK application directory:

cdk destroy RekWsiStack

AWS Secrets Manager

To delete the GitHub token, follow the instructions in the documentation.

Conclusion

In this post, we walked through the necessary steps to train a Amazon Rekognition Custom Labels model for a digital pathology application using real-world data. We then learned how to use the model from a simple application deployed from a CI/CD pipeline to Fargate.

Amazon Rekognition Custom Labels enables you to build ML-enabled healthcare applications that you can easily build and deploy using services like Fargate, CodeBuild, and CodePipeline.

Can you think of any applications to help researchers, doctors, or their patients to make their lives easier? If so, use the code in this walkthrough to build your next application. And if you have any questions, please share them in the comments section.

Acknowledgments

We would like to thank Prof. Dr. Marc Aubreville for kindly giving us permission to use the MITOS_WSI_CMC dataset for this blog post. The dataset can be found on GitHub.

References

[1] Aubreville, M., Bertram, C.A., Donovan, T.A. et al. A completely annotated whole slide image dataset of canine breast cancer to aid human breast cancer research. Sci Data 7, 417 (2020). https://doi.org/10.1038/s41597-020-00756-z

[2] Khened, M., Kori, A., Rajkumar, H. et al. A generalized deep learning framework for whole-slide image segmentation and analysis. Sci Rep 11, 11579 (2021). https://doi.org/10.1038/s41598-021-90444-8

[3] PNAS March 27, 2018 115 (13) E2970-E2979; first published March 12, 2018; https://doi.org/10.1073/pnas.1717139115


About the Author

Pablo Nuñez Pölcher, MSc, is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. Pablo focuses on helping healthcare public sector customers build new, innovative products on AWS in accordance with best practices. He received his M.Sc. in Biological Sciences from Universidad de Buenos Aires. In his spare time, he enjoys cycling and tinkering with ML-enabled embedded devices.

Razvan Ionasec, PhD, MBA, is the technical leader for healthcare at Amazon Web Services in Europe, Middle East, and Africa. His work focuses on helping healthcare customers solve business problems by leveraging technology. Previously, Razvan was the global head of artificial intelligence (AI) products at Siemens Healthineers in charge of AI-Rad Companion, the family of AI-powered and cloud-based digital health solutions for imaging. He holds 30+ patents in AI/ML for medical imaging and has published 70+ international peer-reviewed technical and clinical publications on computer vision, computational modeling, and medical image analysis. Razvan received his PhD in Computer Science from the Technical University Munich and MBA from University of Cambridge, Judge Business School.

Read More

Distributed fine-tuning of a BERT Large model for a Question-Answering Task using Hugging Face Transformers on Amazon SageMaker

From training new models to deploying them in production, Amazon SageMaker offers the most complete set of tools for startups and enterprises to harness the power of machine learning (ML) and Deep Learning.

With its Transformers open-source library and ML platform, Hugging Face makes transfer learning and the latest ML models accessible to the global AI community, reducing the time needed for data scientists and ML engineers in companies around the world to take advantage of every new scientific advancement.

Applying Transformers to new NLP tasks or domains requires fine-tuning of large language models, a technique leveraging the accumulated knowledge of pre-trained models to adapt them to a new task or specific type of documents in an additional, efficient training process.

Fine-tuning the model to produce accurate predictions for the business problem at hand requires the training of large Transformers models, for example, BERT, BART, RoBERTa, T5, which can be challenging to perform in a scalable way.

Hugging Face has been working closely with SageMaker to deliver ready-to-use Deep Learning Containers (DLCs) that make training and deploying the latest Transformers models easier and faster than ever. Because features such as SageMaker Data Parallel (SMDP), SageMaker Model Parallel (SMMP), S3 pipe mode, are integrated into the container, using these drastically reduces the time for companies to create Transformers-based ML solutions such as question-answering, generating text and images, optimizing search results, and improves customer support automation, conversational interfaces, semantic search, document analyses, and many more applications.

In this post, we focus on the deep integration of SageMaker distributed libraries with Hugging Face, which enables data scientists to accelerate training and fine-tuning of Transformers models from days to hours, all in SageMaker.

Overview of distributed training

ML practitioners and data scientists face two scaling challenges when training models: scaling model size (number of parameters and layers) and scaling training data. Scaling either the model size or training data can result in better accuracy, but there can be cases in deep learning where the amount of memory on the accelerator (CPU or GPU) limits the combination of the size of the training data and the size of the model. For example, when training a large language model, the batch size is often limited to a small number of samples, which can result in a less accurate model.

Distributed training can split up the workload to train the model among multiple processors, called workers. These workers operate in parallel to speed up model training.

Based on what we want to scale (model or data) there are two approaches to distributed training: data parallel and model parallel.

Data parallel is the most common approach to distributed training. Data parallelism entails creating a copy of the model architecture and weights on different accelerators. Then, rather than passing in the entire training set to a single accelerator, we can partition the training set across the different accelerators, and get through the training set faster. Although this adds the step of the accelerators needing to communicate their gradient information back to a parameter server, this time is more than offset by the speed boost of iterating over a fraction of the entire dataset per accelerator. Because of this, data parallelism can significantly help reduce training times. For example, training a single model without parallelization takes 4 hours. Using distributed training can reduce that to 24 minutes. SageMaker distributed training also implements cutting-edge techniques in gradient updates.

A model parallel approach is used with large models too large to fit on one accelerator (GPU). This approach implements a parallelization strategy where the model architecture is divided into shards and placed onto different accelerators. The configuration of each of these shards is neural network architecture dependent, and typically includes several layers. The communication between the accelerators occurs each time the training data passes from one of the shards to the next.

To summarize, you should use distributed training data parallelism for time-intensive tasks due to large datasets or when you want to accelerate your training experiments. You should use model parallelism when your model can’t fit onto one accelerator.

Prerequisites

To perform distributed training of Hugging Face Transformers models in SageMaker, you need to complete the following prerequisites:

Implement distributed training

The Hugging Face Transformers library provides a Trainer API that is optimized to train or fine-tune the models the library provides. You can also use it on your own models if they work the same way as Transformers models; see Trainer for more details. This API is used in our example scripts, which show how to preprocess the data for various NLP tasks, which you can take as models to write a script solving your own custom problem. The promise of the Trainer API is that this script works out of the box on any distributed setup, including SageMaker.

The Trainer API takes everything needed for the training. This includes your datasets, your model (or a function that returns your model), a compute_metrics function that returns the metrics you want to track from the arrays of predications and labels, your optimizer and learning rate scheduler (good defaults are provided), as well as all the hyperparameters you can tune for your training grouped in a data class called TrainingArguments. With all of that, it exposes three methods—train, evaluate, and predict—to train your model, get the metric results on any dataset, or get the predictions on any dataset. To learn more about the Trainer object, refer to Fine-tuning a model with the Trainer API and the video The Trainer API, which walks you through a simple example.

Behind the scenes, the Trainer API starts by analyzing the environment in which you are launching your script when you create the TrainingArguments. For instance, if you launched your training with SageMaker, it looks at the SM_FRAMEWORK_PARAMS variable in the environment to detect if you enabled SageMaker data parallelism or model parallelism. Then it gets the relevant variables (such as the rank of the process or the world size) from the environment before performing the necessary initialization steps (such as smdistributed.dataparallel.torch.distributed.init_process_group()).

The Trainer contains the whole training loop, so it can adjust the necessary steps to make sure the smdistributed.dataparallel backend is used when necessary without you having to change a line of code in your script. It can still run (albeit much slower) on your local machine for debugging. It handles sharding your dataset such that each process sees different samples automatically, with a reshuffle at each epoch, synchronizing your gradients before the optimization step, mixed precision training if you activated it, gradient accumulation if you can’t fit a big batch size on your GPUs, and many more optimizations.

If you activated model parallelism, it makes sure the processes that have to see the same data (if their dp_rank is the same) get the same batches, and that processes with different dp_rank don’t see the same samples, again with a reshuffle at each epoch. It makes sure the state dictionaries of the model or optimizers are properly synchronized when checkpointing, and again handles all the optimizations such as mixed precision and gradient accumulation.

When using the evaluate and predict methods, the Trainer performs a distributed evaluation, to take advantage of all your GPUs. It properly handles splitting your data for each process (process of the same dp_rank if model parallelism is activated) and makes sure that the predictions are properly gathered in the same order as the dataset you’re using before they are sent to the compute_metrics function or just returned. Using the Trainer API is not mandatory. Users can still use Keras or PyTorch within Hugging Face. However, the Trainer API can provide a helpful abstraction layer.

Train a model using SageMaker Hugging Face Estimators

An Estimator is a high-level interface for SageMaker training and handles end-to-end SageMaker training and deployment tasks. The training of your script is invoked when you call fit on a HuggingFace Estimator. In the Estimator, you define which fine-tuning script to use as entry_point, which instance_type to use, and which hyperparameters are passed in. For more information about HuggingFace parameters, see Hugging Face Estimator.

Distributed training: Data parallel

In this example, we use the new Hugging Face DLCs and SageMaker SDK to train a distributed Seq2Seq-transformer model on the question and answering task using the Transformers and datasets libraries. The bert-large-uncased-whole-word-masking model is fine-tuned on the squad dataset.

The following code samples show you steps of creating a HuggingFace estimator for distributed training with data parallelism.

  1. Choose a Hugging Face Transformers script:
    # git configuration to download our fine-tuning script
    git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

When you create a HuggingFace Estimator, you can specify a training script that is stored in a GitHub repository as the entry point for the Estimator, so you don’t have to download the scripts locally. You can use git_config to run the Hugging Face Transformers examples scripts and right ‘branch’ if your transformers_version needs to be configured. For example, if you use transformers_version 4.6.1, you have to use ‘branch':'v4.6.1‘.

  1. Configure training hyperparameters that are passed into the training job:
    # hyperparameters, which are passed into the training job
    hyperparameters={
        'model_name_or_path': 'bert-large-uncased-whole-word-masking',
        'dataset_name':'squad',
        'do_train': True,
        'do_eval': True,
        'fp16': True,
        'per_device_train_batch_size': 4,
        'per_device_eval_batch_size': 4,
        'num_train_epochs': 2,
        'max_seq_length': 384,
        'max_steps': 100,
        'pad_to_max_length': True,
        'doc_stride': 128,
        'output_dir': '/opt/ml/model'
    }

As a hyperparameter, we can define any Seq2SeqTrainingArguments and the ones defined in the training script.

  1. Define the distribution parameters in the HuggingFace Estimator:
    # configuration for running training on smdistributed Data Parallel
    distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

You can use the SageMaker data parallelism library out of the box for distributed training. We added the functionality of data parallelism directly into the Trainer. To enable data parallelism, you can simply add a single parameter to your HuggingFace Estimator to let your Trainer-based code use it automatically.

  1. Create a HuggingFace Estimator including parameters defined in previous steps and start training:
from sagemaker.huggingface import HuggingFace
# estimator
huggingface_estimator = HuggingFace(entry_point='run_qa.py',
                                    source_dir='./examples/pytorch/question-answering',
                                    git_config=git_config,
                                    instance_type= 'ml.p3.16xlarge',
                                    instance_count= 2,
                                    volume_size= 200,
                                    role= <SageMaker Role>, # IAM role,
                                    transformers_version='4.6',
                                    pytorch_version='1.7',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters)

# starting the train job 
huggingface_estimator.fit()

The Hugging Face Transformers repository contains several examples and scripts for fine-tuning models on tasks from language modeling to token classification. In our case, we use run_qa.py from the examples/pytorch/question-answering examples.

smdistributed.dataparallel supports model training on SageMaker with the following instance types only. For best performance, we recommend using an instance type that supports Elastic Fabric Adapter (EFA):

  • ml.p3.16xlarge
  • ml.p3dn.24xlarge (Recommended)
  • ml.p4d.24xlarge (Recommended)

To get the best performance and the most out of SMDataParallel, you should use at least two instances, but you can also use one for testing this example.

The following example notebook provides more detailed step-by-step guidance.

Distributed training: Model parallel

For distributed training with model parallelism, we use the Hugging Face Transformers and datasets library together with the SageMaker SDK for sequence classification on the General Language Understanding Evaluation (GLUE) benchmark on a multi-node, multi-GPU cluster using the SageMaker model parallelism library.

As with data parallelism, we first set the git configuration, training hyperparameters, and distribution parameters in the HuggingFace Estimator:

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# hyperparameters, which are passed into the training job
hyperparameters={
    'model_name_or_path':'roberta-large',
    'task_name': 'mnli',
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 16,
    'do_train': True,
    'do_eval': True,
    'do_predict': True,
    'num_train_epochs': 2,
    'output_dir':'/opt/ml/model',
    'max_steps': 500,
}

# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,
}
smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "partitions": 4,
        "ddp": True,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

The model parallelism library internally uses MPI, so to use model parallelism, MPI must be enabled using the distribution parameter. “processes_per_host” in the preceding code specifies the number of processes MPI should launch on each host. We suggest these for development and testing. At production time, you can contact AWS Support if requesting extensive GPU capacity. For more information, see Run a SageMaker Distributed Model Parallel Training Job.

The following example notebook contains the complete code scripts.

Spot Instances

With the Hugging Face framework extension for the SageMaker Python SDK, we can also take advantage of fully managed Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances and save up to 90% of our training cost.

Unless your training job will complete quickly, we recommend you use checkpointing with managed spot training, therefore you need to define checkpoint_s3_uri.

To use Spot Instances with the HuggingFace Estimator, we have to set the use_spot_instances parameter to True and define your max_wait and max_run time. For more information about the managed spot training lifecycle, see Managed Spot Training in Amazon SageMaker.

The following is a code snippet for setting up a spot training Estimator:

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased',
                 'output_dir':'/opt/ml/checkpoints'
                 }

# s3 uri where our checkpoints will be uploaded during training
job_name = "using-spot"
checkpoint_s3_uri = f's3://{sess.default_bucket()}/{job_name}/checkpoints'

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            base_job_name=job_name,
                            checkpoint_s3_uri=checkpoint_s3_uri,
                            use_spot_instances=True,
                            max_wait=3600, # This should be equal to or greater than max_run in seconds'
                            max_run=1000, # expected max run in seconds
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = hyperparameters)

The following notebook contains the complete code scripts.

Conclusion

In this post, we discussed distributed training of Hugging Face Transformers using SageMaker. We first reviewed the use cases for data parallelism vs. model parallelism. Data parallelism is typically more appropriate but not necessarily restricted to when training is bottlenecked by compute, whereas you can use model parallelism when a model can’t fit within the memory provided on a single accelerator. We then showed how to train with both methods.

In the data parallelism use case we discussed, training a model on a single p3.2xlarge instance (with a single GPU) takes 4 hours and costs roughly $15 at the time of this writing. With data parallelism, we can train the same model in 24 minutes at a cost of $28. Although the cost has doubled, this has reduced the training time by a factor of 10. For a situation in which you need to train many models within a short period of time, data parallelism can enable this at a relatively low cost increase. As for the model parallelism use case, it adds the ability to train models that could not have been previously trained at all due to hardware limitations. Both features enable new workflows for ML practitioners, and are readily accessible through the HuggingFace Estimator as a part of the SageMaker Python SDK. Deploying these models to hosted endpoints follows the same procedure as for other Estimators.

This integration enables other features that are part of the SageMaker ecosystem. For example, you can use Spot Instances by adding a simple flag to the Estimator for additional cost-optimization. As a next step, you can find and run the training demo and example notebook.


About the Authors

Archis Joglekar is an AI/ML Partner Solutions Architect in the Emerging Technologies team. He is interested in performant, scalable deep learning and scientific computing using the building blocks at AWS. His past experiences range from computational physics research to machine learning platform development in academia, national labs, and startups. His time away from the computer is spent playing soccer and with friends and family.

James Yi is a Sr. AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys playing soccer, traveling and spending time with his family.

Philipp Schmid is a Machine Learning Engineer and Tech Lead at Hugging Face, where he leads the collaboration with the Amazon SageMaker team. He is passionate about democratizing, optimizing, and productionizing cutting-edge NLP models and improving the ease of use for Deep Learning.

Sylvain Gugger is a Research Engineer at Hugging Face and one of the main maintainers of the Transformers library. He loves open source software and help the community use it.

Jeff Boudier builds products at Hugging Face, creator of Transformers, the leading open-source ML library. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing, Business Development and Corporate Development.

Read More

Detect NLP data drift using custom Amazon SageMaker Model Monitor

Natural language understanding is applied in a wide range of use cases, from chatbots and virtual assistants, to machine translation and text summarization. To ensure that these applications are running at an expected level of performance, it’s important that data in the training and production environments is from the same distribution. When the data that is used for inference (production data) differs from the data used during model training, we encounter a phenomenon known as data drift. When data drift occurs, the model is no longer relevant to the data in production and likely performs worse than expected. It’s important to continuously monitor the inference data and compare it to the data used during training.

You can use Amazon SageMaker to quickly build, train, and deploy machine learning (ML) models at any scale. As a proactive measure against model degradation, you can use Amazon SageMaker Model Monitor to continuously monitor the quality of your ML models in real time. With Model Monitor, you can also configure alerts to notify and trigger actions if any drift in model performance is observed. Early and proactive detection of these deviations enables you to take corrective actions, such as collecting new ground truth training data, retraining models, and auditing upstream systems, without having to manually monitor models or build additional tooling.

Model Monitor offers four different types of monitoring capabilities to detect and mitigate model drift in real time:

  • Data quality – Helps detect change in data schemas and statistical properties of independent variables and alerts when a drift is detected.
  • Model quality – For monitoring model performance characteristics such as accuracy or precision in real time, Model Monitor allows you to ingest the ground truth labels collected from your applications. Model Monitor automatically merges the ground truth information with prediction data to compute the model performance metrics.
  • Model bias –Model Monitor is integrated with Amazon SageMaker Clarify to improve visibility into potential bias. Although your initial data or model may not be biased, changes in the world may cause bias to develop over time in a model that has already been trained.
  • Model explainability – Drift detection alerts you when a change occurs in the relative importance of feature attributions.

In this post, we discuss the types of data quality drift that are applicable to text data. We also present an approach to detecting data drift in text data using Model Monitor.

Data drift in NLP

Data drift can be classified into three categories depending on whether the distribution shift is happening on the input or on the output side, or whether the relationship between the input and the output has changed.

Covariate shift

In a covariate shift, the distribution of inputs changes over time, but the conditional distribution P(y|x) doesn’t change. This type of drift is called covariate shift because the problem arises due to a shift in the distribution of the covariates (features). For example, in an email spam classification model, distribution of training data (email corpora) may diverge from the distribution of data during scoring.

Label shift

While covariate shift focuses on changes in the feature distribution, label shift focuses on changes in the distribution of the class variable. This type of shifting is essentially the reverse of covariate shift. An intuitive way to think about it might be to consider an unbalanced dataset. If the spam to non-spam ratio of emails in our training set is 50%, but in reality 10% of our emails are non-spam, then the target label distribution has shifted.

Concept shift

Concept shift is different from covariate and label shift in that it’s not related to the data distribution or the class distribution, but instead is related to the relationship between the two variables. For example, email spammers often use a variety of concepts to pass the spam filter models, and the concept of emails used during training may change as time goes by.

Now that we understand the different types of data drift, let’s see how we can use Model Monitor to detect covariate shift in text data.

Solution overview

Unlike tabular data, which is structured and bounded, textual data is complex, high dimensional, and free form. To efficiently detect drift in NLP, we work with embeddings, which are low-dimensional representations of the text. You can obtain embeddings using various language models such as Word2Vec and transformer-based models like BERT. These models project high-dimensional data into low-dimensional spaces while preserving the semantic information of the text. The results are dense and contextually meaningful vectors, which can be used for various downstream tasks, including monitoring for data drift.

In our solution, we use embeddings to detect the covariate shift of English sentences. We utilize Model Monitor to facilitate continuous monitoring for a text classifier that is deployed to a production environment. Our approach consists of the following steps:

  1. Fine-tune a BERT model using SageMaker.
  2. Deploy a fine-tuned BERT classifier as a real-time endpoint with data capture enabled.
  3. Create a baseline dataset that consists of a sample of the sentences used to train the BERT classifier.
  4. Create a custom SageMaker monitoring job to calculate the cosine similarity between the data captured in production and the baseline dataset.

The following diagram illustrates the solution workflow:

Fine-tune a BERT model

In this post, we use Corpus of Linguistic Acceptability (CoLA), a dataset of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. We use SageMaker training to fine-tune a BERT model using the CoLa dataset by defining an PyTorch estimator class. For more information on how to use this SDK with PyTorch, see Use PyTorch with the SageMaker Python SDK. Calling the fit() method of the estimator launches the training job:

from sagemaker.pytorch import PyTorch

# place to save model artifact
output_path = f"s3://{bucket}/{model_prefix}"

estimator = PyTorch(
    entry_point="train_deploy.py",
    source_dir="code",
    role=role,
    framework_version="1.7.1",
    py_version="py3",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=output_path,
    hyperparameters={
        "epochs": 1,
        "num_labels": 2,
        "backend": "gloo",
    },
    disable_profiler=True, # disable debugger
)
estimator.fit({"training": inputs_train, "testing": inputs_test})

Deploy the model

After training our model, we host it on a SageMaker endpoint. To make the endpoint load the model and serve predictions, we implement a few methods in train_deploy.py:

  • model_fn() – Loads the saved model and returns a model object that can be used for model serving. The SageMaker PyTorch model server loads our model by invoking model_fn.
  • input_fn() – Deserializes and prepares the prediction input. In this example, our request body is first serialized to JSON and then sent to the model serving endpoint. Therefore, in input_fn(), we first deserialize the JSON-formatted request body and return the input as a torch.tensor, as required for BERT.
  • predict_fn() – Performs the prediction and returns the result.

Enable Model Monitor data capture

We enable Model Monitor data capture to record the input data into the Amazon Simple Storage Service (Amazon S3) bucket to reference it later:

data_capture_config = DataCaptureConfig(enable_capture=True,
                                        sampling_percentage=100,
                                        destination_s3_uri=s3_capture_upload_path)

Then we create a real-time SageMaker endpoint with the model created in the previous step:

predictor = estimator.deploy(endpoint_name='nlp-data-drift-bert-endpoint',
                                            initial_instance_count=1,
                                            instance_type="ml.m4.xlarge",
                                            data_capture_config=data_capture_config)

Inference

We run prediction using the predictor object that we created in the previous step. We set JSON serializer and deserializer, which is used by the inference endpoint:

print("Sending test traffic to the endpoint {}. nPlease wait...".format(endpoint_name))

result = predictor.predict([
"Thanks so much for driving me home",
"Thanks so much for cooking dinner. I really appreciate it",
"Nice to meet you, Sergio. So, where are you from"
])

The real-time endpoint is configured to capture data from the request, and the response and the data gets stored in Amazon S3. You can view the data that’s captured in the previous monitoring schedule.

Create a baseline

We use a fine-tuned BERT model to extract sentence embedding features from the training data. We use these vectors as high-quality feature inputs for comparing cosine distance because BERT produces dynamic word representation with semantic context. Complete the following steps to get sentence embedding:

  1. Use a BERT tokenizer to get token IDs for each token (input_id) in the input sentence and mask to indicate which elements in the input sequence are tokens vs. padding elements (attention_mask_id). We use the BERT tokenizer.encode_plus function to get these values for each input sentence:
#Add instantiation of tokenizer
encoded_dict = tokenizer.encode_plus(
               sent,                      # Input Sentence to encode.
               add_special_tokens = True, # Add '[CLS]' and '[SEP]'
               max_length = 64,           # Pad sentence to max_length
               pad_to_max_length = True,  # Truncate sentence to max_length
               return_attention_mask = True, #BERT model needs attention_mask
               return_tensors = 'pt',     # Return pytorch tensors.
               )
input_ids = encoded_dict['input_ids']
attention_mask_ids = encoded_dict['attention_mask']

input_ids and attention_mask_ids are passed to the model and fetch the hidden states of the network. The hidden_states has four dimensions in the following order:

  • Layer number (BERT has 12 layers)
  • Batch number (1 sentence)
  • Word token indexes
  • Hidden units (768 features)
  1. Use the last two hidden layers to get a single vector (sentence embedding) by calculating the average of all input tokens in the sentence:
outputs = model(input_ids, attention_mask_ids) # forward pass to model
hidden_states = outputs[2]                     # token vectors
token_vecs = hidden_states[-2][0]              # last 2 layer hidden states
sentence_embedding = torch.mean(token_vecs, dim=0) # average token vectors
  1. Convert the sentence embedding as a NumPy array and store it in an Amazon S3 location as a baseline that is used by Model Monitor:
sentence_embeddings_list = []for i in sentence_embeddings:sentence_embeddings_list.append(i.numpy())

np.save('embeddings.npy', sentence_embeddings_list)

#Upload the sentence embedding to S3
!aws s3 cp embeddings.npy s3://{bucket}/{model_prefix}/embeddings/

Evaluation script

Model Monitor provides a pre-built container with the ability to analyze the data captured from endpoints for tabular datasets. If you want to bring your own container, Model Monitor provides extension points that you can use. When you create a MonitoringSchedule, Model Monitor ultimately kicks off processing jobs. Therefore, the container needs to be aware of the processing job contract. We need to create an evaluation script that is compatible with container contract inputs and outputs.

Model Monitor uses evaluation code on all the samples that are captured during the monitoring schedule. For each inference data point, we calculate the sentence embedding using the same logic described earlier. Cosine similarity is used as a distance metric to measure the similarity of an inference data point and sentence embeddings in the baseline. Mathematically, it measures the cosine angle between two sentence embedding vectors. A high the cosine similarity score indicates similar sentence embeddings. A lower cosine similarity score indicates data drift. We calculate an average of all the cosine similarity scores, and if it’s less than the threshold, it gets captured in the violation report. Based on the use case, you can use other distance metrics like manhattan or euclidean to measure similarity of sentence embeddings.

The following diagram shows how we use SageMaker Model Monitoring to establish baseline and detect data drift using cosine distance similarity.

The following is the code for calculating the violations; the complete evaluation script is available on GitHub:

for embed_item in embedding_list: # all sentence embeddings from baseline
    cosine_score += (1 - cosine(input_sentence_embedding, embed_item)) # cosine distance between input sentence embedding and baseline embedding
cosine_score_avg = cosine_score/(len(embedding_list)) # average cosine score of input sentence
if cosine_score_avg < env.max_ratio_threshold: # compare averge cosine score against a threshold
    sent_cosine_dict[record] = cosine_score_avg # capture details for violation report
    violations.append({
            "sentence": record,
            "avg_cosine_score": cosine_score_avg,
            "feature_name": "sent_cosine_score",
            "constraint_check_type": "baseline_drift_check",
            "endpoint_name" : env.sagemaker_endpoint_name,
            "monitoring_schedule_name": env.sagemaker_monitoring_schedule_name
        })

Measure data drift using Model Monitor

In this section, we focus on measuring data drift using Model Monitor. Model Monitor pre-built monitors are powered by Deequ, which is a library built on top of Apache Spark for defining unit tests for data, which measure data quality in large datasets. You don’t require coding to utilize these pre-built monitoring capabilities. You also have the flexibility to monitor models by coding to provide custom analysis. You can collect and review all metrics emitted by Model Monitor in Amazon SageMaker Studio, so you can visually analyze your model performance without writing additional code.

In certain scenarios, for instance when the data is non-tabular, the default processing job (powered by Deequ) doesn’t suffice because it only supports tabular datasets. The pre-built monitors may not be sufficient to generate sophisticated metrics to detect drifts, and may necessitate bringing your own metrics. In the next sections, we describe the setup to bring in your metrics by building a custom container.

Build the custom Model Monitor container

We use the evaluation script from the previous section to build a Docker container and push it to Amazon Elastic Container Registry (Amazon ECR):

#Build a docker container and push to ECR

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'nlp-data-drift-bert-v1'
tag = ':latest'
region = boto3.session.Session().region_name
sm = boto3.client('sagemaker')
uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
    processing_repository_uri = f'{account_id}.dkr.ecr.{region}.{uri_suffix}/{ecr_repository + tag}'
# Creating the ECR repository and pushing the container image

!docker build -t $ecr_repository docker

!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

!aws ecr create-repository --repository-name $ecr_repository

!docker tag {ecr_repository + tag} $processing_repository_uri!docker push $processing_repository_uri

When the customer Docker container is in Amazon ECR, we can schedule a Model Monitoring job and generate a violations report, as demonstrated in the next sections.

Schedule a model monitoring job

To schedule a model monitoring job, we create an instance of Model Monitor and in the image_uri, we refer to the Docker container that we created in the previous section:

from sagemaker.model_monitor import ModelMonitor

monitor = ModelMonitor(
    base_job_name='nlp-data-drift-bert-v1',
    role=role,
    image_uri=processing_repository_uri,
    instance_count=1,
    instance_type='ml.m5.large',
    env={ 'THRESHOLD':'0.5', 'bucket': bucket },
)

We schedule the monitoring job using the create_monitoring_schedule API. You can schedule the monitoring job on an hourly or daily basis. You configure the job using the destination parameter, as shown in the following code:

from sagemaker.model_monitor import CronExpressionGenerator, MonitoringOutput
from sagemaker.processing import ProcessingInput, ProcessingOutput

destination = f's3://{sagemaker_session.default_bucket()}/{prefix}/{endpoint_name}/monitoring_schedule'

processing_output = ProcessingOutput(
    output_name='result',
    source='/opt/ml/processing/resultdata',
    destination=destination,
)
output = MonitoringOutput(source=processing_output.source, destination=processing_output.destination)

monitor.create_monitoring_schedule(
    monitor_schedule_name='nlp-data-drift-bert-schedule',
    output=output,
    endpoint_input=predictor.endpoint_name,
    schedule_cron_expression=CronExpressionGenerator.hourly(),
)

To describe and list the monitoring schedule and its runs, you can use the following commands:

monitor.describe_schedule()
print(monitor.list_executions())

Data drift violation report

When the model monitoring job is complete, you can navigate to the destination S3 path to access the violation reports. This report contains all the inputs whose average cosine score (avg_cosine_score) is below the threshold configured as an environment variable THRESHOLD:0.5 in the ModelMonitor instance. This is an indication that the data observed during inference is drifting beyond the established baseline.

The following code shows the generated violation report:

{
    "violations": [
        {
            "feature_name": "sent_cosine_score",
            "constraint_check_type": "baseline_drift_check",
            "sentence": "Thanks so much for driving me home",
            "avg_cosine_score": 0.36653404209142876
        },
        {
            "feature_name": "sent_cosine_score",
            "constraint_check_type": "baseline_drift_check",
            "sentence": "Thanks so much for cooking dinner. I really appreciate it",
            "avg_cosine_score": 0.34974955975723576
        },
        {
            "feature_name": "sent_cosine_score",
            "constraint_check_type": "baseline_drift_check",
            "sentence": "Nice to meet you, Sergio. So, where are you from",
            "avg_cosine_score": 0.378982806084463
        }
    ]
}

Finally, based on this observation, you can configure your model for retraining. You can also enable Amazon Simple Notification Service (Amazon SNS) notifications to send alerts when violations occur.

Conclusion

Model Monitor enables you to maintain the high quality of your models in production. In this post, we highlighted the challenges with monitoring data drift on unstructured data like text, and provided an intuitive approach to detect data drift using a custom monitoring script. You can find the code associated with the post in the following GitHub repository. Additionally, you can customize the solution to utilize other distance metrics such as maximum mean discrepancy (MMD), a non-parametric distance metric to compute marginal distribution between source and target distribution on the embedded space.


About the Authors

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia, USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Raghu Ramesha is a ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Tony Chen is a Machine Learning Solutions Architect at Amazon Web Services, helping customers design scalable and robust machine learning capabilities in the cloud. As a former data scientist and data engineer, he leverages his experience to help tackle some of the most challenging problems organizations face with operationalizing machine learning.

Read More

Computer vision-based anomaly detection using Amazon Lookout for Vision and AWS Panorama

This is the second post in the two-part series on how Tyson Foods Inc., is using computer vision applications at the edge to automate industrial processes inside their meat processing plants. In Part 1, we discussed an inventory counting application at packaging lines built with Amazon SageMaker and AWS Panorama . In this post, we discuss a vision-based anomaly detection solution at the edge for predictive maintenance of industrial equipment.

Operational excellence is a key priority at Tyson Foods. Predictive maintenance is an essential asset for achieving this objective by continuously improving overall equipment effectiveness (OEE). In 2021, Tyson Foods launched a machine learning (ML) based computer vision project to identify failing product carriers during production to prevent them from impacting team member safety, operations, or product quality. When a product carrier breaks or moves into the wrong position, production must be stopped. If it’s not caught in time, it poses a threat to team member safety and machinery. With a manual inspection method, an operator inspects 8,000 pins per line. This is a slow and challenging task because attention to detail is critical. ML practitioners at Tyson Foods have built computer vision models to automate the inspection process and detect anomalies continuously. This process can enable the maintenance team to reduce the cycle time and improve the reliability of inspecting 8,000 pins.

Developing a custom ML model to analyze images and detect anomalies, and making these models run efficiently at the edge is a challenging task. This requires specialized expertise, time, and resources. The entire development cycle may take months to complete. With the approaches mentioned in Part 1 of this series, we completed the project for monitoring the condition of the product carriers at Tyson Foods in record time using AWS Managed Services such as Amazon Lookout for Vision.

Solution overview

The patterns, code, and infrastructure designed for the tray counting use case in Part 1 were readily replicated in the product carrier project. Although at first glance these projects may seem very different, at their core they are made up of the same five components: image capture, labeling, model training, frame deduplication, and inference.

This post demonstrates how to set up a computer vision-based anomaly detection solution for failing product carriers (or similar manufacturing line assembly) using AWS Panorama and Lookout for Vision. The workflow begins with inference via an object detection model on an AWS Panorama device at the edge. The object detection model crops the image and passes the result to the Lookout for Vision anomaly detection model that classifies the pin images. The anomalous pin images and model results are sent to the cloud and available for additional processing.

The following diagram illustrates this architecture.

Prerequisites

To follow along with this post, you need the following:

Train an object detection model

The first stage of our multi-model inference design is an SSD object detection model trained to detect product carriers and flags. The pins are used to train the anomaly classification model using Lookout for Vision. The flag, referencing the beginning of the product carrier line, helps us track each loop cycle and deduplicate anomaly detections.

The following image is an example inference result from the pin detection SSD model.

Train an anomaly classification model using Lookout for Vision

Lookout for Vision is a fully managed ML service that uses computer vision to help identify visual defects in objects. It allows you to build an anomaly detection model quickly with little-to-no code and requires very little data to start (minimum 20 normal and 10 anomaly images). Training a Lookout for Vision model follows a four-step process:

  1. Create a Lookout for Vision project.
  2. Build a product carrier dataset.
  3. Train and tune the Lookout for Vision model.
  4. Export the Lookout for Vision model for inference.

In this section, we walk you through Steps 1–3.

Create a Lookout for Vision project

For instructions on creating a Lookout for Vision project, see Creating your project.

Build a product carrier dataset

The dataset for Lookout for Vision has to be square images, JPG or PNG format, minimum pixel size of 64 x 64, and maximum pixel size of 4096 x 4096. To generate a dataset that satisfies the requirements, we had to crop each bounding box and resize them while preserving the original aspect ratio using the following Python code. We add this code to the image capture pipeline described in Part 1 to generate the final 150 x 150 pixel images for Lookout for Vision.

def crop_n_resize_image(self, img, bbox, size, padColor=0):

    # crop images ==============================
    crop = img[bbox[1]:bbox[3],bbox[0]:bbox[2]].copy()
    
    # cropped image size
    h, w = crop.shape[:2]
    # designed crop image sizes
    sh, sw = size

    # interpolation method
    if h > sh or w > sw: # shrinking image
        interp = cv2.INTER_AREA
    else: # stretching image
        interp = cv2.INTER_CUBIC

    # aspect ratio of image
    aspect = w/h 

    # compute scaling and pad sizing
    if aspect > 1: # horizontal image
        new_w = sw
        new_h = np.round(new_w/aspect).astype(int)
        pad_vert = (sh-new_h)/2
        pad_top, pad_bot = np.floor(pad_vert).astype(int), np.ceil(pad_vert).astype(int)
        pad_left, pad_right = 0, 0
    elif aspect < 1: # vertical image
        new_h = sh
        new_w = np.round(new_h*aspect).astype(int)
        pad_horz = (sw-new_w)/2
        pad_left, pad_right = np.floor(pad_horz).astype(int), np.ceil(pad_horz).astype(int)
        pad_top, pad_bot = 0, 0
    else: # square image
        new_h, new_w = sh, sw
        pad_left, pad_right, pad_top, pad_bot = 0, 0, 0, 0

    # set pad color
    if len(img.shape) is 3 and not isinstance(padColor, (list, tuple, np.ndarray)): # color image but only one color provided
        padColor = [padColor]*3

    # scale and pad
    scaled_img = cv2.resize(crop, (new_w, new_h), interpolation=interp)
    scaled_img = cv2.copyMakeBorder(scaled_img, pad_top, pad_bot, pad_left, pad_right, borderType=cv2.BORDER_CONSTANT, value=padColor)

    return scaled_img

The following are examples of processed product carrier images.

We label the images through Amazon SageMaker Ground Truth, which returns a label manifest file. This file is imported into Lookout for Vision to create the anomaly detection dataset. You can label the images within the Lookout for Vision platform, but we didn’t use that approach in this project. The following screenshot shows the labeled dataset on the Lookout for Vision console.

Train and tune the Lookout for Vision model

Training an anomaly detection model in Lookout for Vision is as simple as a click of a button. Lookout for Vision automatically holds out 20% of the data as a test set to validate the model performance. The key to generating good model results is to focus on labeling and image quality. The initial image size used was too small, and critical details were lost due to resolution. Increasing the resolution from 64 x 64 to 150 x 150 resulted in a significant jump in model accuracy. To tune the labels, the development team spent a significant amount of time with subject matter experts from the plant to utilize their knowledge in designing the definitions for each class. It was imperative that these class definitions were very clear, and it took a few iterations to get them perfect. The following screenshot shows the results achieved with well-established class definitions.

Develop the AWS Panorama application

The AWS Panorama application is an inference container deployed to the AWS Panorama Appliance to process input video streams, run inference, and output video results using the AWS Panorama SDK. Most of the inference code is the same as in Part 1; the following features are added specifically for this product carrier use case:

  • Build a frame inference trigger
  • Run Lookout for Vision inference
  • Deduplicate and isolate pin location

Build a frame inference trigger

For this use case, our product carriers are moving continuously across the video frame, and the same pins may be detected repeatedly until it moves off of the camera view. To avoid sending duplicated pins to the Lookout for Vision model for anomaly classification and wasting compute resources, we developed a software trigger in our inference code to downsample the frames and reduce the number of duplicated pins for inference. In the following screenshot, the minimum number of pins detected is 8 and the maximum number of pins detected is 10.

The logic determines the trigger using product carrier IDs, which is a counter for the number of new product carriers moving into the camera view. We get that by determining when the number of bounding boxes in a frame reaches the max value. As shown in the preceding figure, there is a min and max possible bounding boxes detected at any given time. The count oscillates between the min and max value, which corresponds to a new product carrier moving into the camera view. The following figure illustrates the oscillation pattern. Because a camera frame can only fit six product carriers, we know an entire frame shifted off when the product carrier ID incremented by 6.

Run Lookout for Vision inference

We crop the bounding boxes from the frame image and process them using the same resize function described earlier, and then forward these images to the Lookout for Vision model for anomaly classification. In response, the Lookout for Vision model produces a label (normal or anomaly) and confidence score.

Isolate pin locations and deduplicate anomaly detections

Lastly for this use case, it was important to identify the relative location of the product carriers and only generate one entry per bad pin to avoid duplications. To track the pin location, inference code was written to use the flag as a point of reference and count the product carrier ID. When an anomaly is detected, the product carrier ID is recorded with the pin image to provide the location reference relative to the flag. We also use this flag to help us deduplicate the anomaly detections and track when an entire product carrier line has looped around. There is a cycle ID parameter that gets incremented every time the flag appears, and all the parameters like product carrier ID reset to 0 to start a new cycle.

Deploy models at the edge with AWS Panorama

When we have the models and the inference code ready, we package the object detection model, inference code, and camera stream into a container and deploy to AWS Panorama using the same deployment pattern described in Part 1.

Email alerts

Whenever the system detects an anomaly, the image containing the defective pin is sent to Amazon S3 for storage, and the metadata associated with it is sent to AWS IoT SiteWise. At the end of each shift, an EventBridge event triggers a Lambda function, which uses the images and metadata to send a summary email to the plant staff. The plant staff uses this information when making repairs during shift change.

Conclusion

In this post, we demonstrated how to set up a vision-based anomaly detection system in a production environment using Lookout for Vision and AWS Panorama. With this solution, plants can save 1 hour of team member time per day per line. This would save this plant alone an estimated 15,000 hours of skilled labor annually. This would free up the time of valuable Tyson team members to complete other, more complex tasks.

The models trained in this process performed well. The SSD pin detection model achieved 95% accuracy across both classes. The Lookout for Vision model was tuned to perform at 99.1% accuracy for failing pin detection. Despite the two models utilized in this project, the inference code was easily able to keep up with line speed, running at around 10 FPS.

By far the most exciting result of this project was the speedup in development time. Although this project utilizes two models and more complex application code than the project in Part 1, it took 12% less developer time to complete. This agility is only possible because of the repeatable patterns established in Part 1 and using managed services from AWS. This combination made our final solutions faster to scale and industry ready. Learn more about Amazon Lookout for Vision by going to the Amazon Lookout for Vision Resources page. You can also view other examples of AWS Panorama in action by going to the GitHub repo.


About the Authors

Audrey Timmerman is a Sr Applications Developer at Tyson Foods. She is a Computer Engineering Graduate from the University of Arkansas and has been on the Emerging Technology team at Tyson Foods for 2 years. She has an interest in computer vision, machine learning, and IoT applications.

James Wu is a Senior Customer Solutions Manager at AWS, based in Dallas, TX. He works with customers to accelerate their cloud journey and fast-track their business value realization. In addition to that, James is also passionate about developing and scaling large AI/ ML solutions across various domains. Prior to joining AWS, he led a multi-discipline innovation technology team with ML engineers and software developers for a top global firm in the market and advertising industry.

Farooq Sabir is a Senior AI/ML Specialist Solutions Architect at AWS. He holds a PhD in Electrical Engineering from the University of Texas at Austin. He helps customers solve their business problems using data science, machine learning, artificial intelligence, and numerical optimization.

Elizabeth Samara Rubio is a Principal Specialist in the WWSO at Amazon Web Services, driving new AI/ML and computer vision solutions across industries, including industrial and manufacturing sectors. Prior to joining Amazon, Elizabeth was a Managing Director at Accenture leading North America Industry X growth and strategy, Divisional Vice President at AMETEK, and Business Unit Manager at Cognex.

Shreyas Subramanian is an AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.

Read More