Amazon AWS – Page 237

Build GAN with PyTorch and Amazon SageMaker

December 14, 2021

by Laurence MIAO Amazon AWS

GAN is a generative ML model that is widely used in advertising, games, entertainment, media, pharmaceuticals, and other industries. You can use it to create fictional characters and scenes, simulate facial aging, change image styles, produce chemical formulas synthetic data, and more.

For example, the following images show the effect of picture-to-picture conversion.

The following images show the effect of synthesizing scenery based on semantic layout.

This post walks you through building your first GAN model using Amazon SageMaker. This is a journey of learning GAN from the perspective of practical engineering experiences, as well as opening a new AI/ML domain of generative models.

We also introduce a use case of one of the hottest GAN applications in the synthetic data generation area. We hope this gives you a tangible sense on how GAN is used in real-life scenarios.

Overview of solution

Among the following two pictures of handwritten digits, one of them is actually generated by a GAN model. Can you tell which one?

The main topic of this article is to use ML techniques to generate synthetic handwritten digits. To achieve this goal, you personally experience the training of a GAN model. Generating synthetic handwritten digits is basically the same as the basic principles and engineering processes of portrait generation, although their data, algorithm complexity, and accuracy requirements are different.

Generative Adversarial Networks by Ian Goodfellow et al. is a deep neural network architecture consisting of a generator network and a discriminator network. The generator synthesizes data and tries to deceive the discriminator, whereas the discriminator authenticates the data and tries to correctly identify all synthesized data. In the process of training iterations, the two networks continue to evolve and confront until they reach an equilibrium state (Nash equilibrium). The discriminator can no longer recognize synthesized data anymore, at which point the training process is over.

To train a GAN model, we need to start with some tools and services that are efficient and necessary for ML practices on AWS. As the working environment, SageMaker is a fully managed ML service. It offers all mainstream ML frameworks as managed container images, such as Scikit-Learn, XGBoost, MXNet, TensorFlow, PyTorch, and more. The SageMaker SDK is an open-source development kit for SageMaker that allows you to use SageMaker and other AWS services, for example, accessing data in an Amazon Simple Storage Service (Amazon S3) bucket, or training a model with a managed Amazon Elastic Compute Cloud (Amazon EC2) instance.

With SageMaker end-to-end ML functionality, you can focus on the model building work and easily train a variety of GAN models, without overheads in infrastructure and framework maintenance.

The following diagram illustrates our architecture.

The training data comes from the S3 storage bucket, and is loaded into the local storage of the training instance. The managed training frameworks and managed algorithms serve in the form of container images in Amazon Elastic Container Registry (Amazon ECR), which are combined with the custom training code when the training container is launched. The training output is collected and sent to a specified S3 bucket. In the following sections, we learn how to use these resources via the SageMaker SDK.

We use AWS services such as Amazon SageMaker and Amazon S3, which incur certain cloud resource usage fees.

Set up the development environment

SageMaker provides a managed Jupyter notebook instance, for model building, training, and more. You can carry out ML activities effectively and easily via Jupyter notebooks. For instructions on setting up your Jupyter notebook working environment, see Get Started with Amazon SageMaker Notebook Instances.

Alternatively, you may want to work with Amazon SageMaker Studio. For instructions, see Get Started with Studio Notebooks.

Download the source code

The source code is available in SageMaker Examples GitHub repository.

On the Git menu, choose Clone a Repository.
Enter the clone URI of the repository (https://github.com/aws/amazon-sagemaker-examples.git).
Choose Clone.

When the download is complete, browse the source code structure through the file browser.

Open the notebook build_gan_with_pytorch.ipynb, which is under the folder /amazon-sagemaker-examples/advanced_functionality/pytorch_bring_your_own_gan/.
In the Select Kernel pop-up, choose conda_pytorch_latest_p36.

If using a Studio environment, select the Python3 (PyTorch 1.6 Python 3.6 GPU Optimized) kernel instead.

The code and notebooks used in this post are available on GitHub, and are all verified with Python 3.6, PyTorch 1.5, and SageMaker-managed JupyterLab.

Deep convolutional generative adversarial network (DCGAN)

In 2016, Alec Radford et al. published the paper “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”. This pioneered the application of convolutional neural networks to GAN. In the algorithm design, the full connected layers are replaced with convolutional layers, which improves the stability of training in the image generation scenarios.

Network structure

The generator network uses a stride transposed convolutional layers to improve the resolution of the tensor. The input shape is (batch_size, 100) and the output shape is (batch_size, 64, 64, 3). In other words, the network accepts a 100-dimensional uniform distribution vector, and then undergoes continuous transformation until the final image is generated.

The discriminator network receives pictures in (64, 64, 3) format, uses 2D convolutional layers for downsampling, and finally passes them to the full connected layer for classification.

The training process of the DCGAN model can be roughly divided into three sub-processes.

Firstly, the generator network uses a random number as input to generate a synthetic picture. Then it uses the authentic picture and the synthetic picture to train the discriminator network and update the parameters. Finally, it updates the generator network parameters.

Code structure

The file structure of the project directory pytorch_bring_your_own_gan is as follows:

├── data
├── src
│   └── train.py
├── tmp
└── build_gan_with_pytorch.ipynb

The file train.py contains three classes: the generator network Generator, the discriminator network Discriminator, and a wrapper class for a single batch training process. See the following code:

class Generator(nn.Module):
...

class Discriminator(nn.Module):
...

class DCGAN(object):
    """
    A wrapper class for Generator and Discriminator,
    'train_step' method is for single batch training.
    """
...

The train.py file also contains several functions, which are used to facilitate training of the networks of Generator and Discriminator. Some of the major functions are as follows:

def parse_args():
...

def get_datasets(dataset_name, ...):
...

def train(dataloader, hps, ...):
...

Model development

During development, you may run the train.py script directly from the Linux command line. You can specify input data channels, model hyperparameters, and training output storage via command line arguments (for more information, see Use PyTorch with the SageMaker Python SDK):

python src/train.py --dataset mnist 
        --model-dir '/home/myhome/byos-pytorch-gan/model' 
        --output-dir '/home/myhome/byos-pytorch-gan/tmp' 
        --data-dir '/home/myhome/byos-pytorch-gan/data' 
        --hps '{"beta1":0.5, "dataset":"mnist", "epochs":18,
            "learning-rate":0.0002, "log-interval":64, "nz":100, "ngf":28, "ndf":28}'

Such design of the training script parameter not only provides a good debugging method, but also is a protocol and prerequisite for integration with SageMaker containers. This takes into account the flexibility of model development and the portability of the training environment.

Model training and validation

Find and open the notebook file build_gan_with_pytorch.ipynb, which introduces and runs the training process. Some of the code in this section is omitted; refer to the notebook for details.

Download data

Many public datasets are available on the internet that are very helpful for ML engineering and scientific research, such as algorithm study and evaluation. We use the MNIST dataset, which is a handwritten digits dataset, to train a DCGAN model, and eventually generate some synthetic handwritten digits. See the following code:

from sagemaker.s3 import S3Downloader as s3down
s3down.download('s3://sagemaker-sample-files/datasets/image/MNIST/pytorch/', './data')

Prepare the data

The PyTorch framework has a torchvision.datasets package, which provides access to several datasets. You can use the following commands to read the pre-downloaded MNIST dataset from local storage, for later use:

from torchvision import datasets

dataroot = './data'
trainset = datasets.MNIST(root=dataroot, train=True, download=False)
testset = datasets.MNIST(root=dataroot, train=False, download=False)

The SageMaker SDK creates a default S3 bucket for you to access various files and data that you may need in the ML engineering lifecycle. We can get the name of this bucket through the default_bucket method of the sagemaker.session.Session class in the SageMaker SDK:

from sagemaker.session import Session

sess = Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

The SageMaker SDK provides tools for operating AWS services. For example, the S3Downloader class is used to download objects in Amazon S3, and S3Uploader is used to upload local files to Amazon S3. You upload the dataset files to Amazon S3 for model training. During model training, we don’t download data from the internet in order to avoid network latency caused by fetching data from the internet. This also avoids possible security risks due to direct access to the internet. See the following code:

import os
from sagemaker.s3 import S3Uploader as s3up

s3_data_location = s3up.upload(os.path.join(dataroot, "MNIST"),
    f"s3://{bucket}/{prefix}/data/mnist")

Train the model

Via the sagemaker.get_execution_role() method, the notebook can get the role pre-assigned to the notebook instance. This role is used to obtain training resources, such as downloading training framework images, allocating EC2 instances, and so on.

The hyperparameters used in the model training task can be defined in the notebook so that it’s separated from the algorithm and training code. The hyperparameters are passed in when the training task is created and dynamically combined with the training task. See the following code:

import json

hps = {
         'seed': 0,
         'learning-rate': 0.0002,
         'epochs': 18,
         'pin-memory': 1,
         'beta1': 0.5,
         'nz': 100,
         'ngf': 28,
         'ndf': 28,
         'batch-size': 128,
         'log-interval': 20,
     }

The PyTorch class from the sagemaker.pytorch package is an estimator for the PyTorch framework. You can use it to create and run training tasks. In the parameter list, instance_type specifies the type of the training instance, such as CPU or GPU instances. The directory containing the training script and model code is specified by source_dir, and the training script name must be clearly defined by entry_point. These parameters are passed to the training job along with other parameters, and they determine the environment settings of the training task. See the following code:

from sagemaker.pytorch import PyTorch

my_estimator = PyTorch(role=role,
                        entry_point='train.py',
                        source_dir='src',
                        output_path=s3_model_artifacts_location,
                        code_location=s3_custom_code_upload_location,
                        instance_count=1,
                        instance_type='ml.g4dn.2xlarge',
                        use_spot_instances=False,
                        framework_version='1.5.0',
                        py_version='py3',
                        hyperparameters=hps)

Pay special attention to the use_spot_instances parameter. The value of True here means that you want to use Spot Instances to train the model. Because ML training usually requires a large amount of computing resources to run for a long time, using Spot Instances can help you control your cost. Spot Instances may save cost up to 90% vs. On-Demand Instances. Depending on the instance type, Region, and time, the actual price might be different.

You have created a PyTorch object, and you can use it to fit pre-uploaded training data on Amazon S3. The following command initiates the training job, and the training data is loaded into the training instance local storage in the form of an input channel named MNIST. When the training task starts, the training data is already available on the local file system of the training instance, and the training script train.py can access the data from the local disk afterwards.

# Start training
my_estimator.fit({'MNIST': s3_data_location}, wait=False)

Depending on the training instance you choose, the training process may last from dozens of minutes to hours. We recommend setting the wait parameter to False, which detaches the notebook from the training job. In scenarios with long training time and many training logs, it can prevent the notebook context from being lost due to network interruption or session timeout. After the notebook is detached from the training task, the output is temporarily invisible. Run the following code to allow the notebook to obtain and resume the previous training session:

%%time
from sagemaker.estimator import Estimator

# Attaching previous training session
training_job_name = my_estimator.latest_training_job.name
attached_estimator = Estimator.attach(training_job_name)

Because the model was designed to use the GPU power to accelerate training, it’s much faster on GPU instances than on CPU instances. For example, the g4dn.2xlarge instance take about 12 minutes, whereas the c5.xlarge instance may take more than 6 hours. The current model doesn’t support multi-instance training, so an instance_count parameter with a value more than 1 doesn’t bring extra benefits in training time optimization.

When the training job is complete, the trained model is collected and uploaded to Amazon S3. The upload location is specified by the output_path parameter, which is provided when creating the PyTorch object.

Test the model

You download the trained model from Amazon S3 to the local file system of the notebook instance, where this Jupyter notebook is running. The following code loads and runs the model, and then generates a picture of handwritten digits from a random number as input:

import matplotlib.pyplot as plt
import numpy as np
import torch
from src.train import Generator

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

params = {'nz': hps['nz'], 'nc': 1, 'ngf': hps['ngf']}
model = load_model("./tmp/model.pth", model_cls=Generator, params=params, device=device, strict=False)
img = generate_fake_handwriting(model, num_images=64, nz=hps['nz'], device=device)

plt.imshow(np.asarray(img))

Use case: Synthetic data boosting handwritten text recognition

GAN and DCGAN have been derived into a remarkable number of variants that address different problems in their respective domains. Let’s look at one use case, which is designed to reduce the effort and cost in data collection and annotation, as well as improve the performance of a handwriting text recognition system.

ScrabbleGAN (see also the GitHub repo), introduced by scientists from Amazon, is a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. It relies on a novel generative model that can generate images of words with an arbitrary length. The generator can manipulate the resulting text style, for instance, whether the text is cursive, or how thin the pen stroke is.

Problem definition

Optical character recognition (OCR), especially handwritten text recognition (HTR) systems, have seen significant performance improvements in the deep learning era. However, deep learning-based HTR is limited by the number of training examples. In other words, data gathering and labeling are challenging and costly tasks.

Targeting the lack of versatile, annotated handwritten text, and the difficulty to obtain it, Amazon scientists introduced a semi-supervised learning solution by creating realistic, synthesized text, reducing the need for annotations and enriching the variety of training data in both style and lexicon.

Network architecture

In contrast to the vast majority of text-related networks that rely on recurrent neural networks (RNNs), ScrabbleGAN introduces a novel fully convolutional handwritten text generation architecture, which allows for arbitrarily long outputs. This architecture learns character embeddings without the need for character-level annotation.

Handwriting is a local process—each letter is influenced by its predecessor and successor. The attention of the synthesizer is focused on the immediate neighbors of the current letter, and the generator G is designed to mimic this process. Instead of generating the image out of an entire word representation, each convolutional-upsampling layer widens the receptive field, as well as the overlap between two neighboring characters. This overlap allows adjacent characters to interact, and creates a smooth transition. The style of each image is controlled by a noise vector z given as input to the network. To generate the same style for the entire word or sentence, this noise vector is kept constant throughout the generation of all the characters in the input.

The purpose of the discriminator D is to identify synthetic images generated by G from the real ones. It also discriminates between such images based on the handwriting output style. The discriminator architecture has to account for the varying length of the generated image, therefore it’s designed to be convolutional, and is essentially a concatenation of separate binary classifiers with overlapping receptive fields. Because it’s designed not to rely on character-level annotations, it doesn’t use class supervision for each of these classifiers, therefore unlabeled images can be used to train D. A pooling layer aggregates scores from all classifiers into the final discriminator output.

While discriminator D promotes real-looking images, the recognizer R promotes readable text, in essence identifying between gibberish and real text. Generated images are penalized by comparing the recognized text in the output of R to the one that was given as input to G. R is trained only on real, labeled, handwritten samples.

Most recognition networks use a recurrent module, which learns an implicit language model that helps it identify the correct character even if it’s not written clearly. Although this quality is usually desired in a handwriting recognition model, in this synthetic data case, it may lead the network to correctly read characters that weren’t written clearly by the generator G. Therefore, the recurrent head of the recognition network isn’t excluded, and only the convolutional backbone is used.

Conclusion

The PyTorch framework, one of the most popular deep learning frameworks, has been advancing rapidly, and is widely recognized and applied in recent years. More and more new models have been composed with PyTorch, and a remarkable number of existing models are being migrated from other frameworks to PyTorch. It has already become one of the de facto mainstream deep learning frameworks.

SageMaker is closely integrated with a variety of AWS services, such as EC2 instances of various types, Amazon S3, and Amazon ECR. It provides an end-to-end, consistent ML experience for ML practitioners of all frameworks. SageMaker continues to support mainstream ML frameworks, including PyTorch. ML algorithms and models developed with PyTorch can be easily transplanted to a SageMaker environment by using the fully managed Jupyter notebook, Spot training instances, Amazon ECR, the SageMaker SDK, and more. This lowers the overhead of ML engineering and infrastructure operation, improves productivity and efficiency, and reduces operation and maintenance costs.

Synthetic data, generated by GAN, is rich and versatile in features, and can be produced in substantial amounts. Therefore, you can use it to improve the performance of a model by enriching the training set. Moreover, this technique can reduce effort and cost in data gathering and labeling.

DCGAN is a landmark in the field of generative adversarial networks, and it’s the cornerstone of many modern complex generative adversarial networks today. We explore some of the most recent and interesting variants of GANs in later posts. The introduction and engineering practices discussed in this post can help you understand the principles and engineering methods for GAN in general. Try out your first generative model, available as an example of SageMaker, have fun, and see you next time.

About the Author

Laurence MIAO, Solutions Architect at AWS. Laurence is specialized in AI/ML. He helps customers empower their business with AI/ML on AWS. Before AWS, Laurence served in a variety of software projects and organizations. His tech spectrum covers high-performance internet applications, enterprise information system integration, DevOps, cloud computing, and Machine Learning.

Process Amazon Redshift data and schedule a training pipeline with Amazon SageMaker Processing and Amazon SageMaker Pipelines

December 14, 2021

by Davide Galliteli Amazon AWS

Customers in many different domains tend to work with multiple sources for their data: object-based storage like Amazon Simple Storage Service (Amazon S3), relational databases like Amazon Relational Database Service (Amazon RDS), or data warehouses like Amazon Redshift. Machine learning (ML) practitioners are often driven to work with objects and files instead of databases and tables from the different frameworks they work with. They also prefer local copies of such files in order to reduce the latency of accessing them.

Nevertheless, ML engineers and data scientists might be required to directly extract data from data warehouses with SQL-like queries to obtain the datasets that they can use for training their models.

In this post, we use the Amazon SageMaker Processing API to run a query against an Amazon Redshift cluster, create CSV files, and perform distributed processing. As an extra step, we also train a simple model to predict the total sales for new events, and build a pipeline with Amazon SageMaker Pipelines to schedule it.

Prerequisites

This post uses the sample data that is available when creating a Free Tier cluster in Amazon Redshift. As a prerequisite, you should create your cluster and attach to it an AWS Identity and Access Management (IAM) role with the correct permissions. For instructions on creating the cluster with the sample dataset, see Using a sample dataset. For instructions on associating the role with the cluster, see Authorizing access to the Amazon Redshift Data API.

You can then use your IDE of choice to open the notebooks. This content has been developed and tested using SageMaker Studio on a ml.t3.medium instance. For more information about using Studio, refer to the following resources:

Define the query

Now that your Amazon Redshift cluster is up and running, and loaded with the sample dataset, we can define the query to extract data from our cluster. According to the documentation for the sample database, this application helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In particular, analysts can identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons.

Analysts may be tasked to solve a very common ML problem: predict the number of tickets sold given the characteristics of an event. Because we have two fact tables and five dimensions in our sample database, we have some data that we can work with. For the sake of this example, we try to use information from the venue in which the event takes place as well as its date. The SQL query looks like the following:

SELECT sum(s.qtysold) AS total_sold, e.venueid, e.catid, d.caldate, d.holiday
from sales s, event e, date d
WHERE s.eventid = e.eventid and e.dateid = d.dateid
GROUP BY e.venueid, e.catid, d.caldate, d.holiday

We can run this query in the query editor to test the outcomes and change it to include additional information if needed.

Extract the data from Amazon Redshift and process it with SageMaker Processing

Now that we’re happy with our query, we need to make it part of our training pipeline.

A typical training pipeline consists of three phases:

Preprocessing – This phase reads the raw dataset and transforms it into a format that matches the input required by the model for its training
Training – This phase reads the processed dataset and uses it to train the model
Model registration – In this phase, we save the model for later usage

Our first task is to use a SageMaker Processing job to load the dataset from Amazon Redshift, preprocess it, and store it to Amazon S3 for the training model to pick up. SageMaker Processing allows us to directly read data from different resources, including Amazon S3, Amazon Athena, and Amazon Redshift. SageMaker Processing allows us to configure access to the cluster by providing the cluster and database information, and use our previously defined SQL query as part of a RedshiftDatasetDefinition. We use the SageMaker Python SDK to create this object, and you can check the definition and the parameters needed on the GitHub page. See the following code:

from sagemaker.dataset_definition.inputs import RedshiftDatasetDefinition

rdd = RedshiftDatasetDefinition(
    cluster_id="THE-NAME-OF-YOUR-CLUSTER",
    database="THE-NAME-OF-YOUR-DATABASE",
    db_user="YOUR-DB-USERNAME",
    query_string="THE-SQL-QUERY-FROM-THE-PREVIOUS-STEP",
    cluster_role_arn="THE-IAM-ROLE-ASSOCIATED-TO-YOUR-CLUSTER",
    output_format="CSV",
    output_s3_uri="WHERE-IN-S3-YOU-WANT-TO-STORE-YOUR-DATA"
)

Then, you can define the DatasetDefinition. This object is responsible for defining how SageMaker Processing uses the dataset loaded from Amazon Redshift:

from sagemaker.dataset_definition.inputs import DatasetDefinition

dd = DatasetDefinition(
    data_distribution_type='ShardedByS3Key', # This tells SM Processing to shard the data across instances
    local_path='/opt/ml/processing/input/data/', # Where SM Processing will save the data in the container
    redshift_dataset_definition=rdd # Our ResdhiftDataset
)

Finally, you can use this object as input of your processor of choice. For this post, we wrote a very simple scikit-learn script that cleans the dataset, performs some transformations, and splits the dataset for training and testing. You can check the code in the file processing.py.

We can now instantiate the SKLearnProcessor object, where we define the framework version that we plan on using, the amount and type of instances that we spin up as part of our processing cluster, and the execution role that contains the right permissions. Then, we can pass the parameter dataset_definition as the input of the run() method. This method accepts our processing.py script as the code to run, given some inputs (namely, our RedshiftDatasetDefinition), generates some outputs (a train and a test dataset), and stores both to Amazon S3. We run this operation synchronously thanks to the parameter wait=True:

from sagemaker.sklearn import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

skp = SKLearnProcessor(
    framework_version='0.23-1',
    role=get_execution_role(),
    instance_type='ml.m5.large',
    instance_count=1
)
skp.run(
    code='processing/processing.py',
    inputs=[ProcessingInput(
        dataset_definition=dd,
        destination='/opt/ml/processing/input/data/',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs = [
        ProcessingOutput(
            output_name="train", 
            source="/opt/ml/processing/train"
        ),
        ProcessingOutput(
            output_name="test", 
            source="/opt/ml/processing/test"
        ),
    ],
    wait=True
)

With the outputs created by the processing job, we can move to the training step, by means of the sagemaker.sklearn.SKLearn() Estimator:

from sagemaker.sklearn import SKLearn

s = SKLearn(
    entry_point='training/script.py',
    framework_version='0.23-1',
    instance_type='ml.m5.large',
    instance_count=1,
    role=get_execution_role()
)
s.fit({
    'train':skp.latest_job.outputs[0].destination, 
    'test':skp.latest_job.outputs[1].destination
})

To learn more about the SageMaker Training API and Scikit-learn Estimator, see Using Scikit-learn with the SageMaker Python SDK.

Define a training pipeline

Now that we have proven that we can read data from Amazon Redshift, preprocess it, and use it to train a model, we can define a pipeline that reproduces these steps, and schedule it to run. To do so, we use SageMaker Pipelines. Pipelines is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for ML. With Pipelines, you can create, automate, and manage end-to-end ML workflows at scale.

Pipelines are composed of steps. These steps define the actions that the pipeline takes, and the relationships between steps using properties. We already know that our pipelines are composed of three steps:

A processing phase, defined in ProcessingStep
A training phase, defined in TrainingStep
A registration phase, defined in CreateModelStep

Furthermore, to make the pipeline definition dynamic, Pipelines allows us to define parameters, which are values that we can provide at runtime when the pipeline starts.

The following code is a snippet that shows the definition of a processing step. The step requires the definition of a processor, which is very similar to the one defined previously during the preprocessing discovery phase, but this time using the parameters of Pipelines. The others parameters, code, inputs, and outputs are the same as we have defined previously:

#### PROCESSING STEP #####

# PARAMETERS
processing_instance_type = ParameterString(name='ProcessingInstanceType', default_value='ml.m5.large')
processing_instance_count = ParameterInteger(name='ProcessingInstanceCount', default_value=2)

# PROCESSOR
skp = SKLearnProcessor(
    framework_version='0.23-1',
    role=get_execution_role(),
    instance_type=processing_instance_type,
    instance_count=processing_instance_count
)

# DEFINE THE STEP
processing_step = ProcessingStep(
    name='ProcessingStep',
    processor=skp,
    code='processing/processing.py',
    inputs=[ProcessingInput(
        dataset_definition=dd,
        destination='/opt/ml/processing/input/data/',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs = [
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test"),
    ]
)

Very similarly, we can define the training step, but we use the outputs from the processing step as inputs:

# TRAININGSTEP
training_step = TrainingStep(
    name='TrainingStep',
    estimator=s,
    inputs={
        "train": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri),
        "test": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri)
    }
)

Finally, let’s add the model step, which registers the model to SageMaker for later use (for real-time endpoints and batch transform):

# MODELSTEP
model_step = CreateModelStep(
    name="Model",
    model=model,
    inputs=CreateModelInput(instance_type='ml.m5.xlarge')
)

With all the pipeline steps now defined, we can define the pipeline itself as a pipeline object comprising a series of those steps. ParallelStep and Condition steps also are possible. Then we can update and insert (upsert) the definition to Pipelines with the .upsert() command:

#### PIPELINE ####
pipeline = Pipeline(
    name = 'Redshift2Pipeline',
    parameters = [
        processing_instance_type, processing_instance_count,
        training_instance_type, training_instance_count,
        inference_instance_type
    ],
    steps = [
        processing_step, 
        training_step,
        model_step
    ]
)
pipeline.upsert(role_arn=role)

After we upsert the definition, we can start the pipeline with the pipeline object’s start() method, and wait for the end of its run:

execution = pipeline.start()
execution.wait()

After the pipeline starts running, we can view the run on the SageMaker console. In the navigation pane, under Components and registries, choose Pipelines. Choose the Redshift2Pipeline pipeline, and then choose the specific run to see its progress. You can choose each step to see additional details such as the output, logs, and additional information. Typically, this pipeline should take about 10 minutes to complete.

Conclusions

In this post, we created a SageMaker pipeline that reads data from Amazon Redshift natively without requiring additional configuration or services, processed it via SageMaker Processing, and trained a scikit-learn model. We can now do the following:

Schedule the pipeline to run with Amazon EventBridge rules (see Automating Amazon SageMaker with Amazon EventBridge)
Create a new scheduled pipeline for inference with the TransformStep
Use the model to update an existing real-time endpoint manually or as part of a SageMaker project

If you want additional notebooks to play with, check out the following:

Use the Amazon Redshift Data API from within a SageMaker notebook: extra-content/data-api-discovery.ipynb
Integrate the Amazon Redshift Data API in an AWS Lambda function to have more granular control, and add this step to a SageMaker pipeline: extra-content/pipeline.ipynb

About the Author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Add AutoML functionality with Amazon SageMaker Autopilot across accounts

December 14, 2021

by Francesco Polimeni Amazon AWS

AutoML is a powerful capability, provided by Amazon SageMaker Autopilot, that allows non-experts to create machine learning (ML) models to invoke in their applications.

The problem that we want to solve arises when, due to governance constraints, Amazon SageMaker resources can’t be deployed in the same AWS account where they are used.

Examples of such a situation are:

A multi-account enterprise setup of AWS where the Autopilot resources must be deployed in a specific AWS account (the trusting account), and should be accessed from trusted accounts
A software as a service (SaaS) provider that offers AutoML to their users and adopts the resources in the customer AWS account so that the billing is associated to the end customer

This post walks through an implementation using the SageMaker Python SDK. It’s divided into two sections:

Create the AWS Identity and Access Management (IAM) resources needed for cross-account access
Perform the Autopilot job, deploy the top model, and make predictions from the trusted account accessing the trusting account

The solution described in this post is provided in the Jupyter notebook available in this GitHub repository.

For a full explanation of Autopilot, you can refer to the examples available in GitHub, particularly Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK).

Prerequisites

We have two AWS accounts:

Customer (trusting) account – Where the SageMaker resources are deployed
SaaS (trusted) account – Drives the training and prediction activities

You have to create a user for each account, with programmatic access enabled and the IAMFullAccess managed policy associated.

You have to configure the user profiles in the .aws/credentials file:

customer_config for the user configured in the customer account
saas_config for the user configured in the SaaS account

To update the SageMaker SDK, run the following command in your Python environment:

!pip install --update sagemaker

The procedure has been tested in the SageMaker environment conda_python3.

Common modules and initial definitions

Import common Python modules used in the script:

import boto3
import json
import sagemaker
from botocore.exceptions import ClientError

Let’s define the AWS Region that will host the resources:

REGION = boto3.Session().region_name

and the reference to the dataset for the training of the model:

DATASET_URI = "s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt"

Set up the IAM resources

The following diagram illustrates the IAM entities that we create, which allow the cross-account implementation of the Autopilot job.

On the customer account, we define the single role customer_trusting_saas, which consolidates the permissions for Amazon Simple Storage Service (Amazon S3) and SageMaker access needed for the following:

The local SageMaker service that performs the Autopilot actions
The principal in the SaaS account that initiates the actions in the customer account

On the SaaS account, we define the following:

The AutopilotUsers group with the policy required to assume the customer_trusting_saas role via AWS Security Token Service (AWS STS)
The saas_user, which is a member of the AutopilotUsers group and is the actual principal triggering the Autopilot actions

For additional security, in the cross-account trust relationship, we use the external ID to mitigate the confused deputy problem.

Let’s proceed with the setup.

For each of the two accounts, we complete the following tasks:

Create the Boto3 session with the profile of the respective configuration user.
Retrieve the AWS account ID by means of AWS STS.
Create the IAM client that performs the configuration steps in the account.

For the customer account, use the following code:

customer_config_session = boto3.session.Session(profile_name="customer_config")
CUSTOMER_ACCOUNT_ID = customer_config_session.client("sts").get_caller_identity()["Account"]
customer_iam_client = customer_config_session.client("iam")

Use the following code in the SaaS account:

saas_config_session = boto3.session.Session(profile_name="saas_config")
SAAS_ACCOUNT_ID = saas_config_session.client("sts").get_caller_identity()["Account"]
saas_iam_client = saas_config_session.client("iam")

Set up the IAM entities in the customer account

Let’s first define the role needed to perform cross-account tasks from the SaaS account in the customer account.

For simplicity, the same role is adopted for trusting SageMaker in the customer account. Ideally, consider splitting this role into two roles with fine-grained permissions in line with the principle of granting least privilege.

The role name and the references to the ARN of the SageMaker AWS managed policies are as follows:

CUSTOMER_TRUST_SAAS_ROLE_NAME = "customer_trusting_saas"
CUSTOMER_TRUST_SAAS_ROLE_ARN = "arn:aws:iam::{}:role/{}".format(CUSTOMER_ACCOUNT_ID, CUSTOMER_TRUST_SAAS_ROLE_NAME)
SAGEMAKERFULLACCESS_POLICY_ARN = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"

The following customer managed policy gives the role the permissions to access the Amazon S3 resources that are needed for the SageMaker tasks and for the cross-account copy of the dataset.

We restrict the access to the S3 buckets dedicated to SageMaker in the AWS Region for the customer account. See the following code:

CUSTOMER_S3_POLICY_NAME = "customer_s3"
CUSTOMER_S3_POLICY = 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
         "s3:GetObject",
         "s3:PutObject",
         "s3:DeleteObject",
         "s3:ListBucket"
      ],
      "Resource": [
         "arn:aws:s3:::sagemaker-{}-{}".format(REGION, CUSTOMER_ACCOUNT_ID),
         "arn:aws:s3:::sagemaker-{}-{}/*".format(REGION, CUSTOMER_ACCOUNT_ID)
      ]
    }
  ]
}

Then we define the external ID to mitigate the confused deputy problem:

EXTERNAL_ID = "XXXXX"

The trust relationships policy allows the principals from the trusted account and SageMaker to assume the role:

CUSTOMER_TRUST_SAAS_POLICY = 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{}:root".format(SAAS_ACCOUNT_ID)
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": EXTERNAL_ID
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

For simplicity, we don’t include the management of the exceptions in the following snippets. See the Jupyter notebook for the full code.

We create the customer managed policy in the customer account, create the new role, and attach the two policies. We use the maximum session duration parameter to manage long-running jobs. See the following code:

MAX_SESSION_DURATION = 10800
create_policy_response = customer_iam_client.create_policy(PolicyName=CUSTOMER_S3_POLICY_NAME,
                                                           PolicyDocument=json.dumps(CUSTOMER_S3_POLICY))
customer_s3_policy_arn = create_policy_response["Policy"]["Arn"]

create_role_response = customer_iam_client.create_role(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                                       AssumeRolePolicyDocument=json.dumps(CUSTOMER_TRUST_SAAS_POLICY),
                                                       MaxSessionDuration=MAX_SESSION_DURATION)

customer_iam_client.attach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                       PolicyArn=customer_s3_policy_arn)
customer_iam_client.attach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                       PolicyArn=SAGEMAKERFULLACCESS_POLICY_ARN)

Set up IAM entities in the SaaS account

We define the following in the SaaS account:

A group of users allowed to perform the Autopilot job in the customer account
A policy associated with the group for assuming the role defined in the customer account
A policy associated with the group for uploading data to Amazon S3 and managing bucket policies
A user that is responsible for the implementation of the Autopilot jobs – the user has programmatic access
A user profile to store the user access key and secret in the file for the credentials

Let’s start with defining the name of the group (AutopilotUsers):

SAAS_USER_GROUP_NAME = "AutopilotUsers"

The first policy refers to the customer account ID and the role:

SAAS_ASSUME_ROLE_POLICY_NAME = "saas_assume_customer_role"
SAAS_ASSUME_ROLE_POLICY = 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{}:role/{}".format(CUSTOMER_ACCOUNT_ID, CUSTOMER_TRUST_SAAS_ROLE_NAME)
        }
    ]
}

The second policy is needed to download the dataset, and to manage the Amazon S3 bucket used by SageMaker:

SAAS_S3_POLICY_NAME = "saas_s3"
SAAS_S3_POLICY = 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(DATASET_URI.split('://')[1])
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutBucketPolicy",
                "s3:DeleteBucketPolicy"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-{}-{}".format(REGION, SAAS_ACCOUNT_ID),
                "arn:aws:s3:::sagemaker-{}-{}/*".format(REGION, SAAS_ACCOUNT_ID)
            ]
        }
    ]
}

For simplicity, we give the same value to the user name and user profile:

SAAS_USER_PROFILE = SAAS_USER_NAME = "saas_user"

Now we create the two new managed policies. Next, we create the group, attach the policies to the group, create the user with programmatic access, and insert the user into the group. See the following code:

create_policy_response = saas_iam_client.create_policy(PolicyName=SAAS_ASSUME_ROLE_POLICY_NAME,
                                                       PolicyDocument=json.dumps(SAAS_ASSUME_ROLE_POLICY))      
saas_assume_role_policy_arn = create_policy_response["Policy"]["Arn"]

create_policy_response = saas_iam_client.create_policy(PolicyName=SAAS_S3_POLICY_NAME,
                                                       PolicyDocument=json.dumps(SAAS_S3_POLICY))
saas_s3_policy_arn = create_policy_response["Policy"]["Arn"]

saas_iam_client.create_group(GroupName=SAAS_USER_GROUP_NAME)

saas_iam_client.attach_group_policy(GroupName=SAAS_USER_GROUP_NAME,PolicyArn=saas_assume_role_policy_arn)
saas_iam_client.attach_group_policy(GroupName=SAAS_USER_GROUP_NAME,PolicyArn=saas_s3_policy_arn)

saas_iam_client.create_user(UserName=SAAS_USER_NAME)
saas_iam_client.create_access_key(UserName=SAAS_USER_NAME)

add_user_to_group(GroupName=SAAS_USER_GROUP_NAME,UserName=SAAS_USER_NAME)

Update the credentials file

Create the user profile for saas_user in the .aws/credentials file:

from pathlib import Path
import configparser

credentials_config = configparser.ConfigParser()
credentials_config.read(str(Path.home()) + "/.aws/credentials")

if not credentials_config.has_section(SAAS_USER_PROFILE):
    credentials_config.add_section(SAAS_USER_PROFILE)
    
credentials_config[SAAS_USER_PROFILE]["aws_access_key_id"] = create_akey_response["AccessKey"]["AccessKeyId"]
credentials_config[SAAS_USER_PROFILE]["aws_secret_access_key"] = create_akey_response["AccessKey"]["SecretAccessKey"]

with open(str(Path.home()) + "/.aws/credentials", "w") as configfile:
    credentials_config.write(configfile, space_around_delimiters=False)

This completes the configuration of IAM entities that are needed for the cross-account implementation of the Autopilot job.

Autopilot cross-account access

This is the core objective of the post, where we demonstrate the main differences with respect to the single-account scenario.

First, we prepare the dataset the Autopilot job uses for training the models.

Data

We reuse the same dataset adopted in the SageMaker example: Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK).

For a full explanation of the data, refer to the original example.

We skip the data inspection and proceed directly to the focus of this post, which is the cross-account Autopilot job invocation.

Download the churn dataset with the following AWS Command Line Interface (AWS CLI) command:

!aws s3 cp $DATASET_URI ./ --profile saas_user

Split the dataset for the Autopilot job and the inference phase

After you load the dataset, split it into two parts:

80% for the Autopilot job to train the top model
20% for testing the model that we deploy

Autopilot applies a cross-validation resampling procedure, on the dataset passed as input, to all candidate algorithms to test their ability to predict data they have not been trained on.

Split the dataset with the following code:

import pandas as pd
import numpy as np

churn = pd.read_csv("./churn.txt")
train_data = churn.sample(frac=0.8,random_state=200)
test_data = churn.drop(train_data.index)
test_data_no_target = test_data.drop(columns=["Churn?"])

Let’s save the training data into a file locally that we pass to the fit method of the AutoML estimator:

train_file = "train_data.csv"
train_data.to_csv(train_file, index=False, header=True)

Autopilot training job, deployment, and prediction overview

The training, deployment, and prediction process is illustrated in the following diagram.

The following are the steps for the cross-account invocation:

Initiate a session as saas_user in the SaaS account and load the profile from the credentials.
Assume the role in the customer account via the AWS STS.
Set up and train the AutoML estimator in the customer account.
Deploy the top candidate model proposed by AutoML in the customer account.
Invoke the deployed model endpoint for the prediction on test data.

Initiate the user session in the SaaS account

The setup procedure of IAM entities, explained at the beginning of the post, created the saas_user, identified by the saas_user profile in the .aws/credentials file. We initiate a Boto3 session with this profile:

saas_user_session = boto3.session.Session(profile_name=SAAS_USER_PROFILE, 
                                          region_name=REGION)

The saas_user inherits from the AutopilotUsers group the permission to assume the customer_trusting_saas role in the customer account.

Assume the role in the customer account via AWS STS

AWS STS provides the credentials for a temporary session that is initiated in the customer account:

saas_sts_client = saas_user_session.client("sts", region_name=REGION)

The default session duration (the DurationSeconds parameter) is 1 hour. We set it to the maximum duration session value set for the role. If the session expires, you can recreate it by performing the following steps again. See the following code:

assumed_role_object = saas_sts_client.assume_role(RoleArn=CUSTOMER_TRUST_SAAS_ROLE_ARN,
                                                  RoleSessionName="sagemaker_autopilot",
                                                  ExternalId=EXTERNAL_ID,
                                                  DurationSeconds=MAX_SESSION_DURATION)

assumed_role_credentials = assumed_role_object["Credentials"]
			 
assumed_role_session = boto3.Session(aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                     aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                     aws_session_token=assumed_role_credentials["SessionToken"],
                                     region_name=REGION)
									 
sagemaker_session = sagemaker.Session(boto_session=assumed_role_session)

The sagemaker_session parameter is needed for using the high-level AutoML estimator.

Set up and train the AutoML estimator in the customer account

We use the AutoML estimator from the SageMaker Python SDK to invoke the Autopilot job to train a set of candidate models for the training data.

The setup of the AutoML object is similar to the single-account scenario, but with the following differences for the cross-account invocation:

The role for SageMaker access in the customer account is CUSTOMER_TRUST_SAAS_ROLE_ARN
The sagemaker_session is the temporary session created by AWS STS

See the following code:

target_attribute_name = "Churn?"

from sagemaker import AutoML
from time import gmtime, strftime, sleep

timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
base_job_name = "automl-churn-sdk-" + timestamp_suffix

target_attribute_name = "Churn?"
target_attribute_values = np.unique(train_data[target_attribute_name])
target_attribute_true_value = target_attribute_values[1] # 'True.'

automl = AutoML(role=CUSTOMER_TRUST_SAAS_ROLE_ARN,
                target_attribute_name=target_attribute_name,
                base_job_name=base_job_name,
                sagemaker_session=sagemaker_session,
                max_candidates=10)

We now launch the Autopilot job by calling the fit method of the AutoML estimator in the same way as in the single-account example. We consider the following alternative options for providing the training dataset to the estimator.

First option: upload a local file and train by fit method

We simply pass the training dataset by referring to the local file that the fit method uploads into the default Amazon S3 bucket used by SageMaker in the customer account:

automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)

Second option: cross-account copy

Most likely, the training dataset is located in an Amazon S3 bucket owned by the SaaS account. We copy the dataset from the SaaS account into the customer account and refer to the URI of the copy in the fit method.

Upload the dataset into a local bucket of the SaaS account. For convenience, we use the SageMaker default bucket in the Region.

DATA_PREFIX = "auto-ml-input-data"
local_session = sagemaker.Session(boto_session=saas_user_session)
local_session_bucket = local_session.default_bucket()
train_data_s3_path = local_session.upload_data(path=train_file,key_prefix=DATA_PREFIX)

To allow the cross-account copy, we set the following policy in the local bucket, only for the time needed for the copy operation:

train_data_s3_arn = "arn:aws:s3:::{}/{}/{}".format(local_session_bucket,DATA_PREFIX,train_file)
bucket_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": CUSTOMER_TRUST_SAAS_ROLE_ARN
            },
            "Action": "s3:GetObject",
            "Resource": train_data_s3_arn
        }
    ]
}
bucket_policy = json.dumps(bucket_policy)

saas_s3_client = saas_user_session.client("s3")
saas_s3_client.put_bucket_policy(Bucket=local_session_bucket,Policy=bucket_policy)

Then the copy is performed by the assumed role in the customer account:

assumed_role_s3_client = boto3.client("s3",
                                       aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                       aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                       aws_session_token=assumed_role_credentials["SessionToken"])
target_train_key = "{}/{}".format(DATA_PREFIX, train_file)
assumed_role_s3_client.copy_object(Bucket=sagemaker_session.default_bucket(), 
                                   CopySource=train_data_s3_path.split("://")[1], 
                                   Key=target_train_key)

Delete the bucket policy so that the access has been granted only for the time of the copy:
```
saas_s3_client.delete_bucket_policy(Bucket=local_session_bucket)
```
Finally, we launch the Autopilot job, passing the URI of the object copy:

target_train_uri = "s3://{}/{}".format(sagemaker_session.default_bucket(), 
                                       target_train_key)
automl.fit(target_train_uri, job_name=base_job_name, wait=False, logs=False)

Another option is to refer to the URI of the source dataset in the bucket in SaaS account. In this case, the bucket policy should include the s3:ListBucket action for the source bucket.

The bucket policy should be assigned for the duration of all the training and allow the s3:ListBucket action for the source bucket, including a statement like the following:

{
  "Effect": "Allow",
  "Principal": {
     "AWS": "arn:aws:iam::CUSTOMER_ACCOUNT_ID:role/customer_trusting_saas"
  },
  "Action": "s3:ListBucket",
  "Resource": "arn:aws:s3:::sagemaker-REGION-SAAS_ACCOUNT_ID"
}

We can use the describe_auto_ml_job method to track the status of our SageMaker Autopilot job:

describe_response = automl.describe_auto_ml_job()
print (describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response["AutoMLJobStatus"]
    
    print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
    sleep(30)

Because an Autopilot job can take a long time, if the session token expires during the fit, you can create a new session following the steps described earlier and retrieve the current Autopilot job reference by implementing the following code:

automl = AutoML.attach(auto_ml_job_name=base_job_name,sagemaker_session=sagemaker_session)

Deploy the top candidate model proposed by AutoML

The Autopilot job trains and returns a set of trained candidate models, identifying among them the top candidate that optimizes the evaluation metric related to the ML problem.

In this post, we only demonstrate the deployment of the top candidate proposed by AutoML, but you can choose a different candidate that better fits your business criteria.

First, we review the performance achieved by the top candidate in the cross-validation:

best_candidate = automl.describe_auto_ml_job()["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]
print("n")
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"]))

If the performance is good enough for our business criteria, we deploy the top candidate in the customer account:

from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

inference_response_keys = ["predicted_label", "probability"]

predictor = automl.deploy(initial_instance_count=1,
                          instance_type="ml.m5.large",
                          inference_response_keys=inference_response_keys,
                          predictor_cls=Predictor,
                          serializer=CSVSerializer(),
                          deserializer=CSVDeserializer())

print("Created endpoint: {}".format(predictor.endpoint_name))

The instance is deployed and billed to the customer account.

Prediction on test data

Finally, we access the model endpoint for the prediction of the label output for the test data:

predictor.predict(test_data_no_target.to_csv(sep=",", 
                                             header=False, 
                                             index=False))

If the session token expires after the deployment of the endpoint, you can recreate a new session following the steps described earlier and connect to the already deployed endpoint by implementing the following code:

predictor = Predictor(predictor.endpoint_name, 
                      sagemaker_session = sagemaker_session,
                      serializer=CSVSerializer(), 
                      deserializer=CSVDeserializer())

Clean up

To avoid incurring unnecessary charges, delete the endpoints and resources that were created when deploying the model after they are no longer needed.

Delete the model endpoint

The model endpoint is deployed in a container that is always active. We delete it first to avoid consumption of credits:

predictor.delete_endpoint()

Delete the artifacts generated by the Autopilot job

Delete all the artifacts created by the Autopilot job, such as the generated candidate models, scripts, and notebook.

We use the high-level resource for Amazon S3 to simplify the operation:

assumed_role_s3_resource = boto3.resource("s3",
                                          aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                          aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                          aws_session_token=assumed_role_credentials["SessionToken"])

s3_bucket = assumed_role_s3_resource.Bucket(automl.sagemaker_session.default_bucket())
s3_bucket.objects.filter(Prefix=base_job_name).delete()

Delete the training dataset copied into the customer account

Delete the training dataset in the customer account with the following code:

from urllib.parse import urlparse

train_data_uri = automl.describe_auto_ml_job()["InputDataConfig"][0][ "DataSource"]["S3DataSource"]["S3Uri"]

o = urlparse(train_data_uri, allow_fragments=False)
assumed_role_s3_resource.Object(o.netloc, o.path.lstrip("/")).delete()

Clean up IAM resources

We delete the IAM resources in reverse order to the creation phase.

Remove the user from the group, and the profile from the credentials, and delete the user:

saas_iam_client.remove_user_from_group(GroupName = SAAS_USER_GROUP_NAME,
                                       UserName = SAAS_USER_NAME)
                                      
credentials_config.remove_section(SAAS_USER_PROFILE)
with open(str(Path.home()) + "/.aws/credentials", "w") as configfile:
    credentials_config.write(configfile, space_around_delimiters=False)
    
user_access_keys = saas_iam_client.list_access_keys(UserName=SAAS_USER_NAME)
for AccessKeyId in [element["AccessKeyId"] for element in user_access_keys["AccessKeyMetadata"]]:
    saas_iam_client.delete_access_key(UserName=SAAS_USER_NAME, AccessKeyId=AccessKeyId)
	
saas_iam_client.delete_user(UserName=SAAS_USER_NAME)

Detach the policies from the group in the SaaS account, and delete the group and policies:

attached_group_policies = saas_iam_client.list_attached_group_policies(GroupName=SAAS_USER_GROUP_NAME)
for PolicyArn in [element["PolicyArn"] for element in attached_group_policies["AttachedPolicies"]]:
    saas_iam_client.detach_group_policy(GroupName=SAAS_USER_GROUP_NAME, PolicyArn=PolicyArn)
    
saas_iam_client.delete_group(GroupName=SAAS_USER_GROUP_NAME)
saas_iam_client.delete_policy(PolicyArn=saas_assume_role_policy_arn)
saas_iam_client.delete_policy(PolicyArn=saas_s3_policy_arn)

Detach the AWS policies from the role in the customer account, then delete the role and the policy:

attached_role_policies = customer_iam_client.list_attached_role_policies(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME)
for PolicyArn in [element["PolicyArn"] for element in attached_role_policies["AttachedPolicies"]]:
    customer_iam_client.detach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME, PolicyArn=PolicyArn)

customer_iam_client.delete_role(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME)
customer_iam_client.delete_policy(PolicyArn=customer_s3_policy_arn)

Conclusion

This post described a possible implementation, using the SageMaker Python SDK, of an Autopilot training job, model deployment, and prediction in a cross-account configuration. The originating account owns the data for the training and it delegates the activities to the account hosting the SageMaker resources.

You can use the API calls shown in this post to incorporate AutoML capabilities into a SaaS application, by delegating the management and billing of SageMaker resources to the customer account.

SageMaker decouples the environment where the data scientist drives the analysis from the containers that perform each phase of the ML process.

This capability simplifies other cross-account scenarios. For example: a SaaS provider who owns sensitive data, instead of sharing its data with the customer, could expose certified training algorithms and generate models on behalf of the customer. The customer will receive the trained model at the end of the Autopilot job.

For more examples of how to integrate Autopilot into SaaS products, see the following posts:

About the Authors

Francesco Polimeni is a Sr Solutions Architect at AWS with focus on Machine Learning. He has over 20 years of experience in professional services and pre-sales organizations for IT management software solutions.

Mehmet Bakkaloglu is a Sr Solutions Architect at AWS. He has vast experience in data analytics and cloud architecture, having provided technical leadership for transformation programs and pre-sales activities in a variety of sectors.

Train and deploy a FairMOT model with Amazon SageMaker

December 14, 2021

by Gordon Wang Amazon AWS

Multi-object tracking (MOT) in video analysis is increasingly in demand in many industries, such as live sports, manufacturing, surveillance, and traffic monitoring. For example, in live sports, MOT can track soccer players in real time to analyze physical performance such as real-time speed and moving distance.

Previously, most methods were designed to separate MOT into two tasks: object detection and association. The object detection task detects objects first. The association task extracts re-identification (re-ID) features from image regions for each detected object, and links each detected object through re-ID features to existing tracks or creates a new track. It’s challenging to do real-time inference in an environment with a large number of objects. This is because two tasks extract features respectively and the association task needs to run re-ID feature extraction for each object. Some proposed one-shot MOT methods add a re-ID branch to the object detection network to conduct object detection and association simultaneously. This reduces the inference time, but sacrifices the tracking performance.

FairMOT is a one-shot tracking method with two homogeneous branches for detecting objects and extracting re-ID features. FairMOT has higher performance than the two-step methods—it reaches a speed of about 30 FPS on the MOT challenge datasets. This improvement helps MOT find its way in many industrial scenarios.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to prepare, build, train, and deploy machine learning (ML) models quickly. SageMaker provides several built-in algorithms and container images that you can use to accelerate training and deployment of ML models. Additionally, custom algorithms such as FairMOT can also be supported via custom-built Docker container images.

This post demonstrates how to train and deploy a FairMOT model with SageMaker, optimize it using hyperparameter tuning, and make predictions in real time as well as batch mode.

Overview of the solution

Our solution consists of the following high-level steps:

Set up your resources.
Use SageMaker to train a FairMOT model and tune hyperparameters on the MOT challenge dataset.
Run real-time inference.
Run batch inference.

Prerequisites

Before getting started, complete the following prerequisites:

Create an AWS account or use an existing AWS account.
Make sure that you have a minimum of one ml.p3.16xlarge instance for the training job.
Make sure that you have a minimum of one ml.p3.2xlarge instance for inference endpoint.
Make sure that you have a minimum of one ml.p3.2xlarge instance for processing jobs.

If this is your first time training a model, deploying a model, or running a processing job on the previously mentioned instance sizes, you must request a service quota increase for SageMaker training job.

Set up your resources

After you complete all the prerequisites, you’re ready to deploy the necessary resources.

Create a SageMaker notebook instance. For this task, we recommend the ml.t3.medium instance type. The default volume size is 5 GB; you must increase the volume size to 100 GB. For your AWS Identity and Access Management (IAM) role, choose an existing role or create a new role, and attach the AmazonSageMakerFullAccess and AmazonElasticContainerRegistryPublicFullAccess policies to the role.
Clone the GitHub repo to the notebook you created.
Create a new Amazon Simple Storage Service (Amazon S3) bucket or use an existing bucket.

Train a FairMOT model

To train your FairMOT model, we use the fairmot-training.ipynb notebook. The following diagram outlines the logical flow implemented in this code.

In the Initialize SageMaker section, we define the S3 bucket location and dataset name, and choose either to train on the entire dataset (by setting the half_val parameter to 0) or split it into training and validation (half_val is set to 1). We use the latter mode for hyperparameter tuning.

Next, the prepare-s3-bucket.sh script downloads the dataset from MOT challenge, converts it, and uploads it to the S3 bucket. We tested training the model using the MOT17 and MOT20 datasets, but you can try training with other MOT datasets as well.

In the Build and push SageMaker training image section, we create a custom container image with the FairMOT training algorithm. You can find the definition of the Docker image in the container-dp folder. Because this container image consumes about 13.5 GB volume, the prepare-docker.sh script changes the default directory of the local temporary Docker image in order to avoid the “no space” error. The build_and_push.sh command does just that—it builds and pushes the container to Amazon Elastic Container Registry (Amazon ECR). You should be able to validate the result on the Amazon ECR console.

Finally, the Define a training job section initiates the model training. You can observe the model training on the SageMaker console on the Training Jobs page. The model shows an In progress status first and changes to Completed in about 3 hours (if you’re running the notebook as is). You can access corresponding training metrics on the training job details page, as shown in the following screenshot.

Training metrics

The FairMOT model is based on a backbone network with object detection and re-ID branches on top. The object detection branch has three parallel heads to estimate heatmaps, object center offsets, and bounding box sizes. During the training phase, each head has a corresponding loss value: hm_loss for heatmap, offset_loss for center offsets, and wh_loss for bounding box sizes. The re-ID branch has an id_loss for the re-ID feature learning. Based on these four loss values, a total loss named loss is calculated for the entire network. We monitor all loss values on both the training and validation datasets. During hyperparameter tuning, we rely on ObjectiveMetric to select the best-performing model.

When the training job is complete, note the URI of your model in the Output section of the job details page.

Finally, the last section of the notebook demonstrates SageMaker hyperparameter optimization (HPO). The right combination of hyperparameters can improve performance of ML models; however, finding one manually is time-consuming. SageMaker hyperparameter tuning helps automate the process. We simply define the range for each tuning hyperparameter and the objective metric, while HPO does the rest.

To accelerate the process, SageMaker HPO can run multiple training jobs in parallel. In the end, the best training job provides the most optimal hyperparameters for the model, which you can then use for training on the entire dataset.

Perform real-time inference

In this section, we use the fairmot-inference.ipynb notebook. Similar to the training notebook, we begin by initializing SageMaker parameters and building a custom container image. The inference container is then deployed with the model we built earlier. The model is referenced via the s3_model_uri variable—you should double-check to make sure it links to the correct URI (adjust manually if necessary).

The following diagram illustrates the inference flow.

After our custom container is deployed on a SageMaker inference endpoint, we’re ready to test. First, we download a test video from MOT16-03. Next, in our inference loop, we use OpenCV to split the video into individual frames, convert them to base64, and make predictions by calling the deployed inference endpoint.

The following code demonstrates this logic implemented with the SageMaker SDK:

frame_path = # the path of a frame
with open(frame_path, "rb") as image_file:
        img_data = base64.b64encode(image_file.read())
        data = {"frame_id": frame_id}
        data["frame_data"] = img_data.decode("utf-8")
        if frame_id == 0:
            data["frame_w"] = frame_w
            data["frame_h"] = frame_h
            data["batch_size"] = 1
        body = json.dumps(data).encode("utf-8")
    
    os.remove(frame_path)
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Accept="application/json", Body=body
    )

    body = response["Body"].read()

The resulting video is stored in {root_directory}/datasets/test.mp4. The following is a sample frame. The same person in consecutive frames is wrapped by a bounding box with a unique ID.

Perform batch inference

Now that we implemented and validated the FairMOT model using a frame-by-frame inference endpoint, we build a container that can process the entire video as a whole. This allows us to use FairMOT as a step in more complex video processing pipelines. We use a SageMaker processing job to achieve this goal, as demonstrated in the fairmot-batch-inference.ipynb notebook.

Once again, we begin with SageMaker initialization and building a custom container image. This time we encapsulate the frame-by-frame inference loop into the container itself (the predict.py script). Our test data is MOT16-03, pre-staged in the S3 bucket. As in the previous steps, make sure that the s3_model_uri variable refers to the correct model URI.

SageMaker processing jobs rely on Amazon S3 for input and output data placement. The following diagram demonstrates our workflow.

In the Run batch inference section, we create an instance of ScriptProcessor and define the path for input and output data, as well as the target model. We then run the processor, and the resulting video is placed into the location defined in the s3_output variable. It looks the same as the resulting video generated in the previous section.

Clean up

To avoid unnecessary costs, delete the resources you created as part of this solution, including the inference endpoint.

Conclusion

This post demonstrated how to use SageMaker to train and deploy an object tracking model based on FairMOT. You can use a similar approach to implement other custom algorithms. Although we used public datasets in this example, you can certainly accomplish the same with your own dataset. Amazon SageMaker Ground Truth can help you with the labeling, and SageMaker custom containers simplify implementation.

About the Author

Gordon Wang is a Data Scientist on the Professional Services team at Amazon Web Services. He supports customers in many industries, including media, manufacturing, energy, and healthcare. He is passionate about computer vision, deep learning, and MLOps. In his spare time, he loves running and hiking.

Amazon pushes the boundaries of extreme multilabel classification

December 14, 2021

by admin Amazon AWS

Two NeurIPS papers examine the assignment of the same label to multiple categories, fast training of Transformer-based models.Read More

Distributed Mask RCNN training with Amazon SageMakerCV

December 13, 2021

by Ben Snyder Amazon AWS

Computer vision algorithms are at the core of many deep learning applications. Self-driving cars, security systems, healthcare, logistics, and image processing all incorporate various aspects of computer vision. But despite their ubiquity, training computer vision algorithms, like Mask or Cascade RCNN, is hard. These models employ complex architectures, train on large datasets, and require computer clusters, often requiring dozens of GPUs.

Last year at AWS re:Invent we announced record-breaking Mask RCNN training times of 6:45 minutes on PyTorch and 6:12 minutes on TensorFlow, which we achieved through a series of algorithmic, system, and infrastructure improvements. Our model made heavy use of half precision computation, state-of-the-art optimizers and loss functions, the AWS Elastic Fabric Adapter, and a new parameter server distribution approach.

Now, we’re making these optimizations available in Amazon SageMaker in our new SageMakerCV package. SageMakerCV takes all the high performance tools we developed last year and combines them with the convenience features of SageMaker, such as interactive development in SageMaker Studio, Spot training, and streaming data directly from Amazon Simple Storage Service (Amazon S3).

The challenge of training object detection and instance segmentation

Object detection models, like Mask RCNN, have complex architectures. They typically involve a pretrained backbone, such as a ResNet model, a region proposal network, classifiers, and regression heads. Essentially, these models work like a collection of neural networks working on slightly different, but related, tasks. On top of that, developers often need to modify these models for their own use case. For example, along with the classifier, we might want a model that can identify human poses, as part of an autonomous vehicle project, in order to predict movement and behavior. This involves adding an additional network to the model, alongside the classifier and regression heads.

Mask RCNN architecture

The following diagram illustrates the Mask RCNN architecture.

For more information on Mask RCNN, see the following blog posts:

Modifying models like this is a time-consuming process. The updated model might train slower, or not converge as well as the previous model. SageMakerCV solves these issues by simplifying both the model modification and optimization process. The modification process is streamlined by modularizing the models, and using the interactive development environment in Studio. At the same time, we can apply all the optimizations we developed for our record training time to the new model.

GPU and algorithmic improvements

Several pieces of Mask RCNN are difficult to optimize for GPUs. For example, as part of the region proposal step, we want to reduce the number of regions using non-max suppression (NMS), the process of removing overlapping boxes. Many implementations of Mask RCNN run NMS on the CPU, which means moving a lot of data off the GPU in the middle of training. Other parts of the model, such as anchor generation and assignment, and ROI align, encounter similar problems.

As part of our Mask RCNN optimizations in 2020, we worked with NVIDIA to develop efficient CUDA implementations of NMS, ROI align, and anchor tools, all of which are built into SageMakerCV. This means data stays on the GPU and models train faster. Options for mixed and half precision training means larger batch sizes, shorter step times, and higher GPU utilization.

SageMakerCV also includes the same improved optimizers and loss functions we used in our record Mask RCNN training. NovoGrad means you can now train a model on batch sizes as large as 512. GIoU loss boosts both box and mask performance by around 5%. Combined, these improvements make it possible to train Mask RCNN to state-of-the-art levels of performance in under 7 minutes.

The following table summarizes the benchmark training times for Mask RCNN trained to MLPerf convergence levels using SageMakerCV on P4d.24xlarge instances SageMaker instances. Total time refers to the entire elapsed time, including SageMaker instance setup, Docker and data download, training, and evaluation.

Framework	Nodes	Total Time	Training Time	Box MaP	Seg MaP
PyTorch	1	1:33:04	1:25:59	37.8	34.1
PyTorch	2	0:57:05	0:50:21	38.0	34.4
PyTorch	4	0:36:27	0:29:40	37.9	34.3
TensorFlow	1	2:23:52	2:18:24	37.7	34.3
TensorFlow	2	1:09:02	1:03:29	37.8	34.5
TensorFlow	4	0:48:55	0:42:33	38.0	34.8

Interactive development

Our goal with SageMakerCV was not only to provide fast training models to our users, but also to make developing new models easier. To that end, we provide a series of template object detection models in a highly modularized format, with a simple registry structure for adding new pieces. We also provide tools to modify and test models directly in Studio, so you can quickly go from prototyping a model to launching a distributed training cluster.

For example, say you want to add a custom keypoint head to Mask RCNN in TensorFlow. You first build your new head using the TensorFlow 2 Keras API, and add the SageMakerCV registry decorator at the top. The registry is a set of dictionaries organized into sections of the model. For example, the HEADS section triggers when the build_detector function is called, and the KeypointHead value from the configuration file tells the build to include the new ROI head. See the following code:

import tensorflow as tf
from sagemakercv.builder import HEADS

@HEADS.register("KeypointHead")
class KeypointHead(tf.keras.Model):
    def __init__(self, cfg):
        ...

Then you can call your new head by adding it to a YAML configuration file:

MODEL:
    RCNN:
        ROI_HEAD: "KeypointHead"

You provide this new configuration when building a model:

from configs.default_config import _C as cfg
from sagemakercv.detection import build_detector

cfg.merge_from_file('keypoint_config.yaml')

model = build_detector(cfg)

We know that building a new model is never as straightforward as we’re describing here, so we provide example notebooks of how to prototype models in Studio. This allows developers to quickly iterate on and debug their ideas.

Distributed training

SageMakerCV uses the distributed training capabilities of SageMaker right out of the box. You can go from prototyping a model on a single GPU to launching training on dozens of GPUs with just a few lines of code. SageMakerCV automatically supports SageMaker Distributed Data Parallel, which uses EFA to provide unmatched multi-node scaling efficiency. We also provide support for DDP in PyTorch, and Horovod in TensorFlow. By default, SageMakerCV automatically selects the optimal distributed training strategy for the cluster configuration you select. All you have to do is set your instance type and number of nodes, and SageMakerCV takes care of the rest.

Distributed training also typically involves huge amounts of data, often in the order of many terabytes. Getting all that data onto the training instances can take time, providing it will even fit. To fix this problem, SageMakerCV provides built-in support for streaming data directly from Amazon S3 with our recently released S3 plugin, reducing startup times and training costs.

Get started

We provide detailed tutorial notebooks that walk you through the entire process, from getting the COCO dataset, to building a model in Studio, to launching a distributed cluster. What follows is a brief overview.

Follow the instructions in Onboard to Amazon SageMaker Studio Using Quick Start. On your Studio instance, open a system terminal and clone the SageMakerCV repo.

git clone https://github.com/aws-samples/amazon-sagemaker-cv

Create a new Studio notebook with the PyTorch DLC, and install SageMakerCV in editable mode:

cd amazon-sagemaker-cv/pytorch
pip install -e .

In your notebook, create a new training configuration:

from configs import cfg

cfg.SOLVER.OPTIMIZER="NovoGrad" 
cfg.SOLVER.BASE_LR=0.042
cfg.SOLVER.LR_SCHEDULE="COSINE"
cfg.SOLVER.IMS_PER_BATCH=384 
cfg.SOLVER.WEIGHT_DECAY=0.001 
cfg.SOLVER.MAX_ITER=5000
cfg.OPT_LEVEL="O1"

Set your data sources by using either channels, or an S3 location to stream data during training:

S3_DATA_LOCATION = 's3://my-bucket/coco/'
CHANNELS_DIR='/opt/ml/input/data/' # on node, set by SageMaker

channels = {'validation': os.path.join(S3_DATA_LOCATION, 'val2017'),
            'weights': S3_WEIGHTS_LOCATION,
            'annotations': os.path.join(S3_DATA_LOCATION, 'annotations')}
            
cfg.INPUT.VAL_INPUT_DIR = os.path.join(CHANNELS_DIR, 'validation') 
cfg.INPUT.TRAIN_ANNO_DIR = os.path.join(CHANNELS_DIR, 'annotations', 'instances_train2017.json')
cfg.INPUT.VAL_ANNO_DIR = os.path.join(CHANNELS_DIR, 'annotations', 'instances_val2017.json')
cfg.MODEL.WEIGHT=os.path.join(CHANNELS_DIR, 'weights', R50_WEIGHTS) 
cfg.INPUT.TRAIN_INPUT_DIR = os.path.join(S3_DATA_LOCATION, "train2017") 
cfg.OUTPUT_DIR = '/opt/ml/checkpoints' # SageMaker output dir

# Save the new configuration file
dist_config_file = f"configs/dist-training-config.yaml"
with open(dist_config_file, 'w') as outfile:
    with redirect_stdout(outfile): print(cfg.dump())
    
hyperparameters = {"config": dist_config_file}

Finally, we can launch a distributed training job. For example, we can say we want four ml.p4d.24xlarge instances, and train a model to state-of-the-art convergence in about 45 minutes:

estimator = PyTorch(
                entry_point='train.py', 
                source_dir='.', 
                py_version='py3',
                framework_version='1.8.1',
                role=get_execution_role(),
                instance_count=4,
                instance_type='ml.p4d.24xlarge',
                distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } ,
                output_path='s3://my-bucket/output/',
                checkpoint_s3_uri='s3://my-bucket/checkpoints/',
                model_dir='s3://my-bucket/model/',
                hyperparameters=hyperparameters,
                volume_size=500,
)

estimator.fit(channels)

Clean up

After training your model, be sure to check that all your training instances are complete or stopped by using the SageMaker console and choosing Training Jobs in the navigation pane.

Also, make sure to stop all Studio instances by choosing the Studio session monitor (square inside a circle icon) at the left of the page in Studio. Choose the power icon next to any running instances to shut them down. Your files are saved on your Studio EBS.

Conclusion

SageMakerCV started life as our project to break training records for computer vision models. In the process, we developed new tools and techniques to boost both training speed and accuracy. Now, we’ve combined those advances with SageMaker’s unified machine learning development experience. By combining the latest algorithmic advances, GPU hardware, EFA, and the ability to stream huge datasets from Amazon S3, SageMakerCV is the ideal place to develop the most advanced computer vision models. We look forward to seeing what new models and applications the machine learning community develops, and welcome any and all contributions. To get started, see our comprehensive tutorial notebooks in PyTorch and TensorFlow on GitHub.

About the Authors

Ben Snyder is an applied scientist with AWS Deep Learning. His research interests include computer vision models, reinforcement learning, and distributed optimization. Outside of work, he enjoys cycling and backcountry camping.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for AWS Machine Learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Sami Kama is a software engineer in AWS Deep Learning with expertise in performance optimization, HPC/HTC, Deep learning frameworks and distributed computing. Sami aims to reduce the environmental impact of Deep Learning by increasing the computation efficiency. He enjoys spending time with his kids, catching up with science and technology and occasional video games.

Machine learning inference at the edge using Amazon Lookout for Vision and AWS IoT Greengrass

December 13, 2021

by Amit Gupta Amazon AWS

Discrete and continuous manufacturing lines generate a high volume of products at low latency, ranging from milliseconds to a few seconds. To identify defects at the same throughput of production, camera streams of images must be processed at low latency. Additionally, factories may have low network bandwidth or intermittent cloud connectivity. In such scenarios, you may need to run the defect detection system on your on-premises compute infrastructure, and upload the processed results for further development and monitoring purposes to the AWS Cloud. This hybrid approach with both local edge hardware and the cloud can address the low latency requirements and help reduce storage and network transfer costs to the cloud. This may also fulfill your data privacy and other regulatory requirements.

In this post, we show you how to detect defective parts using Amazon Lookout for Vision machine learning (ML) models running on your on-premises edge appliance.

Lookout for Vision is an ML service that helps spot product defects using computer vision to automate the quality inspection process in your manufacturing lines, with no ML expertise required. The fully managed service enables you to build, train, optimize, and deploy the models in the AWS Cloud or edge. You can use the cloud APIs or deploy Amazon Lookout for Vision models on any NVIDIA Jetson edge appliance or x86 compute platform running Linux with an NVIDIA GPU accelerator. You can use AWS IoT Greengrass to deploy, and manage your edge compatible customized models on your fleet of devices.

Solution overview

In this post, we use a printed circuit board dataset composed of normal and defective images such as scratches, solder blobs, and damaged components on the board. We train a Lookout for Vision model in the cloud to identify defective and normal printed circuit boards. We compile the model to a target ARM architecture, package the trained Lookout for Vision model as an AWS IoT Greengrass component, and deploy the model to an NVIDIA Jetson edge device using the AWS IoT Greengrass console. Finally, we demonstrate a Python-based sample application running on the NVIDIA Jetson edge device that sources the printed circuit board image from the edge device file system, runs the inference on the Lookout for Vision model using the gRPC interface, and sends the inference data to an MQTT topic in the AWS Cloud.

The following diagram illustrates the solution architecture.

The solution has the following workflow:

Upload a training dataset to Amazon Simple Storage Service (Amazon S3).
Train a Lookout for Vision model in the cloud.
Compile the model to the target architecture (ARM) and deploy the model to the NVIDIA Jetson edge device using the AWS IoT Greengrass console.
Source images from local disk.
Run inferences on the deployed model via the gRPC interface.
Post the inference results to an MQTT client running on the edge device.
Receive the MQTT message on a topic in AWS IoT Core in the AWS Cloud for further monitoring and visualization.

Steps 4, 5 and 6 are coordinated with the sample Python application.

Prerequisites

Before you get started, complete the following prerequisites:

Create an AWS account.
On your NVIDIA Jetson edge device, complete the following:
1. Set up your edge device (we have set IoT THING_NAME = l4vJetsonXavierNx when installing AWS IoT Greengrass V2).
2. Clone the sample project containing the Python-based sample application (warmup-model.py to load the model, and sample-client-file-mqtt.py to run inferences). Load the Python modules. See the following code:

git clone https://github.com/aws-samples/ds-peoplecounter-l4v-workshop.git
cd ds-peoplecounter-l4v-workshop 
pip3 install -r requirements.txt
cd lab2/inference_client  
# Replace ENDPOINT variable in sample-client-file-mqtt.py with the 
# value on the AWS console AWS IoT->Things->l4JetsonXavierNX->Interact.  
# Under HTTPS. It will be of type <name>-ats.iot.<region>.amazon.com

Dataset and model training

We use the printed circuit board dataset to demonstrate the solution. The dataset contains normal and anomalous images. Here are a few sample images from the dataset.

The following image shows a normal printed circuit board.

The following image shows a printed circuit board with scratches.

The following image shows a printed circuit board with a soldering defect.

To train a Lookout for Vision model, we follow the steps outlined in Amazon Lookout for Vision – New ML Service Simplifies Defect Detection for Manufacturing. After you complete these steps, you can navigate to the project and the Models page to check the performance of the trained model. You can start the process of exporting the model to the target edge device any time after the model is trained.

Compile and package the model as an AWS IoT Greengrass component

In this section, we walk through the steps to compile the printed circuit board model to our target edge device and package the model as an AWS IoT Greengrass component.

On the Lookout for Vision console, choose your project.
In the navigation pane, choose Edge model packages.
Choose Create model packaging job.

For Job name, enter a name.
For Job description, enter an optional description.
Choose Browse models.

Select the model version (the printed circuit board model built in the previous section).
Choose Choose.

Select Target device and enter the compiler options.

Our target device is on JetPack 4.5.1. See this page for additional details on supported platforms. You can find the supported compiler options such as trt-ver and cuda-ver in the NVIDIA JetPack 4.5.1 archive.

Enter the details for Component name, Component description (optional), Component version, and Component location.

Amazon Lookout for Vision stores the component recipes and artifacts in this Amazon S3 location.

Choose Create model packaging job.

You can see your job name and status showing as In progress. The model packaging job may take a few minutes to complete.

When the model packaging job is complete, the status shows as Success.

Choose your job name (in our case it’s ComponentCircuitBoard) to see the job details.

The Greengrass component and model artifacts have been created in your AWS account.

Choose Continue deployment to Greengrass to deploy the component to the target edge device.

Deploy the model

In this section, we walk through the steps to deploy the printed circuit board model to the edge device using the AWS IoT Greengrass console.

Choose Deploy to initiate the deployment steps.

Select Core device (because the deployment is to a single device) and enter a name for Target name.

The target name is the same name you used to name the core device during the AWS IoT Greengrass V2 installation process.

Choose your component. In our case, the component name is ComponentCircuitBoard, which contains the circuit board model.
Choose Next.

Configure the component (optional).
Choose Next.

Expand Deployment policies.

For Component update policy, select Notify components.

This allows the already deployed component (a prior version of the component) to defer an update until they are ready to update.

For Failure handling policy, select Don’t roll back.

In case of a failure, this option allows us to investigate the errors in deployment.

Choose Next.

Review the list of components that will be deployed on the target (edge) device.
Choose Next.

You should see the message Deployment successfully created.

To validate the model deployment was successful, run the following command on your edge device:

sudo /greengrass/v2/bin/greengrass-cli component list

You should see a similar looking output running the ComponentCircuitBoard lifecycle startup script:

 Components currently running in Greengrass:
 
 Component Name: aws.iot.lookoutvision.EdgeAgent
    Version: 0.1.34
    State: RUNNING
    Configuration: {"Socket":"unix:///tmp/aws.iot.lookoutvision.EdgeAgent.sock"}
 Component Name: ComponentCircuitBoard
    Version: 1.0.0
    State: RUNNING
    Configuration: {"Autostart":false}

Run inferences on the model

We’re now ready to run inferences on the model. On your edge device, run the following command to load the model:

# run command to load the model
# This will load the model into running state 
python3 warmup-model.py

To generate inferences, run the following command with the source file name:

python3 sample-client-file-mqtt.py /path/to/images

The following screenshot shows that the model correctly predicts the image as anomalous (bent pin) with a confidence score of 0.999766.

The following screenshot shows that the model correctly predicts the image as anomalous (solder blob) with a confidence score of 0.7701461.

The following screenshot shows that the model correctly predicts the image as normal with a confidence score of 0.9568462.

The following screenshot shows that the inference data posted an MQTT topic in AWS IoT Core.

Customer Stories

With AWS IoT Greengrass and Amazon Lookout for Vision, you can now automate visual inspection with CV for processes like quality control and defect assessment – all on the edge and in real time. You can proactively identify problems such as parts damage (like dents, scratches, or poor welding), missing product components, or defects with repeating patterns, on the production line itself – saving you time and money. Customers like Tyson and Baxter are discovering the power of Amazon Lookout for Vision to increase quality and reduce operational costs by automating visual inspection.

“Operational excellence is a key priority at Tyson Foods. Predictive maintenance is an essential asset for achieving this objective by continuously improving overall equipment effectiveness (OEE). In 2021, Tyson Foods launched a machine learning based computer vision project to identify failing product carriers during production to prevent them from impacting Team Member safety, operations, or product quality.

The models trained using Amazon Lookout for Vision performed well. The pin detection model achieved 95% accuracy across both classes. The Amazon Lookout for Vision model was tuned to perform at 99.1% accuracy for failing pin detection. By far the most exciting result of this project was the speedup in development time. Although this project utilizes two models and a more complex application code, it took 12% less developer time to complete. This project for monitoring the condition of the product carriers at Tyson Foods was completed in record time using AWS managed services such as Amazon Lookout for Vision.”

Audrey Timmerman, Sr Applications Developer, Tyson Foods.

“We use Amazon Lookout for Vision to automate inspection tasks and solve complex process management problems that can’t be addressed by manual inspection or traditional machine vision alone. Lookout for Vision’s cloud and edge capabilities provide us the ability to leverage computer vision and AI/ML-based solutions at scale in a rapid and agile manner, helping us to drive efficiencies on the manufacturing shop floor and enhance our operator’s productivity and experience.”

K. Karan, Global Senior Director – Digital Transformation, Integrated Supply Chain, Baxter International Inc.

Conclusion

In this post, we described a typical scenario for industrial defect detection at the edge. We walked through the key components of the cloud and edge lifecycle with an end-to-end example using Lookout for Vision and AWS IoT Greengrass. With Lookout for Vision, we trained an anomaly detection model in the cloud using the printed circuit board dataset, compiled the model to a target architecture, and packaged the model as an AWS IoT Greengrass component. With AWS IoT Greengrass, we deployed the model to an edge device. We demonstrated a Python-based sample application that sources printed circuit board images from the edge device local file system, runs the inferences on the Lookout for Vision model at the edge using the gRPC interface, and sends the inference data to an MQTT topic in the AWS Cloud.

In a future post, we will show how to run inferences on a real-time stream of images using a GStreamer media pipeline.

Start your journey towards industrial anomaly detection and identification by visiting the Amazon Lookout for Vision and AWS IoT Greengrass resource pages.

About the Authors

Amit Gupta is an AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Ryan Vanderwerf is a partner solutions architect at Amazon Web Services. He previously provided Java virtual machine-focused consulting and project development as a software engineer at OCI on the Grails and Micronaut team. He was chief architect/director of products at ReachForce, with a focus on software and system architecture for AWS Cloud SaaS solutions for marketing data management. Ryan has built several SaaS solutions in several domains such as financial, media, telecom, and e-learning companies since 1996.

Prathyusha Cheruku is an AI/ML Computer Vision Product Manager at AWS. She focuses on building powerful, easy-to-use, no code/ low code deep learning-based image and video analysis services for AWS customers.

Learning in game-theoretical models

December 13, 2021

by admin Amazon AWS

Amazon Research Award recipient Éva Tardos studies complex theoretical questions that have far-ranging practical consequences.Read More

Hierarchical Forecasting using Amazon SageMaker

December 10, 2021

by Mani Khanuja Amazon AWS

Time series forecasting is a common problem in machine learning (ML) and statistics. Some common day-to-day use cases of time series forecasting involve predicting product sales, item demand, component supply, service tickets, and all as a function of time. More often than not, time series data follows a hierarchical aggregation structure. For example, in retail, weekly sales for a Stock Keeping Unit (SKU) at a store can roll up to different geographical hierarchies at the city, state, or country level. In these cases, we must make sure that the sales estimates are in agreement when rolled up to a higher level. In these scenarios, Hierarchical Forecasting is used. It is the process of generating coherent forecasts (or reconciling incoherent forecasts) that allows individual time series to be forecasted individually while still preserving the relationships within the hierarchy. Hierarchical time series often arise due to various smaller geographies combining to form a larger one. For example, the following figure shows the case of a hierarchical structure in time series for store sales in the state of Texas. Individual store sales are depicted in the lowest level (level 2) of the tree, followed by sales aggregated on the city level (level 1), and finally all of the city sales aggregated on the state level (level 0).

In this post, we will first review the concept of hierarchical forecasting, including different reconciliation approaches. Then, we will take an example of demand forecasting on synthetic retail data to show you how to train and tune multiple hierarchical time series models. We will also perform hyper-parameter combinations using the scikit-hts toolkit on Amazon SageMaker, which is the most comprehensive and fully managed ML service. Amazon SageMaker lets data scientists and developers quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

The forecasts at all of the levels must be coherent. The forecast for Texas in the previous figure should break down accurately into forecasts for the cities, and the forecasts for cities should also break down accurately for forecasts on the individual store level. There are various approaches to combining and breaking forecasts at different levels. The most common of these methods, as discussed in detail in Hyndman and Athanasopoulos, are as follows:

Bottom-Up:

In this method, the forecasts are carried out at the bottom-most level of the hierarchy, and then summed going up. For example, in the preceding figure, by using the bottom-up method, the time series’ for the individual stores (level 2) are used to build forecasting models. The outputs of individual models are then summed to generate the forecast for the cities. For example, forecasts for Store 1 and Store 2 are summed to get the forecasts for Austin. Finally, forecasts for all of the cities are summed to generate the forecasts for Texas.

Top-down:

In top-down approaches, the forecast is first generated for the top level (Texas in the preceding figure) and then disaggregated down the hierarchy. Disaggregate proportions are used in conjunction with the top level forecast to generate forecasts at the bottom level of the hierarchy. There are multiple methods to generate these disaggregate proportions, such as average historical proportions, proportions of the historical averages, and forecast proportions. These methods are briefly described in the following section. For a detailed discussion, please see Hyndman and Athanasopoulos.

- Average historical proportions:

In this method, the bottom level series is generated by using the average of the historical proportions of the series at the bottom level (stores in the figure preceding), relative to the series at the top level (Texas in the preceding figure).

- Proportions of the historical averages:

The average historical value of the series at the bottom level (stores in the preceding figure) relative to the average historical value of the series at the top level (Texas in the preceding figure) is used as the disaggregation proportion.

While both of the preceding top-down approaches are simple to implement and use, they are generally very accurate for the top level and are less accurate for lower levels. This is due to the loss of information and the inability to take advantage of characteristics of individual time series at lower levels. Furthermore, these methods also fail to account for how the historical proportions may change over time.

- Forecast proportions:

In this method, instead of historical data, proportions based on forecasts are used for disaggregation. Forecasts are first generated for each individual series. These forecasts are not used directly, since they are not coherent at different levels of hierarchy. At each level, the proportions of these initial forecasts to that of the aggregate of all initial forecasts at the level are calculated. Then, these forecast proportions are used to disaggregate the top level forecast into individual forecasts at various levels. This method does not rely on outdated historical proportions and uses the current data to calculate the appropriate proportions. Due to this reason, forecast proportions often result in much more accurate forecasts as compared to the average historical proportions and proportions of the historical averages top-down approaches.

Middle-out:

In this method, forecasts are first generated for all of the series at a “middle level” (for example, Austin, Dallas, Houston, and San Antonio in the preceding figure). From these forecasts, the bottom-up approach is used to generate the aggregated forecasts for the levels above this middle level. For the levels below the middle level, a top-down approach is used.

Ordinary least squares (OLS):

In OLS, a least squares estimator is used to compute the reconciliation weights needed for generating coherent forecasts.

Solution overview

In this post, we take the example of demand forecasting on synthetic retail data to fine tune multiple hierarchical time series models across algorithms and hyper-parameter combinations. We are using the scikit-hts toolkit on Amazon SageMaker, which is the most comprehensive and fully managed ML service. SageMaker lets data scientists and developers quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment.

First, we will show you how to setup scikit-hts on SageMaker using the SKLearn estimator, train multiple models using the SKLearn estimator, and track and organize experiments using SageMaker Experiments. We will walk you through the following steps:

Prerequisites
Prepare Time Series Data
Setup the scikit-hts training script
Setup the SKLearn Estimator
Setup Amazon SageMaker Experiment and Trials
Evaluate metrics and select a winning candidate
Runtime series forecasts
Visualize the forecasts:
- Visualization at Region Level
- Visualization at State Level

Prerequisites

The following is needed to follow along with this post and run the associated code:

An AWS account for running the code
Amazon Simple Storage Service (S3)
Amazon SageMaker (Notebook Instance or SageMaker Studio)
Amazon CloudWatch
An AWS Identity and Access Management (IAM) role to access Amazon SageMaker, S3, and CloudWatch
The code and associated dataset
Compatible versions of Amazon SageMaker’s SKLearn container (0.23-1), scikit-learn (0.23-1), scikit-hts (0.5.11)

Prepare Time Series Data

For this post, we will use synthetic retail clothing data to perform feature engineering steps to clean data. Then, we will convert the data into hierarchical representations as required by the scikit-hts package.

The retail clothing data is the time series daily quantity of sales data for six item categories: men’s clothing, men’s shoes, women’s clothing, women’s shoes, kids’ clothing, and kids’ shoes. The date range for the data is 11/25/1997 through 7/28/2009. Each row of the data corresponds to the quantity of sales for an item category in a state (total of 18 US states) for a specific date in the date range. Furthermore, the 18 states are also categorized into five US regions. The data is synthetically generated using repeatable patterns (for seasonality) with random noise added for each day.

First, let’s read the data into a Pandas DataFrame.

df_raw = pd.read_csv("retail-usa-clothing.csv",
                          parse_dates=True,
                          header=0,
                          names=['date', 'state',
                                   'item', 'quantity', 'region',
                                   'country']
                    )

Define the S3 bucket and folder locations to store the test and training data. This should be within the same region as SageMaker Studio.

Now, let’s divide the raw data into train and test samples, and save them in their respective S3 folder locations using the Pandas DataFrame query function. We can check the first few entries of the train and test dataset. Both datasets should have the same fields, as in the following code:

df_train = df_raw.query(f'date <= "2009-04-29"').copy()
df_train.to_csv("train.csv")
s3_client.upload_file("train.csv", bucket, pref+"/train.csv")

df_test = df_raw.query(f'date > "2009-04-29"').copy()
df_test.to_csv("test.csv")
s3_client.upload_file("test.csv", bucket, pref+"/test.csv")

Convert data into Hierarchical Representation

scikit-hts requires that each column in our DataFrame is a time series of its own, and for all hierarchy levels. To acheive this, we have created a dataset_prep.py script, which performs the following steps:

Transform the dataset into a column-oriented one.
Create the hierarchy representation as a dictionary.

For a complete description of how this is done under the hood, and for a sense of what the API accepts, see the scikit-hts’ docs.

Once we have created the hierarchy represenation as a dictionary, then we can visualize the data as a tree structure:

from hts.hierarchy import HierarchyTree
ht = HierarchyTree.from_nodes(nodes=train_hierarchy, df=train_product_bottom_level)

- total
   |- Mid-Alantic
   |  |- Mid-Alantic_NewJersey
   |  |- Mid-Alantic_NewYork
   |  - Mid-Alantic_Pennsylvania
   |- SouthCentral
   |  |- SouthCentral_Alabama
   |  |- SouthCentral_Kentucky
   |  |- SouthCentral_Mississippi
   |  - SouthCentral_Tennessee
   |- Pacific
   |  |- Pacific_Alaska
   |  |- Pacific_California
   |  |- Pacific_Hawaii
   |  - Pacific_Oregon
   |- EastNorthCentral
   |  |- EastNorthCentral_Illinois
   |  |- EastNorthCentral_Indiana
   |  - EastNorthCentral_Ohio
   - NewEngland
      |- NewEngland_Connecticut
      |- NewEngland_Maine
      |- NewEngland_RhodeIsland
      - NewEngland_Vermont

Setup the scikit-hts training script

We use a Python entry script to import the necessary SKLearn libraries, set up the scikit-hts estimators using the model packages for our algorithms of interest, and pass in our algorithm and hyper-parameter preferences from the SKLearn estimator that we set up in the notebook. In this post and associated code, we show the implementation and results for the bottom-up approach and the top-down approach with the average historical proportions division method. Note that the user can change these to select different hierarchical methods from the package. In addition, for the hyperparameters, we used additive and multiplicative seasonality with both the bottom-up and top-down approaches. The script uses the train and test data files that we uploaded to Amazon S3 to create the corresponding SKLearn datasets for training and evaluation. When training is complete, the script runs an evaluation to generate metrics, which we use to choose a winning model. For further analysis, the metrics are also available via the SageMaker trial component analytics (discussed later in this post). Then, the model is serialized for storage and future retrieval.

For more details, refer to the entry script “train.py” that is available in the GitHub repo. From the accompanying notebook, you can also run the cell in Step 3 to review the script. The following code shows the train function calling HTSRegressor with the Prophet algorithm along with the hierarchical method and seasonality mode:

def train(bucket, seasonality_mode, algo, daily_seasonality, changepoint_prior_scale, revision_method):
    print('**************** Training Script ***********************')
    # create train dataset
    df = pd.read_csv(filepath_or_buffer=os.environ['SM_CHANNEL_TRAIN'] + "/train.csv", header=0, index_col=0)
    hierarchy, data, region_states = prepare_data(df)
    regions = df["region"].unique().tolist()
    # create test dataset
    df_test = pd.read_csv(filepath_or_buffer=os.environ['SM_CHANNEL_TEST'] + "/test.csv", header=0, index_col=0)
    test_hierarchy, test_df, region_states = prepare_data(df_test)
    print("************** Create Root Edges *********************")
    print(hierarchy)
    print('*************** Data Type for Hierarchy *************', type(hierarchy))
    # determine estimators##################################
    if algo == "Prophet":
        print('************** Started Training Prophet Model ****************')
        estimator = HTSRegressor(model='prophet', 
                                 revision_method=revision_method, 
                                 n_jobs=4, 
                                 daily_seasonality=daily_seasonality, 
                                 changepoint_prior_scale = changepoint_prior_scale,
                                 seasonality_mode=seasonality_mode,
                                )
        # train the model
        print("************** Calling fit method ************************")
        model = estimator.fit(data, hierarchy)
        print("Prophet training is complete SUCCESS")
        
        # evaluate the model on test data
        evaluate(model, test_df, regions, region_states)
    
    ###################################################
 
    mainpref = "scikit-hts/models/"
    prefix = mainpref + "/"
    print('************************ Saving Model *************************')
    joblib.dump(estimator, os.path.join(os.environ['SM_MODEL_DIR'], "model.joblib"))
    print('************************ Model Saved Successfully *************************')

    return model

Setup Amazon SageMaker Experiment and Trials

SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. SageMaker Experiments is integrated with SageMaker Studio. This provides a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models. SageMaker Experiments comes with its own Experiments SDK, which makes the analytics capabilities easily accessible in SageMaker notebooks. Because SageMaker Experiments enables tracking of all the steps and artifacts that go into creating a model, you can quickly revisit the origins of a model when you’re troubleshooting issues in production or auditing your models for compliance verifications. You can create your experiment with the following code:

from datetime import datetime
from smexperiments.experiment import Experiment

#name of experiment
timestep = datetime.now()
timestep = timestep.strftime("%d-%m-%Y-%H-%M-%S")
experiment_name = "hierarchical-forecast-models-" + timestep

#create experiment
Experiment.create(
experiment_name=experiment_name,
description="Hierarchical Timeseries models",
sagemaker_boto_client=sagemaker_boto_client)

For each job, we define a new Trial component within that experiment:

from smexperiments.trial import Trial
trial = Trial.create(
experiment_name=experiment_name,
sagemaker_boto_client=sagemaker_boto_client
)
print(trial)

Next, we define an experiment config, which is a dictionary that we pass into the fit() method of SKLearn estimator later on. This makes sure that the training job is associated with that experiment and trial. For the full code block for this step, refer to the accompanying notebook. In the notebook, we use the bottom-up and top-down (with average historical proportions) approaches, along with additive and multiplicative seasonality as the seasonality hyperparameter values. This lets us train four different models. The code can be modified easily to use the rest of the hierarchical forecasting approaches discussed in the previous sections, since they are also implemented in scikit-hts package.

Creating the SKLearn estimator

You can run SKLearn training scripts on SageMaker’s fully managed training environment by creating an SKLearn estimator. Let’s set up the actual training runs with a combination of parameters and encapsulate the training jobs within SageMaker experiments.

We will use scikit-hts to fit the FBProphet model in our data and compare the results.

FBProphet
- daily_seasonality: By default, daily seasonality is set to False, thereby explicitly changing it to True.
- changepoint_prior_scale: If the trend changes are being overfit (too much flexibility) or underfit (not enough flexibility), you can adjust the strength of the sparse prior using the input argument changepoint_prior_scale. By default, this parameter is set to 0.05. Increasing it will make the trend more flexible.

See the following code:

import sagemaker
from sagemaker.sklearn import SKLearn

for idx, row in df_hps_combo.iterrows():
    trial = Trial.create(
        experiment_name=experiment_name,
        sagemaker_boto_client=sagemaker_boto_client
    )

    experiment_config = { "ExperimentName": experiment_name, 
                      "TrialName":  trial.trial_name,
                      "TrialComponentDisplayName": "Training"}
    
    sklearn_estimator = SKLearn('train.py',
                                source_dir='code',
                                instance_type='ml.m4.xlarge',
                                framework_version='0.23-1',
                                role=sagemaker.get_execution_role(),
                                debugger_hook_config=False,
                                hyperparameters = {'bucket': bucket,
                                                   'algo': "Prophet", 
                                                   'daily_seasonality': True,
                                                   'changepoint_prior_scale': 0.5,
                                                   'seasonality_mode': row['seasonality_mode'],
                                                   'revision_method' : row['revision_method']
                                                  },
                                metric_definitions = metric_definitions,
                               )

After specifying our estimator with all of the necessary hyperparameters, we can train it using our training dataset. We train it by invoking the fit() method of the SKLearn estimator. We pass the location of the train and test data, as well as the experiment configuration. The training algorithm returns a fitted model that we can use to construct forecasts. See the following code:

sklearn_estimator.fit({'train': s3_train_channel, "test": s3_test_channel},
                     experiment_config=experiment_config, wait=False)

We start four training jobs in this case corresponding to the combinations of two hierarchical forecasting methods and two seasonality modes. These jobs are run in parallel using SageMaker training. The average runtime for these training jobs in this example was approximately 450 seconds on ml.m4.xlarge instances. You can review the job parameters and metrics from the trial component view in SageMaker Studio (see the following screenshot):

Evaluate metrics and select a winning candidate

Amazon SageMaker Studio provides an experiments browser that you can use to view the lists of experiments, trials, and trial components. You can choose one of these entities to view detailed information about the entity, or choose multiple entities for comparison. For more details, refer to the documentation. Once the training jobs are running, we can use the experiment view in Studio (see the following screenshot) or the ExperimentAnalytics module to track the status of our training jobs and their metrics.

In the training script, we used SKLearn Metrics to calculate the mean_squared_error (MSE) and stored it in the experiment. We can access the recorded metrics via the ExperimentAnalytics function and convert it to a Pandas DataFrame. The training job with the lowest Mean Squared Error (MSE) is the winner.

from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(experiment_name=experiment_name)
tc_df = trial_component_analytics.dataframe()
for name in tc_df['sagemaker_job_name']:
        description = sagemaker_boto_client.describe_training_job(TrainingJobName=name[1:-1])
        total_mse.append(description['FinalMetricDataList'][0]['Value'])
        model_url.append(description['ModelArtifacts']['S3ModelArtifacts'])
tc_df['total_mse'] = total_mse
new_df = tc_df[['sagemaker_job_name','algo', 'changepoint_prior_scale', 'revision_method', 'total_mse', 'seasonality_mode']]
mse_min = new_df['total_mse'].min()
df_winner = new_df[new_df['total_mse'] == mse_min]

Let’s select the winner model and download it for running forecasts:

for name in df_winner['sagemaker_job_name']:
    model_dir = sagemaker_boto_client.describe_training_job(TrainingJobName = name[1:-1])['ModelArtifacts']['S3ModelArtifacts']
key = model_dir.split('s3://{}/'.format(bucket))
s3_client.download_file(bucket, key[1], 'model.tar.gz')

Runtime series forecasts

Now, we will load the model and make forecasts 90 days in future:

import joblib
def model_fn(model_dir):
    clf = joblib.load(model_dir)
    return clf
model = model_fn('model.joblib')
predictions = model.predict(steps_ahead=90)

Visualize the forecasts

Let’s visualize the model results and fitted values for all of the states:

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
def plot_results(cols, axes, preds):
    axes = np.hstack(axes)
    for ax, col in zip(axes, cols):
        preds[col].plot(ax=ax, label="Predicted")
        train_product_bottom_level[col].plot(ax=ax, label="Observed")
        ax.legend()
        ax.set_title(col)
        ax.set_xlabel("Date")
        ax.set_ylabel("Quantity")

Visualization at Region Level

Visualization at State Level

The following screenshot is for some of the states. For a full list of state visualizations, execute the visualization section of the notebook.

Clean up

Make sure to shut down the studio notebook. You can reach the Running Terminals and Kernels pane on the left side of Amazon SageMaker Studio with the icon. The Running Terminals and Kernels pane consists of four sections. Each section lists all of the resources of that type. You can shut down each resource individually or shut down all of the resources in a section at the same time.

Conclusion

Hierarchical forecasting is important where time series data can be grouped or aggregated at various levels in a hierarchical fashion. For accurate forecasting/prediction at various levels of hierarchy, methods that generate coherent forecasts at these different levels are needed. In this post, we demonstrated how we can leverage Amazon SageMaker’s training capabilities to carry out hierarchical forecasting. We used synthetic retail data and showed how to train hierarchical forecasting models using the scikit-hts package. We used the FBProphet model along with bottom-up and top-down (average historic proportions) hierarchical aggregation and disaggregation methods (see code). Furthermore, we used SageMaker Experiments to train multiple models and picked the best model out of the four trained models. While we only demonstrated this approach on a synthetic retail dataset, the code provided can easily be used with any time-series dataset that exhibits a similar hierarchical structure.

References

Training, debugging and running time series forecasting models with the GluonTS toolkit on Amazon SageMaker
Forecasting: Principles and Practice (2nd ed) by Rob J. Hyndman and George Athanasopoulos
Scikit-hts documentation
Scikit-hts examples

About the Authors

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers use machine learning to solve their business challenges with AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at the edge. She has created her own lab with a self-driving kit and prototype manufacturing production line, where she spends a lot of her free time.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from The University of Texas at Austin and a MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization and related domains. Based in Dallas, Texas, he and his family love to travel and make long road trips.

Neha Gupta is a Solutions Architect at AWS and has 16 years of experience as a Database architect/ DBA. Apart from work, she’s outdoorsy and loves to dance.

Live transcriptions of F1 races using Amazon Transcribe

December 10, 2021

by Beibit Baktygaliyev Amazon AWS

The Formula 1 (F1) live steaming service, F1 TV, has live automated closed captions in three different languages: English, Spanish, and French.

For the 2021 season, FORMULA 1 has achieved another technological breakthrough, building a fully automated workflow to create closed captions in three languages and broadcasting to 85 territories using Amazon Transcribe. Amazon Transcribe is an automatic speech recognition (ASR) service that allows you to generate audio transcription.

In this post, we share how Formula 1 joined forces with the AWS Professional Services team to make it happen. We discuss how they used Amazon Transcribe and its custom vocabulary feature as well as custom-built postprocessing logic to improve their live transcription accuracy in three languages.

The challenge

For F1, everything is about extreme speed: with pit stops as short as 2 seconds, speeds of up to 375 KPH (233 MPH), and 5g forces on drivers under braking and through corners. In this fast-paced and dynamic environment, milliseconds dictate the difference between pole position or second on the grid. The role of the race commentators is to weave the multitude of parallel events and information into a single exciting narrative. This form of commentary greatly increases the engagement and excitement of viewers.

F1 has a strong affinity to cutting edge technology, and partnered with AWS to build a scalable and sustainable closed caption solution for F1 TV, their Over-the-top (OTT) platform, that can support a growing calendar and language portfolio. F1 now provides real-time live captions in three languages across four series: F1 in British English, US Spanish and French; and F2, F3, and Porsche Supercup in British English and US Spanish. This was achieved using Amazon Transcribe to automatically convert the commentary into subtitles.

This task provides many unique challenges. With the excitement of an F1 race, it’s common to have commentators with differing accents move quickly from one topic to another as the race unfolds. Being a sport steeped in technology, commentators often refer to F1 domain-specific terminology such as DRS (Drag Reduction System), aerodynamic, downforce, or halo (a safety device) for example. Moreover, F1 is a global sport, traveling across the world and drawing drivers from many different countries. Looking only at the 2021 season, 16/20 drivers had non-English names and 17/20 had non-Spanish names or non-French names. With the advanced customization features available in Amazon Transcribe, we tailored the underlying language models to recognize domain-specific terms that are rare in general language use, which boosted transcription accuracy.

In the following sections, we take a deep dive into how AWS Professional Services partnered with F1 to build a robust, state-of-the-art, real-time race commentary captioning system by enhancing Amazon Transcribe to understand the particularities of the F1 world. You will learn how to utilize Amazon Transcribe in real-time broadcasts and supercharge live captioning for your use case with custom vocabularies, postprocessing steps, and a human-in-the-loop validation layer.

Solution overview

The solution works as a proxy to Amazon Transcribe. Custom vocabularies are passed as parameters to Amazon Transcribe, and the resulting captions are postprocessed. The postprocessed text is then moderated by an F1 moderator before being transformed to captions that are displayed to the viewers. The following diagram shows the sequential process.

Live transcriptions: Understanding use case specific terminology and context

The output of Automatic Speech Recognition (ASR) systems is highly context-dependent. ASR language models benefit from utilizing the words across a fully spoken sentence. For example, in the following sentence, the system uses the words ‘WORLD CHAMPIONSHIP’ towards the end of the sentence to inform context and allow ‘FORMER ONE’ to be correctly transcribed as ‘FORMULA 1’.

GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMER ONE

GOOD AFTERNOON EVERYBODY WELCOME ALONG TO ROUND 4 OF THE FORMULA 1 WORLD CHAMPIONSHIP IN 2019

Amazon Transcribe supports both batch and streaming transcription models. In batch transcription, the model issues a transcription using the full context provided in the audio segment. Amazon Transcribe streaming transcription enables you to send an audio stream and receive a transcription stream in real time. Generating subtitles for a live broadcast requires a streaming model because transcriptions should appear on screen shortly after the commentary is spoken. This real-time need presents unique challenges compared to batch transcriptions and often affects the quality of the results because the language model has limited knowledge of the future context.

Amazon Transcribe is pre-trained to capture a wide range of use cases. However, F1 domain-specific terminology, names, and locations aren’t present in the Amazon Transcribe general language model. Getting those words correct is nevertheless crucial for the understanding of the narrative, such as who is leading the race, circuit corners, and technical details.

Amazon Transcribe allows you to develop with custom vocabularies and custom language models to improve transcription accuracy. You can use them separately for streaming transcriptions or together for batch transcriptions.

Custom vocabularies consist of a list of specific words that you want Amazon Transcribe to recognize in the audio input. These are generally domain-specific words and phrases, such as proper nouns. You can inform Amazon Transcribe how to pronounce these terms with information such as SoundsLike (in regular orthography) or the IPA (International Phonetic Alphabet) description of the term. Custom vocabularies are available for all languages supported by Amazon Transcribe. Custom vocabularies improve the ability of Amazon Transcribe to recognize terms without using the context in which they’re spoken.

The following table shows some examples of a custom vocabulary.

Phrase	DisplayAs	SoundsLike	IPA
Charles-Leclerc	Charles Leclerc		ʃ ɑ ɹ l l ə k l ɛ ɹ
Charles-Leclerc	Charles Leclerc	shal-luh-klurk
Lewis-Hamilton	Lewis Hamilton	loo-is-ha-muhl-tn
Lewis-Hamilton	Lewis Hamilton	loo-uhs-ha-muhl-tn
Ferrari	Ferrari		f ɝ ɹ ɑ ɹ ɪ
Ferrari	Ferrari	fuh-rehr-ee
Mercedes	Mercedes	mer-sey-deez
Mercedes	Mercedes		m ɛ ɹ s eɪ d i z

The custom vocabulary includes the following details:

Phrase – The term that should be recognized.
DisplayAs – How the word or phrase looks when it’s output. If not declared, the output would be the phrase.
SoundsLike – The term broken into small pieces with the respective pronunciations in the specified language using standard orthography.
IPA – The International Phonetic Alphabet representation for the term.

Custom language models are valuable when there are larger corpuses of text data that can be used to train models. With the additional data, the models learn to predict the probabilities of sequences of words in the domain-specific context. For this project, F1 chose to use custom vocabulary given the unique words and phrases that are unique to F1 racing.

Postprocessing: the final layer of performance boosting

Due to the fast-paced nature of F1 commentary with rapidly changing context as well as commentator accents, inaccurate transcriptions may still occur. However, recurring mistakes can be easily fixed using text replacement. For example, “Kvyat and Albon” may be misunderstood as “create an album” by the British English language model. Because “create an album” is an unlikely term to occur in F1 commentaries, we can safely replace them with their assumed real meanings in a postprocessing routine. On top of that, postprocessing terms can be defined as general, or based on location and race series filters. Such selection allows for more specific term replacement, reducing the chance of erroneous replacements with this approach.

For this project, we gathered thousands of replacements for each language using hours of real-life F1 audio commentary that was analyzed by F1 domain specialists. On top of that, during every live event, F1 runs a transcribed commentary through a human-in-the-loop tool (described in the next section), which allows sentence rejection before the subtitles appear on screen. This data is used later to continuously improve the custom vocabulary and postprocessing rules. The following table shows examples of postprocessing rules for English captions. The location filter is a replacement filter based on race location, and the race series filter is based on the race series.

Original Term	Replacement	Location Filter	Race Series Filter
CHARLOTTE CLAIRE	CHARLES LECLERC		FORMULA 1
CREATE AN ALBUM	KVYAT AND ALBON		FORMULA 1
SCHWARTZMAN	SHWARTZMAN		FORMULA 2
CURVE A PARABOLIC	CURVA PARABOLICA	Italy
CIRCUIT THE CATALONIA	CIRCUIT DE CATALUNYA	Spain
TYPE COMPOUNDS	TYRE COMPOUNDS

Another important function of postprocessing is the standardization and formatting of numbers. When generating transcriptions for live broadcasts such as television, it’s a best practice to use digits when displaying numbers because they’re faster to read and occupy less space on screen. In English, Amazon Transcribe automatically displays numbers bigger than 10 as digits, and numbers between 0–10 are converted to digits under specific conditions, such as when there are more than one in a row. For example, “three four five” converts to 345. In an effort to standardize number transcriptions, we digitize all numbers.

As of August 8, 2021, transcriptions only output numbers as digits instead of words for a defined list of languages in both batch and streaming (for more information, see Transcribing numbers and punctuation). Notably, this list doesn’t include Spanish (es-US and es-ES) or French (fr-FR and fr-CA). With the postprocessing routine, numbers were also formatted to handle integers, decimals, and ordinals, as well F1-specific lap time formatting.

The following shows an example of number postprocessing for different languages that were built for F1.

Human in the loop: Continuous improvement and adaptation

Amazon Transcribe custom vocabularies and postprocessing boost the service’s real-time performance significantly. However, the fast-paced and quickly changing environment remains a challenge for automated transcriptions. It’s better for a person reliant on closed captions to miss out on a phase of commentary, rather than see an incorrect transcription that may be misleading. To this end, F1 employs a human in the loop as a final validation, where a moderator has a number of seconds to verify if a word or an entire sentence should be removed before it’s included in the video stream. Any removed sentences are then used to improve the custom vocabularies and postprocessing step for the next races.

Evaluation

Minor grammatical errors don’t greatly decrease the understandability of a sentence. However, using the wrong F1 terminology breaks a sentence. Usually ASR systems are evaluated on word error rate (WER), which quantifies how many insertions, deletions, and substitutions are required to change the predicted sentence to the correct one.

Although WER is important, F1-specific terms are even more crucial. For this, we created an accuracy score that measures the accuracy of people names (such as Charles Leclerc), teams (McLaren), locations (Hungaroring), and other F1 terms (DRS) transcribed in a commentary. These scores allow us to evaluate how understandable the transcriptions are to F1 fans and, combined with WER, allow us to maintain high-quality transcriptions and improvements in Amazon Transcribe.

Results

The F1 TV enhanced live transcriptions system was released on March 26, 2021, during the Formula 1 Gulf Air Bahrain Grand Prix. By the first race, the solution had already achieved a strong reduction in WER and F1-specific accuracy improvements for all three languages, compared to the Amazon Transcribe standard model. In the following tables, we highlight the WER and F1 specific accuracy improvements for the different languages. The numbers compare the developed solution using Amazon Transcribe using custom vocabularies and postprocessing with Amazon Transcribe generic model. The lower the WER, the better.

	Standard Amazon Transcribe WER	Amazon Transcribe with CV and Postprocessing WER	WER Improvement
English	18.95%	11.37%	39.99%
Spanish	25.95%	16.21%	37.16%
French	37.40%	16.80%	55.08%

	Accuracy Group	Standard Amazon Transcribe Accuracy	Amazon Transcribe with CV and Postprocessing Accuracy	Accuracy Improvement
English	People Names	40.17%	92.25%	129.68%
	Teams	56.33%	95.28%	69.15%
	Locations	61.82%	94.33%	52.59%
	Other F1 terms	81.47%	90.89%	11.55%
Spanish	People Names	45.31%	95.43%	110.62%
	Teams	39.40%	95.46%	142.28%
	Locations	58.32%	87.58%	50.17%
	Other F1 terms	63.87%	85.25%	33.47%
French	People Names	39.12%	92.38%	136.15%
	Teams	33.20%	90.84%	173.61%
	Locations	55.34%	89.33%	61.42%
	Other F1 terms	61.15%	86.77%	41.90%

Although the approach significantly improves the WER measures, its main influence is seen on F1 names, teams, and locations. Because the F1 specific terms are often in local languages, custom vocabularies, and postprocessing steps can quickly teach Amazon Transcribe to consider those terms and correctly transcribe them. The postprocessing step then further adapts the outcome transcriptions to F1’s domain to provide highly accurate automated transcriptions. In the following examples, we present phrases in English, Spanish, and French where Amazon Transcribe custom vocabularies, postprocessing, and number handling techniques successfully improved the transcription accuracy.

For Spanish, we have the original Amazon Transcribe output “EL PILOTO BRITÁNICO LORIS JAMIL TODOS ESTÁ A DOS SEGUNDOS PUNTO TRES DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN UNO VEINTINUEVE DOSCIENTOS TREINTA Y CUATRO” compared to the final transcription “EL PILOTO BRITÁNICO LEWIS HAMILTON ESTÁ A 2.3 s DEL LIDER. COMPLETÓ SU ÚLTIMA VUELTA EN 1:29.234.”

The custom vocabulary and postprocessing combination converted “LORIS JAMIL TODOS” to “LEWIS HAMILTON,” and the number handling routine converted the lap time to digits and added the appropriate punctuation (1:29.234).

For English, compare the original output “THE GERMAN DRIVER THE BASTION BETTER COMPLETED THE LAST LAP IN ONE 15 632” to the final transcription “THE GERMAN DRIVER SEBASTIAN VETTEL COMPLETED THE LAST LAP IN 1:15.632.”

The custom vocabulary and postprocessing combination converted “THE BASTION BETTER” to “SEBASTIAN VETTEL.”

In French, we can compare the original output “VICTOIRE POUR LES MISS MILLE TONNE DIX-HUIT POLE CENT TROIS PODIUM QUATRE VICTOIRES ICI” to the final output “VICTOIRE POUR LEWIS HAMILTON 18 POLE 103 PODIUM 4 VICTOIRES ICI.”

The custom vocabulary and postprocessing combination converted “LES MISS MILLE TONNE” to “LEWIS HAMILTON,” and the number handling routine converted the numbers to digits.

The following short video shows live captions in action during the Formula 1 Gulf Air Bahrain Grand Prix 2021.

Summary

In this post, we explained how F1 is now able to provide live closed captions on their OTT (Over-The-Top) platform to benefit viewers with accessibility needs and those who want to ensure they do not miss any live commentary.

In collaboration with AWS Professional Services, F1 has set up live transcriptions in English, Spanish, and French by using Amazon Transcribe and applying enhancements to capture domain-specific terminology.

Whether for sport broadcasting, streaming educational content, or conferences and webinars, AWS Professional Services is ready to help your team develop a real-time captioning system that is accurate and customizable by making full use of your domain-specific knowledge and the advanced features of Amazon Transcribe. For more information, see AWS Professional Services or reach out through your account manager to get in touch.

About the Authors

Beibit Baktygaliyev is a Senior Data Scientist with AWS Professional Services. As a technical lead, he helps customers to attain their business goals through innovative technology. In his spare time, Beibit enjoys sports and spending time with his family and friends.

Maira Ladeira Tanke is a Data Scientist at AWS Professional Services. She works with customers across industries to help them achieve business outcomes with AI and ML technologies. In her spare time, Maira likes to play with her cat Smila. She also loves to travel and spend time with her family and friends.

Sara Kazdagli is a Professional Services consultant specialized in Data Analytics and Machine Learning. She helps customers across different industries to build innovative solutions and make data-driven decisions. Sara holds a MSc in Software Engineering and a MSc in Data Science. In her spare time, she like to go on hikes and walks with her Australian shepherd dog Kiba.

Pablo Hermoso Moreno is a Data Scientist in the AWS Professional Services Team. He works with clients across industries using Machine Learning to tell stories with data and reach more informed engineering decisions faster. Pablo’s background is in Aerospace Engineering and having worked in the motorsport industry he has an interest in bridging physics and domain expertise with ML. In his spare time, he enjoys rowing and playing guitar.