Build a computer vision model using Amazon Rekognition Custom Labels and compare the results with a custom trained TensorFlow model

Building accurate computer vision models to detect objects in images requires deep knowledge of each step in the process—from labeling, processing, and preparing the training and validation data, to making the right model choice and tuning the model’s hyperparameters adequately to achieve the maximum accuracy. Fortunately, these complex steps are simplified by Amazon Rekognition Custom Labels, a service of Amazon Rekognition that enables you to build your own custom computer vision models for image classification and object detection tasks without requiring any prior computer vision expertise or advanced programming skills.

In this post, we showcase how we can train a model to detect bees in images using Amazon Rekognition Custom Labels. We also compare these results against a custom-trained TensorFlow model (DIY model). We use Amazon SageMaker as the platform to develop and train our model. Finally, we demonstrate how to build a serverless architecture to process new images using Amazon Rekognition APIs.

When and where to use each model

Before diving deeper, it is important to understand the use cases that drive the decision of which model to use, whether it’s an Amazon Rekognition Custom Labels model or a DIY model.

Amazon Rekognition Custom Labels models are a great choice when our desired goal is to achieve maximum quality results in our task quickly. These models are heavily optimized and fine-tuned to perform at a high accuracy and recall. This is a cloud service, so when the model is trained, images must be uploaded to the cloud to be analyzed. A great advantage of this service is that the user doesn’t need to have expertise to run this training pipeline. You can do it on the AWS Management Console with just a few clicks, and it takes care of the heavy lifting of training and fine-tuning the model for you. Then, a simple set of API calls is offered, tailored to this specific model, for you to apply when needed.

DIY models are the choice for advanced users with expertise in machine learning (ML). They allow you to control every aspect of the model, and tune the training data and the necessary parameters as needed. This requires advanced coding skills. These models trade off accuracy for latency: you can run them faster at the expense of lower qualitative performance. This lower latency fits really well in low bandwidth scenarios where the model needs to be deployed on the edge. For instance, IoT devices that support these models can host and run them and only upload the inference results to the cloud, which reduces the amount of data sent upstream.

Overview of solution

To build our DIY model, we follow the solution from the GitHub repo TensorFlow 2 Object Detection API SageMaker, which consists of these steps:

  1. Download and prepare our bee dataset.
  2. Train the model using a SageMaker custom container instance.
  3. Test the model using a SageMaker model endpoint.

After we have our DIY model, we can proceed with the steps to build our bee detection model using Amazon Rekognition Custom Labels:

  1. Deploy a serverless architecture using AWS CloudFormation.
  2. Download and prepare our bee dataset.
  3. Create a project in Amazon Rekognition Custom Labels and import the dataset.
  4. Train the Amazon Rekognition Custom Labels model.
  5. Test the Amazon Rekognition Custom Labels model using the automatically generated API endpoint using Amazon Simple Storage Service (Amazon S3) events.

Amazon Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end process. After we train both models, we can compare them.

Set up the environment

We prepare our serverless environment using the CloudFormation template on GitHub. On the AWS CloudFormation console, we create a new stack and use the template.yaml file present in the root folder of our code repository. We provide a unique Amazon Simple Storage Service (Amazon S3) bucket name when prompted, where our images are downloaded for further processing. We also provide a name for the inference processing Amazon Simple Queue Service (Amazon SQS) queue, as well as an AWS Key Management Service (AWS KMS) alias to securely encrypt the inference pipeline.

The architecture diagram is as follows, and it is used for detecting objects in new images as they are uploaded to our bucket.

Following the first notebook (1_prepare_data), we download and store our images in a bucket in Amazon S3. The dataset is already curated and annotated, and the images used have been licensed under CC0. For convenience, the dataset is stored in a single .zip archive: dataset.zip.

Inside the dataset folder, the manifest file output.manifest contains the bounding box annotations of the dataset. The Amazon S3 references of these images belong to a different S3 bucket where the images were annotated originally. To import this manifest in Amazon Rekognition Custom Labels, the notebook rewrites the manifest file according to the bucket name we chose.

Train your DIY model

To establish a comparison between a DIY and Amazon Rekognition Custom Labels model, we follow the steps in the following public repository that demonstrates how to train a TensorFlow2 model using the same dataset.

We follow the steps described in this repository to train an EfficientNet object detector using our bee dataset. We modify the training notebook so that it runs for 10,000 steps. The model trains for about 2 hours, achieving an average precision of 83% and a recall of 56%.

Create your Amazon Rekognition Custom Labels project

To create your bee detection project, complete the following steps:

  1. On the Amazon Rekognition console, choose Amazon Rekognition Custom Labels.
  2. Choose Get Started.
  3. For Project name, enter bee-detection.
  4. Choose Create project.

Import your dataset

We created a manifest using the first notebook (1_prepare_data) that contains the Amazon S3 URIs of our image annotations. We follow these steps to import our manifest into Amazon Rekognition Custom Labels:

  1. On the Amazon Rekognition Custom Labels console, choose Create dataset.
  2. Select Import images labeled by Amazon SageMaker Ground Truth.
  3. Name your dataset (for example, bee_dataset).
  4. Enter the Amazon S3 URI of the manifest file that we created.
  5. Copy the bucket policy that appears on the console.
  6. Open the Amazon S3 console in a new tab and access the bucket where the images are stored.
  7. On the Permissions tab, enter the bucket policy to allow access of the dataset by Amazon Rekognition Custom Labels.
  8. Go back to the dataset creation console and choose Submit.

Train your model

After the dataset is imported into Amazon Rekognition Custom Labels, we can train a model immediately.

  1. Choose Train Model from the dataset page.
  2. For Choose project, choose your bee-detection project.
  3. For Choose training dataset, choose your bee_dataset dataset.

As part of model training, Amazon Rekognition Custom Labels requires a labeled test dataset to validate the model training. Amazon Rekognition Custom Labels uses the test dataset to verify how well your trained model predicts the correct labels and to generate evaluation metrics. Images in the test dataset are not used to train your model and should represent the same types of images you use your model to analyze.

  1. For Create test set, select how you want to provide your test dataset.

Amazon Rekognition Custom Labels provides three options:

  • Choose an existing test dataset
  • Create a new test dataset
  • Split training dataset

For this post, we choose to split our training dataset, which sets aside 20% of our dataset for testing the model.

  1. Select Split training dataset.
  2. Choose Train.

Our model took approximately 1.5 hours to train. The model achieved an average precision of 99% with a recall of 90% on the test data. The training time required for your model depends on many factors, including the number of images provided in the dataset and the complexity of the model. When training is complete, Amazon Rekognition Custom Labels outputs key quality metrics including F1 score, precision, recall, and the assumed threshold for each label. For more information about metrics, see Metrics for evaluating your model.

Serverless inference architecture

After our model is trained, Amazon Rekognition Custom Labels provides the API calls for starting, using, and stopping your model. In the environment setup section, we set up a serverless architecture to process test images that are uploaded to our S3 bucket via Amazon S3 events. It uses an AWS Lambda function to call the inference API, and manages these API calls using Amazon SQS.

We’re ready now to start applying our trained model to new images. We first need to start the project model version via the Amazon Rekognition Custom Labels console.

We take note of our model’s ARN and update the Lambda function bee-detection-inference with it. This indicates which endpoint we must invoke to retrieve the object detection results. We can also change the assumed threshold to accept or reject results with a low confidence score.

Now it’s time to start uploading our test images to our S3 bucket prefix (s3://your-bucket/test_images). We can either use the Amazon S3 console or the AWS Command Line Interface (AWS CLI). We choose some test images present in our bee detection dataset and upload them using the console. As the images are uploaded, they’re queued in Amazon SQS and then processed by our Lambda function, leaving the result with the same file name, plus the .json suffix.

We visualize the results of the JSON response from our Amazon Rekognition Custom Labels model using the second notebook (2_visualize_images). The following is an example of a response output:

{'CustomLabels': [{'Name': 'bee',
   			   'Confidence': 99.9679946899414,
   			   'Geometry': {'BoundingBox': {'Width': 0.17472000420093536,
     'Height': 0.23267999291419983,
     'Left': 0.34907999634742737,
     'Top': 0.36125999689102173}}}],
 'ResponseMetadata': {'RequestId': '4f98fdc8-a7d3-4251-b21e-484baf958efb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Thu, 11 Mar 2021 15:23:39 GMT',
   'x-amzn-requestid': '4f98fdc8-a7d3-4251-b21e-484baf958efb',
   'content-length': '202',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

This bee is detected with a confidence of 99.97%

In the following image on the left, we find six bees over 99.4% confidence, which is our optimal threshold. The image on the right shows the same result with a threshold of 90% (15 bees).

Clean up

When you’re done, remember to follow these steps to avoid incurring in unnecessary charges:

  1. Stop the model version on the Amazon Rekognition Custom Labels console.
  2. Empty the S3 bucket that was created where images were uploaded.
  3. Delete the CloudFormation stack to remove all provisioned resources.

Comparison with a custom DIY model

The performance of our Amazon Rekognition Custom Labels model is quantitatively better than our DIY model, achieving almost perfect precision (99%). It is noticeable how it’s also able to prevent false negatives, yielding a very robust recall of 90%, smashing the 56% recall of our DIY model. This is partly due to the optimized tuning that Amazon Rekognition Custom Labels applies to the model, and the assumed thresholds that it yields after training to achieve the best performance at test time.

For the first example, our single bee is detected at a much lower confidence score (64%), and with a rather large bounding box that doesn’t reflect accurately the size of the bee.

For the more challenging picture, we must lower our threshold to 81% to find the very first detection (left), and lower it even more to 50% to find 7 bees (right).

Playing with this threshold can be risky. Setting a very low threshold can detect more bees (better recall), but at the same time find false detections, lowering our model precision. However, Amazon Rekognition Custom Labels can detect bees with a much higher confidence, which allows us to set a higher threshold for a much better overall performance.

Conclusion

In this post, we showed you how to create a computer vision object detection model with Amazon Rekognition Custom Labels using annotated data, and compared the results with a custom DIY model. Amazon Rekognition Custom Labels brings a great advantage over using your own models. Amazon Rekognition Custom Labels enables you to build and optimize your own specialized computer vision models to detect unique objects without the need of advanced programming knowledge.

With more experiments with other model architectures and hyperparameters, an ML scientist can improve the DIY model we tested in this post. The Amazon Rekognition Custom Labels value proposition is that it does these experiments on your behalf, thereby reducing the time to get a usable model and its development costs. Finally, we also showed how to set up a minimal serverless architecture to process new images using our trained model.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?


About the Author

Raúl Díaz García is a Sr Data Scientist in the EMEA SDT IoT Team. Raúl works with customers across the EMEA region, where he helps them enable solutions related to Computer Vision and Machine Learning in the IoT space.

Read More

Training Machine Learning Models More Efficiently with Dataset Distillation

For a machine learning (ML) algorithm to be effective, useful features must be extracted from (often) large amounts of training data. However, this process can be made challenging due to the costs associated with training on such large datasets, both in terms of compute requirements and wall clock time. The idea of distillation plays an important role in these situations by reducing the resources required for the model to be effective. The most widely known form of distillation is model distillation (a.k.a. knowledge distillation), where the predictions of large, complex teacher models are distilled into smaller models.

An alternative option to this model-space approach is dataset distillation [1, 2], in which a large dataset is distilled into a synthetic, smaller dataset. Training a model with such a distilled dataset can reduce the required memory and compute. For example, instead of using all 50,000 images and labels of the CIFAR-10 dataset, one could use a distilled dataset consisting of only 10 synthesized data points (1 image per class) to train an ML model that can still achieve good performance on the unseen test set.

Top: Natural (i.e., unmodified) CIFAR-10 images. Bottom: Distilled dataset (1 image per class) on CIFAR-10 classification task. Using only these 10 synthetic images as training data, a model can achieve test set accuracy of ~51%.

In “Dataset Meta-Learning from Kernel Ridge Regression”, published in ICLR 2021, and “Dataset Distillation with Infinitely Wide Convolutional Networks”, presented at NeurIPS 2021, we introduce two novel dataset distillation algorithms, Kernel Inducing Points (KIP) and Label Solve (LS), which optimize datasets using the loss function arising from kernel regression (a classical machine learning algorithm that fits a linear model to features defined through a kernel). Applying the KIP and LS algorithms, we obtain very efficient distilled datasets for image classification, reducing the datasets to 1, 10, or 50 data points per class while still obtaining state-of-the-art results on a number of benchmark image classification datasets. Additionally, we are also excited to release our distilled datasets to benefit the wider research community.

Methodology
One of the key theoretical insights of deep neural networks (DNN) in recent years has been that increasing the width of DNNs results in more regular behavior that makes them easier to understand. As the width is taken to infinity, DNNs trained by gradient descent converge to the familiar and simpler class of models arising from kernel regression with respect to the neural tangent kernel (NTK), a kernel that measures input similarity by computing dot products of gradients of the neural network. Thanks to the Neural Tangents library, neural kernels for various DNN architectures can be computed in a scalable manner.

We utilized the above infinite-width limit theory of neural networks to tackle dataset distillation. Dataset distillation can be formulated as a two-stage optimization process: an “inner loop” that trains a model on learned data, and an “outer loop” that optimizes the learned data for performance on natural (i.e., unmodified) data. The infinite-width limit replaces the inner loop of training a finite-width neural network with a simple kernel regression. With the addition of a regularizing term, the kernel regression becomes a kernel ridge-regression (KRR) problem. This is a highly valuable outcome because the kernel ridge regressor (i.e., the predictor from the algorithm) has an explicit formula in terms of its training data (unlike a neural network predictor), which means that one can easily optimize the KRR loss function during the outer loop.

The original data labels can be represented by one-hot vectors, i.e., the true label is given a value of 1 and all other labels are given values of 0. Thus, an image of a cat would have the label “cat” assigned a 1 value, while the labels for “dog” and “horse” would be 0. The labels we use involve a subsequent mean-centering step, where we subtract the reciprocal of the number of classes from each component (so 0.1 for 10-way classification) so that the expected value of each label component across the dataset is normalized to zero.

While the labels for natural images appear in this standard form, the labels for our learned distilled datasets are free to be optimized for performance. Having obtained the kernel ridge regressor from the inner loop, the KRR loss function in the outer loop computes the mean-square error between the original labels of natural images and the labels predicted by the kernel ridge regressor. KIP optimizes the support data (images and possibly labels) by minimizing the KRR loss function through gradient-based methods. The Label Solve algorithm directly solves for the set of support labels that minimizes the KRR loss function, generating a unique dense label vector for each (natural) support image.

Example of labels obtained by label solving. Left and Middle: Sample images with possible labels listed below. The raw, one-hot label is shown in blue and the final LS generated dense label is shown in orange. Right: The covariance matrix between original labels and learned labels. Here, 500 labels were distilled from the CIFAR-10 dataset. A test accuracy of 69.7% is achieved using these labels for kernel ridge-regression.

Distributed Computation
For simplicity, we focus on architectures that consist of convolutional neural networks with pooling layers. Specifically, we focus on the so-called “ConvNet” architecture and its variants because it has been featured in other dataset distillation studies. We used a slightly modified version of ConvNet that has a simple architecture given by three blocks of convolution, ReLu, and 2×2 average pooling and then a final linear readout layer, with an additional 3×3 convolution and ReLu layer prepended (see our GitHub for precise details).

ConvNet architecture used in DC/DSA. Ours has an additional 3×3 Conv and ReLu prepended.

To compute the neural kernels needed in our work, we used the Neural Tangents library.

The first stage of this work, in which we applied KRR, focused on fully-connected networks, whose kernel elements are cheap to compute. But a hurdle facing neural kernels for models with convolutional layers plus pooling is that the computation of each kernel element between two images scales as the square of the number of input pixels (due to the capturing of pixel-pixel correlations by the kernel). So, for the second stage of this work, we needed to distribute the computation of the kernel elements and their gradients across many devices.

Distributed computation for large scale metalearning.

We invoke a client-server model of distributed computation in which a server distributes independent workloads to a large pool of client workers. A key part of this is to divide the backpropagation step in a way that is computationally efficient (explained in detail in the paper).

We accomplish this using the open-source tools Courier (part of DeepMind’s Launchpad), which allows us to distribute computations across GPUs working in parallel, and JAX, for which novel usage of the jax.vjp function enables computationally efficient gradients. This distributed framework allows us to utilize hundreds of GPUs per distillation of the dataset, for both the KIP and LS algorithms. Given the compute required for such experiments, we are releasing our distilled datasets to benefit the wider research community.

Examples
Our first set of distilled images above used KIP to distill CIFAR-10 down to 1 image per class while keeping the labels fixed. Next, in the below figure, we compare the test accuracy of training on natural MNIST images, KIP distilled images with labels fixed, and KIP distilled images with labels optimized. We highlight that learning the labels provides an effective, albeit mysterious benefit to distilling datasets. Indeed the resulting set of images provides the best test performance (for infinite-width networks) despite being less interpretable.

MNIST dataset distillation with trainable and non-trainable labels. Top: Natural MNIST data. Middle: Kernel Inducing Point distilled data with fixed labels. Bottom: Kernel Inducing Point distilled data with learned labels.

Results
Our distilled datasets achieve state-of-the-art performance on benchmark image classification datasets, improving performance beyond previous state-of-the-art models that used convolutional architectures, Dataset Condensation (DC) and Dataset Condensation with Differentiable Siamese Augmentation (DSA). In particular, for CIFAR-10 classification tasks, a model trained on a dataset consisting of only 10 distilled data entries (1 image / class, 0.02% of the whole dataset) achieves a 64% test set accuracy. Here, learning labels and an additional image preprocessing step leads to a significant increase in performance beyond the 50% test accuracy shown in our first figure (see our paper for details). With 500 images (50 images / class, 1% of the whole dataset), the model reaches 80% test set accuracy. While these numbers are with respect to neural kernels (using the KRR infinite width limit), these distilled datasets can be used to train finite-width neural networks as well. In particular, for 10 data points on CIFAR-10, a finite-width ConvNet neural network achieves 50% test accuracy with 10 images and 68% test accuracy using 500 images, which are still state-of-the-art results. We provide a simple Colab notebook demonstrating this transfer to a finite-width neural network.

Dataset distillation using Kernel Inducing Points (KIP) with a convolutional architecture outperforms prior state-of-the-art models (DC/DSA) on all benchmark settings on image classification tasks. Label Solve (LS, middle columns) while only distilling information in the labels could often (e.g. CIFAR-10 10, 50 data points per class) outperform prior state-of-the-art models as well.

In some cases, our learned datasets are more effective than a natural dataset one hundred times larger in size.

Conclusion
We believe that our work on dataset distillation opens up many interesting future directions. For instance, our algorithms KIP and LS have demonstrated the effectiveness of using learned labels, an area that remains relatively underexplored. Furthermore, we expect that utilizing efficient kernel approximation methods can help to reduce computational burden and scale up to larger datasets. We hope this work encourages researchers to explore other applications of dataset distillation, including neural architecture search and continual learning, and even potential applications to privacy.

Anyone interested in the KIP and LS learned datasets for further analysis is encouraged to check out our papers [ICLR 2021, NeurIPS 2021] and open-sourced code and datasets available on Github.

Acknowledgement
This project was done in collaboration with Zhourong Chen, Roman Novak and Lechao Xiao. We would like to acknowledge special thanks to Samuel S. Schoenholz, who proposed and helped develop the overall strategy for our distributed KIP learning methodology.


1Now at DeepMind.  

Read More

‘AI 2041: Ten Visions for Our Future’: AI Pioneer Kai-Fu Lee Discusses His New Work of Fiction

One of AI’s greatest champions has turned to fiction to answer the question: how will technology shape our world in the next 20 years?

Kai-Fu Lee, CEO of Sinovation Ventures and a former president of Google China, spoke with NVIDIA AI Podcast host Noah Kravitz about AI 2041: Ten Visions for Our Future. The book, his fourth available in the U.S. and first work of fiction, was in collaboration with Chinese sci-fi writer Chen Qiufan, also known as Stanley Chan.

Lee and Chan blend their expertise in scientific forecasting and speculative fiction in this collection of short stories, which was published in September.

Among Lee’s books is the New York Times bestseller AI Superpowers: China, Silicon Valley, and the New World Order, which he spoke about on a 2018 episode of the AI Podcast.

Key Points From This Episode:

  • Each of AI 2041‘s stories takes place in a different country and tackles various AI-related topics. For example, one story follows a teenage girl in Mumbai who rebels when AI gets in the way of her romantic endeavors. Another is about virtual teachers in Seoul who offer orphaned twins new ways to learn and connect.
  • Lee added written commentaries to go along with each story, covering what real technologies are used in it, how those technologies work, as well as their potential upsides and downsides.

Tweetables:

As AI still seems to be an intimidating topic for many, “I wanted to see if we could create a new, innovative piece of work that is not only accessible, but also engaging and entertaining for more people.” — Kai-Fu Lee [1:48]

“By the end of the 10th story, [readers] will have taken their first lesson in AI.” — Kai-Fu Lee [2:02]

You Might Also Like:

Investor, AI Pioneer Kai-Fu Lee on the Future of AI in the US, China

Kai-Fu Lee talks about his book, AI Superpowers: China, Silicon Valley, and the New World Order, which ranked No. 6 on the New York Times Business Books bestsellers list.

Real or Not Real? Attorney Steven Frank Uses Deep Learning to Authenticate Art

Steven Frank is a partner at the law firm Morgan Lewis, specializing in intellectual property and commercial technology law. He talks about working with his wife, Andrea Frank, a professional curator of art images, to authenticate artistic masterpieces with AI’s help.

Author Cade Metz Talks About His New Book “Genius Makers”

Call it Moneyball for AI. In his book, Genius Makers, the New York Times writer Cade Metz tells the funny, inspiring — and ultimately triumphant — tale of how a dogged group of AI researchers bet their careers on the long-dismissed technology of deep learning.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn. If your favorite isn’t listed here, drop us a note.

Make the AI Podcast Better

Have a few minutes to spare? Fill out this listener survey. Your answers will help us make a better podcast.

The post ‘AI 2041: Ten Visions for Our Future’: AI Pioneer Kai-Fu Lee Discusses His New Work of Fiction appeared first on The Official NVIDIA Blog.

Read More

The Unsupervised Reinforcement Learning Benchmark

img0.png

The shortcomings of supervised RL

Reinforcement Learning (RL) is a powerful paradigm for solving many problems of interest in AI, such as controlling autonomous vehicles, digital assistants, and resource allocation to name a few. We’ve seen over the last five years that, when provided with an extrinsic reward function, RL agents can master very complex tasks like playing Go, Starcraft, and dextrous robotic manipulation. While large-scale RL agents can achieve stunning results, even the best RL agents today are narrow. Most RL algorithms today can only solve the single task they were trained on and do not exhibit cross-task or cross-domain generalization capabilities.

A side-effect of the narrowness of today’s RL systems is that today’s RL agents are also very data inefficient. If we were to train AlphaGo-like agents on many tasks each agent would likely require billions of training steps because today’s RL agents don’t have the capabilities to reuse prior knowledge to solve new tasks more efficiently. RL as we know it is supervised – agents overfit to a specific extrinsic reward which limits their ability to generalize.

Efficient PyTorch: Tensor Memory Format Matters

Ensuring the right memory format for your inputs can significantly impact the running time of your PyTorch vision models. When in doubt, choose a Channels Last memory format.

When dealing with vision models in PyTorch that accept multimedia (for example image Tensorts) as input, the Tensor’s memory format can significantly impact the inference execution speed of your model on mobile platforms when using the CPU backend along with XNNPACK. This holds true for training and inference on server platforms as well, but latency is particularly critical for mobile devices and users.

Outline of this article

  1. Deep Dive into matrix storage/memory representation in C++. Introduction to Row and Column major order.
  2. Impact of looping over a matrix in the same or different order as the storage representation, along with an example.
  3. Introduction to Cachegrind; a tool to inspect the cache friendliness of your code.
  4. Memory formats supported by PyTorch Operators.
  5. Best practices example to ensure efficient model execution with XNNPACK optimizations

Matrix Storage Representation in C++

Images are fed into PyTorch ML models as multi-dimensional Tensors. These Tensors have specific memory formats. To understand this concept better, let’s take a look at how a 2-d matrix may be stored in memory.

Broadly speaking, there are 2 main ways of efficiently storing multi-dimensional data in memory.

  1. Row Major Order: In this format, the matrix is stored in row order, with each row stored before the next row in memory. I.e. row N comes before row N+1.
  2. Column Major Order: In this format, the matrix is stored in column-order, with each column stored before the next column in memory. I.e. column N comes before column N+1.

You can see the differences graphically below.

C++ stores multi-dimensional data in row-major format.

C++ stores multi-dimensional data in row-major format.

Efficiently accessing elements of a 2d matrix

Similar to the storage format, there are 2 ways to access data in a 2d matrix.

  1. Loop Over Rows first: All elements of a row are processed before any element of the next row.
  2. Loop Over Columns first: All elements of a column are processed before any element of the next column.

For maximum efficiency, one should always access data in the same format in which it is stored. I.e. if the data is stored in row-major order, then one should try to access it in that order.

The code below (main.cpp) shows 2 ways of accessing all the elements of a 2d 4000×4000 matrix.

#include <iostream>
#include <chrono>

// loop1 accesses data in matrix 'a' in row major order,
// since i is the outer loop variable, and j is the
// inner loop variable.
int loop1(int a[4000][4000]) {
 int s = 0;
 for (int i = 0; i < 4000; ++i) {
   for (int j = 0; j < 4000; ++j) {
     s += a[i][j];
   }
 }
 return s;
}

// loop2 accesses data in matrix 'a' in column major order
// since j is the outer loop variable, and i is the
// inner loop variable.
int loop2(int a[4000][4000]) {
 int s = 0;
 for (int j = 0; j < 4000; ++j) {
   for (int i = 0; i < 4000; ++i) {
     s += a[i][j];
   }
 }
 return s;
}

int main() {
 static int a[4000][4000] = {0};
 for (int i = 0; i < 100; ++i) {
   int x = rand() % 4000;
   int y = rand() % 4000;
   a[x][y] = rand() % 1000;
 }

 auto start = std::chrono::high_resolution_clock::now();
 auto end = start;
 int s = 0;

#if defined RUN_LOOP1
 start = std::chrono::high_resolution_clock::now();

 s = 0;
 for (int i = 0; i < 10; ++i) {
   s += loop1(a);
   s = s % 100;
 }
 end = std::chrono::high_resolution_clock::now();

 std::cout << "s = " << s << std::endl;
 std::cout << "Time for loop1: "
   << std::chrono::duration<double, std::milli>(end - start).count()
   << "ms" << std::endl;
#endif

#if defined RUN_LOOP2
 start = std::chrono::high_resolution_clock::now();
 s = 0;
 for (int i = 0; i < 10; ++i) {
   s += loop2(a);
   s = s % 100;
 }
 end = std::chrono::high_resolution_clock::now();

 std::cout << "s = " << s << std::endl;
 std::cout << "Time for loop2: "
   << std::chrono::duration<double, std::milli>(end - start).count()
   << "ms" << std::endl;
#endif
}


Lets build and run this program and see what it prints.

g++ -O2 main.cpp -DRUN_LOOP1 -DRUN_LOOP2
./a.out


Prints the following:

s = 70
Time for loop1: 77.0687ms
s = 70
Time for loop2: 1219.49ms

loop1() is 15x faster than loop2(). Why is that? Let’s find out below!

Measure cache misses using Cachegrind

Cachegrind is a cache profiling tool used to see how many I1 (first level instruction), D1 (first level data), and LL (last level) cache misses your program caused.

Let’s build our program with just loop1() and just loop2() to see how cache friendly each of these functions is.

Build and run/profile just loop1()

g++ -O2 main.cpp -DRUN_LOOP1
valgrind --tool=cachegrind ./a.out

Prints:

==3299700==
==3299700== I   refs:      643,156,721
==3299700== I1  misses:          2,077
==3299700== LLi misses:          2,021
==3299700== I1  miss rate:        0.00%
==3299700== LLi miss rate:        0.00%
==3299700==
==3299700== D   refs:      160,952,192  (160,695,444 rd   + 256,748 wr)
==3299700== D1  misses:     10,021,300  ( 10,018,723 rd   +   2,577 wr)
==3299700== LLd misses:     10,010,916  ( 10,009,147 rd   +   1,769 wr)
==3299700== D1  miss rate:         6.2% (        6.2%     +     1.0%  )
==3299700== LLd miss rate:         6.2% (        6.2%     +     0.7%  )
==3299700==
==3299700== LL refs:        10,023,377  ( 10,020,800 rd   +   2,577 wr)
==3299700== LL misses:      10,012,937  ( 10,011,168 rd   +   1,769 wr)
==3299700== LL miss rate:          1.2% (        1.2%     +     0.7%  )

Build and run/profile just loop2()

g++ -O2 main.cpp -DRUN_LOOP2
valgrind --tool=cachegrind ./a.out

Prints:

==3300389==
==3300389== I   refs:      643,156,726
==3300389== I1  misses:          2,075
==3300389== LLi misses:          2,018
==3300389== I1  miss rate:        0.00%
==3300389== LLi miss rate:        0.00%
==3300389==
==3300389== D   refs:      160,952,196  (160,695,447 rd   + 256,749 wr)
==3300389== D1  misses:    160,021,290  (160,018,713 rd   +   2,577 wr)
==3300389== LLd misses:     10,014,907  ( 10,013,138 rd   +   1,769 wr)
==3300389== D1  miss rate:        99.4% (       99.6%     +     1.0%  )
==3300389== LLd miss rate:         6.2% (        6.2%     +     0.7%  )
==3300389==
==3300389== LL refs:       160,023,365  (160,020,788 rd   +   2,577 wr)
==3300389== LL misses:      10,016,925  ( 10,015,156 rd   +   1,769 wr)
==3300389== LL miss rate:          1.2% (        1.2%     +     0.7%  )

The main differences between the 2 runs are:

  1. D1 misses: 10M v/s 160M
  2. D1 miss rate: 6.2% v/s 99.4%

As you can see, loop2() causes many many more (~16x more) L1 data cache misses than loop1(). This is why loop1() is ~15x faster than loop2().

Memory Formats supported by PyTorch Operators

While PyTorch operators expect all tensors to be in Channels First (NCHW) dimension format, PyTorch operators support 3 output memory formats.

  1. Contiguous: Tensor memory is in the same order as the tensor’s dimensions.
  2. ChannelsLast: Irrespective of the dimension order, the 2d (image) tensor is laid out as an HWC or NHWC (N: batch, H: height, W: width, C: channels) tensor in memory. The dimensions could be permuted in any order.
  3. ChannelsLast3d: For 3d tensors (video tensors), the memory is laid out in THWC (Time, Height, Width, Channels) or NTHWC (N: batch, T: time, H: height, W: width, C: channels) format. The dimensions could be permuted in any order.

The reason that ChannelsLast is preferred for vision models is because XNNPACK (kernel acceleration library) used by PyTorch expects all inputs to be in Channels Last format, so if the input to the model isn’t channels last, then it must first be converted to channels last, which is an additional operation.

Additionally, most PyTorch operators preserve the input tensor’s memory format, so if the input is Channels First, then the operator needs to first convert to Channels Last, then perform the operation, and then convert back to Channels First.

When you combine it with the fact that accelerated operators work better with a channels last memory format, you’ll notice that having the operator return back a channels-last memory format is better for subsequent operator calls or you’ll end up having every operator convert to channels-last (should it be more efficient for that specific operator).

From the XNNPACK home page:

“All operators in XNNPACK support NHWC layout, but additionally allow custom stride along the Channel dimension”.

PyTorch Best Practice

The best way to get the most performance from your PyTorch vision models is to ensure that your input tensor is in a Channels Last memory format before it is fed into the model.

You can get even more speedups by optimizing your model to use the XNNPACK backend (by simply calling optimize_for_mobile() on your torchscripted model). Note that XNNPACK models will run slower if the inputs are contiguous, so definitely make sure it is in Channels-Last format.

Working example showing speedup

Run this example on Google Colab – note that runtimes on colab CPUs might not reflect accurate performance; it is recommended to run this code on your local machine.

import torch
from torch.utils.mobile_optimizer import optimize_for_mobile
import torch.backends.xnnpack
import time

print("XNNPACK is enabled: ", torch.backends.xnnpack.enabled, "n")

N, C, H, W = 1, 3, 200, 200
x = torch.rand(N, C, H, W)
print("Contiguous shape: ", x.shape)
print("Contiguous stride: ", x.stride())
print()

xcl = x.to(memory_format=torch.channels_last)
print("Channels-Last shape: ", xcl.shape)
print("Channels-Last stride: ", xcl.stride())

## Outputs:
 
# XNNPACK is enabled:  True
 
# Contiguous shape:  torch.Size([1, 3, 200, 200])
# Contiguous stride:  (120000, 40000, 200, 1)
 
# Channels-Last shape:  torch.Size([1, 3, 200, 200])
# Channels-Last stride:  (120000, 1, 600, 3)

The input shape stays the same for contiguous and channels-last formats. Internally however, the tensor’s layout has changed as you can see in the strides. Now, the number of jumps required to go across channels is only 1 (instead of 40000 in the contiguous tensor).
This better data locality means convolution layers can access all the channels for a given pixel much faster. Let’s see now how the memory format affects runtime:

from torchvision.models import resnet34, resnet50, resnet101

m = resnet34(pretrained=False)
# m = resnet50(pretrained=False)
# m = resnet101(pretrained=False)

def get_optimized_model(mm):
  mm = mm.eval()
  scripted = torch.jit.script(mm)
  optimized = optimize_for_mobile(scripted)  # explicitly call the xnnpack rewrite 
  return scripted, optimized


def compare_contiguous_CL(mm):
  # inference on contiguous
  start = time.perf_counter()
  for i in range(20):
    mm(x)
  end = time.perf_counter()
  print("Contiguous: ", end-start)

  # inference on channels-last
  start = time.perf_counter()
  for i in range(20):
    mm(xcl)
  end = time.perf_counter()
  print("Channels-Last: ", end-start)

with torch.inference_mode():
  scripted, optimized = get_optimized_model(m)

  print("Runtimes for torchscripted model: ")
  compare_contiguous_CL(scripted.eval())
  print()
  print("Runtimes for mobile-optimized model: ")
  compare_contiguous_CL(optimized.eval())

   
## Outputs (on an Intel Core i9 CPU):
 
# Runtimes for torchscripted model:
# Contiguous:  1.6711160129999598
# Channels-Last:  1.6678222839999535
 
# Runtimes for mobile-optimized model:
# Contiguous:  0.5712863490000473
# Channels-Last:  0.46113000699995155

Conclusion

The Memory Layout of an input tensor can significantly impact a model’s running time. For Vision Models, prefer a Channels Last memory format to get the most out of your PyTorch models.

References

Read More

Build GAN with PyTorch and Amazon SageMaker

GAN is a generative ML model that is widely used in advertising, games, entertainment, media, pharmaceuticals, and other industries. You can use it to create fictional characters and scenes, simulate facial aging, change image styles, produce chemical formulas synthetic data, and more.

For example, the following images show the effect of picture-to-picture conversion.

The following images show the effect of synthesizing scenery based on semantic layout.

This post walks you through building your first GAN model using Amazon SageMaker. This is a journey of learning GAN from the perspective of practical engineering experiences, as well as opening a new AI/ML domain of generative models.

We also introduce a use case of one of the hottest GAN applications in the synthetic data generation area. We hope this gives you a tangible sense on how GAN is used in real-life scenarios.

Overview of solution

Among the following two pictures of handwritten digits, one of them is actually generated by a GAN model. Can you tell which one?

The main topic of this article is to use ML techniques to generate synthetic handwritten digits. To achieve this goal, you personally experience the training of a GAN model. Generating synthetic handwritten digits is basically the same as the basic principles and engineering processes of portrait generation, although their data, algorithm complexity, and accuracy requirements are different.

Generative Adversarial Networks by Ian Goodfellow et al. is a deep neural network architecture consisting of a generator network and a discriminator network. The generator synthesizes data and tries to deceive the discriminator, whereas the discriminator authenticates the data and tries to correctly identify all synthesized data. In the process of training iterations, the two networks continue to evolve and confront until they reach an equilibrium state (Nash equilibrium). The discriminator can no longer recognize synthesized data anymore, at which point the training process is over.

To train a GAN model, we need to start with some tools and services that are efficient and necessary for ML practices on AWS. As the working environment, SageMaker is a fully managed ML service. It offers all mainstream ML frameworks as managed container images, such as Scikit-Learn, XGBoost, MXNet, TensorFlow, PyTorch, and more. The SageMaker SDK is an open-source development kit for SageMaker that allows you to use SageMaker and other AWS services, for example, accessing data in an Amazon Simple Storage Service (Amazon S3) bucket, or training a model with a managed Amazon Elastic Compute Cloud (Amazon EC2) instance.

With SageMaker end-to-end ML functionality, you can focus on the model building work and easily train a variety of GAN models, without overheads in infrastructure and framework maintenance.

The following diagram illustrates our architecture.

The training data comes from the S3 storage bucket, and is loaded into the local storage of the training instance. The managed training frameworks and managed algorithms serve in the form of container images in Amazon Elastic Container Registry (Amazon ECR), which are combined with the custom training code when the training container is launched. The training output is collected and sent to a specified S3 bucket. In the following sections, we learn how to use these resources via the SageMaker SDK.

We use AWS services such as Amazon SageMaker and Amazon S3, which incur certain cloud resource usage fees.

Set up the development environment

SageMaker provides a managed Jupyter notebook instance, for model building, training, and more. You can carry out ML activities effectively and easily via Jupyter notebooks. For instructions on setting up your Jupyter notebook working environment, see Get Started with Amazon SageMaker Notebook Instances.

Alternatively, you may want to work with Amazon SageMaker Studio. For instructions, see Get Started with Studio Notebooks.

Download the source code

The source code is available in SageMaker Examples GitHub repository.

  1. On the Git menu, choose Clone a Repository.
  2. Enter the clone URI of the repository (https://github.com/aws/amazon-sagemaker-examples.git).
  3. Choose Clone.

When the download is complete, browse the source code structure through the file browser.

  1. Open the notebook build_gan_with_pytorch.ipynb, which is under the folder /amazon-sagemaker-examples/advanced_functionality/pytorch_bring_your_own_gan/.
  2. In the Select Kernel pop-up, choose conda_pytorch_latest_p36.

If using a Studio environment, select the Python3 (PyTorch 1.6 Python 3.6 GPU Optimized) kernel instead.

The code and notebooks used in this post are available on GitHub, and are all verified with Python 3.6, PyTorch 1.5, and SageMaker-managed JupyterLab.

Deep convolutional generative adversarial network (DCGAN)

In 2016, Alec Radford et al. published the paper “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”. This pioneered the application of convolutional neural networks to GAN. In the algorithm design, the full connected layers are replaced with convolutional layers, which improves the stability of training in the image generation scenarios.

Network structure

The generator network uses a stride transposed convolutional layers to improve the resolution of the tensor. The input shape is (batch_size, 100) and the output shape is (batch_size, 64, 64, 3). In other words, the network accepts a 100-dimensional uniform distribution vector, and then undergoes continuous transformation until the final image is generated.

The discriminator network receives pictures in (64, 64, 3) format, uses 2D convolutional layers for downsampling, and finally passes them to the full connected layer for classification.

The training process of the DCGAN model can be roughly divided into three sub-processes.

Firstly, the generator network uses a random number as input to generate a synthetic picture. Then it uses the authentic picture and the synthetic picture to train the discriminator network and update the parameters. Finally, it updates the generator network parameters.

Code structure

The file structure of the project directory pytorch_bring_your_own_gan is as follows:

├── data
├── src
│   └── train.py
├── tmp
└── build_gan_with_pytorch.ipynb

The file train.py contains three classes: the generator network Generator, the discriminator network Discriminator, and a wrapper class for a single batch training process. See the following code:

class Generator(nn.Module):
...

class Discriminator(nn.Module):
...

class DCGAN(object):
    """
    A wrapper class for Generator and Discriminator,
    'train_step' method is for single batch training.
    """
...

The train.py file also contains several functions, which are used to facilitate training of the networks of Generator and Discriminator. Some of the major functions are as follows:

def parse_args():
...

def get_datasets(dataset_name, ...):
...

def train(dataloader, hps, ...):
...

Model development

During development, you may run the train.py script directly from the Linux command line. You can specify input data channels, model hyperparameters, and training output storage via command line arguments (for more information, see Use PyTorch with the SageMaker Python SDK):

python src/train.py --dataset mnist 
        --model-dir '/home/myhome/byos-pytorch-gan/model' 
        --output-dir '/home/myhome/byos-pytorch-gan/tmp' 
        --data-dir '/home/myhome/byos-pytorch-gan/data' 
        --hps '{"beta1":0.5, "dataset":"mnist", "epochs":18,
            "learning-rate":0.0002, "log-interval":64, "nz":100, "ngf":28, "ndf":28}'

Such design of the training script parameter not only provides a good debugging method, but also is a protocol and prerequisite for integration with SageMaker containers. This takes into account the flexibility of model development and the portability of the training environment.

Model training and validation

Find and open the notebook file build_gan_with_pytorch.ipynb, which introduces and runs the training process. Some of the code in this section is omitted; refer to the notebook for details.

Download data

Many public datasets are available on the internet that are very helpful for ML engineering and scientific research, such as algorithm study and evaluation. We use the MNIST dataset, which is a handwritten digits dataset, to train a DCGAN model, and eventually generate some synthetic handwritten digits. See the following code:

from sagemaker.s3 import S3Downloader as s3down
s3down.download('s3://sagemaker-sample-files/datasets/image/MNIST/pytorch/', './data')

Prepare the data

The PyTorch framework has a torchvision.datasets package, which provides access to several datasets. You can use the following commands to read the pre-downloaded MNIST dataset from local storage, for later use:

from torchvision import datasets

dataroot = './data'
trainset = datasets.MNIST(root=dataroot, train=True, download=False)
testset = datasets.MNIST(root=dataroot, train=False, download=False)

The SageMaker SDK creates a default S3 bucket for you to access various files and data that you may need in the ML engineering lifecycle. We can get the name of this bucket through the default_bucket method of the sagemaker.session.Session class in the SageMaker SDK:

from sagemaker.session import Session

sess = Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

The SageMaker SDK provides tools for operating AWS services. For example, the S3Downloader class is used to download objects in Amazon S3, and S3Uploader is used to upload local files to Amazon S3. You upload the dataset files to Amazon S3 for model training. During model training, we don’t download data from the internet in order to avoid network latency caused by fetching data from the internet. This also avoids possible security risks due to direct access to the internet. See the following code:

import os
from sagemaker.s3 import S3Uploader as s3up

s3_data_location = s3up.upload(os.path.join(dataroot, "MNIST"),
    f"s3://{bucket}/{prefix}/data/mnist")

Train the model

Via the sagemaker.get_execution_role() method, the notebook can get the role pre-assigned to the notebook instance. This role is used to obtain training resources, such as downloading training framework images, allocating EC2 instances, and so on.

The hyperparameters used in the model training task can be defined in the notebook so that it’s separated from the algorithm and training code. The hyperparameters are passed in when the training task is created and dynamically combined with the training task. See the following code:

import json

hps = {
         'seed': 0,
         'learning-rate': 0.0002,
         'epochs': 18,
         'pin-memory': 1,
         'beta1': 0.5,
         'nz': 100,
         'ngf': 28,
         'ndf': 28,
         'batch-size': 128,
         'log-interval': 20,
     }

The PyTorch class from the sagemaker.pytorch package is an estimator for the PyTorch framework. You can use it to create and run training tasks. In the parameter list, instance_type specifies the type of the training instance, such as CPU or GPU instances. The directory containing the training script and model code is specified by source_dir, and the training script name must be clearly defined by entry_point. These parameters are passed to the training job along with other parameters, and they determine the environment settings of the training task. See the following code:

from sagemaker.pytorch import PyTorch

my_estimator = PyTorch(role=role,
                        entry_point='train.py',
                        source_dir='src',
                        output_path=s3_model_artifacts_location,
                        code_location=s3_custom_code_upload_location,
                        instance_count=1,
                        instance_type='ml.g4dn.2xlarge',
                        use_spot_instances=False,
                        framework_version='1.5.0',
                        py_version='py3',
                        hyperparameters=hps)

Pay special attention to the use_spot_instances parameter. The value of True here means that you want to use Spot Instances to train the model. Because ML training usually requires a large amount of computing resources to run for a long time, using Spot Instances can help you control your cost. Spot Instances may save cost up to 90% vs. On-Demand Instances. Depending on the instance type, Region, and time, the actual price might be different.

You have created a PyTorch object, and you can use it to fit pre-uploaded training data on Amazon S3. The following command initiates the training job, and the training data is loaded into the training instance local storage in the form of an input channel named MNIST. When the training task starts, the training data is already available on the local file system of the training instance, and the training script train.py can access the data from the local disk afterwards.

# Start training
my_estimator.fit({'MNIST': s3_data_location}, wait=False)

Depending on the training instance you choose, the training process may last from dozens of minutes to hours. We recommend setting the wait parameter to False, which detaches the notebook from the training job. In scenarios with long training time and many training logs, it can prevent the notebook context from being lost due to network interruption or session timeout. After the notebook is detached from the training task, the output is temporarily invisible. Run the following code to allow the notebook to obtain and resume the previous training session:

%%time
from sagemaker.estimator import Estimator

# Attaching previous training session
training_job_name = my_estimator.latest_training_job.name
attached_estimator = Estimator.attach(training_job_name)

Because the model was designed to use the GPU power to accelerate training, it’s much faster on GPU instances than on CPU instances. For example, the g4dn.2xlarge instance take about 12 minutes, whereas the c5.xlarge instance may take more than 6 hours. The current model doesn’t support multi-instance training, so an instance_count parameter with a value more than 1 doesn’t bring extra benefits in training time optimization.

When the training job is complete, the trained model is collected and uploaded to Amazon S3. The upload location is specified by the output_path parameter, which is provided when creating the PyTorch object.

Test the model

You download the trained model from Amazon S3 to the local file system of the notebook instance, where this Jupyter notebook is running. The following code loads and runs the model, and then generates a picture of handwritten digits from a random number as input:

import matplotlib.pyplot as plt
import numpy as np
import torch
from src.train import Generator

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

params = {'nz': hps['nz'], 'nc': 1, 'ngf': hps['ngf']}
model = load_model("./tmp/model.pth", model_cls=Generator, params=params, device=device, strict=False)
img = generate_fake_handwriting(model, num_images=64, nz=hps['nz'], device=device)

plt.imshow(np.asarray(img))

Use case: Synthetic data boosting handwritten text recognition

GAN and DCGAN have been derived into a remarkable number of variants that address different problems in their respective domains. Let’s look at one use case, which is designed to reduce the effort and cost in data collection and annotation, as well as improve the performance of a handwriting text recognition system.

ScrabbleGAN (see also the GitHub repo), introduced by scientists from Amazon, is a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. It relies on a novel generative model that can generate images of words with an arbitrary length. The generator can manipulate the resulting text style, for instance, whether the text is cursive, or how thin the pen stroke is.

Problem definition

Optical character recognition (OCR), especially handwritten text recognition (HTR) systems, have seen significant performance improvements in the deep learning era. However, deep learning-based HTR is limited by the number of training examples. In other words, data gathering and labeling are challenging and costly tasks.

Targeting the lack of versatile, annotated handwritten text, and the difficulty to obtain it, Amazon scientists introduced a semi-supervised learning solution by creating realistic, synthesized text, reducing the need for annotations and enriching the variety of training data in both style and lexicon.

Network architecture

In contrast to the vast majority of text-related networks that rely on recurrent neural networks (RNNs), ScrabbleGAN introduces a novel fully convolutional handwritten text generation architecture, which allows for arbitrarily long outputs. This architecture learns character embeddings without the need for character-level annotation.

Handwriting is a local process—each letter is influenced by its predecessor and successor. The attention of the synthesizer is focused on the immediate neighbors of the current letter, and the generator G is designed to mimic this process. Instead of generating the image out of an entire word representation, each convolutional-upsampling layer widens the receptive field, as well as the overlap between two neighboring characters. This overlap allows adjacent characters to interact, and creates a smooth transition. The style of each image is controlled by a noise vector z given as input to the network. To generate the same style for the entire word or sentence, this noise vector is kept constant throughout the generation of all the characters in the input.

The purpose of the discriminator D is to identify synthetic images generated by G from the real ones. It also discriminates between such images based on the handwriting output style. The discriminator architecture has to account for the varying length of the generated image, therefore it’s designed to be convolutional, and is essentially a concatenation of separate binary classifiers with overlapping receptive fields. Because it’s designed not to rely on character-level annotations, it doesn’t use class supervision for each of these classifiers, therefore unlabeled images can be used to train D. A pooling layer aggregates scores from all classifiers into the final discriminator output.

While discriminator D promotes real-looking images, the recognizer R promotes readable text, in essence identifying between gibberish and real text. Generated images are penalized by comparing the recognized text in the output of R to the one that was given as input to G. R is trained only on real, labeled, handwritten samples.

Most recognition networks use a recurrent module, which learns an implicit language model that helps it identify the correct character even if it’s not written clearly. Although this quality is usually desired in a handwriting recognition model, in this synthetic data case, it may lead the network to correctly read characters that weren’t written clearly by the generator G. Therefore, the recurrent head of the recognition network isn’t excluded, and only the convolutional backbone is used.

Conclusion

The PyTorch framework, one of the most popular deep learning frameworks, has been advancing rapidly, and is widely recognized and applied in recent years. More and more new models have been composed with PyTorch, and a remarkable number of existing models are being migrated from other frameworks to PyTorch. It has already become one of the de facto mainstream deep learning frameworks.

SageMaker is closely integrated with a variety of AWS services, such as EC2 instances of various types, Amazon S3, and Amazon ECR. It provides an end-to-end, consistent ML experience for ML practitioners of all frameworks. SageMaker continues to support mainstream ML frameworks, including PyTorch. ML algorithms and models developed with PyTorch can be easily transplanted to a SageMaker environment by using the fully managed Jupyter notebook, Spot training instances, Amazon ECR, the SageMaker SDK, and more. This lowers the overhead of ML engineering and infrastructure operation, improves productivity and efficiency, and reduces operation and maintenance costs.

Synthetic data, generated by GAN, is rich and versatile in features, and can be produced in substantial amounts. Therefore, you can use it to improve the performance of a model by enriching the training set. Moreover, this technique can reduce effort and cost in data gathering and labeling.

DCGAN is a landmark in the field of generative adversarial networks, and it’s the cornerstone of many modern complex generative adversarial networks today. We explore some of the most recent and interesting variants of GANs in later posts. The introduction and engineering practices discussed in this post can help you understand the principles and engineering methods for GAN in general. Try out your first generative model, available as an example of SageMaker, have fun, and see you next time.


About the Author

Laurence MIAO, Solutions Architect at AWS. Laurence is specialized in AI/ML. He helps customers empower their business with AI/ML on AWS. Before AWS, Laurence served in a variety of software projects and organizations. His tech spectrum covers high-performance internet applications, enterprise information system integration, DevOps, cloud computing, and Machine Learning.

Read More

Process Amazon Redshift data and schedule a training pipeline with Amazon SageMaker Processing and Amazon SageMaker Pipelines

Customers in many different domains tend to work with multiple sources for their data: object-based storage like Amazon Simple Storage Service (Amazon S3), relational databases like Amazon Relational Database Service (Amazon RDS), or data warehouses like Amazon Redshift. Machine learning (ML) practitioners are often driven to work with objects and files instead of databases and tables from the different frameworks they work with. They also prefer local copies of such files in order to reduce the latency of accessing them.

Nevertheless, ML engineers and data scientists might be required to directly extract data from data warehouses with SQL-like queries to obtain the datasets that they can use for training their models.

In this post, we use the Amazon SageMaker Processing API to run a query against an Amazon Redshift cluster, create CSV files, and perform distributed processing. As an extra step, we also train a simple model to predict the total sales for new events, and build a pipeline with Amazon SageMaker Pipelines to schedule it.

Prerequisites

This post uses the sample data that is available when creating a Free Tier cluster in Amazon Redshift. As a prerequisite, you should create your cluster and attach to it an AWS Identity and Access Management (IAM) role with the correct permissions. For instructions on creating the cluster with the sample dataset, see Using a sample dataset. For instructions on associating the role with the cluster, see Authorizing access to the Amazon Redshift Data API.

You can then use your IDE of choice to open the notebooks. This content has been developed and tested using SageMaker Studio on a ml.t3.medium instance. For more information about using Studio, refer to the following resources:

Define the query

Now that your Amazon Redshift cluster is up and running, and loaded with the sample dataset, we can define the query to extract data from our cluster. According to the documentation for the sample database, this application helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts. In particular, analysts can identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons.

Analysts may be tasked to solve a very common ML problem: predict the number of tickets sold given the characteristics of an event. Because we have two fact tables and five dimensions in our sample database, we have some data that we can work with. For the sake of this example, we try to use information from the venue in which the event takes place as well as its date. The SQL query looks like the following:

SELECT sum(s.qtysold) AS total_sold, e.venueid, e.catid, d.caldate, d.holiday
from sales s, event e, date d
WHERE s.eventid = e.eventid and e.dateid = d.dateid
GROUP BY e.venueid, e.catid, d.caldate, d.holiday

We can run this query in the query editor to test the outcomes and change it to include additional information if needed.

Extract the data from Amazon Redshift and process it with SageMaker Processing

Now that we’re happy with our query, we need to make it part of our training pipeline.

A typical training pipeline consists of three phases:

  • Preprocessing – This phase reads the raw dataset and transforms it into a format that matches the input required by the model for its training
  • Training – This phase reads the processed dataset and uses it to train the model
  • Model registration – In this phase, we save the model for later usage

Our first task is to use a SageMaker Processing job to load the dataset from Amazon Redshift, preprocess it, and store it to Amazon S3 for the training model to pick up. SageMaker Processing allows us to directly read data from different resources, including Amazon S3, Amazon Athena, and Amazon Redshift. SageMaker Processing allows us to configure access to the cluster by providing the cluster and database information, and use our previously defined SQL query as part of a RedshiftDatasetDefinition. We use the SageMaker Python SDK to create this object, and you can check the definition and the parameters needed on the GitHub page. See the following code:

from sagemaker.dataset_definition.inputs import RedshiftDatasetDefinition

rdd = RedshiftDatasetDefinition(
    cluster_id="THE-NAME-OF-YOUR-CLUSTER",
    database="THE-NAME-OF-YOUR-DATABASE",
    db_user="YOUR-DB-USERNAME",
    query_string="THE-SQL-QUERY-FROM-THE-PREVIOUS-STEP",
    cluster_role_arn="THE-IAM-ROLE-ASSOCIATED-TO-YOUR-CLUSTER",
    output_format="CSV",
    output_s3_uri="WHERE-IN-S3-YOU-WANT-TO-STORE-YOUR-DATA"
)

Then, you can define the DatasetDefinition. This object is responsible for defining how SageMaker Processing uses the dataset loaded from Amazon Redshift:

from sagemaker.dataset_definition.inputs import DatasetDefinition

dd = DatasetDefinition(
    data_distribution_type='ShardedByS3Key', # This tells SM Processing to shard the data across instances
    local_path='/opt/ml/processing/input/data/', # Where SM Processing will save the data in the container
    redshift_dataset_definition=rdd # Our ResdhiftDataset
)

Finally, you can use this object as input of your processor of choice. For this post, we wrote a very simple scikit-learn script that cleans the dataset, performs some transformations, and splits the dataset for training and testing. You can check the code in the file processing.py.

We can now instantiate the SKLearnProcessor object, where we define the framework version that we plan on using, the amount and type of instances that we spin up as part of our processing cluster, and the execution role that contains the right permissions. Then, we can pass the parameter dataset_definition as the input of the run() method. This method accepts our processing.py script as the code to run, given some inputs (namely, our RedshiftDatasetDefinition), generates some outputs (a train and a test dataset), and stores both to Amazon S3. We run this operation synchronously thanks to the parameter wait=True:

from sagemaker.sklearn import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

skp = SKLearnProcessor(
    framework_version='0.23-1',
    role=get_execution_role(),
    instance_type='ml.m5.large',
    instance_count=1
)
skp.run(
    code='processing/processing.py',
    inputs=[ProcessingInput(
        dataset_definition=dd,
        destination='/opt/ml/processing/input/data/',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs = [
        ProcessingOutput(
            output_name="train", 
            source="/opt/ml/processing/train"
        ),
        ProcessingOutput(
            output_name="test", 
            source="/opt/ml/processing/test"
        ),
    ],
    wait=True
)

With the outputs created by the processing job, we can move to the training step, by means of the sagemaker.sklearn.SKLearn() Estimator:

from sagemaker.sklearn import SKLearn

s = SKLearn(
    entry_point='training/script.py',
    framework_version='0.23-1',
    instance_type='ml.m5.large',
    instance_count=1,
    role=get_execution_role()
)
s.fit({
    'train':skp.latest_job.outputs[0].destination, 
    'test':skp.latest_job.outputs[1].destination
})

To learn more about the SageMaker Training API and Scikit-learn Estimator, see Using Scikit-learn with the SageMaker Python SDK.

Define a training pipeline

Now that we have proven that we can read data from Amazon Redshift, preprocess it, and use it to train a model, we can define a pipeline that reproduces these steps, and schedule it to run. To do so, we use SageMaker Pipelines. Pipelines is the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for ML. With Pipelines, you can create, automate, and manage end-to-end ML workflows at scale.

Pipelines are composed of steps. These steps define the actions that the pipeline takes, and the relationships between steps using properties. We already know that our pipelines are composed of three steps:

Furthermore, to make the pipeline definition dynamic, Pipelines allows us to define parameters, which are values that we can provide at runtime when the pipeline starts.

The following code is a snippet that shows the definition of a processing step. The step requires the definition of a processor, which is very similar to the one defined previously during the preprocessing discovery phase, but this time using the parameters of Pipelines. The others parameters, code, inputs, and outputs are the same as we have defined previously:

#### PROCESSING STEP #####

# PARAMETERS
processing_instance_type = ParameterString(name='ProcessingInstanceType', default_value='ml.m5.large')
processing_instance_count = ParameterInteger(name='ProcessingInstanceCount', default_value=2)

# PROCESSOR
skp = SKLearnProcessor(
    framework_version='0.23-1',
    role=get_execution_role(),
    instance_type=processing_instance_type,
    instance_count=processing_instance_count
)

# DEFINE THE STEP
processing_step = ProcessingStep(
    name='ProcessingStep',
    processor=skp,
    code='processing/processing.py',
    inputs=[ProcessingInput(
        dataset_definition=dd,
        destination='/opt/ml/processing/input/data/',
        s3_data_distribution_type='ShardedByS3Key'
    )],
    outputs = [
        ProcessingOutput(output_name="train", source="/opt/ml/processing/output/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/output/test"),
    ]
)

Very similarly, we can define the training step, but we use the outputs from the processing step as inputs:

# TRAININGSTEP
training_step = TrainingStep(
    name='TrainingStep',
    estimator=s,
    inputs={
        "train": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri),
        "test": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri)
    }
)

Finally, let’s add the model step, which registers the model to SageMaker for later use (for real-time endpoints and batch transform):

# MODELSTEP
model_step = CreateModelStep(
    name="Model",
    model=model,
    inputs=CreateModelInput(instance_type='ml.m5.xlarge')
)

With all the pipeline steps now defined, we can define the pipeline itself as a pipeline object comprising a series of those steps. ParallelStep and Condition steps also are possible. Then we can update and insert (upsert) the definition to Pipelines with the .upsert() command:

#### PIPELINE ####
pipeline = Pipeline(
    name = 'Redshift2Pipeline',
    parameters = [
        processing_instance_type, processing_instance_count,
        training_instance_type, training_instance_count,
        inference_instance_type
    ],
    steps = [
        processing_step, 
        training_step,
        model_step
    ]
)
pipeline.upsert(role_arn=role)

After we upsert the definition, we can start the pipeline with the pipeline object’s start() method, and wait for the end of its run:

execution = pipeline.start()
execution.wait()

After the pipeline starts running, we can view the run on the SageMaker console. In the navigation pane, under Components and registries, choose Pipelines. Choose the Redshift2Pipeline pipeline, and then choose the specific run to see its progress. You can choose each step to see additional details such as the output, logs, and additional information. Typically, this pipeline should take about 10 minutes to complete.

Conclusions

In this post, we created a SageMaker pipeline that reads data from Amazon Redshift natively without requiring additional configuration or services, processed it via SageMaker Processing, and trained a scikit-learn model. We can now do the following:

If you want additional notebooks to play with, check out the following:


About the Author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Add AutoML functionality with Amazon SageMaker Autopilot across accounts

AutoML is a powerful capability, provided by Amazon SageMaker Autopilot, that allows non-experts to create machine learning (ML) models to invoke in their applications.

The problem that we want to solve arises when, due to governance constraints, Amazon SageMaker resources can’t be deployed in the same AWS account where they are used.

Examples of such a situation are:

  • A multi-account enterprise setup of AWS where the Autopilot resources must be deployed in a specific AWS account (the trusting account), and should be accessed from trusted accounts
  • A software as a service (SaaS) provider that offers AutoML to their users and adopts the resources in the customer AWS account so that the billing is associated to the end customer

This post walks through an implementation using the SageMaker Python SDK. It’s divided into two sections:

  • Create the AWS Identity and Access Management (IAM) resources needed for cross-account access
  • Perform the Autopilot job, deploy the top model, and make predictions from the trusted account accessing the trusting account

The solution described in this post is provided in the Jupyter notebook available in this GitHub repository.

For a full explanation of Autopilot, you can refer to the examples available in GitHub, particularly Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK).

Prerequisites

We have two AWS accounts:

  • Customer (trusting) account – Where the SageMaker resources are deployed
  • SaaS (trusted) account – Drives the training and prediction activities

You have to create a user for each account, with programmatic access enabled and the IAMFullAccess managed policy associated.

You have to configure the user profiles in the .aws/credentials file:

  • customer_config for the user configured in the customer account
  • saas_config for the user configured in the SaaS account

To update the SageMaker SDK, run the following command in your Python environment:

!pip install --update sagemaker

The procedure has been tested in the SageMaker environment conda_python3.

Common modules and initial definitions

Import common Python modules used in the script:

import boto3
import json
import sagemaker
from botocore.exceptions import ClientError

Let’s define the AWS Region that will host the resources:

REGION = boto3.Session().region_name

and the reference to the dataset for the training of the model:

DATASET_URI = "s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt"

Set up the IAM resources

The following diagram illustrates the IAM entities that we create, which allow the cross-account implementation of the Autopilot job.

On the customer account, we define the single role customer_trusting_saas, which consolidates the permissions for Amazon Simple Storage Service (Amazon S3) and SageMaker access needed for the following:

  • The local SageMaker service that performs the Autopilot actions
  • The principal in the SaaS account that initiates the actions in the customer account

On the SaaS account, we define the following:

  • The AutopilotUsers group with the policy required to assume the customer_trusting_saas role via AWS Security Token Service (AWS STS)
  • The saas_user, which is a member of the AutopilotUsers group and is the actual principal triggering the Autopilot actions

For additional security, in the cross-account trust relationship, we use the external ID to mitigate the confused deputy problem.

Let’s proceed with the setup.

For each of the two accounts, we complete the following tasks:

  1. Create the Boto3 session with the profile of the respective configuration user.
  2. Retrieve the AWS account ID by means of AWS STS.
  3. Create the IAM client that performs the configuration steps in the account.

For the customer account, use the following code:

customer_config_session = boto3.session.Session(profile_name="customer_config")
CUSTOMER_ACCOUNT_ID = customer_config_session.client("sts").get_caller_identity()["Account"]
customer_iam_client = customer_config_session.client("iam")

Use the following code in the SaaS account:

saas_config_session = boto3.session.Session(profile_name="saas_config")
SAAS_ACCOUNT_ID = saas_config_session.client("sts").get_caller_identity()["Account"]
saas_iam_client = saas_config_session.client("iam")

Set up the IAM entities in the customer account

Let’s first define the role needed to perform cross-account tasks from the SaaS account in the customer account.

For simplicity, the same role is adopted for trusting SageMaker in the customer account. Ideally, consider splitting this role into two roles with fine-grained permissions in line with the principle of granting least privilege.

The role name and the references to the ARN of the SageMaker AWS managed policies are as follows:

CUSTOMER_TRUST_SAAS_ROLE_NAME = "customer_trusting_saas"
CUSTOMER_TRUST_SAAS_ROLE_ARN = "arn:aws:iam::{}:role/{}".format(CUSTOMER_ACCOUNT_ID, CUSTOMER_TRUST_SAAS_ROLE_NAME)
SAGEMAKERFULLACCESS_POLICY_ARN = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"

The following customer managed policy gives the role the permissions to access the Amazon S3 resources that are needed for the SageMaker tasks and for the cross-account copy of the dataset.

We restrict the access to the S3 buckets dedicated to SageMaker in the AWS Region for the customer account. See the following code:

CUSTOMER_S3_POLICY_NAME = "customer_s3"
CUSTOMER_S3_POLICY = 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
         "s3:GetObject",
         "s3:PutObject",
         "s3:DeleteObject",
         "s3:ListBucket"
      ],
      "Resource": [
         "arn:aws:s3:::sagemaker-{}-{}".format(REGION, CUSTOMER_ACCOUNT_ID),
         "arn:aws:s3:::sagemaker-{}-{}/*".format(REGION, CUSTOMER_ACCOUNT_ID)
      ]
    }
  ]
}

Then we define the external ID to mitigate the confused deputy problem:

EXTERNAL_ID = "XXXXX"

The trust relationships policy allows the principals from the trusted account and SageMaker to assume the role:

CUSTOMER_TRUST_SAAS_POLICY = 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::{}:root".format(SAAS_ACCOUNT_ID)
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": EXTERNAL_ID
        }
      }
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

For simplicity, we don’t include the management of the exceptions in the following snippets. See the Jupyter notebook for the full code.

We create the customer managed policy in the customer account, create the new role, and attach the two policies. We use the maximum session duration parameter to manage long-running jobs. See the following code:

MAX_SESSION_DURATION = 10800
create_policy_response = customer_iam_client.create_policy(PolicyName=CUSTOMER_S3_POLICY_NAME,
                                                           PolicyDocument=json.dumps(CUSTOMER_S3_POLICY))
customer_s3_policy_arn = create_policy_response["Policy"]["Arn"]

create_role_response = customer_iam_client.create_role(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                                       AssumeRolePolicyDocument=json.dumps(CUSTOMER_TRUST_SAAS_POLICY),
                                                       MaxSessionDuration=MAX_SESSION_DURATION)

customer_iam_client.attach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                       PolicyArn=customer_s3_policy_arn)
customer_iam_client.attach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME,
                                       PolicyArn=SAGEMAKERFULLACCESS_POLICY_ARN)

Set up IAM entities in the SaaS account

We define the following in the SaaS account:

  • A group of users allowed to perform the Autopilot job in the customer account
  • A policy associated with the group for assuming the role defined in the customer account
  • A policy associated with the group for uploading data to Amazon S3 and managing bucket policies
  • A user that is responsible for the implementation of the Autopilot jobs – the user has programmatic access
  • A user profile to store the user access key and secret in the file for the credentials

Let’s start with defining the name of the group (AutopilotUsers):

SAAS_USER_GROUP_NAME = "AutopilotUsers"

The first policy refers to the customer account ID and the role:

SAAS_ASSUME_ROLE_POLICY_NAME = "saas_assume_customer_role"
SAAS_ASSUME_ROLE_POLICY = 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::{}:role/{}".format(CUSTOMER_ACCOUNT_ID, CUSTOMER_TRUST_SAAS_ROLE_NAME)
        }
    ]
}

The second policy is needed to download the dataset, and to manage the Amazon S3 bucket used by SageMaker:

SAAS_S3_POLICY_NAME = "saas_s3"
SAAS_S3_POLICY = 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(DATASET_URI.split('://')[1])
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutBucketPolicy",
                "s3:DeleteBucketPolicy"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-{}-{}".format(REGION, SAAS_ACCOUNT_ID),
                "arn:aws:s3:::sagemaker-{}-{}/*".format(REGION, SAAS_ACCOUNT_ID)
            ]
        }
    ]
}

For simplicity, we give the same value to the user name and user profile:

SAAS_USER_PROFILE = SAAS_USER_NAME = "saas_user"

Now we create the two new managed policies. Next, we create the group, attach the policies to the group, create the user with programmatic access, and insert the user into the group. See the following code:

create_policy_response = saas_iam_client.create_policy(PolicyName=SAAS_ASSUME_ROLE_POLICY_NAME,
                                                       PolicyDocument=json.dumps(SAAS_ASSUME_ROLE_POLICY))      
saas_assume_role_policy_arn = create_policy_response["Policy"]["Arn"]

create_policy_response = saas_iam_client.create_policy(PolicyName=SAAS_S3_POLICY_NAME,
                                                       PolicyDocument=json.dumps(SAAS_S3_POLICY))
saas_s3_policy_arn = create_policy_response["Policy"]["Arn"]

saas_iam_client.create_group(GroupName=SAAS_USER_GROUP_NAME)

saas_iam_client.attach_group_policy(GroupName=SAAS_USER_GROUP_NAME,PolicyArn=saas_assume_role_policy_arn)
saas_iam_client.attach_group_policy(GroupName=SAAS_USER_GROUP_NAME,PolicyArn=saas_s3_policy_arn)

saas_iam_client.create_user(UserName=SAAS_USER_NAME)
saas_iam_client.create_access_key(UserName=SAAS_USER_NAME)

add_user_to_group(GroupName=SAAS_USER_GROUP_NAME,UserName=SAAS_USER_NAME)

Update the credentials file

Create the user profile for saas_user in the .aws/credentials file:

from pathlib import Path
import configparser

credentials_config = configparser.ConfigParser()
credentials_config.read(str(Path.home()) + "/.aws/credentials")

if not credentials_config.has_section(SAAS_USER_PROFILE):
    credentials_config.add_section(SAAS_USER_PROFILE)
    
credentials_config[SAAS_USER_PROFILE]["aws_access_key_id"] = create_akey_response["AccessKey"]["AccessKeyId"]
credentials_config[SAAS_USER_PROFILE]["aws_secret_access_key"] = create_akey_response["AccessKey"]["SecretAccessKey"]

with open(str(Path.home()) + "/.aws/credentials", "w") as configfile:
    credentials_config.write(configfile, space_around_delimiters=False)

This completes the configuration of IAM entities that are needed for the cross-account implementation of the Autopilot job.

Autopilot cross-account access

This is the core objective of the post, where we demonstrate the main differences with respect to the single-account scenario.

First, we prepare the dataset the Autopilot job uses for training the models.

Data

We reuse the same dataset adopted in the SageMaker example: Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK).

For a full explanation of the data, refer to the original example.

We skip the data inspection and proceed directly to the focus of this post, which is the cross-account Autopilot job invocation.

Download the churn dataset with the following AWS Command Line Interface (AWS CLI) command:

!aws s3 cp $DATASET_URI ./ --profile saas_user

Split the dataset for the Autopilot job and the inference phase

After you load the dataset, split it into two parts:

  • 80% for the Autopilot job to train the top model
  • 20% for testing the model that we deploy

Autopilot applies a cross-validation resampling procedure, on the dataset passed as input, to all candidate algorithms to test their ability to predict data they have not been trained on.

Split the dataset with the following code:

import pandas as pd
import numpy as np

churn = pd.read_csv("./churn.txt")
train_data = churn.sample(frac=0.8,random_state=200)
test_data = churn.drop(train_data.index)
test_data_no_target = test_data.drop(columns=["Churn?"])

Let’s save the training data into a file locally that we pass to the fit method of the AutoML estimator:

train_file = "train_data.csv"
train_data.to_csv(train_file, index=False, header=True)

Autopilot training job, deployment, and prediction overview

The training, deployment, and prediction process is illustrated in the following diagram.

The following are the steps for the cross-account invocation:

  1. Initiate a session as saas_user in the SaaS account and load the profile from the credentials.
  2. Assume the role in the customer account via the AWS STS.
  3. Set up and train the AutoML estimator in the customer account.
  4. Deploy the top candidate model proposed by AutoML in the customer account.
  5. Invoke the deployed model endpoint for the prediction on test data.

Initiate the user session in the SaaS account

The setup procedure of IAM entities, explained at the beginning of the post, created the saas_user, identified by the saas_user profile in the .aws/credentials file. We initiate a Boto3 session with this profile:

saas_user_session = boto3.session.Session(profile_name=SAAS_USER_PROFILE, 
                                          region_name=REGION)

The saas_user inherits from the AutopilotUsers group the permission to assume the customer_trusting_saas role in the customer account.

Assume the role in the customer account via AWS STS

AWS STS provides the credentials for a temporary session that is initiated in the customer account:

saas_sts_client = saas_user_session.client("sts", region_name=REGION)

The default session duration (the DurationSeconds parameter) is 1 hour. We set it to the maximum duration session value set for the role. If the session expires, you can recreate it by performing the following steps again. See the following code:

assumed_role_object = saas_sts_client.assume_role(RoleArn=CUSTOMER_TRUST_SAAS_ROLE_ARN,
                                                  RoleSessionName="sagemaker_autopilot",
                                                  ExternalId=EXTERNAL_ID,
                                                  DurationSeconds=MAX_SESSION_DURATION)

assumed_role_credentials = assumed_role_object["Credentials"]
			 
assumed_role_session = boto3.Session(aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                     aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                     aws_session_token=assumed_role_credentials["SessionToken"],
                                     region_name=REGION)
									 
sagemaker_session = sagemaker.Session(boto_session=assumed_role_session)

The sagemaker_session parameter is needed for using the high-level AutoML estimator.

Set up and train the AutoML estimator in the customer account

We use the AutoML estimator from the SageMaker Python SDK to invoke the Autopilot job to train a set of candidate models for the training data.

The setup of the AutoML object is similar to the single-account scenario, but with the following differences for the cross-account invocation:

  • The role for SageMaker access in the customer account is CUSTOMER_TRUST_SAAS_ROLE_ARN
  • The sagemaker_session is the temporary session created by AWS STS

See the following code:

target_attribute_name = "Churn?"

from sagemaker import AutoML
from time import gmtime, strftime, sleep

timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
base_job_name = "automl-churn-sdk-" + timestamp_suffix

target_attribute_name = "Churn?"
target_attribute_values = np.unique(train_data[target_attribute_name])
target_attribute_true_value = target_attribute_values[1] # 'True.'

automl = AutoML(role=CUSTOMER_TRUST_SAAS_ROLE_ARN,
                target_attribute_name=target_attribute_name,
                base_job_name=base_job_name,
                sagemaker_session=sagemaker_session,
                max_candidates=10)

We now launch the Autopilot job by calling the fit method of the AutoML estimator in the same way as in the single-account example. We consider the following alternative options for providing the training dataset to the estimator.

First option: upload a local file and train by fit method

We simply pass the training dataset by referring to the local file that the fit method uploads into the default Amazon S3 bucket used by SageMaker in the customer account:

automl.fit(train_file, job_name=base_job_name, wait=False, logs=False)

Second option: cross-account copy

Most likely, the training dataset is located in an Amazon S3 bucket owned by the SaaS account. We copy the dataset from the SaaS account into the customer account and refer to the URI of the copy in the fit method.

  1. Upload the dataset into a local bucket of the SaaS account. For convenience, we use the SageMaker default bucket in the Region.
    DATA_PREFIX = "auto-ml-input-data"
    local_session = sagemaker.Session(boto_session=saas_user_session)
    local_session_bucket = local_session.default_bucket()
    train_data_s3_path = local_session.upload_data(path=train_file,key_prefix=DATA_PREFIX)

  2. To allow the cross-account copy, we set the following policy in the local bucket, only for the time needed for the copy operation:
    train_data_s3_arn = "arn:aws:s3:::{}/{}/{}".format(local_session_bucket,DATA_PREFIX,train_file)
    bucket_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": CUSTOMER_TRUST_SAAS_ROLE_ARN
                },
                "Action": "s3:GetObject",
                "Resource": train_data_s3_arn
            }
        ]
    }
    bucket_policy = json.dumps(bucket_policy)
    
    saas_s3_client = saas_user_session.client("s3")
    saas_s3_client.put_bucket_policy(Bucket=local_session_bucket,Policy=bucket_policy)

  3. Then the copy is performed by the assumed role in the customer account:
    assumed_role_s3_client = boto3.client("s3",
                                           aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                           aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                           aws_session_token=assumed_role_credentials["SessionToken"])
    target_train_key = "{}/{}".format(DATA_PREFIX, train_file)
    assumed_role_s3_client.copy_object(Bucket=sagemaker_session.default_bucket(), 
                                       CopySource=train_data_s3_path.split("://")[1], 
                                       Key=target_train_key)

  4. Delete the bucket policy so that the access has been granted only for the time of the copy:
    saas_s3_client.delete_bucket_policy(Bucket=local_session_bucket)

  5. Finally, we launch the Autopilot job, passing the URI of the object copy:
target_train_uri = "s3://{}/{}".format(sagemaker_session.default_bucket(), 
                                       target_train_key)
automl.fit(target_train_uri, job_name=base_job_name, wait=False, logs=False)

Another option is to refer to the URI of the source dataset in the bucket in SaaS account. In this case, the bucket policy should include the s3:ListBucket action for the source bucket.

The bucket policy should be assigned for the duration of all the training and allow the s3:ListBucket action for the source bucket, including a statement like the following:

{
  "Effect": "Allow",
  "Principal": {
     "AWS": "arn:aws:iam::CUSTOMER_ACCOUNT_ID:role/customer_trusting_saas"
  },
  "Action": "s3:ListBucket",
  "Resource": "arn:aws:s3:::sagemaker-REGION-SAAS_ACCOUNT_ID"
}

We can use the describe_auto_ml_job method to track the status of our SageMaker Autopilot job:

describe_response = automl.describe_auto_ml_job()
print (describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
job_run_status = describe_response["AutoMLJobStatus"]

while job_run_status not in ("Failed", "Completed", "Stopped"):
    describe_response = automl.describe_auto_ml_job()
    job_run_status = describe_response["AutoMLJobStatus"]
    
    print(describe_response["AutoMLJobStatus"] + " - " + describe_response["AutoMLJobSecondaryStatus"])
    sleep(30)

Because an Autopilot job can take a long time, if the session token expires during the fit, you can create a new session following the steps described earlier and retrieve the current Autopilot job reference by implementing the following code:

automl = AutoML.attach(auto_ml_job_name=base_job_name,sagemaker_session=sagemaker_session)

Deploy the top candidate model proposed by AutoML

The Autopilot job trains and returns a set of trained candidate models, identifying among them the top candidate that optimizes the evaluation metric related to the ML problem.

In this post, we only demonstrate the deployment of the top candidate proposed by AutoML, but you can choose a different candidate that better fits your business criteria.

First, we review the performance achieved by the top candidate in the cross-validation:

best_candidate = automl.describe_auto_ml_job()["BestCandidate"]
best_candidate_name = best_candidate["CandidateName"]
print("n")
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate["FinalAutoMLJobObjectiveMetric"]["MetricName"])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate["FinalAutoMLJobObjectiveMetric"]["Value"]))

If the performance is good enough for our business criteria, we deploy the top candidate in the customer account:

from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

inference_response_keys = ["predicted_label", "probability"]

predictor = automl.deploy(initial_instance_count=1,
                          instance_type="ml.m5.large",
                          inference_response_keys=inference_response_keys,
                          predictor_cls=Predictor,
                          serializer=CSVSerializer(),
                          deserializer=CSVDeserializer())

print("Created endpoint: {}".format(predictor.endpoint_name))

The instance is deployed and billed to the customer account.

Prediction on test data

Finally, we access the model endpoint for the prediction of the label output for the test data:

predictor.predict(test_data_no_target.to_csv(sep=",", 
                                             header=False, 
                                             index=False))

If the session token expires after the deployment of the endpoint, you can recreate a new session following the steps described earlier and connect to the already deployed endpoint by implementing the following code:

predictor = Predictor(predictor.endpoint_name, 
                      sagemaker_session = sagemaker_session,
                      serializer=CSVSerializer(), 
                      deserializer=CSVDeserializer())

Clean up

To avoid incurring unnecessary charges, delete the endpoints and resources that were created when deploying the model after they are no longer needed.

Delete the model endpoint

The model endpoint is deployed in a container that is always active. We delete it first to avoid consumption of credits:

predictor.delete_endpoint()

Delete the artifacts generated by the Autopilot job

Delete all the artifacts created by the Autopilot job, such as the generated candidate models, scripts, and notebook.

We use the high-level resource for Amazon S3 to simplify the operation:

assumed_role_s3_resource = boto3.resource("s3",
                                          aws_access_key_id=assumed_role_credentials["AccessKeyId"],
                                          aws_secret_access_key=assumed_role_credentials["SecretAccessKey"],
                                          aws_session_token=assumed_role_credentials["SessionToken"])

s3_bucket = assumed_role_s3_resource.Bucket(automl.sagemaker_session.default_bucket())
s3_bucket.objects.filter(Prefix=base_job_name).delete()

Delete the training dataset copied into the customer account

Delete the training dataset in the customer account with the following code:

from urllib.parse import urlparse

train_data_uri = automl.describe_auto_ml_job()["InputDataConfig"][0][ "DataSource"]["S3DataSource"]["S3Uri"]

o = urlparse(train_data_uri, allow_fragments=False)
assumed_role_s3_resource.Object(o.netloc, o.path.lstrip("/")).delete()

Clean up IAM resources

We delete the IAM resources in reverse order to the creation phase.

  1. Remove the user from the group, and the profile from the credentials, and delete the user:
    saas_iam_client.remove_user_from_group(GroupName = SAAS_USER_GROUP_NAME,
                                           UserName = SAAS_USER_NAME)
                                          
    credentials_config.remove_section(SAAS_USER_PROFILE)
    with open(str(Path.home()) + "/.aws/credentials", "w") as configfile:
        credentials_config.write(configfile, space_around_delimiters=False)
        
    user_access_keys = saas_iam_client.list_access_keys(UserName=SAAS_USER_NAME)
    for AccessKeyId in [element["AccessKeyId"] for element in user_access_keys["AccessKeyMetadata"]]:
        saas_iam_client.delete_access_key(UserName=SAAS_USER_NAME, AccessKeyId=AccessKeyId)
    	
    saas_iam_client.delete_user(UserName=SAAS_USER_NAME)

  2. Detach the policies from the group in the SaaS account, and delete the group and policies:
    attached_group_policies = saas_iam_client.list_attached_group_policies(GroupName=SAAS_USER_GROUP_NAME)
    for PolicyArn in [element["PolicyArn"] for element in attached_group_policies["AttachedPolicies"]]:
        saas_iam_client.detach_group_policy(GroupName=SAAS_USER_GROUP_NAME, PolicyArn=PolicyArn)
        
    saas_iam_client.delete_group(GroupName=SAAS_USER_GROUP_NAME)
    saas_iam_client.delete_policy(PolicyArn=saas_assume_role_policy_arn)
    saas_iam_client.delete_policy(PolicyArn=saas_s3_policy_arn)

  3. Detach the AWS policies from the role in the customer account, then delete the role and the policy:
    attached_role_policies = customer_iam_client.list_attached_role_policies(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME)
    for PolicyArn in [element["PolicyArn"] for element in attached_role_policies["AttachedPolicies"]]:
        customer_iam_client.detach_role_policy(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME, PolicyArn=PolicyArn)
    
    customer_iam_client.delete_role(RoleName=CUSTOMER_TRUST_SAAS_ROLE_NAME)
    customer_iam_client.delete_policy(PolicyArn=customer_s3_policy_arn)

Conclusion

This post described a possible implementation, using the SageMaker Python SDK, of an Autopilot training job, model deployment, and prediction in a cross-account configuration. The originating account owns the data for the training and it delegates the activities to the account hosting the SageMaker resources.

You can use the API calls shown in this post to incorporate AutoML capabilities into a SaaS application, by delegating the management and billing of SageMaker resources to the customer account.

SageMaker decouples the environment where the data scientist drives the analysis from the containers that perform each phase of the ML process.

This capability simplifies other cross-account scenarios. For example: a SaaS provider who owns sensitive data, instead of sharing its data with the customer, could expose certified training algorithms and generate models on behalf of the customer. The customer will receive the trained model at the end of the Autopilot job.

For more examples of how to integrate Autopilot into SaaS products, see the following posts:


About the Authors

Francesco Polimeni is a Sr Solutions Architect at AWS with focus on Machine Learning. He has over 20 years of experience in professional services and pre-sales organizations for IT management software solutions.

Mehmet Bakkaloglu is a Sr Solutions Architect at AWS. He has vast experience in data analytics and cloud architecture, having provided technical leadership for transformation programs and pre-sales activities in a variety of sectors.

Read More