How The Chefz serves the perfect meal with Amazon Personalize

This is a guest post by Ramzi Alqrainy, Chief Technology Officer, The Chefz.

The Chefz is a Saudi-based online food delivery startup, founded in 2016. At the core of The Chefz’s business model is enabling its customers to order food and sweets from top elite restaurants, bakeries, and chocolate shops. In this post, we explain how The Chefz uses Amazon Personalize filters to apply business rules on recommendations to end-users, increasing revenue by 35%.

Food delivery is a growing industry but at the same time is extremely competitive. The biggest challenge in the industry is maintaining customer loyalty. This requires a comprehensive understanding of the customer’s preferences, the ability to provide excellent response time in terms of on-time delivery, and good food quality. These three factors determine the most important metric for The Chefz’s customer satisfaction. The Chefz’s demands fluctuate, especially with spikes in order volumes at lunch and dinner times. Demand also fluctuates during special days such as Mother’s Day, the football final, Ramadan dusk (Suhoor) and sundown (Iftaar) times, or Eid festive holidays. During these times, the demand can increase by up to 300%, adding one more critical challenge to recommend the perfect meal based on time of the day, especially in Ramadan.

The perfect meal at the right time

To make the ordering process more deterministic and to cater to peak demand times, The Chefz team decided to divide the day into different periods. For example, during Ramadan season, days are divided into Iftar and Suhoor. On regular days, days consist of four periods: breakfast, lunch, dinner, and dessert. The technology that underpins this deterministic ordering process is Amazon Personalize, a powerful recommendation engine. Amazon Personalize takes these grouped periods along with the location of the customer to provide a perfect recommendation.

This ensures the customer receives restaurant and meal recommendations based on their preference and from a nearby location so that it arrives quickly at their doorstep.

This recommendation engine based on Amazon Personalize is the key ingredient in how The Chefz’s customers enjoy personalized restaurant meal recommendations, rather than random recommendations for categories of favorites.

The personalization journey

The Chefz started its personalization journey by offering restaurant recommendations for customers using Amazon Personalize based on previous interactions, user metadata (such as age, nationality, and diet), restaurant metadata like category and food types offered, along with live tracking for customer interactions on the Chefz mobile application and web portal. The initial deployment phases of Amazon Personalize led to a 10% increase in customer interactions with the portal.

Although that was a milestone step, delivery time was still a problem that many customers encountered. One of the main difficulties customers had was delivery time during rush hour. To address this, the data scientist team added location as an additional feature to user metadata so recommendations would take into consideration both user preference and location for improved delivery time.

The next step in the recommendation journey was to consider annual timing, especially Ramadan, and the time of day. These considerations ensured The Chefz could recommend heavy meals or restaurants that provide Iftaar meals during Ramadan sundown, and lighter meals in the late evening. To solve this challenge, the data scientist team used Amazon Personalize filters updated by AWS Lambda functions, which were triggered by an Amazon CloudWatch cron job.

The following architecture shows the automated process for applying the filters:

  1. A CloudWatch event uses a cron expression to schedule when a Lambda function is invoked.
  2. When the Lambda function is triggered, it attaches the filter to the recommendation engine to apply business rules.
  3. Recommended meals and restaurants are delivered to end-users on the application.

Conclusion

Amazon Personalize enabled The Chefz to apply context about individual customers and their circumstances, and deliver customized recommendations based on business rules such as special deals and offers through our mobile application. This increased revenue by 35% per month and doubled customer orders at recommended restaurants.

“The customer is at the heart of everything we do at The Chefz, and we’re working tirelessly to improve and enhance their experience. With Amazon Personalize, we are able to achieve personalization at scale across our entire customer base, which was previously impossible.”

-Ramzi Algrainy, CTO at The Chefz.


About the authors

Ramzi Alqrainy is Chief Technology Officer at The Chefz. Ramzi is a contributor to Apache Solr and Slack and technical reviewer, and has published many papers in IEEE focusing on search and data functions.

Mohamed Ezzat is a Senior Solutions Architect at AWS with a focus in machine learning. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys playing table tennis.

Read More

Distributed training with Amazon EKS and Torch Distributed Elastic

Distributed deep learning model training is becoming increasingly important as data sizes are growing in many industries. Many applications in computer vision and natural language processing now require training of deep learning models, which are growing exponentially in complexity and are often trained with hundreds of terabytes of data. It then becomes important to use a vast cloud infrastructure to scale the training of such large models.

Developers can use open-source frameworks such as PyTorch to easily design intuitive model architectures. However, scaling the training of these models across multiple nodes can be challenging due to increased orchestration complexity.

Distributed model training mainly consists of two paradigms:

  • Model parallel – In model parallel training, the model itself is so large that it can’t fit in the memory of a single GPU, and multiple GPUs are needed to train the model. The Open AI’s GPT-3 model with 175 billion trainable parameters (approximately 350 GB in size) is a good example of this.
  • Data parallel – In data parallel training, the model can reside in a single GPU, but because the data is so large, it can take days or weeks to train a model. Distributing the data across multiple GPU nodes can significantly reduce the training time.

In this post, we provide an example architecture to train PyTorch models using the Torch Distributed Elastic framework in a distributed data parallel fashion using Amazon Elastic Kubernetes Service (Amazon EKS).

Prerequisites

To replicate the results reported in this post, the only prerequisite is an AWS account. In this account, we create an EKS cluster and an Amazon FSx for Lustre file system. We also push container images to an Amazon Elastic Container Registry (Amazon ECR) repository in the account. Instructions to set up these components are provided as needed throughout the post.

EKS clusters

Amazon EKS is a managed container service to run and scale Kubernetes applications on AWS. With Amazon EKS, you can efficiently run distributed training jobs using the latest Amazon Elastic Compute Cloud (Amazon EC2) instances without needing to install, operate, and maintain your own control plane or nodes. It is a popular orchestrator for machine learning (ML) and AI workflows. A typical EKS cluster in AWS looks like the following figure.

We have released an open-source project, AWS DevOps for EKS (aws-do-eks), which provides a large collection of easy-to-use and configurable scripts and tools to provision EKS clusters and run distributed training jobs. This project is built following the principles of the Do Framework: Simplicity, Flexibility, and Universality. You can configure your desired cluster by using the eks.conf file and then launch it by running the eks-create.sh script. Detailed instructions are provided in the GitHub repo.

Train PyTorch models using Torch Distributed Elastic

Torch Distributed Elastic (TDE) is a native PyTorch library for training large-scale deep learning models where it’s critical to scale compute resources dynamically based on availability. The TorchElastic Controller for Kubernetes is a native Kubernetes implementation for TDE that automatically manages the lifecycle of the pods and services required for TDE training. It allows for dynamically scaling compute resources during training as needed. It also provides fault-tolerant training by recovering jobs from node failure.

In this post, we discuss the steps to train PyTorch EfficientNet-B7 and ResNet50 models using ImageNet data in a distributed fashion with TDE. We use the PyTorch DistributedDataParallel API and the Kubernetes TorchElastic controller, and run our training jobs on an EKS cluster containing multiple GPU nodes. The following diagram shows the architecture diagram for this model training.

TorchElastic for Kubernetes consists mainly of two components: the TorchElastic Kubernetes Controller (TEC) and the parameter server (etcd). The controller is responsible for monitoring and managing the training jobs, and the parameter server keeps track of the worker nodes for distributed synchronization and peer discovery.

In order for the training pods to access the data, we need a shared data volume that can be mounted by each pod. Some options for shared volumes through Container Storage Interface (CSI) drivers included in AWS DevOps for EKS are Amazon Elastic File System (Amazon EFS) and FSx for Lustre.

Cluster setup

In our cluster configuration, we use one c5.2xlarge instance for system pods. We use three p4d.24xlarge instances as worker pods to train an EfficientNet model. For ResNet50 training, we use p3.8xlarge instances as worker pods. Additionally, we use an FSx shared file system to store our training data and model artifacts.

AWS p4d.24xlarge instances are equipped with Elastic Fabric Adapter (EFA) to provide networking between nodes. We discuss EFA more later in the post. To enable communication through EFA, we need to configure the cluster setup through a .yaml file. An example file is provided in the GitHub repository.

After this .yaml file is properly configured, we can launch the cluster using the script provided in the GitHub repo:

./eks-create.sh

Refer to the GitHub repo for detailed instructions.

There is practically no difference between running jobs on p4d.24xlarge and p3.8xlarge. The steps described in this post work for both. The only difference is the availability of EFA on p4d.24xlarge instances. For smaller models like ResNet50, standard networking compared to EFA networking has minimal impact on the speed of training.

FSx for Lustre file system

FSx is designed for high-performance computing workloads and provides sub-millisecond latency using solid-state drive storage volumes. We chose FSx because it provided better performance as we scaled to a large number of nodes. An important detail to note is that FSx can only exist in a single Availability Zone. Therefore, all nodes accessing the FSx file system should exist in the same Availability Zone as the FSx file system. One way to achieve this is to specify the relevant Availability Zone in the cluster .yaml file for the specific node groups before creating the cluster. Alternatively, we can modify the network part of the auto scaling group for these nodes after the cluster is set up, and limit it to using a single subnet. This can be easily done on the Amazon EC2 console.

Assuming that the EKS cluster is up and running, and the subnet ID for the Availability Zone is known, we can set up an FSx file system by providing the necessary information in the fsx.conf file as described in the readme and running the deploy.sh script in the fsx folder. This sets up the correct policy and security group for accessing the file system. The script also installs the CSI driver for FSx as a daemonset. Finally, we can create the FSx persistent volume claim in Kubernetes by applying a single .yaml file:

kubectl apply -f fsx-pvc-dynamic.yaml

This creates an FSx file system in the Availability Zone specified in the fsx.conf file, and also creates a persistent volume claim fsx-pvc, which can be mounted by any of the pods in the cluster in a read-write-many (RWX) fashion.

In our experiment, we used complete ImageNet data, which contains more that 12 million training images divided into 1,000 classes. The data can be downloaded from the ImageNet website. The original TAR ball has several directories, but for our model training, we’re only interested in ILSVRC/Data/CLS-LOC/, which includes the train and val subdirectories. Before training, we need to rearrange the images in the val subdirectory to match the directory structure required by the PyTorch ImageFolder class. This can be done using a simple Python script after the data is copied to the persistent volume in the next step.

To copy the data from an Amazon Simple Storage Service (Amazon S3) bucket to the FSx file system, we create a Docker image that includes scripts for this task. An example Dockerfile and a shell script are included in the csi folder within the GitHub repo. We can build the image using the build.sh script and then push it to Amazon ECR using the push.sh script. Before using these scripts, we need to provide the correct URI for the ECR repository in the .env file in the root folder of the GitHub repo. After we push the Docker image to Amazon ECR, we can launch a pod to copy the data by applying the relevant .yaml file:

kubectl apply -f fsx-data-prep-pod.yaml

The pod automatically runs the script data-prep.sh to copy the data from Amazon S3 to the shared volume. Because the ImageNet data has more than 12 million files, the copy process takes a couple of hours. The Python script imagenet_data_prep.py is also run to rearrange the val dataset as expected by PyTorch.

Network acceleration

We can use Elastic Fabric Adapter (EFA) in combination with supported EC2 instance types to accelerate network traffic between the GPU nodes in your cluster. This can be useful when running large distributed training jobs where standard network communication may be a bottleneck. Scripts to deploy and test the EFA device plugin in the EKS cluster that we use here are included in the efa-device-plugin folder in the GitHub repo. To enable a job with EFA in your EKS cluster, in addition to the cluster nodes having the necessary hardware and software, the EFA device plugin needs to be deployed to the cluster, and your job container needs to have compatible CUDA and NCCL versions installed.

To demonstrate running NCCL tests and evaluating the performance of EFA on p4d.24xlarge instances, we first must deploy the Kubeflow MPI operator by running the corresponding deploy.sh script in the mpi-operator folder. Then we run the deploy.sh script and update the test-efa-nccl.yaml manifest so limits and requests for resource vpc.amazonaws.com are set to 4. The four available EFA adapters in the p4d.24xlarge nodes get bundled together to provide maximum throughput.

Run kubectl apply -f ./test-efa-nccl.yaml to apply the test and then display the logs of the test pod. The following line in the log output confirms that EFA is being used:

NCCL INFO NET/OFI Selected Provider is efa

The test results should look similar to the following output:

[1,0]<stdout>:#                                                       out-of-place                       in-place
[1,0]<stdout>:#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
[1,0]<stdout>:#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
[1,0]<stdout>:           8             2     float     sum    629.7    0.00    0.00  2e-07    631.4    0.00    0.00  1e-07
[1,0]<stdout>:          16             4     float     sum    630.5    0.00    0.00  1e-07    628.1    0.00    0.00  1e-07
[1,0]<stdout>:          32             8     float     sum    627.6    0.00    0.00  1e-07    628.2    0.00    0.00  1e-07
[1,0]<stdout>:          64            16     float     sum    633.6    0.00    0.00  1e-07    628.4    0.00    0.00  6e-08
[1,0]<stdout>:         128            32     float     sum    627.5    0.00    0.00  6e-08    632.0    0.00    0.00  6e-08
[1,0]<stdout>:         256            64     float     sum    634.5    0.00    0.00  6e-08    636.5    0.00    0.00  6e-08
[1,0]<stdout>:         512           128     float     sum    634.8    0.00    0.00  6e-08    635.2    0.00    0.00  6e-08
[1,0]<stdout>:        1024           256     float     sum    646.6    0.00    0.00  2e-07    643.6    0.00    0.00  2e-07
[1,0]<stdout>:        2048           512     float     sum    745.0    0.00    0.01  5e-07    746.0    0.00    0.01  5e-07
[1,0]<stdout>:        4096          1024     float     sum    958.2    0.00    0.01  5e-07    955.8    0.00    0.01  5e-07
[1,0]<stdout>:        8192          2048     float     sum    963.0    0.01    0.02  5e-07    954.5    0.01    0.02  5e-07
[1,0]<stdout>:       16384          4096     float     sum    955.0    0.02    0.03  5e-07    955.5    0.02    0.03  5e-07
[1,0]<stdout>:       32768          8192     float     sum    975.5    0.03    0.06  5e-07   1009.0    0.03    0.06  5e-07
[1,0]<stdout>:       65536         16384     float     sum   1353.4    0.05    0.09  5e-07   1343.5    0.05    0.09  5e-07
[1,0]<stdout>:      131072         32768     float     sum   1395.9    0.09    0.18  5e-07   1392.6    0.09    0.18  5e-07
[1,0]<stdout>:      262144         65536     float     sum   1476.7    0.18    0.33  5e-07   1536.3    0.17    0.32  5e-07
[1,0]<stdout>:      524288        131072     float     sum   1560.3    0.34    0.63  5e-07   1568.3    0.33    0.63  5e-07
[1,0]<stdout>:     1048576        262144     float     sum   1599.2    0.66    1.23  5e-07   1595.3    0.66    1.23  5e-07
[1,0]<stdout>:     2097152        524288     float     sum   1671.1    1.25    2.35  5e-07   1672.5    1.25    2.35  5e-07
[1,0]<stdout>:     4194304       1048576     float     sum   1785.1    2.35    4.41  5e-07   1780.3    2.36    4.42  5e-07
[1,0]<stdout>:     8388608       2097152     float     sum   2133.6    3.93    7.37  5e-07   2135.0    3.93    7.37  5e-07
[1,0]<stdout>:    16777216       4194304     float     sum   2650.9    6.33   11.87  5e-07   2649.9    6.33   11.87  5e-07
[1,0]<stdout>:    33554432       8388608     float     sum   3422.0    9.81   18.39  5e-07   3478.7    9.65   18.09  5e-07
[1,0]<stdout>:    67108864      16777216     float     sum   4783.2   14.03   26.31  5e-07   4782.6   14.03   26.31  5e-07
[1,0]<stdout>:   134217728      33554432     float     sum   7216.9   18.60   34.87  5e-07   7240.9   18.54   34.75  5e-07
[1,0]<stdout>:   268435456      67108864     float     sum    12738   21.07   39.51  5e-07    12802   20.97   39.31  5e-07
[1,0]<stdout>:   536870912     134217728     float     sum    24375   22.03   41.30  5e-07    24403   22.00   41.25  5e-07
[1,0]<stdout>:  1073741824     268435456     float     sum    47904   22.41   42.03  5e-07    47893   22.42   42.04  5e-07
[1,4]<stdout>:test-efa-nccl-worker-0:33:33 [4] NCCL INFO comm 0x7fd4a0000f60 rank 4 nranks 16 cudaDev 4 busId 901c0 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth    : 8.23785

We can observe in the test results that the max throughput is about 42 GB/sec and average bus bandwidth is approximately 8 GB.

We also conducted experiments with a single EFA adapter enabled as well as no EFA adapters. All results are summarized in the following table.

Number of EFA Adapters Net/OFI Selected Provider Avg. Bandwidth (GB/s) Max. Bandwith (GB/s)
4 efa 8.24 42.04
1 efa 3.02 5.89
0 socket 0.97 2.38

We also found that for relatively small models like ImageNet, the use of accelerated networking reduces the training time per epoch only with 5–8% at batch size of 64. For larger models and smaller batch sizes, when increased network communication of weights is needed, the use of accelerated networking has greater impact. We observed a decrease of epoch training time with 15–18% for training of EfficientNet-B7 with batch size 1. The actual impact of EFA on your training will depend on the size of your model.

GPU monitoring

Before running the training job, we can also set up Amazon CloudWatch metrics to visualize the GPU utilization during training. It can be helpful to know whether the resources are being used optimally or potentially identify resource starvation and bottlenecks in the training process.

The relevant scripts to set up CloudWatch are located in the gpu-metrics folder. First, we create a Docker image with amazon-cloudwatch-agent and nvidia-smi. We can use the Dockerfile in the gpu-metrics folder to create this image. Assuming that the ECR registry is already set in the .env file from the previous step, we can build and push the image using build.sh and push.sh. After this, running the deploy.sh script automatically completes the setup. It launches a daemonset with amazon-cloudwatch-agent and pushes various metrics to CloudWatch. The GPU metrics appear under the CWAgent namespace on the CloudWatch console. The rest of the cluster metrics show under the ContainerInsights namespace.

Model training

All the scripts needed for PyTorch training are located in the elasticjob folder in the GitHub repo. Before launching the training job, we need to run the etcd server, which is used by the TEC for worker discovery and parameter exchange. The deploy.sh script in the elasticjob folder does exactly that.

To take advantage of EFA in p4d.24xlarge instances, we need to use a specific Docker image available in the Amazon ECR Public Gallery that supports NCCL communication through EFA. We just need to copy our training code to this Docker image. The Dockerfile under the samples folder creates an image to be used when running training job on p4d instances. As always, we can use the build.sh and push.sh scripts in the folder to build and push the image.

The imagenet-efa.yaml file describes the training job. This .yaml file sets up the resources needed for running the training job and also mounts the persistent volume with the training data set up in the previous section.

A couple of things are worth pointing out here. The number of replicas should be set to the number of nodes available in the cluster. In our case, we set this to 3 because we had three p4d.24xlarge nodes. In the imagenet-efa.yaml file, the nvidia.com/gpu parameter under resources and nproc_per_node under args should be set to the number of GPUs per node, which in the case of p4d.24xlarge is 8. Also, the worker argument for the Python script sets the number of CPUs per process. We chose this to be 4 because, in our experiments, this provides optimal performance when running on p4d.24xlarge instances. These settings are necessary in order to maximize the use of all the hardware resources available in the cluster.

When the job is running, we can observe the GPU usage in CloudWatch for all the GPUs in the cluster. The following is an example from one of our training jobs with three p4d.24xlarge nodes in the cluster. Here we’ve selected one GPU from each node. With the settings mentioned earlier, the GPU usage is close to 100% during the training phase of the epoch for all of the nodes in the cluster.

For training a ResNet50 model using p3.8xlarge instances, we need exactly the same steps as described for the EfficientNet training using p4d.24xlarge. We can also use the same Docker image. As mentioned earlier, p3.8xlarge instances aren’t equipped with EFA. However, for the ResNet50 model, this is not a significant drawback. The imagenet-fsx.yaml script provided in the GitHub repository sets up the training job with appropriate resources for the p3.8xlarge node type. The job uses the same dataset from the FSx file system.

GPU scaling

We ran some experiments to observe how the training time scales for the EfficientNet-B7 model by increasing the number of GPUs. To do this, we changed the number of replicas from 1 to 3 in our training .yaml file for each training run. We only observed the time for a single epoch while using the complete ImageNet dataset. The following figure shows the results for our GPU scaling experiment. The red dotted line represents how the training time should go down from a run using 8 GPUs by increasing the number of GPUs. As we can see, the scaling is quite close to what is expected.

Similarly, we obtained the GPU scaling plot for ResNet50 training on p3.8xlarge instances. For this case, we changed the replicas in our .yaml file from 1 to 4. The results of this experiment are shown in the following figure.

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. With each script that creates resources, the GitHub repo provides a matching script to delete them. To clean up our setup, we must delete the FSx file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC. To delete the FSx file system, we just need to run the following command (from inside the fsx folder):

kubectl delete -f fsx-pvc-dynamic.yaml
./delete.sh

Note that this will not only delete the persistent volume, it will also delete the FSx file system, and all the data on the file system will be lost. When this step is complete, we can delete the cluster by using the following script in the eks folder:

./eks-delete.sh

This will delete all the existing pods, remove the cluster, and delete the VPC created in the beginning.

Conclusion

In this post, we detailed the steps needed for running PyTorch distributed data parallel model training on EKS clusters. This task may seem daunting, but the AWS DevOps for EKS project created by the ML Frameworks team at AWS provides all the necessary scripts and tools to simplify the process and make distributed model training easily accessible.

For more information on the technologies used in this post, visit Amazon EKS and Torch Distributed Elastic. We encourage you to apply the approach described here to your own distributed training use cases.

Resources


About the authors

Imran Younus is a Principal Solutions Architect for ML Frameworks team at AWS. He focuses on large scale machine learning and deep learning workloads across AWS services like Amazon EKS and AWS ParallelCluster. He has extensive experience in applications of Deep Leaning in Computer Vision and Industrial IoT. Imran obtained his PhD in High Energy Particle Physics where he has been involved in analyzing experimental data at peta-byte scales.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect for Self-managed Machine Learning at AWS. In his role he focuses on helping customers with containerization and orchestration of ML and AI workloads on container-powered AWS services. He is also the author of the open source Do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on combating climate change, democratizing AI and ML, making travel safer, healthcare better, and energy smarter.

Read More

Learn How Amazon SageMaker Clarify Helps Detect Bias

Bias detection in data and model outcomes is a fundamental requirement for building responsible artificial intelligence (AI) and machine learning (ML) models. Unfortunately, detecting bias isn’t an easy task for the vast majority of practitioners due to the large number of ways in which it can be measured and different factors that can contribute to a biased outcome. For instance, an imbalanced sampling of the training data may result in a model that is less accurate for certain subsets of the data. Bias may also be introduced by the ML algorithm itself—even with a well-balanced training dataset, the outcomes might favor certain subsets of the data as compared to the others.

To detect bias, you must have a thorough understanding of different types of bias and the corresponding bias metrics. For example, at the time of this writing, Amazon SageMaker Clarify offers 21 different metrics to choose from.

In this post, we use an income prediction use case (predicting user incomes from input features like education and number of hours worked per week) to demonstrate different types of biases and the corresponding metrics in SageMaker Clarify. We also develop a framework to help you decide which metrics matter for your application.

Introduction to SageMaker Clarify

ML models are being increasingly used to help make decisions across a variety of domains, such as financial services, healthcare, education, and human resources. In many situations, it’s important to understand why the ML model made a specific prediction and also whether the predictions were impacted by bias.

SageMaker Clarify provides tools for both of these needs, but in this post we only focus on the bias detection functionality. To learn more about explainability, check out Explaining Bundesliga Match Facts xGoals using Amazon SageMaker Clarify.

SageMaker Clarify is a part of Amazon SageMaker, which is a fully managed service to build, train, and deploy ML models.

Examples of questions about bias

To ground the discussion, the following are some sample questions that ML builders and their stakeholders may have regarding bias. The list consists of some general questions that may be relevant for several ML applications, as well as questions about specific applications like document retrieval.

You might ask, given the groups of interest in the training data (for example, men vs. women) which metrics should I use to answer the following questions:

  • Does the group representation in the training data reflect the real world?
  • Do the target labels in the training data favor one group over the other by assigning it more positive labels?
  • Does the model have different accuracy for different groups?
  • In a model whose purpose is to identify qualified candidates for hiring, does the model have the same precision for different groups?
  • In a model whose purpose is to retrieve documents relevant to an input query, does the model retrieve relevant documents from different groups in the same proportion?

In the rest of this post, we develop a framework for how to consider answering these questions and others through the metrics available in SageMaker Clarify.

Use case and context

This post uses an existing example of a SageMaker Clarify job from the Fairness and Explainability with SageMaker Clarify notebook and explains the generated bias metric values. The notebook trains an XGBoost model on the UCI Adult dataset (Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science).

The ML task in this dataset is to predict whether a person has a yearly income of more or less than $50,000. The following table shows some instances along with their features. Measuring bias in income prediction is important because we could use these predictions to inform decisions like discount offers and targeted marketing.

Bias terminology

Before diving deeper, let’s review some essential terminology. For a complete list of terms, see Amazon SageMaker Clarify Terms for Bias and Fairness.

  • Label – The target feature that the ML model is trained to predict. An observed label refers to the label value observed in the data used to train or test the model. A predicted label is the value predicted by the ML model. Labels could be binary, and are often encoded as 0 and 1. We assume 1 to represent a favorable or positive label (for example, income more than or equal to $50,000), and 0 to represent an unfavorable or negative label. Labels could also consist of more than two values. Even in these cases, one or more of the values constitute favorable labels. For the sake of simplicity, this post only considers binary labels. For details on handling labels with more than two values and labels with continuous values (for example, in regression), see Amazon AI Fairness and Explainability Whitepaper.
  • Facet – A column or feature with respect to which bias is measured. In our example, the facet is sex and takes two values: woman and man, encoded as female and male in the data (this data is extracted from the 1994 Census and enforces a binary option). Although the post considers a single facet with only two values, for more complex cases involving multiple facets or facets with more than two values, see Amazon AI Fairness and Explainability Whitepaper.
  • Bias – A significant imbalance in the input data or model predictions across different facet values. What constitutes “significant” depends on your application. For most metrics, a value of 0 implies no imbalance. Bias metrics in SageMaker Clarify are divided into two categories:

    • Pretraining – When present, pretraining bias indicates imbalances in the data only.
    • Posttraining – Posttraining bias additionally considers the predictions of the models.

Let’s examine each category separately.

Pretraining bias

Pretraining bias metrics in SageMaker Clarify answer the following question: Do all facet values have equal (or similar) representation in the data? It’s important to inspect the data for pretraining bias because it may translate into posttraining bias in the model predictions. For instance, a model trained on imbalanced data where one facet value appears very rarely can exhibit substantially worse accuracy for that facet value. Equal representation can be calculated over the following:

  • The whole training data irrespective of the labels
  • The subset of the training data with positive labels only
  • Each label separately

The following figure provides a summary of how each metric fits into each of the three categories.

Some categories consist of more than one metric. The basic metrics (grey boxes) answer the question about bias in that category in the simplest form. Metrics in white boxes additionally cover special cases (for example, Simpson’s paradox) and user preferences (for example, focusing on certain parts of the population when computing predictive performance).

Facet value representation irrespective of labels

The only metric in this category is Class Imbalance (CI). The goal of this metric is to measure if all the facet values have equal representation in the data.

CI is the difference in the fraction of the data constituted by the two facet values. In our example dataset, for the facet sex, the breakdown (shown in the pie chart) shows that women constitute 32.4% of the training data, whereas men constitute 67.6%. As a result:

CI = 0.676 - 0.324 = 0.352

A severely high class imbalance could lead to worse predictive performance for the facet value with smaller representation.

Facet value representation at the level of positive labels only

Another way to measure equal representation is to check whether all facet values contain a similar fraction of samples with positive observed labels. Positive labels consist of favorable outcomes (for example, loan granted, selected for the job), so analyzing positive labels separately helps assess if the favorable decisions are distributed evenly.

In our example dataset, the observed labels break down into positive and negative values, as shown in the following figure.

11.4% of all women and 31.4% of all men have the positive label (dark shaded region in the left and right bars). The Difference in Positive Proportions in Labels (DPL) measures this difference.

DPL = 0.314 - 0.114 = 0.20

The advanced metric in this category, Conditional Demographic Disparity in Labels (CDDL), measures the differences in the positive labels, but stratifies them with respect to another variable. This metric helps control for the Simpson’s paradox, a case where a computation over the whole data shows bias, but the bias disappears when grouping the data with respect to some side-information.

The 1973 UC Berkeley Admissions Study provides an example. According to the data, men were admitted at a higher rate than women. However, when examined at the level of individual university departments, women were admitted at similar or higher rate at each department. This observation can be explained by the Simpson’s paradox, which arose here because women applied to schools that were more competitive. As a result, fewer women were admitted overall compared to men, even though school by school they were admitted at a similar or higher rate.

For more detail on how CDDL is computed, see Amazon AI Fairness and Explainability Whitepaper.

Facet value representation at the level of each label separately

Equality in representation can also be measured for each individual label, not just the positive label.

Metrics in this category compute the difference in the label distribution of different facet values. The label distribution for a facet value contains all the observed label values, along with the fraction of samples with that label’s value. For instance, in the figure showing labels distributions, 88.6% of women have a negative observed label and 11.4% have a positive observed label. So the label distribution for women is [0.886, 0.114] and for men is [0.686, 0.314].

The basic metric in this category, Kullback-Leibler divergence (KL), measures this difference as:

KL = [0.686 x log(0.686/0.886)] + [0.314 x log(0.314/0.114)] = 0.143

The advanced metrics in this category, Jensen-Shannon divergence (JS), Lp-norm (LP), Total Variation Distance (TVD), and Kolmogorov-Smirnov (KS), also measure the difference between the distributions but have different mathematical properties. Barring special cases, they will deliver insights similar to KL. For example, although the KL value can be infinity when a facet value contains no samples with a certain labels (for example, no men with a negative label), JS avoids these infinite values. For more detail into these differences, see Amazon AI Fairness and Explainability Whitepaper.

Relationship between DPL (Category 2) and distribution-based metrics of KL/JS/LP/TVD/KS (Category 3)

Distribution-based metrics are more naturally applicable to non-binary labels. For binary labels, owing to the fact that imbalance in the positive label can be used to compute the imbalance in negative label, the distribution metrics deliver the same insights as DPL. Therefore, you can just use DPL in such cases.

Posttraining bias

Posttraining bias metrics in SageMaker Clarify help us answer two key questions:

  • Are all facet values represented at a similar rate in positive (favorable) model predictions?
  • Does the model have similar predictive performance for all facet values?

The following figure shows how the metrics map to each of these questions. The second question can be further broken down depending on which label the performance is measured with respect to.

Equal representation in positive model predictions

Metrics in this category check if all facet values contain a similar fraction of samples with positive predicted label by the model. This class of metrics is very similar to the pretraining metrics of DPL and CDDL—the only difference is that this category considers predicted labels instead of observed labels.

In our example dataset, 4.5% of all women are assigned the positive label by the model, and 13.7% of all men are assigned the positive label.

The basic metric in this category, Difference in Positive Proportions in Predicted Labels (DPPL), measures the difference in the positive class assignments.

DPPL = 0.137 - 0.045 = 0.092

Notice how in the training data, a higher fraction of men had a positive observed label. In a similar manner, a higher fraction of men are assigned a positive predicted label.

Moving on to the advanced metrics in this category, Disparate Impact (DI) measures the same disparity in positive class assignments, but instead of the difference, it computes the ratio:

DI = 0.045 / 0.137 = 0.328

Both DI and DPPL convey qualitatively similar insights but differ at some corner cases. For instance, ratios tend to explode to very large numbers if the denominator is small. Take an example of the numbers 0.1 and 0.0001. The ratio is 0.1/0.0001 = 10,000 whereas the difference is 0.1 – 0.0001 ≈ 0.1. Unlike the other metrics where a value of 0 implies no bias, for DI, no bias corresponds to a value of 1.

Conditional Demographic Disparity in Predicted Labels (CDDPL) measures the disparity in facet value representation in the positive label, but just like the pretraining metric of CDDL, it also controls for the Simpson’s paradox.

Counterfactual Fliptest (FT) measures if similar samples from the two facet values receive similar decisions from the model. A model assigning different decisions to two samples that are similar to each other but differ in the facet values could be considered biased against the facet value being assigned the unfavorable (negative) label. Given the first facet value (women), it assesses whether similar members with the other facet value (men) have a different model prediction. Similar members are chosen based on the k-nearest neighbor algorithm.

Equal performance

The model predictions might have similar representation in positive labels from different facet values, yet the model performance on these groups might significantly differ. In many applications, having a similar predictive performance across different facet values can be desirable. The metrics in this category measure the difference in predictive performance across facet values.

Because the data can be sliced in many different ways based on the observed or predicted labels, there are many different ways to measure predictive performance.

Equal predictive performance irrespective of labels

You could consider the model performance on the whole data, irrespective of the observed or the predicted labels – that is, the overall accuracy.

The following figures shows how the model classifies inputs from the two facet values in our example dataset. True negatives (TN) are cases where both the observed and predicted label were 0. False positives (FP) are misclassifications where the observed label was 0 but the predicted label was 1. True positives (TP) and false negatives (FN) are defined similarly.

For each facet value, the overall model performance, that is, the accuracy for that facet value, is:

Accuracy = (TN + TP) / (TN + FP + FN + TP)

With this formula, the accuracy for women is 0.930 and for men is 0.815. This leads to the only metric in this category, Accuracy Difference (AD):

AD = 0.815 - 0.930 = -0.115

AD = 0 means that the accuracy for both groups is the same. Larger (positive or negative) values indicate larger differences in accuracy.

Equal performance on positive labels only

You could restrict the model performance analysis to positive labels only. For instance, if the application is about detecting defects on an assembly line, it may be desirable to check that non-defective parts (positive label) of different kinds (facet values) are classified as non-defective at the same rate. This quantity is referred to as recall, or true positive rate:

Recall = TP / (TP + FN)

In our example dataset, the recall for women is 0.389, and the recall for men is 0.425. This leads to the basic metric in this category, the Recall Difference (RD):

RD = 0.425 - 0.389 = 0.036

Now let’s consider the three advanced metrics in this category, see which user preferences they encode, and how they differ from the basic metric of RD.

First, instead of measuring the performance on the positive observed labels, you could measure it on the positive predicted labels. Given a facet value, such as women, and all the samples with that facet value that are predicted to be positive by the model, how many are actually correctly classified as positive? This quantity is referred to as acceptance rate (AR), or precision:

AR = TP / (TP + FP)

In our example, the AR for women is 0.977, and the AR for men is 0.970. This leads to the Difference in Acceptance Rate (DAR):

DAR = 0.970 - 0.977 = -0.007

Another way to measure bias is by combining the previous two metrics and measuring how many more positive predictions the models assign to a facet value as compared to the observed positive labels. SageMaker Clarify measures this advantage by the model as the ratio between the number of observed positive labels for that facet value, and the number of predicted positive labels, and refers to it as conditional acceptance (CA):

CA = (TP + FN) / (TP + FP)

In our example, the CA for women is 2.510 and for men is 2.283. The difference in CA leads to the final metric in this category, Difference in Conditional Acceptance (DCA):

DCA = 2.283 - 2.510 = -0.227

Equal performance on negative labels only

In a manner similar to positive labels, bias can also be computed as the performance difference on the negative labels. Considering negative labels separately can be important in certain applications. For instance, in our defect detection example, we might want to detect defective parts (negative label) of different kinds (facet value) at the same rate.

The basic metric in this category, specificity, is analogous to the recall (true positive rate) metric. Specificity computes the accuracy of the model on samples with this facet value that have an observed negative label:

Specificity = TN / (TN + FP)

In our example (see the confusion tables), the specificity for women and men is 0.999 and 0.994, respectively. Consequently, the Specificity Difference (SD) is:

SD = 0.994 - 0.999 = -0.005

Moving on, just like the acceptance rate metric, the analogous quantity for negative labels—the rejection rate (RR)—is:

RR = TN / (TN + FN)

The RR for women is 0.927 and for men is 0.791, leading to the Difference in Rejection Rate (DRR) metric:

DRR = 0.927 - 0.791 = -0.136

Finally, the negative label analogue of conditional acceptance, the conditional rejection (CR), is the ratio between the number of observed negative labels for that facet value, and the number of predicted negative labels:

CR = (TN + FP) / (TN + FN)

The CR for women is 0.928 and for men is 0.796. The final metric in this category is Difference in Conditional Rejection (DCR):

DCR = 0.796 - 0.928 = 0.132

Equal performance on positive vs. negative labels

SageMaker Clarify combines the previous two categories by considering the model performance ratio on the positive and negative labels. Specifically, for each facet value, SageMaker Clarify computes the ration between false negatives (FN) and false positives (FP). In our example, the FN/FP ratio for women is 679/10 = 67.9 and for men is 3678/84 = 43.786. This leads to the Treatment Equality (TE) metric, which measures the difference between the FP/FN ratio:

TE = 67.9 - 43.786 = 24.114

The following screenshot shows how you can use SageMaker Clarify with Amazon SageMaker Studio to show the values as well as ranges and short descriptions of different bias metrics.

Questions about bias: Which metrics to start with?

Recall the sample questions about bias at the start of this post. Having gone through the metrics from different categories, consider the questions again. To answer the first question, which concerns the representations of different groups in the training data, you could start with the Class Imbalance (CI) metric. Similarly, for the remaining questions, you can start by looking into Difference in Positive Proportions in Labels (DPL), Accuracy Difference (AD), Difference in Acceptance Rate (DAR), and Recall Difference (RD), respectively.

Bias without facet values

For the ease of exposition, this description of posttraining metrics excluded the Generalized Entropy Index (GE) metric. This metric measures bias without considering the facet value, and can be helpful in assessing how the model errors are distributed. For details, refer to Generalized entropy (GE).

Conclusion

In this post, you saw how the 21 different metrics in SageMaker Clarify measure bias at different stages of the ML pipeline. You learned about various metrics via an income prediction use case, how to choose metrics for your use case, and which ones you could start with.

Get started with your responsible AI journey by assessing bias in your ML models by using the demo notebook Fairness and Explainability with SageMaker Clarify. You can find the detailed documentation for SageMaker Clarify, including the formal definition of metrics, at What Is Fairness and Model Explainability for Machine Learning Predictions. For the open-source implementation of the bias metrics, refer to the aws-sagemaker-clarify GitHub repository. For a detailed discussion including limitations, refer to Amazon AI Fairness and Explainability Whitepaper.


About the authors

Bilal Zafar is an Applied Scientist at AWS, working on Fairness, Explainability and Security in Machine Learning.

Denis1_resized

Denis V. Batalov is a Solutions Architect for AWS, specializing in Machine Learning. He’s been with Amazon since 2005. Denis has a PhD in the field of AI. Follow him on Twitter: @dbatalov.

Michele Donini is a Sr Applied Scientist at AWS. He leads a team of scientists working on Responsible AI and his research interests are Algorithmic Fairness and Explainable Machine Learning.

Read More

Create a batch recommendation pipeline using Amazon Personalize with no code

With personalized content more likely to drive customer engagement, businesses continuously seek to provide tailored content based on their customer’s profile and behavior. Recommendation systems in particular seek to predict the preference an end-user would give to an item. Some common use cases include product recommendations on online retail stores, personalizing newsletters, generating music playlist recommendations, or even discovering similar content on online media services.

However, it can be challenging to create an effective recommendation system due to complexities in model training, algorithm selection, and platform management. Amazon Personalize enables developers to improve customer engagement through personalized product and content recommendations with no machine learning (ML) expertise required. Developers can start to engage customers right away by using captured user behavior data. Behind the scenes, Amazon Personalize examines this data, identifies what is meaningful, selects the right algorithms, trains and optimizes a personalization model that is customized for your data, and provides recommendations via an API endpoint.

Although providing recommendations in real time can help boost engagement and satisfaction, sometimes this might not actually be required, and performing this in batch on a scheduled basis can simply be a more cost-effective and manageable option.

This post shows you how to use AWS services to not only create recommendations but also operationalize a batch recommendation pipeline. We walk through the end-to-end solution without a single line of code. We discuss two topics in detail:

Solution overview

In this solution, we use the MovieLens dataset. This dataset includes 86,000 ratings of movies from 2,113 users. We attempt to use this data to generate recommendations for each of these users.

Data preparation is very important to ensure we get customer behavior data into a format that is ready for Amazon Personalize. The architecture described in this post uses AWS Glue, a serverless data integration service, to perform the transformation of raw data into a format that is ready for Amazon Personalize to consume. The solution uses Amazon Personalize to create batch recommendations for all users by using a batch inference. We then use a Step Functions workflow so that the automated workflow can be run by calling Amazon Personalize APIs in a repeatable manner.

The following diagram demonstrates this solution.Architecture Diagram

We will build this solution with the following steps:

  1. Build a data transformation job to transform our raw data using AWS Glue.
  2. Build an Amazon Personalize solution with the transformed dataset.
  3. Build a Step Functions workflow to orchestrate the generation of batch inferences.

Prerequisites

You need the following for this walkthrough:

Build a data transformation job to transform raw data with AWS Glue

With Amazon Personalize, input data needs to have a specific schema and file format. Data from interactions between users and items must be in CSV format with specific columns, whereas the list of users for which you want to generate recommendations for must be in JSON format. In this section, we use AWS Glue Studio to transform raw input data into the required structures and format for Amazon Personalize.

AWS Glue Studio provides a graphical interface that is designed for easy creation and running of extract, transform, and load (ETL) jobs. You can visually create data transformation workloads through simple drag-and-drop operations.

We first prepare our source data in Amazon Simple Storage Service (Amazon S3), then we transform the data without code.

  1. On the Amazon S3 console, create an S3 bucket with three folders: raw, transformed, and curated.
  2. Download the MovieLens dataset and upload the uncompressed file named user_ratingmovies-timestamp.dat to your bucket under the raw folder.
  3. On the AWS Glue Studio console, choose Jobs in the navigation pane.
  4. Select Visual with a source and target, then choose Create.
    AWS Glue Studio - Create Job
  5. Choose the first node called Data source – S3 bucket. This is where we specify our input data.
  6. On the Data source properties tab, select S3 location and browse to your uploaded file.
  7. For Data format, choose CSV, and for Delimiter, choose Tab.
    AWS Glue Studio - S3
  8. We can choose the Output schema tab to verify that the schema has inferred the columns correctly.
  9. If the schema doesn’t match your expectations, choose Edit to edit the schema.
    AWS Glue Studio - Fields

Next, we transform this data to follow the schema requirements for Amazon Personalize.

  1. Choose the Transform – Apply Mapping node and, on the Transform tab, update the target key and data types.
    Amazon Personalize, at minimum, expects the following structure for the interactions dataset:
    • user_id (string)
    • item_id (string)
    • timestamp (long, in Unix epoch time format)
      AWS Glue Studio - Field mapping

In this example, we exclude the poorly rated movies in the dataset.

  1. To do so, remove the last node called S3 bucket and add a filter node on the Transform tab.
  2. Choose Add condition and filter out data where rating < 3.5.
    AWS Glue Studio - Output

We now write the output back to Amazon S3.

  1. Expand the Target menu and choose Amazon S3.
  2. For S3 Target Location, choose the folder named transformed.
  3. Choose CSV as the format and suffix the Target Location with interactions/.

Next, we output a list of users that we want to get recommendations for.

  1. Choose the ApplyMapping node again, and then expand the Transform menu and choose ApplyMapping.
  2. Drop all fields except for user_id and rename that field to userId. Amazon Personalize expects that field to be named userId.
  3. Expand the Target menu again and choose Amazon S3.
  4. This time, choose JSON as the format, and then choose the transformed S3 folder and suffix it with batch_users_input/.

This produces a JSON list of users as input for Amazon Personalize. We should now have a diagram that looks like the following.

AWS Glue Studio - Entire Workflow

We are now ready to run our transform job.

  1. On the IAM console, create a role called glue-service-role and attach the following managed policies:
    • AWSGlueServiceRole
    • AmazonS3FullAccess

For more information on how to create IAM service roles, refer to the Creating a role to delegate permissions to an AWS service.

  1. Navigate back to your AWS Glue Studio job, and choose the Job details tab.
  2. Set the job name as batch-personalize-input-transform-job.
  3. Choose the newly created IAM role.
  4. Keep the default values for everything else.
    AWS Glue Studio - Job details
  5. Choose Save.
  6. When you’re ready, choose Run and monitor the job in the Runs tab.
  7. When the job is complete, navigate to the Amazon S3 console to validate that your output file has been successfully created.

We have now shaped our data into the format and structure that Amazon Personalize requires. The transformed dataset should have the following fields and format:

  • Interactions dataset – CSV format with fields USER_ID, ITEM_ID, TIMESTAMP
  • User input dataset – JSON format with element userId

Build an Amazon Personalize solution with the transformed dataset

With our interactions dataset and user input data in the right format, we can now create our Amazon Personalize solution. In this section, we create our dataset group, import our data, and then create a batch inference job. A dataset group organizes resources into containers for Amazon Personalize components.

  1. On the Amazon Personalize console, choose Create dataset group.
  2. For Domain, select Custom.
  3. Choose Create dataset group and continue.
    Amazon Personalize - create dataset group

Next, create the interactions dataset.

  1. Enter a dataset name and select Create new schema.
  2. Choose Create dataset and continue.
    Amazon Personalize - create interactions dataset

We now import the interactions data that we had created earlier.

  1. Navigate to the S3 bucket in which we created our interactions CSV dataset.
  2. On the Permissions tab, add the following bucket access policy so that Amazon Personalize has access. Update the policy to include your bucket name.
    {
       "Version":"2012-10-17",
       "Id":"PersonalizeS3BucketAccessPolicy",
       "Statement":[
          {
             "Sid":"PersonalizeS3BucketAccessPolicy",
             "Effect":"Allow",
             "Principal":{
                "Service":"personalize.amazonaws.com"
             },
             "Action":[
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
             ],
             "Resource":[
                "arn:aws:s3:::<your-bucket-name>",
                "arn:aws:s3:::<your-bucket-name> /*"
             ]
          }
       ]
    }

Navigate back to Amazon Personalize and choose Create your dataset import job. Our interactions dataset should now be importing into Amazon Personalize. Wait for the import job to complete with a status of Active before continuing to the next step. This should take approximately 8 minutes.

  1. On the Amazon Personalize console, choose Overview in the navigation pane and choose Create solution.
    Amazon Personalize - Dashboard
  2. Enter a solution name.
  3. For Solution type, choose Item recommendation.
  4. For Recipe, choose the aws-user-personalization recipe.
  5. Choose Create and train solution.
    Amazon Personalize - create solution

The solution now trains against the interactions dataset that was imported with the user personalization recipe. Monitor the status of this process under Solution versions. Wait for it to complete before proceeding. This should take approximately 20 minutes.
Amazon Personalize - Status

We now create our batch inference job, which generates recommendations for each of the users present in the JSON input.

  1. In the navigation pane, under Custom resources, choose Batch inference jobs.
  2. Enter a job name, and for Solution, choose the solution created earlier.
  3. Choose Create batch inference job.
    Amazon Personalize - create batch inference job
  4. For Input data configuration, enter the S3 path of where the batch_users_input file is located.

This is the JSON file that contains userId.

  1. For Output data configuration path, choose the curated path in S3.
  2. Choose Create batch inference job.

This process takes approximately 30 minutes. When the job is finished, recommendations for each of the users specified in the user input file are saved in the S3 output location.

We have successfully generated a set of recommendations for all of our users. However, we have only implemented the solution using the console so far. To make sure that this batch inferencing runs regularly with the latest set of data, we need to build an orchestration workflow. In the next section, we show you how to create an orchestration workflow using Step Functions.

Build a Step Functions workflow to orchestrate the batch inference workflow

To orchestrate your pipeline, complete the following steps:

  1. On the Step Functions console, choose Create State Machine.
  2. Select Design your workflow visually, then choose Next.
    AWS Step Functions - Create workflow
  3. Drag the CreateDatasetImportJob node from the left (you can search for this node in the search box) onto the canvas.
  4. Choose the node, and you should see the configuration API parameters on the right. Record the ARN.
  5. Enter your own values in the API Parameters text box.

This calls the CreateDatasetImportJob API with the parameter values that you specify.

AWS Step Functions Workflow

  1. Drag the CreateSolutionVersion node onto the canvas.
  2. Update the API parameters with the ARN of the solution that you noted down.

This creates a new solution version with the newly imported data by calling the CreateSolutionVersion API.

  1. Drag the CreateBatchInferenceJob node onto the canvas and similarly update the API parameters with the relevant values.

Make sure that you use the $.SolutionVersionArn syntax to retrieve the solution version ARN parameter from the previous step. These API parameters are passed to the CreateBatchInferenceJob API.

AWS Step Functions Workflow

We need to build a wait logic in the Step Functions workflow to make sure the recommendation batch inference job finishes before the workflow completes.

  1. Find and drag in a Wait node.
  2. In the configuration for Wait, enter 300 seconds.

This is an arbitrary value; you should alter this wait time according to your specific use case.

  1. Choose the CreateBatchInferenceJob node again and navigate to the Error handling tab.
  2. For Catch errors, enter Personalize.ResourceInUseException.
  3. For Fallback state, choose Wait.

This step enables us to periodically check the status of the job and it only exits the loop when the job is complete.

  1. For ResultPath, enter $.errorMessage.

This effectively means that when the “resource in use” exception is received, the job waits for x seconds before trying again with the same inputs.

AWS Step Functions Workflow

  1. Choose Save, and then choose Start the execution.

We have successfully orchestrated our batch recommendation pipeline for Amazon Personalize. As an optional step, you can use Amazon EventBridge to schedule a trigger of this workflow on a regular basis. For more details, refer to EventBridge (CloudWatch Events) for Step Functions execution status changes.

Clean up

To avoid incurring future charges, delete the resources that you created for this walkthrough.

Conclusion

In this post, we demonstrated how to create a batch recommendation pipeline by using a combination of AWS Glue, Amazon Personalize, and Step Functions, without needing a single line of code or ML experience. We used AWS Glue to prep our data into the format that Amazon Personalize requires. Then we used Amazon Personalize to import the data, create a solution with a user personalization recipe, and create a batch inferencing job that generates a default of 25 recommendations for each user, based on past interactions. We then orchestrated these steps using Step Functions so that we can run these jobs automatically.

For steps to consider next, user segmentation is one of the newer recipes in Amazon Personalize, which you might want to explore to create user segments for each row of the input data. For more details, refer to Getting batch recommendations and user segments.


About the author

Maxine Wee

Maxine Wee is an AWS Data Lab Solutions Architect. Maxine works with customers on their use cases, designs solutions to solve their business problems, and guides them through building scalable prototypes. Prior to her journey with AWS, Maxine helped customers implement BI, data warehousing, and data lake projects in Australia.

Read More

Use Amazon SageMaker pipeline sharing to view or manage pipelines across AWS accounts

On August 9, 2022, we announced the general availability of cross-account sharing of Amazon SageMaker Pipelines entities. You can now use cross-account support for Amazon SageMaker Pipelines to share pipeline entities across AWS accounts and access shared pipelines directly through Amazon SageMaker API calls.

Customers are increasingly adopting multi-account architectures for deploying and managing machine learning (ML) workflows with SageMaker Pipelines. This involves building workflows in development or experimentation (dev) accounts, deploying and testing them in a testing or pre-production (test) account, and finally promoting them to production (prod) accounts to integrate with other business processes. You can benefit from cross-account sharing of SageMaker pipelines in the following use cases:

  • When data scientists build ML workflows in a dev account, those workflows are then deployed by an ML engineer as a SageMaker pipeline into a dedicated test account. To further monitor those workflows, data scientists now require cross-account read-only permission to the deployed pipeline in the test account.
  • ML engineers, ML admins, and compliance teams, who manage deployment and operations of those ML workflows from a shared services account, also require visibility into the deployed pipeline in the test account. They might also require additional permissions for starting, stopping, and retrying those ML workflows.

In this post, we present an example multi-account architecture for developing and deploying ML workflows with SageMaker Pipelines.

Solution overview

A multi-account strategy helps you achieve data, project, and team isolation while supporting software development lifecycle steps. Cross-account pipeline sharing supports a multi-account strategy, removing the overhead of logging in and out of multiple accounts and improving ML testing and deployment workflows by sharing resources directly across multiple accounts.

In this example, we have a data science team that uses a dedicated dev account for the initial development of the SageMaker pipeline. This pipeline is then handed over to an ML engineer, who creates a continuous integration and continuous delivery (CI/CD) pipeline in their shared services account to deploy this pipeline into a test account. To still be able to monitor and control the deployed pipeline from their respective dev and shared services accounts, resource shares are set up with AWS Resource Access Manager in the test and dev accounts. With this setup, the ML engineer and the data scientist can now monitor and control the pipelines in the dev and test accounts from their respective accounts, as shown in the following figure.

In the workflow, the data scientist and ML engineer perform the following steps:

  1. The data scientist (DS) builds a model pipeline in the dev account.
  2. The ML engineer (MLE) productionizes the model pipeline and creates a pipeline, (for this post, we call it sagemaker-pipeline).
  3. sagemaker-pipeline code is committed to an AWS CodeCommit repository in the shared services account.
  4. The data scientist creates an AWS RAM resource share for sagemaker-pipeline and shares it with the shared services account, which accepts the resource share.
  5. From the shared services account, ML engineers are now able to describe, monitor, and administer the pipeline runs in the dev account using SageMaker API calls.
  6. A CI/CD pipeline triggered in the shared service account builds and deploys the code to the test account using AWS CodePipeline.
  7. The CI/CD pipeline creates and runs sagemaker-pipeline in the test account.
  8. After running sagemaker-pipeline in the test account, the CI/CD pipeline creates a resource share for sagemaker-pipeline in the test account.
  9. A resource share from the test sagemaker-pipeline with read-only permissions is created with the dev account, which accepts the resource share.
  10. The data scientist is now able to describe and monitor the test pipeline run status using SageMaker API calls from the dev account.
  11. A resource share from the test sagemaker-pipeline with extended permissions is created with the shared services account, which accepts the resource share.
  12. The ML engineer is now able to describe, monitor, and administer the test pipeline run using SageMaker API calls from the shared services account.

In the following sections, we go into more detail and provide a demonstration on how to set up cross-account sharing for SageMaker pipelines.

How to create and share SageMaker pipelines across accounts

In this section, we walk through the necessary steps to create and share pipelines across accounts using AWS RAM and the SageMaker API.

Set up the environment

First, we need to set up a multi-account environment to demonstrate cross-account sharing of SageMaker pipelines:

  1. Set up two AWS accounts (dev and test). You can set this up as member accounts of an organization or as independent accounts.
  2. If you’re setting up your accounts as member of an organization, you can enable resource sharing with your organization. With this setting, when you share resources in your organization, AWS RAM doesn’t send invitations to principals. Principals in your organization gain access to shared resources without exchanging invitations.
  3. In the test account, launch Amazon SageMaker Studio and run the notebook train-register-deploy-pipeline-model. This creates an example pipeline in your test account. To simplify the demonstration, we use SageMaker Studio in the test account to launch the the pipeline. For real life projects, you should use Studio only in the dev account and launch SageMaker Pipeline in the test account using your CI/CD tooling.

Follow the instructions in the next section to share this pipeline with the dev account.

Set up a pipeline resource share

To share your pipeline with the dev account, complete the following steps:

  1. On the AWS RAM console, choose Create resource share.
  2. For Select resource type, choose SageMaker Pipelines.
  3. Select the pipeline you created in the previous step.
  4. Choose Next.
  5. For Permissions, choose your associated permissions.
  6. Choose Next.
    Next, you decide how you want to grant access to principals.
  7. If you need to share the pipeline only within your organization accounts, select Allow Sharing only within your organization; otherwise select Allow sharing with anyone.
  8. For Principals, choose your principal type (you can use an AWS account, organization, or organizational unit, based on your sharing requirement). For this post, we share with anyone at the AWS account level.
  9. Select your principal ID.
  10. Choose Next.
  11. On the Review and create page, verify your information is correct and choose Create resource share.
  12. Navigate to your destination account (for this post, your dev account).
  13. On the AWS RAM console, under Shared with me in the navigation pane, choose Resource shares.
  14. Choose your resource share and choose Accept resource share.

Resource sharing permissions

When creating your resource share, you can choose from one of two supported permission policies to associate with the SageMaker pipeline resource type. Both policies grant access to any selected pipeline and all of its runs.

The AWSRAMDefaultPermissionSageMakerPipeline policy allows the following read-only actions:

"sagemaker:DescribePipeline"
"sagemaker:DescribePipelineDefinitionForExecution"
"sagemaker:DescribePipelineExecution"
"sagemaker:ListPipelineExecutions"
"sagemaker:ListPipelineExecutionSteps"
"sagemaker:ListPipelineParametersForExecution"
"sagemaker:Search"

The AWSRAMPermissionSageMakerPipelineAllowExecution policy includes all of the read-only permissions from the default policy, and also allows shared accounts to start, stop, and retry pipeline runs.

The extended pipeline run permission policy allows the following actions:

"sagemaker:DescribePipeline"
"sagemaker:DescribePipelineDefinitionForExecution"
"sagemaker:DescribePipelineExecution"
"sagemaker:ListPipelineExecutions"
"sagemaker:ListPipelineExecutionSteps"
"sagemaker:ListPipelineParametersForExecution"
"sagemaker:StartPipelineExecution"
"sagemaker:StopPipelineExecution"
"sagemaker:RetryPipelineExecution"
"sagemaker:Search"

Access shared pipeline entities through direct API calls

In this section, we walk through how you can use various SageMaker Pipeline API calls to gain visibility into pipelines running in remote accounts that have been shared with you. For testing the APIs against the pipeline running in the test account from the dev account, log in to the dev account and use AWS CloudShell.

For the cross-account SageMaker Pipeline API calls, you always need to use your pipeline ARN as the pipeline identification. That also includes the commands requiring the pipeline name, where you need to use your pipeline ARN as the pipeline name.

To get your pipeline ARN, in your test account, navigate to your pipeline details in Studio via SageMaker Resources.

Choose Pipelines on your resources list.

Choose your pipeline and go to your pipeline Settings tab. You can find the pipeline ARN with your Metadata information. For this example, your ARN is defined as "arn:aws:sagemaker:us-east-1:<account-id>:pipeline/serial-inference-pipeline".

ListPipelineExecutions

This API call lists the runs of your pipeline. Run the following command, replacing $SHARED_PIPELINE_ARN with your pipeline ARN from CloudShell or using the AWS Command Line Interface (AWS CLI) configured with the appropriated AWS Identity and Access Management (IAM) role:

aws sagemaker list-pipeline-executions --pipeline-name $SHARED_PIPELINE_ARN

The response lists all the runs of your pipeline with their PipelineExecutionArn, StartTime, PipelineExecutionStatus, and PipelineExecutionDisplayName:

{
  "PipelineExecutionSummaries": [
    {
      "PipelineExecutionArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>/execution/<execution_id>",
      "StartTime": "2022-08-10T11:32:05.543000+00:00",
      "PipelineExecutionStatus": "Executing",
      "PipelineExecutionDisplayName": "execution-321"
    },
    {
      "PipelineExecutionArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>/execution/<execution_id>",
      "StartTime": "2022-08-10T11:28:03.680000+00:00",
      "PipelineExecutionStatus": "Stopped",
      "PipelineExecutionDisplayName": "test"
    },
    {
      "PipelineExecutionArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>/execution/<execution_id>",
      "StartTime": "2022-08-10T11:03:47.406000+00:00",
      "PipelineExecutionStatus": "Succeeded",
      "PipelineExecutionDisplayName": "execution-123"
    }
  ]
}

DescribePipeline

This API call describes the detail of your pipeline. Run the following command, replacing $SHARED_PIPELINE_ARN with your pipeline ARN:

aws sagemaker describe-pipeline --pipeline-name $SHARED_PIPELINE_ARN

The response provides the metadata of your pipeline, as well as information about creation and modifications of it:

Output(truncated): 
{
"PipelineArn": "arn:aws:sagemaker:<region>:<account-id>:pipeline/<pipeline_name>",
"PipelineName": "serial-inference-pipeline",
"PipelineDisplayName": "serial-inference-pipeline",
"PipelineDefinition": "{"Version": "2020-12-01", "Metadata": {}, "Parameters": [{"Name": "TrainingInstanceType", "Type": "String", "DefaultValue": "ml.m5.xlarge"}, {"Name": "ProcessingInstanceType", "Type": "String", "DefaultValue": "ml.m5.xlarge"}, {"Name": "ProcessingInstanceCount", "Type": "Integer", "DefaultValue": 1}, {"Name": "InputData", "Type":

..

"PipelineStatus": "Active",
"CreationTime": "2022-08-08T21:33:39.159000+00:00",
"LastModifiedTime": "2022-08-08T21:48:14.274000+00:00",
"CreatedBy": {},
"LastModifiedBy": {}
}

DescribePipelineExecution

This API call describes the detail of your pipeline run. Run the following command, replacing $SHARED_PIPELINE_ARN with your pipeline ARN:

aws sagemaker describe-pipeline-execution 
--pipeline-execution-arn $PIPELINE_EXECUTION_ARN

The response provides details on your pipeline run, including the PipelineExecutionStatus, ExperimentName, and TrialName:

{
  "PipelineArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>",
  "PipelineExecutionArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>/execution/<execution_id>",
  "PipelineExecutionDisplayName": "execution-123",
  "PipelineExecutionStatus": "Succeeded",
  "PipelineExperimentConfig": {
  "ExperimentName": "<pipeline_name>",
  "TrialName": "<execution_id>"
},
  "CreationTime": "2022-08-10T11:03:47.406000+00:00",
  "LastModifiedTime": "2022-08-10T11:15:01.102000+00:00",
  "CreatedBy": {},
  "LastModifiedBy": {}
}

StartPipelineExecution

This API call starts a pipeline run. Run the following command, replacing $SHARED_PIPELINE_ARN with your pipeline ARN and $CLIENT_REQUEST_TOKEN with a unique, case-sensitive identifier that you generate for this run. The identifier should have between 32–128 characters. For instance, you can generate a string using the AWS CLI kms generate-random command.

aws sagemaker start-pipeline-execution 
  --pipeline-name $SHARED_PIPELINE_ARN 
  --client-request-token $CLIENT_REQUEST_TOKEN

As a response, this API call returns the PipelineExecutionArn of the started run:

{
  "PipelineExecutionArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>/execution/<execution_id>"
}

StopPipelineExecution

This API call stops a pipeline run. Run the following command, replacing $PIPELINE_EXECUTION_ARN with the pipeline run ARN of your running pipeline and $CLIENT_REQUEST_TOKEN with an unique, case-sensitive identifier that you generate for this run. The identifier should have between 32–128 characters. For instance, you can generate a string using the AWS CLI kms generate-random command.

aws sagemaker stop-pipeline-execution 
  --pipeline-execution-arn $PIPELINE_EXECUTION_ARN 
  --client-request-token $CLIENT_REQUEST_TOKEN

As a response, this API call returns the PipelineExecutionArn of the stopped pipeline:

{
  "PipelineExecutionArn": "arn:aws:sagemaker:<region>:<account_id>:pipeline/<pipeline_name>/execution/<execution_id>"
}

Conclusion

Cross-account sharing of SageMaker pipelines allows you to securely share pipeline entities across AWS accounts and access shared pipelines through direct API calls, without having to log in and out of multiple accounts.

In this post, we dove into the functionality to show how you can share pipelines across accounts and access them via SageMaker API calls.

As a next step, you can use this feature for your next ML project.

Resources

To get started with SageMaker Pipelines and sharing pipelines across accounts, refer to the following resources:


About the authors

Ram Vittal is an ML Specialist Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis, photography, and action movies.

Maira Ladeira Tanke is an ML Specialist Solutions Architect at AWS. With a background in data science, she has 9 years of experience architecting and building ML applications with customers across industries. As a technical lead, she helps customers accelerate their achievement of business value through emerging technologies and innovative solutions. In her free time, Maira enjoys traveling and spending time with her family someplace warm.

Gabriel Zylka is a Professional Services Consultant at AWS. He works closely with customers to accelerate their cloud adoption journey. Specialized in the MLOps domain, he focuses on productionizing machine learning workloads by automating end-to-end machine learning lifecycles and helping achieve desired business outcomes. In his spare time, he enjoys traveling and hiking in the Bavarian Alps.

Read More

Explore Amazon SageMaker Data Wrangler capabilities with sample datasets

Data preparation is the process of collecting, cleaning, and transforming raw data to make it suitable for insight extraction through machine learning (ML) and analytics. Data preparation is crucial for ML and analytics pipelines. Your model and insights will only be as reliable as the data you use for training them. Flawed data will produce poor results regardless of the sophistication of your algorithms and analytical tools.

Amazon SageMaker Data Wrangler is a service to help data scientists and data engineers simplify and accelerate tabular and time series data preparation and feature engineering through a visual interface. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Snowflake, and DataBricks, and process your data with over 300 built-in data transformations and a library of code snippets, so you can quickly normalize, transform, and combine features without writing any code. You can also bring your custom transformations in PySpark, SQL, or Pandas.

Previously, customers wanting to explore Data Wrangler needed to bring their own datasets; we’ve changed that. Starting today, you can begin experimenting with Data Wrangler’s features even faster by using a sample dataset and following suggested actions to easily navigate the product for the first time. In this post, we walk you through this process.

Solution overview

Data Wrangler offers a pre-loaded version of the well-known Titanic dataset, which is widely used to teach and experiment with ML. Data Wrangler’s suggested actions help first-time customers discover features such as Data Wrangler’s Data Quality and Insights Report, a feature that verifies data quality and helps detect abnormalities in your data.

In this post, we create a sample flow with the pre-loaded sample Titanic dataset to show how you can start experimenting with Data Wrangler’s features faster. We then use the processed Titanic dataset to create a classification model to tell us whether a passenger will survive or not, using the training functionality, which allows you to launch an Amazon SageMaker Autopilot experiment within any of the steps in a Data Wrangler flow. Along the way, we can explore Data Wrangler features through the product suggestions that surface in Data Wrangler. These suggestions can help you accelerate your learning curve with Data Wrangler by recommending actions and next steps.

Prerequisites

In order to get all the features described in this post, you need to be running the latest kernel version of Data Wrangler. For any new flow created, the kernel will always be the latest one; nevertheless, for existing flows, you need to update the Data Wrangler application first.

Import the Titanic dataset

The Titanic dataset is a public dataset widely used to teach and experiment with ML. You can use it to create an ML model that predicts which passengers will survive the Titanic shipwreck. Data Wrangler now incorporates this dataset as a sample dataset that you can use to get started with Data Wrangler more quickly. In this post, we perform some data transformations using this dataset.

Let’s create a new Data Wrangler flow and call it Titanic. Data Wrangler shows you two options: you can either import your own dataset or you can use the sample dataset (the Titanic dataset).

You’re presented with a loading bar that indicates the progress of the dataset being imported into Data Wrangler. Click through the carousel to learn more about how Data Wrangler helps you import, prepare, and process datasets for ML. Wait until the bar is fully loaded; this indicates that your dataset is imported and ready for use.

The Titanic dataset is now loaded into our flow. For a description of the dataset, refer to Titanic – Machine Learning from Disaster.

Explore Data Wrangler features

As a first-time Data Wrangler user, you now see suggested actions to help you navigate the product and discover interesting features. Let’s follow the suggested advice.

  1. Choose the plus sign to get a list of options to modify the dataset.
  2. Choose Get data insights.

    This opens the Analysis tab on the data, in which you can create a Data Quality and Insights Report. When you create this report, Data Wrangler gives you the option to select a target column. A target column is a column that you’re trying to predict. When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power. When you select a target column, you must specify whether you’re trying to solve a regression or a classification problem.
  3. Choose the column survived as the target column because that’s the value we want to predict.
  4. For Problem type¸ select Classification¸ because we want to know whether a passenger belongs to the survived or not survived classes.
  5. Choose Create.
    This creates an analysis on your dataset that contains relevant points like a summary of the dataset, duplicate rows, anomalous samples, feature details, and more. To learn more about the Data Quality and Insights Report, refer to Accelerate data preparation with data quality and insights in Amazon SageMaker Data Wrangler and Get Insights On Data and Data Quality.
    Let’s get a quick look at the dataset itself.
  6. Choose the Data tab to visualize the data as a table.Let’s now generate some example data visualizations.
  7. Choose the Analysis tab to start visualizing your data. You can generate three histograms: the first two visualize the number of people that survived based on the sex and class columns, as shown in the following screenshots.The third visualizes the ages of the people that boarded the Titanic.Let’s perform some transformations on the data,
  8. First, drop the columns ticket, cabin, and name.
  9. Next, perform one-hot encoding on the categorical columns embarked and sex, and home.dest.
  10. Finally, fill in missing values for the columns boat and body with a 0 value.
    Your dataset now looks something like the following screenshot.
  11. Now split the dataset into three sets: a training set with 70% of the data, a validation set with 20% of the data, and a test set with 10% of the data.The splits done here use the stratified split approach using the survived variable and are just for the sake of the demonstration.Now let’s configure the destination of our data.
  12. Choose the plus sign on each Dataset node, choose Add destination, and choose S3 to add an Amazon S3 destination for the transformed datasets.
  13. In the Add a destination pane, you can configure the Amazon S3 details to store your processed datasets.Our Titanic flow should now look like the following screenshot.You can now transform all the data using SageMaker processing jobs.
  14. Choose Create job.
  15. Keep the default values and choose Next.
  16. Choose Run.A new SageMaker processing job is now created. You can see the job’s details and track its progress on the SageMaker console under Processing jobs.When the processing job is complete, you can navigate to any of the S3 locations specified for storing the datasets and query the data just to confirm that the processing was successful. You can now use this data to feed your ML projects.

Launch an Autopilot experiment to create a classifier

You can now launch Autopilot experiments directly from Data Wrangler and use the data at any of the steps in the flow to automatically train a model on the data.

  1. Choose the Dataset node called Titanic_dataset (train) and navigate to the Train tab.
    Before training, you need to first export your data to Amazon S3.
  2. Follow the instructions to export your data to an S3 location of your choice.
    You can specify to export the data in CSV or Parquet format for increased efficiency. Additionally, you can specify an AWS Key Management Service (AWS KMS) key to encrypt your data.
    On the next page, you configure your Autopilot experiment.
  3. Unless your data is split into several parts, leave the default value under Connect your data.
  4. For this demonstration, leave the default values for Experiment name and Output data location.
  5. Under Advanced settings, expand Machine learning problem type.
  6. Choose Binary classification as the problem type and Accuracy as the objective metric.You specify these two values manually even though Autopilot is capable of inferring them from the data.
  7. Leave the rest of the fields with the default values and choose Create Experiment.Wait for a couple of minutes until the Autopilot experiment is complete, and you will see a leaderboard like the following with each of the models obtained by Autopilot.

You can now choose to deploy any of the models in the leaderboard for inference.

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.

To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Amazon SageMaker Studio, choose File, then choose Save Data Wrangler Flow.
    Data Wrangler automatically saves your data flow every 60 seconds.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated how you can use the new sample dataset on Data Wrangler to explore Data Wrangler’s features without needing to bring your own data. We also presented two additional features: the loading page to let you visually track the progress of the data being imported into Data Wrangler, and product suggestions that provide useful tips to get started with Data Wrangler. We went further to show how you can create SageMaker processing jobs and launch Autopilot experiments directly from the Data Wrangler user interface.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler. To learn more about Autopilot and AutoML on SageMaker, visit Automate model development with Amazon SageMaker Autopilot.


About the authors

David Laredo is a Prototyping Architect at AWS Envision Engineering in LATAM, where he has helped develop multiple machine learning prototypes. Previously, he worked as a Machine Learning Engineer and has been doing machine learning for over 5 years. His areas of interest are NLP, time series, and end-to-end ML.

Parth Patel is a Solutions Architect at AWS in the San Francisco Bay Area. Parth guides customers to accelerate their journey to the cloud and helps them adopt the AWS Cloud successfully. He focuses on ML and application modernization.

Read More

Run image segmentation with Amazon SageMaker JumpStart

In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that solve common business problems. These features remove the heavy lifting from each step of the ML process, making it easier to develop high-quality models and reducing time to deployment.

This post is the third in a series on using JumpStart for specific ML tasks. In the first post, we showed how you can run image classification use cases on JumpStart. In the second post, we showed how you can run text classification use cases on JumpStart. In this post, we provide a step-by-step walkthrough on how to fine-tune and deploy an image segmentation model, using trained models from MXNet. We explore two ways of obtaining the same result: via JumpStart’s graphical interface on Amazon SageMaker Studio, and programmatically through JumpStart APIs.

If you want to jump straight into the JumpStart API code we explain in this post, you can refer to the following sample Jupyter notebooks:

JumpStart overview

JumpStart helps you get started with ML models for a variety of tasks without writing a single line of code. At the time of writing, JumpStart enables you to do the following:

  • Deploy pre-trained models for common ML tasks – JumpStart enables you to address common ML tasks with no development effort by providing easy deployment of models pre-trained on large, publicly available datasets. The ML research community has put a large amount of effort into making a majority of recently developed models publicly available for use. JumpStart hosts a collection of over 300 models, spanning the 15 most popular ML tasks such as object detection, text classification, and text generation, making it easy for beginners to use them. These models are drawn from popular model hubs such as TensorFlow, PyTorch, Hugging Face, and MXNet.
  • Fine-tune pre-trained models – JumpStart allows you to fine-tune pre-trained models with no need to write your own training algorithm. In ML, the ability to transfer the knowledge learned in one domain to another domain is called transfer learning. You can use transfer learning to produce accurate models on your smaller datasets, with much lower training costs than the ones involved in training the original model. JumpStart also includes popular training algorithms based on LightGBM, CatBoost, XGBoost, and Scikit-learn, which you can train from scratch for tabular regression and classification.
  • Use pre-built solutions – JumpStart provides a set of 17 solutions for common ML use cases, such as demand forecasting and industrial and financial applications, which you can deploy with just a few clicks. Solutions are end-to-end ML applications that string together various AWS services to solve a particular business use case. They use AWS CloudFormation templates and reference architectures for quick deployment, which means they’re fully customizable.
  • Refer to notebook examples for SageMaker algorithms – SageMaker provides a suite of built-in algorithms to help data scientists and ML practitioners get started with training and deploying ML models quickly. JumpStart provides sample notebooks that you can use to quickly use these algorithms.
  • Review training videos and blogs – JumpStart also provides numerous blog posts and videos that teach you how to use different functionalities within SageMaker.

JumpStart accepts custom VPC settings and AWS Key Management Service (AWS KMS) encryption keys, so you can use the available models and solutions securely within your enterprise environment. You can pass your security settings to JumpStart within Studio or through the SageMaker Python SDK.

Semantic segmentation

Semantic segmentation delineates each class of objects appearing in an input image. It tags (classifies) each pixel of the input image with a class label from a predefined set of classes. Multiple objects of the same class are mapped to the same mask.

The model available for fine-tuning builds a fully convolutional network (FCN) “head” on top of the base network. The fine-tuning step fine-tunes the FCNHead while keeping the parameters of the rest of the model frozen, and returns the fine-tuned model. The objective is to minimize per-pixel softmax cross entropy loss to train the FCN. The model returned by fine-tuning can be further deployed for inference.

The input directory should look like the following code if the training data contains two images. The names of the .png files can be anything.

input_directory
    |--images
        |--abc.png
        |--def.png
    |--masks
        |--abc.png
        |--def.png
    class_label_to_prediction_index.json

The mask files should have class label information for each pixel.

Instance segmentation

Instance segmentation detects and delineates each distinct object of interest appearing in an image. It tags every pixel with an instance label. Whereas semantic segmentation assigns the same tag to pixels of multiple objects of the same class, instance segmentation further labels pixels corresponding to each occurrence of an object on the image with a separate tag.

Currently, JumpStart offers inference-only models for instance segmentation and doesn’t support fine-tuning.

The following images illustrate the difference between the inference in semantic segmentation and instance segmentation. The original image has two people in the image. Semantic segmentation treats multiple people in the image as one entity: Person. However, instance segmentation identifies individual people within the Person category.

Solution overview

The following sections provide a step-by-step demo to perform semantic segmentation with JumpStart, both via the Studio UI and via JumpStart APIs.

We walk through the following steps:

  1. Access JumpStart through the Studio UI:
    1. Run inference on the pre-trained model.
    2. Fine-tune the pre-trained model.
  2. Use JumpStart programmatically with the SageMaker Python SDK:
    1. Run inference on the pre-trained model.
    2. Fine-tune the pre-trained model.

We also discuss additional advanced features of JumpStart.

Access JumpStart through the Studio UI

In this section, we demonstrate how to train and deploy JumpStart models through the Studio UI.

Run inference on the pre-trained model

The following video shows you how to find a pre-trained semantic segmentation model on JumpStart and deploy it. The model page contains valuable information about the model, how to use it, expected data format, and some fine-tuning details. You can deploy any of the pre-trained models available in JumpStart. For inference, we pick the ml.g4dn.xlarge instance type. It provides the GPU acceleration needed for low inference latency, but at a lower price point. After you configure the SageMaker hosting instance, choose Deploy. It may take 5–10 minutes until your persistent endpoint is up and running.

After a few minutes, your endpoint is operational and ready to respond to inference requests.

Similarly, you can deploy a pre-trained instance segmentation model by following the same steps in the preceding video while searching for instance segmentation instead of semantic segmentation in the JumpStart search bar.

Fine-tune the pre-trained model

The following video shows how to find and fine-tune a semantic segmentation model in JumpStart. In the video, we fine-tune the model using the PennFudanPed dataset, provided by default in JumpStart, which you can download under the Apache 2.0 License.

Fine-tuning on your own dataset involves taking the correct formatting of data (as explained on the model page), uploading it to Amazon Simple Storage Service (Amazon S3), and specifying its location in the data source configuration. We use the same hyperparameter values set by default (number of epochs, learning rate, and batch size). We also use a GPU-backed ml.p3.2xlarge as our SageMaker training instance.

You can monitor your training job running directly on the Studio console, and are notified upon its completion. After training is complete, you can deploy the fine-tuned model from the same page that holds the training job details. The deployment workflow is the same as deploying a pre-trained model.

Use JumpStart programmatically with the SageMaker SDK

In the preceding sections, we showed how you can use the JumpStart UI to deploy a pre-trained model and fine-tune it interactively, in a matter of a few clicks. However, you can also use JumpStart’s models and easy fine-tuning programmatically by using APIs that are integrated into the SageMaker SDK. We now go over a quick example of how you can replicate the preceding process. All the steps in this demo are available in the accompanying notebooks Introduction to JumpStart – Instance Segmentation and Introduction to JumpStart – Semantic Segmentation.

Run inference on the pre-trained model

In this section, we choose an appropriate pre-trained model in JumpStart, deploy this model to a SageMaker endpoint, and run inference on the deployed endpoint.

SageMaker is a platform based on Docker containers. JumpStart uses the available framework-specific SageMaker Deep Learning Containers (DLCs). We fetch any additional packages, as well as scripts to handle training and inference for the selected task. Finally, the pre-trained model artifacts are separately fetched with model_uris, which provides flexibility to the platform. You can use any number of models pre-trained for the same task with a single training or inference script. See the following code:

model_id, model_version = "mxnet-semseg-fcn-resnet50-ade", "*"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="inference")

base_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="inference")

For instance segmentation, we can set model_id to mxnet-semseg-fcn-resnet50-ade. The is in the identifier corresponds to instance segmentation.

Next, we feed the resources into a SageMaker model instance and deploy an endpoint:

# Create the SageMaker model instance
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=base_model_uri,
    entry_point="inference.py",  # entry point file in source_dir and present in deploy_source_uri
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
)

# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the sagemaker API.
base_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)

After a few minutes, our model is deployed and we can get predictions from it in real time!

The following code snippet gives you a glimpse of what semantic segmentation looks like. The predicted mask for each pixel is visualized. To get inferences from a deployed model, an input image needs to be supplied in binary format. The response of the endpoint is a predicted label for each pixel in the image. We use the query_endpoint and parse_response helper functions, which are defined in the accompanying notebook:

query_response = query(base_model_predictor, pedestrian_img)
predictions, labels, image_labels = parse_response(query_response)
print("Objects present in the picture:", image_labels)

Fine-tune the pre-trained model

To fine-tune a selected model, we need to get that model’s URI, as well as that of the training script and the container image used for training. Thankfully, these three inputs depend solely on the model name, version (for a list of the available models, see JumpStart Available Model Table), and the type of instance you want to train on. This is demonstrated in the following code snippet:

from sagemaker import image_uris, model_uris, script_uris

model_id, model_version = "mxnet-semseg-fcn-resnet50-ade", "*"
training_instance_type = "ml.p3.2xlarge"
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,)# Retrieve the training script

train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")# Retrieve the pre-trained model tarball to further fine-tune

train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

We retrieve the model_id corresponding to the same model we used previously. You can now fine-tune this JumpStart model on your own custom dataset using the SageMaker SDK. We use a dataset that is publicly hosted on Amazon S3, conveniently focused on semantic segmentation. The dataset should be structured for fine-tuning as explained in the previous section. See the following example code:

# URI of your training dataset
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/PennFudanPed_SemSeg/"
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"
training_job_name = name_from_base(f"jumpstart-example-{model_id}-transfer-learning")# Create SageMaker Estimator instance
semseg_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,)# Launch a SageMaker Training job by passing s3 path of the training data
semseg_estimator.fit({"training": training_dataset_s3_path}, logs=True)

We obtain the same default hyperparameters for our selected model as the ones we saw in the previous section, using sagemaker.hyperparameters.retrieve_default(). We then instantiate a SageMaker estimator and call the .fit method to start fine-tuning our model, passing it the Amazon S3 URI for our training data. The entry_point script provided is named transfer_learning.py (the same for other tasks and models), and the input data channel passed to .fit must be named training.

While the algorithm trains, you can monitor its progress either in the SageMaker notebook where you’re running the code itself, or on Amazon CloudWatch. When training is complete, the fine-tuned model artifacts are uploaded to the Amazon S3 output location specified in the training configuration. You can now deploy the model in the same manner as the pre-trained model.

Advanced features

In addition to fine-tuning and deploying pre-trained models, JumpStart offers many advanced features.

The first is automatic model tuning. This allows you to automatically tune your ML models to find the hyperparameter values with the highest accuracy within the range provided through the SageMaker API.

The second is incremental training. This allows you to train a model you have already fine-tuned using an expanded dataset that contains an underlying pattern not accounted for in previous fine-tuning runs, which resulted in poor model performance. Incremental training saves both time and resources because you don’t need to retrain the model from scratch.

Conclusion

In this post, we showed how to fine-tune and deploy a pre-trained semantic segmentation model, and how to adapt it for instance segmentation using JumpStart. You can accomplish this without needing to write code. Try out the solution on your own and send us your comments.

To learn more about JumpStart and how you can use open-source pre-trained models for a variety of other ML tasks, check out the following AWS re:Invent 2020 video.


About the Authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Santosh Kulkarni is an Enterprise Solutions Architect at Amazon Web Services who works with sports customers in Australia. He is passionate about building large-scale distributed applications to solve business problems using his knowledge in AI/ML, big data, and software development.

Leonardo Bachega is a senior scientist and manager in the Amazon SageMaker JumpStart team. He’s passionate about building AI services for computer vision.

Read More