AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS has expanded the availability of Amazon EC2 Inf1 instances to four new AWS Regions, bringing the total number of supported Regions to 11: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Paris), and South America (São Paulo).

Amazon EC2 Inf1 instances are powered by AWS Inferentia chips, which are custom-designed to provide you with the lowest cost per inference in the cloud and lower the barriers for everyday developers to use machine learning (ML) at scale. Customers using models such as YOLO v3 and YOLO v4 can get up to 1.85 times higher throughput and up to 40% lower cost per inference compared to the EC2 G4 GPU-based instances.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure cost spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to deliver faster performance and more cost-effective ML inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances natively with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. You can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes by using the AWS Neuron SDK, which is integrated with these popular ML frameworks. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen a dramatic improvement in the breadth of models that deliver best-in-class performance at a fraction of the cost. This includes natural language processing models like the popular BERT, image classification models (ResNet and VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new neuron tool that allows you to easily scale monitoring of large fleets of inference applications.

Customer success stories

Since the launch of Inf1 instances, a broad spectrum of customers, from large enterprises to startups, as well as Amazon services, have begun using them to run production workloads.

Anthem is one of the nation’s leading health benefits companies, serving the healthcare needs of over 40 million members across dozens of states. They use deep learning to automate the generation of actionable insights from customer opinions via natural language models.

“Our application is computationally intensive and needs to be deployed in a highly performant manner,” says Numan Laanait, PhD, Principal AI/Data Scientist at Anthem. “We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide two times higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Condé Nast, another AWS customer, has a global portfolio that encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair.

“Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips,” says Paul Fryzel, Principal Engineer in AI Infrastructure at Condé Nast. “This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances.”

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.


About the Author

Gadi Hutt is a Sr. Director, Business Development at AWS. Gadi has over 20 years’ experience in engineering and business disciplines. He started his career as an embedded software engineer, and later on moved to product lead positions. Since 2013, Gadi leads Annapurna Labs technical business development and product management focused on hardware acceleration software and hardware products like the EC2 FPGA F1 instances and AWS Inferentia along side with its Neuron SDK, accelerating machine learning in the cloud.

Read More

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added.

In practice, data scientists often work with Jupyter notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of a ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following:

  • Amazon SageMaker – A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly
  • AWS Glue – A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data

In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints.

Use case

For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes.

This post doesn’t go into the details of the model, but demonstrates a way to build an ML pipeline that builds and deploys any ML model.

Solution overview

The following diagram summarizes the approach for the retraining pipeline.

The workflow contains the following elements:

  • AWS Glue crawler – You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
  • AWS Glue triggers – Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers.
  • AWS Glue job – An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location.
  • AWS Glue workflow – An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image.

The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters.

When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status.

At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more.

Setting up the environment

To set up the environment, complete the following steps:

  1. Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI.
  2. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet.
  3. Download the following code into your local directory.

Organization of code

The code to build the pipeline has the following directory structure:

--Glue workflow orchestration
	--glue_scripts
		--DataExtractionJob.py
		--DataProcessingJob.py
		--MessagingQueueJob,py
		--TrainingJob.py
	--base_resources.template
	--deploy.sh
	--glue_resources.template

The code directory is divided into three parts:

  • AWS CloudFormation templates – The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow.
  • AWS Glue scripts – The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs.
  • Bash script – A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers.

Implementing the solution

Complete the following steps:

  1. Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region.

The following code example is a path for Region us-west-2:

algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest"

For more information about BlazingText parameters, see Common parameters for built-in algorithms.

  1. Enter the following code in your terminal:
    sh deploy.sh -s dev AWS_PROFILE=your_profile_name

This step sets up the infrastructure of the pipeline.

  1. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE.
  2. On the AWS Glue console, manually start the pipeline.

In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs.

  1. To begin the workflow, in the Workflow section, select DevMLWorkflow.
  2. From the Actions drop-down menu, choose Run.
  3. View the progress of your workflow on the History tab and select the latest RUN ID.

The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion.

  1. After the workflow is successful, open the Amazon SageMaker console.
  2. Under Inference, choose Endpoint.

The following screenshot shows that the endpoint the workflow deployed is ready.

Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application.

Cleaning up

Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code:

	def delete_resources(self):

        endpoint_name = self.endpoint

        try:
            sagemaker.delete_endpoint(EndpointName=endpoint_name)
            print("Deleted Test Endpoint ", endpoint_name)
        except Exception as e:
            print('Model endpoint deletion failed')

        try:
            sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name)
            print("Deleted Test Endpoint Configuration ", endpoint_name)
        except Exception as e:
            print(' Endpoint config deletion failed')

        try:
            sagemaker.delete_model(ModelName=endpoint_name)
            print("Deleted Test Endpoint Model ", endpoint_name)
        except Exception as e:
            print('Model deletion failed')

Conclusion

This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.


About the Authors

Sai Sharanya Nalla is a Data Scientist at AWS Professional Services. She works with customers to develop and implement AI and ML solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, long walks, and engaging in outreach activities.

 

 

 

Inchara B Diwakar is a Data Scientist at AWS Professional Services. She designs and engineers ML solutions at scale, with experience across healthcare, manufacturing and retail verticals. Outside of work, she enjoys the outdoors, traveling and a good read.

Read More

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

Bidirectional Encoder Representations from Transformers (BERT) [1] has become one of the most popular models for natural language processing (NLP) applications. BERT can outperform other models in several NLP tasks, including question answering and sentence classification.

Training the BERT model on large datasets is expensive and time consuming, and achieving low latency when performing inference on this model is challenging. Latency and throughput are key factors to deploy a model in production. In this post, we focus on optimizing these factors for BERT inference tasks. We also compare the cost of deploying BERT on different Amazon Elastic Compute Cloud (Amazon EC2) instances.

When running inference on the BERT-base model, the g4dn.xlarge GPU instance achieves between 2.6–5 times lower latency (3.8 on average) than a c5.24xlarge CPU instance. The g4dn.xlarge instance also achieves the best cost-effective ratio (cost per requests) compared to c5.xlarge, c5.24xlarge, and m5.xlarge CPU instances. Specifically, the cost of processing 1 million BERT-inference requests with sequence length 128 is $0.20 on g4dn.xlarge, whereas on c5.xlarge (the best of these CPU instances), the cost is $3.31—the GPU instance is 16.5 times more efficient.

We achieved these results after a set of GPU optimizations on MXNet, described in the section Optimizing BERT model performance on MXNET 1.6 and 1.7 of this post.

Amazon EC2 G4 instances

G4 instances are optimized for machine learning application deployments. They’re equipped with NVIDIA T4 GPUs, powered by Tensor Cores, and deliver groundbreaking AI performance: up to 65 TFLOPS in FP16 precision and up to 130 TOPS in INT8 precision.

Amazon EC2 offers a variety of G4 instances with one or multiple GPUs, and with different amounts of vCPU and memory. You can perform BERT inference below 5 milliseconds on a single T4 GPU with 16 GB, such as on a g4dn.xlarge instance (the cost of this instance at the time of writing is $0.526 per hour on demand in the US East (N. Virginia) Region.

For more information about G4 instances, see Amazon EC2 G4 Instances.

GluonNLP and MXNet

GluonNLP is a deep learning framework built on top of MXNet, which was specifically designed for NLP applications. It extends MXNet, providing NLP models, datasets, and examples.

GluonNLP includes an efficient implementation of the BERT model, scripts for training and performing inference, and several datasets (such as GLUE benchmark and SQuAD). For more information, see GluonNLP: NLP made easy.

For this post, we use the GluonNLP BERT implementation to perform inference on NLP tasks. Specifically, we use MXNet version 1.7 and GluonNLP version 0.10.0.

BERT-base inference results

We present results performing two different BERT tasks: question answering and classification (sentimental analysis using the Stanford Sentiment Treebank (SST2) dataset). We achieved the results after a set of GPU optimizations on MXNet.

In the following graphs, we compare latency achieved by a single GPU on a g4dn.xlarge instance with FP16 precision vs. the most efficient CPU instance in terms of latency, c5.24xlarge with INT8 precision, MKL BLAS and 24 OpenMP threads.

The following graph shows BERT-base latency on c5.25xlarge (INT8) and g4dn.xlarge (FP16) instances performing a classification inference task (SST2 dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

The following graph shows BERT-base latency on c5.24xlarge (INT8) and g4dn.xlarge (FP16) instances performing a question answering inference task (SQuAD dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

 

In the following two graphs, we present a cost comparison between several instances based on the throughput (sentences/s) and the cost of each instance on demand (cost per hour) in the US East (N. Virginia) Region.

The following graph shows dollars per 1 million sequence classification requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

The following graph shows dollars per 1 million question answering requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

Deploying BERT on G4 instances

You can easily reproduce the results in the preceding section on a g4dn.xlarge instance. You can start from a pretrained model and fine-tune it for a specific task before running inference, or you can download one of the following fine-tuned models:

Then complete the following steps:

  1. To initialize a G4 instance, on the Amazon EC2 console, choose Deep Learning AMI (Ubuntu 18.04) Version 28.1 (or posterior) and a G4 instance.
  2. Connect to the instance and set MXNet 1.7 and GluonNLP 0.10.x:
pip install mxnet-cu102==1.7.0
git clone --branch v0.10.x https://github.com/dmlc/gluon-nlp.git
cd gluon-nlp; pip install -e .; cd scripts/bert
python setup.py install

 The command python setup.py install generates a custom graph pass (bertpass_lib.so) that optimizes the graph, and therefore performance. It can be passed to the inference script as an argument.

  1. If you didn’t download any fine-tuned parameters, you can now fine-tune your model to specify a sequence length and use a GPU.
    • For a question answering task, run the following script (approximately 180 minutes):
python3 finetune_squad.py --max_seq_length 128 --gpu
    • For a classification task, run the following script:
python3 finetune_classifier.py --task_name [task_name] --max_len 128 --gpu 0

In the preceding code, task choices include ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’ (refers to SST2), ‘XNLI’, ‘LCQMC’, and ‘ChnSentiCorp’. Computation time depends on the specific task. For SST, it should take less than 15 minutes.

By default, these scripts run 3 epochs (to achieve the published accuracy in [1]).

They generate an output file, output_dir/net.params, where the fine-tuned parameters are stored and from where they can be loaded at inference step. Scripts also perform a prediction test to check accuracy.

You should get an F1 score of 85 or higher in question answering, and a validation metric higher to 0.92 in SST classification task.

You can now perform inference using validation datasets.

  1. Force MXNet to use FP32 precision in Softmax and LayerNorm layers for better accuracy when using FP16.

These two layers are susceptible to overflow, so we recommend always using FP32. MXNet takes care of it if you set the following:

export MXNET_SAFE_ACCUMULATION=1
  1. Activate True FP16 computation for performance purposes.

General matrix multiply operations don’t present accuracy issues in this model. By default, they’re computed using FP32 accumulation (for more information, see the section Optimizing BERT model performance on MXNET 1.6 and 1.7 in this post), but you can activate the FP16 accumulation setting:

 export MXNET_FC_TRUE_FP16=1
  1. Run inference:
python3 deploy.py --model_parameters [path_to_finetuned_params] --task [_task_] --gpu 0 --dtype float16 --custom_pass=bertpass_lib.so

In the preceding code, the  task can be one of ‘QA’, ‘embedding’, ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’, ‘XNLI’, ‘LCQMC’, or ‘ChnSentiCorp’ [1].

This command exports the model (JSON and parameter files) into the output directory (output_dir/[task_name]), and performs inference using the validation dataset corresponding to each task.

It reports the average latency and throughput.

The second time you run it, you can skip the export step by adding the tag --only_infer and specifying the exported model to use by adding --exported_model followed by the prefix name of the JSON or parameter files.

Optimal latency is achieved on G4 instances with FP16 precision. We recommend adding the flag -dtype float16 and activating MXNET_FC_TRUE_FP16 when performing inference. These flags shouldn’t reduce the final accuracy in your results.

By default, all these scripts use BERT-base (12 transformer-encoder layers). If you want to use BERT-large, use the flag --bert_model bert_24_1024_16 when calling the scripts.

Optimizing BERT model performance on MXNet 1.6 and 1.7

Computationally, the BERT model is mainly dominated by general matrix multiply operations (GEMMs). They represent up to 56% of time consumed when performing inference. The following chart shows the percentage of computational time spent on each operation type performing BERT-base inference (sequence length 128 and batch size 128).

MXNet uses the cuBLAS library to efficiently compute these GEMMs on the GPU. These GEMMs belong to the multi-head self-attention part of the model (4 GEMMs per transformer layer), and the feed-forward network (2 GEMMs per transformer layer).

In this section, we discuss optimizing the most computational-consuming operations.

The following table shows the improvement of each optimization. The performance improvements were achieved by the different GPU BERT optimizations implemented on MXNet and GluonNLP, performing a question answering inference task (SQuAD dataset), and using a sequence length of 128. Speedup achieved is shown for different batch sizes.

LayerNorm, Softmax and AddBias

Although LayerNorm was already optimized for GPUs on MXNet 1.5, the implementation of Softmax was optimized in MXNet 1.6. The new implementation improves inference performance on GPUs by optimizing the device memory accesses and using the CUDA registers and shared memory during reduction operations more efficiently. Additionally, you have the option to apply a max_length mask within the C++ Softmax operator, which removes the need to apply the mask at the Python level.

The addition of bias terms following GEMMs was also optimized. Instead of using an mshadow broadcast summation, a custom CUDA kernel is now attached to the FullyConnected layer, which includes efficient device memory accesses.

Multi-head self-attention

The following equation defines the attention mechanism used in the BERT model [2], where Q represents the query, K the key, V the value, and dk the inner dimension of these three matrices:

Three different linear projections (FullyConnected: GEMMs and Bias-Addition) are performed to obtain Q, K, and V from the same input (when the same input is employed, the mechanism is denominated self-attention), but with different weights:

    • Q = input Wqt
    • K = input Wkt
    • V = input Wvt

The input size is (BatchSize, SeqLength, EmbeddingDim), and each weight tensor W size is (ProjectionDim, EmbeddingDim).

In multi-head attention, many projections and attention functions are applied to the input as the number of heads, augmenting the dimensions of the weights so that each W size is ((NumHeads x ProjectionDim), EmbeddingDim).

All these projections are independent, so we can compute them in parallel within the same operation, producing an output which size is (BatchSize, SeqLength, 3 x NumHeads x ProjectionDim). That is, GluonNLP uses a single FullyConnected layer to compute Q, K, and V.

To compute the attention function (the preceding equation), we first need to compute the dot product QKT. We need to perform this computation independently for each head, with m=SeqLength number of Q rows, n=SeqLength number of K columns, and k=ProjectionDim size of vectors in the dot product. We can use a batched dot product operation, where the number of batches is (BatchSize x NumHeads), to compute all the dot products within the same operation.

However, to perform such an operation in cuBLAS, we need to have the batches and heads dimensions contiguous (in order to have a regular pattern to express distance between batches), but that isn’t the case by default (SeqLength dimension is between them). To avoid rearranging Q, K, and V, GluonNLP transposes the input so that its shape is (SeqLength, BatchSize, EmbeddingDim), and Q, K, and V are directly projected into a tensor with shape (SeqLength, BatchSize, 3 x NumHeads x ProjectionDim).

Moreover, to avoid splitting the joint QKV output, we can compute the projections in an interleaved fashion, allocating continuously the applied weights Wq, Wk, Wv of each individual head. The following diagram depicts the interleaved projection operation, where P is the projection size, and we end with a joint QKV output with shape (SeqLength, BatchSize, NumHeads x 3 x ProjectionDim).

This strategy allows us to compute QKT from a unique joint input tensor with cuBLASGEMMStridedBatched, setting the number of batches to (BatchSize x NumHeads) and the stride to (3 x ProjectionDim). We also use a strided batched GEMM to compute the dot product of V (same stride as before) with the output of the Softmax function. We implemented MXNet operators that deal with this cuBLAS configuration.

True FP16

Since MXNet 1.7, you can compute completely in FP16 precision GEMMs. By default, when the data type is FP16, MXNet sets cuBLAS to internally use FP32 accumulation. You can now set the environment variable MXNET_FC_TRUE_FP16 to 1 to force MXNet to use FP16 as the cuBLAS internal computation type.

Pointwise fusion and prearrangement of MHA weights and bias using a custom graph pass

Finally, the feed-forward part of the model, which happens after each transformer layer, uses Gaussian Exponential Linear Unit (GELU) as its activation function. This operation follows a feed-forward FullyConnected operation, which includes bias addition. We use the MXNet functionality of custom graph passes to detach the bias addition from the FullyConnected operation and fuse it with GELU through the pointwise fusion mechanism.

In our custom graph pass for BERT, we also prearrange the weights and bias terms for the multi-head self-attention computation so that we avoid any overhead at runtime. As explained earlier, weights need to be interleaved, and bias terms need to be joint into a unique tensor. We do this before exporting the model. This strategy shows benefits in small batch size cases.

Conclusion

In this post, we presented an efficient solution for performing BERT inference tasks on EC2 G4 GPU instances. We showed how a set of MXNet optimizations boost GPU performance, achieving speeds up to twice as fast in both question answering and classification tasks.

We have shown that g4dn.xlarge instances offer lower latency (below 4 milliseconds with batch size 1) than any EC2 CPU instance, and g4dn.xlarge is 3.8 times better than c5.24xlarge on average. Finally, g4dn.xlarge offers the best cost per million requests ratio—16 times better than CPU instances (c5.xlarge) on average.

 

 

Acknowledgments

We would like to thank Triston Cao, Murat Guney from NVIDIA, Sandeep Krishnamurthy from Amazon, the Amazon-MXNet team, and the NVIDIA MXNet team for their feedback and support.

Disclaimer

The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.

References

  1. Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  2. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

About the Authors

Moises Hernandez Fernandez is an AI DevTech Engineer at NVIDIA. He works on accelerating NLP applications on GPUs. Before joining NVIDIA, he was conducting research into the brain connectivity, optimizing the analysis of diffusion MRI using GPUs. Moises received a PhD in Neurosciences from Oxford University.

Haibin Lin is a former Applied Scientist at Amazon Web Services. He works on distributed systems, deep learning, and NLP. He is a PPMC and committer of Apache MXNet, and a major contributor to the GluonNLP toolkit. He has finished his M.S. in Computer Science at Carnegie Mellon University, advised by Andy Pavlo. Prior to that, he has received a B.Eng. in Computer Science from University of Hong Kong and Shanghai Jiao Tong University jointly.

Przemyslaw Tredak is a senior developer technology engineer on the Deep Learning Frameworks team at NVIDIA. He is a committer of Apache MXNet and leads the MXNet team at NVIDIA.

Anish Mohan is a Machine Learning Architect at Nvidia and the technical lead for ML/DL engagements with key Nvidia customers in the greater Seattle region. Before Nvidia, he was at Microsoft’s AI Division, working to develop and deploy AI/ML algorithms and solutions.

Read More

Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Active learning workflow for Amazon Comprehend custom classification models – Part 1.

Amazon Comprehend  Custom Classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use Custom Classification to automatically categorize inbound requests by problem type based on how the customer has described the issue.  You can use custom classifiers to automatically label support emails with appropriate issue types, routing customer phone calls to the right agents, and categorizing social media posts into user segments.

For custom classification, you start by creating a training job with a ground truth dataset comprising a collection of text and corresponding category labels. Upon completing the job, you have a classifier that can classify any new text into one or more named categories. When the custom classification model classifies a new unlabeled text document, it predicts what it has learned from the training data. Sometimes you may not have a training dataset with various language patterns, or once you deploy the model, you start seeing completely new data patterns. In these cases, the model may not be able to classify these new data patterns accurately. How can we ensure continuous model training to keep it up to date with new data and patterns?

In this two part blog series, we discuss an architecture pattern that allows you to build an active learning workflow for Amazon Comprehend custom classification models. The first post will describe a workflow comprising real-time classification, feedback pipelines and human review workflows using Amazon Augmented AI (Amazon A2I). The second post will cover the automated model building using the human reviewed data, selecting the best model, and automated deployment of an endpoint of the chosen model.

Feedback loops play a pivotal role in keeping the models up to date. This feedback helps the models learn about their misclassifications and learn the right ones. This process of teaching the models continuously through feedback and deploying them is called active learning.

For every prediction Amazon Comprehend Custom Classification makes, it also gives a confidence score associated with its prediction. This architecture proposes that you set an acceptable threshold and only accept the predictions with a confidence score that exceeds the threshold. All the predictions that have a confidence score less than the desired threshold are flagged for human review. The human decides whether to accept the model’s prediction or correct it.

In some instances, the model may be confident about its predictions, but the classification might be wrong. In these scenarios, the end-user applications that receive the model predictions can request explicit feedback from its users on the prediction quality. A human moderator reviews this explicit feedback and reclassifies instances where the feedback was negative. This process of generating human-verified data and using it for model retraining helps keep the models up to date, reduce data drift, and achieve higher model accuracy.

Feedback Workflow Architecture.

In this section, we discuss an architectural pattern for implementing an end-to-end active learning workflow for custom classification models in Amazon Comprehend using Amazon A2I. The active learning workflow comprises the following components:

  1. Real-time classification
  2. Feedback loops
  3. Human classification
  4. Model building
  5. Model selection
  6. Model deployment

The following diagram illustrates this architecture covering the first three components. In the following sections, we walk you through each step in the workflow.

Architecture Diagram for Feedback Loops

Real-time classification

To use custom classification in Amazon Comprehend, you need to create a custom classification job that reads a ground truth dataset from an Amazon Simple Storage Service (Amazon S3) bucket and builds a classification model. After the model builds successfully, you can create an endpoint that allows you to make real-time classifications of unlabeled text. This stage is represented by steps 1–3 in the preceding architecture:

  1. The end-user application calls an API Gateway endpoint with a text that needs to be classified.
  2. The API Gateway endpoint then calls an AWS Lambda function configured to call an Amazon Comprehend endpoint.
  3. The Lambda function calls the Amazon Comprehend endpoint, which returns the unlabeled text classification and a confidence score.

Feedback collection

When the endpoint returns the classification and the confidence score during the real-time classification, you can send instances with low-confidence scores to human review. This type of feedback is called implicit feedback.

  1. The Lambda function sends the implicit feedback to an Amazon Kinesis Data Firehose.

The other type of feedback is called explicit feedback and comes from the application’s end-users that use the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. Explicit feedback can be sent either in real-time through an API or a batch process.

  1. End-users of the application submit explicit real-time feedback through an API Gateway endpoint.
  2. The Lambda function backing the API endpoint transforms the data into a standard feedback format and writes it to the Kinesis Data Firehose delivery stream.
  3. End-users of the application can also submit explicit feedback as a batch file by uploading it to an S3 bucket.
  4. A trigger configured on the S3 bucket triggers a Lambda function.
  5. The Lambda function transforms the data into a standard feedback format and writes it to the delivery stream.
  6. Both the implicit and explicit feedback data gets sent to a delivery stream in a standard format. All this data is buffered and written to an S3 bucket.

Human classification

The human classification stage includes the following steps:

  1. A trigger configured on the feedback bucket in Step 10 invokes a Lambda function.
  2. The Lambda function creates Amazon A2I human review tasks for all the feedback data received.
  3. Workers assigned to the classification jobs log in to the human review portal and either approve the classification by the model or classify the text with the right labels.
  4. After the human review, all these instances are stored in an S3 bucket and used for retraining the models. Part 2 of this series covers the retraining workflow.

Solution overview

The next few sections of the post go over how to set up this architecture in your AWS account. We classify news into four categories: World, Sports, Business, and Sci/Tech, using the AG News dataset for custom classification, and set up the implicit and explicit feedback loop. You need to complete two manual steps:

  1. Create an Amazon Comprehend custom classifier and an endpoint.
  2. Create an Amazon SageMaker private workforce, worker task template, and human review workflow.

After this, you run the provided AWS CloudFormation template to set up the rest of the architecture.

Prerequisites

Before you get started, download the dataset and upload it to Amazon S3. This dataset comprises a collection of news articles and their corresponding category labels. We have created a training dataset called train.csv from the original dataset and made it available for download.

The following screenshot shows a sample of the train.csv file.

CSV file representing the Training data set

After you download the train.csv file, upload it to an S3 bucket in your account for reference during training. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

Creating a custom classifier and an endpoint

To create your classifier for classifying news, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. Choose Train classifier.
  3. For Name, enter news-classifier-demo.
  4. Select Using Multi-class mode.
  5. For Training data S3 location, enter the path for train.csv in your S3 bucket, for example, s3://<your-bucketname>/train.csv.
  6. For Output data S3 location, enter the S3 bucket path where you want the output, such as s3://<your-bucketname>/.
  7. For IAM role, select Create an IAM role.
  8. For Permissions to access, choose Input and output (if specified) S3 bucket.
  9. For Name suffix, enter ComprehendCustom.

Comprehend Custom Classification Model Creation

  1. Scroll down and choose Train Classifier to start the training process.

The training takes some time to complete. You can either wait to create an endpoint or come back to this step later after finishing the steps in the section Creating a private workforce, worker task template, and human review workflow.

Creating a custom classifier real-time endpoint

To create your endpoint, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and select your model news-classifier-demo.
  3. From the Actions drop-down menu, choose Create endpoint.
  4. For Endpoint name, enter classify-news-endpoint and give it one inference unit.
  5. Choose Create endpoint
  6. Copy the endpoint ARN as shown in the following screenshot. You use it when running the CloudFormation template in a future step.

Custom Classification Model Endpoint Page

Creating a private workforce, worker task template, and human review workflow.

This section walks you through creating a private workforce in Amazon SageMaker, a worker task template, and your human review workflow.

Creating a labeling workforce

  1. For this post, you will create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console).
  2. Once the user accepts the invitation, you will have to add him to the workforce. For instructions, see the Add a Worker to a Work Team section the Manage a Workforce (Amazon SageMaker Console)

Creating a worker task template

To create a worker task template, complete the following steps:

  1. On the Amazon A2I console, choose Worker task templates.
  2. Choose to Create a template.
  3. For Template name, enter custom-classification-template.
  4. For Template type, choose Custom,
  5. In the Template editor, enter the following GitHub UI template code.
  6. Choose Create.

Worker Task Template

Creating a human review workflow

To create your human review workflow, complete the following steps:

  1. On the Amazon A2I console, choose Human review workflows.
  2. Choose Create human review workflow.
  3. For Name, enter classify-workflow.
  4. Specify an S3 bucket to store output: s3://<your bucketname>/.

Use the same bucket where you downloaded your train.csv in the prerequisite step.

  1. For IAM role, select Create a new role.
  2. For Task type, choose Custom.
  3. Under Worker task template creation, select the custom classification template you created.
  4. For Task description, enter Read the instructions and review the document.
  5. Under Workers, select Private.
  6. Use the drop-down list to choose the private team that you created.
  7. Choose Create.
  8. Copy the workflow ARN (see the following screenshot) to use when initializing the CloudFormation parameters.

Human Review Workflow Page

Deploying the CloudFormation template to set up active learning feedback

Now that you have completed the manual steps, you can run the CloudFormation template to set up this architecture’s building blocks, including the real-time classification, feedback collection, and the human classification.

Before deploying the CloudFormation template, make sure you have the following to pass as parameters:

  • Custom classifier endpoint ARN
  • Amazon A2I workflow ARN
  1. Choose Launch Stack:

  1. Enter the following parameters:
    1. ComprehendEndpointARN – The endpoint ARN you copied.
    2. HumanReviewWorkflowARN – The workflow ARN you copied.
    3. ComrehendClassificationScoreThreshold – Enter 0.5, which means a 50% threshold for low confidence score.

CloudFormation Required Parameters

  1. Choose Next until the Capabilities
  2. Select the check-box to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.

For more information about these resources, see AWS IAM resources.

  1. Choose Create stack.

Acknowledgement section of the CloudFormation Page

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

CloudFormation Outputs

  1. On the Outputs tab of the stack (see the following screenshot), copy the value for  BatchUploadS3Bucket, FeedbackAPIGatewayID, and TextClassificationAPIGatewayID to interact with the feedback loop.
  2. Both the TextClassificationAPI and FeedbackAPI will require and API key to interact with them. The Cloudformtion output ApiGWKey refers to the name of the API key. Currently this API key is associated with a usage plan that allows 2000 requests per month.
  3. On the API Gateway console, choose either the TextClassification API or the the FeedbackAPI. Choose API Keys from the left navigation. Choose your API key from step 7. Expand the API key section in the right pane and copy the value.

API Key page

  1. You can manage the usage plan by following the instructions on, Create, configure, and test usage plans with the API Gateway console.
  2. You can also add fine grained authentication and authorization to your APIs. For more information on securing your APIs, you can follow instructions on Controlling and managing access to a REST API in API Gateway.

Testing the feedback loop

In this section, we walk you through testing your feedback loop, including real-time classification, implicit and explicit feedback, and human review tasks.

Real-time classification

To interact and test these APIs, you need to download Postman.

The API Gateway endpoint receives an unlabeled text document from a client application and internally calls the custom classification endpoint, which returns the predicted label and a confidence score.

  1. Open Postman and enter the TextClassificationAPIGateway URL in POST method.
  2. In the Headers section, configure the API key.  x-api-key :  << Your API key >>
  3. In the text field, enter the following JSON code (make sure you have JSON selected and enable raw):
{"classifier":"<your custom classifier name>", "sentence":"MS Dhoni retires and a billion people had mixed feelings."}
  1. Choose Send.

You get a response back with a confidence score and class, as seen in the following screenshot.

Sample JSON request to the Classify Text API endpoint.

Implicit feedback

When the endpoint returns the classification and the confidence score during the real-time classification, you can route all the instances where the confidence score doesn’t meet the threshold to human review. This type of feedback is called implicit feedback. For this post, we set the threshold as 0.5 as an input to the CloudFormation stack parameter.

You can change this threshold when deploying the CloudFormation template based on your needs.

Explicit feedback

The explicit feedback comes from the end-users of the application that uses the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send the predicted label by the model’s explicit feedback through the following methods:

  • Real time through an API, which is usually triggered through a like/dislike button on a UI.
  • Batch process, where a file with a collection of misclassified utterances is put together based on a user survey conducted by the customer outreach team.

Invoking the explicit real-time feedback loop

To test the Feedback API, complete the following steps:

  1. Open Postman and enter the FeedbackAPIGatewayID value from your CloudFormation stack output in POST method.
  2. In the Headers section, configure the API key.  x-api-key :  << Your API key >>
  3. In the text field, enter the following JSON code (for classifier, enter the classifier you created, such as news-classifier-demo, and make sure you have JSON selected and enable raw):
{"classifier":"<your custom classifier name>","sentence":"Sachin is Indian Cricketer."}
  1. Choose Send.

Sample JSON request to the Feedback API endpoint.

Submitting explicit feedback as a batch file

Download the following test feedback JSON file, populate it with your data, and upload it into the BatchUploadS3Bucket created when you deployed your CloudFormation template. The following code shows some sample data in the file:

{
   "classifier":"news-classifier-demo",
   "sentences":[
      "US music firms take legal action against 754 computer users alleged to illegally swap music online.",
      "A gamer spends $26,500 on a virtual island that exists only in a PC role-playing game."
   ]
}

Uploading the file triggers the Lambda function that starts your human review loop.

Human review tasks

All the feedback collected through the implicit and explicit methods is sent for human classification. The labeling workforce can include Amazon Mechanical Turk, private teams, or AWS Marketplace vendors. For this post, we create a private workforce. The URL to the labeling portal is located on the Amazon SageMaker console, on the Labeling workforces page, on the Private tab.

Private Workforce section of the SageMaker console.

After you log in, you can see the human review tasks assigned to you. Select the task to complete and choose Start working.

Human Review Task Page

You see the tasks displayed based on the worker template used when creating the human workflow.

Human Review Task

After you complete the human classification and submit the tasks, the human-reviewed data is stored in the S3 bucket you configured when creating the human review workflow. Go to Amazon Sagemaker-> Human review workflows->output location:

Human Review Task Output Location

This human-reviewed data is used to retrain the custom classification model to learn newer patterns and improve its overall accuracy. Below is screenshot of the human annotated output file output.json in S3 bucket:

Human Review Task Output payload

The process of retraining the models with human-reviewed data, selecting the best model, and automatically deploying the new endpoints completes the active learning workflow. We cover these remaining steps in Part 2 of this series.

Cleaning up

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. On the Amazon S3 console, delete the S3 bucket that contains the training dataset.
  2. On the Amazon Comprehend console, delete the endpoint and the classifier.
  3. On the Amazon A2I console, delete the human review workflow, worker template, and the private workforce.
  4. On the AWS CloudFormation console, delete the stack you created. (This removes the resources the CloudFormation template created.)

Conclusion

Amazon Comprehend helps you build scalable and accurate natural language processing capabilities without any machine learning experience. This post provides a reusable pattern and infrastructure for active learning workflows for custom classification models. The feedback pipelines and human review workflow help the custom classifier learn new data patterns continuously. The second part of this series covers the automatic model building, selection, and deployment of custom classification models.

For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.


About the Authors

 Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and develop products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.

 

 

 

 

Joyson Neville Lewis obtained his master’s in Information Technology from Rutgers University in 2018. He has worked as a Software/Data engineer before diving into the Conversational AI domain in 2019, where he works with companies to connect the dots between business and AI using voice and chatbot solutions. Joyson joined Amazon Web Services in February of 2018 as a Big Data Consultant for AWS Professional Services team in NYC.

Read More

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker

Many organizations use Amazon SageMaker for their machine learning (ML) requirements and source data from a data lake stored on Amazon Simple Storage Service (Amazon S3). The petabyte scale source data on Amazon S3 may not always be clean because data lakes ingest data from several source systems, such as like flat files, external feeds, databases, and Hadoop. It may contain extreme values in source attributes, considered as outliers in the data. Outliers arise due to changes in system behavior, fraudulent behavior, human error, instrument error, missing data, or simply through natural deviations in populations. Outliers in training data can easily impact model accuracy of many ML models, like linear and logistic regression. These anomalies result in ML scientists and analysts facing skewed results. Outliers can dramatically impact ML models and change the model equation completely with bad predictions or estimations.

Data scientists and analysts are looking for a way to remove outliers. Analysts come from a strong data background, and are very fluent in writing SQL queries with programming languages. The following tools are a natural choice for ML scientists to remove outliers and carry out data visualization:

  • Amazon Athena – An interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
  • Pandas – An open-source, high-performance, easy-to-use library that provides for data structures and data analysis library like matplotlib for Python programming language
  • Amazon SageMaker – A fully managed service that provides you with the ability to build, train, and deploy ML models quickly

To illustrate how to use Athena with Pandas for anomaly detection and visualization using Amazon SageMaker, we clean a set of New York City Taxi and Limousine Commission (TLC) Trip Record Data by removing outlier records. In this dataset, outliers are when a taxi trip’s duration is for multiple days, 0 seconds, or less than 0 seconds. Then we use the Pandas matplotlib library to plot graphs to visualize trip duration values.

Solution overview

To implement this solution, you perform the following high-level steps:

  1. Create an AWS Glue Data Catalog and browse the data on the Athena console.
  2. Create an Amazon SageMaker Jupyter notebook and install PyAthena.
  3. Identify anomalies using Athena SQL-Pandas from the Jupyter notebook.
  4. Visualize data and remove outliers using Athena SQL-Pandas.

The following diagram illustrates the architecture of this solution.

Prerequisites

To follow this post, you should be familiar with the following:

  • The Amazon S3 file upload process
  • AWS Glue crawlers and the Data Catalog
  • Basic SQL queries
  • Jupyter notebooks
  • Assigning a basic AWS Identity and Access Management (IAM) policy to a role

Preparing the data

For this post, we use New York City Taxi and Limousine Commission (TLC) Trip Record Data, which is a publicly available dataset.

  1. Download the file yellow_tripdata_2019-01.csv to your local machine.
  2. Create the S3 bucket s3-yellow-cab-trip-details (your name will be different).
  3. Upload the file to your bucket using the Amazon S3 console.

Creating the Data Catalog and browsing the data

After you upload the data to Amazon S3, you create the Data Catalog in AWS Glue. This allows you to run SQL queries using Athena.

  1. On the AWS Glue console, create a new database.
  2. For Database name, enter db_yellow_cab_trip_details.

  1. Create an AWS Glue crawler to gather the metadata in the file and catalog it.

For this post, I use the database (db_yellow_cab_trip_details) to save tables with the added pre-fix as src_.

  1. Run the crawler.

The crawler can take 2–3 minutes to complete. You can check the status on Amazon CloudWatch.

The following screenshot shows the crawler details on the AWS Glue console.

When the crawler is complete, the table is available in the Data Catalog. All the metadata and column-level information is displayed with corresponding data types.

We can now check the data on the Athena console to make sure we can read the file as a table and run a SQL query.

Run your query with the following code:

SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10;

The following screenshot shows your output on the Athena console.

Creating a Jupyter notebook and installing PyAthena

You now create a new notebook instance from Amazon SageMaker and install PyAthena using Jupyter.

Amazon SageMaker has managed built-in Jupyter notebooks that allow you to write code in Python, Julia, R, or Scala to explore, analyze, and do modeling with a small set of data.

Make sure the role used for your notebook has access on Athena (use IAM policies to verify and add S3FullAccess and AmazonAthenaFullAccess).

To create your notebook, complete the following steps:

  1. On the Amazon SageMaker console, under Notebook, choose Notebook instances.
  2. Choose Create notebook instance.

  1. On the Create notebook instance page, enter a name and choose an instance type.

We recommend using an ml.m4.10xlarge instance, due to the size of the dataset. You should choose an appropriate instance depending on your data; costs vary for different instances.

Wait until the Notebook instance status shows as InService (this step can take up to 5 minutes).

  1. When the instance is ready, choose Open Jupyter.

  1. Open the conda_python3 kernel from the notebook instance.
  2. Enter the following commands to install PyAthena:
! pip install --upgrade pip
! pip install PyAthena

The following screenshot shows the output.

You can also install PyAthena when you create the notebook instance by using lifecycle configurations. See the following code:

#!/bin/bash
sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate python3
pip install --upgrade pip
pip install --upgrade  PyAthena
source /home/ec2-user/anaconda3/bin/deactivate
EOF

The following screenshot shows where you enter the preceding code in the Scripts section when creating a lifecycle configuration.

You can run a SQL query from the notebook to validate connectivity to Athena and pull data for visualization.

To import the libraries, enter the following code:

from pyathena import connect
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

To connect to Athena, enter the following code:

conn = connect(s3_staging_dir='s3://<your-Query-result-location>',region_name='us-east-1')

To check the sample data, enter the following query:

df_sample = pd.read_sql("SELECT * FROM db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 10", conn)
df_sample

The following screenshot shows the output.

Detecting anomalies with Athena, Pandas, and Amazon SageMaker

Now that we can connect to Athena, we can run SQL queries to find the records that have unusual trip_duration values.

The following Athena query checks anomalies in the trip_duration data to find the top 50 records with the maximum duration:

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 desc limit 50", conn)

df_anomaly_duration

The following screenshot shows the output; there are many outliers (trips with a duration greater than 1 day).

The output shows the duration in seconds, minutes, hours, and days.

The following query checks for anomalies and shows the top 50 records with the lowest minimum duration (negative value or 0 seconds):

df_anomaly_duration= pd.read_sql("select tpep_dropoff_datetime,tpep_pickup_datetime, 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second, 
date_diff('minute', cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_minute, 
date_diff('hour',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_hour, 
date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) as duration_day 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details 
order by 3 asc limit 50", conn)

df_anomaly_duration

The following screenshot shows the output; multiple trips have a negative value or duration of 0.

Similarly, we can use different SQL queries using to analyze the data and find other outliers. We can also clean the data by using SQL queries and, if needed, save the data in Amazon S3 with CTAS queries.

Visualizing the data and removing outliers

Pull the data using the following Athena query in a Pandas DataFrame, and use matplotlib.pyplot to create a visual graph to see the outliers:

df_full = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details ", conn)
plt.figure(figsize=(12,12))
plt.scatter(range(len(df_full["duration_second"])), np.sort(df_full["duration_second"]))
plt.xlabel('index')
plt.ylabel('duration_second')

plt.show()

The process of plotting the full dataset can take 7–10 minutes. To reduce time, add a limit to the number of records in the query:

SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second 
from db_yellow_cab_trip_details.src_yellow_cab_trip_details limit 100000 

The following screenshot shows the output.

To ignore outliers, run the following query. You replot the graph after removing the outlier records in which the duration is equal or less than 0 seconds or longer than 1 day:

df_clean_data = pd.read_sql("SELECT date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp)) as duration_second from db_yellow_cab_trip_details.src_yellow_cab_trip_details where 
date_diff('second', cast(tpep_pickup_datetime as timestamp), cast(tpep_dropoff_datetime as timestamp))  > 0 and date_diff('day',cast(tpep_pickup_datetime as timestamp),cast(tpep_dropoff_datetime as timestamp)) < 1 ", conn)
plt.figure(figsize=(12,12))
plt.scatter(range(len(df_clean_data["duration_second"])), np.sort(df_clean_data["duration_second"]))
plt.xlabel('index')
plt.ylabel('duration_second')
plt.show()

The following screenshot shows the output.

Cleaning up

When you’re done, delete the notebook instance to avoid recurring deployment costs.

  1. On the Amazon SageMaker notebook, choose your notebook instance.
  2. Choose Stop.

  1. When the status shows as Stopped, choose Delete.

Conclusion

This post walked you through finding and removing outliers from your dataset and data visualization. We used an Amazon SageMaker notebook to run analytical queries using Athena SQL, and used Athena to read the dataset, which is saved in Amazon S3 with the metadata catalog in AWS Glue. We used queries in Athena to find anomalies in the data and ignore these outliers. We also used notebook instances to visualize graphs using Pandas’ matplotlib.pyplot library.

You can try this solution for your use-cases to remove outliers using Athena SQL and SageMaker notebook. If you have comments or feedback, please leave them below.


About the Authors

Rahul Sonawane is a Senior Consultant, Big Data at the Shared Delivery Teams at Amazon Web Services.

 

 

 

 

Behram Irani is a Senior Solutions Architect, Data & Analytics at Amazon Web Services.

Read More

Football tracking in the NFL with Amazon SageMaker

Football tracking in the NFL with Amazon SageMaker

With the 2020 football season kicking off, Amazon Web Services (AWS) is continuing its work with the National Football League (NFL) on several ongoing game-changing initiatives. Specifically, the NFL and AWS are teaming up to develop state-of-the-art cloud technology using machine learning (ML) aimed at aiding the officiating process through real-time football detection. As a first step in this process, the Amazon Machine Learning Solutions Lab developed a computer vision model for the challenge of football detection. In this post, we provide in-depth examples including code snippets and visualizations to demonstrate the key components of the football detection pipeline, starting with data labeling and following up with training and deployment using Amazon SageMaker and Apache MXNet Gluon.

Detecting the football in NFL Broadcast videos

The following video illustrates detecting the football frame by frame.

Football Tracking

Computer vision-based object detection techniques use deep learning algorithms to predict the location of objects in images and videos. Today, object detection has many far-reaching, high business value use cases, such as in self-driving car technology, where detecting pedestrians and vehicles is of paramount importance in ensuring safety on the roads. For the NFL, object detection technology like this is crucial as the game continues to evolve at a rapid pace. For example, they can use real-time object identification to generate new advanced analytics around player and team performance, in addition to aiding game officials in ball spotting. This technology is part of the larger suite of innovations in the AWS/NFL partnership.

The following sections of the post outline how we used NFL broadcast video data to train object detection models that analyze thousands of images to locate and classify the football from background objects.

Creating an object detection dataset with Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML. With the help of Ground Truth, we created a custom object detection dataset by breaking the NFL play segments into images. It offers a user interface (UI) that allowed us to quickly spin up a bounding box labeling job, in which human annotators can quickly draw bounding box labels around thousands of football image sequences stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following screenshot illustrates the object detection UI.

The labeling job outputs a manifest file that contains the S3 file path of the image, the (x, y) bounding box coordinates, and the class label (for this use case, football) needed for the model training process.

The following screenshot shows the manifest file that is automatically updated in the S3 path.

The following screenshot shows the contents of the output.manifest file.

Supercharging training with Apache MXNet Gluon and Amazon SageMaker Script Mode

Our approach to model development relied on an ML technique called transfer learning. In it, we take neural networks previously trained on similar applications with strong results and fine-tune these models on our annotated data. We converted the annotations from the labeling job to RecordIO format for compact storage and faster disk access. Neural networks have the tendency to overfit training data, leading to poor out-of-sample results. The MXNet Gluon toolkit we added provides image normalization and image augmentations, such as randomized image flipping and cropping, to help reduce overfitting during training.

Amazon SageMaker provides a simple UI to train object detection models with no code, offering the Single Shot Detector (SSD) pre-trained model with several out-of-the-box configurations. For a more customized architecture, we use Amazon SageMaker Script Mode, which allows you to bring your own training algorithms and directly train models while staying within the user-friendly confines of Amazon SageMaker. We could train larger, more accurate models directly from Amazon SageMaker notebooks by combining Script Mode with pre-trained models like Yolov3 and Faster-RCNN with several backbone network combinations from the Gluon Model Zoo for object detection. See the following code:

import os
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

s3_output_path = "s3://<path to bucket where model weights will be saved>/"

model_estimator = MXNet(
    entry_point="train.py",
    role=role,
    train_instance_count=1,  # value can be more than 1 for multi node training
    train_instance_type="ml.p3.16xlarge",
    framework_version="1.6.0",
    output_path=s3_output_path,
    py_version="py3",
    distributions={"parameter_server": {"enabled": True}},
    hyperparameters={"epochs": 15},
)

m.fit("s3://<bucket path for train and validation record-io files>/")

Object detection algorithm: Background

Object detectors typically combine two key components: detection of objects in images and regression for estimating bounding box coordinates of objects. During training, object detectors are optimized to reduce both detection error and localization error (bounding box prediction error) via a loss function.

Current state-of-the-art object detectors contain deep learning architectures that use pre-trained convolutional neural networks (CNNs) like VGG-16 or ResNet-50 as base networks to perform rich feature extraction from input images. SSD predicts the relative offsets to a fixed set of boxes at every location of a convolutional feature map. Empirically, SSD underperforms other object detector algorithms on small objects like football. In contrast, YOLOv3 uses DarkNet-53 for feature extraction, which concatenates multiple feature maps together to make predictions, leading to improved performance on smaller objects.

Faster-RCNN in comparison to both SSD and YOLOv3 uses an additional shared deep neural network to predict region proposals of the input image feature maps, which is aggregated in the feed downstream in the model for object classification and bounding box prediction. Faster-RCNN empirically outperformed other networks on small objects in our use case. One major consideration in addition to performance, when choosing object detectors, is model inference time. SSD and YOLOv3 tend to have fast inference times as measured in frames per second, which is a key consideration for real-time applications; larger networks like Faster-RCNN have slower inference time.

Hyperparameter optimization on Amazon SageMaker

A standard metric for evaluating object detectors is mean average precision (mAP). mAP is based on the model precision recall (PR) curve and provides a numerical metric that can be directly used across models. You can generate PR curves by setting model confidence score thresholds to different levels, resulting in precision and recall pairs. Plotting these pairs with a bit of interpolation results in a PR curve. Average precision (AP) is then defined as the area under this PR curve. Similarly, you may want to detect multiple objects in an image, such as K > 1 objects: mAP is the mean AP across all K classes.

Automatic model tuning in Amazon SageMaker, also known as hyperparameter optimization, allowed us to try over 100 models with unique parameter configurations to achieve the best model possible given our data. Hyperparameter optimization uses strategies like random search and Bayesian search to help tune hyper parameters in the ML algorithm. See the following sample code:

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter
from sagemaker.tuner import HyperparameterTuner

hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),
    " network": CategoricalParameter(["resnet50_v1b", "resnet101_v1d"])
}

### Objective metric Regex based on print statements in script
objective_metric_name = "Validation: "
metric_definitions = [{"Name": "Validation: ", "Regex": "Validation: ([0-9\.]+)"}]

tuner = HyperparameterTuner(
    model_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=100,
    max_parallel_jobs=10,
)

tuner.fit("s3://<bucket path for train and validation record-io files>/")

To do this, we specified the location of our data and manifest file on Amazon S3 and chose our Amazon SageMaker instance type and object detection algorithm to use (SSD with ResNet50). Amazon SageMaker hyperparameter optimization then launched several configurations of the base model with unique hyperparameter configurations, using Bayesian search to determine which configuration achieves the best model based on a preset test metric. In our case, we optimized towards the highest mean average precision (mAP) on our held-out test data. The following graph shows a visualization of a sample set of hyperparameter optimization jobs from the hyperparameter optimization tuner object.

Deploying the model

Deploying the model required only a few additional lines of code (hosting methods) within our Amazon SageMaker notebook instance. We can simply call tuner.deploy on our hyperparameter optimization tuner to deploy the best model based on the evaluation metric that was set for the hyperparameter optimization training job. The code below demonstrates a proof-of-concept deployment on Amazon SageMaker:

predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

The model weights for each training job are stored in Amazon S3. We can deploy any of the jobs or Amazon SageMaker trained models by passing its model artifact path to an Amazon SageMaker estimator object. To do this, we referred to the preconfigured container optimized to perform inference and linked it to the model weights. After this model-container pair was created on our account, we could configure an endpoint with the instance type and number of instances needed by the NFL. See the following code:

from sagemaker.mxnet.model import MXNetModel

sagemaker_model = MXNetModel(
    model_data="s3://<path to training job model file>/model.tar.gz",
    role=role,
    py_version="py3",
    framework_version="1.6.0",
    entry_point="train.py",
)

predictor = sagemaker_model.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge"
)

Model inference for football detection

At runtime, a client sends a request to the endpoint hosting the model container on an Amazon Elastic Compute Cloud (Amazon EC2) instance and returns the output (inference). In production, scaling endpoints for large-scale inference on the NFL broadcast videos is significantly simplified with this pipeline.

Sensitivity and error analysis

When exploring strategies to improve model performance, you can scale up (use larger architectures) or scale out (acquire more data). After scaling up, which we discussed earlier during model exploration, data scientists commonly collect additional data in hopes of improving model generalizability. For our use case, we specifically aimed at reducing localization error. To do this, we created several test sets that quantitatively and qualitatively helped us understand mAP in relation to specific characteristics of the input video: occlusion (high vs. low) of the football, bounding box size (small vs. large) and aspect ratio (tall vs. wide) effects, camera angle (endzone vs. sideline), and contrast (high vs. low) between the football and its background area. From this, we understood which qualitative aspect of image the model was struggling to predict, and these findings led us to strategically gather additional data to target and improve upon these areas.

Summary

The NFL uses cloud computing to create innovative experiences that introduce additional ways for fans to enjoy football while making the game more efficient and fast-paced. By combining football detection with additional new technologies, the NFL can reduce game stoppage, support officiating, and bring real-time insight into what’s happening on the field, leading to a greater connection with the game that fans love. While fans take delight in “America’s game,” they can rest assured that the NFL in collaboration with AWS is utilizing the newest and best technologies to make the game more enjoyable with a broader range of data points.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, implementing HPO, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Michael Lopez is the Director of Football Data and Analytics at the National Football League and a Lecturer of Statistics and Research Associate at Skidmore College. At the National Football League, his work centers on how to use data to enhance and better understand the game of football.

 

 

 

Colby Wise is a Senior Data Scientist and manager at the Amazon Machine Learning Solutions Lab where he works with customers across different verticals to accelerate their use of machine learning and AWS cloud services to solve their business challenges.

 

 

 

 

Divya Bhargavi is a Data Scientist at the Amazon Machine Learning Solutions Lab where she develops machine learning models to address customers’ business problems. Most recently, she worked on Computer Vision solutions involving both classical and deep learning methods for a sports customer.

Read More

Improved OCR and structured data extraction with Amazon Textract

Improved OCR and structured data extraction with Amazon Textract

Optical character recognition (OCR) technology, which enables extracting text from an image, has been around since the mid-20th century, and continues to be a research topic today. OCR and document understanding are still vibrant areas of research because they’re both valuable and hard problems to solve.

AWS has been investing in improving OCR and document understanding technology, and our research scientists continue to publish research papers in these areas. For example, the research paper Can you read me now? Content aware rectification using angle supervision describes how to tackle the problem of document rectification which is fundamental to the OCR process on documents. Additionally, the paper SCATTER: Selective Context Attentional Scene Text Recognizer introduces a novel way to perform scene text recognition, which is the task of recognizing text against complex image backgrounds. For more recent publications in this area, see Computer Vision.

Amazon scientists also incorporate these research findings into best-of-breed technologies such as Amazon Textract, a fully managed service that uses machine learning (ML) to identify text and data from tables and forms in documents—such as tax information from a W2, or values from a table in a scanned inventory report—and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring customization or human intervention.

One of the advantages of a fully managed service is the automatic and periodic improvement to the underlying ML models to improve accuracy. You may need to extract information from documents that have been scanned or pictured in different lighting conditions, a variety of angles, and numerous document types. As the models are trained using data inputs that encompass these different conditions, they become better at detecting and extracting data.

In this post, we discuss a few recent updates to Amazon Textract that improve the overall accuracy of document detection and extraction.

Currency symbols

Amazon Textract now detects a set of currency symbols (Chinese yuan, Japanese yen, Indian rupee, British pound, and US dollar) and the degree symbol with more precision without much regression on existing symbol detection.

For example, the following is a sample table in a document from a company’s annual report.

The following screenshot shows the output on the Amazon Textract console before the latest update.

Amazon Textract detects all the text accurately. However, the Indian rupee symbol is recognized as an “R” instead of “₹”. The following screenshot shows the output using the updated model.

The rupee symbol is detected and extracted accurately. Similarly, the degree symbol and the other currency symbols (yuan, yen, pound, and dollar) are now supported in Amazon Textract.

Detecting rows and columns in large tables

Amazon Textract released a new table model update that more accurately detects rows and columns of large tables that span an entire page. Overall table detection and extraction of data and text within tables has also been improved.

The following is an example of a table in a personal investment account statement.

The following screenshot shows the Amazon Textract output prior to the new model update.

Even though all the rows, columns, and text is detected properly, the output also contains empty columns. The original table didn’t have a clear separation for columns, so the model included extra columns.

The following screenshot shows the output after the model update.

The output now is much cleaner. Amazon Textract still extracts all the data accurately from this table and now includes the correct number of columns. Similar performance improvement can be seen in tables that span an entire page and columns are not omitted.

Improved accuracy in forms

Amazon Textract now has higher accuracy on a variety of forms, especially income verification documents such as pay stubs, bank statements, and tax documents. The following screenshot shows an example of such a form.

The preceding form is not of high-quality resolution. Regardless, you may have to process such documents in your organization. The following screenshot is the Amazon Textract output using one of the previous models.

Although the older model detected many of the check boxes, it didn’t capture all of them. The following screenshot shows the output using the new model.

With this new model, Amazon Textract accurately detected all the check boxes in the document.

Summary

The improvements to the currency symbols and the degree symbol detection will be launched in the Asia Pacific (Singapore) region on September 24th, 2020, followed by other regions where Amazon Textract is available in the next few days. With the latest improvements to Amazon Textract, you can retrieve information from documents with more accuracy. Tables spanning the entire page are detected more accurately, currency symbols  (yuan, yen, rupee, pound, and dollar) and the degree symbol are now supported, and key-value pairs and check boxes in financial forms are detected with more precision. To start extracting data from your documents and images, try Amazon Textract for yourself.


About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

Read More

Preventing customer churn by optimizing incentive programs using stochastic programming

Preventing customer churn by optimizing incentive programs using stochastic programming

In recent years, businesses are increasingly looking for ways to integrate the power of machine learning (ML) into business decision-making. This post demonstrates the use case of creating an optimal incentive program to offer customers identified as being at risk of leaving for a competitor, or churning. It extends a popular ML use case, predicting customer churn, and shows how to optimize an incentive program to address the real business goal of preventing customer churn. We use a large phone company for our use case.

Although it’s usual to treat this as a binary classification problem, the real world is less binary: people become likely to churn for some time before they actually churn. Loss of brand loyalty occurs some time before someone actually buys from a competitor. There’s frequently a slow rise in dissatisfaction over time before someone is finally driven to act. Providing the right incentive at the right time can reset a customer’s satisfaction.

This post builds on the post Gain customer insights using Amazon Aurora machine learning. There we met a telco CEO and heard his concern about customer churn. In that post, we moved from predicting customer churn to intervening in time to prevent it. We built a solution that integrates Amazon Aurora machine learning with the Amazon SageMaker built-in XGBoost algorithm to predict which customers will churn. We then integrated Amazon Comprehend to identify the customer’s sentiment when they called customer service. Lastly, we created a naïve incentive to offer customers identified as being at risk at the time they called.

In this post, we focus on replacing this naïve incentive with an optimized incentive program. Rather than using an abstract cost function, we optimize using the actual economic value of each customer and a limited incentive budget. We use a mathematical optimization approach to calculate the optimal incentive to offer each customer, based on our estimate of the probability that they’ll churn, and the probability that they’ll accept our incentive to stay.

Solution overview

Our incentive program is intended to be used in a system such as that described in the post Gain customer insights using Amazon Aurora machine learning. For simplicity, we’ve built this post so that it can run separately.

We use a Jupyter notebook running on an Amazon SageMaker notebook instance. Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. In the Jupyter notebook, we first build and host an XGBoost model, identical to the one in the prior post. Then we run the optimization based on a given marketing budget and review the expected results.

Setting up the solution infrastructure

To set up the environment necessary to run this example in your own AWS account, follow Steps 0 and 1 in the post Simulate quantum systems on Amazon SageMaker to set up an Amazon SageMaker instance.

Then, as in Step 2, open a terminal. Enter the following command to copy the notebook to your Amazon SageMaker notebook instance:

wget https://aws-ml-blog.s3.amazonaws.com/artifacts/prevent_churn_by_optimizing_incentives/preventing_customer_churn_by_optimizing_incentives.ipynb

Alternatively, you can review a pre-run version of the notebook.

Building the XGBoost model

The first sections of the notebook—Setup, Data Exploration, Train, and Host, are the same as the sample notebook Amazon SageMaker Examples – Customer Churn. The exception is that we capture a copy of the data for later use and add a column to calculate each customer’s total spend.

At the end of these sections, we have a running XGBoost model on an Amazon SageMaker endpoint. We can use it to predict which customers will churn.

Assessing and optimizing

In this section of the post and accompanying notebook, we focus on assessing the XGBoost model and creating our optimal incentive program.

We can assess model performance by looking at the prediction scores, as shown in the original customer churn prediction post, Amazon SageMaker Examples – Customer Churn.

So how do we calculate the minimum incentive that will give the desired result? Rather than providing a single program to all customers, can we save money and gain a better outcome by using variable incentives, customized to a customer’s churn probability and value? And if so, how?

We can do so by building on components we’ve already developed.

Assigning costs to our predictions

The costs of churn for the mobile operator depend on the specific actions that the business takes. One common approach is to assign costs, treating each customer’s prediction as binary: they churn or don’t, we predicted correctly or we didn’t. To demonstrate this approach, we must make some assumptions. We assign the true negatives the cost of $0. Our model essentially correctly identified a happy customer in this case, and we won’t offer them an incentive. An alternative is to assign the actual value of the customer’s spend to the true negatives, because this is the customer’s contribution to our overall revenue.

False negatives are the most problematic because they incorrectly predict that a churning customer will stay. We lose the customer and have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. Such costs typically run in the hundreds of dollars, so for this use case, we assume $500 to be the cost for each false negative. For a better estimate, our marketing department should be able to give us a value to use for the overhead, and we have the actual customer spend for each customer in our dataset.

Finally, we give an incentive to customers that our model identifies as churning. For this post, we assume a one-time retention incentive of $50. This is the cost we apply to both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we waste the concession. We probably could have spent those dollars more effectively, but it’s possible we increased the loyalty of an already loyal customer, so that’s not so bad. We revise this approach later in this post.

Mapping the customer churn threshold

In previous versions of this notebook, we’ve shown the effect of false negatives that are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we’ve used a cost function that looks like the following equation:

cost_of_replacing_customer * FN(C) + customer_value * TN(C) + incentive_offered * FP(C) + incentive_offered * TP(C)

FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP. We want to find the cutoff, C, where the result of the expression is smallest.

We start by using the same values for all customers, to give us a starting point for discussion with the business. With our estimates, this equation becomes the following:

$500 * FN(C) + $0 * TN(C) + $50 * FP(C) + $50 * TP(C)

A straightforward way to understand the impact of these numbers is to simply run a simulation over a large number of possible cutoffs. We test 100 possible values, and produce the following graph.

The following output summarizes our results:

Cost is minimized near a cutoff of: 0.21000000000000002 for a cost of: $ 25800 for these 1500 customers.
Incentive is paid to 246 customers, for a total outlay of $ 12300
Total customer spend of these customers is $ 16324.36

The preceding chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive. Meanwhile, setting the threshold too high (such as 0.7 or above) results in too many lost customers, which ultimately grows to be nearly as costly. In between, there is a large grey area, where perhaps some more nuanced incentives would create better outcomes.

The overall cost can be minimized at $25,750 by setting the cutoff to 0.13, which is substantially better than the $100,000 or more we would expect to lose by not taking any action.

We can also calculate the dollar outlay of the program and compare to the total spend of the customers. Here we can see that paying the incentive to all predicted churn customers costs $13,750, and that these customers spend $183,700. (Your numbers may vary, depending on the specific customers chosen for the sample.)

What happens if we instead have a smaller budget for our campaign? We choose a budget of 1% of total customer monthly spend. The following output shows our results:

Total budget is: $895.90
Per customer incentive is $0.60

We can see that our cost changes. But it’s pretty clear that an incentive of approximately $0.60 is unlikely to change many people’s minds.

Can we do better? We could offer a range of incentives to customers that meet different criteria. For example, it’s worth more to the business to prevent a high spend customer from churning than a low spend customer. We could also target the grey area of customers that have less loyalty and could be swayed by another company’s advertising. We explore this in the following section.

Preventing customer churn using mathematical optimization of incentive programs

In this section, we use a more sophisticated approach to developing our customer retention program. We want to tailor our incentives to target the customers most likely to reconsider a churn decision.

Intuitively, we know that we don’t need to offer an incentive to customers with a low churn probability. Also, above some threshold, we’ve already lost the customer’s heart and mind, even if they haven’t actually left yet. So the best target for our incentive is between those two thresholds—these are the customers we can convince to stay.

The problem under investigation is inherently stochastic in that each customer might churn or not, and might accept the incentive (offer) or not. Stochastic programming [1, 2] is an approach for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real-world problems almost invariably include parameters that are unknown at the time a decision should be made. An example would be the construction of an investment portfolio to maximize return. An efficient portfolio would be defined as the portfolio that maximizes the expected return for a given amount of risk (such as standard deviation), or the portfolio that minimizes the risk subject to a given expected return [3].

Our use case has the following elements:

  • We know the number of customers, 𝑁.
  • We can use the customer’s current spend as the (upper bound) estimate of the profit they generate, P.
  • We can use the churn score from our ML model as an estimate of the probability of churn, alpha.
  • We use 1% of our total revenue as our campaign budget, C.
  • The probability that the customer is swayed, beta, depends on how convincing the incentive is to the customer, which we represent as 𝛾.
  • The incentive, c, is what we want to calculate.

We set up our inputs: P (profit), alpha (our churn probabilities, from our preceding model), and C, our campaign budget. We then define the function we wish to optimize, f(ci) being the expected total profit across the 𝑁 customers.

Our goal is to optimally allocate the discount 𝑐𝑖 across the 𝑁 customers to maximize the expected total profit. Mathematically this is equivalent to the following optimization problem:

Now we can specify how likely we think each customer is to accept the offer and not churn—that is, how convincing they’ll find the incentive. We represent this as 𝛾 in the formulae.

Although this is a matter of business judgment, we can use the preceding graph to inform that judgment. In this case, the business believes that if the churn probability is below 0.55, they are unlikely to churn, even without an incentive; on the other hand, if the customer’s churn probability is above 0.95, the customer has little loyalty and is unlikely to be convinced. The real targets for the incentives are the customers with churn probability between 0.55–0.95.

We could include that business insight into the optimization by setting the value for the convincing factor 𝛾𝑖 as follows:

  • 𝛾𝑖 = 100. This is equivalent to giving less importance as a deciding factor to the discount for customers whose churn probability is below 0.55 (they are loyal and less likely to churn), or greater than 0.95 (they will most likely leave despite the retention campaign).
  • 𝛾𝑖 = 1. This is equivalent to saying that the probability customer i will accept the discount is equal to 𝛽=1−𝑒C𝑖 for customers with churn probability between 0.55 and 0.95.

When we start to offer these incentives, we can log whether or not each customer accepts the offer and remains a customer. With that information, we can learn this function from experience, and use that learned function to develop the next set of incentives.

Solving the optimization problem

A variety of open-source solvers are available that can solve this optimization problem for us. Examples include SciPy scipy.optimize.minimize, or faster open-source solvers like GEKKO, which is what we use for this post. For large-scale problems, we would recommend using commercial optimization solvers like CPLEX or GUROBI.

After the optimization task has run, we check how much of our budget has been allocated.

The total spend is $1,000.00 compared to our budget of $1,000.00, and the total customer spend is $89,589.73 for 1,500 customers.

Now we evaluate the expected total profit for the following scenarios:

  1. Optimal discount allocation, as calculated by our optimization algorithm
  2. Uniform discount allocation: every customer is offered the same incentive
  3. No discount

The following graph shows our outcomes.

Expected total profit compared to no campaign: 17%

Expected total profit compared to uniform discount allocation: 5%

Lastly, we add the discount to our customer data. We can see how the discount we offer varies across our customer base. The red vertical line shows the value of the uniform discount. The pattern of discounts we offer closely mirrors the pattern in the prediction scores, where many customers aren’t identified as likely churners, and a few are identified as highly likely to churn.

We can also see a sample of the discounts we’d be offering to individual customers. See the following table.

For each customer, we can see their total monthly spend and the optimal incentive to offer them. We can see that the discount varies by churn probability, and we’re assured that the incentive campaign fits within our budget.

Depending on the size of the total budget we allocate, we may occasionally find that we’re offering all customers a discount. This discount allocation problem reminds us of the water-filling algorithm in wireless communications [4,5], where the problem is of maximizing the mutual information between the input and the output of a channel composed of several subchannels (such as a frequency-selective channel, a time-varying channel, or a set of parallel subchannels arising from the use of multiple antennas at both sides of the link) with a global power constraint at the transmitter. More power is allocated to the channels with higher gains to maximize the sum of data rates or the capacity of all the channels. The solution to this class of problems can be interpreted as pouring a limited volume of water into a tank, the bottom of which has the stair levels determined by the inverse of the subchannel gains.

Unfortunately, our problem doesn’t have an intuitive explanation as for the water-filling problem. This is due to the fact that, because of the nature of the objective function, the system of equations and inequalities corresponding to the KKT conditions [6] doesn’t admit a closed form solution.

The optimal incentives calculated here are the result of an optimization routine designed to maximize an economic figure, which is the expected total profit. Although this approach provides a principled way for marketing teams to make systematic, quantitative, and analytics-driven decisions, it’s also important to recall that the objective function to be optimized is a proxy measure to the actual total profit. It goes without saying that we can’t compute the actual profit based on future decisions (this would paradoxically imply maximizing the actual return based on future values of the stocks). But we can explore new ideas using techniques such as the potential outcomes work [7], which we could use to design strategies for back-testing our solution.

Conclusion

We’ve now taken another step towards preventing customer churn. We built on a prior post in which we integrated our customer data with our ML model to predict churn. We can now experiment with variations on this optimization equation, and see the effect of different campaign budgets or even different theories of how they should be modeled.

To gather more data on effective incentives and customer behavior, we could also test several campaigns against different subsets of our customers. We can collect their responses—do they churn after being offered this incentive, or not—and use that data in a future ML model to further refine the incentives offered. We can use this data to learn what kinds of incentives convince customers with different characteristics to stay, and use that new function within this optimization.

We’ve empowered marketing teams with the tools to make data-driven decisions that they can quickly turn into action. This approach can drive fast iterations on incentive programs, moving at the speed with which our customers make decisions. Over to you, marketing!

 

Bibliography

[1] S. Uryasev, P. M. Pardalos, Stochastic Optimization: Algorithm and Applications, Kluwer Academic: Norwell, MA, USA, 2001.

[2] John R. Birge and François V. Louveaux. Introduction to Stochastic Programming. Springer Verlag, New York, 1997.

[3] Francis, J. C. and Kim, D. (2013). Modern portfolio theory: Foundations, analysis, and new developments (Vol. 795). John Wiley & Sons.

[4] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991.

[5] D. P. Palomar and J. R. Fonollosa, “Practical algorithms for a family of water-filling solutions,” IEEE Trans. Signal Process., vol. 53, no. 2, pp. 686–695, Feb. 2005.

[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] Imbens, G. W. and D. B. Rubin (2015): Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.


About the Authors

Marco Guerriero, PhD, is a Practice Manager for Emergent Technologies and Intelligence Platform for AWS Professional Services. He loves working on ways for emergent technologies such as AI/ML, Big Data, IoT, Quantum, and more to help businesses across different industry verticals succeed within their innovation journey.

 

 

 

Veronika Megler, PhD, is Principal Data Scientist for Amazon.com Consumer Packaging. Until recently she, was the Principal Data Scientist for AWS Professional Services. She enjoys adapting innovative big data, AI, and ML technologies to help companies solve new problems, and to solve old problems more efficiently and effectively. Her work has lately been focused more heavily on economic impacts of ML models and exploring causality.

 

 

Oliver Boom is a London based consultant for the Emerging Technologies and Intelligent Platforms team at AWS. He enjoys solving large-scale analytics problems using big data, data science and dev ops, and loves working at the intersection of business and technology. In his spare time he enjoys language learning, music production and surfing.

 

 

 

Dr Sokratis Kartakis is a UK-based Data Science Consultant for AWS. He works with enterprise customers to help them adopt and productionize innovative Machine Learning (ML) solutions at scale solving challenging business problems. His focus areas are ML algorithms, ML Industrialization, and AI/MLOps. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.

 

 

Read More

Selecting the right metadata to build high-performing recommendation models with Amazon Personalize

Selecting the right metadata to build high-performing recommendation models with Amazon Personalize

In this post, we show you how to select the right metadata for your use case when building a recommendation engine using Amazon Personalize. The aim is to help you optimize your models to generate more user-relevant recommendations. We look at which metadata is most relevant to include for different use cases, and where you may get better results by excluding other metadata. We also highlight a specific use case from Pulselive, one of our customers that recently used Amazon Personalize to enhance the recommendation capabilities of their customer’s websites, resulting in a 20% increase in video consumption.

Introducing Amazon Personalize

Amazon Personalize is a managed service that enables you to improve customer engagement by powering personalized product and content recommendations, and targeted marketing promotions. Amazon Personalize uses machine learning (ML) to create high-quality recommendations that you can use to personalize your user experience across digital channels such as websites, applications, and email systems. You can get started without any prior ML experience using simple APIs to easily build sophisticated personalization capabilities in just a few clicks. Amazon Personalize automatically processes and examines your metadata, identifies what is meaningful, allows you to pick an ML algorithm, and trains and optimizes a custom model based on your metadata. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

Let’s dive into how important that metadata is to get a performant model.

The role metadata selection plays in recommendations

The goal of metadata selection in recommendation engines is to select the right data to aid the training algorithm to discover valuable information about the similarities in user preferences and behavior, in addition to the properties and similarity of the items you’re trying to recommend through the engine. The ultimate goal is to provide a personalized experience, uniquely tailored for each user, and present them with the items that are the most relevant to them.

Nowadays, there are so many sources of data that a company could potentially use to capture user behavior and understand which items to present to them that it has become challenging to accurately select which metadata to consider and which to ignore. Irrespective of the use case, a commercial website can use large amounts of data about every aspect of each user’s behavior on the website, such as which items they’re frequently interacting with (watching a video or ordering an item), how long they spend on each item’s page, or even how erratic or smooth the movement of their cursor is while scrolling through a page. All this information can reveal a lot about a user’s preferences and what would be the ideal items to recommend to them.

There are two main categories of approaches to recommendation engines: collaborative filtering and content-based filtering.

Collaborative filtering compares the behavior of the users with each other and tries to calculate the similarity between them to find shared interests. Therefore, the recommendation engine knows that if user A has very similar behavior to user B, then user A would likely be interested in some of the items that user B has interacted with, and vice versa.

Content-based filtering looks at the actual items the users interact with. If a user has interacted with items A and B, and product C is very similar to A and B, then item C will likely be of interest to the user.

We also have hybrid models that use both user behavior and item-related data to find the underlying patterns that reveal the ideal items to recommend to each user.

Each method requires a different approach to metadata selection because they require different types of data to be collected and used for the training. For example, building collaborative filtering engines requires data related to the behavior of users on the website, whereas building a content-based engine requires more data related to the items (item-specific metadata and which users interacted with which items). A hybrid solution requires data related to both the users and the items.

As a general rule, authenticated experiences are most optimal. When your users have personal accounts that they log in to, you can provide them with a more personalized experience tailored to their needs because you can easily track and record every aspect of their behavior (along with additional metadata), whereas it’s harder to track anonymous or guest users and map them to their previous sessions.

The problems that can occur if metadata selection isn’t done right

If metadata selection isn’t done correctly, it can potentially lead to poor recommendations that are either too generic (showing most users the most popular and commonly interacted products) or not relevant (showing items that are completely irrelevant to the unique user).

When too much information is included in training a recommendation model, it can lead to noise in the model. Metadata that has no correlation with user preferences but was included in training skews the model and makes it harder for the algorithm to find the valuable underlying patterns that allow for a successful recommender system.

This can also apply to the depth (amount of history) of the data that is used to train a model. Perhaps relevant metadata has been selected, but the freshness of the data in many cases is a stronger indicator of relevance—the most recent metadata is more relevant than historical data for the same kind of interactions. This is because user behavior and preferences vary over time and people’s interests can change rather quickly; therefore, presenting a user with a recommendation that was considered relevant to them a few months ago doesn’t guarantee that the recommendation is relevant to them today. This is why it’s important to keep your recommender system up to date with current user behavior.

Conversely, if too little information is included, the recommendation model under-performs. If you don’t include valuable information that can aid the performance of the model, the recommendation model makes suboptimal suggestions.

A wrong approach to metadata selection can make it harder for the algorithms to find the underlying patterns that connect users and items. This means that the recommendations that the users are presented with aren’t personalized as expected.

Terminology of recommendation engines

To introduce the topic further, let’s dive into some of the terminology associated with Amazon Personalize:

  • Datasets and dataset groupsDatasets contain the data used to train a recommendation model. You can use different dataset groups to serve different purposes. For example, separate applications, with their own users and items, can have their own dataset groups.
  • Recipes and solutions – Amazon Personalize uses recipes, which are the combination of the learning algorithm with the hyperparameters and datasets used. Training a model with different recipes leads to different results. The resultant models that are deployed are referred to as a solution version.
  • Campaigns – A deployed solution version is known as a campaign. A campaign allows Amazon Personalize to make recommendations for your users.

Metadata types are dictated in the datasets used to train a model. In the following section, we look at how to do that.

Selecting metadata

Amazon Personalize uses different recipes that are aimed towards either of the two main categories of recommendation engines—collaborative filtering and content-based filtering—and also the hybrid methods. For more information about pre-defined recipes, see Choosing a Recipe.

No matter which recipe you chose to work with, Amazon Personalize has three main types of datasets that it can use to build models (solutions), and each is related to one of the following categories:

  • Users
  • Items
  • Interactions

The users and items dataset types are known as metadata types, and are only used by certain recipes. As their names imply, their metadata has unique fields that describe each individual user or item. User metadata could be age, gender, and geography. Typical item metadata is color, category, shape, price (in the case of items) or content category, ratings, and genre, if the type of item we’re trying to recommend is a video or movie.

The interactions metadata is the direct interactions of a user with an item, which is usually the most revealing information for the relationship between users and items. Some examples of interactions data can be clicks (user A clicked on item X), purchases (user actually purchased an item), amount of time spent on an item’s webpage, the addition of an item to a user’s wishlist, or even the fact that the user hovered their cursor for a few milliseconds more than usual over a certain item.

The minimum number of interactions Amazon Personalize expects in order to start making recommendations is 1,000 interactions from a minimum of 25 users. User and item metadata datasets are optional, and their importance depends on your use case and the algorithm (recipe) you’re using.

The following screenshot shows the Datasets page on the Amazon Personalize console.

What data types are supported by each category?

Each dataset has a set of required fields, reserved keywords, and required datatypes, as shown in the following table.

Dataset Type Required Fields Reserved Keywords
Users USER_ID (string)
one metadata field
Items ITEM_ID (string)
one metadata field
CREATION_TIMESTAMP (long)

 

 

Interactions USER_ID (string)
ITEM_ID (string)
TIMESTAMP (long)

EVENT_TYPE (string)

IMPRESSION (string)

EVENT_VALUE (float, null)

 

Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. Each dataset type has specific requirements. Schemas in Amazon Personalize are defined in the Avro format.

The following example code shows an interactions schema. The EVENT_TYPE and EVENT_VALUE fields are optional, and are reserved keywords recognized by Amazon Personalize. LOCATION and DEVICE are optional contextual metadata fields.

{
  "type": "record",
  "name": "Interactions",
  "namespace": "com.amazonaws.personalize.schema",
  "fields": [
      {
          "name": "USER_ID",
          "type": "string"
      },
      {
          "name": "ITEM_ID",
          "type": "string"
      },
      {
          "name": "EVENT_TYPE",
          "type": "string"
      },
      {
          "name": "EVENT_VALUE",
          "type": "float"
      },
      {
          "name": "LOCATION",
          "type": "string",
          "categorical": true
      },
      {
          "name": "DEVICE",
          "type": "string",
          "categorical": true
      },
      {
          "name": "TIMESTAMP",
          "type": "long"
      }
  ],
  "version": "1.0"
}

Creating a schema using the AWS Python SDK

To create a schema using the AWS Python SDK, complete the following steps:

  1. Define the Avro format schema that you want to use.
  2. Save the schema in a JSON file in the default Python folder.
  3. Create the schema using the following code:
import boto3

personalize = boto3.client('personalize')

with open('schema.json') as f:
    createSchemaResponse = personalize.create_schema(
        name = 'YourSchema',
        schema = f.read()
    )

schema_arn = createSchemaResponse['schemaArn']

print('Schema ARN:' + schema_arn )

Amazon Personalize returns the ARN of the new schema.

  1. Store the ARN for later use.

Filtering your metadata

Amazon Personalize allows you to experiment with building different models (or solutions) based on different metadata by enabling you to filter records from your interactions dataset and set a threshold for each event type, or simply select and leave out certain event types. You can filter records from an interactions dataset in two ways:

  • Set a threshold to exclude records based on a specific value by specifying an event value in your recipe. If the records include a value that is associated with a specific event—for example, the price a user paid is associated with the purchase of an item—you can set a specific value in a recipe as a threshold to exclude records from training. The amount is called an event value.
  • Exclude records of a certain type by specifying an event type in your recipe. A dataset often includes specific types of activities, for example, purchase, click, or wishlisted. These are called event types. To include only records for specific event types in training, filter your dataset by event type in your recipe.

To filter your metadata, call the CreateSolution API. If you want to specify the event type, for example purchase, set it in the eventType parameter. If you want to specify an event value, for example 10, set it in the eventValueThreshold parameter. You can also specify an event type and an event value. You can specify an eventType, an eventType and eventValueThreshold, or neither. You can’t specify just eventValueThreshold alone. See the following code:

import boto3
 
personalize = boto3.client('personalize')

# Create the solution
create_solution_response = personalize.create_solution(
    name = "your-solution-name",
    datasetGroupArn = dataset_group_arn,
    recipeArn = recipe_arn,
    "eventType": "purchase",
    solutionConfig = {
        "eventValueThreshold": "10"
    }
)

# Store the solution ARN
solution_arn = create_solution_response['solutionArn']

# Use the solution ARN to get the solution status
solution_description = personalize.describe_solution(solutionArn = solution_arn)['solution']
print('Solution status: ' + solution_description['status'])

When selecting metadata for a recommendation engine, it’s helpful to ask the following questions to help guide your decisions:

  • What is likely to be the strongest indicator of a good recommendation—similar users, similar items, or their combined interactions? This can help determine which metadata to select and tag in the datasets. As described, the interactions dataset is the minimum that Amazon Personalize expects, so you have to choose wisely which types of interactions (or events) you want to capture. A combination of interactions and metadata is typically recommended, but choosing which types of interactions to record is important.
  • What is the temporal value of the data? Is old data less potent? How much less? How can you use real-time APIs with real-time data to get the most relevant recommendations that reflect the users’ change of preferences over time?
  • Which metrics best show whether the recommendation engine is working well? Can you align Amazon Personalize metrics with your own KPIs? Can you construct an A/B test with live customers?

The answers to these questions can be a good guide to improve the recommendation system.

Applying metadata selection: Pulselive use case

In a recent engagement with Pulselive, an AWS customer that builds and hosts solutions for large sports organizations, we were asked to aid them in prototyping a personalized recommendation engine for one of their customers, a renowned European football club, to suggest videos to the visitors of their website according to their preferences and past behavior. Their goal was to use all the data they could to provide the website’s visitors with a tailored, highly personalized experience by recommending videos relevant to each user to increase engagement with the content.

Our initial approach was to use some of their existing recorded historical data to extract the minimum required information needed to start building Amazon Personalize solutions that can recommend the right videos to the right users. Therefore, the metadata we initially selected was the simplest form of user-video interactions—clicks—from a historical dataset of which users had clicked on which videos and at what time. We started with 30,000 user interactions.

That allowed us to build a baseline solution that used that information to evaluate the relevance of each video to each user and considered it as our starting point. The next goal was to enrich the dataset by selecting the right metadata and observing the impact that the new models had on user engagement.

At this point, we have to mention that it’s somewhat challenging to predict how well the recommendation system will do when deployed into production. Amazon Personalize provides some standard out-of-the-box metrics when a model has finished training to give you an idea of how well it did at recommending the most relevant items higher on the recommendations list (such as having a high precision or coverage). But you can only evaluate the true impact on your customers when deploying the system into production.

Pulselive chose to do A/B testing, comparing the results of their existing recommendation methods to those produced from an Amazon Personalize campaign. They started with redirecting 5% of their traffic through the Amazon Personalize campaign. After seeing good results, they eventually rolled out to 50% of the traffic being redirected to Amazon Personalize. For more information, see Increasing engagement with personalized online sports content.

Regarding metadata selection, we quickly realized that the users and items in the initial historical dataset weren’t very recent, and most of their IDs didn’t correspond to users and items that had recent activity on their production website.

Luckily, apart from an initial historical dataset, Amazon Personalize can also enrich its models in real time by allowing you to feed in interaction data from your live website. Through the use of the Amazon Personalize PutEvents API, you can record any action users take on the website and feed it into Amazon Personalize in near-real time, updating the model with the most recent user behavior and preferences. This is an important capability because it’s natural for user preferences to change over time, and you don’t want to risk presenting them with items that are either out of date or not relevant to them anymore.

This also means that you can directly connect Amazon Personalize to your website, with no historical data or any models trained, and start feeding in events. After a while, Amazon Personalize has gathered enough data to start making accurate recommendations. For more information, see Recording Events.

We spent some time discussing what other relevant user behavior metadata we could capture, and decided to start recording some to observe whether these would result in a more accurate recommendation system that would impact user engagement on the site. Two simple measures for this were seeing if recommended videos were more frequently visited and watched for longer periods.

We started recording the source of the clicks (recommended list vs. other links in the website), the amount of time a user spent on a clicked video in seconds, and the percentage of the video that time represented (because it’s different for someone to spend 1 minute on a 20-minute video, compared to spending the same time on a 1-minute video to watch it in its entirety). These additions proved to be very important because after a while, user engagement started improving. We discussed and investigated providing more detailed information about user behavior on the website, but decided to pay more attention to the metadata.

Items metadata was important because it allowed Amazon Personalize to have more context on the nature of each video. This ranged from general and broad video categories, such as interviews and games, to more specific categories, such as “Leagues” and “Friendly games,” to more specific metadata, such as which players are featured in a video. Adding metadata about the content for each video significantly improved the personalized recommendations because the solution had a notion of context that helped determine what type on content each user preferred to watch.

Equally, on the user metadata side, more detailed information was provided, trying to capture the demographics and preferences of each user. Of course, in the case of the users, we had to deal with the cold-start problem (new users or guest users for which the system didn’t have any information yet). Luckily, the Amazon Personalize HRNN-Coldstart recipe has proved to be very sufficient in solving this problem by quickly linking the new user’s behavior to existing ones. The more time a guest or new user spends on the platform, the more Amazon Personalize understands about their preferences and adjusts its recommendations accordingly.

We had many options of what type of metadata to include in the interactions dataset, but it’s important to make sure we only use relevant metadata, and we had to pay attention to the balance between providing too much information to a model and providing too little.

For example, we considered recording the movement of each user’s cursors on the website and sending these as well to Amazon Personalize, which in theory could provide a marginal improvement to the performance of the recommendation system. But doing so proved to be expensive and tolling both on the front end (it impacted website performance) and the back end (the volume of data the system had to record, store, and send to Amazon Personalize significantly increased). Therefore, after careful consideration, we decided that cursor movement metadata wasn’t worth keeping.

After a few months, Pulselive rolled out the Amazon Personalize-based recommendation system to nearly half of their customer’s website visitors, and saw that that group’s engagement with their videos increased by 20%.

Conclusion

Recommendation engines can provide more pertinent results to users based on metadata about a user’s historical selections, or on the types of items of interest.

In this post, we looked at how to select the right metadata to get the best results when training a recommendation engine on Amazon Personalize by evaluating which metadata to include and which to exclude. We also looked at a specific use case and how an AWS customer, Pulselive, increased engagement with videos on their customer’s website by providing personalized recommendations to users.

For more information on creating recommendation engines with Amazon Personalize and metadata selection, see the following:


About the Authors

Andrew Hood is a Prototyping Engagement Manager at AWS.

 

 

 

 

Ion Kleopas is an ML Prototyping Architect at AWS.

Read More