Predict customer churn with no-code machine learning using Amazon SageMaker Canvas

Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. But losing customers (also called customer churn) is always a risk, and insights into why customers leave can be just as important for maintaining revenues and profits. Machine learning (ML) can help with insights, but up until now you needed ML experts to build models to predict churn, the lack of which could delay insight-driven actions by businesses to retain customers.

In this post, we show you how business analysts can build a customer churn ML model with Amazon SageMaker Canvas, no code required. Canvas provides business analysts with a visual point-and-click interface that allows you to build models and generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.

Overview of solution

For this post, we assume the role of a marketing analyst in the marketing department of a mobile phone operator. We have been tasked with identifying customers that are potentially at risk of churning. We have access to service usage and other customer behavior data, and want to know if this data can help explain why a customer would leave. If we can identify factors that explain churn, then we can take corrective actions to change predicted behavior, such as running targeted retention campaigns.

To do this, we use the data we have in a CSV file, which contains information about customer usage and churn. We use Canvas to perform the following steps:

  1. Import the churn dataset from Amazon Simple Storage Service (Amazon S3).
  2. Train and build the churn model.
  3. Analyze the model results.
  4. Test predictions against the model.

For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The attributes are as follows:

  • State – The US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
  • Account Length – The number of days that this account has been active
  • Area Code – The three-digit area code of the customer’s phone number
  • Phone – The remaining seven-digit phone number
  • Int’l Plan – Whether the customer has an international calling plan (yes/no)
  • VMail Plan – Whether the customer has a voice mail feature (yes/no)
  • VMail Message – The average number of voice mail messages per month
  • Day Mins – The total number of calling minutes used during the day
  • Day Calls – The total number of calls placed during the day
  • Day Charge – The billed cost of daytime calls
  • Eve Mins, Eve Calls, Eve Charge – The billed cost for evening calls
  • Night Mins, Night Calls, Night Charge – The billed cost for nighttime calls
  • Intl Mins, Intl Calls, Intl Charge – The billed cost for international calls
  • CustServ Calls – The number of calls placed to customer service
  • Churn? – Whether the customer left the service (true/false)

The last attribute, Churn?, is the attribute that we want the ML model to predict. The target attribute is binary, meaning our model predicts the output as one of two categories (True or False).

Prerequisites

A cloud admin with an AWS account with appropriate permissions is required to complete the following prerequisites:

Create a customer churn model

First, let’s download the churn dataset and review the file to make sure all the data is there. Then complete the following steps:

  1. Sign in to the AWS Management Console, using an account with the appropriate permissions to access Canvas.
  2. Log in to the Canvas console.

This is where we can manage our datasets and create models.

  1. Choose Import.

Canvas Import Button Select

  1. Choose Upload and select the churn.csv file.
  2. Choose Import data to upload it to Canvas.

Canvas select data from s3

The import process takes approximately 10 seconds (this can vary depending on dataset size). When it’s complete, we can see the dataset is in Ready status.

Canvas Ready Dataset

  1. To preview the first 100 rows of the dataset, hover your mouse over the eye icon.

Canvas View Dataset

A preview of the dataset appears. Here we can verify that our data is correct.

Canvas Verify Data

After we confirm that the imported dataset is ready, we create our model.

  1. Choose New model.

Canvas New Models

  1. Select the churn.csv dataset and choose Select dataset.

Canvas Select Dataset

Now we configure the build model process.

  1. For Target columns, choose the Churn? column.

For Model type, Canvas automatically recommends the model type, in this case 2 category prediction (what a data scientist would call binary classification). This is suitable for our use case because we have only two possible prediction values: True or False, so we go with the recommendation Canvas made.

Canvas Build Model

We now validate some assumptions. We want to get a quick view into whether our target column can be predicted by the other columns. We can get a fast view into the model’s estimated accuracy and column impact (the estimated importance of each column in predicting the target column).

  1. Select all 21 columns and choose Preview model.

This feature uses a subset of our dataset and only a single pass at modeling. For our use case, the preview model takes approximately 2 minutes to build.

Canvas Preview Model

As shown in the following screenshot, the Phone and State columns have much less impact on our prediction. We want to be careful when removing text input because it can contain important discrete, categorical features contributing to our prediction. Here, the phone number is just the equivalent of an account number—not of value in predicting other accounts’ likelihood of churn, and the customer’s state doesn’t impact our model much.

  1. We remove these columns because they have no major feature importance.Canvas Feature Engineering
  2. After we remove the Phone and State columns, let’s run the preview again.

As shown in the following screenshot, the model accuracy increased by 0.1%. Our preview model has a 95.9% estimated accuracy, and the columns with the biggest impact are Night Calls, Eve Mins, and Night Charge. This gives us an insight into what columns impact the performance of our model the most. Here we need to be careful when doing feature selection because if a single feature is extremely impactful on a model’s outcome, it’s a primary indicator of target leakage, and the feature won’t be available at the time of prediction. In this case, few columns showed very similar impact, so we continue to build our model.

Canvas Feature Engineering After

Canvas offers two build options:

  • Standard build – Builds the best model from an optimized process powered by AutoML; speed is exchanged for greatest accuracy
  • Quick build – Builds a model in a fraction of the time compared to a standard build; potential accuracy is exchanged for speed.
  1. For this post, we choose the Standard build option because we want to have the very best model and we are willing to spend additional time waiting the result.

Canvas Standard Build

The build process can take 2–4 hours. During this time, Canvas tests hundreds of candidate pipelines, selecting the best model to present to us. In the following screenshot, we can see the expected build time and progress.

Canvas Analyze Model

Evaluate model performance

When the model building process is complete, the model predicted churn 97.9% of the time. This seems fine, but as analysts we want to dive deeper and see if we can trust the model to make decisions based on it. On the Scoring tab, we can review a visual plot of our predictions mapped to their outcomes. This allows us a deeper insight into our model.

Canvas separates the dataset into training and test sets. The training dataset is the data Canvas uses to build the model. The test set is used to see if the model performs well with new data. The Sankey diagram in the following screenshot shows how the model performed on the test set. To learn more, refer to Evaluating Your Model’s Performance in Amazon SageMaker Canvas.

Canvas Analyze Model Score

To get more detailed insights beyond what is displayed in the Sankey diagram, business analysts can use a confusion matrix analysis for their business solutions. For example, we want to better understand the likelihood of the model making false predictions. We can see this in the Sankey diagram, but want more insights, so we choose Advanced metrics. We’re presented with a confusion matrix, which displays the performance of a model in a visual format with the following values, specific to the positive class—we’re measuring based on whether they will in fact churn, so our positive class is True in this example:

  • True Positive (TP) – The number of True results that were correctly predicted as True
  • True Negative (TN) – The number of False results that were correctly predicted as False
  • False Positive (FP) – The number of False results that were wrongly predicted as True
  • False Negative (FN) – The number of True results that were wrongly predicted as False

We can use this matrix chart to determine not only how accurate our model is, but when it is wrong, how often that might be and how it’s wrong.

Canvas F1 Matrix

The advanced metrics look good. We can trust the model result. We see very low false positives and false negatives. These are if the model thinks a customer in the dataset will churn and they actually don’t (false positive), or if the model thinks the customer will churn and they actually do (false negative). High numbers for either might make us think more on if we can use the model to make decisions.

Let’s go back to Overview tab, to review the impact of each column. This information can help the marketing team gain insights that lead to taking actions to reduce customer churn. For example, we can see that both low and high CustServ Calls increase the likelihood of churn. The marketing team can take actions to prevent customer churn based on these learnings. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up.

Our model looks pretty accurate. We can directly perform an interactive prediction on the Predict tab, either in batch or single (real-time) prediction. In this example, we made a few changes to certain column values and performed a real-time prediction. Canvas shows us the prediction result along with the confidence level.

Canvas Predict Inference

Let’s say we have an existing customer who has the following usage: Night Mins is 40 and Eve Mins is 40. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True). We might now choose to provide promotional discounts to retain this customer.

Let’s say we have an existing customer who has the following the usage: Night Mins is 40 and Eve Mins is 40. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True). We might now choose to provide promotion discounts to retain this customer.

Running one prediction is great for individual what-if analysis, but we also need to run predictions on many records at once. Canvas is able to run batch predictions, which allows you to run predictions at scale.

Conclusion

In this post, we showed how a business analyst can create a customer churn model with SageMaker Canvas using sample data. Canvas allows your business analysts to create accurate ML models and generate predictions using a no-code, visual, point-and-click interface. A marketing analysist can now use this information to run targeted retention campaigns and test new campaign strategies faster, leading to a reduction in customer churn.

Analysts can take this to the next level by sharing their models with data scientist colleagues. The data scientists can view the Canvas model in Amazon SageMaker Studio, where they can explore the choices Canvas AutoML made, validate model results, and even productionalize the model with a few clicks. This can accelerate ML-based value creation and help scale improved outcomes faster.

To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.


About the Author

Henry Robalino is a Solutions Architect at AWS, based out of NJ. He is passionate about cloud and machine learning, and the role they can play in society. He achieves this by working with customers to help them achieve their business goals using the AWS Cloud. Outside of work, you can find Henry traveling or exploring the outdoors with his fur daughter Arly.

Chaoran Wang is a Solution Architect at AWS, based in Dallas, TX. He has been working at AWS since graduating from the University of Texas at Dallas in 2016 with a master’s in Computer Science. Chaoran helps customers build scalable, secure, and cost-effective applications and find solutions to solve their business challenges on the AWS Cloud. Outside work, Chaoran loves spending time with his family and two dogs, Biubiu and Coco.

Read More

Deploy and manage machine learning pipelines with Terraform using Amazon SageMaker

AWS customers are relying on Infrastructure as Code (IaC) to design, develop, and manage their cloud infrastructure. IaC ensures that customer infrastructure and services are consistent, scalable, and reproducible, while being able to follow best practices in the area of development operations (DevOps).

One possible approach to manage AWS infrastructure and services with IaC is Terraform, which allows developers to organize their infrastructure in reusable code modules. This aspect is increasingly gaining importance in the area of machine learning (ML). Developing and managing ML pipelines, including training and inference with Terraform as IaC, lets you easily scale for multiple ML use cases or Regions without having to develop the infrastructure from scratch. Furthermore, it provides consistency for the infrastructure (for example, instance type and size) for training and inference across different implementations of the ML pipeline. This lets you route requests and incoming traffic to different Amazon SageMaker endpoints.

In this post, we show you how to deploy and manage ML pipelines using Terraform and Amazon SageMaker.

Solution overview

This post provides code and walks you through the steps necessary to deploy AWS infrastructure for ML pipelines with Terraform for model training and inference using Amazon SageMaker. The ML pipeline is managed via AWS Step Functions to orchestrate the different steps implemented in the ML pipeline, as illustrated in the following figure.

Step Function Steps

Step Functions starts an AWS Lambda function, generating a unique job ID, which is then used when starting a SageMaker training job. Step Functions also creates a model, endpoint configuration, and endpoint used for inference. Additional resources include the following:

The ML-related code for training and inference with a Docker image relies mainly on existing work in the following GitHub repository.

The following diagram illustrates the solution architecture:

Architecture Diagram

We walk you through the following high-level steps:

  1. Deploy your AWS infrastructure with Terraform.
  2. Push your Docker image to Amazon ECR.
  3. Run the ML pipeline.
  4. Invoke your endpoint.

Repository structure

You can find the repository containing the code and data used for this post in the following GitHub repository.

The repository includes the following directories:

  • /terraform – Consists of the following subfolders:

    • ./infrastructure – Contains the main.tf file calling the ML pipeline module, in addition to variable declarations that we use to deploy the infrastructure
    • ./ml-pipeline-module – Contains the Terraform ML pipeline module, which we can reuse
  • /src – Consists of the following subfolders:

    • ./container – Contains example code for training and inference with the definitions for the Docker image
    • ./lambda_function – Contains the Python code for the Lambda function generating configurations, such as a unique job ID for the SageMaker training job
  • /data – Contains the following file:

    • ./iris.csv – Contains data for training the ML model

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploy your AWS infrastructure with Terraform

To deploy the ML pipeline, you need to adjust a few variables and names according to your needs. The code for this step is in the /terraform directory.

When initializing for the first time, open the file terraform/infrastructure/terraform.tfvars and adjust the variable project_name to the name of your project, in addition to the variable region if you want to deploy in another Region. You can also change additional variables such as instance types for training and inference.

Then use the following commands to deploy the infrastructure with Terraform:

export AWS_PROFILE=<your_aws_cli_profile_name>
cd terraform/infrastructure
terraform init
terraform plan
terraform apply

Check the output and make sure that the planned resources appear correctly, and confirm with yes in the apply stage if everything is correct. Then go to the Amazon ECR console (or check the output of Terraform in the terminal) and get the URL for your ECR repository that you created via Terraform.

The output should look similar to the following displayed output, including the ECR repository URL:

Apply complete! Resources: 19 added, 0 changed, 0 destroyed.

Outputs:

ecr_repository_url = <account_number>.dkr.ecr.eu-west-1.amazonaws.com/ml-pipeline-terraform-demo

Push your Docker image to Amazon ECR

For the ML pipeline and SageMaker to train and provision a SageMaker endpoint for inference, you need to provide a Docker image and store it in Amazon ECR. You can find an example in the directory src/container. If you have already applied the AWS infrastructure from the earlier step, you can push the Docker image as described. After your Docker image is developed, you can take the following actions and push it to Amazon ECR (adjust the Amazon ECR URL according to your needs):

cd src/container
export AWS_PROFILE=<your_aws_cli_profile_name>
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin <account_number>.dkr.ecr.eu-west-1.amazonaws.com
docker build -t ml-training .
docker tag ml-training:latest <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>:latest
docker push <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>

If you have already applied the AWS infrastructure with Terraform, you can push the changes of your code and Docker image directly to Amazon ECR without deploying via Terraform again.

Run the ML pipeline

To train and run the ML pipeline, go to the Step Functions console and start the implementation. You can check the progress of each step in the visualization of the state machine. You can also check the SageMaker training job progress and the status of your SageMaker endpoint.

Start Step Function

After you successfully run the state machine in Step Functions, you can see that the SageMaker endpoint has been created. On the SageMaker console, choose Inference in the navigation pane, then Endpoints. Make sure to wait for the status to change to InService.

SageMaker Endpoint Status

Invoke your endpoint

To invoke your endpoint (in this example, for the iris dataset), you can use the following Python script with the AWS SDK for Python (Boto3). You can do this from a SageMaker notebook, or embed the following code snippet in a Lambda function:

import boto3
from io import StringIO
import pandas as pd

client = boto3.client('sagemaker-runtime')

endpoint_name = 'Your endpoint name' # Your endpoint name.
content_type = "text/csv"   # The MIME type of the input data in the request body.

payload = pd.DataFrame([[1.5,0.2,4.4,2.6]])
csv_file = StringIO()
payload.to_csv(csv_file, sep=",", header=False, index=False)
payload_as_csv = csv_file.getvalue()

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=content_type,
Body=payload_as_csv
)

label = response['Body'].read().decode('utf-8')
print(label)

Clean up

You can destroy the infrastructure created by Terraform with the command terraform destroy, but you need to delete the data and files in the S3 buckets first. Furthermore, the SageMaker endpoint (or multiple SageMaker endpoints if run multiple times) is created via Step Functions and not managed via Terraform. This means that the deployment happens when running the ML pipeline with Step Functions. Therefore, make sure you delete the SageMaker endpoint or endpoints created via the Step Functions ML pipeline as well to avoid unnecessary costs. Complete the following steps:

  1. On the Amazon S3 console, delete the dataset in the S3 training bucket.
  2. Delete all the models you trained via the ML pipeline in the S3 models bucket, either via the Amazon S3 console or the AWS CLI.
  3. Destroy the infrastructure created via Terraform:
    cd terraform/infrastructure
    terraform destroy

  4. Delete the SageMaker endpoints, endpoint configuration, and models created via Step Functions, either on the SageMaker console or via the AWS CLI.

Conclusion

Congratulations! You’ve deployed an ML pipeline using SageMaker with Terraform. This example solution shows how you can easily deploy AWS infrastructure and services for ML pipelines in a reusable fashion. This allows you to scale for multiple use cases or Regions, and enables training and deploying ML models with one click in a consistent way. Furthermore, you can run the ML pipeline multiple times, for example, when new data is available or you want to change the algorithm code. You can also choose to route requests or traffic to different SageMaker endpoints.

I encourage you to explore adding security features and adopting security best practices according to your needs and potential company standards. Additionally, embedding this solution into your CI/CD pipelines will give you further capabilities in adopting and establishing DevOps best practices and standards according to your requirements.


About the Author

Oliver Zollikofer is a Data Scientist at Amazon Web Services. He enables global enterprise customers to build, train and deploy machine learning models, as well as managing the ML model lifecycle with MLOps. Further, he builds and architects related cloud solutions.

Read More

Achieve hyperscale performance for model serving using NVIDIA Triton Inference Server on Amazon SageMaker

Machine learning (ML) applications are complex to deploy and often require multiple ML models to serve a single inference request. A typical request may flow across multiple models with steps like preprocessing, data transformations, model selection logic, model aggregation, and postprocessing. This has led to the evolution of common design patterns such as serial inference pipelines, ensembles (scatter gather), and business logic workflows, resulting in realizing the entire workflow of the request as a Directed Acyclic Graph (DAG). However, as workflows get more complex, this leads to an increase in overall response times, or latency, of these applications which in turn impacts the overall user experience. Furthermore, if these components are hosted on different instances, the additional network latency between these instances increases the overall latency. Consider an example of a popular ML use case for a virtual assistant in customer support. A typical request might have to go through several steps involving speech recognition, natural language processing (NLP), dialog state tracking, dialog policy, text generation, and finally text to speech. Furthermore, to make the user interaction more personalized, you might also use state-of-art, transformer-based NLP models like different versions of BERT, BART, and GPT. The end result is long response times for these model ensembles and a poor customer experience.

A common pattern to drive lower response times without compromising overall throughput is to host these models on the same instance along with the lightweight business logic embedded in it. These models can further be encapsulated within single or multiple containers on the same instance in order to provide isolation for running processes and keep latency low. Additionally, overall latency also depends on inference application logic, model optimizations, underlying infrastructure (including compute, storage, and networking), and the underlying web server taking inference requests. NVIDIA Triton Inference Server is an open-source inference serving software with features to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. It has wide support of ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including GPUs, CPUs, and AWS Inferentia. Additionally, Triton Inference Server is integrated with Amazon SageMaker, a fully managed end-to-end ML service, providing real-time inference options including single and multi-model hosting. These inference options include hosting multiple models within the same container behind a single endpoint, and hosting multiple models with multiple containers behind a single endpoint.

In November 2021, we announced the integration of Triton Inference Server on SageMaker. AWS worked closely with NVIDIA to enable you to get the best of both worlds and make model deployment with Triton on AWS easier.

In this post, we look at best practices for deploying transformer models at scale on GPUs using Triton Inference Server on SageMaker. First, we start with a summary of key concepts around latency in SageMaker, and an overview of performance tuning guidelines. Next, we provide an overview of Triton and its features as well as example code for deploying on SageMaker. Finally, we perform load tests using SageMaker Inference Recommender and summarize the insights and conclusions from load testing of a popular transformer model provided by Hugging Face.

You can review the notebook we used to deploy models and perform load tests on your own using the code on GitHub.

Performance tuning and optimization for model serving on SageMaker

Performance tuning and optimization is an empirical process often involving multiple iterations. The number of parameters to tune is combinatorial and the set of configuration parameter values aren’t independent of each other. Various factors affect optimal parameter tuning, including payload size, type, and the number of ML models in the inference request flow graph, storage type, compute instance type, network infrastructure, application code, inference serving software runtime and configuration, and more.

If you’re using SageMaker for deploying ML models, you have to select a compute instance with the best price-performance, which is a complicated and iterative process that can take weeks of experimentation. First, you need to choose the right ML instance type out of over 70 options based on the resource requirements of your models and the size of the input data. Next, you need to optimize the model for the selected instance type. Lastly, you need to provision and manage infrastructure to run load tests and tune cloud configuration for optimal performance and cost. All this can delay model deployment and time to market. Additionally, you need to evaluate the trade-offs between latency, throughput, and cost to select the optimal deployment configuration. SageMaker Inference Recommender automatically selects the right compute instance type, instance count, container parameters, and model optimizations for inference to maximize throughput, reduce latency, and minimize cost.

Real-time inference and latency in SageMaker

SageMaker real-time inference is ideal for inference workloads where you have real-time, interactive, low-latency requirements. There are four most commonly used metrics for monitoring inference request latency for SageMaker inference endpoints

  • Container latency – The time it takes to send the request, fetch the response from the model’s container, and complete inference in the container. This metric is available in Amazon CloudWatch as part of the Invocation Metrics published by SageMaker.
  • Model latency – The total time taken by all SageMaker containers in an inference pipeline. This metric is available in Amazon CloudWatch as part of the Invocation Metrics published by SageMaker.
  • Overhead latency – Measured from the time that SageMaker receives the request until it returns a response to the client, minus the model latency. This metric is available in Amazon CloudWatch as part of the Invocation Metrics published by SageMaker.
  • End-to-end latency – Measured from the time the client sends the inference request until it receives a response back. Customers can publish this as a custom metric in Amazon CloudWatch.

The following diagram illustrates these components.

Container latency depends on several factors; the following are among the most important:

  • Underlying protocol (HTTP(s)/gRPC) used to communicate with the inference server
  • Overhead related to creating new TLS connections
  • Deserialization time of the request/response payload
  • Request queuing and batching features provided by the underlying inference server
  • Request scheduling capabilities provided by the underlying inference server
  • Underlying runtime performance of the inference server
  • Performance of preprocessing and postprocessing libraries before calling the model prediction function
  • Underlying ML framework backend performance
  • Model-specific and hardware-specific optimizations

In this post, we focus primarily on optimizing container latency along with overall throughput and cost. Specifically, we explore performance tuning Triton Inference Server running inside a SageMaker container.

Use case overview

Deploying and scaling NLP models in a production setup can be quite challenging. NLP models are often very large in size, containing millions of model parameters. Optimal model configurations are required to satisfy the stringent performance and scalability requirements of production-grade NLP applications.

In this post, we benchmark an NLP use case using a SageMaker real-time endpoint based on a Triton Inference Server container and recommend performance tuning optimizations for our ML use case. We use a large, pre-trained transformer-based Hugging Face BERT large uncased model, which has about 336 million model parameters. The input sentence used for the binary classification model is padded and truncated to a maximum input sequence length of 512 tokens. The inference load test simulates 500 invocations per second (30,000 maximum invocations per minute) and ModelLatency of less than 0.5 seconds (500 milliseconds).

The following table summarizes our benchmark configuration.

Model Name Hugging Face bert-large-uncased
Model Size 1.25 GB
Latency Requirement 0.5 seconds (500 milliseconds)
Invocations per Second 500 requests (30,000 per minute)
Input Sequence Length 512 tokens
ML Task Binary classification

NVIDIA Triton Inference Server

Triton Inference Server is specifically designed to enable scalable, rapid, and easy deployment of models in production. Triton supports a variety of major AI frameworks, including TensorFlow, TensorRT, PyTorch, XGBoost and ONNX. With the Python and C++ custom backend, you can also implement your inference workload for more customized use cases.

Most importantly, Triton provides a simple configuration-based setup to host your models, which exposes a rich set of performance optimization features you can use with little coding effort.

Triton increases inference performance by maximizing hardware utilization with different optimization techniques (concurrent model runs and dynamic batching are the most frequently used). Finding the optimal model configurations from various combinations of dynamic batch sizes and the number of concurrent model instances is key to achieving real time inference within low-cost serving using Triton.

Dynamic batching

Many practitioners tend to run inference sequentially when the server is invoked with multiple independent requests. Although easier to set up, it’s usually not the best practice to utilize GPU’s compute power. To address this, Triton offers the built-in optimizations of dynamic batching to combine these independent inference requests on the server side to form a larger batch dynamically to increase throughput. The following diagram illustrates the Triton runtime architecture.

In the preceding architecture, all the requests reach the dynamic batcher first before entering the actual model scheduler queues to wait for inference. You can set your preferred batch sizes for dynamic batching using the preferred_batch_size settings in the model configuration. (Note that the formed batch size needs to be less than the max_batch_size the model supports.) You can also configure max_queue_delay_microseconds to specify the maximum delay time in the batcher to wait for other requests to join the batch based on your latency requirements.

The following code snippet shows how you can add this feature with model configuration files to set dynamic batching with a preferred batch size of 16 for the actual inference. With the current settings, the model instance is invoked instantly when the preferred batch size of 16 is met or the delay time of 100 microseconds has elapsed since the first request reached the dynamic batcher.

dynamic_batching {
        preferred_batch_size: 16
        max_queue_delay_microseconds: 100
    }

Running models concurrently

Another essential optimization offered in Triton to maximize hardware utilization without additional latency overhead is concurrent model execution, which allows multiple models or multiple copies of the same model to run in parallel. This feature enables Triton to handle multiple inference requests simultaneously, which increases the inference throughput by utilizing otherwise idle compute power on the hardware.

The following figure showcases how you can easily configure different model deployment policies with only a few lines of code changes. For example, configuration A (left) shows that you can broadcast the same configuration of two model instances of bert-large-uncased to all available GPUs. In contrast, configuration B (middle) shows a different configuration for GPU 0 only, without changing the policies on the other GPUs. You can also deploy instances of different models on a single GPU, as shown in configuration C (right).

In configuration C, the compute instance can handle two concurrent requests for the DistilGPT-2 model and seven concurrent requests for the bert-large-uncased model in parallel. With these optimizations, the hardware resources can be better utilized for the serving process, thereby improving the throughput and providing better cost-efficiency for your workload.

TensorRT

NVIDIA TensorRT is an SDK for high-performance deep learning inference that works seamlessly with Triton. TensorRT, which supports every major deep learning framework, includes an inference optimizer and runtime that delivers low latency and high throughput to run inferences with massive volumes of data via powerful optimizations.

TensorRT optimizes the graph to minimize memory footprint by freeing unnecessary memory and efficiently reusing it. Additionally, TensorRT compilation fuses the sparse operations inside the model graph to form a larger kernel to avoid the overhead of multiple small kernel launches. Kernel auto-tuning helps you fully utilize the hardware by selecting the best algorithm on your target GPU. CUDA streams enable models to run in parallel to maximize your GPU utilization for best performance. Last but not least, the quantization technique can fully use the mixed-precision acceleration of the Tensor cores to run the model in FP32, TF32, FP16, and INT8 to achieve the best inference performance.

Triton on SageMaker hosting

SageMaker hosting services are the set of SageMaker features aimed at making model deployment and serving easier. It provides a variety of options to easily deploy, auto scale, monitor, and optimize ML models tailored for different use cases. This means that you can optimize your deployments for all types of usage patterns, from persistent and always available with serverless options, to transient, long-running, or batch inference needs.

Under the SageMaker hosting umbrella is also the set of SageMaker inference Deep Learning Containers (DLCs), which come prepackaged with the appropriate model server software for their corresponding supported ML framework. This enables you to achieve high inference performance with no model server setup, which is often the most complex technical aspect of model deployment and in general, isn’t part of a data scientist’s skill set. Triton inference server is now available on SageMaker Deep Learning Containers (DLC).

This breadth of options, modularity, and ease of use of different serving frameworks makes SageMaker and Triton a powerful match.

SageMaker Inference Recommender for benchmarking test results

We use SageMaker Inference Recommender to run our experiments. SageMaker Inference Recommender offers two types of jobs: default and advanced, as illustrated in the following diagram.

The default job provides recommendations on instance types with just the model and a sample payload to benchmark. In addition to instance recommendations, the service also offers runtime parameters that improve performance. The default job’s recommendations are intended to narrow down the instance search. In some cases, it could be the instance family, and in others, it could be the specific instance types. The results of the default job are then fed into the advanced job.

The advanced job offers more controls to further fine-tune performance. These controls simulate the real environment and production requirements. Among these controls is the traffic pattern, which aims to stage the request pattern for the benchmarks. You can set ramps or steady traffic by using the traffic pattern’s multiple phases. For example, an InitialNumberOfUsers of 1, SpawnRate of 1, and DurationInSeconds of 600 may result in ramp traffic of 10 minutes with 1 concurrent user at the beginning and 10 at the end. Additionally, on the controls, MaxInvocations and ModelLatencyThresholds set the threshold of production, so when one of the thresholds is exceeded, the benchmarking stops.

Finally, recommendation metrics include throughput, latency at maximum throughput, and cost per inference, so it’s easy to compare them.

We use the advanced job type of SageMaker Inference Recommender to run our experiments to gain additional control over the traffic patterns, and fine-tune the configuration of the serving container.

Experiment setup

We use the custom load test feature of SageMaker Inference Recommender to benchmark the NLP profile outlined in our use case. We first define the following prerequisites related to the NLP model and ML task. SageMaker Inference Recommender uses this information to pull an inference Docker image from Amazon Elastic Container Registry (Amazon ECR) and register the model with the SageMaker model registry.

Domain NATURAL_LANGUAGE_PROCESSING
Task FILL_MASK
Framework PYTORCH: 1.6.0
Model bert-large-uncased

The traffic pattern configurations in SageMaker Inference Recommender allow us to define different phases for the custom load test. The load test starts with two initial users and spawns two new users every minute, for a total duration of 25 minutes (1500 seconds), as shown in the following code:

"TrafficPattern": {
    "TrafficType": "PHASES",
    "Phases": [
        {
            "InitialNumberOfUsers": 2,
            "SpawnRate": 2,
            "DurationInSeconds": 1500
        }, 
    ],
}

We experiment with load testing the same model in two different states. The PyTorch-based experiments use the standard, unaltered PyTorch model. For the TensorRT-based experiments, we convert the PyTorch model into a TensorRT engine beforehand.

We apply different combinations of the performance optimization features on these two models, summarized in the following table.

Configuration Name Configuration Description Model Configuration
pt-base PyTorch baseline Base PyTorch model, no changes
pt-db PyTorch with dynamic batching dynamic_batching
{}
pt-ig PyTorch with multiple model instances instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
  ]
pt-ig-db PyTorch with multiple model instances and dynamic batching dynamic_batching
{},
instance_group [
     {
          count: 2
          kind: KIND_GPU
     }
]
trt-base TensorRT baseline PyTorch model compiled with TensoRT trtexec utility
trt-db TensorRT with dynamic batching dynamic_batching
{}
trt-ig TensorRT with multiple model instances instance_group [
     {
          count: 2
          kind: KIND_GPU
     }
]
trt-ig-db TensorRT with multiple model instances and dynamic batching dynamic_batching
{},
instance_group [
     {
          count: 2
          kind: KIND_GPU
      }
]

Test results and observations

We conducted load tests for three instance types within the same g4dn family: ml.g4dn.xlarge, ml.g4dn.2xlarge and ml.g4dn.12xlarge. All g4dn instance types have access to NVIDIA T4 Tensor Core GPUs, and 2nd Generation Intel Cascade Lake processors. The logic behind the choice of instance types was to have both an instance with only one GPU available, as well as an instance with access to multiple GPUs—four in the case of ml.g4dn.12xlarge. Additionally, we wanted to test if increasing the vCPU capacity on the instance with only one available GPU would yield a cost-performance ratio improvement.

Let’s go over the speedup of the individual optimization first. The following graph shows that TensorRT optimization provides a 50% reduction in model latency compared to the native one in PyTorch on the ml.g4dn.xlarge instance. This latency reduction grows to over three times on the multi-GPU instances of ml.g4dn.12xlarge. Meanwhile, the 30% throughput improvement is consistent on both instances, resulting in better cost-effectiveness after applying TensorRT optimizations.

With dynamic batching, we can get close to 2x improvement in throughput using the same hardware architecture on all experiments instance of ml.g4dn.xlarge,  ml.g4dn.2xlarge and ml.g4dn.12xlarge without noticeable latency increase.

Similarly, concurrent model execution enable us to obtain about 3-4x improvement in throughput by maximizing the GPU utilization on ml.g4dn.xlarge instance and about 2x improvement on both the ml.g4dn.2xlarge instance and the multi-GPU instance of ml.g4dn.12xlarge.. This throughput increase comes without any overhead in the latency.

Better still, we can integrate all these optimizations to provide the best performance by utilizing the hardware resources to the fullest. The following table and graphs summarize the results we obtained in our experiments.

Configuration Name Model optimization

Dynamic

Batching

Instance group config Instance type vCPUs GPUs

GPU Memory

(GB)

Initial Instance Count[1] Invocations per min per Instance Model Latency Cost per Hour[2]
pt-base NA No NA ml.g4dn.xlarge 4 1 16 62 490 1500 45.6568
pt-db NA Yes NA ml.g4dn.xlarge 4 1 16 57 529 1490 41.9748
pt-ig NA No 2 ml.g4dn.xlarge 4 1 16 34 906 868 25.0376
pt-ig-db NA Yes 2 ml.g4dn.xlarge 4 1 16 34 892 1158 25.0376
trt-base TensorRT No NA ml.g4dn.xlarge 4 1 16 47 643 742 34.6108
trt-db TensorRT Yes NA ml.g4dn.xlarge 4 1 16 28 1078 814 20.6192
trt-ig TensorRT No 2 ml.g4dn.xlarge 4 1 16 14 2202 1273 10.3096
trt-db-ig TensorRT Yes 2 ml.g4dn.xlarge 4 1 16 10 3192 783 7.364
pt-base NA No NA ml.g4dn.2xlarge 8 1 32 56 544 1500 52.64
pt-db NA Yes NA ml.g4dn.2xlarge 8 1 32 59 517 1500 55.46
pt-ig NA No 2 ml.g4dn.2xlarge 8 1 32 29 1054 960 27.26
pt-ig-db NA Yes 2 ml.g4dn.2xlarge 8 1 32 30 1017 992 28.2
trt-base TensorRT No NA ml.g4dn.2xlarge 8 1 32 42 718 1494 39.48
trt-db TensorRT Yes NA ml.g4dn.2xlarge 8 1 32 23 1335 499 21.62
trt-ig TensorRT No 2 ml.g4dn.2xlarge 8 1 32 23 1363 1017 21.62
trt-db-ig TensorRT Yes 2 ml.g4dn.2xlarge 8 1 32 22 1369 963 20.68
pt-base NA No NA ml.g4dn.12xlarge 48 4 192 15 2138 906 73.35
pt-db NA Yes NA ml.g4dn.12xlarge 48 4 192 15 2110 907 73.35
pt-ig NA No 2 ml.g4dn.12xlarge 48 4 192 8 3862 651 39.12
pt-ig-db NA Yes 2 ml.g4dn.12xlarge 48 4 192 8 3822 642 39.12
trt-base TensorRT No NA ml.g4dn.12xlarge 48 4 192 11 2892 279 53.79
trt-db TensorRT Yes NA ml.g4dn.12xlarge 48 4 192 6 5356 278 29.34
trt-ig TensorRT No 2 ml.g4dn.12xlarge 48 4 192 6 5210 328 29.34
trt-db-ig TensorRT Yes 2 ml.g4dn.12xlarge 48 4 192 6 5235 439 29.34
[1] Initial instance count in the above table is the recommended number of instances to use with an autoscaling policy to maintain the throughput and latency requirements for your workload.
[2] Cost per hour in the above table is calculated based on the Initial instance count and price for the instance type.

Results mostly validate the impact that was expected of different performance optimization features:

  • TensorRT compilation has the most reliable impact across all instance types. Transactions per minute per instance increased by 30–35%, with a consistent cost reduction of approximately 25% when compared to the TensorRT engine’s performance to the default PyTorch BERT (pt-base). The increased performance of the TensorRT engine is compounded upon and exploited by the other tested performance tuning features.
  • Loading two models on each GPU (instance group) almost strictly doubled all measured metrics. Invocations per minute per instance increased approximately 80–90%, yielding a cost reduction in the 50% range, almost as if we were using two GPUs. In fact, Amazon CloudWatch metrics for our experiments on g4dn.2xlarge (as an example) confirms that both CPU and GPU utilization double when we configure an instance group of two models.

Further performance and cost-optimization tips

The benchmark presented in this post just scratched the surface of the possible features and techniques that you can use with Triton to improve inference performance. These range from data preprocessing techniques, such as sending binary payloads to the model server or payloads with bigger batches, to native Triton features, such as the following:

  • Model warmup, which prevents initial, slow inference requests by completely initializing the model before the first inference request is received.
  • Response cache, which caches repeated requests.
  • Model ensembling, which enables you to create a pipeline of one or more models and the connection of input and output tensors between those models. This opens the possibility of adding preprocessing and postprocessing steps, or even inference with other models, to the processing flow for each request.

We expect to test and benchmark these techniques and features in a future post, so stay tuned!

Conclusion

In this post, we explored a few parameters that you can use to maximize the performance of your SageMaker real-time endpoint for serving PyTorch BERT models with Triton Inference Server. We used SageMaker Inference Recommender to perform the benchmarking tests to fine-tune these parameters. These parameters are in essence related to TensorRT-based model optimization, leading to almost 50% improvement in response times compared to the non-optimized version. Additionally, running models concurrently and using dynamic batching of Triton led to almost a 70% increase in throughput. Fine-tuning these parameters led to an overall reduction of inference cost as well.

The best way to derive the correct values is through experimentation. However, to start building empirical knowledge on performance tuning and optimization, you can observe the combinations of different Triton-related parameters and their effect on performance across ML models and SageMaker ML instances.

SageMaker provides the tools to remove the undifferentiated heavy lifting from each stage of the ML lifecycle, thereby facilitating the rapid experimentation and exploration needed to fully optimize your model deployments.

You can find the notebook used for load testing and deployment on GitHub. You can update Triton configurations and SageMaker Inference Recommender settings to best fit your use case to achieve cost-effective and best-performing inference workloads.


About the Authors

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He mostly focuses on NLP use-cases and helping customers optimize Deep Learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 9 years and has worked on various AWS services like EMR, EFA and RDS on Outposts. Currently, he is focused on improving the SageMaker Inference Experience. In his spare time, he enjoys hiking and running marathons.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Santosh Bhavani is a Senior Technical Product Manager with the Amazon SageMaker Elastic Inference team. He focuses on helping SageMaker customers accelerate model inference and deployment. In his spare time, he enjoys traveling, playing tennis, and drinking lots of Pu’er tea.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Read More

Build a corporate credit ratings classifier using graph machine learning in Amazon SageMaker JumpStart

Today, we’re releasing a new solution for financial graph machine learning (ML) in Amazon SageMaker JumpStart. JumpStart helps you quickly get started with ML and provides a set of solutions for the most common use cases that can be trained and deployed with just a few clicks.

The new JumpStart solution (Graph-Based Credit Scoring) demonstrates how to construct a corporate network from SEC filings (long-form text data), combine this with financial ratios (tabular data), and use graph neural networks (GNNs) to build credit rating prediction models. In this post, we explain how you can use this fully customizable solution for credit scoring, so you can accelerate your graph ML journey. Graph ML is becoming a fruitful area for financial ML because it enables the use of network data in conjunction with traditional tabular datasets. For more information, see Amazon at WSDM: The future of graph neural networks.

Solution overview

You can improve credit scoring by exploiting data on business linkages, for which you may construct a graph, denoted as CorpNet (short for corporate network) in this solution. You can then apply graph ML classification using GNNs on this graph and a tabular feature set for the nodes, to see if you can build a better ML model by further exploiting the information in network relationships. Therefore, this solution offers a template for business models that exploit network data, such as using supply chain relationship graphs, social network graphs, and more.

The solution develops several new artifacts by constructing a corporate network and generating synthetic financial data, and combines both forms of data to create models using graph ML.

The solution shows how to construct a network of connected companies using the MD&A section from SEC 10-K/Q filings. Companies with similar forward-looking statements are likely to be connected for credit events. These connections are represented in a graph. For graph node features, the solution uses the variables in the Altman Z-score model and the industry category of each firm. These are provided in a synthetic dataset made available for demonstration purposes. The graph data and tabular data are used to fit a rating classifier using GNNs. For illustrative purposes, we compare the performance of models with and without the graph information.

Use the Graph-Based Credit Scoring solution

To start using JumpStart, see Getting started with Amazon SageMaker. The JumpStart card for the Graph-Based Credit Scoring solution is available through Amazon SageMaker Studio.

  1. Choose the model card, then choose Launch to initiate the solution.

The solution generates a model for inference and an endpoint to use with a notebook.

  1. Wait until they’re ready and the status shows as Complete.
  2. Choose Open Notebook to open the first notebook, which is for training and endpoint deployment.

You can work through this notebook to learn how to use this solution and then modify it for other applications on your own data. The solution comes with synthetic data and uses a subset of it to exemplify the steps needed to train the model, deploy it to an endpoint, and then invoke the endpoint for inference. The notebook also contains code to deploy an endpoint of your own.

  1. To open the second notebook (used for inference), choose Use Endpoint in Notebook next to the endpoint artifact.

In this notebook, you can see how to prepare the data to invoke the example endpoint to perform inference on a batch of examples.

The endpoint returns predicted ratings, which are used to assess model performance, as shown in the following screenshot of the last code block of the inference notebook.

You can use this solution as a template for a graph-enhanced credit rating model. You’re not restricted to the feature set in this example—you can change both the graph data and tabular data for your own use case. The extent of code changes required is minimal. We recommend working through our template example to understand the structure of the solution, and then modify it as needed.

This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production use. Although text from SEC filings is used, the financial data is synthetically and randomly generated and has no relation to any company’s true financials. Therefore, the synthetically generated ratings also don’t have any relation to any real company’s true rating.

Data used in the solution

The dataset has synthetic tabular data such as various accounting ratios (numerical) and industry codes (categorical). The dataset has 𝑁=3286 rows. Rating labels are also added. These are the node features to be used with graph ML.

The dataset also contains a corporate graph, which is undirected and unweighted. This solution allows you to adjust the structure of the graph by varying the way in which links are included. Each company in the tabular dataset is represented by a node in the corporate graph. The function construct_network_data() helps construct the graph, which comprises lists of source nodes and destination nodes.

Rating labels are used for classification using GNNs, which can be multi-category for all ratings or binary, divided between investment grade (AAA, AA, A, BBB) and non-investment grade (BB, B, CCC, CC, C, D). D here stands for defaulted.

The complete code to read in the data and run the solution is provided in the solution notebook. The following screenshot shows the structure of the synthetic tabular data.

The graph information is passed in to the Deep Graph Library and combined with the tabular data to undertake graph ML. If you bring your own graph, simply supply it as a set of source nodes and destination nodes.

Model training

For comparison, we first train a model only on tabular data using AutoGluon, mimicking the traditional approach to credit rating of companies. We then add in the graph data and use GNNs for training. Full details are provided in the notebook, and a brief overview is offered in this post. The notebook also offers a quick overview of graph ML with selected references.

Training the GNN is undertaken as follows. We use an adaptation of the GraphSAGE model implemented in the Deep Graph Library.

  1. Read in graph data from Amazon Simple Storage Service (Amazon S3) and create the source and destination node lists for CorpNet.
  2. Read in the graph node feature sets (train and test). Normalize the data as required.
  3. Set tunable hyperparameters. Call the specialized graph ML container running PyTorch to fit the GNN without hyperparameter optimization (HPO).
  4. Repeat graph ML with HPO.

To make implementation straightforward and stable, we run model training in a container using the following code (the setup code prior to this training code is in the solution notebook):

from sagemaker.pytorch import PyTorch
from time import strftime, gmtime

training_job_name = sagemaker_config["SolutionPrefix"] + "-gcn-training"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> 
    a job name started with {training_job_name} to monitor training job 
    status and details."
)

estimator = PyTorch(
    entry_point='train_dgl_pytorch_entry_point.py',
    source_dir='graph_convolutional_network',
    role=role, 
    instance_count=1, 
    instance_type='ml.g4dn.xlarge',
    framework_version="1.9.0",
    py_version='py38',
    hyperparameters=hyperparameters,
    output_path=output_location,
    code_location=output_location,
    sagemaker_session=sess,
    base_job_name=training_job_name,
)

estimator.fit({'train': input_location})

The current training process is undertaken in a transductive setting, where the features of the test dataset (not including the target column) are used to construct the graph and therefore the test nodes are included in the training process. At the end of training, the predictions on the test dataset are generated and saved in output_location in the S3 bucket.

Even though the training is transductive, the labels of the test dataset aren’t used for training, and our exercise is aimed at predicting these labels using node embeddings for the test dataset nodes. An important feature of GraphSAGE is that inductive learning on new observations that aren’t part of the graph is also possible, though not exploited in this solution.

Hyperparameter optimization

This solution is further extended by conducting HPO on the GNN. This is done within SageMaker. See the following code:

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

# Static hyperparameters we do not tune
hyperparameters = {
    "n-layers": 2,
    "aggregator-type": "pool",
    "target-column": target_column
}
# Dynamic hyperparameters to tune and their searching ranges. 
# For demonstration purpose, we skip the architecture search by skipping 
# tuning the hyperparameters such as 'skip_rnn_num_layers', 'rnn_num_layers', etc.
hyperparameter_ranges = {
    "n-hidden": CategoricalParameter([32, 64, 128, 256, 512, 1024]),
    'dropout': ContinuousParameter(0.0, 0.6),
    'weight-decay': ContinuousParameter(1e-5, 1e-2),
    'n-epochs': IntegerParameter(70, 120), #80, 160
    'lr': ContinuousParameter(0.002, 0.02),
}

We then set up the training objective, to maximize the F1 score in this case:

objective_metric_name = "Validation F1"
metric_definitions = [{"Name": "Validation F1", "Regex": "Validation F1 (\S+)"}]
objective_type = "Maximize"

Establish the chosen environment and training resources on SageMaker:

estimator_tuning = PyTorch(
    entry_point='train_dgl_pytorch_entry_point.py',
    source_dir='graph_convolutional_network',
    role=role, 
    instance_count=1, 
    instance_type='ml.g4dn.xlarge',
    framework_version="1.9.0",
    py_version='py38',
    hyperparameters=hyperparameters,
    output_path=output_location,
    code_location=output_location,
    sagemaker_session=sess,
    base_job_name=training_job_name,
)

Finally, run the training job with hyperparameter optimization:

import time

tuning_job_name = sagemaker_config["SolutionPrefix"] + "-gcn-hpo"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {tuning_job_name} to monitor HPO tuning status and details.n"
    f"Note. You will be unable to successfully run the following cells until the tuning job completes. This step may take around 2 hours."
)

tuner = HyperparameterTuner(
    estimator_tuning,  # using the estimator defined in previous section
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=30,
    max_parallel_jobs=10,
    objective_type=objective_type,
    base_tuning_job_name = tuning_job_name,
)

start_time = time.time()

tuner.fit({'train': input_location})

hpo_training_job_time_duration = time.time() - start_time

Results

The inclusion of network data and hyperparameter optimization yields improved results. The performance metrics in the following table demonstrate the benefit of adding in CorpNet to standard tabular datasets used for credit scoring.

The results for AutoGluon don’t use the graph, only the tabular data. When we add in the graph data and use HPO, we get a material gain in performance.

F1 Score ROC AUC Accuracy MCC Balanced Accuracy Precision Recall
AutoGluon 0.72 0.74323 0.68037 0.35233 0.67323 0.68528 0.75843
GCN Without HPO 0.64 0.84498 0.69406 0.45619 0.71154 0.88177 0.50281
GCN With HPO 0.81 0.87116 0.78082 0.563 0.77081 0.75119 0.89045

(Note: MCC is the Matthews Correlation Coefficient; https://en.wikipedia.org/wiki/Phi_coefficient.)

Clean up

After you’re done using this notebook, delete the model artifacts and other resources to avoid incurring further charges. You need to manually delete resources that you may have created while running the notebook, such as S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.

Summary

In this post, we introduced a graph-based credit scoring solution in JumpStart to help you accelerate your graph ML journey. The notebook provides a pipeline that you can modify and exploit graphs with existing tabular models to obtain better performance.

To get started, you can find the Graph-Based Credit Scoring solution in JumpStart in SageMaker Studio.


About the Authors

Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University) and Computer Science (M.S. from UC Berkeley), and an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice President at Citibank. He works on multimodal machine learning in the area of financial applications.

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the areas of natural language processing, deep learning on tabular data, and robust analysis of non-parametric space-time clustering.

Soji Adeshina is an Applied Scientist at AWS, where he develops graph neural network-based models for machine learning on graphs tasks with applications to fraud and abuse, knowledge graphs, recommender systems, and life sciences. In his spare time, he enjoys reading and cooking.

Patrick Yang is a Software Development Engineer at Amazon SageMaker. He focuses on building machine learning tools and products for customers.

Read More

Increase your content reach with automated document-to-speech conversion using Amazon AI services

Reading the printed word opens up a world of information, imagination, and creativity. However, scanned books and documents may be difficult for people with vision impairment and learning disabilities to consume. In addition, some people prefer to listen to text-based content versus reading it. A document-to-speech solution extends the reach of digital content by giving text content a voice. It has uses across different industry sectors, such as:

  • EntertainmentYou can create your own audiobooks.
  • Education – Students can convert their lecture notes to speech and access them anywhere.
  • Patient care – Dosage instructions and precautions are typically in small fonts and hard to read. With this solution, you could take a picture, convert to speech, and listen to the instructions in order to avoid potential harm.

The document-to-speech solution converts scanned books or documents taken on a mobile phone or handheld device automatically to speech. This solution extends the capabilities of Amazon Polly. We extract text from scanned documents using Amazon Textract, and then convert the text to speech using Amazon Polly. Solution benefits include mobility and freedom for the user plus enhanced learning capabilities for early readers.

The idea originated from Harry Pan, one of the blog author’s favorite parent-child activities – reading books. “My son enjoys storybooks, but is too young to read on his own. I love reading to him, but sometimes I need to work or tend to household chores. This sparked an idea to build a document-to-speech solution that could read to him when I was busy”.

Overview of solution

The solution is an event-driven serverless architecture that uses Amazon AI services to convert scanned documents to speech. Amazon Textract and Amazon Polly belong to the topmost layer of the AWS machine learning (ML) stack. These services allow developers to easily add intelligence to any application without prior ML knowledge.

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data without any manual effort.

Amazon Polly is a text-to-speech service that turns text into lifelike speech, allowing you to create applications that talk and to build entirely new categories of speech-enabled products. Amazon Polly uses advanced deep learning technologies to synthesize speech that sounds like a human voice.

There are significant advantages of using Amazon AI services:

  • They take little effort; you can integrate these APIs into any application
  • They offer highly scalable and cost-effective solutions
  • Your organization can shift its focus from development of custom models to business outcomes

The solution also uses Amazon API Gateway to quickly stand up APIs that the web UI can invoke to perform operations like uploading documents and converting scanned documents to speech. API Gateway provides a scalable way to create, publish, and maintain secure APIs. In this solution, we also use API Gateway WebSocket support to establish a persistent connection between the web UI and the backend, so the backend can keep sending progress updates to user in real time.

We use AWS Lambda functions to trigger Amazon Textract and Amazon Polly asynchronous jobs. Lambda is a highly available and scalable compute service that lets you run code without provisioning resources.

We use an AWS Step Functions state machine to orchestrate two parallel Lambda functions – one to moderate text and the other to store text in Amazon Simple Storage Service (Amazon S3). Step Functions is a serverless orchestration service to define application workflows as a series of event-driven steps.

Architecture and code

As described in the previous section, we use two key AI services, Amazon Textract and Amazon Polly, to build a document-to-speech conversion solution. One additional service that we haven’t touched upon is AWS Amplify. Amplify allows front-end developers to quickly build extensible, full stack web and mobile apps. With Amplify, you can easily configure a backend, connect an application to it within minutes, and scale effortlessly. We use Amplify to host a web UI that allows users to upload their scanned documents.

You can also use your own UI without Amplify. As we dive deep into this solution, we show how you can use any client application to connect to the backend to convert documents to speech – as long as they support REST and WebSocket APIs. The web UI here is simply to demonstrate key features of this solution. As of this writing, the solution supports JPEG, PNG, and PDF input formats, and the English language.

The following diagram illustrates the solution architecture.

We walk through this architecture by following the path of a single user request:

  1. The user visits the web UI hosted on Amplify. The UI code is the index.html file in the client folder of the code repository.
  2. The user chooses a JPG, PDF, or PNG file to upload using the web UI.
  3. The user initiates the Convert & Play Audio process from the web UI, which uploads the input file to an S3 bucket, through a REST API hosted on API Gateway.
  4. When the upload is complete, the document-to-speech conversion starts as a background process:
    1. During the conversion, the web client keeps a persistent WebSocket connection with the API Gateway. This allows the backend processes (Lambda functions) to continuously send progress updates to the web client.
    2. The request goes through the API Gateway and triggers the Lambda function convert-images-to-text. This function calls Amazon Textract asynchronously to convert the document to text.
    3. When the image-to-text conversion is complete, Amazon Textract sends a notification to Amazon Simple Notification Service (Amazon SNS).
    4. The notification triggers the Lambda function on-textract-ready, which kicks off a Step Functions state machine.
    5. The state machine orchestrates the following steps:
      1. It runs the Lambda function retrieve-text to obtain the converted text from Amazon Textract.
      2. It then runs Lambda functions moderate-text and store-text in parallel. moderate-text stops further processing when undesirable words are detected, and store-text stores a copy of the converted text to an S3 bucket.
      3. After the parallel steps are complete, the state machine runs the Lambda function convert-text-to-audio, which invokes Amazon Polly asynchronously with the converted text, for speech conversion. The state machine finishes after this step.
    6. Similar to Amazon Textract, Amazon Polly sends a notification to Amazon SNS when the job is done. The notification triggers the Lambda function on-polly-ready, which sends a final message to the web UI along with the Amazon S3 location of the converted audio file.
  5. The web UI downloads the final converted audio file from Amazon S3 via a REST API, and then plays it for the user.
  6. The application uses an Amazon DynamoDB table to track job information such as Amazon Textract job ID, Amazon Polly job ID, and more.

The code is hosted on GitHub and is deployed using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to define cloud application resources using familiar programming languages. AWS CDK provisions resources in a repeatable manner through AWS CloudFormation.

Prerequisites

The only prerequisite to deploy this solution is an AWS account.

Deploy the solution

The following steps detail how to deploy the application:

  1. Sign in to your AWS account.
  2. On the AWS Cloud9 console, open an existing environment, or choose Create environment to create a new one.
  3. In your AWS Cloud9 IDE, on the Window menu, choose New Terminal to open a terminal.

All the following steps are done in the same terminal.

  1. Clone the git repository and enter the project directory:
git clone --depth 1 https://github.com/aws-samples/scanned-documents-to-speech.git
cd scanned-documents-to-speech
  1. Create a Python virtual environment:
python3 -m venv .venv
  1. After the init process is complete and the virtual environment is created, use the following step to activate your virtual environment:
source .venv/bin/activate
  1. After the virtual environment is activated, install the required dependencies:
pip install -r requirements.txt
  1. You can now synthesize the CloudFormation templates from the AWS CDK code:
cdk synth
  1. Deploy the AWS CDK application and capture AWS CDK outputs needed later:
cdk deploy --all --outputs-file cdk-outputs.json

You must confirm changes to be deployed for each stack. You can check the stack creation progress on the AWS Cloud Formation console.

  1. To visit the web client, run the following command and follow its output to kick off front-end deployment and use the web client:
./extract-cdk-outputs.py cdk-outputs.json

Key things to note:

  • The extract-cdk-outputs.py script prints out the URL of the web UI. The script also prints out strings of the S3 bucket name, file API endpoint, and conversion API endpoint, which need to be set on the web UI before uploading a document.
  • You can set the list of undesirable words in the variable in the moderate-text Lambda function.

Use the application

The following steps demonstrate how to use the application via the web UI.

  1. Following the last step of the deployment, fill in the fields for S3 Bucket Name, File Endpoint, and Conversion Endpoint in the web UI.
  2. Choose Choose File to upload an input file.
  3. Choose Convert & Play Audio.

The web UI shows the progress of the ongoing conversion.

The web UI plays the audio automatically when the conversion is complete.

Clean up

Run the following command to delete all resources and avoid incurring future charges:

cdk destroy --all

Conclusion

In this post, we demonstrated a solution to quickly deploy a document-to-speech conversion application by using two powerful AI services: Amazon Textract and Amazon Polly. We showed how the solution works and provided a detailed walkthrough of the code and deployment steps. This solution is meant to be a reference architecture or quick start that you can further enhance. Notably, you could add support for more human languages, add a queue for buffering incoming requests, and authenticate users.

As discussed in this post, we see multiple use cases for this solution across different industry verticals. Give it a try and let us know how this solved your use case by leaving feedback in the comments section. You can access the resources for the solution in the document to speech GitHub repository.

References

More information is available at the following resources:


About the Authors

Harry PanHarry Pan is an ISV Solutions Architect at Amazon Web Services based in the San Francisco Bay Area, where he helps software companies achieve their business goals by building well-architected IT systems. He loves spending his spare time with his family, as well as playing tennis, coding in Haskell, and traveling.

Chaitra MathurChaitra Mathur is a Principal Solutions Architect at AWS. She guides partners and customers in building highly scalable, reliable, secure, and cost-effective solutions on AWS. In her spare time, she enjoys reading, yoga and spending time with her daughters.

Read More