How Kakao Games automates lifetime value prediction from game data using Amazon SageMaker and AWS Glue

How Kakao Games automates lifetime value prediction from game data using Amazon SageMaker and AWS Glue

This post is co-written with Suhyoung Kim, General Manager at KakaoGames Data Analytics Lab.

Kakao Games is a top video game publisher and developer headquartered in South Korea. It specializes in developing and publishing games on PC, mobile, and virtual reality (VR) serving globally. In order to maximize its players’ experience and improve the efficiency of operations and marketing, they are continuously adding new in-game items and providing promotions to their players. The result of these events can be evaluated afterwards so that they make better decisions in the future.

However, this approach is reactive. If we can forecast lifetime value (LTV), we can take a proactive approach. In other words, these activities can be planned and run based on the forecasted LTV, which determines the players’ values through their lifetime in the game. With this proactive approach, Kakao Games can launch the right events at the right time. If the forecasted LTV for some players is decreasing, this means that the players are likely to leave soon. Kakao Games can then create a promotional event not to leave the game. This makes it important to accurately forecast the LTV of their players. LTV is the measurement adopted by not only gaming companies but also any kind of service with long-term customer engagement. Statistical methods and machine learning (ML) methods are actively developed and adopted to maximize the LTV.

In this post, we share how Kakao Games and the Amazon Machine Learning Solutions Lab teamed up to build a scalable and reliable LTV prediction solution by using AWS data and ML services such as AWS Glue and Amazon SageMaker.

We chose one of the most popular games of Kakao Games, ODIN, as the target game for the project. ODIN is a popular massively multiplayer online roleplaying game (MMORPG) for PC and mobile devices published and operated by Kakao Games. It was launched in June 2021 and has been ranked within the top three in revenue in Korea.

Kakao Games ODIN

Challenges

In this section, we discuss challenges around various data sources, data drift caused by internal or external events, and solution reusability. These challenges are typically faced when we implement ML solutions and deploy them into a production environment.

Player behavior affected by internal and external events

It’s challenging to forecast the LTV accurately, because there are many dynamic factors affecting player behavior. These include game promotions, newly added items, holidays, banning accounts for abuse or illegal play, or unexpected external events like sport events or severe weather conditions. This means that the model working this month might not work well next month.

We can utilize external events as ML features along with the game-related logs and data. For example, Amazon Forecast supports related time series data like weather, prices, economic indicators, or promotions to reflect internal and external related events. Another approach is to refresh ML models regularly when data drift is observed. For our solution, we chose the latter method because the related event data wasn’t available and we weren’t sure how reliable the existing data was.

Continuous ML model retraining is one method to overcome this challenge by relearning from the most recent data. This requires not only well-designed features and ML architecture, but also data preparation and ML pipelines that can automate the retraining process. Otherwise, the ML solution can’t be efficiently operated in the production environment due to the complexity and poor repeatability.

It’s not sufficient to retrain the model using the latest training dataset. The retrained model might not give a more accurate forecasting result than the existing one, so we can’t simply replace the model with the new one without any evaluation. We need to be able to go back to the previous model if the new model starts to underperform for some reason.

To solve this problem, we had to design a strong data pipeline to create the ML features from the raw data and MLOps.

Multiple data sources

ODIN is an MMORPG where the game players interact with each other, and there are various events such as level-up, item purchase, and gold (game money) hunting. It produces about 300 GB logs every day from its more than 10 million players across the world. The gaming logs are of different types, such as player login, player activity, player purchases, and player level-ups. These types of data are historical raw data from an ML perspective. For example, each log is written in the format of timestamp, user ID, and event information. The interval of logs is not uniform. Also, there is static data describing the players such as their age and registration date, which is non-historical data. LTV prediction modeling requires these two types of data as its input because they complement each other to represent the player’s characteristics and behavior.

For this solution, we decided to define the tabular dataset combining the historical features with the fixed number of aggregated steps along with the static player features. The aggregated historical features are generated through multiple steps from the number of game logs, which are stored in Amazon Athena tables. In addition to the challenge of defining the features for the ML model, it’s critical to automate the feature generation process so that we can get ML features from the raw data for ML inference and model retraining.

To solve this problem, we build an extract, transform, and load (ETL) pipeline that can be run automatically and repeatedly for training and inference dataset creation.

Scalability to other games

Kakao Games has other games with long-term player engagements just like ODIN. Naturally, LTV prediction benefits those games as well. Because most of the games share similar log types, they want to reuse this ML solution to other games. We can fulfill this requirement by using the common log and attributes among different games when we design the ML model. But there is still an engineering challenge. The ETL pipeline, MLOps pipeline, and ML inference should be rebuilt in a different AWS account. Manual deployment of this complex solution isn’t scalable and the deployed solution is hard to maintain.

To solve this problem, we make the ML solution auto-deployable with a few configuration changes.

Solution overview

The ML solution for LTV forecasting is composed of four components: the training dataset ETL pipeline, MLOps pipeline, inference dataset ETL pipeline, and ML batch inference.

The training and inference ETL pipeline creates ML features from the game logs and the player’s metadata stored in Athena tables, and stores the resulting feature data in an Amazon Simple Storage Service (Amazon S3) bucket. ETL requires multiple transformation steps, and the workflow is implemented using AWS Glue. The MLOps trains ML models, evaluates the trained model against the existing model, and then registers the trained model to the model registry if it outperforms the existing model. These are all implemented as a single ML pipeline using Amazon SageMaker Pipelines, and all the ML trainings are managed via Amazon SageMaker Experiments. With SageMaker Experiments, ML engineers can find which training and evaluation datasets, hyperparameters, and configurations were used for each ML model during the training or later. ML engineers no longer need to manage this training metadata separately.

The last component is the ML batch inference, which is run regularly to predict LTV for the next couple of weeks.

The following figure shows how these components work together as a single ML solution.

ML Ops architecture

The solution architecture has been implemented using the AWS Cloud Development Kit (AWS CDK) to promote infrastructure as code (IaC), making it easy to version control and deploy the solution across different AWS accounts and Regions

In the following sections, we discuss each component in more detail.

Data pipeline for ML feature generation

Game logs stored in Athena backed by Amazon S3 go through the ETL pipelines created as Python shell jobs in AWS Glue. It enables running Python scripts with AWS Glue for feature exaction to generate the training-ready dataset. Corresponding tables in each phase are created in Athena. We use AWS Glue for running the ETL pipeline due to its serverless architecture and flexibility in generating different versions of the dataset by passing in various start and end dates. Refer to Accessing parameters using getResolvedOptions to learn more about how to pass the parameters to an AWS Glue job. With this method, the dataset can be created to cover a period of as short as 4 weeks, supporting the game in its early stages. For instance, the input start date and prediction start date for each version of dataset are parsed via the following code:

import sys
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(
    sys.argv,
    [
        'JOB_NAME',
        'db_name',
        'ds_version',
        'input_start_date',
        'prediction_start_date',
        'bucket',
        'prefix',
        'timestamp'
    ]
)

AWS Glue jobs are designed and divided into different stages and triggered sequentially. Each job is configured to take in positional and key-value pair arguments to run customized ETL pipelines. One key parameter is the start and end date of the data that is used in training. This is because the start and end date of data likely span different holidays, and serve as a direct factor in determining the length of dataset. To observe this parameter’s impact on model performances, we created nine different dataset versions (with different start dates and length of training period).

Specifically, we created dataset versions with different start dates (shifted by 4 weeks) and different training periods (12 weeks, 16 weeks, 20 weeks, 24 weeks, and 28 weeks) in nine Athena databases backed by Amazon S3. Each version of the dataset contains the features describing player characteristics and in-game purchase activity time series data.

ML model

We selected AutoGluon for model training implemented with SageMaker pipelines. AutoGluon is a toolkit for automated machine learning (AutoML). It enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning image, text, and tabular data.

You can use AutoGluon standalone to train ML models or in conjunction with Amazon SageMaker Autopilot, a feature of SageMaker that provides a fully managed environment for training and deploying ML models.

In general, you should use AutoGluon with Autopilot if you want to take advantage of the fully managed environment provided by SageMaker, including features such as automatic scaling and resource management, as well as easy deployment of trained models. This can be especially useful if you’re new to ML and want to focus on training and evaluating models without worrying about the underlying infrastructure.

You can also use AutoGluon standalone when you want to train ML models in a customized way. In our case, we used AutoGluon with SageMaker to realize a two-stage prediction, including churn classification and lifetime value regression. In this case, the players that stopped purchasing game items are considered as having churned.

Let’s talk about the modeling approach for LTV prediction and the effectiveness of the model retraining against the data drift symptom, which means the internal or external events that change a player’s purchase pattern.

First, the modeling processes were separated into two stages, including a binary classification (classifying a player as churned or not) and a regression model that was trained to predict the LTV value for non-churned players:

  • Stage 1 – Target values for LTV are converted into a binary label, LTV = 0 and LTV > 0. AutoGluon TabularPredictor is trained to maximize F1 score.
  • Stage 2 – A regression model using AutoGluon TabularPredictor is used to train the model on users with LTV > 0 for actual LTV regression.

During the model testing phase, the test data goes through the two models sequentially:

  • Stage 1 – The binary classification model runs on test data to get the binary prediction 0 (user having LTV = 0, churned) or 1 (user having LTV > 0, not churned).
  • Stage 2 – Players predicted with LTV > 0 go through the regression model to get the actual LTV value predicted. Combined with the user predicted as having LTV = 0, the final LTV prediction result is generated.

Model artifacts associated with the training configurations for each experiment and for each version of dataset are stored in an S3 bucket after the training, and also registered to the SageMaker Model Registry within the SageMaker Pipelines run.

To test if there is any data drift due to using the same model trained on the dataset v1 (12 weeks starting from October), we run inference on dataset v1, v2 (starting time shifted forward by 4 weeks), v3 (shifted forward by 8 weeks), and so on for v4 and v5. The following table summarizes model performance. The metric used for comparison is minmax score, whose range is 0–1. It gives a higher number when the LTV prediction is closer to the true LTV value.

Dataset Version Minmax Score Difference with v1
v1 0.68756
v2 0.65283 -0.03473
v3 0.66173 -0.02584
v4 0.69633 0.00877
v5 0.71533 0.02777

A performance drop is seen on dataset v2 and v3, which is consistent with the analysis performed on various modeling approaches having decreasing performance on dataset v2 and v3. For v4 and v5, the model shows equivalent performance, and even shows a slight improvement on v5 without model retraining. However, when comparing model v1 performance on dataset v5 (0.71533) vs. model v5 performance on dataset v5 (0.7599), model retraining is improving performance significantly.

Training pipeline

SageMaker Pipelines provides easy ways to compose, manage, and reuse ML workflows; select the best models for deploying into production; track the models automatically; and integrate CI/CD into ML pipelines.

In the training step, a SageMaker Estimator is constructed with the following code. Unlike the normal SageMaker Estimator to create a training job, we pass a SageMaker pipeline session to SageMaker_session instead of a SageMaker session:

from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession

pipeline_session = PipelineSession()

ltv_train = Estimator(
    image_uri=image_uri,
    instance_type=instance_type,
    instance_count=1,
    output_path=output_path,
    base_job_name=f'{base_jobname_prefix}/train',
    role=role,
    source_dir=source_dir,
    entry_point=entry_point,
    sagemaker_session=pipeline_session,
    hyperparameters=hyperparameters
)

The base image is retrieved by the following code:

image_uri = SageMaker.image_uris.retrieve(
        "AutoGluon",
        region=region,
        version=framework_version,
        py_version=py_version,
        image_scope="training",
        instance_type=instance_type,
)

The trained model goes through the evaluation process, where the target metric is minmax. A score larger than the current best LTV minmax score will lead to a model register step, whereas a lower LTV minmax score won’t lead to the current registered model version being updated. The model evaluation on the holdout test dataset is implemented as a SageMaker Processing job.

The evaluation step is defined by the following code:

step_eval = ProcessingStep(
        name=f"EvaluateLTVModel-{ds_version}",
        processor=script_eval,
        inputs=[
            ProcessingInput(
                source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
                destination="/opt/ml/processing/model",
            ),
            ProcessingInput(
                source=test,
                input_name='test',
                destination="/opt/ml/processing/test",
            ),
        ],
        outputs=[
            ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
        ],
        code=os.path.join(BASE_DIR, "evaluate_weekly.py"),
        property_files=[evaluation_report],
        job_arguments=["--test-fname", os.path.basename(test)],
    )

When the model evaluation is complete, we need to compare the evaluation result (minmax) with the existing model’s performance. We define another pipeline step, step_cond.

With all the necessary steps defined, the ML pipeline can be constructed and run with the following code:

# training pipeline
training_pipeline = Pipeline(
    name=f'odin-ltv-{ds_version}', 
    parameters=[
        processing_instance_count,
        model_approval_status,
        dataset_version,
        train_data,
        test_data,
        output_path,
        batch_instance_types,
        model_metrics,
        best_ltv_minmax_score
    ],
    steps=[step_train, step_eval, step_cond]
)

### start execution
execution = training_pipeline.start(
    parameters=dict(
        DatasetVersion=ds_version,
    )
)

The whole workflow is trackable and visualized in Amazon SageMaker Studio, as shown in the following graph. The ML training jobs are tracked by the SageMaker Experiment automatically so that you can find the ML training configuration, hyperparameters, dataset, and trained model of each training job. Choose each of the modules, logs, parameters, output, and so on to examine them in detail.

SaegMaker Pipelines

Automated batch inference

In the case of LTV prediction, batch inference is preferred to real-time inference because the predicted LTV is used for the offline downstream tasks normally. Just like creating ML features from the training dataset through the multi-step ETL, we have to create the ML features as an input to the LTV prediction model. We reuse the same workflow of AWS Glue to convert the players’ data into the ML features, but the data split and the label generation are not performed. The resulting ML feature is stored in the designated S3 bucket, which is monitored by an AWS Lambda trigger. When the ML feature file is dropped into the S3 bucket, the Lambda function runs automatically, which starts the SageMaker batch transform job using the latest and approved LTV model found in the SageMaker Model Registry. When the batch transform is complete, the output or predicted LTV values for each player are saved to the S3 bucket so that any downstream task can pick up the result. This architecture is described in the following diagram.

Data ETL pipeline

With this pipeline combining the ETL task and the batch inference, the LTV prediction is done simply running the AWS Glue ETL workflow regularly, such as once a week or once a month. AWS Glue and SageMaker manage their underlying resources, which means that this pipeline doesn’t require you to keep any resource running all the time. Therefore, this architecture using managed services is cost effective for batch tasks.

Deployable solution using the AWS CDK

The ML pipeline itself is defined and run using Pipelines, but the data pipeline and the ML model inference code including the Lambda function are out of the scope of Pipelines. To make this solution deployable so that we can apply this to other games, we defined the data pipeline and ML model inference using the AWS CDK. This way, the engineering team and data science team have the flexibility to manage, update, and control the whole ML solution without having to manage the infrastructure manually using the AWS Management Console.

Conclusion

In this post, we discussed how we could solve data drift and complex ETL challenges by building an automated data pipeline and ML pipeline utilizing managed services such as AWS Glue and SageMaker, and how to make it a scalable and repeatable ML solution to be adopted by other games using the AWS CDK.

“In this era, games are more than just content. They bring people together and have boundless potential and value when it comes to enjoying our lives. At Kakao Games, we dream of a world filled with games anyone can easily enjoy. We strive to create experiences where players want to stay playing and create bonds through community. The MLSL team helped us build a scalable LTV prediction ML solution using AutoGluon for AutoML, Amazon SageMaker for MLOps, and AWS Glue for data pipeline. This solution automates the model retraining for data or game changes, and can easily be deployed to other games via the AWS CDK. This solution helps us optimize our business processes, which in turn helps us stay ahead in the game.”

SuHyung Kim, Head of Data Analytics Lab, Kakao Games.

To learn more about related features of SageMaker and the AWS CDK, check out the following:

Amazon ML Solutions Lab

The Amazon ML Solutions Lab pairs your team with ML experts to help you identify and implement your organization’s highest-value ML opportunities. If you want to accelerate your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.


About the Authors

Suhyoung Kim is a General Manager at KakaoGames Data Analytics Lab. He is responsible for gathering and analyzing data, and especially concern for the economy of online games.

Muhyun Kim is a data scientist at Amazon Machine Learning Solutions Lab. He solves customer’s various business problems by applying machine learning and deep learning, and also helps them gets skilled.

Sheldon Liu is a Data Scientist at Amazon Machine Learning Solutions Lab. As an experienced machine learning professional skilled in architecting scalable and reliable solutions, he works with enterprise customers to address their business problems and deliver effective ML solutions.

Alex Chirayath is a Senior Machine Learning Engineer at the Amazon ML Solutions Lab. He leads teams of data scientists and engineers to build AI applications to address business needs.

Gonsoo Moon, AI/ML Specialist Solutions Architect at AWS, has worked together customers to solve their ML problems using AWS AI/ML services. In the past, he had experience in developing machine learning services in the manufacture industry as well as in a large scale of service development, data analysis and system development in the portal and gaming industry. In his spare time, Gonsoo takes a walk and plays with children.

Read More

Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel

Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel

Amazon Comprehend is a managed AI service that uses natural language processing (NLP) with ready-made intelligence to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. The ability to train custom models through the Custom classification and Custom entity recognition features of Comprehend has enabled customers to explore out-of-the-box NLP capabilities tied to their requirements without having to take the approach of building classification and entity recognition models from scratch.

Today, users invest a significant amount of resources to build, train, and maintain custom models. However, these models are sensitive to changes in the real world. For example, since 2020, COVID has become a new entity type that businesses need to extract from documents. In order to do so, customers have to retrain their existing entity extraction models with new training data that includes COVID. Custom Comprehend users need to manually monitor model performance to assess drifts, maintain data to retrain models, and select the right models that improve performance.

Comprehend flywheel is a new Amazon Comprehend resource that simplifies the process of improving a custom model over time. You can use a flywheel to orchestrate the tasks associated with training and evaluating new custom model versions. You can create a flywheel to use an existing trained model, or Amazon Comprehend can create and train a new model for the flywheel. Flywheel creates a data lake (in Amazon S3) in your account where all the training and test data for all versions of the model are managed and stored. Periodically, the new labeled data (to retrain the model) can be made available to flywheel by creating datasets. To incorporate the new datasets into your custom model, you create and run a flywheel iteration. A flywheel iteration is a workflow that uses the new datasets to evaluate the active model version and to train a new model version.

Based on the quality metrics for the existing and new model versions, you set the active model version to be the version of the flywheel model that you want to use for inference jobs. You can use the flywheel active model version to run custom analysis (real-time or asynchronous jobs). To use the flywheel model for real-time analysis, you must create an endpoint for the flywheel.

This post demonstrates how you can build a custom text classifier (no prior ML knowledge needed) that can assign a specific label to a given text. We will also illustrate how flywheel can be used to orchestrate the training of a new model version and improve the accuracy of the model using new labeled data.

Prerequisites

To complete this walkthrough, you need an AWS account and access to create resources in AWS Identity and Access Management (IAM), Amazon S3 and Amazon Comprehend within the account.

  • Configure IAM user permissions for users to access flywheel operations (CreateFlywheel, DeleteFlywheel, UpdateFlywheel, CreateDataset, StartFlywheelIteration).
  • (Optional) Configure permissions for AWS KMS keys for AWS KMS keys for the datalake.
  • Create a data access role that authorizes Amazon Comprehend to access the datalake.

For information about creating IAM policies for Amazon Comprehend, see Permissions to perform Amazon Comprehend actions. 

In this post, we use the Yahoo corpus from Text Understanding from scratch by Xiang Zhang and Yann LeCun. The data can be accessed from AWS Open Data Registry. Please refer to section 4, “Preparing data,” from the post Building a custom classifier using Amazon Comprehend for the script and detailed information on data preparation and structure.

Alternatively, for even more convenience, you can download the prepared data by entering the following two command lines:

Admin:~/environment $ aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-13607/custom-classifier-partial-dataset.csv .

Admin:~/environment $ aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-13607/custom-classifier-complete-dataset.csv .

We will be using the custom-classifier-partial-dataset.csv (about 15,000 documents) dataset to create the initial version of the custom classifier.  Next, we will create a flywheel to orchestrate the retraining of the initial version of the model using the complete dataset custom-classifier-complete-dataset.csv (about 100,000 documents). Upon retraining the model by triggering a flywheel iteration, we evaluate the model performance metrics of the two versions of the custom model and choose the better-performing one as the active model version and demonstrate real-time custom classification using the same.

Solution overview

Please find the following steps to set up the environment and the data lake to create a Comprehend flywheel iteration to retrain the custom models.

  1. Setting up the environment
  2. Creating S3 buckets
  3. Training the custom classifier
  4. Creating a flywheel
  5. Configuring datasets
  6. Triggering flywheel iterations
  7. Update active model version
  8. Using flywheel for custom classification
  9. Cleaning up the resources

1. Setting up the environment

You can interact with Amazon Comprehend via the AWS Management ConsoleAWS Command Line Interface (AWS CLI), or Amazon Comprehend API. For more information, refer to Getting started with Amazon Comprehend.

In this post, we use AWS CLI to create and manage the resources. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code. It includes a code editor, debugger, and terminal. AWS Cloud9 comes prepackaged with AWS CLI.

Please refer to  Creating an environment in AWS Cloud9 to set up the environment.

2. Creating S3 buckets

  1. Create two S3 buckets
    • One for managing the datasets custom-classifier-partial-dataset.csv and custom-classifier-complete-dataset.csv.
    • One for the data lake for Comprehend flywheel.
  2. Create the first bucket using the following command (replace ‘123456789012’ with your account ID):
    $ aws s3api create-bucket --acl private --bucket '123456789012-comprehend' --region us-east-1

  3. Create the bucket to be used as the data lake for flywheel:
    $ aws s3api create-bucket --acl private --bucket '123456789012-comprehend-flywheel-datalake' --region us-east-1

  4. Upload the training datasets to the “123456789012-comprehend” bucket:
    $ aws s3 cp custom-classifier-partial-dataset.csv s3://123456789012-comprehend/
    
    $ aws s3 cp custom-classifier-complete-dataset.csv s3://123456789012-comprehend/

3. Training the custom classifier

Use the following command to create a custom classifier: yahoo-answers-version1 using the dataset: custom-classifier-partial-dataset.csv. Replace the data access role ARN and the S3 bucket locations with your own.

$ aws comprehend create-document-classifier  --document-classifier-name "yahoo-answers-version1"  --data-access-role-arn arn:aws:iam::123456789012:role/comprehend-data-access-role  --input-data-config S3Uri=s3://123456789012-comprehend/custom-classifier-partial-dataset.csv  --output-data-config S3Uri=s3://123456789012-comprehend/TrainingOutput/ --language-code en

The above API call results in the following output:

{  "DocumentClassifierArn": "arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers-version1"}

CreateDocumentClassifier starts the training of the custom classifier model. In order to further track the progress of the training, use DescribeDocumentClassifier.

$ aws comprehend describe-document-classifier --document-classifier-arn arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers-version1

{ "DocumentClassifierProperties": { "DocumentClassifierArn": "arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers-version1", "LanguageCode": "en", "Status": "TRAINED", "SubmitTime": "2022-09-22T21:17:53.380000+05:30", "EndTime": "2022-09-22T23:04:52.243000+05:30", "TrainingStartTime": "2022-09-22T21:21:55.670000+05:30", "TrainingEndTime": "2022-09-22T23:04:17.057000+05:30", "InputDataConfig": { "DataFormat": "COMPREHEND_CSV", "S3Uri": "s3://123456789012-comprehend/custom-classifier-partial-dataset.csv" }, "OutputDataConfig": { "S3Uri": "s3://123456789012-comprehend/TrainingOutput/333997476486-CLR-4ea35141e42aa6b2eb2b3d3aadcbe731/output/output.tar.gz" }, "ClassifierMetadata": { "NumberOfLabels": 10, "NumberOfTrainedDocuments": 13501, "NumberOfTestDocuments": 1500, "EvaluationMetrics": { "Accuracy": 0.6827, "Precision": 0.7002, "Recall": 0.6906, "F1Score": 0.693, "MicroPrecision": 0.6827, "MicroRecall": 0.6827, "MicroF1Score": 0.6827, "HammingLoss": 0.3173 } }, "DataAccessRoleArn": "arn:aws:iam::123456789012:role/comprehend-data-access-role", "Mode": "MULTI_CLASS" }}
Console view of the initial version of the custom classifier as a result of the create-document-classifier command previously described:

Console view of the initial version of the custom classifier as a result of the create-document-classifier command previously described

Model Performance

Model Performance

Once Status shows TRAINED, the classifier is ready to use. The initial version of the model has an F1-score of 0.69. F1-score is an important evaluation metric in machine learning. It sums up the predictive performance of a model by combining two otherwise competing metrics—precision and recall.

4. Create a flywheel

As the next step, create a new version of the model with the updated dataset (custom-classifier-complete-dataset.csv). For retraining, we will be using Comprehend flywheel to help orchestrate and simplify the process of retraining the model.

You can create a flywheel for an existing trained model (as in our case) or train a new model for the flywheel. When you create a flywheel, Amazon Comprehend creates a data lake to hold all the data that the flywheel needs, such as the training data and test data for each version of the model.  When Amazon Comprehend creates the data lake, it sets up the following folder structure in the Amazon S3 location.

Datasets 
Annotations pool 
Model datasets 
       (data for each version of the model) 
       VersionID-1 
                Training 
                Test 
                ModelStats 
       VersionID-2 
                Training 
                Test 
                ModelStats 

Warning: Amazon Comprehend manages the data lake folder organization and contents. If you modify the datalake folders, your flywheel may not operate correctly.

How to create a flywheel (for the existing custom model):

Note: If you create a flywheel for an existing trained model version, the model type and model configuration are preconfigured.

Be sure to replace the model ARN, data access role, and data lake S3 URI with your resource’s ARNs. Use the second S3 bucket  123456789012-comprehend-flywheel-datalake created in the “Setting up S3 buckets” step as the data lake for flywheel.

$ aws comprehend create-flywheel --flywheel-name custom-model-flywheel-test --active-model-arn arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers-version1 -- data-access-role-arn arn:aws:iam::123456789012:role/comprehend-data-access-role --data-lake-s3-uri s3://123456789012-comprehend-flywheel-datalake/

The above API call results in a FlyWheelArn.

{ "FlywheelArn": "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test"}
Console view of the Flywheel

Console view of the flywheel

5. Configuring datasets

To add labeled training or test data to a flywheel, use the Amazon Comprehend console or API to create a dataset.

  1. Create an inputConfig.json file containing the following content:
    {"DataFormat": "COMPREHEND_CSV","DocumentClassifierInputDataConfig": {"S3Uri": "s3://123456789012-comprehend/custom-classifier-complete-dataset.csv"}}

  2. Use the relevant flywheel ARN from your account to create the dataset.
    $ aws comprehend create-dataset --flywheel-arn "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test" --dataset-name "training-dataset-complete" --dataset-type "TRAIN" --description "my training dataset" --input-data-config file://inputConfig.json

  3. This results in the creation of a dataset:
    {   "DatasetArn": "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test/dataset/training-dataset-complete"   }
    {   "DatasetArn": "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test/dataset/training-dataset-complete"   }

6. Triggering flywheel iterations

Use flywheel iterations to help you create and manage new model versions. Users can also view per-dataset metrics in the “model stats” folder in the data lake in S3 bucket. Run the following command to start the flywheel iteration:

$ aws comprehend start-flywheel-iteration --flywheel-arn  "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test"

The response contains the following content :

{ "FlywheelArn": "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test", "FlywheelIterationId": "20220922T192911Z"}

When you run the flywheel, it creates a new iteration that trains and evaluates a new model version with the updated dataset. You can promote the new model version if its performance is superior to the existing active model version.

Result of the flywheel iteration

Result of the flywheel iteration

7. Update active model version

We notice that the model performance has improved as a result of the recent iteration (highlighted above). To promote the new model version as the active model version for inferences, use UpdateFlywheel API call:

$  aws comprehend update-flywheel --flywheel-arn arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test --active-model-arn  "arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers-version1/version/Comprehend-Generated-v1-1b235dd0"

The response contains the following contents, which shows that the newly trained model is being promoted as the active version:

{"FlywheelProperties": {"FlywheelArn": "arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test","ActiveModelArn": "arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers-version1/version/Comprehend-Generated-v1-1b235dd0","DataAccessRoleArn": "arn:aws:iam::123456789012:role/comprehend-data-access-role","TaskConfig": {"LanguageCode": "en","DocumentClassificationConfig": {"Mode": "MULTI_CLASS"}},"DataLakeS3Uri": "s3://123456789012-comprehend-flywheel-datalake/custom-model-flywheel-test/schemaVersion=1/20220922T175848Z/","Status": "ACTIVE","ModelType": "DOCUMENT_CLASSIFIER","CreationTime": "2022-09-22T23:28:48.959000+05:30","LastModifiedTime": "2022-09-23T07:05:54.826000+05:30","LatestFlywheelIteration": "20220922T192911Z"}}

8. Using flywheel for custom classification

You can use the flywheel’s active model version to run analysis jobs for custom classification. This can be for both real-time analysis or for asynchronous classification jobs.

  • Asynchronous jobs: Use the StartDocumentClassificationJob API request to start an asynchronous job for custom classification. Provide the FlywheelArn parameter instead of the DocumentClassifierArn.
  • Real-time analysis: You use an endpoint to run real-time analysis. When you create the endpoint, you configure it with the flywheel ARN instead of a model ARN. When you run the real-time analysis, select the endpoint associated with the flywheel. Amazon Comprehend runs the analysis using the active model version of the flywheel.

Run the following command to create the endpoint:

$ aws comprehend —endpoint-name custom-classification-endpoint —model-arn arn:aws:comprehend:us-east-1:123456789012:flywheel/custom-model-flywheel-test —desired-inference-units 1

Warning: You will be charged for this endpoint from the time it is created until it is deleted. Ensure you delete the endpoint when not in use to avoid charges.

For API, use the ClassifyDocument API operation. Provide the endpoint of the flywheel for the EndpointArn parameter OR use the console to classify documents in real time.

Pricing details

Flywheel APIs are free of charge. However, you will be billed for custom model training and management. You are charged $3 per hour for model training (billed by the second) and $0.50 per month for custom model management. For synchronous custom classification and entities inference requests, you provision an endpoint with the appropriate throughput. For more details, please visit Comprehend Pricing.

9. Cleaning up the resources

As discussed, you are charged from the time that you start your endpoint until it is deleted. Once you no longer need your endpoint, you should delete it so that you stop incurring costs from it. You can easily create another endpoint whenever you need it from the Endpoints section. For more information, refer to Deleting endpoints.

Conclusion

In this post, we walked through the capabilities of Comprehend flywheel and how it simplifies the process of retraining and improving custom models over time. As part of the next steps, you can explore the following:

  • Create and manage Comprehend flywheel resources from other mediums such as SDK and console.
  • In this blog, we created a flywheel for an already trained custom model. You can explore the option of creating a flywheel and training a model for it from scratch.
  • Use flywheel for custom entity recognizers.

There are many possibilities, and we are excited to see how you use Amazon Comprehend for your NLP use cases. Happy learning and experimentation!


About the Author

Supreeth S Angadi is a Greenfield Startup Solutions Architect at AWS and a member of AI/ML technical field community. He works closely with ML Core , SaaS and Fintech startups to help accelerate their journey to the cloud. Supreeth likes spending his time with family and friends, loves playing football and follows the sport immensely. His day is incomplete without a walk and playing fetch with his ‘DJ’ (Golden Retriever).

Read More

Introducing the Amazon Comprehend flywheel for MLOps

Introducing the Amazon Comprehend flywheel for MLOps

The world we live in is rapidly changing, and so are the data and features that companies and customers use to train their models. Retraining models to keep them in sync with these changes is critical to maintain accuracy. Therefore, you need an agile and dynamic approach to keep models up to date and adapt them to new inputs. This combination of great models and continuous adaptation is what will lead to a successful machine learning (ML) strategy.

Today, we are excited to announce the launch of Amazon Comprehend flywheel—a one-stop machine learning operations (MLOps) feature for an Amazon Comprehend model. In this post, we demonstrate how you can create an end-to-end workflow with an Amazon Comprehend flywheel.

Solution overview

Amazon Comprehend is a fully managed service that uses natural language processing (NLP) to extract insights about the content of documents. It helps you extract information by recognizing sentiments, key phrases, entities, and much more, allowing you to take advantage of state-of-the-art models and adapt them for your specific use case.

MLOps focuses on the intersection of data science and data engineering in combination with existing DevOps practices to streamline model delivery across the ML development lifecycle. MLOps is the discipline of integrating ML workloads into release management, CI/CD, and operations. MLOps requires the integration of software development, operations, data engineering, and data science.

This is why Amazon Comprehend is introducing the flywheel. The flywheel is intended to be your one stop to perform MLOPs for your Amazon Comprehend models. This new feature will allow you to keep your models up to date, improve upon your models, and deploy the best version faster.

The following diagram represents the model lifecycle inside an Amazon Comprehend flywheel.

The current process to create a new model consists of a sequence of steps. First, you gather data and prepare the dataset. Then, you train the model using this dataset. After the model is trained, it’s evaluated for accuracy and performance. Finally, you deploy the model to an endpoint to perform inference. When new models are created, these steps need to be repeated, and the endpoint needs to be manually updated.

An Amazon Comprehend flywheel automates this ML process, from data ingestion to deploying the model in production. With this new feature, you can manage training and testing of the created models inside Amazon Comprehend. This feature also allows you to automate model retraining after new datasets are ingested and available in the flywheel´s data lake.

The flywheel provides integration with custom classification and custom entity recognition APIs, and can help different roles such as data engineers and developers automate and manage the NLP workflow with no-code services.

First, let’s introduce some concepts:

  • Flywheel – A flywheel is an AWS resource that orchestrates the ongoing training of a model for custom classification or custom entity recognition.
  • Dataset – A dataset is a set of training or test data that is used in a single flywheel. Flywheel uses the training datasets to train new model versions and evaluate their performance on the test datasets.
  • Data lake – A flywheel’s data lake is a location in your Amazon Simple Storage Service (Amazon S3) bucket that stores all its datasets and model artifacts. Each flywheel has its own dedicated data lake.
  • Flywheel iteration – A flywheel iteration is a run of the flywheel when triggered by the user. Depending on the availability of new train or test datasets, the flywheel will train a new model version or assess the performance of the active model on new test data.
  • Active model – An active model is the selected version of the model by the user for predictions. As the performance of the model is improved with new flywheel iterations, you can change the active version to the one that has the best performance.

The following diagram illustrates the flywheel workflow.

flywheel workflow

These steps are detailed as follows:

  • Create a flywheel – A flywheel automates the training of model versions for a custom classifier or custom entity recognizer. You can either select an existing Amazon Comprehend model as a starting point for the flywheel or you can start from scratch with no models. In both cases, a flywheel’s data lake location must be specified for the flywheel.
  • Data ingestion – You can create new datasets for training or testing in the flywheel. All the training and test data for all versions of the model are managed and stored in the flywheel’s data lake created in your S3 bucket. The supported file formats are CSV and augmented manifest from an S3 location. You can find more information for preparing the dataset for custom classification and custom entity recognition.
  • Train and evaluate the model – When you don’t indicate the model ARN (Amazon Resource Name) to use, it implies that a new one is going to be built from scratch. For that, the first iteration of flywheel will create the model based on the train dataset uploaded. For successive iterations, these are the possible cases:
    • If no new train or test datasets are uploaded since the last iteration, the flywheel iteration will finish without any change.
    • If there are only new test datasets since the last iteration, the flywheel iteration will report the performance of the current active model based on the new test datasets.
    • If there are only new train datasets, the flywheel iteration will train a new model.
    • If there are new train and test datasets, the flywheel iteration will train a new model and report the performance of the current active model.
  • Promote new active model version – Based on the performance of the different flywheel iterations, you can update the active model version to the best one.
  • Deploy an endpoint – After running a flywheel iteration and updating the active model version, you can run real-time (synchronous) inference on your model. You can create an endpoint with the flywheel ARN, which will by default use the currently active model version. When the active model for the flywheel changes, the endpoint automatically starts using the new active model without any customer intervention. An endpoint includes all the managed resources that make your custom model available for real-time inference.

In the following sections, we demonstrate the different ways to create a new Amazon Comprehend flywheel.

Prerequisites

You need the following:

  • An active AWS account
  • An S3 bucket for your data location
  • An AWS Identity and Access Management (IAM) role with permissions to create an Amazon Comprehend flywheel and permissions to read and write to your data location S3 bucket

Create a flywheel with AWS CloudFormation

To start using an Amazon Comprehend flywheel with AWS CloudFormation, you need the following information about the AWS::Comprehend::Flywheel resource:

  • DataAccessRoleArn – The ARN of the IAM role that grants Amazon Comprehend permission to access the flywheel data
  • DataLakeS3Uri – The Amazon S3 URI of the flywheel’s data lake location
  • FlywheelName – The name for the flywheel

For more information, refer to AWS CloudFormation documentation.

Create a flywheel on the Amazon Comprehend console

In this example, we demonstrate how to build a flywheel for a custom classifier model on the Amazon Comprehend console that figures out the topic of the news.

Create a dataset

First, you need to create the dataset. For this post, we use the AG News Classification Dataset. In this dataset, data is classified in four news categories: WORLD, SPORTS, BUSINESS, and SCI_TEC.

Run the notebook following the steps to preprocess data in the Comprehend Immersion Day Lab 2 for the training and testing dataset and save the data in Amazon S3.

Create a flywheel

Now we can create our flywheel. Complete the following steps:

  1. On the Amazon Comprehend console, choose Flywheels in the navigation pane.
    choose Flywheels in the navigation pane.
  2. Choose Create new flywheel.
    create new flywheel

You can create a new flywheel from an existing model or create a new model. In this case, we create a new model from scratch.

  1. For Flywheel name, enter a name (for this example, custom-news-flywheel).
  2. Leave the Model field empty.
  3. Select Custom classification for Custom model type.
  4. For Language, leave the setting as English.
  5. Select Using Multi-label mode for Classifier mode.
  6. For Custom labels, enter BUSINESS,SCI_TECH,SPORTS,WORLD.
    For Custom labels, enter BUSINESS,SCI_TECH,SPORTS,WORLD.
  7. For the encryption settings, keep Use AWS owned key.
  8. For the flywheel’s data lake location, select an S3 URI in your account that can be dedicated to this flywheel.

Each flywheel has an S3 data lake location where it stores flywheel assets and artifacts such as datasets and model statistics. Make sure not to modify or delete any objects from this location because it’s meant to be managed exclusively by the flywheel.

  1. Choose Create an IAM role and enter a name for the role (CustomNewsFlywheelRole in our case).
  2. Choose Create.

It will take a couple of minutes to create the flywheel. Once created, the status will change to Active.

Once created, the status will change to Active.

  1. On the custom-news-flywheel details page, choose Create dataset.
    Create dataset
  2. For Dataset name, enter a name for the training dataset.
  3. Leave CSV file for Data format.
  4. Choose Training and select the training dataset from the S3 bucket.
  5. Choose Create.
  6. Repeat these steps to create a test dataset.
  7. After the uploaded dataset status changes to Completed, go to the Flywheel iterations tab and choose Run flywheel.
    go to the Flywheel iterations tab and choose Run flywheel.
  8. When the training is complete, go to the Model versions tab, select the recently trained model, and choose Make active model.

You can also observe the objective metrics F1 score, precision, and recall.

objective metrics F1 score, precision, and recall.

  1. Return to the Datasets tab and choose Create dataset in the Test datasets section.
    Create dataset in the Test datasets section.
  2. Enter the location of text.csv in the S3 bucket.
    Enter the location of text.csv in the S3 bucket.

Wait until the status shows as Completed. This will create metrics on the active model using the test dataset.

status shows as Completed.

If you choose Custom classification in the navigation pane, you can see all the document classifier models, even the ones trained using flywheels.

document classifier models

Create an endpoint

To create your model endpoint, complete the following steps:

  1. On the Amazon Comprehend console, navigate to the flywheel you created.
  2. On the Endpoints tab, choose Create endpoint.
    Create endpoint.
  3. Name the endpoint news-topic.
  4. Under Classification models and flywheels, the active model version is already selected.
    active model version is already selected.
  5. For Inference Units, choose 1 IU.
  6. Select the acknowledgement check box, then choose Create endpoint.
  7. After the endpoint has been created and is active, navigate to Use in real-time analysis on the endpoint’s details page.
  8. Test the model by entering text in the Input text box.
  9. Under Results, check the labels for the news topics.
    labels for the news topics.

Create an asynchronous analysis job

To create an analysis job, complete the following steps:

  1. On the Amazon Comprehend console, navigate to the active model version.
  2. Choose Create job.
    create job
  3. For Name, enter batch-news.
  4. For Analysis type¸ choose Custom classification.
  5. For Classification models and flywheels, choose the flywheel you created (custom-news-flywheel).
    create analysis job
  6. Browse Amazon S3 to select the input file with the different news texts we want to create the analysis with and then choose One document per line (one news text per line).

The following screenshot shows the document uploaded for this exercise.

the document uploaded for this exercise.

  1. Choose where you want to save the output file in your S3 location.
  2. For Access permissions, choose the IAM role CustomNewsFlywheelRole that you created earlier.
  3. Choose Create job.
  4. When the job is complete, download the output file and check the predictions.
    download the output file

Clean up

To avoid future charges, clean up the resources you created.

  1. On the Amazon Comprehend console, choose Flywheels in the navigation pane.
  2. Select your flywheel and choose Delete.
    delete flywheel
  3. Delete any endpoints you created.
  4. Empty and delete the S3 buckets you created.

Conclusion

In this post, we saw how an Amazon Comprehend flywheel serves as a one-stop shop to perform MLOPs for your Amazon Comprehend models. We also discussed its value proposition and introduced basic flywheel concepts. Then we walked you through the different steps starting from creating a flywheel to creating an endpoint.

Learn more about Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel. Try it out now and get started with our newly launched service, the Amazon Comprehend flywheel.


About the Authors

Alberto Menendez is an Associate DevOps Consultant in Professional Services at AWS and a member of Comprehend Champions. He loves helping accelerate customers´ journey to the cloud and creating solutions to solve their business challenges. In his free time, he enjoys practicing sports, especially basketball and padel, spending time with family and friends, and learning about technology.

Irene Arroyo Delgado is an Associate AI/ML Consultant in Professional Services at AWS and a member of Comprehend Champions. She focuses on productionizing ML workloads to achieve customers’ desired business outcomes by automating end-to-end ML lifecycles. She has experience building performant ML platforms and their integration with a data lake on AWS. In her free time, Irene enjoys traveling and hiking in the mountains.

Shweta Thapa is a Solutions Architect in Enterprise Engaged at AWS and a member of Comprehend Champions. She enjoys helping her customers with their journey and growth in the cloud, listening to their business needs, and offering them the best solutions. In her free time, Shweta enjoys going out for a run, traveling, and most of all spending time with her baby daughter.

Read More

Extract non-PHI data from Amazon HealthLake, reduce complexity, and increase cost efficiency with Amazon Athena and Amazon SageMaker Canvas

Extract non-PHI data from Amazon HealthLake, reduce complexity, and increase cost efficiency with Amazon Athena and Amazon SageMaker Canvas

In today’s highly competitive market, performing data analytics using machine learning (ML) models has become a necessity for organizations. It enables them to unlock the value of their data, identify trends, patterns, and predictions, and differentiate themselves from their competitors. For example, in the healthcare industry, ML-driven analytics can be used for diagnostic assistance and personalized medicine, while in health insurance, it can be used for predictive care management.

However, organizations and users in industries where there is potential health data, such as in healthcare or in health insurance, must prioritize protecting the privacy of people and comply with regulations. They are also facing challenges in using ML-driven analytics for an increasing number of use cases. These challenges include a limited number of data science experts, the complexity of ML, and the low volume of data due to restricted Protected Health Information (PHI) and infrastructure capacity.

Organizations in the healthcare, clinical, and life sciences are facing several challenges in using ML for data analytics:

  • Low volume of data – Due to restrictions on private, protected, and sensitive health information, the volume of usable data is often limited, reducing the accuracy of ML models
  • Limited talent – Hiring ML talent is hard enough, but hiring talent that has not only ML experience but also deep medical knowledge is even harder
  • Infrastructure management – Provisioning infrastructure specialized for ML is a difficult and time-consuming task, and companies would rather focus on their core competencies than manage complex technical infrastructure
  • Prediction of multi-modal issues – When predicting the likelihood of multi-faceted medical events, such as a stroke, different factors such as medical history, lifestyle, and demographic information must be combined

A possible scenario is that you are a healthcare technology company with a team of 30 non-clinical physicians researching and investigating medical cases. This team has the knowledge and intuition in healthcare but not the ML skills to build models and generate predictions. How can you deploy a self-service environment that allows these clinicians to generate predictions themselves for multivariate questions like, “How can I access useful data while being compliant with health regulations and without compromising privacy?” And how can you do that without exploding the number of servers the SysOps folks need to manage?

This post addresses all these problems simultaneously in one solution. First, it automatically anonymizes the data from Amazon HealthLake. Then, it uses that data with serverless components and no-code self-service solutions like Amazon SageMaker Canvas to eliminate the ML modeling complexity and abstract away the underlying infrastructure.

A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data. AWS provides the most complete set of services for the entire end-to-end data journey for all workloads, all types of data, and all desired business outcomes.

Solution overview

This post shows that by anonymizing sensitive data from Amazon HealthLake and making it available to SageMaker Canvas, organizations can empower more stakeholders to use ML models that can generate predictions for multi-modal problems, such as stroke prediction, without writing ML code, while limiting access to sensitive data. And we want to automate that anonymization to make this as scalable and amenable to self-service as possible. Automating also allows you to iterate the anonymization logic to meet your compliance requirements, and provides the ability to re-run the pipeline as your population’s health data changes.

The dataset used in this solution is generated by Synthea™, a Synthetic Patient Population Simulator and open-source project under the Apache License 2.0.

The workflow includes a hand-off between cloud engineers and domain experts. The former can deploy the pipeline. The latter can verify if the pipeline is correctly anonymizing the data and then generate predictions without code. At the end of the post, we’ll look at additional services to verify the anonymization.

The high-level steps involved in the solution are as follows:

  1. Use AWS Step Functions to orchestrate the health data anonymization pipeline.
  2. Use Amazon Athena queries for the following:
    1. Extract non-sensitive structured data from Amazon HealthLake.
    2. Use natural language processing (NLP) in Amazon HealthLake to extract non-sensitive data from unstructured blobs.
  3. Perform one-hot encoding with Amazon SageMaker Data Wrangler.
  4. Use SageMaker Canvas for analytics and predictions.

The following diagram illustrates the solution architecture.

Architecture Diagram

Prepare the data

First, we generate a fictional patient population using Synthea™ and import that data into a newly created Amazon HealthLake data store. The result is a simulation of the starting point from where a healthcare technology company could run the pipeline and solution described in this post.

When Amazon HealthLake ingests data, it automatically extracts meaning from unstructured data, such as doctors notes, into separate structured fields, such as patient names and medical conditions. To accomplish this on the unstructured data in DocumentReference FHIR resources, Amazon HealthLake transparently triggers Amazon Comprehend Medical, where entities, ontologies, and their relationships are extracted and added back to Amazon HealthLake as discreet data within the extension segment of records.

We can use Step Functions to streamline the collection and preparation of the data. The entire workflow is visible in one place, with any errors or exceptions highlighted, allowing for a repeatable, auditable, and extendable process.

Query the data using Athena

By running Athena SQL queries directly on Amazon HealthLake, we are able to select only those fields that are not personally identifying; for example, not selecting name and patient ID, and reducing birthdate to birth year. And by using Amazon HealthLake, our unstructured data (the text field in DocumentReference) automatically comes with a list of detected PHI, which we can use to mask the PHI in the unstructured data. In addition, because generated Amazon HealthLake tables are integrated with AWS Lake Formation, you can control who gets access down to the field level.

The following is an excerpt from an example of unstructured data found in a synthetic DocumentReference record:

# History of Present Illness
Marquis
is a 45 year-old. Patient has a history of hypertension, viral sinusitis (disorder), chronic obstructive bronchitis (disorder), stress (finding), social isolation (finding).
# Social History
Patient is married. Patient quit smoking at age 16.
Patient currently has UnitedHealthcare.
# Allergies
No Known Allergies.
# Medications
albuterol 5 mg/ml inhalation solution; amlodipine 2.5 mg oral tablet; 60 actuat fluticasone propionate 0.25 mg/actuat / salmeterol 0.05 mg/actuat dry powder inhaler
# Assessment and Plan
Patient is presenting with stroke.

We can see that Amazon HeathLake NLP interprets this as containing the condition “stroke” by querying for the condition record that has the same patient ID and displays “stroke.” And we can take advantage of the fact that entities found in DocumentReference are automatically labeled SYSTEM_GENERATED:

SELECT
  code.coding[1].code,
  code.coding[1].display
FROM
  condition
WHERE
  split(subject.reference, '/')[2] = 'fbfe46b4-70b1-8f61-c343-d241538ebb9b'
AND
  meta.tag[1].display = 'SYSTEM_GENERATED'
AND regexp_like(code.coding[1].display, 'Cerebellar stroke syndrome')

The result is as follows:

G46.4, Cerebellar stroke syndrome

The data collected in Amazon HealthLake can now be effectively used for analytics thanks to the ability to select specific condition codes, such as G46.4, rather than having to interpret entire notes. This data is then stored as a CSV file in Amazon Simple Storage Service (Amazon S3).

Note: When implementing this solution, please follow the instructions on turning on HealthLake’s integrated NLP feature via a support case before ingesting data into your HealthLake data store.

Perform one-hot encoding

To unlock the full potential of the data, we use a technique called one-hot encoding to convert categorical columns, like the condition column, into numerical data.

One of the challenges of working with categorical data is that it is not as amenable to being used in many machine learning algorithms. To overcome this, we use one-hot encoding, which converts each category in a column to a separate binary column, making the data suitable for a wider range of algorithms. This is done using Data Wrangler, which has built-in functions for this:

The built-in function for one-hot encoding in SageMaker Data Wrangler

The built-in function for one-hot encoding in SageMaker Data Wrangler

One-hot encoding transforms each unique value in the categorical column into a binary representation, resulting in a new set of columns for each unique value. In the example below, the condition column is transformed into six columns, each representing one unique value. After one-hot encoding, the same rows would turn into a binary representation.

Before_and_After_One-hot_encoding_tables

With the data now encoded, we can move on to using SageMaker Canvas for analytics and predictions.

Use SageMaker Canvas for analytics and predictions

The final CSV file then becomes the input for SageMaker Canvas, which healthcare analysts (business users) can use to generate predictions for multivariate problems like stroke prediction without needing to have expertise in ML. No special permissions are required because the data doesn’t contain any sensitive information.

In the example of stroke prediction, SageMaker Canvas was able to achieve an accuracy rate of 99.829% through the use of advanced ML models, as shown in the following screenshot.

The analyze screen within SageMaker Canvas showing 99.829% as how often the model correctly predicts stroke

In the next screenshot, you can see that, according to the model’s prediction, this patient has a 53% chance of not having a stroke.

SageMaker Canvas Predict screen showing that the prediction is No stroke based on the input of not in labor force among other inputs.

You might posit that you can create this prediction using rule-based logic in a spreadsheet. But do those rules tell you the feature importance—for example, that 4.9% of the prediction is based on whether or not they have ever smoked tobacco? And what if, in addition to the current columns like smoking status and blood pressure, you add 900 more columns (features)? Would you still be able to use a spreadsheet to maintain and manage the combinations of all those dimensions? Real-life scenarios lead to many combinations, and the challenge is to manage this at scale with the right level of effort.

Now that we have this model, we can start making batch or single predictions, asking what-if questions. For example, what if this person keeps all variables the same but, as of the previous two encounters with the medical system, is classified as Full-time employment instead of Not in labor force?

According to our model, and the synthetic data we fed it from Synthea, the person is at a 62% risk of having a stroke.

SageMaker Canvas predict screen showing Yes as the prediction and full time employment as an input.

As we can tell from the circled 12% and 10% feature importance of the conditions from the two most recent encounters with the medical system, whether they are full-time employed or not has a big impact on their risk of stroke. Beyond the findings of this model, there is research that demonstrates a similar link:

These studies have used large population-based samples and controlled for other risk factors, but it’s important to note that they are observational in nature and do not establish causality. Further research is needed to fully understand the relationship between full-time employment and stroke risk.

Enhancements and alternate methods

To further validate compliance, we can use services like Amazon Macie, which will scan the CSV files in the S3 bucket and alert us if there is any sensitive data. This helps increase the confidence level of the anonymized data.

In this post, we used Amazon S3 as the input data source for SageMaker Canvas. However, we can also import data into SageMaker Canvas directly from Amazon RedShift and Snowflake—popular enterprise data warehouse services used by many customers to organize their data and popular third-party solutions. This is especially important for customers who already have their data in either Snowflake or Amazon Redshift being used for other BI analytics.

By using Step Functions to orchestrate the solution, the solution is more extensible. Instead of a separate trigger to invoke Macie, you can add another step to the end of the pipeline to call Macie to double-check for PHI. If you want to add rules to monitor your data pipeline’s quality over time, you can add a step for AWS Glue Data Quality.

And if you want to add more bespoke integrations, Step Functions lets you scale out to handle as much data or as little data as you need in parallel and only pay for what you use. The parallelization aspect is useful when you are processing hundreds of GB of data, because you don’t want to try to jam all that into one function. Instead, you want to break it up and run it in parallel so you’re not waiting for it to process in a single queue. This is similar to a check-out line at the store—you don’t want to have a single cashier.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

Log out button in SageMaker Canvas

Conclusion

In this post, we showed that predictions for critical health issues like stroke can be done by medical professionals using complex ML models but without the need for coding. This will vastly increase the pool of resources by including people who have specialized domain knowledge but no ML experience. Also, using serverless and managed services allows existing IT people to manage infrastructure challenges like availability, resilience, and scalability with less effort.

You can use this post as a starting point to investigate other complex multi-modal predictions, which are key to guiding the health industry toward better patient care. Coming soon, we will have a GitHub repository to help engineers more quickly launch the kind of ideas we presented in this post.

Experience the power of SageMaker Canvas today, and build your models using a user-friendly graphical interface, with the 2-month Free Tier that SageMaker Canvas offers. You don’t need any coding knowledge to get started, and you can experiment with different options to see how your models perform.

Resources

To learn more about SageMaker Canvas, refer to the following:

To learn more about other use cases that you can solve with SageMaker Canvas, check out the following:

To learn more about Amazon HealthLake, refer to the following:


About the Authors

Yann Stoneman headshot -- white male in 30s with slight beard and glasses smilingYann Stoneman is a Solutions Architect at AWS based out of Boston, MA and is a member of the AI/ML Technical Field Community (TFC). Yann earned his Bachelors at The Juilliard School. When he’s not modernizing workloads for global enterprises, Yann plays piano, tinkers in React and Python, and regularly YouTubes about his cloud journey.

Ramesh Dwarakanath headshotRamesh Dwarakanath is a Principal Solutions Architect at AWS based out of Boston, MA. He works with Enterprises in the Northeast area on their Cloud journey. His areas of interest are Containers and DevOps. In his spare time, Ramesh enjoys tennis, racquetball.

Bakha Nurzhanov's headshot with dark hair and slightly smilingBakha Nurzhanov is an Interoperability Solutions Architect at AWS, and is a member of the Healthcare and Life Sciences technical field community at AWS. Bakha earned his Masters in Computer Science from the University of Washington and in his spare time Bakha enjoys spending time with family, reading, biking, and exploring new places.

Scott Schreckengaust's headshotScott Schreckengaust has a degree in biomedical engineering and has been inventing devices alongside scientists on the bench since the beginning of his career. He loves science, technology, and engineering with decades of experiences in startups to large multi-national organizations within the Healthcare and Life Sciences domain. Scott is comfortable scripting robotic liquid handlers, programming instruments, integrating homegrown systems into enterprise systems, and developing complete software deployments from scratch in regulatory environments. Besides helping people out, he thrives on building — enjoying the journey of hashing out customer’s scientific workflows and their issues then converting those into viable solutions.

Read More

Build a GNN-based real-time fraud detection solution using the Deep Graph Library without using external graph storage

Build a GNN-based real-time fraud detection solution using the Deep Graph Library without using external graph storage

Fraud detection is an important problem that has applications in financial services, social media, ecommerce, gaming, and other industries. This post presents an implementation of a fraud detection solution using the Relational Graph Convolutional Network (RGCN) model to predict the probability that a transaction is fraudulent through both the transductive and inductive inference modes. You can deploy our implementation to an Amazon SageMaker endpoint as a real-time fraud detection solution, without requiring external graph storage or orchestration, thereby significantly reducing the deployment cost of the model.

Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which you can use to identify suspicious online payments, detect new account fraud, prevent trial and loyalty program abuse, or improve account takeover detection.

Solution overview

The following diagram describes an exemplar financial transaction network that includes different types of information. Each transaction contains information like device identifiers, Wi-Fi IDs, IP addresses, physical locations, telephone numbers, and more. We represent the transaction datasets through a heterogeneous graph that contains different types of nodes and edges. Then, the fraud detection problem is handled as a node classification task on this heterogeneous graph.

RGCN graph construction diagram

Graph neural networks (GNNs) have shown great promise in tackling fraud detection problems, outperforming popular supervised learning methods like gradient-boosted decision trees or fully connected feed-forward networks on benchmarking datasets. In a typical fraud detection setup, during the training phase, a GNN model is trained on a set of labeled transactions. Each training transaction is provided with a binary label denoting if it is fraudulent. This trained model can then be used to detect fraudulent transactions among a set of unlabeled transactions during the inference phase. Two different modes of inference exist: transductive inference vs. inductive inference (which we discuss more later in this post).

GNN-based models, like RGCN, can take advantage of topological information, combining both graph structure and features of nodes and edges to learn a meaningful representation that distinguishes malicious transactions from legitimate transactions. RGCN can effectively learn to represent different types of nodes and edges (relations) via heterogeneous graph embedding. In the preceding diagram, each transaction is being modeled as a target node, and several entities associated with each transaction get modeled as non-target node types, like ProductCD and P_emaildomain. Target nodes have numerical and categorical features assigned, whereas other node types are featureless. The RGCN model learns an embedding for each non-target node type. For the embedding of a target node, a convolutional operation is used to compute its embedding using its features and neighborhood embeddings. In the rest of the post, we use the terms GNN and RGCN interchangeably.

It’s worth noting that alternative strategies, such as treating the non-target entities as features and one-hot-encoding them, would often be infeasible because of the large cardinalities of these entities. Conversely, encoding them as graph entities enables the GNN model to take advantage of the implicit topology in the entity relationships. For example, transactions that share a phone number with known fraudulent transactions are more likely to be fraudulent too.

The graph representation employed by GNNs creates some complexity in their implementation. This is especially true for applications such as fraud detection, in which the graph representation may get augmented during inference with newly added nodes that correspond to entities not known during model training. This inference scenario is usually referred to as inductive mode. In contrast, transductive mode is a scenario that assumes the graph representation constructed during model training won’t change during inference. GNN models are often evaluated in transductive mode by constructing graph representations from a combined set of training and test examples, while masking test labels during back-propagation. This ensures the graph representation is static, and there the GNN model doesn’t require implementation of operations to extend the graph with new nodes during inference. Unfortunately, static graph representation can’t be assumed when detecting fraudulent transactions in a real-world setting. Therefore, support for inductive inference is required when deploying GNN models for fraud detection to production environments.

In addition, detecting fraudulent transactions in real time is crucial, especially in business cases where there is only one chance of stopping illegal activities. For example, fraudulent users can behave maliciously just once with an account and never use the same account again. Real-time inference on GNN models introduces additional complexity to the implementation. It is often necessary to implement subgraph extraction operations to support real-time inference. The subgraph extraction operation is needed to reduce inference latency when graph representation is large and performing inference on the entire graph becomes prohibitively expensive. An algorithm for real-time inductive inference with an RGCN model runs as follows:

  1. Given a batch of transactions and a trained RGCN model, extend graph representation with entities from the batch.
  2. Assign embedding vectors of new non-target nodes with the mean embedding vector of their respective node type.
  3. Extract a subgraph induced by k-hop out-neighborhood of the target nodes from the batch.
  4. Perform inference on the subgraph and return prediction scores for the batch’s target nodes.
  5. Clean up the graph representation by removing newly added nodes (this step ensures the memory requirement for model inference stays constant).

The key contribution of this post is to present an RGCN model implementing the real-time inductive inference algorithm. You can deploy our RGCN implementation to a SageMaker endpoint as a real-time fraud detection solution. Our solution doesn’t require external graph storage or orchestration, and significantly reduces the deployment cost of the RGCN model for fraud detection tasks. The model also implements transductive inference mode, enabling us to carry out experiments to compare model performance in inductive and transductive modes. The model code and notebooks with experiments can be accessed from the AWS Examples GitHub repo.

This post builds on the post Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and the Deep Graph Library. The previous post built a RGCN-based real-time fraud detection solution using SageMaker, Amazon Neptune, and the Deep Graph Library (DGL). The prior solution used a Neptune database as external graph storage, required AWS Lambda for orchestration for real-time inference, and only included experiments in transductive mode.

The RGCN model introduced in this post implements all operations of the real-time inductive inference algorithm using only the DGL as a dependency, and doesn’t require external graph storage or orchestration for deployment.

We first evaluate the performance of the RGCN model in transductive and inductive modes on a benchmark dataset. As expected, model performance in inductive mode is slightly lower than in transductive mode. We also study the effect of hyperparameter k on model performance. The hyperparameter k controls the number of hops performed to extract a subgraph in Step 3 of the real-time inference algorithm. Higher values of k will produce larger subgraphs and can lead to better inference performance at the expense of higher latency. As such, we also conduct timing experiments to evaluate the feasibility of the RGCN model for a real-time application.

Dataset

We use the IEEE-CIS fraud dataset, the same dataset that was used in the previous post. The dataset contains over 590,000 transaction records that have a binary fraud label (the isFraud column). The data is split into two tables: transaction and identity. However, not all transaction records have corresponding identity information. We join the two tables on the TransactionID column, which leaves us with a total of 144,233 transaction records. We sort the table by transaction timestamp (the TransactionDT column) and create an 80/20 percentage split by time, producing 115,386 and 28,847 transactions for training and testing, respectively.

For more details on the dataset and how to format it to suit the input requirement of the DGL, refer to Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library.

Graph construction

We use the TransactionID column to generate target nodes. We use the following columns to generate 11 types of non-target nodes:

  • card1 through card6
  • ProductCD
  • addr1 and addr2
  • P_emaildomain and R_emaildomain

We use 38 columns as categorical features of target nodes:

  • M1 through M9
  • DeviceType and DeviceInfo
  • id_12 through id_38

We use 382 columns as numerical features of target nodes:

  • TransactionAmt
  • dist1 and dist2
  • id_01 through id_11
  • C1 through C14
  • D1 through D15
  • V1 through V339

Our graph constructed from the training transactions contains 217,935 nodes and 2,653,878 edges.

Hyperparameters

Other parameters are set to match the parameters reported in the previous post. The following snippet illustrates training the RGCN model in transductive and inductive modes:

import pandas as pd
from fgnn.fraud_detector import FraudRGCN

# overload default hyperparameters defined in FraudRGCN constructor
params = {
    "embedding_size": 64,
    "n_layers": 2,
    "n_epochs": 150,
    "n_hidden": 16,
    "dropout": 0.2,
    "weight_decay": 5e-05,
    "lr": 0.01
}

# load train and test splits
df_train = pd.read_parquet('./data/train.parquet')
df_test = pd.read_parquet('./data/test.parquet')

# train RGCN model in inductive mode
fd_ind = FraudRGCN()
fd_ind.train_fg(df_train, params=params)

# train RGCN model in transductive mode
fd_trs = FraudRGCN()
# create boolean array to identify test examples
test_mask = [False]*len(df_train) + [True]*len(df_test)
# concatenate train and test examaples
df_combined = pd.concat([df_train, df_test], ignore_index=True)	
# test_mask must be passed in transductive mode, 
# so test labels are masked-out during back-propagation
fd.train_fg(df_combined, params=params, test_mask=test_mask)

# predict on both models extracting subgraph with 2 k-hops
fraud_proba_ind = fd_ind.predict(df_test, k=2)
fraud_proba_trs = fd_trs.predict(df_test, k=2)

Inductive vs. transductive mode

We perform five trials for inductive and five trials for transductive mode. For each trial, we train an RGCN model and save it to disk, obtaining 10 models. We evaluate each model on test examples while increasing the number of hops (parameter k) used to extract a subgraph for inference, setting k to 1, 2, and 3. We predict on all test examples at once, and compute the ROC AUC score for each trial. The following plot shows the mean and 95% confidence intervals of AUC scores.

Inductive vs Transductive model performance

We can see that performance in transductive mode is slightly higher than in inductive mode. For k=2, mean AUC scores for inductive and transductive modes are 0.876 and 0.883, respectively. This is expected because the RGCN model is able to learn embeddings of all entity nodes in transductive mode, including those in the test set. In contrast, inductive mode only allows the model to learn embeddings of entity nodes that are present in the training examples, and therefore some nodes have to be mean-filled during inference. At the same time, the drop in performance between transductive and inductive modes is not significant, and even in inductive mode, the RGCN model achieves good performance with an AUC of 0.876. We also observe that model performance doesn’t improve for values of k>2. This implies that setting k=2 would extract a sufficiently large subgraph during inference, resulting in optimal performance. This observation is also confirmed by our next experiment.

It’s also worth noting that, for transductive mode, our model’s AUC of 0.883 is higher than the corresponding AUC of 0.870 reported in the previous post. We use more columns as numerical and categorical features of target nodes, which can explain the higher AUC score. We also note that the experiments in the previous post only performed a single trial.

Inference on a small batch

For this experiment, we evaluate the RGCN model in a small batch inference setting. We use five models that were trained in inductive mode in the previous experiment. We compare performance of these models when predicting in two settings: full and small batch inference. For full batch inference, we predict on the entire test set, as was done in the previous experiment. For small batch inference, we predict in small batches by partitioning the test set into 28 batches of equal size with approximately 1,000 transactions in each batch. We compute AUC scores for both settings using different values of k. The following plot shows the mean and 95% confidence intervals for full and small batch inference settings.

Inductive model performance for full-batch vs small-batch

We observe that performance for small batch inference when k=1 is lower than for full batch. However, small batch inference performance matches full batch when k>1. This can be attributed to much smaller subgraphs being extracted for small batches. We confirm this by comparing subgraph sizes with the size of the entire graph constructed from the training transactions. We compare graph sizes in terms of number of nodes. For k=1, the average subgraph size for small batch inference is less than 2% of the training graph. And for full batch inference when k=1, the subgraph size is 22%. When k=2, subgraph sizes for small and full batch inference are 54% and 64%, respectively. Finally, subgraph sizes for both inference settings reach 100% for k=3. In other words, when k>1, the subgraph for a small batch becomes sufficiently large, enabling small batch inference to reach the same performance as full batch inference.

We also record prediction latency for every batch. We perform our experiments on an ml.r5.12xlarge instance, but you can use a smaller instance with 64 G memory to run the same experiments. The following plot shows the mean and 95% confidence intervals of small batch prediction latencies for different values of k.

Timing results for inductive small-batch

The latency includes all five steps of the real-time inductive inference algorithm. We see that when k=2, predicting on 1,030 transactions takes 5.4 seconds on average, resulting in a throughput of 190 transactions per second. This confirms that the RGCN model implementation is suitable for real-time fraud detection. We also note that the previous post did not provide hard latency values for their implementation.

Conclusion

The RGCN model released with this post implements the algorithm for real-time inductive inference, and doesn’t require external graph storage or orchestration. The parameter k in Step 3 of the algorithm specifies the number of hops performed to extract the subgraph for inference, and results in a trade-off between model accuracy and prediction latency. We used the IEEE-CIS fraud dataset in our experiments, and empirically validated that the optimal value of parameter k for this dataset is 2, achieving an AUC score of 0.876 and prediction latency of less than 6 seconds per 1,000 transactions.

This post provided a step-by-step process for training and evaluating an RGCN model for real-time fraud detection. The included model class implements methods for the entire model lifecycle, including serialization and deserialization methods. This enables the model to be used for real-time fraud detection. You can train the model as a PyTorch SageMaker estimator and then deploy it to a SageMaker endpoint by using the following notebook as a template. The endpoint is able to predict fraud on small batches of raw transactions in real time. You can also use Amazon SageMaker Inference Recommender to select the best instance type and configuration for the inference endpoint based on your workloads.

For more information about this topic and implementation, we encourage you to explore and test our scripts on your own. You can access the notebooks and related model class code from the AWS Examples GitHub repo.


About the Authors

Dmitriy BespalovDmitriy Bespalov is a Senior Applied Scientist at the Amazon Machine Learning Solutions Lab, where he helps AWS customers across different industries accelerate their AI and cloud adoption.

Ryan BrandRyan Brand is an Applied Scientist at the Amazon Machine Learning Solutions Lab. He has specific experience in applying machine learning to problems in healthcare and life sciences. In his free time, he enjoys reading history and science fiction.

Yanjun QiYanjun Qi is a Senior Applied Science Manager at the Amazon Machine Learning Solution Lab. She innovates and applies machine learning to help AWS customers speed up their AI and cloud adoption.

Read More

Tune ML models for additional objectives like fairness with SageMaker Automatic Model Tuning

Tune ML models for additional objectives like fairness with SageMaker Automatic Model Tuning

Model tuning is the experimental process of finding the optimal parameters and configurations for a machine learning (ML) model that result in the best possible desired outcome with a validation dataset. Single objective optimization with a performance metric is the most common approach for tuning ML models. However, in addition to predictive performance, there may be multiple objectives which need to be considered for certain applications. For example,

  1. Fairness – The aim here is to encourage models to mitigate bias in model outcomes between certain sub-groups in the data, especially when humans are subject to algorithmic decisions. For example, a credit lending application should not only be accurate but also unbiased to different population sub-groups.
  2. Inference time – The aim here is to reduce the inference time during model invocation. For example, a speech recognition system must not only understand different dialects of the same language accurately, but also operate within a specified time limit that is acceptable by the business process.
  3. Energy efficiency – The aim here is to train smaller energy-efficient models. For example, neural network models are compressed for usage on mobile devices and thus naturally reduce their energy consumption by reducing the number of FLOPS required for a pass through the network.

Multi-objective optimization methods represent different trade-offs between the desired metrics. This can involve finding a global minimum of an objective function subject to a set of constraints on different metrics being simultaneously satisfied.

Amazon SageMaker Automatic Model Tuning (AMT) finds the best version of a model by running many SageMaker training jobs on your dataset using the algorithm and ranges of hyperparameters. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric (e.g., accuracy, auc, recall) that you define. With Amazon SageMaker automatic model tuning, you can find the best version of your model by running training jobs on your dataset with several search strategies, such as Bayesian, Random search, Grid search, and Hyperband.

Amazon SageMaker Clarify can detect potential bias during data preparation, after model training, and in your deployed model. Currently, it offers 21 different metrics to choose from. These metrics are also openly available with the smclarify python package and github repository here. You can learn more about measuring bias with metrics from Amazon SageMaker Clarify at Learn how Amazon SageMaker Clarify helps detect bias.

In this blog we show you how to automatically tune a ML model with Amazon SageMaker AMT for both accuracy and fairness objectives by creating a single combined metric. We demonstrate a financial services use case of credit risk prediction with an accuracy metric of Area Under the Curve (AUC) to measure performance and a bias metric of Difference in Positive Proportions in Predicted Labels (DPPL) from SageMaker Clarify to measure the imbalance in model predictions for different demographic groups. The code for this example is available on GitHub.

Fairness in Credit Risk prediction

The credit lending industry relies heavily on credit scores for processing loan applications. Generally, credit scores reflect an applicant’s history of borrowing and paying back money, and lenders refer to them when determining an individual’s creditworthiness. Payment firms and banks are interested to build systems that can help identify the risk associated with a particular application and provide competitive credit products. Machine learning (ML) models can be used to build such a system that processes historical applicant data and predicts the credit risk profile. Data can include financial and employment history of the applicant, their demographics, and the new credit/loan context. There is always some statistical uncertainty with any model that predicts whether a particular applicant will default in the future. The systems need to provide a tradeoff between rejecting applications that might default over time and accepting applications that are eventually creditworthy.

Business owners of such a system need to ensure the validity and quality of the models as per existing and upcoming regulatory compliance requirements. They are obligated to treat customers fairly and provide transparency in their decision making. They might want to ensure that positive model predictions are not imbalanced across different groups (for example, gender, race, ethnicity, immigration status and others). Once the required data is collected, the ML model training typically optimizes for prediction performance as a primary objective with a metric like classification accuracy or AUC score. Alternatively, a model with a given performance objective can be constrained with a fairness metric to ensure certain requirements are maintained. One such technique to constrain the model is fairness-aware hyperparameter tuning. By applying these strategies, the best candidate model can have lower bias than the unconstrained model while maintaining a high predictive performance.

Scenario - Credit Risk

In the scenario depicted in this schematic,

  1. The ML model is built with historical customer credit profile data. The model training and hyperparameter tuning process maximizes for multiple objectives including classification accuracy and fairness. The model is deployed to an existing business process in a production system.
  2. A new customer credit profile is evaluated for credit risk. If low risk, it can go through an automated process. High risk applications could include human review before a final acceptance or rejection decision.

The decisions and metrics gathered during design and development, deployment and operations can be documented with SageMaker Model Cards and shared with the stakeholders.

This use case demonstrates how to reduce model bias against a specific group by fine tuning hyperparameters for a combined objective metric of both accuracy and fairness with SageMaker Automatic Model Tuning. We use the South German Credit dataset (South German Credit Data Set) .

The applicant data can be split into following categories:

  1. Demographic
  2. Financial Data
  3. Employment History
  4. Loan purpose

Credit Risk Data Structure

In this example, we specifically look at the ‘Foreign worker’ demographic and tune a model that predicts credit application decisions with high accuracy and low bias against that particular subgroup.

There are various bias metrics that can be used to evaluate fairness of the system with respect to specific sub-groups in the data. Here, we use the absolute value of Difference in Positive Proportions in Predicted Labels (DPPL) from SageMaker Clarify. In simple terms, DPPL measures the difference in positive class (good credit) assignments between non-foreign workers and foreign workers.

For example, if 4.5% of all foreign workers are assigned the positive label by the model, and 13.7% of all non-foreign workers are assigned the positive label, then DPPL = 0.137 – 0.045 = 0.092.

Solution Architecture

The figure below displays a high level overview of the architecture of an Automatic Model Tuning job with XGBoost on Amazon SageMaker.

High Level Solution Architecture

In the solution, SageMaker Processing preprocesses the training dataset from Amazon S3. Amazon SageMaker Automatic Tuning instantiates multiple SageMaker training jobs with their associated EC2 instances and EBS volumes. The container for the algorithm (XGBoost) is loaded from Amazon ECR in each job. SageMaker AMT finds the best version of a model by running many training jobs on the preprocessed dataset using the specified algorithm script and range of hyperparameters. The output metrics are logged in Amazon CloudWatch for monitoring.

The hyperparameters we are tuning in this use case are as follows:

  1. eta – Step size shrinkage used in updates to prevent overfitting.
  2. min_child_weight – Minimum sum of instance weight (hessian) needed in a child.
  3. gamma – Minimum loss reduction required to make a further partition on a leaf node of the tree.
  4. max_depth – Maximum depth of a tree.

The definition of these hyperparameters and others available with SageMaker AMT can be found here.

First, we demonstrate a baseline scenario of a single performance objective metric for tuning hyperparameters with Automatic Model Tuning. Then, we demonstrate the optimized scenario of a multi-objective metric specified as a combination of performance metric and fairness metric.

Single Metric Hyperparameter Tuning (Baseline)

There is a choice of multiple metrics for a tuning job to evaluate the individual training jobs. As per the code snippet below, we specify the single objective metric as  objective_metric_name. The hyperparameter tuning job returns the training job that gave the best value for the chosen objective metric.

In this baseline scenario, we are tuning for Area Under Curve (AUC) as seen below. It is important to note that we are only optimizing AUC, and not optimizing for other metrics such as fairness.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
'min_child_weight': IntegerParameter(1, 10),
'gamma': IntegerParameter(1, 5),
'max_depth': IntegerParameter(1, 10)}

objective_metric_name = 'validation:auc'

tuner = HyperparameterTuner(estimator,
objective_metric_name,
hyperparameter_ranges,
max_jobs=100,
max_parallel_jobs=10,
)

tuning_job_name = "xgb-tuner-{}".format(strftime("%d-%H-%M-%S", gmtime()))
inputs = {'train': train_data_path, 'validation': val_data_path}
tuner.fit(inputs, job_name=tuning_job_name)
tuner.wait()
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

In this context max jobs allows us to specify how many times a single training job will be tuned, and finding the best training job from there.

Multi Objective Hyperparameter Tuning (Fairness Optimized)

We want to optimize multiple objective metrics with hyperparameter tuning as described in this paper. However, SageMaker AMT still accepts only a single metric as input.

To address this challenge, we express multiple metrics as a single metric function and optimize this metric:

  • maxM(y1​,y2​,θ)
  • y1​,y2​ are different metrics. For example AUC score and DPPL.
  • M(⋅,⋅,θ)is a scalarization function and is parameterized by a fixed parameter

Higher weight favors that particular objective in model tuning. Weights may wary from case to case and you might need to try different weights for your use case. In this example, weights for AUC and DPPL have been set heuristically. Let’s walk through how this would look like in code. You can see the training job returning a single metric based on a combination function of AUC Score for performance and DPPL for fairness. The hyperparameter optimization ranges for multiple objectives are the same as the single objective. We are passing the validation metric as “auc” but behind the scenes we are returning the results of the combined metric function as described last in the list of functions below:

Here is the function Multi Objective optimization:

objective_metric_name = 'validation:auc'
tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=100,
                            max_parallel_jobs=10
                           )

Here is the function for computing AUC score:

def eval_auc_score(predt, dtrain):
   fY = [1 if p > 0.5 else 0 for p in predt]
   y = dtrain.get_label()
   auc_score = roc_auc_score(y, fY)
   return auc_score

Here is the function for computing DPPL score:

def eval_dppl(predt, dtrain):
    dtrain_np = dmatrix_to_numpy(dtrain)
    # groups: an np array containing 1 or 2
    groups = dtrain_np[:, -1]
    # sensitive_facet_index: boolean column indicating sensitive group
    sensitive_facet_index = pd.Series(groups - 1, dtype=bool)
    # positive_predicted_label_index: boolean column indicating positive predicted labels
    positive_label_index = pd.Series(predt > 0.5)
    return abs(DPPL(predt, sensitive_facet_index, positive_label_index))

Here is the function for the Combined Metric:

def eval_combined_metric(predt, dtrain):
    auc_score = eval_auc_score(predt, dtrain)
    DPPL = eval_dppl(predt, dtrain)
    # Assign weight of 3 to AUC and 1 to DPPL
    # Maximize (1-DPPL) for the purpose of minimizing DPPL 
    combined_metric = ((3*auc_score)+(1-DPPL))/4       
    print("DPPL, AUC Score, Combined Metric: ", DPPL, auc_score, combined_metric)
    return "auc", combined_metric

Experiments & Results

Synthetic data generation for bias dataset

The original South German Credit dataset contained 1000 records, and we generated 100 more records synthetically to create a dataset where the bias in model predictions disfavors Foreign Workers. This is done to simulate bias that could manifest itself in the real world. New records of foreign workers labeled as “bad credit” applicants were extrapolated from existing foreign workers with the same label.

There are many libraries/techniques to create synthetic data and we use Synthetic Data Vault (DPPLV).

From the following code snippet we can see how synthetic data is generated with DPPLV with the South German Credit Data Set:

# Parameters for generated data
# How many rows of synthetic data
num_rows = 100

# Select all foreign workers who were accepted (foreign_worker value 1 credit_risk 1)
ForeignWorkerData = training_data.loc[(training_data['foreign_worker'] == 1) & (training_data['credit_risk'] == 1)]

# Fit Foreign Worker data to SDV model
model = GaussianCopula()
model.fit(ForeignWorkerData)

# Generate Synthetic foreign worker data based on rows stated
SynthForeignWorkers = model.sample(Rows)

We generated 100 new synthetic records of Foreign Workers based on Foreign Workers who were accepted in the original dataset. We will now take those records and convert the “credit_risk” label to 0 (bad credit). This will mark these Foreign Workers unfairly as bad credit, hence inserting bias into our dataset

SynthForeignWorkers.loc[SynthForeignWorkers['credit_risk'] == 1, 'credit_risk'] = 0

We explore the bias in the dataset through the graphs below.
Credit Risk Ratio for Foreign Workers

The pie graph on top shows the percentage of Non-Foreign Workers labelled as good credit or bad credit, and the bottom pie graph shows the same for Foreign Workers. The percentage of Foreign workers labeled as  “bad credit” is 75.90% and far outweigh the 30.70% of Non-Foreign workers labeled the same. The stack bar displays the almost similar percentage breakdown of total workers across the category of Foreign & Non-Foreign workers.

We want to avoid the ML model from learning a strong bias against Foreign Workers either through explicit features or implicit proxy features in the data. With an additional fairness objective, we guide the ML model to mitigate the bias of lower creditworthiness towards Foreign Workers.

Model performance after tuning for both performance and fairness

This chart depicts the density plot of up to 100 tuning jobs run by SageMaker AMT and their corresponding combined objective metric values. Although we have set max jobs to 100, it is changeable under the discretion of the user. The combined metric was a combination of AUC and DPPL with a function of: (3*AUC + (1-DPPL)) / 4. The reason that we use (1-DPPL) instead of (DPPL) is because we would like to maximize the combined objective for the lowest DPPL possible (lower DPPL means lower bias against foreign workers). The plot shows how AMT helps identify the best hyperparameters for the XGBoost model that returns the highest combined evaluation metric value of 0.68.

Model performance with combined metric
Model Performance with Combined Metrics

Below we take a look at the pareto front chart for the individual metrics of AUC and DPPL. A Pareto Front chart is used here to visually represent the trade-offs between multiple objectives, in this case the two metric values (AUC & DPPL). Points on the curve front are considered as equally good and one metric cannot be improved without degrading the other. The Pareto chart allows us to see how different jobs performed against the baseline (red circle) in terms of both metrics. It also shows us the most optimal job (green triangle). The position of the red circle and green triangle are important because it allows us to understand if our combined metric is actually performing as expected and truly optimizing for both metrics. The code to generate the pareto front chart is included in the notebook in GitHub.

Pareto Chart

In this scenario, a lower DPPL value is more desirable (less bias), while higher AUC is better (increased performance).

Here, the baseline (red circle) represents the scenario where the objective metric is AUC alone. In other words, the baseline does not consider DPPL at all and optimizes only for AUC (no fine tuning for fairness). We see the baseline has a good AUC score of 0.74, but does not perform well on fairness with a DPPL score of 0.75.

The Optimized model (green triangle) represents the best candidate model when fine-tuned for a combined metric with weight ratio of 3:1 for AUC:DPPL. We see the optimized model has a good AUC score of 0.72, and also a low DPPL score of 0.43 (low bias). This tuning job found a model configuration where DPPL can be significantly lower than the baseline, without a significant drop in AUC.  Models with even lower DPPL scores can be identified by moving the green triangle further left along the Pareto Front. We thus achieved the combined objective of a well performing model with fairness for Foreign Worker sub-groups.

In the chart below, we can see the results of the predictions from the baseline model and the optimized model. The optimized model with a combined objective of performance and fairness predicts a positive outcome for 30.6% Foreign Workers as opposed to the 13.9% from the baseline model. The optimized model thus reduces the model bias against this sub-group.

Percentage of Accepted Workers

Conclusion

The blog shows you to implement multi-objective optimization with SageMaker Automatic Model Tuning for real-world applications. In many instances, collected data in the real world may be biased against certain subgroups. Multi-objective optimization using automatic model tuning enables customers to build ML models easily that optimize fairness in addition to accuracy. We demonstrate an example of credit risk prediction and specifically look at fairness for foreign workers. We show that it is possible to maximize for another metric like fairness while continuing to train models with high performance. If what you have read has piqued your interest you may try out the code example hosted in Github here.


About the authors

Munish Dabra is a Senior Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML, Data Analytics and Observability. He has a strong background in designing and building scalable distributed systems. He enjoys helping customers innovate and transform their business in AWS. LinkedIn: /mdabra

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, Hasan helps customers design and deploy machine learning applications in production on AWS. He has over 12 years of work experience as a data scientist, machine learning practitioner, and software developer. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Mohammad (Moh) Tahsin is an associate AI/ML Specialist Solutions Architect for AWS. Moh has experience teaching students about responsible AI concepts, and is passionate about conveying these concepts through cloud based architectures. In his spare time he loves to lift weights, play games, and explore nature.

Xingchen Ma is an Applied Scientist at AWS. He works in service team for SageMaker Automatic Model Tuning.

Rahul Sureka is an Enterprise Solution Architect at AWS based out of India. Rahul has more than 22 years of experience in architecting and leading large business transformation programs across multiple industry segments. His areas of interests are data and analytics, streaming, and AI/ML applications.

Read More

Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI

Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI

Flooding Image

In recent years, advances in computer vision have enabled researchers, first responders, and governments to tackle the challenging problem of processing global satellite imagery to understand our planet and our impact on it. AWS recently released Amazon SageMaker geospatial capabilities to provide you with satellite imagery and geospatial state-of-the-art machine learning (ML) models, reducing barriers for these types of use cases. For more information, refer to Preview: Use Amazon SageMaker to Build, Train, and Deploy ML Models Using Geospatial Data.

Many agencies, including first responders, are using these offerings to gain large-scale situational awareness and prioritize relief efforts in geographical areas that have been struck by natural disasters. Often these agencies are dealing with disaster imagery from low altitude and satellite sources, and this data is often unlabeled and difficult to use. State-of-the-art computer vision models often underperform when looking at satellite images of a city that a hurricane or wildfire has hit. Given the lack of these datasets, even state-of-the-art ML models are often unable to deliver the accuracy and precision required to predict standard FEMA disaster classifications.

Geospatial datasets contain useful metadata such as latitude and longitude coordinates, and timestamps, which can provide context for these images. This is especially helpful in improving the accuracy of geospatial ML for disaster scenes, because these images are inherently messy and chaotic. Buildings are less rectangular, vegetation has sustained damage, and linear roads have been interrupted by flooding or mudslides. Because labeling these massive datasets is expensive, manual, and time-consuming, the development of ML models that can automate image labeling and annotation is critical.

To train this model, we need a labeled ground truth subset of the Low Altitude Disaster Imagery (LADI) dataset. This dataset consists of human and machine annotated airborne images collected by the Civil Air Patrol in support of various disaster responses from 2015-2019. These LADI datasets focus on the Atlantic hurricane seasons and coastal states along the Atlantic Ocean and Gulf of Mexico. Two key distinctions are the low altitude, oblique perspective of the imagery and disaster-related features, which are rarely featured in computer vision benchmarks and datasets. The teams used existing FEMA categories for damage such as flooding, debris, fire and smoke, or landslides, which standardized the label categories. The solution is then able to make predictions on the rest of the training data, and route lower-confidence results for human review.

In this post, we describe our design and implementation of the solution, best practices, and the key components of the system architecture.

Solution overview

In brief, the solution involved building three pipelines:

  • Data pipeline – Extracts the metadata of the images
  • Machine learning pipeline – Classifies and labels images
  • Human-in-the-loop review pipeline – Uses a human team to review results

The following diagram illustrates the solution architecture.

Solution Architecture

Given the nature of a labeling system like this, we designed a horizontally scalable architecture that would handle ingestion spikes without over-provisioning by using a serverless architecture. We use a one-to-many pattern from Amazon Simple Queue Service (Amazon SQS) to AWS Lambda in multiple spots to support these ingestion spikes, offering resiliency.

Using an SQS queue for processing Amazon Simple Storage Service (Amazon S3) events helps us control the concurrency of downstream processing (Lambda functions, in this case) and handle the incoming spikes of data. Queuing incoming messages also acts as a buffer storage in case of any failures downstream.

Given the highly parallel needs, we chose Lambda to process our images. Lambda is a serverless compute service that lets us run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, and managing runtimes.

We use Amazon OpenSearch Service as our central data store to take advantage of its highly scalable, fast searches and integrated visualization tool, OpenSearch Dashboards. It enables us to iteratively add context to the image, without having to recompile or rescale, and handle schema evolution.

Amazon Rekognition makes it easy to add image and video analysis into our applications, using proven, highly scalable, deep learning technology. With Amazon Rekognition, we get a good baseline of detected objects.

In the following sections, we dive into each pipeline in more detail.

Data pipeline

The following diagram shows the workflow of the data pipeline.

Data Pipeline

The LADI data pipeline starts with the ingestion of raw data images from the FEMA Common Alerting Protocol (CAP) into an S3 bucket. As we ingest the images into the raw data bucket, they are processed in near-real time in two steps:

  1. The S3 bucket triggers event notifications for all object creations, creating messages in the SQS queue for each image ingested.
  2. The SQS queue concurrently invokes the preprocessing Lambda functions on the image.

The Lambda functions perform the following preprocessing steps:

  1. Compute the UUID for each image, providing a unique identifier for each image. This ID will identify the image for its entire lifecycle.
  2. Extract metadata such as GPS coordinates, image size, GIS information, and S3 location from the image and persist it into OpenSearch.
  3. Based on a lookup against FIPS codes, the function moves the image into the curated data S3 bucket. We partition the data by the image’s FIPS-State-code/FIPS-County-code/Year/Month.

Machine learning pipeline

The ML pipeline starts from the images landing in the curated data S3 bucket in the data pipeline step, which triggers the following steps:

  1. Amazon S3 generates a message into another SQS queue for each object created in the curated data S3 bucket.
  2. The SQS queue concurrently triggers Lambda functions to run the ML inference job on the image.

The Lambda functions perform the following actions:

  1. Send each image to Amazon Rekognition for object detection, storing the returned labels and respective confidence scores.
  2. Compose the Amazon Rekognition output into input parameters for our Amazon SageMaker multi-model endpoint. This endpoint hosts our ensemble of classifiers, which are trained for specific sets of damage labels.
  3. Pass the results of the SageMaker endpoint to Amazon Augmented AI (Amazon A2I).

The following diagram illustrates the pipeline workflow.

Machine Learning Pipeline

Human-in-the-loop review pipeline

The following diagram illustrates the human-in-the-loop (HIL) pipeline.

HIL pipeline

With Amazon A2I, we can configure thresholds that will trigger a human review by a private team when a model yields a low confidence prediction. We can also use Amazon A2I to provide an ongoing audit of our model’s predictions. The workflow steps are as follows:

  1. Amazon A2I routes high confidence predictions into OpenSearch Service, updating the image’s label data.
  2. Amazon A2I routes low confidence predictions to the private team to annotate images manually.
  3. The human reviewer completes the annotation, generating a human annotation output file that is stored in the HIL Output S3 bucket.
  4. The HIL Output S3 bucket triggers a Lambda function that parses the human annotations output and updates the image’s data in OpenSearch Service.

By routing the human annotation results back to the data store, we can retrain the ensemble models and iteratively improve the model’s accuracy.

With our high-quality results now stored in OpenSearch Service, we’re able to perform geospatial and temporal search via a REST API, using Amazon API Gateway and Geoserver. OpenSearch Dashboard also enables users to search and run analytics with this dataset.

Results

The following code shows an example of our results.

Output snapshot

With this new pipeline, we create a human backstop for models that are not yet fully performant. This new ML pipeline has been put into production for use with a Civil Air Patrol Image Filter microservice that allows for filtering of Civil Air Patrol images in Puerto Rico. This enables first responders to view the extent of damage and view images associated with that damage following hurricanes. The AWS Data Lab, AWS Open Data Program, Amazon Disaster Response team, and AWS human-in-the-loop team worked with customers to develop an open-source pipeline that can be used to analyze Civil Air Patrol data stored in the Open Data Program registry on demand following any natural disaster. For more information about the pipeline architecture and an overview of the collaboration and impact, check out the video Focusing on disaster response with Amazon Augmented AI, the AWS Open Data Program and AWS Snowball.

Conclusion

As climate change continues to increase the frequency and intensity of storms and wildfires, we continue to see the importance of using ML to understand the impact of these events on local communities. These new tools can accelerate disaster response efforts and allow us to use the data from these post-event analyses to improve the prediction accuracy of these models with active learning. These new ML models can automate data annotation, which enables us to infer the extent of damage from each of these events as we overlay damage labels with map data. That cumulative data can also help improve our ability to predict damage for future disaster events, which can inform mitigation strategies. This in turn can improve the resilience of communities, economies, and ecosystems by giving decision-makers the information they need to develop data-driven polices to address these emerging environmental threats.

In this blog post we discussed using computer vision on satellite imagery. This solution is meant to be a reference architecture or a quick start guide that you can customize for your own needs.

Give it a whirl and let us know how this solved your use case by leaving feedback in the comments section. For more information, see Amazon SageMaker geospatial capabilities.


About the Authors

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Morgan Dutton is a Senior Technical Program Manager with the Amazon Augmented AI and Amazon SageMaker Ground Truth team. She works with enterprise, academic and public sector customers to accelerate adoption of machine learning and human-in-the-loop ML services.

Sandeep Verma is a Sr. Prototyping Architect with AWS. He enjoys diving deep into customer challenges and building prototypes for customers to accelerate innovation. He has a background in AI/ML, founder of New Knowledge, and generally passionate about tech. In his free time, he loves traveling and skiing with his family.

Read More

Achieve high performance at scale for model serving using Amazon SageMaker multi-model endpoints with GPU

Achieve high performance at scale for model serving using Amazon SageMaker multi-model endpoints with GPU

Amazon SageMaker multi-model endpoints (MMEs) provide a scalable and cost-effective way to deploy a large number of machine learning (ML) models. It gives you the ability to deploy multiple ML models in a single serving container behind a single endpoint. From there, SageMaker manages loading and unloading the models and scaling resources on your behalf based on your traffic patterns. You will benefit from sharing and reusing hosting resources and a reduced operational burden of managing a large quantity of models.

In November 2022, MMEs added support for GPUs, which allows you to run multiple models on a single GPU device and scale GPU instances behind a single endpoint. This satisfies the strong MME demand for deep neural network (DNN) models that benefit from accelerated compute with GPUs. These include computer vision (CV), natural language processing (NLP), and generative AI models. The reasons for the demand include the following:

  • DNN models are typically large in size and complexity and continue growing at a rapid pace. Taking NLP models as an example, many of them exceed billions of parameters, which requires GPUs to satisfy low latency and high throughput requirements.
  • We have observed an increased need for customizing these models to deliver hyper-personalized experiences to individual users. As the quantity of these models increases, there is a need for an easier solution to deploy and operationalize many models at scale.
  • GPU instances are expensive and you want to reuse these instances as much as possible to maximize the GPU utilization and reduce operating cost.

Although all these reasons point to MMEs with GPU as an ideal option for DNN models, it’s advised to perform load testing to find the right endpoint configuration that satisfies your use case requirements. Many factors can influence the load testing results, such as instance type, number of instances, model size, and model architecture. In addition, load testing can help guide the auto scaling strategies using the right metrics rather than iterative trial and error methods.

For those reasons, we put together this post to help you perform proper load testing on MMEs with GPU and find the best configuration for your ML use case. We share our load testing results for some of the most popular DNN models in NLP and CV hosted using MMEs on different instance types. We summarize the insights and conclusion from our test results to help you make an informed decision on configuring your own deployments. Along the way, we also share our recommended approach to performing load testing for MMEs on GPU. The tools and technique recommended determine the optimum number of models that can be loaded per instance type and help you achieve the best price-performance.

Solution overview

For an introduction to MMEs and MMEs with GPU, refer to Create a Multi-Model Endpoint and Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints. For the context of load testing in this post, you can download our sample code from the GitHub repo to reproduce the results or use it as a template to benchmark your own models. There are two notebooks provided in the repo: one for load testing CV models and another for NLP. Several models of varying sizes and architectures were benchmarked on different type of GPU instances: ml.g4dn.2xlarge, ml.g5.2xlarge, and ml.p3.2xlarge. This should provide a reasonable cross section of performance across the following metrics for each instance and model type:

  • Max number of models that can be loaded into GPU memory
  • End-to-end response latency observed on the client side for each inference query
  • Max throughput of queries per second that the endpoint can process without error
  • Max current users per instances before a failed request is observed

The following table lists the models tested.

Use Case Model Name Size On Disk Number of Parameters
CV resnet50 100Mb 25M
CV convnext_base 352Mb 88M
CV vit_large_patch16_224 1.2Gb 304M
NLP bert-base-uncased 436Mb 109M
NLP roberta-large 1.3Gb 335M

The following table lists the GPU instances tested.

Instance Type GPU Type Num of GPUs GPU Memory (GiB)
ml.g4dn.2xlarge NVIDIA T4 GPUs 1 16
ml.g5.2xlarge NVIDIA A10G Tensor Core GPU 1 24
ml.p3.2xlarge NVIDIA® V100 Tensor Core GPU 1 16

As previously mentioned, the code example can be adopted to other models and instance types.

Note that MMEs currently only support single GPU instances. For the list of supported instance types, refer to Supported algorithms, frameworks, and instances.

The benchmarking procedure is comprised of the following steps:

  1. Retrieve a pre-trained model from a model hub.
  2. Prepare the model artifact for serving on SageMaker MMEs (see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints for more details).
  3. Deploy a SageMaker MME on a GPU instance.
  4. Determine the maximum number of models that can be loaded into the GPU memory within a specified threshold.
  5. Use the Locust Load Testing Framework to simulate traffic that randomly invokes models loaded on the instance.
  6. Collect data and analyze the results.
  7. Optionally, repeat Steps 2–6 after compiling the model to TensorRT.

Steps 4 and 5 warrant a deeper look. Models within a SageMaker GPU MME are loaded into memory in a dynamic fashion. Therefore, in Step 4, we upload an initial model artifact to Amazon Simple Storage Service (Amazon S3) and invoke the model to load it into memory. After the initial invocation, we measure the amount of GPU memory consumed, make a copy of the initial model, invoke the copy of the model to load it into memory, and again measure the total amount of GPU memory consumed. This process is repeated until a specified percent threshold of GPU memory utilization is reached. For the benchmark, we set the threshold to 90% to provide a reasonable memory buffer for inferencing on larger batches or leaving some space to load other less-frequently used models.

Simulate user traffic

After we have determined the number of models, we can run a load test using the Locust Load Testing Framework. The load test simulates user requests to random models and automatically measures metrics such as response latency and throughput.

Locust supports custom load test shapes that allow you to define custom traffic patterns. The shape that was used in this benchmark is shown in the following chart. In the first 30 seconds, the endpoint is warmed up with 10 concurrent users. After 30 seconds, new users are spawned at a rate of two per second, reaching 20 concurrent users at the 40-second mark. The endpoint is then benchmarked steadily with 20 concurrent users until the 60-second mark, at which point Locust again begins to ramp up users at two per second until 40 concurrent users. This pattern of ramping up and steady testing is repeated until the endpoint is ramped up to 200 concurrent users. Depending on your use case, you may want to adjust the load test shape in the locust_benchmark_sm.py to more accurately reflect your expected traffic patterns. For example, if you intend to host larger language models, a load test with 200 concurrent users may not be feasible for a model hosted on a single instance, and you may therefore want to reduce the user count or increase the number of instances. You may also want to extend the duration of the load test to more accurately gauge the endpoint’s stability over a longer period of time.

stages = [
{"duration": 30, "users": 10, "spawn_rate": 5},
{"duration": 60, "users": 20, "spawn_rate": 1},
{"duration": 90, "users": 40, "spawn_rate": 2},
…
]

Note that we have only benchmarked the endpoint with homogeneous models all running on a consistent serving bases using either PyTorch or TensorRT. This is because MMEs are best suited for hosting many models with similar characteristics, such as memory consumption and response time. The benchmarking templates provided in the GitHub repo can still be used to determine whether serving heterogeneous models on MMEs would yield the desired performance and stability.

Benchmark results for CV models

Use the cv-benchmark.ipynb notebook to run load testing for computer vision models. You can adjust the pre-trained model name and instance type parameters to performance load testing on different model and instance type combinations. We purposely tested three CV models in different size ranges from smallest to largest: resnet50 (25M), convnext_base (88M), and vit_large_patch16_224 (304M). You may need to adjust to code if you pick a model outside of this list. additionally, the notebook defaults the input image shape to a 224x224x3 image tensor. Remember to adjust the input shape accordingly if you need to benchmark models that take a different-sized image.

After running through the entire notebook, you will get several performance analysis visualizations. The first two detail the model performance with respect to increasing concurrent users. The following figures are the example visualizations generated for the ResNet50 model running on ml.g4dn.2xlarge, comparing PyTorch (left) vs. TensorRT (right). The top line graphs show the model latency and throughput on the y-axis with increasing numbers of concurrent client workers reflected on the x-axis. The bottom bar charts show the count of successful and failed requests.

Looking across all the computer vision models we tested, we observed the following:

  • Latency (in milliseconds) is higher, and throughput (requests per second) is lower for bigger models (resnet50 > convnext_base > vit_large_patch16_224).
  • Latency increase is proportional with the number of users as more requests are queued up on the inference server.
  • Large models consume more compute resources and can reach their maximum throughput limits with fewer users than a smaller model. This is observed with the vit_large_patch16_224 model, which recorded the first failed request at 140 concurrent users. Being significantly larger than the other two models tested, it had the most overall failed requests at higher concurrency as well. This is a clear signal that the endpoint would need to scale beyond a single instance if the intent is to support more than 140 concurrent users.

At the end of the notebook run, you also get a summary comparison of PyTorch vs. TensorRT models for each of the four key metrics. From our benchmark testing, the CV models all saw a boost in model performance after TensorRT compilation. Taking our ResNet50 model as the example again, latency decreased by 32% while throughput increased by 18%. Although the maximum number of concurrent users stayed the same for ResNet50, the other two models both saw a 14% improvement in the number of concurrent users that they can support. The TensorRT performance improvement, however, came at the expense of higher memory utilization, resulting in fewer models loaded by MMEs. The impact is more for models using a convolutional neural network (CNN). In fact, our ResNet50 model consumed approximately twice the GPU memory going from PyTorch to TensorRT, resulting in 50% fewer models loaded (46 vs. 23). We diagnose this behavior further in the following section.

Benchmark results for NLP models

For the NLP models, use the nlp-benchmark.ipynb notebook to run the load test. The setup of the notebook should look very similar. We tested two NLP models: bert-base-uncased (109M) and roberta-large (335M). The pre-trained model and the tokenizer are both downloaded from the Hugging Face hub, and the test payload is generated from the tokenizer using a sample string. Max sequence length is defaulted at 128. If you need to test longer strings, remember to adjust that parameter. Running through the NLP notebook generates the same set of visualizations: Pytorch (left) vs TensorRT (right).


From these, we observed even more performance benefit of TensorRT for NLP models. Taking the roberta-large model on an ml.g4dn.2xlarge instance for example, inference latency decreased dramatically from 180 milliseconds to 56 milliseconds (a 70% improvement), while throughput improved by 406% from 33 requests per second to 167. Additionally, the maximum number of concurrent users increased by 50%; failed requests were not observed until we reached 180 concurrent users, compared to 120 for the original PyTorch model. In terms of memory utilization, we saw one fewer model loaded for TensorRT (from nine models to eight). However, the negative impact is much smaller compared to what we observed with the CNN-based models.

Analysis on memory utilization

The following table shows the full analysis on memory utilization impact going from PyTorch to TensorRT. We mentioned earlier that CNN-based models are impacted more negatively. The ResNet50 model had an over 50% reduction in number of models loaded across all three GPU instance types. Convnext_base had an even larger reduction at approximately 70% across the board. On the other hand, the impact to the transformer models is small or mixed. vit_large_patch16_224 and roberta-large had an average reduction of approximately 20% and 3%, respectively, while bert-base-uncased had an approximately 40% improvement.

Looking at all the data points as a whole in regards to the superior performance in latency, throughput, and reliability, and the minor impact on the maximum number of models loaded, we recommend the TensorRT model for transformer-based model architectures. For CNNs, we believe further cost performance analysis is needed to make sure the performance benefit outweighs the cost of additional hosting infrastructure.

ML Use Case Architecture Model Name Instance Type Framework Max Models Loaded Diff (%) Avg. Diff (%)
CV CNN Resnet50 ml.g4dn.2xlarge PyTorch 46 -50% -50%
TensorRT 23
ml.g5.2xlarge PyTorch 70 -51%
TensorRT 34
ml.p3.2xlarge PyTorch 49 -51%
TensorRT 24
Convnext_base ml.g4dn.2xlarge PyTorch 33 -50% -70%
TensorRT 10
ml.g5.2xlarge PyTorch 50 -70%
TensorRT 16
ml.p3.2xlarge PyTorch 35 -69%
TensorRT 11
Transformer vit_large_patch16_224 ml.g4dn.2xlarge PyTorch 10 -30% -20%
TensorRT 7
ml.g5.2xlarge PyTorch 15 -13%
TensorRT 13
ml.p3.2xlarge PyTorch 11 -18%
TensorRT 9
NLP Roberta-large ml.g4dn.2xlarge PyTorch 9 -11% -3%
TensorRT 8
ml.g5.2xlarge PyTorch 13 0%
TensorRT 13
ml.p3.2xlarge PyTorch 9 0%
TensorRT 9
Bert-base-uncased ml.g4dn.2xlarge PyTorch 26 62% 40%
TensorRT 42
ml.g5.2xlarge PyTorch 39 28%
TensorRT 50
ml.p3.2xlarge PyTorch 28 29%
TensorRT 36

The following tables list our complete benchmark results for all the metrics across all three GPU instances types.

ml.g4dn.2xlarge

Use Case Architecture Model Name Number of Parameters Framework Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users Diff (%)
CV CNN resnet50 25M PyTorch 46 -50% 164 -32% 120 18% 180 NA
TensorRT 23 . 111 . 142 . 180 .
convnext_base 88M PyTorch 33 -70% 154 -22% 64 102% 140 14%
TensorRT 10 . 120 . 129 . 160 .
Transformer vit_large_patch16_224 304M PyTorch 10 -30% 425 -69% 26 304% 140 14%
TensorRT 7 . 131 . 105 . 160 .
NLP bert-base-uncased 109M PyTorch 26 62% 70 -39% 105 142% 140 29%
TensorRT 42 . 43 . 254 . 180 .
roberta-large 335M PyTorch 9 -11% 187 -70% 33 406% 120 50%
TensorRT 8 . 56 . 167 . 180 .

ml.g5.2xlarge

Use Case Architecture Model Name Number of Parameters Framework Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users Diff (%)
CV CNN resnet50 25M PyTorch 70 -51% 159 -31% 146 14% 180 11%
TensorRT 34 . 110 . 166 . 200 .
convnext_base 88M PyTorch 50 -68% 149 -23% 134 13% 180 0%
TensorRT 16 . 115 . 152 . 180 .
Transformer vit_large_patch16_224 304M PyTorch 15 -13% 149 -22% 105 35% 160 25%
TensorRT 13 . 116 . 142 . 200 .
NLP bert-base-uncased 109M PyTorch 39 28% 65 -29% 183 38% 180 11%
TensorRT 50 . 46 . 253 . 200 .
roberta-large 335M PyTorch 13 0% 97 -38% 121 46% 140 14%
TensorRT 13 . 60 . 177 . 160 .

ml.p3.2xlarge

Use Case Architecture Model Name Number of Parameters Framework Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users Diff (%)
CV CNN resnet50 25M PyTorch 49 -51% 197 -41% 94 18% 160 -12%
TensorRT 24 . 117 . 111 . 140 .
convnext_base 88M PyTorch 35 -69% 178 -23% 89 11% 140 14%
TensorRT 11 .137 137 . 99 . 160 .
Transformer vit_large_patch16_224 304M PyTorch 11 -18% 186 -28% 83 23% 140 29%
TensorRT 9 . 134 . 102 . 180 .
NLP bert-base-uncased 109M PyTorch 28 29% 77 -40% 133 59% 140 43%
TensorRT 36 . 46 . 212 . 200 .
roberta-large 335M PyTorch 9 0% 108 -44% 88 60% 160 0%
TensorRT 9 . 61 . 141 . 160 .

The following table summarizes the results across all instance types. The ml.g5.2xlarge instance provides the best performance, whereas the ml.p3.2xlarge instance generally underperforms despite being the most expensive of the three. The g5 and g4dn instances demonstrate the best value for inference workloads.

Use Case Architecture Model Name Number of Parameters Framework Instance Type Max Models Loaded Diff (%) Latency (ms) Diff (%) Throughput (qps) Diff (%) Max Concurrent Users
CV CNN resnet50 25M PyTorch ml.g5.2xlarge 70 . 159 . 146 . 180
. . . . . ml.p3.2xlarge 49 . 197 . 94 . 160
. . . . . ml.g4dn.2xlarge 46 . 164 . 120 . 180
CV CN resnet50 25M TensorRT ml.g5.2xlarge 34 -51% 110 -31% 166 14% 200
. . . . . ml.p3.2xlarge 24 -51% 117 -41% 111 18% 200
. . . . . ml.g4dn.2xlarge 23 -50% 111 -32% 142 18% 180
NLP Transformer bert-base-uncased 109M Pytorch ml.g5.2xlarge 39 . 65 . 183 . 180
. . . . . ml.p3.2xlarge 28 . 77 . 133 . 140
. . . . . ml.g4dn.2xlarge 26 . 70 . 105 . 140
NLP Transformer bert-base-uncased 109M TensorRT ml.g5.2xlarge 50 28% 46 -29% 253 38% 200
. . . . . ml.p3.2xlarge 36 29% 46 -40% 212 59% 200
. . . . . ml.g4dn.2xlarge 42 62% 43 -39% 254 142% 180

Clean up

After you complete your load test, clean up the generated resources to avoid incurring additional charges. The main resources are the SageMaker endpoints and model artifact files in Amazon S3. To make it easy for you, the notebook files have the following cleanup code to help you delete them:

delete_endpoint(sm_client, sm_model_name, endpoint_config_name, endpoint_name)

! aws s3 rm --recursive {trt_mme_path}

Conclusion

In this post, we shared our test results and analysis for various deep neural network models running on SageMaker multi-model endpoints with GPU. The results and insights we shared should provide a reasonable cross section of performance across different metrics and instance types. In the process, we also introduced our recommended approach to run benchmark testing for SageMaker MMEs with GPU. The tools and sample code we provided can help you quickstart your benchmark testing and make a more informed decision on how to cost-effectively host hundreds of DNN models on accelerated compute hardware. To get started with benchmarking your own models with MME support for GPU, refer to Supported algorithms, frameworks, and instances and the GitHub repo for additional examples and documentation.


About the authors

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia USA. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More