Improved ML model deployment using Amazon SageMaker Inference Recommender

Improved ML model deployment using Amazon SageMaker Inference Recommender

Each machine learning (ML) system has a unique service level agreement (SLA) requirement with respect to latency, throughput, and cost metrics. With advancements in hardware design, a wide range of CPU- and GPU-based infrastructures are available to help you speed up inference performance. Also, you can build these ML systems with a combination of ML models, tasks, frameworks, libraries, tools, and inference engines, making it important to evaluate the ML system performance for the best possible deployment configurations. You need recommendations on finding the most cost-effective ML serving infrastructure and the right combination of software configuration to achieve the best price-performance to scale these applications.

Amazon SageMaker Inference Recommender is a capability of Amazon SageMaker that reduces the time required to get ML models in production by automating load testing and model tuning across SageMaker ML instances. In this post, we highlight some of the recent updates to Inference Recommender:

  • SageMaker Python SDK support for running Inference Recommender
  • Inference Recommender usability improvements
  • New APIs that provide flexibility in running Inference Recommender
  • Deeper integration with Amazon CloudWatch for logging and metrics

Credit card fraud detection use case

Any fraudulent activity that is not detected and mitigated immediately can cause significant financial loss. Particularly, credit card payment fraud transactions need to be identified right away to protect the individual’s and company’s financial health. In this post, we discuss a credit card fraud detection use case, and learn how to use Inference Recommender to find the optimal inference instance type and ML system configurations that can detect fraudulent credit card transactions in milliseconds.

We demonstrate how to set up Inference Recommender jobs for a credit card fraud detection use case. We train an XGBoost model for a classification task on a credit card fraud dataset. We use Inference Recommender with a custom load to meet inference SLA requirements to satisfy peak concurrency of 30,000 transactions per minute while serving predictions results in less than 100 milliseconds. Based on Inference Recommender’s instance type recommendations, we can find the right real-time serving ML instances that yield the right price-performance for this use case. Finally, we deploy the model to a SageMaker real-time endpoint to get prediction results.

The following table summarizes the details of our use case.

Model Framework XGBoost
Model Size 10 MB
End-to-End Latency 100 milliseconds
Invocations per Second 500 (30,000 per minute)
ML Task Binary Classification
Input Payload 10 KB

We use a synthetically created credit card fraud dataset. The dataset contains 28 numerical features, time of the transaction, transaction amount, and class target variables. The class column corresponds to whether or not a transaction is fraudulent. The majority of data is non-fraudulent (284,315 samples), with only 492 samples corresponding to fraudulent examples. In the data, Class is the target classification variable (fraudulent vs. non-fraudulent) in the first column, followed by other variables.

In the following sections, we show how to use Inference Recommender to get ML hosting instance type recommendations and find optimal model configurations to achieve better price-performance for your inference application.

Which ML instance type and configurations should you select?

With Inference Recommender, you can run two types of jobs: default and advanced.

The default Instance Recommender job runs a set of load tests to recommended the right ML instance types for any ML use case. SageMaker real-time deployment supports a wide range of ML instances to host and serve the credit card fraud detection XGBoost model. The default job can run a load test on a selection of instances that you provide in the job configuration. If you have an existing endpoint for this use case, you can run this job to find the cost-optimized performant instance type. Inference Recommender will compile and optimize the model for a specific hardware of inference endpoint instance type using Amazon SageMaker Neo. It’s important to note that not all compilation results in improved performance. Inference Recommender will report compilation details when the following conditions are met:

  • Successful compilation of the model using Neo. There could be issues in the compilation process such as invalid payload, data type, or more. In this case, compilation information is not available.
  • Successful inference using the compiled model that shows performance improvement, which appears in the inference job response.

An advanced job is a custom load test job that allows you to perform extensive benchmarks based on your ML application SLA requirements, such as latency, concurrency, and traffic pattern. You can configure a custom traffic pattern to simulate credit card transactions. Additionally, you can define the end-to-end model latency to predict if a transaction is fraudulent and define the maximum concurrent transactions to the model for prediction. Inference Recommender uses this information to run a performance benchmark load test. The latency, concurrency, and cost metrics from the advanced job help you make informed decisions about the ML serving infrastructure for mission-critical applications.

Solution overview

The following diagram shows the solution architecture for training an XGBoost model on the credit card fraud dataset, running a default job for instance type recommendation, and performing load testing to decide the optimal inference configuration for the best price-performance.

The diagram shows the following steps:

  1. Train an XGBoost model to classify credit card transactions as fraudulent or legit. Deploy the trained model to a SageMaker real-time endpoint. Package the model artifacts and sample payload (.tar.gz format), and upload them to Amazon Simple Storage Service (Amazon S3) so Inference Recommender can use these when the job is run. Note that the training step in this post is optional.
  2. Configure and run a default Inference Recommender job on a list of supported instance types to find the right ML instance type that gives the best price-performance for this use case.
  3. Optionally, run a default Inference Recommender job on an existing endpoint.
  4. Configure and run an advanced Inference Recommender job to perform a custom load test to simulate user interactions with the credit card fraud detection application. This helps you find the right configurations to satisfy latency, concurrency, and cost for this use case.
  5. Analyze the default and advanced Inference Recommender job results, which include ML instance type recommendation latency, performance, and cost metrics.

A complete example is available in our GitHub notebook.

Prerequisites

To use Inference Recommender, make sure to meet the prerequisites.

Python SDK support for Inference Recommender

We recently released Python SDK support for Inference Recommender. You can now run default and advanced jobs using a single function: right_size. Based on the parameters of the function call, Inference Recommender infers if it should run default or advanced jobs. This greatly simplifies the use of Inference Recommender using the Python SDK. To run the Inference Recommender job, complete the following steps:

  1. Create a SageMaker model by specifying the framework, version, and image scope:
    model = Model(
        model_data=model_url,
        role=role,
        image_uri = sagemaker.image_uris.retrieve(framework="xgboost", 
        region=region, 
        version="1.5-1", 
        py_version="py3", 
        image_scope='inference'),
        sagemaker_session=sagemaker_session
        )

  2. Optionally, register the model in the SageMaker model registry. Note that parameters such as domain and task during model package creation are also optional parameters in the recent release.
    model_package = model.register(
        content_types=["text/csv"],
        response_types=["text/csv"],
        model_package_group_name=model_package_group_name,
        image_uri=model.image_uri,
        approval_status="Approved",
        framework="XGBOOST"
    )

  3. Run the right_size function on the supported ML inference instance types using the following configuration. Because XGBoost is a memory-intensive algorithm, we provide ml.m5 type instances to get instance type recommendations. You can call the right_size function on the model registry object as well.
    model.right_size(
        sample_payload_url=sample_payload_url,
        supported_content_types=["text/csv"],
        supported_instance_types=["ml.m5.large", 
                                  "ml.m5.xlarge", 
                                  "ml.m5.2xlarge", 
                                  "ml.m5.4xlarge", 
                                  "ml.m5.12xlarge"],
        framework="XGBOOST",
        job_name="credit-card-fraud-default-job"
    )
    INFO:sagemaker:Advance Job parameters were not specified. Running Default job...

  4. Define additional parameters to the right_size function to run an advanced job and custom load test on the model:
    1. Configure the traffic pattern using the phases parameter. In the first phase, we start the load test with two initial users and create two new users for every minute for 2 minutes. In the following phase, we start the load test with six initial users and create two new users for every minute for 2 minutes. Stopping conditions for the load tests are p95 end-to-end latency of 100 milliseconds and concurrency to support 30,000 transactions per minute or 500 transactions per second.
    2. We tune the endpoint against the environment variable OMP_NUM_THREADS with values [3,4,5] and we aim to limit the latency requirement to 100 milliseconds and achieve max concurrency of 30,000 invocations per minute. The goal is to find which value for OMP_NUM_THREADS provides the best performance.
from sagemaker.parameter import CategoricalParameter 
from sagemaker.inference_recommender.inference_recommender_mixin import (  
    Phase,  
    ModelLatencyThreshold 
) 
hyperparameter_ranges = [ 
    { 
        "instance_types": CategoricalParameter(["ml.m5.4xlarge"]), 
        "OMP_NUM_THREADS": CategoricalParameter(["3", "4", "6"]),
    } 
] 
phases = [ 
    Phase(duration_in_seconds=120, initial_number_of_users=2, spawn_rate=2), 
    Phase(duration_in_seconds=120, initial_number_of_users=6, spawn_rate=2) 
] 
model_latency_thresholds = [ 
    ModelLatencyThreshold(percentile="P95", value_in_milliseconds=100) 
]

model.right_size( 
    sample_payload_url=sample_payload_url, 
    supported_content_types=["text/csv"], 
    framework="XGBOOST", 
    job_duration_in_seconds=7200, 
    hyperparameter_ranges=hyperparameter_ranges, 
    phases=phases, # TrafficPattern 
    max_invocations=30000, # StoppingConditions 
    model_latency_thresholds=model_latency_thresholds,
    job_name="credit-card-fraud-advanced-job"
)
INFO:sagemaker:Advance Job parameters were specified. Running Advanced job...

Run Inference Recommender jobs using the Boto3 API

You can use the Boto3 API to launch Inference Recommender default and advanced jobs. You need to use the Boto3 API (create_inference_recommendations_job) to run Inference Recommender jobs on an existing endpoint. Inference Recommender infers the framework and version from the existing SageMaker real-time endpoint. The Python SDK doesn’t support running Inference Recommender jobs on existing endpoints.

The following code snippet shows how to create a default job:

sagemaker_client.create_inference_recommendations_job(
    JobName = "credit-card-fraud-default-job",
    JobType = 'Default',
    RoleArn = <ROLE_ARN>,
    InputConfig = {
        'ModelPackageVersionArn': <MODEL_PACKAGE_ARN>, #optional
        'Endpoints': ['EndpointName': <ENDPOINT_POINT>]
    }
)

Later in this post, we discuss the parameters needed to configure an advanced job.

Configure a traffic pattern using the TrafficPattern parameter. In the first phase, we start a load test with two initial users (InitialNumberOfUsers) and create two new users (SpawnRate) for every minute for 2 minutes (DurationInSeconds). In the following phase, we start the load test with six initial users and create two new users for every minute for 2 minutes. Stopping conditions (StoppingConditions) for the load tests are p95 end-to-end latency (ModelLatencyThresholds) of 100 milliseconds (ValueInMilliseconds) and concurrency to support 30,000 transactions per minute or 500 transactions per second (MaxInvocations). See the following code:

env_parameter_ranges = [{"Name": "OMP_NUM_THREADS", "Value": ["3", "4", "5"]}]

sagemaker_client.create_inference_recommendations_job(JobName=load_test_job_name,
        JobType='Advanced', RoleArn=role_arn, InputConfig={
    'ModelPackageVersionArn': model_package_arn, #optional
    'JobDurationInSeconds': 7200,
    'TrafficPattern': {'TrafficType': 'PHASES',
                       'Phases': [
                       {'InitialNumberOfUsers': 2,
                       'SpawnRate': 2, 
                       'DurationInSeconds': 120
                       },
                       {'InitialNumberOfUsers': 6, 
                       'SpawnRate': 6,
                       'DurationInSeconds': 120
                       }]},
    'ResourceLimit': {'MaxNumberOfTests': 10, 'MaxParallelOfTests': 3},
    'EndpointConfigurations': [{'InstanceType': 'ml.m5.4xlarge'
                                'EnvironmentParameterRanges': 
                                {'CategoricalParameterRanges': env_parameter_ranges}
                                }],
    }, StoppingConditions={'MaxInvocations': 30000,
                           'ModelLatencyThresholds': 
                           [{'Percentile': 'P95',
                            'ValueInMilliseconds': 100
                           }]})

Inference Recommender job results and metrics

The results of the default Inference Recommender job contain a list of endpoint configuration recommendations, including instance type, instance count, and environment variables. The results contain configurations for SAGEMAKER_MODEL_SERVER_WORKERS and OMP_NUM_THREADS associated with the latency, concurrency, and throughput metrics. OMP_NUM_THREADS is the model server tunable environment parameter. As shown in the details in the following table, with an ml.m5.4xlarge instance with SAGEMAKER_MODEL_SERVER_WORKERS=3 and OMP_NUM_THREADS=3, we got a throughput of 32,628 invocations per minute and model latency under 10 milliseconds. ml.m5.4xlarge had 100% improvement in latency, an approximate 115% increase in concurrency compared to the ml.m5.xlarge instance configuration. Also, it was 66% more cost-effective compared to the ml.m5.12xlarge instance configurations while achieving comparable latency and throughput.

Instance Type Initial Instance Count OMP_NUM_THREADS Cost Per Hour Max Invocations Model Latency CPU Utilization Memory Utilization SageMaker Model Server Workers
ml.m5.xlarge 1 2 0.23 15189 18 108.864 1.62012 1
ml.m5.4xlarge 1 3 0.922 32628 9 220.57001 0.69791 3
ml.m5.large 1 2 0.115 13793 19 106.34 3.24398 1
ml.m5.12xlarge 1 4 2.765 32016 4 215.32401 0.44658 7
ml.m5.2xlarge 1 2 0.461 32427 13 248.673 1.43109 3

We have included CloudWatch helper functions in the notebook. You can use the functions to get detailed charts of your endpoints during the load test. The charts have details on invocation metrics like invocations, model latency, overhead latency, and more, and instance metrics such as CPUUtilization and MemoryUtilization. The following example shows the CloudWatch metrics for our ml.m5.4xlarge model configuration.

You can visualize Inference Recommender job results in Amazon SageMaker Studio by choosing Inference Recommender under Deployments in the navigation pane. With a deployment goal for this use case (high latency, high throughput, default cost), the default Inference Recommender job recommended an ml.m5.4xlarge instance because it provided the best latency performance and throughput to support a maximum 34,600 invocations per minute (576 TPS). You can use these metrics to analyze and find the best configurations that satisfy latency, concurrency, and cost requirements of your ML application.

We recently introduced ListInferenceRecommendationsJobSteps, which allows you to analyze subtasks in an Inference Recommender job. The following code snippet shows how to use the list_inference_recommendations_job_steps Boto3 API to get the list of subtasks. This can help with debugging Inference Recommender job failures at the step level. This functionality is not supported in the Python SDK yet.

sm_client = boto3.client("sagemaker", region_name=region)
list_job_steps_response = sm_client.list_inference_recommendations_job_steps
                          (JobName='<JOB_NAME>')
print(list_job_steps_response)

The following code shows the response:

{
    "Steps": [
        {
            "StepType": "BENCHMARK",
            "JobName": "SMPYTHONSDK-<JOB_NAME>",
            "Status": "COMPLETED",
            "InferenceBenchmark": {
                "Metrics": {
                    "CostPerHour": 1.8359999656677246,
                    "CostPerInference": 1.6814110495033674e-06,
                    "MaxInvocations": 18199,
                    "ModelLatency": 40,
                    "CpuUtilization": 106.06400299072266,
                    "MemoryUtilization": 0.3920480012893677
                },
                "EndpointConfiguration": {
                    "EndpointName": "sm-epc-<ENDPOINTNAME>",
                    "VariantName": "sm-epc-<VARIANTNAME>",
                    "InstanceType": "ml.c5.9xlarge",
                    "InitialInstanceCount": 1
                },
                "ModelConfiguration": {
                    "EnvironmentParameters": [
                        {
                            "Key": "SAGEMAKER_MODEL_SERVER_WORKERS",
                            "ValueType": "String",
                            "Value": "1"
                        },
                        {
                            "Key": "OMP_NUM_THREADS",
                            "ValueType": "String",
                            "Value": "28"
                        }
                    ]
                }
            }
        },
     ...... <TRUNCATED>
    "ResponseMetadata": {
        "RequestId": "<RequestId>",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "<x-amzn-requestid>",
            "content-type": "application/x-amz-json-1.1",
            "content-length": "1443",
            "date": "Mon, 20 Feb 2023 16:53:30 GMT"
        },
        "RetryAttempts": 0
    }
}

Run an advanced Inference Recommender job

Next, we run an advanced Inference Recommender job to find optimal configurations such as SAGEMAKER_MODEL_SERVER_WORKERS and OMP_NUM_THREADS on an ml.m5.4xlarge instance type. We set the hyperparameters of the advanced job to run a load test on different combinations:

hyperparameter_ranges = [ 
    { 
        "instance_types": CategoricalParameter(["ml.m5.4xlarge"]), 
        "OMP_NUM_THREADS": CategoricalParameter(["3", "4", "6"]),
    } 
]

You can view the advanced Inference Recommender job results on the Studio console, as shown in the following screenshot.

Using the Boto3 API or CLI commands, you can access all the metrics from the advanced Inference Recommender job results. InitialInstanceCount is the number of instances that you should provision in the endpoint to meet ModelLatencyThresholds and MaxInvocations mentioned in StoppingConditions. The following table summarizes our results.

Instance Type Initial Instance Count OMP_NUM_THREADS Cost Per Hour Max Invocations Model Latency CPU Utilization Memory Utilization
ml.m5.2xlarge 2 3 0.922 39688 6 86.732803 3.04769
ml.m5.2xlarge 2 4 0.922 42604 6 177.164993 3.05089
ml.m5.2xlarge 2 5 0.922 39268 6 125.402 3.08665
ml.m5.4xlarge 2 3 1.844 38174 4 102.546997 2.68003
ml.m5.4xlarge 2 4 1.844 39452 4 141.826004 2.68136
ml.m5.4xlarge 2 5 1.844 40472 4 107.825996 2.70936

Clean up

Follow the instructions in the notebook to delete all the resources created as part of this post to avoid incurring additional charges.

Summary

Finding the right ML serving infrastructure, including instance type, model configurations, and auto scaling polices, can be tedious. This post showed how you can use the Inference Recommender Python SDK and Boto3 APIs to launch default and advanced jobs to find the optimal inference infrastructure and configurations. We also discussed the new improvements to Inference Recommender, including Python SDK support and usability improvements. Check out our GitHub repository to get started.


About the Authors

Shiva Raaj Kotini works as a Principal Product Manager in the AWS SageMaker inference product portfolio. He focuses on model deployment, performance tuning, and optimization in SageMaker for inference.

John Barboza is a Software Engineer at AWS. He has extensive experience working on distributed systems. His current focus is on improving the SageMaker inference experience. In his spare time, he enjoys cooking and biking.

Mohan Gandhi is a Senior Software Engineer at AWS. He has been with AWS for the last 10 years and has worked on various AWS services like Amazon EMR, Amazon EFA, and Amazon RDS. Currently, he is focused on improving the SageMaker inference experience. In his spare time, he enjoys hiking and marathons.

Ram Vegiraju is an ML Architect with the SageMaker service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Vikram Elango is an Sr. AIML Specialist Solutions Architect at AWS, based in Virginia USA. He is currently focused on Generative AI, LLMs, prompt engineering, large model inference optimization and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design, thought leadership to build and deploy machine learning applications at scale. In his spare time, he enjoys traveling, hiking, cooking and camping with his family.

Read More

Amazon Comprehend document classifier adds layout support for higher accuracy

Amazon Comprehend document classifier adds layout support for higher accuracy

The ability to effectively handle and process enormous amounts of documents has become essential for enterprises in the modern world. Due to the continuous influx of information that all enterprises deal with, manually classifying documents is no longer a viable option. Document classification models can automate the procedure and help organizations save time and resources. Traditional categorization techniques, such as manual processing and keyword-based searches, become less efficient and more time-consuming as the volume of documents increases. This inefficiency causes lower productivity and higher operating expenses. Additionally, it can prevent crucial information from being accessible when needed, which could lead to a poor customer experience and impact decision-making. At AWS re:Invent 2022, Amazon Comprehend, a natural language processing (NLP) service that uses machine learning (ML) to discover insights from text, launched support for native document types. This new feature gave you the ability to classify documents in native formats (PDF, TIFF, JPG, PNG, DOCX) using Amazon Comprehend.

Today, we are excited to announce that Amazon Comprehend now supports custom classification model training with documents like PDF, Word, and image formats. You can now train bespoke document classification models on native documents that support layout in addition to text, increasing the accuracy of the results.

In this post, we provide an overview of how you can get started with training an Amazon Comprehend custom document classification model.

Overview

The capacity to understand the relative placements of objects within a defined space is referred to as layout awareness. In this case, it aids the model in understanding how headers, subheadings, tables, and graphics relate to one another inside a document. The model can more effectively categorize a document based on its content when it’s aware of the structure and layout of the text.

In this post, we walk through the data preparation steps involved, demonstrate the model training process, and discuss the benefits of using the new custom document classification model in Amazon Comprehend. As a best practice, you should consider the following points before you begin training the custom document classification model.

Evaluate your document classification needs

Identify the various types of documents they you may need to classify, along with the different classes or categories to support your use case. Determine the suitable classification structure or taxonomy after evaluating the amount and types of documents that need to be categorized. Document types may vary from PDF, Word, images, and so on. Ensure you have authorized access to a diverse set of labeled documents either via a document management system or other storage mechanisms.

Prepare your data

Ensure that the document files you intend to use for model training aren’t encrypted or locked—for example, make sure that your PDF files aren’t encrypted and locked with a password. You must decrypt such files before you can use them for training purposes. Label a sample of your documents with the appropriate categories or labels (classes). Determine whether single-label classification (multi-class mode) or multi-label classification is appropriate for your use case. Multi-class mode associates only a single class with each document, whereas multi-label mode associates one or more class with a document.

Consider model evaluation

Use the labeled dataset to train the model so it can learn to classify new documents accurately and evaluate how the newly trained model version performs by understanding the model metrics. To understand the metrics provided by Amazon Comprehend post-model training, refer to Custom classifier metrics. After the training process is complete, you can begin classifying documents asynchronously or in real time. We walk through how to train a custom classification model in the following sections.

Prepare the training data

Before we train our custom classification model, we need to prepare the training data. Training data is comprised of a set of labeled documents, which can be pre-identified documents from a document repository that you already have access to. For our example, we trained a custom classification model with a few different document types that are typically found in a health insurance claim adjudication process: patient discharge summary, invoices, receipts, and so on. We also need to prepare an annotations file in CSV format. Following is an example of an annotations file CSV data required for the training:

 discharge_summary,summary-1.pdf,1
 discharge_summary,summary-2.pdf,1
 invoice,invoice-1.pdf,1
 invoice,invoice-1.pdf,2
 invoice,invoice-2.pdf,1

The annotations CSV file must contain three columns. The first column contains the desired class (label) for the document, the second column is the document name (file name), and the last column is the page number of the document that you want to include in the training dataset. Because the training process supports native multi-page PDF and DOCX files, you must specify the page number in case the document is a multi-page document. If you want to include all pages of a multi-page document in the training dataset, you must specify each page as a separate line in the CSV annotations file. For example, in the preceding annotations file, invoice-1.pdf is a two-page document, and we want to include both pages in the classification dataset. Because files like PDF, PNG, and TIFF are image formats, the page number (third column) value must always be 1. If your dataset contains multi-frame (multi-page) TIF files, you must split them into separate TIF files in order to use them in the training process.

We prepared an annotations file called test.csv with the appropriate data to train a custom classification model. For each sample document, the CSV file contains the class that document belongs to, the location of the document in Amazon Simple Storage Service (Amazon S3), such as path/to/prefix/document.pdf, and the page number (if applicable). Because most of our documents are either single-page DOCX, PDF files, or TIF, JPG, or PNG files, the page number assigned is 1. Because our annotations CSV and sample documents are all under the same Amazon S3 prefix, we don’t need to explicitly specify the prefix in the second column. We also prepare at least 10 document samples or more for each class, and we used a mix of JPG, PNG, DOCX, PDF, and TIF files for training the model. Note that it’s usually recommended to have a diverse set of sample documents for model training to avoid overfitting of the model, which impacts its ability to recognize new documents. It’s also recommended that the number of samples per class is balanced, although it’s not required to have an exact same number of samples per class. Next, we upload the test.csv annotations file and all the documents into Amazon S3. The following image shows part of our annotations CSV file.

Train a custom classification model

Now that we have the annotations file and all our sample documents ready, we set up a custom classification model and train it. Before you begin setting up custom classification model training, make sure that the annotations CSV and sample documents exist in an Amazon S3 location.

  1. On the Amazon Comprehend console, choose Custom classification in the navigation pane.
  2. Choose Create new model.
  3. For Model name, enter a unique name.
  4. For Version name, enter a unique version name.
  5. For Training model type, select Native documents.

This tells Amazon Comprehend that you intend to use native document types to train the model instead of serialized text.

  1. For Classifier mode, select Using single-label mode.

This mode tells the classifier that we intend to classify documents into a single class. If you need to train a model with multi-label mode, meaning a document may belong to one or more than one class, you must set up the annotations file appropriately by specifying the classes of the document separated by a special character in the annotations CSV file. In that case, you would select the Using multi-label mode option.

  1. For Annotation location on S3, enter the path of the annotations CSV file.
  2. For Training data location on S3, enter the Amazon S3 location where your documents reside.
  3. Leave all other options as default in this section.
  4. In the Output data section, specify an Amazon S3 location for your output.

This is optional, but it’s a good practice to provide an output location because Amazon Comprehend will generate the post-model training evaluation metrics in this location. This data is useful to evaluate model performance, iterate, and improve the accuracy of your model.

  1. In the IAM role section, choose an appropriate AWS Identity and Access Management (IAM) role that allows Amazon Comprehend to access the Amazon S3 location and write and read from it.
  2. Choose Create to initiate the model training.

The model may take several minutes to train, depending on the number of classes and the dataset size. You can review the training status on the Custom classification page. The training process will display a Submitted status right after the training process starts and will change to Training status when the training process begins. After your model is trained, the Version status will change to Trained. If Amazon Comprehend finds inconsistencies in your training data, the status will show In error along with an alert that shows the appropriate error message so that you can take corrective action and restart the training process with the corrected data.

In this post, we demonstrated the steps to train a custom classifier model using the Amazon Comprehend console. You can also use the AWS SDK in any language (for example, Boto3 for Python) or the AWS Command Line Interface (AWS CLI) to initiate a custom classification model training. With either the SDK or AWS CLI, you can use the CreateDocumentClassifier API to initiate the model training, and subsequently use the DescribeDocumentClassifier API to check the status of the model.

After the model is trained, you can perform either real-time analysis or asynchronous (batch) analysis jobs on new documents. To perform real-time classification on documents, you must deploy an Amazon Comprehend real-time endpoint with the trained custom classification model. Real-time endpoints are best suited for use cases that require low-latency, real-time inference results, whereas for classifying a large set of documents, an asynchronous analysis job is more appropriate. To learn how you can perform asynchronous inference on new documents using a trained classification model, refer to Introducing one-step classification and entity recognition with Amazon Comprehend for intelligent document processing.

Benefits of the layout-aware custom classification model

The new classifier model offers a number of improvements. It’s not only easier to train the new model, but you can also train a new model with just a few samples for each class. Additionally, you no longer have to extract serialized plain text out of scanned or digital documents such as images or PDFs to prepare the training dataset. The following are some additional noteworthy improvements that you can expect from the new classification model:

  • Improved accuracy – The model now takes into account the layout and structure of documents, which leads to a better understanding of the structure and content of the documents. This helps distinguish between documents with similar text but different layouts or structures, resulting in increased classification accuracy.
  • Robustness – The model now handles variations in document structure and formatting. This makes it better suited for classifying documents from different sources with varying layouts or formatting styles, which is a common challenge in real-world document classification tasks. It’s compatible with several document types natively, making it versatile and applicable to different industries and use cases.
  • Reduced manual intervention – Higher accuracy leads to less manual intervention in the classification process. This can save time and resources, and increase operational efficiency in your document processing workload.

Conclusion

The new Amazon Comprehend document classification model, which incorporates layout awareness, is a game-changer for businesses dealing with large volumes of documents. By understanding the structure and layout of documents, this model offers improved classification accuracy and efficiency. Implementing a robust and accurate document classification solution using a layout-aware model can help your business save time, reduce operational costs, and enhance decision-making processes.

As a next step, we encourage you to try the new Amazon Comprehend custom classification model via the Amazon Comprehend console. We also recommend revisiting our custom classification model improvement announcements from last year and visit the GitHub repository for code samples.


About the authors

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Godwin Sahayaraj Vincent is an Enterprise Solutions Architect at AWS who is passionate about Machine Learning and providing guidance to customers  to design, deploy and manage their AWS workloads and architectures. In his spare time, he loves to play cricket with his friends and tennis with his three kids.

Wrick Talukdar is a Senior Architect with the Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.

Read More

Use streaming ingestion with Amazon SageMaker Feature Store and Amazon MSK to make ML-backed decisions in near-real time

Use streaming ingestion with Amazon SageMaker Feature Store and Amazon MSK to make ML-backed decisions in near-real time

Businesses are increasingly using machine learning (ML) to make near-real-time decisions, such as placing an ad, assigning a driver, recommending a product, or even dynamically pricing products and services. ML models make predictions given a set of input data known as features, and data scientists easily spend more than 60% of their time designing and building these features. Furthermore, highly accurate predictions depend on timely access to feature values that change quickly over time, adding even more complexity to the job of building a highly available and accurate solution. For example, a model for a ride-sharing app can choose the best price for a ri­de from the airport, but only if it knows the number of ride requests received in the past 10 minutes and the number of passengers projected to land in the next 10 minutes. A routing model in a call center app can pick the best available agent for an incoming call, but it’s only effective if it knows the customer’s latest web session clicks.

Although the business value of near-real-time ML predictions is enormous, the architecture required to deliver them reliably, securely, and with good performance is complicated. Solutions need high-throughput updates and low-latency retrieval of the most recent feature values in milliseconds, something most data scientists aren’t prepared to deliver. As a result, some enterprises have spent millions of dollars inventing their own proprietary infrastructure for feature management. Other firms have limited their ML applications to simpler patterns like batch scoring until ML vendors provide more comprehensive off-the-shelf solutions for online feature stores.

To address these challenges, Amazon SageMaker Feature Store provides a fully managed central repository for ML features, making it easy to securely store and retrieve features without having to build and maintain your own infrastructure. Feature Store lets you define groups of features, use batch ingestion and streaming ingestion, retrieve the latest feature values with single-digit millisecond latency for highly accurate online predictions, and extract point-in-time correct datasets for training. Instead of building and maintaining these infrastructure capabilities, you get a fully managed service that scales as your data grows, enables sharing features across teams, and lets your data scientists focus on building great ML models aimed at game-changing business use cases. Teams can now deliver robust features once and reuse them many times in a variety of models that may be built by different teams.

This post walks through a complete example of how you can couple streaming feature engineering with Feature Store to make ML-backed decisions in near-real time. We show a credit card fraud detection use case that updates aggregate features from a live stream of transactions and uses low-latency feature retrievals to help detect fraudulent transactions. Try it out for yourself by visiting our GitHub repo.

Credit card fraud use case

Stolen credit card numbers can be bought in bulk on the dark web from previous leaks or hacks of organizations that store this sensitive data. Fraudsters buy these card lists and attempt to make as many transactions as possible with the stolen numbers until the card is blocked. These fraud attacks typically happen in a short time frame, and this can be easily spotted in historical transactions because the velocity of transactions during the attack differs significantly from the cardholder’s usual spending pattern.

The following table shows a sequence of transactions from one credit card where the cardholder first has a genuine spending pattern, and then experiences a fraud attack starting on November 4.

cc_num trans_time amount fraud_label
…1248 Nov-01 14:50:01 10.15 0
… 1248 Nov-02 12:14:31 32.45 0
… 1248 Nov-02 16:23:12 3.12 0
… 1248 Nov-04 02:12:10 1.01 1
… 1248 Nov-04 02:13:34 22.55 1
… 1248 Nov-04 02:14:05 90.55 1
… 1248 Nov-04 02:15:10 60.75 1
… 1248 Nov-04 13:30:55 12.75 0

For this post, we train an ML model to spot this kind of behavior by engineering features that describe an individual card’s spending pattern, such as the number of transactions or the average transaction amount from that card in a certain time window. This model protects cardholders from fraud at the point of sale by detecting and blocking suspicious transactions before the payment can complete. The model makes predictions in a low-latency, real-time context and relies on receiving up-to-the-minute feature calculations so it can respond to an ongoing fraud attack. In a real-world scenario, features related to cardholder spending patterns would only form part of the model’s feature set, and we can include information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.

Because our use case relies on profiling an individual card’s spending patterns, it’s crucial that we can identify credit cards in a transaction stream. Most publicly available fraud detection datasets don’t provide this information, so we use the Python Faker library to generate a set of transactions covering a 5-month period. This dataset contains 5.4 million transactions spread across 10,000 unique (and fake) credit card numbers, and is intentionally imbalanced to match the reality of credit card fraud (only 0.25% of the transactions are fraudulent). We vary the number of transactions per day per card, as well as the transaction amounts. See our GitHub repo for more details.

Overview of the solution

We want our fraud detection model to classify credit card transactions by noticing a burst of recent transactions that differs significantly from the cardholder’s usual spending pattern. Sounds simple enough, but how do we build it?

The following diagram shows our overall solution architecture. We feel that this same pattern will work well for a variety of streaming aggregation use cases. At a high level, the pattern involves the following five pieces:

  1. Feature store – We use Feature Store to provide a repository of features with high-throughput writes and secure low-latency reads, using feature values that are organized into multiple feature groups.
  2. Batch ingestion – Batch ingestion takes labeled historical credit card transactions and creates the aggregate features and ratios needed for training the fraud detection model. We use an Amazon SageMaker Processing job and the built-in Spark container to calculate aggregate weekly counts and transaction amount averages and ingest them into the feature store for use in online inference.
  3. Model training and deployment – This aspect of our solution is straightforward. We use Amazon SageMaker to train a model using the built-in XGBoost algorithm on aggregated features created from historical transactions. The model is deployed to a SageMaker endpoint, where it handles fraud detection requests on live transactions.
  4. Streaming ingestion – An Amazon Kinesis Data Analytics for Apache Flink application backed by Apache Kafka topics in Amazon Managed Streaming for Apache Kafka (MSK) (Amazon MSK) calculates aggregated features from a transaction stream, and an AWS Lambda function updates the online feature store. Apache Flink is a popular framework and engine for processing data streams.
  5. Streaming predictions – Lastly, we make fraud predictions on a stream of transactions, using Lambda to pull aggregate features from the online feature store. We use the latest feature data to calculate transaction ratios and then call the fraud detection endpoint.

Prerequisites

We provide an AWS CloudFormation template to create the prerequisite resources for this solution. The following table lists the stacks available for different Regions.

AWS Region Link
us-east-1
us-east-2
us-west-1
eu-west-1
ap-northeast-1

In the following sections, we explore each component of our solution in more detail.

Feature store

ML models rely on well-engineered features coming from a variety of data sources, with transformations as simple as calculations or as complicated as a multi-step pipeline that takes hours of compute time and complex coding. Feature Store enables the reuse of these features across teams and models, which improves data scientist productivity, speeds up time to market, and ensures consistency of model input.

Each feature inside Feature Store is organized into a logical grouping called a feature group. You decide which feature groups you need for your models. Each one can have dozens, hundreds, or even thousands of features. Feature groups are managed and scaled independently, but they’re all available for search and discovery across teams of data scientists responsible for many independent ML models and use cases.

ML models often require features from multiple feature groups. A key aspect of a feature group is how often its feature values need to be updated or materialized for downstream training or inference. You refresh some features hourly, nightly, or weekly, and a subset of features must be streamed to the feature store in near-real time. Streaming all feature updates would lead to unnecessary complexity, and could even lower the quality of data distributions by not giving you the chance to remove outliers.

In our use case, we create a feature group called cc-agg-batch-fg for aggregated credit card features updated in batch, and one called cc-agg-fg for streaming features.

The cc-agg-batch-fg feature group is updated nightly, and provides aggregate features looking back over a 1-week time window. Recalculating 1-week aggregations on streaming transactions don’t offer meaningful signals, and would be a waste of resources.

Conversely, our cc-agg-fg feature group must be updated in a streaming fashion, because it offers the latest transaction counts and average transaction amounts looking back over a 10-minute time window. Without streaming aggregation, we couldn’t spot the typical fraud attack pattern of a rapid sequence of purchases.

By isolating features that are recalculated nightly, we can improve ingestion throughput for our streaming features. Separation lets us optimize the ingestion for each group independently. When designing for your use cases, keep in mind that models requiring features from a large number of feature groups may want to make multiple retrievals from the feature store in parallel to avoid adding excessive latency to a real-time prediction workflow.

The feature groups for our use case are shown in the following table.

cc-agg-fg cc-agg-batch-fg
cc_num (record id) cc_num (record id)
trans_time trans_time
num_trans_last_10m num_trans_last_1w
avg_amt_last_10m avg_amt_last_1w

Each feature group must have one feature used as a record identifier (for this post, the credit card number). The record identifier acts as a primary key for the feature group, enabling fast lookups as well as joins across feature groups. An event time feature is also required, which enables the feature store to track the history of feature values over time. This becomes important when looking back at the state of features at a specific point in time.

In each feature group, we track the number of transactions per unique credit card and its average transaction amount. The only difference between our two groups is the time window used for aggregation. We use a 10-minute window for streaming aggregation, and a 1-week window for batch aggregation.

With Feature Store, you have the flexibility to create feature groups that are offline only, online only, or both online and offline. An online store provides high-throughput writes and low-latency retrievals of feature values, which is ideal for online inference. An offline store is provided using Amazon Simple Storage Service (Amazon S3), giving firms a highly scalable repository, with a full history of feature values, partitioned by feature group. The offline store is ideal for training and batch scoring use cases.

When you enable a feature group to provide both online and offline stores, SageMaker automatically synchronizes feature values to an offline store, continuously appending the latest values to give you a full history of values over time. Another benefit of feature groups that are both online and offline is that they help avoid the problem of training and inference skew. SageMaker lets you feed both training and inference with the same transformed feature values, ensuring consistency to drive more accurate predictions. The focus in our post is to demonstrate online feature streaming, so we implemented online-only feature groups.

Batch ingestion

To materialize our batch features, we create a feature pipeline that runs as a SageMaker Processing job on a nightly basis. The job has two responsibilities: producing the dataset for training our model, and populating the batch feature group with the most up-to-date values for aggregate 1-week features, as shown in the following diagram.

Each historical transaction used in the training set is enriched with aggregated features for the specific credit card involved in the transaction. We look back over two separate sliding time windows: 1 week back, and the preceding 10 minutes. The actual features used to train the model include the following ratios of these aggregated values:

  • amt_ratio1 =avg_amt_last_10m / avg_amt_last_1w
  • amt_ratio2 =transaction_amount / avg_amt_last_1w
  • count_ratio =num_trans_last_10m / num_trans_last_1w

For example, count_ratio is the transaction count from the prior 10 minutes divided by the transaction count from the last week.

Our ML model can learn patterns of normal activity vs. fraudulent activity from these ratios, rather than relying on raw counts and transaction amounts. Spending patterns on different cards vary greatly, so normalized ratios provide a better signal to the model than the aggregated amounts themselves.

You may be wondering why our batch job is computing features with a 10-minute lookback. Isn’t that only relevant for online inference? We need the 10-minute window on historical transactions to create an accurate training dataset. This is critical for ensuring consistency with the 10-minute streaming window that will be used in near-real time to support online inference.

The resulting training dataset from the processing job can be saved directly as a CSV for model training, or it can be bulk ingested into an offline feature group that can be used for other models and by other data science teams to address a wide variety of other use cases. For example, we can create and populate a feature group called cc-transactions-fg. Our training job can then pull a specific training dataset based on the needs for our specific model, selecting specific date ranges and a subset of features of interest. This approach enables multiple teams to reuse feature groups and maintain fewer feature pipelines, leading to significant cost savings and productivity improvements over time. This example notebook demonstrates the pattern of using Feature Store as a central repository from which data scientists can extract training datasets.

In addition to creating a training dataset, we use the PutRecord API to put the 1-week feature aggregations into the online feature store nightly. The following code demonstrates putting a record into an online feature group given specific feature values, including a record identifier and an event time:

record = [{'FeatureName': 'cc_num', 
              'ValueAsString': str(cc_num)},
             {'FeatureName':'avg_amt_last_1w', 
              'ValueAsString': str(avg_amt_last_1w)},
             {'FeatureName':'num_trans_last_1w', 
              'ValueAsString': str(num_trans_last_1w)}]
event_time_feature = {
                 'FeatureName': 'trans_time',
                 'ValueAsString': str(int(round(time.time())))}
record.append(event_time_feature)
response = feature_store_client.put_record(
    FeatureGroupName=’cc-agg-batch-fg’, Record=record)

ML engineers often build a separate version of feature engineering code for online features based on the original code written by data scientists for model training. This can deliver the desired performance, but is an extra development step, and introduces more chance for training and inference skew. In our use case, we show how using SQL for aggregations can enable a data scientist to provide the same code for both batch and streaming.

Streaming ingestion

Feature Store delivers single-digit millisecond retrieval of pre-calculated features, and it can also play an effective role in solutions requiring streaming ingestion. Our use case demonstrates both. Weekly lookback is handled as a pre-calculated feature group, materialized nightly as shown earlier. Now let’s dive into how we calculate features aggregated on the fly over a 10-minute window and ingest them into the feature store for later online inference.

In our use case, we ingest live credit card transactions to a source MSK topic, and use a Kinesis Data Analytics for Apache Flink application to create aggregate features in a destination MSK topic. The application is written using Apache Flink SQL. Flink SQL makes it simple to develop streaming applications using standard SQL. It’s easy to learn Flink if you have ever worked with a database or SQL-like system by remaining ANSI-SQL 2011 compliant. Apart from SQL, we can build Java and Scala applications in Amazon Kinesis Data Analytics using open-source libraries based on Apache Flink. We then use a Lambda function to read the destination MSK topic and ingest the aggregate features into a SageMaker feature group for inference. Creating the Apache Flink application using Flink’s SQL API is straightforward. We use Flink SQL to aggregate the streaming data in the source MSK topic and store it in a destination MSK topic.

To produce aggregate counts and average amounts looking back over a 10-minute window, we use the following Flink SQL query on the input topic and pipe the results to the destination topic:

SELECT 
 cc_num, 
 COUNT(*) OVER LAST_10_MINUTES as cc_count,
 AVG(amount) OVER LAST_10_MINUTES as avg_amount
FROM 
 cctopic
WINDOW LAST_10_MINUTES AS (
 PARTITION BY cc_num
 ORDER BY proc_ts
 RANGE INTERVAL '10' MINUTE PRECEDING
 );

cc_num amount datetime num_trans_last_10m avg_amt_last_10m
…1248 50.00 Nov-01,22:01:00 1 74.99
…9843 99.50 Nov-01,22:02:30 1 99.50
…7403 100.00 Nov-01,22:03:48 1 100.00
…1248 200.00 Nov-01,22:03:59 2 125.00
…0732 26.99 Nov01, 22:04:15 1 26.99
…1248 50.00 Nov-01,22:04:28 3 100.00
…1248 500.00 Nov-01,22:05:05 4 200.00

In this example, notice that the final row has a count of four transactions in the last 10 minutes from the credit card ending with 1248, and a corresponding average transaction amount of $200.00. The SQL query is consistent with the one used to drive creation of our training dataset, helping to avoid training and inference skew.

As transactions stream into the Kinesis Data Analytics for Apache Flink aggregation app, the app sends the aggregate results to our Lambda function, as shown in the following diagram. The Lambda function takes these features and populates the cc-agg-fg feature group.

We send the latest feature values to the feature store from Lambda using a simple call to the PutRecord API. The following is the core piece of Python code for storing the aggregate features:

record = [{'FeatureName': 'cc_num', 
           'ValueAsString': str(cc_num)},
          {'FeatureName':'avg_amt_last_10m', 
           'ValueAsString': str(avg_amt_last_10m)},
          {'FeatureName':'num_trans_last_10m', 
           'ValueAsString': str(num_trans_last_10m)},
          {'FeatureName': 'evt_time', 
           'ValueAsString': str(int(round(time.time())))}]
featurestore_runtime.put_record(FeatureGroupName='cc-agg-fg', 
                                Record=record)

We prepare the record as a list of named value pairs, including the current time as the event time. The Feature Store API ensures that this new record follows the schema that we identified when we created the feature group. If a record for this primary key already existed, it is now overwritten in the online store.

Streaming predictions

Now that we have streaming ingestion keeping the feature store up to date with the latest feature values, let’s look at how we make fraud predictions.

We create a second Lambda function that uses the source MSK topic as a trigger. For each new transaction event, the Lambda function first retrieves the batch and streaming features from Feature Store. To detect anomalies in credit card behavior, our model looks for spikes in recent purchase amounts or purchase frequency. The Lambda function computes simple ratios between the 1-week aggregations and the 10-minute aggregations. It then invokes the SageMaker model endpoint using those ratios to make the fraud prediction, as shown in the following diagram.

We use the following code to retrieve feature values on demand from the feature store before calling the SageMaker model endpoint:

featurestore_runtime =  
        boto3.client(service_name='sagemaker-featurestore-runtime')
response = featurestore_runtime.get_record(
		FeatureGroupName=feature_group_name, 
        RecordIdentifierValueAsString=record_identifier_value)

SageMaker also supports retrieving multiple feature records with a single call, even if they are from different feature groups.

Finally, with the model input feature vector assembled, we call the model endpoint to predict if a specific credit card transaction is fraudulent. SageMaker also supports retrieving multiple feature records with a single call, even if they are from different feature groups.

sagemaker_runtime =  
    boto3.client(service_name='runtime.sagemaker')
request_body = ','.join(features)
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=ENDPOINT_NAME,
    ContentType='text/csv',
    Body=request_body)
probability = json.loads(response['Body'].read().decode('utf-8'))

In this example, the model came back with a probability of 98% that the specific transaction was fraudulent, and it was able to use near-real-time aggregated input features based on the most recent 10 minutes of transactions on that credit card.

Test the end-to-end solution

To demonstrate the full end-to-end workflow of our solution, we simply send credit card transactions into our MSK source topic. Our automated Kinesis Data Analytics for Apache Flink aggregation takes over from there, maintaining a near-real-time view of transaction counts and amounts in Feature Store, with a sliding 10-minute lookback window. These features are combined with the 1-week aggregate features that were already ingested to the feature store in batch, letting us make fraud predictions on each transaction.

We send a single transaction from three different credit cards. We then simulate a fraud attack on a fourth credit card by sending many back-to-back transactions in seconds. The output from our Lambda function is shown in the following screenshot. As expected, the first three one-off transactions are predicted as NOT FRAUD. Of the 10 fraudulent transactions, the first is predicted as NOT FRAUD, and the rest are all correctly identified as FRAUD. Notice how the aggregate features are kept current, helping drive more accurate predictions.

Conclusion

We have shown how Feature Store can play a key role in the solution architecture for critical operational workflows that need streaming aggregation and low-latency inference. With an enterprise-ready feature store in place, you can use both batch ingestion and streaming ingestion to feed feature groups, and access feature values on demand to perform online predictions for significant business value. ML features can now be shared at scale across many teams of data scientists and thousands of ML models, improving data consistency, model accuracy, and data scientist productivity. Feature Store is available now, and you can try out this entire example. Let us know what you think.

Special thanks to everyone who contributed to the previous blog post with a similar architecture: Paul Hargis, James Leoni and Arunprasath Shankar.


About the Authors

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in feature stores, computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Prabhakar Chandrasekaran is a Senior Technical Account Manager with AWS Enterprise Support. Prabhakar enjoys helping customers build cutting-edge AI/ML solutions on the cloud. He also works with enterprise customers providing proactive guidance and operational assistance, helping them improve the value of their solutions when using AWS. Prabhakar holds six AWS and six other professional certifications. With over 20 years of professional experience, Prabhakar was a data engineer and a program leader in the financial services space prior to joining AWS.

Read More

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

This is a guest post co-written with Fred Wu from Sportradar.

Sportradar is the world’s leading sports technology company, at the intersection between sports, media, and betting. More than 1,700 sports federations, media outlets, betting operators, and consumer platforms across 120 countries rely on Sportradar knowhow and technology to boost their business.

Sportradar uses data and technology to:

  • Keep betting operators ahead of the curve with the products and services they need to manage their sportsbook
  • Give media companies the tools to engage more with fans
  • Give teams, leagues, and federations the data they need to thrive
  • Keep the industry clean by detecting and preventing fraud, doping, and match fixing

This post demonstrates how Sportradar used Amazon’s Deep Java Library (DJL) on AWS alongside Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Simple Storage Service (Amazon S3) to build a production-ready machine learning (ML) inference solution that preserves essential tooling in Java, optimizes operational efficiency, and increases the team’s productivity by providing better performance and accessibility to logs and system metrics.

The DJL is a deep learning framework built from the ground up to support users of Java and JVM languages like Scala, Kotlin, and Clojure. Right now, most deep learning frameworks are built for Python, but this neglects the large number of Java developers and developers who have existing Java code bases they want to integrate the increasingly powerful capabilities of deep learning into. With the DJL, integrating this deep learning is simple.

In this post, the Sportradar team discusses the challenges they encountered and the solutions they created to build their model inference platform using the DJL.

Business requirements

We are the US squad of the Sportradar AI department. Since 2018, our team has been developing a variety of ML models to enable betting products for NFL and NCAA football. We recently developed four more new models.

The fourth down decision models for the NFL and NCAA predict the probabilities of the outcome of a fourth down play. A play outcome could be a field goal attempt, play, or punt.

The drive outcome models for the NFL and NCAA predict the probabilities of the outcome of the current drive. A drive outcome could be an end of half, field goal attempt, touchdown, turnover, turnover on downs, or punt.

Our models are the building blocks of other models where we generate a list of live betting markets, include spread, total, win probability, next score type, next team to score, and more.

The business requirements for our models are as follows:

  • The model predictor should be able to load the pre-trained model file one time, then make predictions on many plays
  • We have to generate the probabilities for each play under 50-milisecond latency
  • The model predictor (feature extraction and model inference) has to be written in Java, so that the other team can import it as a Maven dependency

Challenges with the in-place system

The main challenge we have is how to bridge the gap between model training in Python and model inference in Java. Our data scientists train the model in Python using tools like PyTorch and save the model as PyTorch scripts. Our original plan was to also host the models in Python and utilize gRPC to communicate with another service, which will use the Java gRPC client to send the request.

However, a few issues came with this solution. Mainly, we saw the network overhead between two different services running in separate run environments or pods, which resulted in higher latency. But the maintenance overhead was the main reason we abandoned this solution. We had to build both the gRPC server and the client program separately and keep the protocol buffer files consistent and up to date. Then we needed to Dockerize the application, write a deployment YAML file, deploy the gRPC server to our Kubernetes cluster, and make sure it’s reliable and auto scalable.

Another problem was whenever an error occurred on the gRPC server side, the application client only got a vague error message instead of a detailed error traceback. The client had to reach out to the gRPC server maintainer to learn exactly which part of the code caused the error.

Ideally, we instead want to load the model PyTorch scripts, extract the features from model input, and run model inference entirely in Java. Then we can build and publish it as a Maven library, hosted on our internal registry, which our service team could import into their own Java projects. When we did our research online, the Deep Java Library showed up on the top. After reading a few blog posts and DJL’s official documentation, we were sure DJL would provide the best solution to our problem.

Solution overview

The following diagram compares the previous and updated architecture.

The following diagram outlines the workflow of the DJL solution.

workflow

The steps are as follows:

  1. Training the models – Our data scientists train the models using PyTorch and save the models as torch scripts. These models are then pushed to an Amazon Simple Storage Service (Amazon S3) bucket using DVC, a version control tool for ML models.
  2. Implementing feature extraction and feeding ML features – The framework team pulls the models from Amazon S3 into a Java repository where they implement feature extraction and feed ML features into the predictor. They use the DJL PyTorch engine to initialize the model predictor.
  3. Packaging and publishing the inference code and models – The GitLab CI/CD pipeline packages and publishes the JAR file that contains the inference code and models to an internal Apache Archiva registry.
  4. Importing the inference library and making calls – The Java client imports the inference library as a Maven dependency. All inference calls are made via Java function calls within the same Kubernetes pod. Because there are no gRPC calls, the inferencing response time is improved. Furthermore, the Java client can easily roll back the inference library to a previous version if needed. In contrast, the server-side error is not transparent for the client side in gRPC-based solutions, making error tracking difficult.

We have seen a stable inferencing runtime and reliable prediction results. The DJL solution offers several advantages over gRPC-based solutions:

  • Improved response time – With no gRPC calls, the inferencing response time is improved
  • Easy rollbacks and upgrades – The Java client can easily roll back the inference library to a previous version or upgrade to a new version
  • Transparent error tracking – In the DJL solution, the client can receive detailed error trackback messages in case of inferencing errors

Deep Java Library overview

The DJL is a full deep learning framework that supports the deep learning lifecycle from building a model, training it on a dataset, to deploying it in production. It has intuitive helpers and utilities for modalities like computer vision, natural language processing, audio, time series, and tabular data. DJL also features an easy model zoo of hundreds of pre-trained models that can be used out of the box and integrated into existing systems.

It is also a fully Apache-2 licensed open-source project and can be found on GitHub. The DJL was created at Amazon and open-sourced in 2019. Today, DJL’s open-source community is led by Amazon and has grown to include many countries, companies, and educational institutions. The DJL continues to grow in its ability to support different hardware, models, and engines. It also includes support for new hardware like ARM (both in servers like AWS Graviton and laptops with Apple M1) and AWS Inferentia.

The architecture of DJL is engine agnostic. It aims to be an interface describing what deep learning could look like in the Java language, but leaves room for multiple different implementations that could provide different capabilities or hardware support. Most popular frameworks today such as PyTorch and TensorFlow are built using a Python front end that connects to a high-performance C++ native backend. The DJL can use this to connect to these same native backends to take advantage of their work on hardware support and performance.

For this reason, many DJL users also use it for inference only. That is, they will train a model using Python and then load it using the DJL for deployment as part of their existing Java production system. Because the DJL utilizes the same engine that powers Python, it’s able to run without any decrease in performance or loss in accuracy. This is exactly the strategy that we found to support the new models.

The following diagram illustrates the workflow under the hood.

djl

When the DJL loads, it finds all the engine implementations available in the class path using Java’s ServiceLoader. In this case, it detects the DJL PyTorch engine implementation, which will act as the bridge between the DJL API and the PyTorch Native.

The engine then works to load the PyTorch Native. By default, it downloads the appropriate native binary based on your OS, CPU architecture, and CUDA version, making it almost effortless to use. You can also provide the binary using one of the many available native JAR files, which are more reliable for production environments that often have limited network access for security.

Once loaded, the DJL uses the Java Native Interface to translate all the easy high-level functionalities in DJL into the equivalent low-level native calls. Every operation in the DJL API is hand-crafted to best fit the Java conventions and make it easily accessible. This also includes dealing with native memory, which is not supported by the Java Garbage Collector.

Although all these details are within the library, calling it from a user standpoint couldn’t be easier. In the following section, we walk through this process.

How Sportradar implemented DJL

Because we train our models using PyTorch, we use the DJL’s PyTorch engine for the model inference.

Loading the model is incredibly easy. All it takes is to build a criteria describing the model to load and where it is from. Then, we load it and use the model to create a new predictor session. See the following code:

crite

For our model, we also have a custom translator, which we call MyTranslator. We use the translator to encapsulate the preprocessing code that converts from a convenient Java type into the input expected by the model and the postprocessing code that converts from the model output into a convenient output. In our case, we chose to use a float[] as the input type and the built-in DJL classifications as the output type. The following is a snippet of our translator code:

It’s pretty amazing that with just a few lines of code, the DJL loads the PyTorch scripts and our custom translator, and then the predictor is ready to make the predictions.

Conclusion

Sportradar’s product built on the DJL solution went live before the 2022–23 NFL regular season started, and it has been running smoothly since then. In the future, Sportradar plans to re-platform existing models hosted on gRPC servers to the DJL solution.

The DJL continues to grow in many different ways. The most recent release, v0.21.0, has many improvements, including updated engine support, improvements on Spark, Hugging Face batch tokenizers, an NDScope for easier memory management, and enhancements to the time series API. It also has the first major release of DJL Zero, a new API aiming to allow support for both using pre-trained models and training your own custom deep learning models even with zero knowledge of deep learning.

The DJL also features a model server called DJL Serving. It makes it simple to host a model on an HTTP server from any of the 10 supported engines, including the Python engine to support Python code. With v0.21.0 of DJL Serving, it includes faster transformer support, Amazon SageMaker multi-model endpoint support, updates for Stable Diffusion, improvements for DeepSpeed, and updates to the management console. You can now use it to deploy large models with model parallel inference using DeepSpeed and SageMaker.

There is also much upcoming with the DJL. The largest area under development is large language model support for models like ChatGPT or Stable Diffusion. There is also work to support streaming inference requests in DJL Serving. Thirdly, there are improvements to demos and the extension for Spark. Of course, there is also standard continuing work including features, fixes, engine updates, and more.

For more information on the DJL and its other features, see Deep Java Library.

Follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of the DJL!


About the authors

Fred Wu is a Senior Data Engineer at Sportradar, where he leads infrastructure, DevOps, and data engineering efforts for various NBA and NFL products. With extensive experience in the field, Fred is dedicated to building robust and efficient data pipelines and systems to support cutting-edge sports analytics.

Zach Kimberg is a Software Developer in the Amazon AI org. He works to enable the development, training, and production inference of deep learning. There, he helped found and continues to develop the DeepJavaLibrary project.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Read More

Financial text generation using a domain-adapted fine-tuned large language model in Amazon SageMaker JumpStart

Financial text generation using a domain-adapted fine-tuned large language model in Amazon SageMaker JumpStart

Large language models (LLMs) with billions of parameters are currently at the forefront of natural language processing (NLP). These models are shaking up the field with their incredible abilities to generate text, analyze sentiment, translate languages, and much more. With access to massive amounts of data, LLMs have the potential to revolutionize the way we interact with language. Although LLMs are capable of performing various NLP tasks, they are considered generalists and not specialists. In order to train an LLM to become an expert in a particular domain, fine-tuning is usually required.

One of the major challenges in training and deploying LLMs with billions of parameters is their size, which can make it difficult to fit them into single GPUs, the hardware commonly used for deep learning. The sheer scale of these models requires high-performance computing resources, such as specialized GPUs with large amounts of memory. Additionally, the size of these models can make them computationally expensive, which can significantly increase training and inference times.

In this post, we demonstrate how we can use Amazon SageMaker JumpStart to easily fine-tune a large language text generation model on a domain-specific dataset in the same way you would train and deploy any model on Amazon SageMaker. In particular, we show how you can fine-tune the GPT-J 6B language model for financial text generation using both the JumpStart SDK and Amazon SageMaker Studio UI on a publicly available dataset of SEC filings.

JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few steps. All the steps in this demo are available in the accompanying notebook Fine-tuning text generation GPT-J 6B model on a domain specific dataset.

Solution overview

In the following sections, we provide a step-by-step demonstration for fine-tuning an LLM for text generation tasks via both the JumpStart Studio UI and Python SDK. In particular, we discuss the following topics:

  • An overview of the SEC filing data in the financial domain that the model is fine-tuned on
  • An overview of the LLM GPT-J 6B model we have chosen to fine-tune
  • A demonstration of two different ways we can fine-tune the LLM using JumpStart:
    • Use JumpStart programmatically with the SageMaker Python SDK
    • Access JumpStart using the Studio UI
  • An evaluation of the fine-tuned model by comparing it with the pre-trained model without fine-tuning

Fine-tuning refers to the process of taking a pre-trained language model and training it for a different but related task using specific data. This approach is also known as transfer learning, which involves transferring the knowledge learned from one task to another. LLMs like GPT-J 6B are trained on massive amounts of unlabeled data and can be fine-tuned on smaller datasets, making the model perform better in a specific domain.

As an example of how performance improves when the model is fine-tuned, consider asking it the following question:

“What drives sales growth at Amazon?”

Without fine-tuning, the response would be:

“Amazon is the world’s largest online retailer. It is also the world’s largest online marketplace. It is also the world”

With fine tuning, the response is:

“Sales growth at Amazon is driven primarily by increased customer usage, including increased selection, lower prices, and increased convenience, and increased sales by other sellers on our websites.”

The improvement from fine-tuning is evident.

We use financial text from SEC filings to fine-tune a GPT-J 6B LLM for financial applications. In the next sections, we introduce the data and the LLM that will be fine-tuned.

SEC filing dataset

SEC filings are critical for regulation and disclosure in finance. Filings notify the investor community about companies’ business conditions and the future outlook of the companies. The text in SEC filings covers the entire gamut of a company’s operations and business conditions. Because of their potential predictive value, these filings are good sources of information for investors. Although these SEC filings are publicly available to anyone, downloading parsed filings and constructing a clean dataset with added features is a time-consuming exercise. We make this possible in a few API calls in the JumpStart Industry SDK.

Using the SageMaker API, we downloaded annual reports (10-K filings; see How to Read a 10-K for more information) for a large number of companies. We select Amazon’s SEC filing reports for years 2021–2022 as the training data to fine-tune the GPT-J 6B model. In particular, we concatenate the SEC filing reports of the company in different years into a single text file except for the “Management Discussion and Analysis” section, which contains forward-looking statements by the company’s management and are used as the validation data.

The expectation is that after fine-tuning the GPT-J 6B text generation model on the financial SEC documents, the model is able to generate insightful financial related textual output, and therefore can be used to solve multiple domain-specific NLP tasks.

GPT-J 6B large language model

GPT-J 6B is an open-source, 6-billion-parameter model released by Eleuther AI. GPT-J 6B has been trained on a large corpus of text data and is capable of performing various NLP tasks such as text generation, text classification, and text summarization. Although this model is impressive on a number of NLP tasks without the need for any fine-tuning, in many cases you will need to fine-tune the model on a specific dataset and NLP tasks you are trying to solve for. Use cases include custom chatbots, idea generation, entity extraction, classification, and sentiment analysis.

Access LLMs on SageMaker

Now that we have identified the dataset and the model we are going to fine-tune on, JumpStart provides two avenues to get started using text generation fine-tuning: the SageMaker SDK and Studio.

Use JumpStart programmatically with the SageMaker SDK

We now go over an example of how you can use the SageMaker JumpStart SDK to access an LLM (GPT-J 6B) and fine-tune it on the SEC filing dataset. Upon completion of fine-tuning, we will deploy the fine-tuned model and make inference against it. All the steps in this post are available in the accompanying notebook: Fine-tuning text generation GPT-J 6B model on domain specific dataset.

In this example, JumpStart uses the SageMaker Hugging Face Deep Learning Container (DLC) and DeepSpeed library to fine-tune the model. The DeepSpeed library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. It supports single node distributed training, utilizing gradient checkpointing and model parallelism to train large models on a single SageMaker training instance with multiple GPUs. With JumpStart, we integrate the DeepSpeed library with the SageMaker Hugging Face DLC for you and take care of everything under the hood. You can easily fine-tune the model on your domain-specific dataset without manually setting it up.

Fine-tune the pre-trained model on domain-specific data

To fine-tune a selected model, we need to get that model’s URI, as well as the training script and the container image used for training. To make things easy, these three inputs depend solely on the model name, version (for a list of the available models, see Built-in Algorithms with pre-trained Model Table), and the type of instance you want to train on. This is demonstrated in the following code snippet:

from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_id, model_version = "huggingface-textgeneration1-gpt-j-6b", "*"
training_instance_type = "ml.g5.12xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)

# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

We retrieve the model_id corresponding to the same model we want to use. In this case, we fine-tune huggingface-textgeneration1-gpt-j-6b.

Defining hyperparameters involves setting the values for various parameters used during the training process of an ML model. These parameters can affect the model’s performance and accuracy. In the following step, we establish the hyperparameters by utilizing the default settings and specifying custom values for parameters such as epochs and learning_rate:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "6"

hyperparameters["learning_rate"] = "2e-04"
print(hyperparameters)

JumpStart provides an extensive list of hyperparameters available to tune. The following list provides an overview of part of the key hyperparameters utilized in fine-tuning the model. For a full list of hyperparameters, see the notebook Fine-tuning text generation GPT-J 6B model on domain specific dataset.

  • epochs – Specifies at most how many epochs of the original dataset will be iterated.
  • learning_rate – Controls the step size or learning rate of the optimization algorithm during training.
  • eval_steps – Specifies how many steps to run before evaluating the model on the validation set during training. The validation set is a subset of the data that is not used for training, but instead is used to check the performance of the model on unseen data.
  • weight_decay – Controls the regularization strength during model training. Regularization is a technique that helps prevent the model from overfitting the training data, which can result in better performance on unseen data.
  • fp16 – Controls whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
  • evaluation_strategy – The evaluation strategy used during training.
  • gradient_accumulation_steps – The number of updates steps to accumulate the gradients for, before performing a backward/update pass.

For further details regarding hyperparameters, refer to the official Hugging Face Trainer documentation.

You can now fine-tune this JumpStart model on your own custom dataset using the SageMaker SDK. We use the SEC filing data we described earlier. The train and validation data is hosted under train_dataset_s3_path and validation_dataset_s3_path. The supported format of the data includes CSV, JSON, and TXT. For the CSV and JSON data, the text data is used from the column called text or the first column if no column called text is found. Because this is for text generation fine-tuning, no ground truth labels are required. The following code is an SDK example of how to fine-tune the model:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner
from sagemaker.huggingface import HuggingFace

train_dataset_s3_path = "s3://jumpstart-cache-prod-us-west-2/training-datasets/tc/data.csv"
validation_dataset_s3_path = "s3://jumpstart-cache-prod-us-west-2/training-datasets/tc/data.csv"

training_job_name = name_from_base(f"jumpstart-example-{model_id}")

metric_definitions=[
    {'Name': 'train:loss', 'Regex': "'loss': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:loss', 'Regex': "'eval_loss': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:runtime', 'Regex': "'eval_runtime': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:eval_steps_per_second', 'Regex': "'eval_steps_per_second': ([0-9]+.[0-9]+)"},
]

# # Create SageMaker Estimator instance
tg_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    base_job_name=training_job_name,
    enable_network_isolation=True,
    metric_definitions=metric_definitions
)

# Launch a SageMaker Training job by passing s3 path of the training data
tg_estimator.fit({"train": train_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True)

After we have set up the SageMaker Estimator with the required hyperparameters, we instantiate a SageMaker estimator and call the .fit method to start fine-tuning our model, passing it the Amazon Simple Storage Service (Amazon S3) URI for our training data. As you can see, the entry_point script provided is named transfer_learning.py (the same for other tasks and models), and the input data channel passed to .fit must be named train and validation.

JumpStart also supports hyperparameter optimization with SageMaker automatic model tuning. For details, see the example notebook.

Deploy the fine-tuned model

When training is complete, you can deploy your fine-tuned model. To do so, all we need to obtain is the inference script URI (the code that determines how the model is used for inference once deployed) and the inference container image URI, which includes an appropriate model server to host the model we chose. See the following code:

from sagemaker.predictor import Predictor
from sagemaker import image_uris
from sagemaker.utils import name_from_base
import boto3

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name="us-west-2"))

#Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
    
endpoint_name = name_from_base(f"jumpstart-example-{model_id}")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = tg_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    image_uri=image_uri,
    endpoint_name=endpoint_name,
)

After a few minutes, our model is deployed and we can get predictions from it in real time!

Access JumpStart through the Studio UI

Another way to fine-tune and deploy JumpStart models is through the Studio UI. This UI provides a low-code/no-code solution to fine-tuning LLMs.

On the Studio console, choose Models, notebooks, solutions under SageMaker JumpStart in the navigation pane.

In the search bar, search for the model you want to fine-tune and deploy.

In our case, we chose the GPT-J 6B model card. Here we can directly fine-tune or deploy the LLM.

Model evaluation

When evaluating an LLM, we can use perplexity (PPL). PPL is a common measure of how well a language model is able to predict the next word in a sequence. In simpler terms, it’s a way to measure how well the model can understand and generate human-like language.

A lower perplexity score means that the model is shown to perform better at predicting the next word. In practical terms, we can use perplexity to compare different language models and determine which one performs better on a given task. We can also use it to track the performance of a single model over time. For more details, refer to Perplexity of fixed-length models.

We evaluate the model’s performance through a comparison of its pre- and post-fine-tuning performance. PPL is emitted in the training job’s Amazon CloudWatch logs. In addition, we look at the output generated by the model in response to specific test prompts.

Evaluation metric on the validation data Before fine-tuning After fine-tuning
Perplexity (PPL) 8.147 1.437

The model’s enhanced performance is demonstrated by two experiments:

  • Text generation – We show that the output the fine-tuned GPT-J generates text that has a resemblance to the SEC dataset, indicating that it has learned to emulate the language patterns and conventions of the financial regulatory domain
  • Zero-shot learning – We demonstrate that the fine-tuned GPT-J shows significantly better performance than the non-fine-tuned GPT-J on zero-shot question answering

For the text generation experiment, we use the following inference parameters to achieve our results: max_length of 400 (number of tokens returned), num_return_sequences of 1 (number of response text examples), top_k of 250 (pick from one of the top 250 possible continuations), top_p of 0.8 (select only from continuations that are at least 80% probable), do_sample is True (sampling), and temperature of 1 (high temperature is greater randomness in responses). For more details about the parameters, refer to GenerationConfig. The following table compares the fine-tuned GPT-J and non-fine-tuned one on a text completion task.

Inference example Output from GPT-J 6B Before Fine-Tuning Output from GPT-J 6B After Fine-Tuning
This Form 10-K report shows that

This Form 10-K report shows that:

The Companys net income attributable to the Company for the year ended December 31, 2016 was $3,923,000, or $0.21 per diluted share, compared to $3,818,000, or $0.21 per diluted share, for the year ended December 31, 2015.

The Companys net income attributable to the Company for the year ended December 31, 2016 was $4,816,000, or $0.28 per diluted share, compared to $4,726,000, or $0.27 per diluted share, for the year ended December 31, 2015.

The Companys net income attributable to the Company for the year ended December 31, 2016 was $5,828,000, or $0.34 per diluted share, compared to $5,716,000, or $0.33 per diluted share, for the year ended December 31, 2015.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended December 31, 2008.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended September 30, 2008.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended March 31, 2009.

This Form 10-K report shows that:(1) The registrant’s name, address, and telephone number, are as follows:Amazon.com, Inc. c/o Ernst & Young LLPSeattle, Washington Attention: Corporate Secretary (2) For the purpose of Article III of the Securities Exchange Act of 1934, the registrant’s name and address are as follows:(3) The registrant’s Exchange Act reportable time period is from and includingJanuary 1, 2020 to the present.(4) The registrant’s subjective assessment is that the financial statements, and other financial information included in this report, fairlypresent in all material respects the financial condition, results of operations and cashflows of the registrant as of, and for, the periods presented in this report.(5) The registrant’s management’s assessment is that the financial statements, and other financial information included in this report, fairlypresent in all material respects the financial condition, results of operations and cashflows of the registrant as of, and for, the periods presented in this report./s/ Ernst & Young LLPSeattle, WashingtonJanuary 29, 2020EX-31.1 5 amzn-20201231xex311.htm EX-31.1 DocumentExhibit 31.1CERTIFICATIONSI, Andrew R. Jassy, certify that:1. I have reviewed this Form 10-K of Amazon.com, Inc.;2. Based on my knowledge, this report does not contain any untrue statementof a material fact or omit to state a material fact necessary to make thestatements made, in light of the circumstances under which such statementswere made, not misleading with respect to the period covered by this report;
We serve consumers through

We serve consumers through:

Our Blog

Are you a business owner looking to grow your business? Do you want to create a more effective and efficient business? Do you want to learn how to grow your business? If you answered “yes” to any of these questions, then you need to read this article.

The first step to growing your business is to understand your business. You must be able to answer the following questions about your business:

Who are your customers?

What is your product or service?

What is your unique selling proposition?

What is your marketing strategy?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

We serve consumers through: Amazon.com, a retailer of physical and digital products; Amazon Web Services, which provides technology infrastructure services that enable virtually any type of business; Amazon Enterprise Solutions, which provides companies with custom technology and services to enable better business processes and improve efficiency and productivity; and Audible, a digital audiobook publisher. We serve developers and enterprises of all sizes through AWS, which offers a broad set of global compute, storage, database, and other service offerings. We also manufacture and sell electronic devices. We have organized our operations into two principal segments: North America and International. See Item 8 of Part II, “”Financial Statements and Supplementary Data-Note 12-Segment Information.”” See Item 7 of Part II, “”Management’s Discussion and Analysis of Financial Condition and Results ofOperations-Results of Operations-Supplemental Information”” for supplemental informationabout our net sales.ConsumersWe serve consumers through our online and physical stores and focus on selection,price, and convenience. We design our stores to enable hundreds of millions ofunique products to be sold by us and by third parties across dozens of productcategories. Customers access our offerings through our websites, mobile apps,Alexa, and physically visiting our stores. We also manufacture and sell Kindledevices. We strive to offer our customers the lowest prices possible throughlow everyday product pricing and shipping offers, and to improve ouroperating efficiencies so that we can continue to lower prices for ourcustomers. We also provide easy-to-use functionality, fast and reliablefulfillment, and timely customer service.In addition, we offer Amazon Prime, a membership program that includes unlimitedfree shipping on over 100 million items, access to unlimited streaming of tens ofthousands of movies and TV episodes, and other benefits.
This year we particularly focus on

This year we particularly focus on:

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

This year we particularly focus on: -Continued international expansion, including the development of newbusinesses in existing international market segments, as well as thedevelopment of new international market segments; -Expanding our customer base, including increasing sales to existingcustomers and attracting new customers; -Developing new business opportunities and entering into commercialagreements, including through licensing and other arrangements, with othercompanies, or through the acquisition of other companies; -Developing technology that provides for enhanced customer service andefficiently delivers our products and services; -Developing new product and service offerings and incorporating these intoexisting and new product and service offerings; -Developing efficient and effective systems for supporting our business;and -Other business and strategic initiatives. We believe that offering low prices to our customers is fundamental to ourfuture success. One way we offer lower prices is through free-shipping offersthat result in a net cost to us in delivering products, and through membershipin Amazon Prime, which provides free-shipping on millions of items andaccess to movies and other content. We also offer other promotions that enableus to turn a lower net cost of sales.We have organized our operations into two principal segments: North Americaand International. See Item 8 of Part II, “”Financial Statements andSupplementary Data-Note 12-Segment Information.”” See Item 7 of Part II,””Management’s Discussion and Analysis of Financial Condition and Results ofOperations-Results of Operations-Supplemental Information”” for supplementalinformation about our net sales.ConsumersWe serve consumers through our retail websites and physical stores and focuson selection, price, and convenience.

For the second experiment with a zero-shot question answering task, we set the max_length to be a different value (40, 200, and 40, respectively) based on the output we expect from each question. The parameters of top_k, top_p, do_sample, temperature are set as 250, 0.8, False, and 0, respectively. We set temperature close to 0 to control the randomness of the output, whereas in the text generation task we set it close to 1 to make the model more creative. The following table compares the results. For each question, the output from fine-tuned GPT-J is more accurate than the non-fine-tuned one.

Input Output from GPT-J 6B Before Fine-Tuning Output from GPT-J 6B After Fine-Tuning
What is Item 7 of the 10-K SEC filing about?:

What is Item 7 of the 10-K SEC filing about?:

Item 7 of the 10-K SEC filing is the following:

“The Company has not received any material adverse

Item 7. Management’s Discussion and Analysis of Financial Condition and Results ofOperations Forward-Looking Statements
How did the COVID-19 pandemic impact Amazon’s business?:

How did the COVID-19 pandemic impact Amazon’s business?:

Amazon’s business is being impacted by the COVID-19 pandemic in a number of ways.

The company is seeing a decline in the number of Prime memberships, which is a big part of its business.

The company is also seeing a decline in the number of Prime memberships.

Amazon is also seeing a decline in the number of orders placed on its website.

The company is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping

How did the COVID-19 pandemic impact Amazon’s business?:The COVID-19 pandemic has impacted our business in several ways. First, it has disrupted our supply chain, and it has affected customer demand, which in turn has affected our sales and operating results. Second, it has increased our spending on advertising and marketing, which may not be effective in the long run. Third, it has increased our spending on technology infrastructure and computing capacity, which may not be effective in the long run. Fourth, it has increased our spending on fulfillment and customer service, which may not be effective in the long run. Finally, it has increased our spending on content, which may not be effective in the long run. See Item 8 of Part II, “Financial Statements and Supplementary Data — Note 1 — Description of Business and Accounting Policies.
What drives sales growth at Amazon?: Amazon is the world’s largest online retailer. It is also the world’s largest online marketplace. It is also the world’ Sales growth at Amazon is driven primarily by increased customer usage, including increased selection, lower prices, and increased convenience, and increased sales by other sellers on our websites.

Clean up

To avoid ongoing charges, delete the SageMaker inference endpoints. You can delete the endpoints via the SageMaker console or from the notebook using the following commands:

# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

Conclusion

JumpStart is a capability in SageMaker that allows you to quickly get started with ML. JumpStart uses open-source, pre-trained models to solve common ML problems like image classification, object detection, text classification, sentence pair classification, and question answering.

In this post, we showed you how to fine-tune and deploy a pre-trained LLM (GPT-J 6B) for text generation based on the SEC filling dataset. We demonstrated how the model transformed into a finance domain expert by undergoing the fine-tuning process on just two annual reports of the company. This fine-tuning enabled the model to generate content with an understanding of financial topics and greater precision. Try out the solution on your own and let us know how it goes in the comments.

Important: This post is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The post used models pre-trained on data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions if you use SEC data.

To learn more about JumpStart, check out the following posts:


About the Authors

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and PhD from New York University) and Computer Science (MS from UC Berkeley), and an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice President at Citibank. He works on multimodal machine learning in the area of financial applications.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, train, and migrate ML production workloads to SageMaker at scale. He specializes in deep learning, especially in the area of NLP and CV. Outside of work, he enjoys running and hiking.

Read More

Announcing the updated Microsoft OneDrive connector (V2) for Amazon Kendra

Announcing the updated Microsoft OneDrive connector (V2) for Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML), enabling organizations to provide relevant information to customers and employees, when they need it.

Amazon Kendra uses ML algorithms to enable users to use natural language queries to search for information scattered across multiple data souces in an enterprise, including commonly used document storage systems like Microsoft OneDrive.

OneDrive is an online cloud storage service that allows you to host your content and have it automatically sync across multiple devices. Amazon Kendra can index document formats like Microsoft OneNote, HTML, PDF, Microsoft Word, Microsoft PowerPoint, Microsoft Excel, Rich Text, JSON, XML, CSV, XSLT, and plain text.

We’re excited to announce that we have updated the OneDrive connector for Amazon Kendra to add even more capabilities. For example, we have added support to search OneNote documents. Additionally, you can now choose to use identity or ACL information to make your searches more granular.

The connector helps to index documents and their access control information to limit the search results to only those documents the user is allowed to access. To show the search results based on user access rights and using only the user information, the connector provides an identity crawler to load principal information, such as user and group mappings into a principal store.

In this post, we demonstrate how to configure multiple data sources in Amazon Kendra to provide a central place to search across your document repository.

Solution overview

For our solution, we demonstrate how to index a OneDrive repository or folder using the Amazon Kendra connector for OneDrive. The solution consists of the following steps:

  1. Create and configure an app on Microsoft Azure Portal and get the authentication credentials.
  2. Create a OneDrive data source via the Amazon Kendra console.
  3. Index the data in the OneDrive repository.
  4. Run a sample query to get the information.
  5. Filter the query by users or groups.

Prerequisites

To try out the Amazon Kendra connector for OneDrive, you need the following:

Configure an Azure application and assign connection permissions

Before we set up the OneDrive data source, we need a few details about the OneDrive repository. Complete the following steps:

  1. Log in to Azure.
  2. After logging in with your account credentials, choose App registrations, then choose New registration.
  3. Give an appropriate name to your application and register the application.
  4. Collect the information about the client ID, tenant ID, and other details of the application.
  5. To get a client secret, choose Add a certificate or secret under Client credentials.
  6. Choose New client secret and provide the proper description and expiry.
  7. Note the client-id, tenant-id, and secret-id values. We use these for authenticating the OAuth2 application.
  8. Navigate to App, choose API permissions in the navigation pane, and choose Add a permission.
  9. Choose Microsoft Graph.
  10. Under Application permissions, enter File in the search bar and under Files, select Files.Read.All.
  11. Choose Add permissions
  12. Similarly, add the following permissions on the Microsoft Graph option for the application you created:
    1. Group.Read.All
    2. Notes.Read.All

On completion, the API permissions will look like the following screenshot.

Configure the Amazon Kendra connector for OneDrive

To configure the Amazon Kendra connector, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.
  2. For Index name, enter a name for the index (for example, my-onedrive-index).
  3. Enter an optional description.
  4. Choose Create a new role.
  5. For Role name, enter an IAM role name.
  6. Configure optional encryption settings and tags
  7. Choose Next
  8. In the Configure user access control section, select Yes under Access control settings.
  9.  For Token type, choose JSON on the drop-down menu.
  10. Leave the remaining values as their default values.
  11. Choose Next

Before we move to the next configuration step, we need to provide Amazon Kendra with a role that has the permissions necessary for connecting to the site. These include permission to get and decrypt the AWS Secrets Manager secret that contains the application ID and secret key necessary to connect to the OneDrive site.

  1. Open another tab for the AWS account, and on the IAM console, navigate to the role that you created earlier (for example, AmazonKendra-us-west-2-onedrive).
  2. Choose Add permissions and Create inline policy.
  3. For Service, choose Kendra.
  4. For Actions¸choose Write and specify BatchPutDocument.
  5. For Resources, choose All resources.
  6. Choose Review policy.
  7. For Name, enter a name (for example, BatchPutPolicy).
  8. Choose Create policy.
  9. Add this policy to the role you created.
  10. Additionally, attach the SecretsManagerReadWrite AWS managed policy to the role
  11. Return to the Amazon Kendra tab.
  12. Select Developer edition and choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

  1. Return to the Amazon Kendra console, choose Data sources in the navigation pane, and choose Add data source.
  2. Under OneDrive connector V2.0, choose Add connector.
  3. For Data source name, enter a name (for example, my-onedrive).
  4. Enter an optional description.
  5. Choose Next.
  6. For OneDrive Tenant ID, enter the tenant ID you gathered earlier.
  7. For Configure VPC and security group, leave the default (No VPC).
  8. Keep Identity crawler is on selected. This imports identity information into the index.
  9. For IAM role, choose Create a new role.
  10. Enter a role name, such as AmazonKendra-us-west-2-onedrive, then choose Next.
  11. In the Authentication section, choose Create and add a secret.
  12. Create a secret with clientId and clientSecret as keys.
  13. Add their respective values with the information you collected earlier.
  14. Choose Next.
  15. In the Configure sync settings section, add the OneDrive users whose documents you want to index.
  16. Select the sync mode for the index. For this post, we select New, modified or deleted content sync.
  17. Choose the frequency of indexing as Run on demand, then choose Next.

Field mappings enable allow you to set the searchability and relevance of fields. For example, the lastUpdatedAt field can sort or boost the ranking of the documents based on how recently it was updated.

  1. Keep all the defaults in the Set field mappings section and choose Next.
  2. On the review page, choose Add data source

  3. Choose Sync now

The sync can take up to 30 minutes to complete.

Test the solution

Now that you have indexed the content from OneDrive, you can test it by querying the index.

  1. Go to your index on the Amazon Kendra console and choose Search indexed content in the navigation pane.
  2. Enter a search term and press Enter.

Notice that without a token, the ACLs prevent a search result from being returned.

  1. Expand Test query with an access token and choose Apply token.
  2. Enter the appropriate token with a user who has permissions to read the file and choose Apply.
  3. Search for information present in OneDrive again.

You can verify that Amazon Kendra presents the ranked results as expected.

Congratulations, you have configured Amazon Kendra to index and search documents in OneDrive and control access to them using ACL.

Conclusion

With the Microsoft OneDrive V2 connector for Amazon Kendra, organizations can tap into commonly used enterprise document stores, securely using intelligent search powered by Amazon Kendra. You can enhance the search experience by integrating the data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion.


About the authors

Pravinchandra Varma is a Senior Customer Delivery Architect with the AWS Professional Services team and is passionate about applications of machine learning and artificial intelligence services.

Supratim Barat is a Software Developer Engineer with AWS Kendra Yellowbadge Team and is a blockchain and cyber security enthusiast

Read More

How RallyPoint and AWS are personalizing job recommendations to help military veterans and service providers transition back into civilian life using Amazon Personalize

How RallyPoint and AWS are personalizing job recommendations to help military veterans and service providers transition back into civilian life using Amazon Personalize

This post was co-written with Dave Gowel, CEO of RallyPoint. In his own words,RallyPoint is an online social and professional network for veterans, service members, family members, caregivers, and other civilian supporters of the US armed forces. With two million members on the platform, the company provides a comfortable place for this deserving population to connect with each other and programs designed to support them.”

All those who serve – and those who support them – often face a variety of employment challenges when a servicemember transitions back into civilian life. RallyPoint has identified the transition period to a civilian career as a major opportunity to improve the quality of life for this population by creating automated and compelling job recommendations. However, the team historically employed a rule-based curation method to recommend jobs throughout its user experience, which doesn’t allow members to get job recommendations personalized to their individual experience, expertise, and interests.

“To improve this experience for its members, we at RallyPoint wanted to explore how machine learning (ML) could help. We don’t want our servicemembers, veterans, and their loved ones to waste time searching for a fulfilling civilian career path when they decide to leave the military. It should be an easy process. We want our members to tell us about their military experiences, any schools they’ve attended, and their personal preferences. Then by leveraging what we know from our millions of military and veteran members, relevant open jobs should be easily surfaced instead of laboriously searched. This free service for our members is also expected to drive revenue by at least seven figures from employers seeking the right military and veteran talent, allowing us to build more free capabilities for our members.”

This blog post summarizes how the Amazon Machine Learning Solution Lab (MLSL) partnered with RallyPoint to drive a 35% improvement in personalized career recommendations and a 66x increase in coverage, amongst other improvements for RallyPoint members from the current rule-based implementation.

“MLSL helped RallyPoint save and improve the lives of the US military community. Fortunate to work on multiple complex and impactful projects with MLSL to support the most deserving of populations, RallyPoint accelerated growth in multiple core organizational metrics in the process. MLSL’s high caliber talent, culture, and focus on aiding our realization of measurable and compelling results from machine learning investments enabled us to reduce suicide risk, improve career transition, and speed up important connections for our service members, veterans, and their families.”

Screenshot of the RallyPoint Website

*Photo provided by the RallyPoint team.

The following sections cover the business and technical challenges, the approach taken by the AWS and RallyPoint teams, and the performance of implemented solution that leverages Amazon Personalize.

Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing. Amazon Personalize is a fully managed ML service that goes beyond rigid, static rule-based recommendation systems by training, tuning, and deploying custom ML models to deliver highly customized recommendations to customers across industries such as retail and media and entertainment.

Business and Technical challenges

Multiple business challenges inspired this partnership. The most pertinent was the clickthrough rate on the top 10 recommended jobs on the RallyPoint website. RallyPoint analyzed user engagement within their platform and discovered that they needed to increase the number of relevant jobs that users are clicking. The idea is that the more relevant a recommended job is, the higher the likelihood of members applying to those jobs, leading to improved employment outcomes.

The next challenge was to increase the engagement by members on job services offered on the site. RallyPoint offers the opportunity for people to “Build your brand and engage the military community, advertise your products and services, run recruitment marketing campaigns, post jobs, and search veteran talent.” They once again identified an opportunity to apply AWS Personalize to help more people transition to civilian life, and sought to improve their click-to-customer conversion numbers, leading to better outcomes for RallyPoint’s direct customers.

From a technical perspective, like many traditional recommender system problems, data sparsity and a long tail was a challenge to overcome. The sample set of de-identified, already publicly shared data included thousands of anonymized user profiles, with more than fifty user-metadata points, but many had inconsistent or missing meta-data/profile information. To tackle this, the team leveraged the Amazon Personalize cold start recommendation functionality for relevant users.

Solution overview

To solve the problem, MLSL collaborated with RallyPoint to construct a custom Amazon Personalize pipeline for RallyPoint. Some of the services used include Amazon Simple Storage Service (Amazon S3), Amazon SageMaker Notebook Instances, and Amazon Personalize. The following diagram illustrates the solution architecture.

The anonymized raw data used for the solution consisted of a history of interactions with job postings along with metadata on user profiles and job positions. This was stored in S3. The MLSL team used Amazon SageMaker Notebook Instances to prepare data as input to Amazon Personalize. This step included data preprocessing, feature engineering, and creating dataset groups and schemas required for Amazon Personalize. For more information refer to Creating a Custom dataset group.

The next step was to create a solution in Amazon Personalize. A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions. For more information refer to Creating a solution. The team used the User-Personalization recipe to generate user-specific job recommendations for users in a validation set. The Amazon Personalize outputs, including the job recommendations and performance metrics, are stored in an Amazon S3 bucket for further analysis.

In the final step, the team used a notebook instance to prepare the output recommendations for external evaluation by human annotators, as described in the Using Domain Experts section.

Evaluation of Amazon Personalize results

The performance of an Amazon Personalize solution version can be evaluated using offline metrics, online metrics, and A/B testing. Offline metrics allow you to view the effects of modifying hyperparameters and algorithms used to train your models, calculated against historical data. Online metrics are the empirical results observed in your user’s interactions with real-time recommendations provided in a live environment (such as clickthrough rate). A/B testing is an online method of comparing the performance of multiple solution versions to a default solution. Users are randomly assigned to either the control (default) group or one of the treatment (test) groups. The control group users receive recommendations from the default solution (baseline), whereas each of the treatment groups interact with a different solution version. Statistical significance tests are used to compare the performance metrics (such as clickthrough rate or latency) and business metrics (such as revenue) to that of the default solution.

Amazon Personalize measures offline metrics during training a solution version. The team used offline metrics such as Mean Reciprocal Rank (MRR), normalized discounted cumulative gain (NCDG@k), Precision@k, and Coverage. For the definitions of all available offline metrics, refer to Metric definitions.

Although Amazon Personalize provides an extensive list of offline metrics that the team can use to objectively measure the performance of solutions during training, online metrics and A/B testing are recommended to track and validate model performance. One caveat to these tests is that they require users to interact with Amazon Personalize recommendations in real time. Because the RallyPoint Amazon Personalize model wasn’t deployed prior to this publication, the team didn’t have results to report for these tests.

Using Domain Experts

A/B testing is the preferred method of analyzing the quality of a recommendation system however, using domain experts to annotate recommendations is a viable precursor. Since online testing was not an option, to test the robustness of the recommendations, the team asked domain experts in RallyPoint to annotate the recommendations generated by the models and count the number of job positions the experts agreed should be recommended (given a user’s information and indicated preferences) as the number of “correct” recommendations. This metric was used to compare solution versions. A popularity solution (the current rule-based criteria) was used as a baseline which consisted of recommending top five most popular job positions to every user. Moreover, a solution with default settings was used as another baseline model called Amazon Personalize baseline solution.

Results

Using the best performing model resulted in a 35% improvement in the number of “correct” recommendations over the Amazon Personalize baseline solution and a 54% improvement over the popularity solution. The team could also achieve a 66x improvement in coverage, 30x improvement in MRR, and 2x improvement in precision@10 when compared to the popularity solution. In addition to the popularity solution, the team observed up to 2x increase in MRR and precision@10 when compared to the Amazon Personalize baseline solution.

Summary

RallyPoint recognized an opportunity to better serve their customers with more personalized career recommendations. They began their user personalization journey with customer obsession in mind, partnering with the Machine Learning Solutions Lab. RallyPoint now has the opportunity to give their users more valuable career recommendations, through this solution. Incorporating this improved recommendation system into their website will result in RallyPoint users seeing more relevant jobs in their career feed, easing the path to more fulfilling careers and an improved quality of life for their members.

Use Amazon Personalize to provide an individualized experience for your users today! If you’d like to collaborate with experts to bring ML solutions to your organization, contact the Amazon ML Solutions Lab.

Additional resources

For more information about Amazon Personalize, see the following:


About the Authors

Dave Gowel is an Army veteran and the CEO of RallyPoint. Dave is a graduate of West Point and the US Army Ranger School, served in Iraq as a tank platoon leader, and taught as an assistant professor at the Massachusetts Institute of Technology ROTC program. RallyPoint is the third technology company for which Dave has been CEO.

Matthew Rhodes is a Data Scientist working in the Amazon ML Solutions Lab. He specializes in building machine learning pipelines that involve concepts such as natural language processing and computer vision.

Amin Tajgardoon is an Applied Scientist at the Amazon ML Solutions Lab. He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Greg Tolmie is an Account Manager on the AWS Public Sector ISV partners team. Greg supports a portfolio of AWS public sector ISV partners to help them grow and mature their adoption of AWS services while maximizing benefits of the AWS partner network.

Read More

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Reliability managers and technicians in industrial environments such as manufacturing production lines, warehouses, and industrial plants are keen to improve equipment health and uptime to maximize product output and quality. Machine and process failures are often addressed by reactive activity after incidents happen or by costly preventive maintenance, where you run the risk of over-maintaining the equipment or missing issues that could happen between the periodic maintenance cycles. Predictive condition-based maintenance is a proactive strategy that is better than reactive or preventive ones. Indeed, this approach combines continuous monitoring, predictive analytics, and just-in-time action. This enables maintenance and reliability teams to service equipment only when necessary, based on the actual equipment condition.

There have been common challenges with condition-based monitoring to generate actionable insights for large industrial asset fleets. These challenges include but are not limited to: build and maintain a complex infrastructure of sensors collecting data from the field, obtain a reliable high-level summary of industrial asset fleets, efficiently manage failure alerts, identify possible root causes of anomalies, and effectively visualize the state of industrial assets at scale.

Amazon Monitron is an end-to-end condition monitoring solution that enables you to start monitoring equipment health with the aid of machine learning (ML) in minutes, so you can implement predictive maintenance and reduce unplanned downtime. It includes sensor devices to capture vibration and temperature data, a gateway device to securely transfer data to the AWS Cloud, the Amazon Monitron service that analyzes the data for anomalies with ML, and a companion mobile app to track potential failures in your machinery. Your field engineers and operators can directly use the app to diagnose and plan maintenance for industrial assets.

From the operational technology (OT) team standpoint, using the Amazon Monitron data also opens up new ways to improve how they operate large industrial asset fleets thanks to AI. OT teams can reinforce the predictive maintenance practice from their organization by building a consolidated view across multiple hierarchies (assets, sites, and plants). They can combine actual measurement and ML inference results with unacknowledged alarms, sensors or getaways connectivity status, or asset state transitions to build a high-level summary for the scope (asset, site, project) they are focused on.

With the recently launched Amazon Monitron Kinesis data export v2 feature, your OT team can stream incoming measurement data and inference results from Amazon Monitron via Amazon Kinesis to AWS Simple Storage Service (Amazon S3) to build an Internet of Things (IoT) data lake. By leveraging the latest data export schema, you can obtain sensors connectivity status, gateways connectivity status, measurement classification results, closure reason code and details of asset state transition events.

Use cases overview

The enriched data stream Amazon Monitron now exposes enables you to implement several key use cases such as automated work order creation, enriching an operational single pane of glass or automating failure reporting. Let’s dive into these use cases.

You can use the Amazon Monitron Kinesis data export v2 to create work orders in Enterprise Asset Management (EAM) systems such as Infor EAM, SAP Asset Management, or IBM Maximo. For example, in the video avoiding mechanical issues with predictive maintenance & Amazon Monitron, you can discover how our Amazon Fulfillment Centers are avoiding mechanical issues on conveyor belts with Amazon Monitron sensors integrated with third-party software such as the EAM used at Amazon as well as with the chat rooms technicians used. This shows how you can naturally integrate Amazon Monitron insights into your existing workflows. Stay tuned in the coming months to read the next installment of this series with an actual implementation of this integration works.

You can also use the data stream to ingest Amazon Monitron insights back into a shop floor system such as a Supervisory Control and Data Acquisition (SCADA) or a Historian. Shop floor operators are more efficient when all the insights about their assets and processes are provided in a single pane of glass. In this concept, Amazon Monitron doesn’t become yet another tool technicians have to monitor, but another data source with insights provided in the single view they are already used to. Later this year, we will also describe an architecture you can use to perform this task and send Amazon Monitron feedback to major third-party SCADA systems and Historians.

Last but not least, the new data stream from Amazon Monitron includes the asset state transitions and closure codes provided by users when acknowledging alarms (which trigger the transition to a new state). Thanks to this data, you can automatically build visualizations that provide real-time reporting of the failures and actions taken while operating their assets.

Your team can then build a broader data analytics dashboard to support your industrial fleet management practice by combining this asset state data with Amazon Monitron measurement data and other IoT data across large industrial asset fleets by using key AWS services, which we describe in this post. We explain how to build an IoT data lake, the workflow to produce and consume the data, as well as a summary dashboard to visualize Amazon Monitron sensors data and inference results. We use an Amazon Monitron dataset coming from about 780 sensors installed in an industrial warehouse, which has been running for more than 1 year. For the detailed Amazon Monitron installation guide, refer to Getting started with Amazon Monitron.

Solution overview

Amazon Monitron provides ML inference of asset health status after 21 days of the ML model training period for each asset. In this solution, the measurement data and ML inference from these sensors are exported to Amazon S3 via Amazon Kinesis Data Streams by using the latest Amazon Monitron data export feature. As soon as Amazon Monitron IoT data is available in Amazon S3, a database and table are created in Amazon Athena by using an AWS Glue crawler. You can query Amazon Monitron data via AWS Glue tables with Athena, and visualize the measurement data and ML inference with Amazon Managed Grafana. With Amazon Managed Grafana, you can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure. In this post, you connect Amazon Managed Grafana to Athena, and learn how to build a data analytics dashboard with Amazon Monitron data to help you plan industrial asset operations at scale.

The following screenshot is an example of what you can achieve at the end of this post. This dashboard is divided into three sections:

  • Plant View – Analytical information from all sensors across plants; for example, the overall counts of various states of sensors (Healthy, Warning, or Alarm), number of unacknowledged and acknowledged alarms, gateway connectivity, and average time for maintenance
  • Site View – Site-level statistics, such as asset status statistics at each site, total number of days that an alarm remains unacknowledged, top/bottom performing assets at each site, and more
  • Asset View – Summary information for the Amazon Monitron project at the asset level, such as the alarm type for an unacknowledged alarm (ISO or ML), the timeline for an alarm, and more

These panels are examples that can help strategic operational planning, but they are not exclusive. You can use a similar workflow to customize the dashboard according to your targeted KPI.



Architecture overview

The solution you will build in this post combines Amazon Monitron, Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon S3, AWS Glue, Athena, and Amazon Managed Grafana.

The following diagram illustrates the solution architecture. Amazon Monitron sensors measure and detect anomalies from equipment. Both measurement data and ML inference outputs are exported at a frequency of once per hour to a Kinesis data stream, and they are delivered to Amazon S3 via Kinesis Data Firehose with a 1-minute buffer. The exported Amazon Monitron data is in JSON format. An AWS Glue crawler analyzes the Amazon Monitron data in Amazon S3 at a chosen frequency of once per hour, builds a metadata schema, and creates tables in Athena. Finally, Amazon Managed Grafana uses Athena to query the Amazon S3 data, allowing dashboards to be built to visualize both measurement data and device health status.

To build this solution, you complete the following high-level steps:

  1. Enable a Kinesis Data Stream export from Amazon Monitron and create a data stream.
  2. Configure Kinesis Data Firehose to deliver data from the data stream to an S3 bucket.
  3. Build the AWS Glue crawler to create a table of Amazon S3 data in Athena.
  4. Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same Region.

Enable a Kinesis data stream export from Amazon Monitron and create a data stream

To configure your data stream export, complete the following steps:

  1. On the Amazon Monitron console, from your project’s main page, choose Start live data export.
  2. Under Select Amazon Kinesis data stream, choose Create a new data stream.
  3. Under Data stream configuration, enter your data stream name.
  4. For Data stream capacity, choose On-demand.
  5. Choose Create data stream.

Note that any live data export enabled after April 4th, 2023 will stream data following the Kinesis Data Streams v2 schema. If you have an existing data export that was enabled before this date, the schema will follow the v1 format.

You can now see live data export information on the Amazon Monitron console with your specified Kinesis data stream.

Configure Kinesis Data Firehose to deliver data to an S3 bucket

To configure your Firehose delivery stream, complete the following steps:

  1. On the Kinesis console, choose Delivery streams in the navigation pane.
  2. Choose Create delivery stream.
  3. For Source, select Amazon Kinesis Data Streams.
  4. For Destination, select Amazon S3.
  5. Under Source settings, for Kinesis data stream, enter the ARN of your Kinesis data stream.
  6. Under Delivery stream name, enter the name of your Kinesis data stream.
  7. Under Destination settings, choose an S3 bucket or enter a bucket URI. You can either use an existing S3 bucket to store Amazon Monitron data, or you can create a new S3 bucket.
  8. Enable dynamic partitioning using inline parsing for JSON:
    • Choose Enabled for Dynamic partitioning.
    • Choose Enabled for Inline parsing for JSON.
    • Under Dynamic partitioning keys, add the following partition keys:
Key Name JQ Expression
project .projectName| "project=(.)"
site .eventPayload.siteName| "site=(.)"
asset .eventPayload.assetName| "asset=(.)"
position .eventPayload.positionName| "position=(.)"
time .timestamp| sub(" [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}$"; "")| "time=(.)"
  1. Choose Apply dynamic partitioning keys and confirm the generated S3 bucket prefix is:
!{partitionKeyFromQuery:project}/!{partitionKeyFromQuery:site}/!{partitionKeyFromQuery:asset}/!{partitionKeyFromQuery:position}/!{partitionKeyFromQuery:time}/.
  1. Enter a prefix for S3 bucket error output prefix. Any JSON payload that doesn’t contain the keys described earlier will be delivered in this prefix. For instance, thegatewayConnectedand gatewayDisconnected events are not linked to a given asset or position. Therefore, they won’t contain the assetName and positionName fields. Specifying this optional prefix here allows you to monitor this location and process these events accordingly.
  2. Choose Create delivery stream.

You can inspect the Amazon Monitron data in the S3 bucket. Note that the Amazon Monitron data will export live data at a frequency of once per hour, so wait for 1 hour to inspect the data.

This Kinesis Data Firehose setup enables dynamic partitioning, and the S3 objects delivered will use the following key format:

/project={projectName}/site={siteDisplayName}/asset={assetDisplayName}/ position={sensorPositionDisplayName}/time={yyyy-mm-dd 00:00:00}/{filename}.

Build the AWS Glue crawler to create a table of Amazon S3 data in Athena

After the live data has been exported to Amazon S3, we use an AWS Glue crawler to generate the metadata tables. In this post, we use AWS Glue crawlers to automatically infer database and table schema from Amazon Monitron data exported in Amazon S3, and store the associated metadata in the AWS Glue Data Catalog. Athena then uses the table metadata from the Data Catalog to find, read, and process the data in Amazon S3. Complete the following steps to create your database and table schema:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. Enter a name for the crawler (for example,XXX_xxxx_monitron).
  4. Choose Next.
  5. For Is your data already mapped to Glue tables, choose Not yet.
  6. For Data Source, choose S3.
  7. For Location of S3 data, choose In this Account, and enter the path of your S3 bucket directory you set up in the previous section (s3://YourBucketName).
  8. For Repeat crawls of S3 data stores, select Crawl all sub-folders.
  9. Finally, choose Next.
  10. Select Create new IAM role and enter a name for the role.
  11. Choose Next.
  12. Select Add Database, and enter a name for the database. This creates the Athena database where your metadata tables are located after the crawler is complete.
  13. For Crawler Schedule, select a preferred time-based scheduler (for example, hourly) to refresh the Amazon Monitron data in the database, and choose Next.
  14. Review the crawler details and choose Create.
  15. On the Crawlers page of the AWS Glue console, select the crawler you created and choose Run crawler.

You may need to wait a few minutes, depending on the size of the data. When it’s complete, the crawler’s status shows as Ready. To see the metadata tables, navigate to your database on the Databases page and choose Tables in the navigation pane.

You can also view data by choosing Table data on the console.

You’re redirected to the Athena console to view the top 10 records of the Amazon Monitron data in Amazon S3.

Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana

In this section, we build a customized dashboard with Amazon Managed Grafana to visualize Amazon Monitron data in Amazon S3, so that OT team can get streamlined access to assets in alarm across their whole Amazon Monitron sensors fleet. This will enable the OT team to plan next step actions based on the possible root cause of the anomalies.

To create a Grafana workspace, complete the following steps:

  1. Ensure that your user role is admin or editor.
  2. On the Amazon Managed Grafana console, choose Create workspace.
  3. For Workspace name, enter a name for the workspace.
  4. Choose Next.
  5. For Authentication access, select AWS IAM Identity Center (successor to AWS Single Sign-On). You can use the same AWS IAM Identity Center user that you used to set up your Amazon Monitron project.
  6. Choose Next.
  7. For this first workspace, confirm that Service managed is selected for Permission type. This selection enables Amazon Managed Grafana to automatically provision the permissions you need for the AWS data sources that you use for this workspace.
  8. Choose Current account.
  9. Choose Next.
  10. Confirm the workspace details, and choose Create workspace. The workspace details page appears. Initially, the status is CREATING.
  11. Wait until the status is ACTIVE to proceed to the next step.

To configure your Athena data source, complete the following steps:

  1. On the Amazon Managed Grafana console, choose the workspace you want to work on.
  2. On the Data sources tab, select Amazon Athena, and choose Actions, Enable service-managed policy.
  3. Choose Configure in Grafana in the Amazon Athena row.
  4. Sign in to the Grafana workspace console using IAM Identity Center if necessary. The user should have the Athena access policy attached to the user or role to have access to the Athena data source. See AWS managed policy: AmazonGrafanaAthenaAccess for more info.
  5. On the Grafana workspace console, in the navigation pane, choose the lower AWS icon (there are two) and then choose Athena on the Data sources menu.
  6. Select the default Region that you want the Athena data source to query from, select the accounts that you want, then choose Add data source.
  7. Follow the steps to configure Athena details.

If your workgroup in Athena doesn’t have an output location configured already, you need to specify an S3 bucket and folder to use for query results. After setting up the data source, you can view or edit it in the Configuration pane.

In the following subsections, we demonstrate several panels in the Amazon Monitron dashboard built in Amazon Managed Grafana to gain operational insights. The Athena data source provides a standard SQL query editor that we’ll use to analyze the Amazon Monitron data to generate desired analytics.

First, if there are many sensors in the Amazon Monitron project and they are in different states (healthy, warning, alarm, and needs maintenance), the OT team wants to visually see the count of positions that sensors are in various states. You can obtain such information as a pie chart widget in Grafana via the following Athena query:

Select * FROM (Select latest_status, COUNT(assetdisplayname)OVER (PARTITION BY latest_status) AS asset_health_count FROM (SELECT timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1) GROUP BY latest_status, asset_health_count; 

The following screenshot shows a panel with the latest distribution of Amazon Monitron sensor status.

To format your SQL query for Amazon Monitron data, refer to Understanding the data export schema.

Next, your Operations Technology team may want to plan predictive maintenance based on assets that are in alarm status, and therefore they want to quickly know the total number of acknowledged alarms vs. unacknowledged alarms. You can show the summary information of alarm state as simple stats panels in Grafana:

Select COUNT(*) FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'Alarm';

The following panel shows acknowledged and unacknowledged alarms.

The OT team can also query the amount of time the sensors remain in alarm status, so that they can decide their maintenance priority:

Select c.assetdisplayname, b.sensorpositiondisplayname, b.alarm_date FROM (Select a.assetdisplayname, a.sensorpositiondisplayname, COUNT(*)/24+1 AS number_of_days_in_alarm_state FROM (Select * FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE (assetState.newState = 'ALARM' AND assetState.newState = assetState.previousState) ORDER BY timestamp DESC) a GROUP BY a.assetdisplayname, a.sensorpositiondisplayname) b INNER JOIN (Select * FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState AS latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'ALARM') c ON b.assetdisplayname = c.assetdisplayname;

The output of this analysis can be visualized by a bar chart in Grafana, and the alarm in alarm state can be easily visualized as shown in the following screenshot.

To analyze top/bottom asset performance based on the total amount of time the assets are in an alarm or need maintenance state, use the following query:

Select s.sitedisplayname, s.assetdisplayname, COUNT(s.timestamp)/24 AS trouble_time FROM (Select timestamp, sitedisplayname, assetdisplayname, sensorpositiondisplayname, assetState.newState FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE assetState.newState = 'ALARM' OR assetState.newState = 'NEEDS_MAINTENANCE') AS s GROUP BY s.assetdisplayname, s.sitedisplayname ORDER BY trouble_time, s.assetdisplayname ASC LIMIT 5;

The following bar gauge is used to visualize the preceding query output, with the top performing assets showing 0 days of alarm states, and the bottom performing assets showing accumulated alarming states over the past year.

To help the OT team understand the possible root cause of an anomaly, the alarm types can be displayed for these assets still in alarm state with the following query:

Select a.assetdisplayname, a.sensorpositiondisplayname, a.latest_status, CASE WHEN a.temperatureML != 'HEALTHY' THEN 'TEMP' WHEN a.vibrationISO != 'HEALTHY' THEN 'VIBRATION_ISO' ELSE 'VIBRATION_ML' END AS alarm_type  FROM (Select sitedisplayname, assetdisplayname, sensorpositiondisplayname, models.temperatureML.persistentClassificationOutput as temperatureML, models.vibrationISO.persistentClassificationOutput as vibrationISO, models.vibrationML.persistentClassificationOutput as vibrationML, assetState.newState as latest_status FROM (Select *, RANK() OVER (PARTITION BY assetdisplayname, sensorpositiondisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND assetState.newState = 'ALARM' ) a WHERE (a.temperatureML != 'HEALTHY' OR a. vibrationISO != 'HEALTHY' OR a. vibrationML != 'HEALTHY');

You can visualize this analysis as a table in Grafana. In this Amazon Monitron project, two alarms were triggered by ML models for vibration measurement.

The Amazon Managed Grafana dashboard is shown here for illustration purposes. You can adapt the dashboard design according to your own business needs.

Failure Reports

When a user acknowledges an alarm in the Amazon Monitron app, the associated assets transition to a new state. The user also has the opportunity to provide some details about this alarm:

  • Failure cause – This can be one of the following: ADMINISTRATION, DESIGN, FABRICATION, MAINTENANCE, OPERATION, OTHER, QUALITY, WEAR, or UNDEDETERMINED
  • Failure mode – This can be one of the following: NO_ISSUE, BLOCKAGE, CAVITATION, CORROSION, DEPOSIT, IMBALANCE, LUBRICATION, MISALIGNMENT, OTHER, RESONANCE, ROTATING_LOOSENESS, STRUCTURAL_LOOSENESS, TRANSMITTED_FAULT, or UNDETERMINED
  • Action taken – This can be ADJUST, CLEAN, LUBRICATE, MODIFY, OVERHAUL, REPLACE, NO_ACTION, or OTHER

The event payload associated to the asset state transition contains all this information, the previous state of the asset, and the new state of the asset. Stay tuned for an update of this post with more details on how you can use this information in an additional Grafana panel to build Pareto charts of the most common failures and actions taken across your assets.

Conclusion

Enterprise customers of Amazon Monitron are looking for a solution to build an IoT data lake with Amazon Monitron’s live data, so they can manage multiple Amazon Monitron projects and assets, and generate analytics reports across multiple Amazon Monitron projects. This post provide a detailed walkthrough of a solution to build this IoT data lake with the latest Amazon Monitron Kinesis data export v2 feature. This solution also showed how to use other AWS services, such as AWS Glue and Athena to query the data, generate analytics outputs, and visualize such outputs with Amazon Managed Grafana with frequent refresh.

As a next step, you can expand this solution by sending ML inference results to other EAM systems that you might use for work order management. This will allow your operation team to integrate Amazon Monitron with other enterprise applications, and improve their operation efficiency. You can also start building more in-depth insights into your failure modes and actions taken by processing the asset state transitions and the closure codes that are now part of the Kinesis data stream payload.


About the authors

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She has extensive experience in IoT architecture and Applied Data Science, and is part of both the Machine Learning and IoT Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome IoT machine learning (ML) solutions, at the Edge and in the Cloud. She enjoys leveraging latest IoT and big data technology to scale up her ML solution, reduce latency, and accelerate industry adoption.

Bishr Tabbaa is a solutions architect at Amazon Web Services. Bishr specializes in helping customers with machine learning, security, and observability applications. Outside of work, he enjoys playing tennis, cooking, and spending time with family.

Shalika Pargal is a Product Manager at Amazon Web Services. Shalika focuses on building AI products and services for Industrial customers. She brings significant experience at the intersection of Product, Industrial and Business Development. She recently shared Monitron’s success story at Reinvent 2022.

Garry Galinsky is a Principal Solutions Architect supporting Amazon on AWS. He has been involved with Monitron since its debut and has helped integrate and deploy the solution into Amazon’s worldwide fulfillment network. He recently shared Amazon’s Monitron success story at re:Invent 2022.

Michaël Hoarau is an AI/ML Specialist Solutions Architect at AWS who alternates between data scientist and machine learning architect, depending on the moment. He is passionate about bringing the AI/ML power to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. He published a book on time series analysis in 2022 and regularly writes about this topic on LinkedIn and Medium. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.

Read More

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Sparked by the release of large AI models like AlexaTM, GPT, OpenChatKit, BLOOM, GPT-J, GPT-NeoX, FLAN-T5, OPT, Stable Diffusion, and ControlNet, the popularity of generative AI has seen a recent boom. Businesses are beginning to evaluate new cutting-edge applications of the technology in text, image, audio, and video generation that have the potential to revolutionize the services they provide and the ways they interact with customers. However, as the size and complexity of the deep learning models that power generative AI continue to grow, deployment can be a challenging task. Advanced techniques such as model parallelism and quantization become necessary to achieve latency and throughput requirements. Without expertise in using these techniques, many customers struggle to get started with hosting large models for generative AI applications.

This post can help! We begin by discussing different types of model optimizations that can be used to boost performance before you deploy your model. Then, we highlight how Amazon SageMaker large model inference deep learning containers (LMI DLCs) can help with optimization and deployment. Finally, we include code examples using LMI DLCs and FasterTransformer model parallelism to deploy models like flan-t5-xxl and flan-ul2. You can find an accompanying example notebook in the SageMaker examples repository.

Large model deployment pipeline

Major steps in any model inference workflow include loading a model into memory and handling inference requests on this in-memory model through a model server. Large models complicate this process because loading a 350 GB model such as BLOOM-176B can take tens of minutes, which materially impacts endpoint startup time. Furthermore, because these models can’t fit within the memory of a single accelerator, the model must be organized and partitioned such that it can be spread across the memory of multiple accelerators; then, model servers must handle processes and communication across multiple accelerators. Beyond model loading, partitioning, and serving, compression techniques are increasingly necessary to achieve performance goals (such as subsecond latency) for customers working with large models. Quantization and compression can reduce model size and serving cost by reducing the precision of weights or reducing the number of parameters via pruning or distillation. Compilation can optimize the computation graph and fuse operators to reduce memory and compute requirements of a model. Achieving low latency for large language models (LLMs) requires improvements in all the steps in the inference workflow: compilation, model loading, compression (runtime quantization), partitioning (tensor or pipeline parallelism), and model serving. At a high level, partitioning (with kernel optimization) brings down inference latency up to 66% (for example, BLOOM-176B from 30 seconds to 10 seconds), compilation by 20%, and compression by 50% (fp32 to fp16). An example pipeline for large model hosting with runtime partitioning is illustrated in the following diagram.

Overview of large model inference optimization techniques

With the large model deployment pipeline in mind, we now explore the optimizations. Optimizations can be critical to achieve latency and throughput goals. However, you need to be thoughtful about which optimizations you use and to what degree, because the accuracy of your model can be affected.

The following diagram is a high-level overview of different inference optimization techniques. Optimization approaches can be at the hardware or software level. We focus only on software optimization techniques in this post.

Optimized kernels and compilation

Today, optimized kernels are the greatest source of performance improvement for LMI (for example, DeepSpeed’ kernels reduced BLOOM-176B latency by three times). Fused kernel operators are model specific, and different model parallel libraries have different approaches. DeepSpeed created an inject policy for each model family. DeepSpeed has handwritten PyTorch modules and CUDA kernels that could speed up the model partially. Meanwhile, FasterTransformer rewrites the model in pure C++ and CUDA to speed up model as a whole. PyTorch 2.0 offers an open portal (via torch.compile) to allow easy compilation into different platforms. To bring cost/performance-wise optimization on SageMaker for LLMs, we offer SageMaker LMI containers that provide the best open-source compilation stack offering on a model basis, like T5 with FasterTransformers and GPT-J with DeepSpeed.

Compilation or integration to optimized runtime

ML compilers, such as Amazon SageMaker Neo, apply techniques such as operator fusion, memory planning, graph optimizations, and automatic integration to optimized inference libraries. Because inference includes only a forward propagation, intermediate tensors between layers are discarded instead of stored for reuse in back-propagation. The graph optimization techniques improve the inference throughput and have a small impact on model memory footprints. Relative to other optimization techniques, compilation for inference provides a limited benefit for reducing a model’s memory requirements. Several runtime libraries for GPU are available today, such as FasterTransformer, TensorRT, and ONNX Runtime.

Model compression

Model compression is a collection of approaches that researchers and practitioners can use to reduce the size of their model, realize faster speed, and reduce hosting cost. Model compression techniques primarily include knowledge distillation, pruning, and quantization. Most compression technologies are challenging for LLMs due to requiring additional training cycles to improve the accuracy of compressed models.

Quantization

Quantization is the process of mapping values from a larger or continuous set of numbers to a smaller set of numbers (for example, INT8 {-128:127}, uINT8 {0:255}). Using a smaller set of numbers reduces memory use and complexity of computations, but the decreased precision can degrade the accuracy of the model. The level of quantization can be adjusted to fit size constraints and accuracy needs. For example, a model quantized to FP8 will be about half the size of a model in FP16 but at the expense of reduced accuracy.

Quantization has shown great and consistent success for inference tasks by reducing the size of the model up to 75%, offering 2–4 times throughput improvements and cost savings.

The success of quantization is because it’s broadly applicable across a range of models and use cases with approximately 1% accuracy/score loss, if a proper technique is used. It doesn’t require changing model architecture. Typically, it starts with an existing floating-point model and quantizes it to obtain a fixed-point quantized model. Quantizing from FP32 to INT8 reduces the model size by 75%, but the accuracy/score loss impact is often less than a point.

Distillation

With distillation, a larger teacher model transfers knowledge to a smaller student model. The model size can be reduced until the student model can fit on an edge device or smaller cloud-based hardware, but accuracy decreases as the model is reduced. There is no industry standard for distillation, and many techniques are experimental. Distillation requires more work by the customer in tuning and trial and error to shrink the model without affecting accuracy. For more information, refer to Knowledge distillation in deep learning and its applications.

Pruning

Pruning is a model compression technique that reduces the number of operations by removing parameters. To minimize the impact to model accuracy, parameters are first ranked by importance. Parameters that are less important are set to zero or connections to the neuron are removed. This decreases the number of operations with minimal impact to model accuracy. For example, when using a pre-trained model for a narrow use case, parts of the larger model that are less relevant to your application could be pruned away to reduce size without significantly degrading performance for your task.

Model partitioning

A model that can’t fit on a single accelerator’s memory must be split into multiple partitions. At a high level, there are two fundamental approaches to partitioning the model (model parallelism): tensor parallelism and pipeline parallelism.

Tensor parallelism is also called intra-layer model parallelism. In this approach, each one of the layers is partitioned across the workers (accelerators). On the positive side, we can handle models with very large layers, because the layers are split across workers. Therefore, we no longer need to fit at least a single layer on a worker, as was the case for pipeline parallelism. However, this leads to an all-to-all communication pattern between the workers after each one of the layers, so there’s a heavy burden on the GPU/accelerator interconnect.

Pipeline parallelism partitions the model into layers. Each worker may end up with having one or more layers. This approach uses point-to-point communication and therefore introduces lower communication overhead compared to tensor parallelism. However, this approach won’t be useful if a layer can’t fit into a single worker’s or accelerator’s memory. This approach is also prone to pipeline idleness and may reduce the scaling efficiency.

Open-source frameworks like DeepSpeed, Hugging Face Accelerate, and FasterTransformer allow per model-based optimization to shard the model. Especially for DeepSpeed, the partitioning algorithm is tightly coupled with fused kernel operators. SageMaker LMI containers come with pre-integrated model partitioning frameworks like FasterTransformer, DeepSpeed, HuggingFace, and Transformers-NeuronX,. Currently, DeepSpeed, FasterTransformer, and Hugging Face Accelerate shard the model at model loading time. Runtime model partitioning can take more than 10 minutes (OPT-66B) and consume extensive CPU, GPU, and accelerator memory. Ahead-of-time (AOT) partitioning can help reduce model loading times. With AOT, models are partitioned before deployment and partitions are kept ready for downstream optimization and subsequent ingestion by model parallel frameworks. When model parallel frameworks are fed already partitioned models, then runtime partition doesn’t happen. This improves model loading time and reduces CPU, GPU, and accelerator memory consumption. DeepSpeed and FasterTransformer have support for pre-partitioning and saving for models.

Prompt engineering

Prompt engineering refers to efforts to extract accurate, consistent, and fair outputs from large models, such text-to-image synthesizers or large language models. LLMs are trained on large-scale bodies of text, so they encode a great deal of factual information about the world. A prompt consists of text and optionally an image given to a pre-trained model for a prediction task. A prompt text may consist of additional components like context, task (instruction, question, and so on), image or text, and training samples. Prompt engineering also provides a way for LLMs to do few-shot generalization, in which a machine learning model trained on a set of generic tasks learns a new or related task from just a handful of examples. For more information, refer to EMNLP: Prompt engineering is the new feature engineering. Refer to the following GitHub repo for more information about getting the most out of your large models using prompt engineering on SageMaker.

Model downloading and loading

Large language models incur long download times (for example, 40 minutes to download BLOOM-176B). In 2022, SageMaker Hosting added the support for larger Amazon Elastic Block Store (Amazon EBS) volumes up to 500 GB, longer download timeout up to 60 minutes, and longer container startup time of 60 minutes. You can enable this configuration to deploy LLMs on SageMaker. SageMaker LMI containers includes model download optimization by using the s5cmd library to speed up the model download time and container startup times, and eventually speed up auto scaling on SageMaker.

Diving deep into SageMaker LMI containers

SageMaker maintains large model inference containers with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. Transformers-NeuronX is a model parallel library introduced by the AWS Neuron team for AWS Inferentia and AWS Trainium to support LLMs. It supports tensor parallelism across Neuron cores.

The LMI container uses DJLServing as the pre-built integrated model server; pre-built integrated model partitioning frameworks like DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX; support for PyTorch; and comes with pre-installed cuDNN, cuBLAS, NCCL CUDA Toolkit for GPUs, MKL for CPU, and the Neuron SDK and runtime for running models on AWS Inferentia and Trainium.

Pre-integrated model partitioning frameworks in SageMaker LMI containers

SageMaker LMI comes with pre-integrated model partitioning frameworks to suite your performance and model support requirements.

Most of the model parallel frameworks support both pipeline and tensor parallelism. Pipeline parallelism is simpler implementation compared to tensor parallelism. However, due to its sequential operating nature, it’s slower than tensor parallelism. Pipeline parallelism and tensor parallelism can be combined together.

Transformers-NeuronX is a model parallel library introduced by the Neuron team to support LLMs on AWS Inferentia and Trainium. It supports tensor parallelism across Neuron cores. The following table summarizes different model partitioning frameworks. This will help you select the right framework for deploying your models on SageMaker.

Hugging Face Accelerate DeepSpeed FasterTransformer TransformersNeuronX (Inf2/Trn1)
Model Parallel Pipeline Parallelism Pipeline and Tensor Parallelism Pipeline and Tensor Parallelism Tensor Parallelism
Load Hugging Face checkpoints
Runtime partition .
Ahead-of-time partition . .
Model partitioning on CPU memory . . .
Supported models All Hugging Face models All GPT family, Stable Diffusion, and T5 family GPT2/OPT/BLOOM/T5 GPT2/OPT/GPTJ/GPT-NeoX*
Streaming tokens .
Fast model loading .
Model loading speed Medium Fast Fast .
Performance on model types All other non-optimized models GPT family T5 and BLOOM All supported models
Hardware support CPU/GPU GPU GPU Inf2/Trn1
SM MME support .

Large model deployment pipeline on SageMaker

SageMaker LMI containers offer a low-code/no-code mechanism to set up your large model deployment pipeline with the following capabilities:

  • Faster model download time using s5cmd
  • Pre-built optimized model parallel frameworks including Transformers-NeuronX, DeepSpeed, Hugging Face Accelerate, and FasterTransformer
  • Pre-built foundation software stack including PyTorch, NCCL, and MPI
  • Low-code/no-code deployment of large models by configuring serving.properties
  • SageMaker-compatible containers

The following diagram gives an overview of a SageMaker LMI deployment pipeline you can use to deploy your models.

Deploy a FLAN-T5-XXL model on SageMaker using the newly released LMI container version

FasterTransformer is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner. FasterTransformer contains the implementation of the highly optimized version of the transformer block that contains the encoder and decoder parts. With this block, you can run the inference of both the full encoder-decoder architectures like T5, as well as encoder-only models such as BERT, or decoder-only models such as GPT. It’s written in C++/CUDA and relies on the highly optimized cuBLAS, cuBLASLt, and cuSPARSELt libraries. This allows you to build the fastest transformer inference pipeline on GPU.

The FasterTransformer model parallel library is now available in a SageMaker LMI container, adding support for popular models such as flan-t5-xxl and flan-ul2. FasterTransformer is an open-source library from NVIDIA that provides an accelerated engine for efficiently running transformer-based neural network inference. It has been designed to handle large models that require multiple GPUs or accelerators and nodes in a distributed manner. The library includes an optimized version of the transformer block, which comprises both the encoder and decoder parts, enabling you to run the inference of full encoder-decoder architectures like T5, as well as encoder-only models like BERT and decoder-only models like GPT.

Runtime architecture of hosting a model using an LMI container’s FasterTransformer engine on SageMaker

The FasterTransformer engine in an LMI container supports loading model weights from an Amazon Simple Storage Service (Amazon S3) path or Hugging Face Hub. After fetching the model, it converts the Hugging Face model checkpoint to FasterTransformer supported partitioned model artifacts based on input parameters like tensor parallel degree and loads the partitioned model artifacts across GPU devices. It has faster loading and uses multi-process loading on Python. It supports AOT compilation and uses CPU to partition the model. SageMaker LMI containers improve the performance in downloading the models from Amazon S3 using s5cmd, provide the FasterTransformer engine, which provides a layer of abstraction for developers that loads the model in Hugging Face checkpoint or PyTorch bin format, and uses the FasterTransformer library to convert it into FasterTransformer-compatible format. These steps happen during the container startup and load the model in the memory before the inference requests come in. The FasterTransformer engine provides high performance C++ and CUDA implementations for the models to run inference. This helps improve the container startup time and reduce the inference latency. The following diagram illustrates the runtime architecture of serving models using FasterTransformer on SageMaker. For more information about DJLServing’s runtime architecture, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Use SageMaker LMI container images

To use a SageMaker LMI container to host a FLAN-T5 model, we have no-code option or a bring-your-own-script option. We showcase the bring-your-own-script option in this post. The first step in the process is to use the right LMI container image. An example notebook is available in the GitHub repo.

Use the following code to use the SageMaker LMI container image after replacing the Region with the specific Region you’re running the notebook in:

inference_image_uri = image_uris.retrieve(
    framework="djl-fastertransformer", region=sess.boto_session.region_name, version="0.21.0"
)

Download the model weights

An LMI container allows us to download the model weights from the Hugging Face Hub at run time when spinning up the instance for deployment. However, that takes longer because it’s dependent on the network and on the provider. The faster option is to download the model weights into Amazon S3 and then use the LMI container to download them to the container from Amazon S3. This is also a preferred method when we need to scale up our instances. In this post, we showcase how to download the weights to Amazon S3 and then use them when configuring the container. See the following code:

model_name = "google/flan-t5-xxl"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]
# - Leverage the snapshot library to download the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"

model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)

Create the model configuration and inference script

First, we create a file called serving.properties that configure the container. This tells the DJL model server to use the FasterTransformer engine to load and shard the model weights. Secondly, we point to the S3 URI where the model weights have been installed. The LMI container will download the model artifacts from Amazon S3 using s5cmd. The file contains the following code:

engine = FasterTransformer
option.tensor_parallel_degree = 4
option.s3url = {{s3url}}

For the no-code option, the key changes are to specify the entry_point as the built-in handler. We specify the value as djl_python.fastertransformer. For more details, refer to the GitHub repo. You can use this code to modify for your own use case as needed. A complete example that illustrates the no-code option can be found in the following notebook. The serving.properties file will now look like the following code:

engine=FasterTransformer
option.entryPoint=djl_python.fastertransformer
option.s3url={{s3url}}
option.tensor_parallel_degree=4

Next, we create our model.py file, which defines the code needed to load and then serve the model. The only mandatory method is handle(inputs). We continue to use the functional programing paradigm to build the other helpful methods like load_model(), pipeline_generate(), and more. In our code, we read in the tensor_parallel_degree property value (the default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Secondly, we get the model weights downloaded under the /tmp location on the container and referenceable by the environment variable “model_dir”. To load the model, we use the FasterTransformer init method as shown in the following code. Note we load the full precision weights in FP32. You can also quantize the model at runtime by setting dtype = "fp16" in the following code and setting tensor_parallel_degree = 2 in serving.properties. However, note that the FP16 version of this model may not provide similar performance in terms of output quality as compared to FP32 version. In addition, refer to an existing issue related to impact on the model quality on FasterTransformer for the T5 model for certain NLP tasks.

import fastertransformer as ft
from djl_python import Input, Output
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    T5Tokenizer,
    T5ForConditionalGeneration,
)
import os
import logging
import math
import torch


def load_model(properties):
    model_name = "google/flan-t5-xxl"
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    pipeline_parallel_degree = 1
    model_location = properties["model_dir"]
    if "model_id" in properties:
        model_location = properties["model_id"]
    logging.info(f"Loading model in {model_location}")

    tokenizer = T5Tokenizer.from_pretrained(model_location)
    dtype = "fp32"
    model = ft.init_inference(
        model_location, tensor_parallel_degree, pipeline_parallel_degree, dtype
    )
    return model, tokenizer


model = None
tokenizer = None


def handle(inputs: Input):
    """
    inputs: Contains the configurations from serving.properties
    """
    global model, tokenizer

    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()

    input_sentences = data["inputs"]
    params = data["parameters"]

    outputs = model.pipeline_generate(input_sentences, **params)
    result = {"outputs": outputs}

    return Output().add_as_json(result)

Create a SageMaker endpoint for inference

In this section, we go through the steps to create a SageMaker model and endpoint for inference.

Create a SageMaker model

We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image provided by and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure tensor_parallel_degree to 4 in serving.properties, which means the model is partitioned along 4 GPUs. See the following code:

from sagemaker.utils import name_from_base
model_name = name_from_base(f"flan-xxl-fastertransformer")
print(model_name)
create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri, 
        "ModelDataUrl": s3_code_artifact
    },
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

Create a SageMaker endpoint for inference

You can use any instances with multiple GPUs for testing. In this demo, we use a g5.12xlarge instance. In the following code, note how we set ModelDataDownloadTimeoutInSeconds and ContainerStartupHealthCheckTimeoutInSeconds. We don’t set the VolumeSizeInGB parameters because this instance comes with SSD. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
{
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
        },
    ],)'

Lastly, we create a SageMaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name)

Starting the endpoint might take a few minutes. You can try a few more times if you run into the InsufficientInstanceCapacity error, or you can raise a request to AWS to increase the limit in your account.

Invoke the model

This is a generative model, so we pass in a text as a prompt and model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters.

# -- we set the prompt in the parameter name which matches what we try and extract in model.py
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({
        "batch_size": 1,
        "inputs" : "Amazon.com is an awesome site",
        "parameters" : {},
    }),
    ContentType="application/json",
)
response_model["Body"].read().decode("utf8")

Model parameters at inference time

The following code lists the set of default parameters that is used by the model. You can set these arguments to a specific value of your choice while invoking the endpoint.

default_args = dict(
            inputs_embeds=None,
            beam_width=1,
            max_seq_len=200,
            top_k=1,
            top_p=0.0,
            beam_search_diversity_rate=0.0,
            temperature=1.0,
            len_penalty=0.0,
            repetition_penalty=1.0,
            presence_penalty=None,
            min_length=0,
            random_seed=0,
            is_return_output_log_probs=False,
            is_return_cum_log_probs=False,
            is_return_cross_attentions=False,
            bad_words_list=None,
            stop_words_list=None
        )

The following code has a sample invocation to the endpoint we deployed. We use the max_seq_len parameter to control the number of tokens that are generated and temperature to control the randomness of the generated text.

smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": [
                "Title: ”University has a new facility coming up“\nGiven the above title of an imaginary article, imagine the article.n"
            ],
            "parameters": {"max_seq_len": 200, "temperature": 0.7},
            "padding": True,
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

Clean up

When you’re done testing the model, delete the endpoint to save costs if the endpoint is no longer required:

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Performance tuning

If you intend to use this post and accompanying notebook with a different model, you may want to explore some of the tunable parameters that SageMaker, DeepSpeed, and the DJL offer. Iteratively experimenting with these parameters can have a material impact on the latency, throughput, and cost of your hosted large model. To learn more about tuning parameters such as number of workers, degree of tensor parallelism, job queue size, and others, refer to DJLServing configurations and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Benchmarking results on hosting FLAN-T5 model on SageMaker

The following table summarizes our benchmarking results.

Model Model Partitioning and Optimization Engine Quantization Batch Size Tensor Parallel Degree Number of Workers Inference Latency
P50
(ms)
Inference Latency
P90
(ms)
Inference Latency
P99
(ms)
Data Quality
flan-t5-xxl FasterTransformer FP32 4 4 1 327.39 331.01 612.73 Normal

For our benchmark, we used four different type of tasks that form into a single batch and benchmarked Flan-T5-XXL model. FasterTransformer is using a tensor parallel degree of 4 (the model gets partitioned across four accelerator devices on the same host). From our benchmark observation, FasterTransformer was the most performant in terms of latency and throughput as compared to other frameworks for hosting this model. The p99 inference latency was 612 milliseconds.

Conclusion

In this post, we gave an overview of large model hosting challenges, and how SageMaker LMI containers help you address these challenges using its low-code/no-code capabilities. We showcased how to host large models using FasterTransformer with high performance on SageMaker using the SageMaker LMI container. We demonstrated this new capability in an example of deploying a FLAN-T5-XXL model on SageMaker. We also covered options available to tune the performance of your models using different model optimization approaches and how SageMaker LMI containers offer low-code/no-code options to you in hosting and optimizing the large models.


About the authors

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More