Deploy BLOOM-176B and OPT-30B on Amazon SageMaker with large model inference Deep Learning Containers and DeepSpeed

Deploy BLOOM-176B and OPT-30B on Amazon SageMaker with large model inference Deep Learning Containers and DeepSpeed

The last few years have seen rapid development in the field of deep learning. Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large deep learning models for applications such as natural language processing (NLP).

In an earlier post, we discussed capabilities and configurable settings in Amazon SageMaker model deployment that can make inference with these large models easier. Today, we announce a new Amazon SageMaker Deep Learning Container (DLC) that you can use to get started with large model inference in a matter of minutes. This DLC packages some of the most popular open-source libraries for model parallel inference, such as DeepSpeed and Hugging Face Accelerate.

In this post, we use a new SageMaker large model inference DLC to deploy two of the most popular large NLP models: BigScience’s BLOOM-176B and Meta’s OPT-30B from the Hugging Face repository. In particular, we use Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve 0.1 second latency per token in a text generation use case.

You can find our complete example notebooks in our GitHub repository.

Large model inference techniques

Language models have recently exploded in both size and popularity. With easy access from model zoos such as Hugging Face and improved accuracy and performance in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, large models are often too big to fit within the memory of a single accelerator. For example, the BLOOM-176B model can require more than 350 gigabytes of accelerator memory, which far exceeds the capacity of hardware accelerators available today. This necessitates the use of  model parallel techniques from libraries like DeepSpeed and Hugging Face Accelerate to distribute a model across multiple accelerators for inference. In this post, we use the SageMaker large model inference container to generate and compare latency and throughput performance using these two open-source libraries.

DeepSpeed and Accelerate use different techniques to optimize large language models for inference. The key difference is DeepSpeed’s use of optimized kernels. These kernels can dramatically improve inference latency by reducing bottlenecks in the computation graph of the model. Optimized kernels can be difficult to develop and are typically specific to a particular model architecture; DeepSpeed supports popular large models such as OPT and BLOOM with these optimized kernels. In contrast, Hugging Face’s Accelerate library doesn’t include optimized kernels at the time of writing. As we discuss in our results section, this difference is responsible for much of the performance edge that DeepSpeed has over Accelerate.

A second difference between DeepSpeed and Accelerate is the type of model parallelism. Accelerate uses pipeline parallelism to partition a model between the hidden layers of a model, whereas DeepSpeed uses tensor parallelism to partition the layers themselves. Pipeline parallelism is a flexible approach that supports more model types and can improve throughput when larger batch sizes are used. Tensor parallelism requires more communication between GPUs because model layers can be spread across multiple devices, but can improve inference latency by engaging multiple GPUs simultaneously. You can learn more about parallelism techniques in Introduction to Model Parallelism and Model Parallelism.

Solution overview

To effectively host large language models, we need features and support in the following key areas:

  • Building and testing solutions – Given the iterative nature of ML development, we need the ability to build, rapidly iterate, and test how the inference endpoint will behave when these models are hosted, including the ability to fail fast. These models can typically be hosted only on larger instances like p4dn or g5, and given the size of the models, it can take a while to spin up an inference instance and run any test iteration. Local testing usually has constraints because you need a similar instance in size to test, and these models aren’t easy to obtain.
  • Deploying and running at scale – The model files need to be loaded onto the inference instances, which presents a challenge in itself given the size. Tar / Un-Tar as an example for the Bloom-176B takes about 1 hour to create and another hour to load. We need an alternate mechanism to allow easy access to the model files.
  • Loading the model as singleton – For a multi-worker process, we need to ensure the model gets loaded only once so we don’t run into race conditions and further spend unnecessary resources. In this post, we show a way to load directly from Amazon Simple Storage Service (Amazon S3). However, this only works if we use the default settings of the DJL. Furthermore, any scaling of the endpoints needs to be able to spin up in a few minutes, which calls for reconsidering how the models might be loaded and distributed.
  • Sharding frameworks – These models typically need to be , usually by a tensor parallelism mechanism or by pipeline sharding as the typical sharding techniques, and we have advanced concepts like ZeRO sharding built on top of tensor sharding. For more information about sharding techniques, refer to Model Parallelism. To achieve this, we can have various combinations and use frameworks from NIVIDIA, DeepSpeed, and others. This needs the ability to test BYOC or use 1P containers and iterate over solutions and run benchmarking tests. You might also want to test various hosting options like asynchronous, serverless, and others.
  • Hardware selection – Your choice in hardware is determined by all the aforementioned points and further traffic patterns, use case needs, and model sizes.

In this post, we use DeepSpeed’s optimized kernels and tensor parallelism techniques to host BLOOM-176B and OPT-30B on SageMaker. We also compare results from Accelerate to demonstrate the performance benefits of optimized kernels and tensor parallelism. For more information on DeepSpeed and Accelerate, refer to DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale and Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate.

We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about the DJL and DJLServing, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

It’s worth noting that optimized kernels can result in precision changes and a modified computation graph, which could theoretically result in changed model behavior. Although this could occasionally change the inference outcome, we do not expect these differences to materially impact the basic evaluation metrics of a model. Nevertheless, practitioners are advised to confirm the model outputs are as expected when using these kernels.

The following steps demonstrate how to deploy a BLOOM-176B model in SageMaker using DJLServing and a SageMaker large model inference container. The complete example is also available in our GitHub repository.

Using the DJLServing SageMaker DLC image

Use the following code to use the DJLServing SageMaker DLC image after replacing the region with your specific region you are running the notebook in:

763104351884.dkr.ecr.<region>.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113
# example uri might be like 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113

Create our model file

First, we create a file called serving.properties that contains only one line of code. This tells the DJL model server to use the DeepSpeed engine. The file contains the following code:

engine=DeepSpeed

serving.properties is a file defined by DJLServing that is used to configure per-model configuration.

Next, we create our model.py file, which defines the code needed to load and then serve the model. In our code, we read in the TENSOR_PARALLEL_DEGREE environment variable (the default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Note that DeepSpeed provides a few built-in partition definitions, including one for BLOOM models. We use it by specifying replace_method and relpace_with_kernel_inject. If you have a customized model and need DeepSpeed to partition effectively, you need to change relpace_with_kernel_inject to false and add injection_policy to make the runtime partition work. For more information, refer to Initializing for Inference. For our example, we used the pre-partitioned BLOOM model on DeepSpeed.

Secondly, in the model.py file, we also load the model from Amazon S3 after the endpoint has been spun up. The model is loaded into the /tmp space on the container because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. See the following code:

from djl_python import Input, Output
import os
import deepspeed
import torch
import torch.distributed as dist
import sys
import subprocess
import time
from glob import glob
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from transformers.models.opt.modeling_opt import OPTDecoderLayer

predictor = None

def check_config():
    local_rank = os.getenv('LOCAL_RANK')
    
    if not local_rank:
        return False
    return True
    
def get_model():

    if not check_config():
        raise Exception("DJL:DeepSpeed configurations are not default. This code does not support non default configurations") 
    
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model_dir = "/tmp/model"
    bucket = os.environ.get("MODEL_S3_BUCKET")
    key_prefix = os.environ.get("MODEL_S3_PREFIX")
    print(f"rank: {local_rank}")
    if local_rank == 0:
        if f"{model_dir}/DONE" not in glob(f"{model_dir}/*"):
            print("Starting Model downloading files")
            try:
                proc_run = subprocess.run(
                    ["aws", "s3", "cp", "--recursive", f"s3://{bucket}/{key_prefix}", model_dir]
                )
                print("Model downloading finished")
                # write file when download complete. Could use dist.barrier() but this makes it easier to check if model is downloaded in case of retry
                with open(f"{model_dir}/DONE", "w") as f:
                    f.write("download_complete")
                    
                proc_run.check_returncode() # to throw the error in case there was one
                
            except subprocess.CalledProcessError as e:
                print ( "Model download failed: Error:nreturn code: ", e.returncode, "nOutput: ", e.stderr )
                raise # FAIL FAST  
                               
    dist.barrier()
                
    
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    # has to be FP16 as Int8 model loading not yet supported
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(model_dir), torch_dtype=torch.bfloat16
        )
    model = model.eval()
    
    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        dtype=torch.int8,
        base_dir = model_dir,
        checkpoint=os.path.join(model_dir, "ds_inference_config.json"),
        replace_method='auto',
        replace_with_kernel_inject=True
    )

    model = model.module
    dist.barrier()
    return model, tokenizer

DJLServing manages the runtime installation on any pip packages defined in requirement.txt. This file will have:

awscli
boto3

We have created a directory called code and the model.py, serving.properties, and requirements.txt files are already created in this directory. To view the files, you can run the following code from the terminal:

mkdir -p code
cat code/model.py 
cat code/serving.properties 
cat code/requirements.txt 

The following figure shows the structure of the model.tar.gz.

Lastly, we create the model file and upload it to Amazon S3:

tar cvfz model.tar.gz code
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

Download and store the model from Hugging Face (Optional)

We have provided the steps in this section in case you want to download the model to Amazon S3 and use it from there. The steps are provided in the Jupyter file on GitHub. The following screenshot shows a snapshot of the steps.

Create a SageMaker model

We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image provided by and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure TENSOR_PARALLEL_DEGREE=8, which means the model is partitioned along 8 GPUs. See the following code:

PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {
            "MODEL_S3_BUCKET": bucket,
            "MODEL_S3_PREFIX": s3_model_prefix,
            "TENSOR_PARALLEL_DEGREE": "8",
},

After you run the preceding cell in the Jupyter file, you see output similar to the following:

{
    "ModelArn": "arn:aws:sagemaker:us-east-1:<account_id>:model/bloom-djl-ds-<date_time>"
}

Create a SageMaker endpoint

You can use any instances with multiple GPUs for testing. In this demo, we use a p4d.24xlarge instance. In the following code, note how we set the ModelDataDownloadTimeoutInSeconds, ContainerStartupHealthCheckTimeoutInSeconds, and VolumeSizeInGB parameters to accommodate the large model size. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)'

Lastly, we create a SageMaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

You see it printed out in the following code:

{
    "EndpointArn": "arn:aws:sagemaker:us-east-1:<aws-account-id>:endpoint/bloom-djl-ds-<date_time>"
}

Starting the endpoint might take a while. You can try a few more times if you run into the InsufficientInstanceCapacity error, or you can raise a request to AWS to increase the limit in your account.

Performance tuning

If you intend to use this post and accompanying notebook with a different model, you may want to explore some of the tunable parameters that SageMaker, DeepSpeed, and the DJL offer. Iteratively experimenting with these parameters can have a material impact on the latency, throughput, and cost of your hosted large model. To learn more about tuning parameters such as number of workers, degree of tensor parallelism, job queue size, and others, refer to DJL Serving configurations and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Results

In this post, we used DeepSpeed to host BLOOM-176B and OPT-30B on SageMaker ML instances. The following table summarizes our performance results, including a comparison with Hugging Face’s Accelerate. Latency reflects the number of milliseconds it takes to produce a 256-token string four times (batch_size=4) from the model. Throughput reflects the number of tokens produced per second for each test. For Hugging Face Accelerate, we used the library’s default loading with GPU memory mapping. For DeepSpeed, we used its faster checkpoint loading mechanism.

Model Library Model Precision Batch Size Parallel Degree Instance Time to Load
(s)
Latency (4 x 256 Token Output) .
. . . . . . . P50
(ms)
P90
(ms)
P99
(ms)
Throughput
(tokens/sec)
BLOOM-176B DeepSpeed INT8 4 8 p4d.24xlarge 74.9 27,564 27,580 32,179 37.1
BLOOM-176B Accelerate INT8 4 8 p4d.24xlarge 669.4 92,694 92,735 103,292 11.0
OPT-30B DeepSpeed FP16 4 4 g5.24xlarge 239.4 11,299 11,302 11,576 90.6
OPT-30B Accelerate FP16 4 4 g5.24xlarge 533.8 63,734 63,737 67,605 16.1

From a latency perspective, DeepSpeed is about 3.4 times faster for BLOOM-176B and 5.6 times faster for OPT-30B than Accelerate. DeepSpeed’s optimized kernels are responsible for much of this difference in latency. Given these results, we recommend using DeepSpeed over Accelerate if your model of choice is supported.

It’s also worth noting that model loading times with DeepSpeed were much shorter, making it a better option if you anticipate needing to quickly scale up your number of endpoints. Accelerate’s more flexible pipeline parallelism technique may be a better option if you have models or model precisions that aren’t supported by DeepSpeed.

These results also demonstrate the difference in latency and throughput of different model sizes. In our tests, OPT-30B generates 2.4 times the number of tokens per unit time than BLOOM-176B on an instance type that is more than three times cheaper. On a price per unit throughput basis, OPT-30B on a g5.24xl instance is 8.9 times better than BLOOM-176B on a p4d.24xl instance. If you have strict latency, throughput, or cost limitations, consider using the smallest model possible that will still achieve functional requirements.

Clean up

As part of best practices it is always recommended to delete idle instances. The below code shows you how to delete the instances.

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

Optionally delete the model check point from your S3

!aws s3 rm --recursive s3://<your_bucket>/{s3_model_prefix}

Conclusion

In this post, we demonstrated how to use SageMaker large model inference containers to host two large language models, BLOOM-176B and OPT-30B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker ML instance.

For more details about Amazon SageMaker and its large model inference capabilities, refer to Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas and Real-time inference.


About the authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Alan Tan is a Senior Product Manager with SageMaker leading efforts on large model inference. He’s passionate about applying Machine Learning to the area of Analytics. Outside of work, he enjoys the outdoors.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Read More

Use Github Samples with Amazon SageMaker Data Wrangler

Use Github Samples with Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a UI-based data preparation tool that helps perform data analysis, preprocessing, and visualization with features to clean, transform, and prepare data faster. Data Wrangler pre-built flow templates help make data preparation quicker for data scientists and machine learning (ML) practitioners by helping you accelerate and understand best practice patterns for data flows using common datasets.

You can use Data Wrangler flows to perform the following tasks:

  • Data visualization – Examining statistical properties for each column in the dataset, building histograms, studying outliers
  • Data cleaning – Removing duplicates, dropping or filling entries with missing values, removing outliers
  • Data enrichment and feature engineering – Processing columns to build more expressive features, selecting a subset of features for training

This post will help you understand Data Wrangler using the following sample pre-built flows on GitHub. The repository showcases tabular data transformation, time series data transformations, and joined dataset transforms. Each requires a different type of transformations because of their basic nature. Standard tabular or cross-sectional data is collected at a specific point in time. In contrast, time series data is captured repeatedly over time, with each successive data point dependent on its past values.

Let’s look at an example of how we can use the sample data flow for tabular data.

Prerequisites

Data Wrangler is an Amazon SageMaker feature available within Amazon SageMaker Studio, so we need to follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

Import the dataset and flow files into Data Wrangler using Studio

The following steps outline how to import data into SageMaker to be consumed by Data Wrangler:

Initialize Data Wrangler via the Studio UI by choosing New data flow.

ML-10599-sm-landing-screen

Clone the GitHub repo to download the flow files into your Studio environment.

When the clone is complete, you should be able to see the repository content in the left pane.

Choose the file Hotel-Bookings-Classification.flow to import the flow file into Data Wrangler.

If you use the time series or joined data flow, the flow will appear as a different name.After the flow has been imported, you should see the following screenshot. This shows us errors because we need to make sure that the flow file points to the correct data source in Amazon Simple Storage Service (Amazon S3).

Choose Edit dataset to bring up all your S3 buckets. Next, choose the dataset hotel_bookings.csv from your S3 bucket for running through the tabular data flow.

Note that if you’re using the joined data flow, you may have to import multiple datasets into Data Wrangler

In the right pane, make sure COMMA is chosen as the delimiter and Sampling is set to First K. Our dataset is small enough to run Data Wrangler transformations on the full dataset, but we wanted to highlight how you can import the dataset. If you have a large dataset, consider using sampling. Choose Import to import this dataset to Data Wrangler.

After the dataset is imported, Data Wrangler automatically validates the dataset and detects the data types. You can see that the errors have gone away because we’re pointing to the correct dataset. The flow editor now shows two blocks showcasing that the data was imported from a source and data types recognized. You can also edit the data types if needed.

The following screenshot shows our data types.

Let’s look at some of the transforms done as a part of this tabular flow. If you’re using the time series or joined data flows, check out some common transforms on the GitHub repo. We performed some basic exploratory data analysis using data insights reports that studied the target leakage and feature collinearity in the dataset, table summary analyses, and quick modeling capability. Explore the steps on the GitHub repo.

Now we drop columns based on the recommendations provided by the Data Insights and Quality Report.

  • For target leakage, drop reservation_status.
  • For redundant columns, drop days_in_waiting_list, hotel, reserved_room_type, arrival_date_month, reservation_status_date, babies, and arrival_date_day_of_month.
  • Based on linear correlation results, drop columns arrival_date_week_number and arrival_date_year because the correlation values for these feature (column) pairs are greater than the recommended threshold of 0.90.
  • Based on non-linear correlation results, drop reservation_status. This column was already marked to be dropped based on the target leakage analysis.
  • Process numeric values (min-max scaling) for lead_time, stays_in_weekend_nights, stays_in_weekday_nights, is_repeated_guest, prev_cancellations, prev_bookings_not_canceled, booking_changes, adr, total_of_specical_requests, and required_car_parking_spaces.
  • One-hot encode categorical variables like meal, is_repeated_guest, market_segment, assigned_room_type, deposit_type, and customer_type.
  • Balance the target variable Random oversample for class imbalance.Use the quick modeling capability to handle outliers and missing values.

Export to Amazon S3

Now we have gone through the different transforms and are ready to export the data to Amazon S3. This option creates a SageMaker processing job, which runs the Data Wrangler processing flow and saves the resulting dataset to a specified S3 bucket. Follow the next steps to set up the export to Amazon S3:

Choose the plus sign next to a collection of transformation elements and choose Add destination, then Amazon S3.

  • For Dataset name, enter a name for the new dataset, for example NYC_export.
  • For File type, choose CSV.
  • For Delimiter, choose Comma.
  • For Compression, choose None.
  • For Amazon S3 location, use the same bucket name that we created earlier.
  • Choose Add destination.

Choose Create job.

For Job name, enter a name or keep the autogenerated option and choose destination. We have only one destination, S3:testingtabulardata, but you might have multiple destinations from different steps in your workflow. Leave the KMS key ARN field empty and choose Next.

Now you have to configure the compute capacity for a job. You can keep all default values for this example.

  • For Instance type, use ml.m5.4xlarge.
  • For Instance count, use 2.
  • You can explore Additional configuration, but keep the default settings.
  • Choose Run.

Now your job has started, and it takes some time to process 6 GB of data according to our Data Wrangler processing flow. The cost for this job will be around $2 USD, because ml.m5.4xlarge costs $0.922 USD per hour and we’re using two of them.

If you choose the job name, you’re redirected to a new window with the job details.

On the job details page, you can see all the parameters from the previous steps.

When the job status changes to Completed, you can also check the Processing time (seconds) value. This processing job takes around 5–10 minutes to complete.

When the job is complete, the train and test output files are available in the corresponding S3 output folders. You can find the output location from the processing job configurations.

After the Data Wrangler processing job is complete, we can check the results saved in our S3 bucket. Don’t forget to update the job_name variable with your job name.

You can now use this exported data for running ML models.

Clean up

Delete your S3 buckets and your Data Wrangler flow in order to delete the underlying resources and prevent unwanted costs after you finish the experiment.

Conclusion

In this post, we showed how you can import the tabular pre-built data flow into Data Wrangler, plug it against our dataset, and export the results to Amazon S3. If your use cases require you to manipulate time series data or join multiple datasets, you can go through the other pre-built sample flows in the GitHub repo.

After you have imported a pre-built data prep workflow, you can integrate it with Amazon SageMaker Processing, Amazon SageMaker Pipelines, and Amazon SageMaker Feature Store to simplify the task of processing, sharing, and storing ML training data. You can also export this sample data flow to a Python script and create a custom ML data prep pipeline, thereby accelerating your release velocity.

We encourage you to check out our GitHub repository to get hands-on practice and find new ways to improve model accuracy! To learn more about SageMaker, visit the Amazon SageMaker Developer Guide.


About the Authors

Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges, and guides them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.

Read More

Transfer learning for TensorFlow object detection models in Amazon SageMaker

Transfer learning for TensorFlow object detection models in Amazon SageMaker

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

This post is the second in a series on the new built-in algorithms in SageMaker. In the first post, we showed how SageMaker provides a built-in algorithm for image classification. Today, we announce that SageMaker provides a new built-in algorithm for object detection using TensorFlow. This supervised learning algorithm supports transfer learning for many pre-trained models available in TensorFlow. It takes an image as input and outputs the objects present in the image along with the bounding boxes. You can fine-tune these pre-trained models using transfer learning even when a large number of training images aren’t available. It’s available through the SageMaker built-in algorithms as well as through the SageMaker JumpStart UI in Amazon SageMaker Studio. For more information, refer to Object Detection Tensorflow and the example notebook Introduction to SageMaker Tensorflow – Object Detection.

Object detection with TensorFlow in SageMaker provides transfer learning on many pre-trained models available in TensorFlow Hub. According to the number of class labels in the training data, a new randomly initialized object detection head replaces the existing head of the TensorFlow model. Either the whole network, including the pre-trained model, or only the top layer (object detection head) can be fine-tuned on the new training data. In this transfer learning mode, you can achieve training even with a smaller dataset.

How to use the new TensorFlow object detection algorithm

This section describes how to use the TensorFlow object detection algorithm with the SageMaker Python SDK. For information on how to use it from the Studio UI, see SageMaker JumpStart.

The algorithm supports transfer learning for the pre-trained models listed in TensorFlow models. Each model is identified by a unique model_id. The following code shows how to fine-tune a ResNet50 V1 FPN model identified by model_id tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8 on a custom training dataset. For each model_id, in order to launch a SageMaker training job through the Estimator class of the SageMaker Python SDK, you need to fetch the Docker image URI, training script URI, and pre-trained model URI through the utility functions provided in SageMaker. The training script URI contains all the necessary code for data processing, loading the pre-trained model, model training, and saving the trained model for inference. The pre-trained model URI contains the pre-trained model architecture definition and the model parameters. Note that the Docker image URI and the training script URI are the same for all the TensorFlow object detection models. The pre-trained model URI is specific to the particular model. The pre-trained model tarballs have been pre-downloaded from TensorFlow and saved with the appropriate model signature in Amazon Simple Storage Service (Amazon S3) buckets, such that the training job runs in network isolation. See the following code:

from sagemaker import image_uris, model_uris, script_urisfrom sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8", "*"
training_instance_type = "ml.p3.2xlarge"
# Retrieve the docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")# Retrieve the pre-trained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tensorflow-od-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

With these model-specific training artifacts, you can construct an object of the Estimator class:

# Create SageMaker Estimator instance
tf_od_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,)

Next, for transfer learning on your custom dataset, you might need to change the default values of the training hyperparameters, which are listed in Hyperparameters. You can fetch a Python dictionary of these hyperparameters with their default values by calling hyperparameters.retrieve_default, update them as needed, and then pass them to the Estimator class. Note that the default values of some of the hyperparameters are different for different models. For large models, the default batch size is smaller and the train_only_top_layer hyperparameter is set to True. The hyperparameter train_only_top_layer defines which model parameters change during the fine-tuning process. If train_only_top_layer is True, parameters of the classification layers change and the rest of the parameters remain constant during the fine-tuning process. On the other hand, if train_only_top_layer is False, all parameters of the model are fine-tuned. See the following code:

from sagemaker import hyperparameters# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

We provide the PennFudanPed dataset as a default dataset for fine-tuning the models. The dataset comprises images of pedestrians. The following code provides the default training dataset hosted in S3 buckets:

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/PennFudanPed_COCO_format/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

Finally, to launch the SageMaker training job for fine-tuning the model, call .fit on the object of the Estimator class, while passing the S3 location of the training dataset:

# Launch a SageMaker Training job by passing s3 path of the training data
tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)

For more information about how to use the new SageMaker TensorFlow object detection algorithm for transfer learning on a custom dataset, deploy the fine-tuned model, run inference on the deployed model, and deploy the pre-trained model as is without first fine-tuning on a custom dataset, see the following example notebook: Introduction to SageMaker TensorFlow – Object Detection.

Input/output interface for the TensorFlow object detection algorithm

You can fine-tune each of the pre-trained models listed in TensorFlow Models to any given dataset comprising images belonging to any number of classes. The objective is to minimize prediction error on the input data. The model returned by fine-tuning can be further deployed for inference. The following are the instructions for how the training data should be formatted for input to the model:

  • Input – A directory with sub-directory images and a file annotations.json.
  • Output – There are two outputs. First is a fine-tuned model, which can be deployed for inference or further trained using incremental training. Second is a file which maps class indexes to class labels; this is saved along with the model.

The input directory should look like the following example:

input_directory
      | -- images
            |--abc.png
            |--def.png
      |--annotations.json

The annotations.json file should have information for bounding_boxes and their class labels. It should have a dictionary with the keys "images" and "annotations". The value for the "images" key should be a list of entries, one for each image of the form {"file_name": image_name, "height": height, "width": width, "id": image_id}. The value of the "annotations" key should be a list of entries, one for each bounding box of the form {"image_id": image_id, "bbox": [xmin, ymin, xmax, ymax], "category_id": bbox_label}.

Inference with the TensorFlow object detection algorithm

The generated models can be hosted for inference and support encoded .jpg, .jpeg, and .png image formats as the application/x-image content type. The input image is resized automatically. The output contains the boxes, predicted classes, and scores for each prediction. The TensorFlow object detection model processes a single image per request and outputs only one line in the JSON. The following is an example of a response in JSON:

accept: application/json;verbose

{"normalized_boxes":[[xmin1, xmax1, ymin1, ymax1],....], "classes":[classidx1, class_idx2,...], "scores":[score_1, score_2,...], "labels": [label1, label2, ...], "tensorflow_model_output":<original output of the model>}

If accept is set to application/json, then the model only outputs predicted boxes, classes, and scores. For more details on training and inference, see the sample notebook Introduction to SageMaker TensorFlow – Object Detection.

Use SageMaker built-in algorithms through the JumpStart UI

You can also use SageMaker TensorFlow object detection and any of the other built-in algorithms with a few clicks via the JumpStart UI. JumpStart is a SageMaker feature that allows you to train and deploy built-in algorithms and pre-trained models from various ML frameworks and model hubs through a graphical interface. It also allows you to deploy fully fledged ML solutions that string together ML models and various other AWS services to solve a targeted use case.

Following are two videos that show how you can replicate the same fine-tuning and deployment process we just went through with a few clicks via the JumpStart UI.

Fine-tune the pre-trained model

Here is the process to fine-tune the same pre-trained object detection model.

Deploy the finetuned model

After model training is finished, you can directly deploy the model to a persistent, real-time endpoint with one click.

Conclusion

In this post, we announced the launch of the SageMaker TensorFlow object detection built-in algorithm. We provided example code on how to do transfer learning on a custom dataset using a pre-trained model from TensorFlow using this algorithm.

For more information, check out documentation and the example notebook.


About the authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Transfer learning for TensorFlow text classification models in Amazon SageMaker

Transfer learning for TensorFlow text classification models in Amazon SageMaker

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

This post is the third in a series on the new built-in algorithms in SageMaker. In the first post, we showed how SageMaker provides a built-in algorithm for image classification. In the second post, we showed how SageMaker provides a built-in algorithm for object detection. Today, we announce that SageMaker provides a new built-in algorithm for text classification using TensorFlow. This supervised learning algorithm supports transfer learning for many pre-trained models available in TensorFlow hub. It takes a piece of text as input and outputs the probability for each of the class labels. You can fine-tune these pre-trained models using transfer learning even when a large corpus of text isn’t available. It’s available through the SageMaker built-in algorithms, as well as through the SageMaker JumpStart UI in Amazon SageMaker Studio. For more information, refer to Text Classification and the example notebook Introduction to JumpStart – Text Classification.

Text Classification with TensorFlow in SageMaker provides transfer learning on many pre-trained models available in the TensorFlow Hub. According to the number of class labels in the training data, a classification layer is attached to the pre-trained TensorFlow hub model. The classification layer consists of a dropout layer and a dense layer, fully connected layer, with 2-norm regularizer, which is initialized with random weights. The model training has hyper-parameters for the dropout rate of dropout layer, and L2 regularization factor for the dense layer. Then, either the whole network, including the pre-trained model, or only the top classification layer can be fine-tuned on the new training data. In this transfer learning mode, training can be achieved even with a smaller dataset.

How to use the new TensorFlow text classification algorithm

This section describes how to use the TensorFlow text classification algorithm with the SageMaker Python SDK. For information on how to use it from the Studio UI, see SageMaker JumpStart.

The algorithm supports transfer learning for the pre-trained models listed in Tensorflow models. Each model is identified by a unique model_id. The following code shows how to fine-tune BERT base model identified by model_id tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2 on a custom training dataset. For each model_id, to launch a SageMaker training job through the Estimator class of the SageMaker Python SDK, you must fetch the Docker image URI, training script URI, and pre-trained model URI through the utility functions provided in SageMaker. The training script URI contains all of the necessary code for data processing, loading the pre-trained model, model training, and saving the trained model for inference. The pre-trained model URI contains the pre-trained model architecture definition and the model parameters. The pre-trained model URI is specific to the particular model. The pre-trained model tarballs have been pre-downloaded from TensorFlow and saved with the appropriate model signature in Amazon Simple Storage Service (Amazon S3) buckets, so that the training job runs in network isolation. See the following code:

from sagemaker import image_uris, model_uris, script_urisfrom sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"
training_instance_type = "ml.p3.2xlarge"
# Retrieve the docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")# Retrieve the pre-trained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tensorflow-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

With these model-specific training artifacts, you can construct an object of the Estimator class:

# Create SageMaker Estimator instance
tf_tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,)

Next, for transfer learning on your custom dataset, you might need to change the default values of the training hyperparameters, which are listed in Hyperparameters. You can fetch a Python dictionary of these hyperparameters with their default values by calling hyperparameters.retrieve_default, update them as needed, and then pass them to the Estimator class. Note that the default values of some of the hyperparameters are different for different models. For large models, the default batch size is smaller and the train_only_top_layer hyperparameter is set to True. The hyperparameter Train_only_top_layer defines which model parameters change during the fine-tuning process. If train_only_top_layer is True, then parameters of the classification layers change and the rest of the parameters remain constant during the fine-tuning process. On the other hand, if train_only_top_layer is False, then all of the parameters of the model are fine-tuned. See the following code:

from sagemaker import hyperparameters# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

We provide the SST2 as a default dataset for fine-tuning the models. The dataset contains positive and negative movie reviews. It has been downloaded from TensorFlow under Apache 2.0 License. The following code provides the default training dataset hosted in S3 buckets.

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

Finally, to launch the SageMaker training job for fine-tuning the model, call .fit on the object of the Estimator class, while passing the Amazon S3 location of the training dataset:

# Launch a SageMaker Training job by passing s3 path of the training data
tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)

For more information about how to use the new SageMaker TensorFlow text classification algorithm for transfer learning on a custom dataset, deploy the fine-tuned model, run inference on the deployed model, and deploy the pre-trained model as is without first fine-tuning on a custom dataset, see the following example notebook: Introduction to JumpStart – Text Classification.

Input/output interface for the TensorFlow text classification algorithm

You can fine-tune each of the pre-trained models listed in TensorFlow Models to any given dataset made up of text sentences with any number of classes. The pre-trained model attaches a classification layer to the Text Embedding model and initializes the layer parameters to random values. The output dimension of the classification layer is determined based on the number of classes detected in the input data. The objective is to minimize classification errors on the input data. The model returned by fine-tuning can be further deployed for inference.

The following instructions describe how the training data should be formatted for input to the model:

  • Input – A directory containing a data.csv file. Each row of the first column should have integer class labels between 0 and the number of classes. Each row of the second column should have the corresponding text data.
  • Output – A fine-tuned model that can be deployed for inference or further trained using incremental training. A file mapping class indexes to class labels is saved along with the models.

The following is an example of an input CSV file. Note that the file should not have any header. The file should be hosted in an S3 bucket with a path similar to the following: s3://bucket_name/input_directory/. Note that the trailing / is required.

|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human nature|
|...|...|

Inference with the TensorFlow text classification algorithm

The generated models can be hosted for inference and support text as the application/x-text content type. The output contains the probability values, class labels for all of the classes, and the predicted label corresponding to the class index with the highest probability encoded in the JSON format. The model processes a single string per request and outputs only one line. The following is an example of a JSON format response:

accept: application/json;verbose
{"probabilities": [prob_0, prob_1, prob_2, ...],
 "labels": [label_0, label_1, label_2, ...],
 "predicted_label": predicted_label}

If accept is set to application/json, then the model only outputs probabilities. For more details on training and inference, see the sample notebook Introduction to Introduction to JumpStart – Text Classification.

Use SageMaker built-in algorithms through the JumpStart UI

You can also use SageMaker TensorFlow text classification and any of the other built-in algorithms with a few clicks via the JumpStart UI. JumpStart is a SageMaker feature that lets you train and deploy built-in algorithms and pre-trained models from various ML frameworks and model hubs through a graphical interface. Furthermore, it lets you deploy fully-fledged ML solutions that string together ML models and various other AWS services to solve a targeted use case.

Following are two videos that show how you can replicate the same fine-tuning and deployment process we just went through with a few clicks via the JumpStart UI.

Fine-tune the pre-trained model

Here is the process to fine-tune the same pre-trained text classification model.

Deploy the finetuned model

After model training is finished, you can directly deploy the model to a persistent, real-time endpoint with one click.

Conclusion

In this post, we announced the launch of the SageMaker TensorFlow text classification built-in algorithm. We provided example code for how to do transfer learning on a custom dataset using a pre-trained model from TensorFlow hub using this algorithm.

For more information, check out the documentation and the example notebook Introduction to JumpStart – Text Classification.


About the authors

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS and SODA conferences.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Intelligent document processing with AWS AI and Analytics services in the insurance industry: Part 2

Intelligent document processing with AWS AI and Analytics services in the insurance industry: Part 2

In Part 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an insurance claims package, such as forms, tables, or specialized documents such as invoices, receipts, or ID documents. We looked into the challenges in legacy document processes, which is time-consuming, error-prone, expensive, and difficult to process at scale, and how you can use AWS AI services to help implement your IDP pipeline.

In this post, we walk you through advanced IDP features for document extraction, querying, and enrichment. We also look into how to further use the extracted structured information from claims data to get insights using AWS Analytics and visualization services. We highlight on how extracted structured data from IDP can help against fraudulent claims using AWS Analytics services.

Intelligent document processing with AWS AI and Analytics services in the insurance industry

Solution overview

The following diagram illustrates the phases if IDP using AWS AI services. In Part 1, we discussed the first three phases of the IDP workflow. In this post, we expand on the extraction step and the remaining phases, which include integrating IDP with AWS Analytics services.

The different phases of intelligent document processing in insurance industry

We use these analytics services for further insights and visualizations, and to detect fraudulent claims using structured, normalized data from IDP. The following diagram illustrates the solution architecture.

IDP architecture diagram

The phases we discuss in this post use the following key services:

  • Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) models that have been pre-trained to understand and extract health data from medical text, such as prescriptions, procedures, or diagnoses.
  • AWS Glue is a part of the AWS Analytics services stack, and is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development.
  • Amazon Redshift is another service in the Analytics stack. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

Prerequisites

Before you get started, refer to Part 1 for a high-level overview of the insurance use case with IDP and details about the data capture and classification stages.

For more information regarding the code samples, refer to our GitHub repo.

Extraction phase

In Part 1, we saw how to use Amazon Textract APIs to extract information like forms and tables from documents, and how to analyze invoices and identity documents. In this post, we enhance the extraction phase with Amazon Comprehend to extract default and custom entities specific to custom use cases.

Insurance carriers often come across dense text in insurance claims applications, such a patient’s discharge summary letter (see the following example image). It can be difficult to automatically extract information from such types of documents where there is no definite structure. To address this, we can use the following methods to extract key business information from the document:

Discharge summary sample

Extract default entities with the Amazon Comprehend DetectEntities API

We run the following code on the sample medical transcription document:

comprehend = boto3.client('comprehend') 

response = comprehend.detect_entities( Text=text, LanguageCode='en')

#print enitities from the response JSON

for entity in response['Entities']:
    print(f'{entity["Type"]} : {entity["Text"]}')

The following screenshot shows a collection of entities identified in the input text. The output has been shortened for the purposes of this post. Refer to the GitHub repo for a detailed list of entities.

Extract custom entities with Amazon Comprehend custom entity recognition

The response from the DetectEntities API includes the default entities. However, we’re interested in knowing specific entity values, such as the patient’s name (denoted by the default entity PERSON), or the patient’s ID (denoted by the default entity OTHER). To recognize these custom entities, we train an Amazon Comprehend custom entity recognizer model. We recommend following the comprehensive steps on how to train and deploy a custom entity recognition model in the GitHub repo.

After we deploy the custom model, we can use the helper function get_entities() to retrieve custom entities like PATIENT_NAME and PATIENT_D from the API response:

def get_entities(text):
try:
    #detect entities
    entities_custom = comprehend.detect_entities(LanguageCode="en",
                      Text=text, EndpointArn=ER_ENDPOINT_ARN) 
    df_custom = pd.DataFrame(entities_custom["Entities"], columns = ['Text',  
                'Type', 'Score'])
    df_custom = df_custom.drop_duplicates(subset=['Text']).reset_index()
    return df_custom
except Exception as e:
    print(e)

# call the get_entities() function 
response = get_entities(text) 
#print the response from the get_entities() function
print(response)

The following screenshot shows our results.

Enrichment phase

In the document enrichment phase, we perform enrichment functions on healthcare-related documents to draw valuable insights. We look at the following types of enrichment:

  • Extract domain-specific language – We use Amazon Comprehend Medical to extract medical-specific ontologies like ICD-10-CM, RxNorm, and SNOMED CT
  • Redact sensitive information – We use Amazon Comprehend to redact personally identifiable information (PII), and Amazon Comprehend Medical for protected health information (PHI) redaction

Extract medical information from unstructured medical text

Documents such as medical providers’ notes and clinical trial reports include dense medical text. Insurance claims carriers need to identify the relationships among the extracted health information from this dense text and link them to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT codes. This is very valuable in automating claim capture, validation, and approval workflows for insurance companies to accelerate and simplify claim processing. Let’s look at how we can use the Amazon Comprehend Medical InferICD10CM API to detect possible medical conditions as entities and link them to their codes:

cm_json_data = comprehend_med.infer_icd10_cm(Text=text)

print("nMedical codingn========")

for entity in cm_json_data["Entities"]:
      for icd in entity["ICD10CMConcepts"]:
           description = icd['Description']
           code = icd["Code"]
           print(f'{description}: {code}')

For the input text, which we can pass in from the Amazon Textract DetectDocumentText API, the InferICD10CM API returns the following output (the output has been abbreviated for brevity).

Extract medical information from unstructured medical text

Similarly, we can use the Amazon Comprehend Medical InferRxNorm API to identify medications and the InferSNOMEDCT API to detect medical entities within healthcare-related insurance documents.

Perform PII and PHI redaction

Insurance claims packages require a lot of privacy compliance and regulations because they contain both PII and PHI data. Insurance carriers can reduce compliance risk by redacting information like policy numbers or the patient’s name.

Let’s look at an example of a patient’s discharge summary. We use the Amazon Comprehend DetectPiiEntities API to detect PII entities within the document and protect the patient’s privacy by redacting these entities:

resp = call_textract(input_document = f's3://{data_bucket}/idp/textract/dr-note-sample.png')
text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])

# call Amazon Comprehend Detect PII Entities API
entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode="en") 

pii = []
for entity in entity_resp['Entities']:
      pii_entity={}
      pii_entity['Type'] = entity['Type']
      pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']]
      pii.append(pii_entity)
print(pii)

We get the following PII entities in the response from the detect_pii_entities() API :

response from the detect_pii_entities() API

We can then redact the PII entities that were detected from the documents by utilizing the bounding box geometry of the entities from the document. For that, we use a helper tool called amazon-textract-overlayer. For more information, refer to Textract-Overlayer. The following screenshots compare a document before and after redaction.

Similar to the Amazon Comprehend DetectPiiEntities API, we can also use the DetectPHI API to detect PHI data in the clinical text being examined. For more information, refer to Detect PHI.

Review and validation phase

In the document review and validation phase, we can now verify if the claim package meets the business’s requirements, because we have all the information collected from the documents in the package from earlier stages. We can do this by introducing a human in the loop that can review and validate all the fields or just an auto-approval process for low dollar claims before sending the package to downstream applications. We can use Amazon Augmented AI (Amazon A2I) to automate the human review process for insurance claims processing.

Now that we have all required data extracted and normalized from claims processing using AI services for IDP, we can extend the solution to integrate with AWS Analytics services such as AWS Glue and Amazon Redshift to solve additional use cases and provide further analytics and visualizations.

Detect fraudulent insurance claims

In this post, we implement a serverless architecture where the extracted and processed data is stored in a data lake and is used to detect fraudulent insurance claims using ML. We use Amazon Simple Storage Service (Amazon S3) to store the processed data. We can then use AWS Glue or Amazon EMR to cleanse the data and add additional fields to make it consumable for reporting and ML. After that, we use Amazon Redshift ML to build a fraud detection ML model. Finally, we build reports using Amazon QuickSight to get insights into the data.

Setup Amazon Redshift external schema

For the purpose of this example, we have created a sample dataset the emulates the output of an ETL (extract, transform, and load) process, and use AWS Glue Data Catalog as the metadata catalog. First, we create a database named idp_demo in the Data Catalog and an external schema in Amazon Redshift called idp_insurance_demo (see the following code). We use an AWS Identity and Access Management (IAM) role to grant permissions to the Amazon Redshift cluster to access Amazon S3 and Amazon SageMaker. For more information about how to set up this IAM role with least privilege, refer to Cluster and configure setup for Amazon Redshift ML administration.

CREATE EXTERNAL SCHEMA idp_insurance_demo
FROM DATA CATALOG
DATABASE 'idp_demo' 
IAM_ROLE '<<<your IAM Role here>>>'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Create Amazon Redshift external table

The next step is to create an external table in Amazon Redshift referencing the S3 location where the file is located. In this case, our file is a comma-separated text file. We also want to skip the header row from the file, which can be configured in the table properties section. See the following code:

create external table idp_insurance_demo.claims(id INTEGER,
date_of_service date,
patients_address_city VARCHAR,
patients_address_state VARCHAR,
patients_address_zip VARCHAR,
patient_status VARCHAR,
insured_address_state VARCHAR,
insured_address_zip VARCHAR,
insured_date_of_birth date,
insurance_plan_name VARCHAR,
total_charges DECIMAL(14,4),
fraud VARCHAR,
duplicate varchar,
invalid_claim VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location '<<<S3 path where file is located>>>'
table properties ( 'skip.header.line.count'='1');

Create training and test datasets

After we create the external table, we prepare our dataset for ML by splitting it into training set and test set. We create a new external table called claim_train, which consists of all records with ID <= 85000 from the claims table. This is the training set that we train our ML model on.

CREATE EXTERNAL TABLE
idp_insurance_demo.claims_train
row format delimited
fields terminated by ','
stored as textfile
location '<<<S3 path where file is located>>>/train'
table properties ( 'skip.header.line.count'='1')
AS select * from idp_insurance_demo.claims where id <= 850000

We create another external table called claim_test that consists of all records with ID >85000 to be the test set that we test the ML model on:

CREATE EXTERNAL TABLE
idp_insurance_demo.claims_test
row format delimited
fields terminated by ','
stored as textfile
location '<<<S3 path where file is located>>>/test'
table properties ( 'skip.header.line.count'='1')
AS select * from idp_insurance_demo.claims where id > 850000

Create an ML model with Amazon Redshift ML

Now we create the model using the CREATE MODEL command (see the following code). We select the relevant columns from the claims_train table that can determine a fraudulent transaction. The goal of this model is to predict the value of the fraud column; therefore, fraud is added as the prediction target. After the model is trained, it creates a function named insurance_fraud_model. This function is used for inference while running SQL statements to predict the value of the fraud column for new records.

CREATE MODEL idp_insurance_demo.insurance_fraud_model
FROM (SELECT 
total_charges ,
fraud ,
duplicate,
invalid_claim
FROM idp_insurance_demo.claims_train
)
TARGET fraud
FUNCTION insurance_fraud_model
IAM_ROLE '<<<your IAM Role here>>>'
SETTINGS (
S3_BUCKET '<<<S3 bucket where model artifacts will be stored>>>'
);

Evaluate ML model metrics

After we create the model, we can run queries to check the accuracy of the model. We use the insurance_fraud_model function to predict the value of the fraud column for new records. Run the following query on the claims_test table to create a confusion matrix:

SELECT 
fraud,
idp_insurance_demo.insurance_fraud_model (total_charges ,duplicate,invalid_claim ) as fraud_calculcated,
count(1)
FROM idp_insurance_demo.claims_test
GROUP BY fraud , fraud_calculcated;

Detect fraud using the ML model

After we create the new model, as new claims data is inserted into the data warehouse or data lake, we can use the insurance_fraud_model function to calculate the fraudulent transactions. We do this by first loading the new data into a temporary table. Then we use the insurance_fraud_model function to calculate the fraud flag for each new transaction and insert the data along with the flag into the final table, which in this case is the claims table.

Visualize the claims data

When the data is available in Amazon Redshift, we can create visualizations using QuickSight. We can then share the QuickSight dashboards with business users and analysts. To create the QuickSight dashboard, you first need to create an Amazon Redshift dataset in QuickSight. For instructions, refer to Creating a dataset from a database.

After you create the dataset, you can create a new analysis in QuickSight using the dataset. The following are some sample reports we created:

  • Total number of claims by state, grouped by the fraud field – This chart shows us the proportion of fraudulent transactions compared to the total number of transactions in a particular state.
  • Sum of the total dollar value of the claims, grouped by the fraud field – This chart shows us the proportion of dollar amount of fraudulent transactions compared to the total dollar amount of transactions in a particular state.
  • Total number of transactions per insurance company, grouped by the fraud field – This chart shows us how many claims were filed for each insurance company and how many of them are fraudulent.

• Total number of transactions per insurance company, grouped by the fraud field

  • Total sum of fraudulent transactions by state displayed on a US map – This chart just shows the fraudulent transactions and displays the total charges for those transactions by state on the map. The darker shade of blue indicates higher total charges. We can further analyze this by city within that state and zip codes with the city to better understand the trends.

Clean up

To prevent incurring future charges to your AWS account, delete the resources that you provisioned in the setup by following the instructions in the Cleanup section in our repo.

Conclusion

In this two-part series, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We explored a claims processing use case in the insurance industry and how IDP can help automate this use case using services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I. In Part 1, we demonstrated how to use AWS AI services for document extraction. In Part 2, we extended the extraction phase and performed data enrichment. Finally, we extended the structured data extracted from IDP for further analytics, and created visualizations to detect fraudulent claims using AWS Analytics services.

We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. To learn more about the pricing of the solution, review the pricing details of Amazon Textract, Amazon Comprehend, and Amazon A2I.


About the Authors

authorChinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.


Uday Narayanan
is an Analytics Specialist Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.


Sonali Sahu
is leading the Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.

Read More

Intelligent document processing with AWS AI services in the insurance industry: Part 1

Intelligent document processing with AWS AI services in the insurance industry: Part 1

The goal of intelligent document processing (IDP) is to help your organization make faster and more accurate decisions by applying AI to process your paperwork. This two-part series highlights the AWS AI technologies that insurance companies can use to speed up their business processes. These AI technologies can be used across insurance use cases such as claims, underwriting, customer correspondence, contracts, or handling disputes resolutions. This series focuses on a claims processing use case in the insurance industry; for more information about the fundamental concepts of the AWS IDP solution, refer to the following two-part series.

Claims processing consists of multiple checkpoints in a workflow that is required to review, verify authenticity, and determine the correct financial responsibility to adjudicate a claim. Insurance companies go through these checkpoints for claims before adjudication of the claims. If a claim successfully goes through all these checkpoints without issues, the insurance company approves it and processes any payment. However, they may require additional supporting information to adjudicate a claim. This claims processing process is often manual, making it expensive, error-prone, and time-consuming. Insurance customers can automate this process using AWS AI services to automate the document processing pipeline for claims processing.

In this two-part series, we take you through how you can automate and intelligently process documents at scale using AWS AI services for an insurance claims processing use case.

Intelligent document processing with AWS AI and Analytics services in the insurance industry

Solution overview

The following diagram represents each stage that we typically see in an IDP pipeline. We walk through each of these stages and how they connect to the steps involved in a claims application process, starting from when an application is submitted, to investigating and closing the application. In this post, we cover the technical details of the data capture, classification, and extraction stages. In Part 2, we expand the document extraction stage and continue to document enrichment, review and verification, and extend the solution to provide analytics and visualizations for a claims fraud use case.

The different phases of intelligent document processing in insurance industry

The following architecture diagram shows the different AWS services used during the phases of the IDP pipeline according to different stages of a claims processing application.

IDP architecture diagram

The solution uses the following key services:

  • Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.
  • Amazon Comprehend is a natural language processing (NLP) service that uses ML to extract insights from text. Amazon Comprehend can detect entities such as person, location, date, quantity, and more. It can also detect the dominant language, personally identifiable information (PII) information, and classify documents into their relevant class.
  • Amazon Augmented AI (Amazon A2I) is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers. Amazon A2I integrates both with Amazon Textract and Amazon Comprehend to provide the ability to introduce human review or validation within the IDP workflow.

Prerequisites

In the following sections, we walk through the different services relating to the first three phases of the architecture, i.e., the data capture, classification and extraction phases.

Refer to our GitHub repository for full code samples along with the document samples in the claims processing packet.

Data capture phase

Claims and its supporting documents can come through various channels, such as fax, email, an admin portal, and more. You can store these documents in a highly scalable and durable storage like Amazon Simple Storage Service (Amazon S3). These documents can be of various types, such as PDF, JPEG, PNG, TIFF, and more. Documents can come in various formats and layouts, and can come from different channels to the data store.

Classification phase

In the document classification stage, we can combine Amazon Comprehend with Amazon Textract to convert text to document context to classify the documents that are stored in the data capture stage. We can then use custom classification in Amazon Comprehend to organize documents into classes that we defined in the claims processing packet. Custom classification is also helpful for automating the document verification process and identifying any missing documents from the packet. There are two steps in custom classification, as shown in the architecture diagram:

  1. Extract text using Amazon Textract from all the documents in the data storage to prepare training data for the custom classifier.
  2. Train an Amazon Comprehend custom classification model (also called a document classifier) to recognize the classes of interest based on the text content.

Document classification of insurance claims packet

After the Amazon Comprehend custom classification model is trained, we can use the real-time endpoint to classify documents. Amazon Comprehend returns all classes of documents with a confidence score linked to each class in an array of key-value pairs (Doc_nameConfidence_score). We recommend going through the detailed document classification sample code on GitHub.

Extraction phase

In the extraction phase, we extract data from documents using Amazon Textract and Amazon Comprehend. For this post, use the following sample documents in the claims processing packet: a Center of Medicaid and Medicare Services (CMS)-1500 claim form, driver’s license and insurance ID, and invoice.

Extract data from a CMS-1500 claim form

The CMS-1500 form is the standard claim form used by a non-institutional provider or supplier to bill Medicare carriers.

It’s important to process the CMS-1500 form accurately, otherwise it can slow down the claims process or delay payment by the carrier. With the Amazon Textract AnalyzeDocument API, we can speed up the extraction process with higher accuracy to extract text from documents in order to understand further insights within the claim form. The following is sample document of a CMS-1500 claim form.

A CMS1500 Claim form

We now use the AnalyzeDocument API to extract two FeatureTypes, FORMS and TABLES, from the document:

from IPython.display import display, JSON
form_resp = textract.analyze_document(Document={'S3Object':{"Bucket": data_bucket, "Name": cms_key}}, FeatureTypes=['FORMS', 'TABLES'])

# print tables
print(get_string(textract_json=form_resp, output_type=[Textract_Pretty_Print.TABLES], table_format=Pretty_Print_Table_Format.fancy_grid))

# using our constructed helper function - values returned as a dictionary

display(JSON(getformkeyvalue(form_resp), root="Claim Form"))

The following results have been shortened for better readability. For more detailed information, see our GitHub repo.

The FORMS extraction is identified as key-value pairs.

The TABLES extraction contains cells, merged cells, and column headers within a detected table in the claim form.

Tables extraction from CMS1500 form

Extract data from ID documents

For identity documents like an insurance ID, which can have different layouts, we can use the Amazon Textract AnalyzeDocument API. We use the FeatureType FORMS as the configuration for the AnalyzeDocument API to extract the key-value pairs from the insurance ID (see the following sample):

Run the following code:

ins_form_resp = textract.analyze_document(Document={'S3Object':{"Bucket": data_bucket, "Name": ins_card_key}}, FeatureTypes=['FORMS'])

# using our constructed helper function - values returned as a dictionary

display(JSON(getformkeyvalue(ins_form_resp), root="Insurance card"))

We get the key-value pairs in the result array, as shown in the following screenshot.

For ID documents like a US driver’s license or US passport, Amazon Textract provides specialized support to automatically extract key terms without the need for templates or formats, unlike what we saw earlier for the insurance ID example. With the AnalyzeID API, businesses can quickly and accurately extract information from ID documents that have different templates or formats. The AnalyzeID API returns two categories of data types:

  • Key-value pairs available on the ID such as date of birth, date of issue, ID number, class, and restrictions
  • Implied fields on the document that may not have explicit keys associated with them, such as name, address, and issuer

We use the following sample US driver’s license from our claims processing packet.

Run the following code:

ID_resp = textract.analyze_id(DocumentPages=[{'S3Object':{"Bucket": data_bucket, "Name": key}}])
# once again using the textract response parser
from trp.trp2_analyzeid import TAnalyzeIdDocument, TAnalyzeIdDocumentSchema

t_doc = TAnalyzeIdDocumentSchema().load(ID_resp)

list_of_results = t_doc.get_values_as_list()
print(tabulate([x[1:3] for x in list_of_results]))

The following screenshot shows our result.

From the results screenshot, you can observe that certain keys are presented that were not in the driver’s license itself. For example, Veteran is not a key found in the license; however, it’s a pre-populated key-value that AnalyzeID supports, due to the differences found in licenses between states.

Extract data from invoices and receipts

Similar to the AnalyzeID API, the AnalyzeExpense API provides specialized support for invoices and receipts to extract relevant information such as vendor name, subtotal and total amounts, and more from any format of invoice documents. You don’t need any template or configuration for extraction. Amazon Textract uses ML to understand the context of ambiguous invoices as well as receipts.

The following is a sample medical insurance invoice.

A sample of insurance invoice

We use the AnalyzeExpense API to see a list of standardized fields. Fields that aren’t recognized as standard fields are categorized as OTHER:

expense_resp = textract.analyze_expense(Document={'S3Object':{"Bucket": data_bucket, "Name": invc_key}})

# print invoice summary

print(get_expensesummary_string(textract_json=expense_resp, table_format=Pretty_Print_Table_Format.fancy_grid))

# print invoice line items

print(get_expenselineitemgroups_string(textract_json=expense_resp, table_format=Pretty_Print_Table_Format.fancy_grid))

We get the following list of fields as key-value pairs (see screenshot on the left) and the entire row of individual line items purchased (see screenshot on the right) in the results.

Conclusion

In this post, we showcased the common challenges in claims processing, and how we can use AWS AI services to automate an intelligent document processing pipeline to automatically adjudicate a claim. We saw how to classify documents into various document classes using an Amazon Comprehend custom classifier, and how to use Amazon Textract to extract unstructured, semi-structured, structured, and specialized document types.

In Part 2, we expand on the extraction phase with Amazon Textract. We also use Amazon Comprehend pre-defined entities and custom entities to enrich the data, and show how to extend the IDP pipeline to integrate with analytics and visualization services for further processing.

We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. To learn more about the pricing of the solution, review the pricing details of Amazon Textract, Amazon Comprehend, and Amazon A2I.


About the Authors

Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.


Sonali Sahu is leading the Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.


Tim Condello is a Senior AI/ML Specialist Solutions Architect at Amazon Web Services. His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.

Read More

Improving stability and flexibility of ML pipelines at Amazon Packaging Innovation with Amazon SageMaker Pipelines

Improving stability and flexibility of ML pipelines at Amazon Packaging Innovation with Amazon SageMaker Pipelines

To delight customers and minimize packaging waste, Amazon must select the optimal packaging type for billions of packages shipped every year. If too little protection is used for a fragile item such as a coffee mug, the item will arrive damaged and Amazon risks their customer’s trust. Using too much protection will result in increased costs and overfull recycling bins. With hundreds of millions of products available, a scalable decision mechanism is needed to continuously learn from product testing and customer feedback.

To solve these problems, the Amazon Packaging Innovation team developed machine learning (ML) models that classify whether products are suitable for Amazon packaging types such as mailers, bags, or boxes, or could even be shipped with no additional packaging. Previously, the team developed a custom pipeline based on AWS Step Functions to perform weekly training and daily or monthly inference jobs. However, over time the pipeline didn’t provide sufficient flexibility to launch models with new architectures. Development for the new pipelines presented an overhead and required coordination between data scientists and developers. To overcome these difficulties and improve speed of deploying new models and architectures, the team chose to orchestrate model training and inference with Amazon SageMaker Pipelines.

In this post, we discuss the previous orchestration architecture based on Step Functions, outline training and inference architectures using Pipelines, and highlight the flexibility the Amazon Packaging Innovation team achieved.

Challenges of the former ML pipeline at Amazon Packaging Innovation

To incorporate continuous feedback about performance of packages, a new model is trained every week using a growing number of labels. The inference for the entire inventory of products is performed monthly, and a daily inference is performed to deliver just-in-time predictions for the newly added inventory.

To automate the process of training multiple models and provide predictions, the team had developed a custom pipeline based on Step Functions to orchestrate the following steps:

  • Data preparation for training and inference jobs and loading of predictions to the database (Amazon Redshift) with AWS Glue.
  • Model training and inference with Amazon SageMaker.
  • Calculation of model performance metrics on the validation set with AWS Batch.
  • Using Amazon DynamoDB to store model configurations (such as data split ratio for training and validation, model artifact location, model type, and number of instances for training and inference), model performance metrics, and the latest successfully trained model version.
  • Calculation of the differences in the model performance scores, changes in the distribution of the training labels, and comparing the size of the input data between the previous and the new model versions with AWS Lambda functions.
  • Given the large number of steps, the pipeline also required a reliable alarming system at each step to alert the stakeholders of any issues. This was accomplished via a combination of Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Notification Service (Amazon SNS). The alarms were created to notify the business stakeholders, data scientists, and developers about any failed steps and large deviations in the model and data metrics.

After using this solution for nearly 2 years, the team realized that this implementation only worked well for a typical ML workflow where a single model was trained and scored on a validation dataset. However, the solution wasn’t sufficiently flexible for complex models and wasn’t resilient to failures. For example, the architecture didn’t easily accommodate sequential model training. It was difficult to add or remove a step without duplicating the entire pipeline and modifying the infrastructure. Even simple changes in the data processing steps such as adjusting the data split ratio or selecting a different set of features required coordination from both a data scientist and a developer. When the pipeline failed at any step, it had to be restarted from the beginning, which resulted in repeated runs and increased cost. To avoid repeated runs and having to restart from the failed step, the team would create a new copy of an abridged state machine. This troubleshooting led to a proliferation of the state machines, each starting from the commonly failing steps. Finally, if a training job encountered a deviation in the distribution of labels, model score, or number of labels, a data scientist had to review the model and its metrics manually. Then a data scientist would access a DynamoDB table with the model versions and update the table to ensure that the correct model was used for the next inference job.

The maintenance of this architecture required at least one dedicated resource and an additional full-time resource for development. Given the difficulties of expanding the pipeline to accommodate new use cases, the data scientists had begun developing their own workflows, which in turn had led to a growing code base, multiple data tables with similar data schemes, and decentralized model monitoring. Accumulation of these issues had resulted in lower team productivity and increased overhead.

To address these challenges, the Amazon Packaging Innovation team evaluated other existing solutions for MLOps, including SageMaker Pipelines (December 2020 release announcement). Pipelines is a capability of SageMaker for building, managing, automating, and scaling end-to-end ML workflows. Pipelines allows you to reduce the number of steps across the entire ML workflow and is flexible enough to allow data scientists to define a custom ML workflow. It takes care of monitoring and logging the steps. It also comes with a model registry that automatically versions new models. The model registry has built-in approval workflows to select models for inference in production. Pipelines also allows for caching steps called with the same arguments. If a previous run is found, a cache is created, which allows for an easy restart instead of recomputing of the successfully completed steps.

In the evaluation process, Pipelines stood out from the other solutions for its flexibility and availability of features for supporting and expanding current and future workflows. Switching to Pipelines freed up developers’ time from platform maintenance and troubleshooting and redirected attention towards the addition of the new features. In this post, we present the design for training and inference workflows at the Amazon Packaging Innovation team using Pipelines. We also discuss the benefits and the reduction in costs the team realized by switching to Pipelines.

Training pipeline

The Amazon Packaging Innovation team trains models for every package type using a growing number of labels. The following diagram outlines the entire process.

PackagingInnovation-training-architecture

The workflow begins by extracting labels and features from an Amazon Redshift database and unloading the data to Amazon Simple Storage Service (Amazon S3) via a scheduled extract, transform, and load (ETL) job. Along with the input data, a file object with the model type and parameters is placed in the S3 bucket. This file serves as the pipeline trigger via a Lambda function.

The next steps are completely customizable and defined entirely by a data scientist using the SageMaker Python SDK for Pipelines. In the scenario we present in this post, the input data is split into training and validation sets and saved back in an S3 bucket by launching a SageMaker Processing job.

When the data is ready in Amazon S3, a SageMaker training job starts. After the model is successfully trained and created, the model evaluation step is performed on the validation data via a SageMaker batch transform job. The model metrics are then compared to the previous week’s model metrics using a SageMaker Processing job. The team has defined multiple custom criteria for evaluating deviations in the model performance. The model is either rejected or approved based on these criteria. If the model is rejected, the previous approved model is used for the next inference jobs. If the model is approved, its version is registered and that model is used for inference jobs. The stakeholders receive a notification about the outcome via Amazon CloudWatch alarms.

The following screenshot from Amazon SageMaker Studio shows the steps of the training pipeline.

PackagingInnovation-SMP-training

Pipelines tracks each pipeline run, which you can monitor in Studio. Alternatively, you can query the progress of the run using Boto3 or the AWS Command Line Interface (AWS CLI). You can visualize the model metrics in Studio and compare different model versions.

Inference pipeline

The Amazon Packaging Innovation team refreshes predictions for the entire inventory of products monthly. Daily predictions are generated to provide just-in-time packaging recommendations for newly added inventory using the latest trained model. This requires the inference pipeline to run daily with different volumes of data. The following diagram illustrates this workflow.

PackagingInnovation-inference-architecture

Similar to the training pipeline, the inference begins with unloading the data from Amazon Redshift to an S3 bucket. A file object placed in Amazon S3 triggers the Lambda function that initiates the inference pipeline. The features are prepared for inference and the data is split into appropriately sized files using a SageMaker Processing job. Next, the pipeline identifies the latest approved model to run the predictions and load them to an S3 bucket. Finally, the predictions are loaded back to Amazon Redshift using the boto3-data API within the SageMaker Processing job.

The following screenshot from Studio shows the inference pipeline details.

Benefits of choosing to architect ML workflows with SageMaker Pipelines

In this section, we discuss the gains the Amazon Packaging Innovation team realized by switching to Pipelines for model training and inference.

Out-of-the-box production-level MLOps features

While comparing different internal and external solutions for the next ML pipeline solution, a single data scientist was able to prototype and develop a full version of an ML workflow with Pipelines in a Studio Jupyter environment in less than 3 weeks. Even at the prototyping stage, it became clear that Pipelines provided all necessary infrastructure components required for a production level workflow: model versioning, caching, and alarms. Immediate availability of these features meant that no additional time would be spent developing and customizing them. This was a clear demonstration of value, which convinced the Amazon Packaging Innovation team that Pipelines was the right solution.

Flexibility in developing ML models

The biggest gain for the data scientists on the team was the ability to experiment easily and iterate through different models. Regardless of what framework they preferred for their ML work and the number of steps and features it involved, Pipelines accommodated their needs. The data scientists were empowered to experiment without having to wait to get on the software development sprint to add an additional feature or step.

Reduced Costs

The Pipelines capability of SageMaker is free: you pay only for the compute resources and the storage associated with training and inference. However, when thinking about the cost, you need to account not only for the cost of the services used but also the developer hours needed to maintain the workflow, debug, and patch it. Orchestrating with Pipelines is simpler because it consists of fewer pieces and familiar infrastructure. Previously, adding a new feature required at least two people (data scientist and software engineer) at the Amazon Packaging Innovation team to implement it. With the redesigned pipeline, engineering efforts are now directed towards additional custom infrastructure around the pipeline, such as creation of a single repository for tracking of the machine learning code, simplification of the model deployment across AWS accounts, development of the integrated ETL jobs and common reusable functions.

The ability to cache the steps with a similar input also contributed to the reduction in cost, because the teams were less likely to rerun the entire pipeline. Instead, they could easily start it from the point of failure.

Conclusion

The Amazon Packaging Innovation team trains ML models on a monthly basis and regularly updates predictions for the recommended product packaging types. These recommendations helped them achieve multiple team- and company-wide goals by reducing waste and delighting customers with each order. The training and inference pipelines must run reliably on a regular basis yet allow for constant improvement of the models.

Transitioning to Pipelines allowed the team to deploy four new multi-modal model architectures to production under 2 months. Deploying a new model using the previous architecture would have required 5 days (with the same model architecture) to 1 month (with a new model architecture). Deploying the same model using Pipelines enabled the team to reduce the development time to 4 hours with the same model architecture and to 5 days with a new model architecture. That evaluates to a savings of almost 80% of working hours.

Additional resources

For more information, see the following resources:


About the Authors

Ankur-Shukla-authorAnkur Shukla is a Principal Data Scientist at AWS-ProServe based in Palo Alto. Ankur has more than 15 years of consulting experience working directly with the customer and help them solve business problem with technology. He leads multiple global applied science and ML-Ops initiatives within AWS. In his free time, he enjoys reading and spending time with family.

Akash-Singla-authorAkash Singla is a Sr. System Dev Engineer with Amazon Packaging Innovation team. He has more than 17 years of experience solving critical business problems through technology for several business verticals. He currently focuses on upgrading NAWS infrastructure for variety of packaging centric applications to scale them better.

Vitalina-Komashko-authorVitalina Komashko is a Data Scientist with AWS Professional Services. She holds a PhD in Pharmacology and Toxicology but transitioned to data science from experimental work because she wanted “to own data generation and the interpretation of the results”. Earlier in her career she worked with biotech and pharma companies. At AWS she enjoys solving problems for customers from variety of industries and learning about their unique challenges.

Prasanth-Meiyappan-authorPrasanth Meiyappan is an Sr. Applied Scientist with Amazon Packaging Innovation for 4+ years. He has 6+ years of industry experience in machine learning and has shipped products to improve search customer experience and improve customer packaging experience. Prasanth is passionate about sustainability and has a PhD in statistical modeling of climate change.

Matthew-Bales-authorMatthew Bales is a Sr. Research Scientist working to optimize package type selection using customer feedback and machine learning. Prior to Amazon, Matt worked as a post doc performing simulations of particle physics in Germany and in a previous life, a production manager of radioactive medical implant devices in a startup. He holds a Ph.D. in Physics from the University of Michigan.

Read More