Face-off Probability, part of NHL Edge IQ: Predicting face-off winners in real time during televised games

A photo showing a faceoff between two teams during an NHL game. Face-off Probability is the National Hockey League’s (NHL) first advanced statistic using machine learning (ML) and artificial intelligence. It uses real-time Player and Puck Tracking (PPT) data to show viewers which player is likely to win a face-off before the puck is dropped, and provides broadcasters and viewers the opportunity to dive deeper into the importance of face-off matches and the differences in player abilities. Based on 10 years of historical data, hundreds of thousands of face-offs were used to engineer over 70 features fed into the model to provide real-time probabilities. Broadcasters can now discuss how a key face-off win by a player led to a goal or how the chances of winning a face-off decrease as a team’s face-off specialist is waived out of a draw. Fans can see visual, real-time predictions that show them the importance of a key part of the game.

In this post, we focus on how the ML model for Face-off Probability was developed and the services used to put the model into production. We also share the key technical challenges that were solved during construction of the Face-off Probability model.

How it works

Imagine the following scenario: It’s a tie game between two NHL teams that will determine who moves forward. We’re in the third period with 1:22 seconds left to play. Two players from opposite teams line up to take the draw in the closest face-off closer to one of the nets. The linesman notices a defensive player encroaching into the face-off circle and waives their player out of the face-off due to the violation. A less experienced defensive player comes in to take the draw as his replacement. The attacking team wins the face-off, gets possession of the puck, and immediately scores to take the lead. The score holds up for the remaining minute of the game and decides who moves forward. What player was favored to win the face-off before the initial duo was changed? How much did the defensive’s team probability of winning the face-off decrease by the violation that forced a different player to take the draw? Face-off Probability, the newest NHL Edge IQ statistic powered by AWS, can now answer these questions.

When there is a stoppage in play, Face-off Probability generates predictions for who will win the upcoming face-off based on the players on the ice, the location of the face-off, and the current game situation. The predictions are generated throughout the stoppage until the game clock starts running again. Predictions occur at sub-second latency and are triggered any time there is a change in the players involved in the face-off.

An NHL faceoff shot from up top

Overcoming key obstacles for face-off probability

Predicting face-off probability in real-time broadcasts can be broken down into two specific sub-problems:

  • Modeling the face-off event as an ML problem, understanding the requirements and limitations, preparing the data, engineering the data signals, exploring algorithms, and ensuring reliability of results
  • Detecting a face-off event during the game from a stream of PPT events, collecting parameters needed for prediction, calling the model, and submitting results to broadcasters

Predicting the probability of a player winning a face-off in real time on a televised broadcast has several technical challenges that had to be overcome. These included determining the features required and modeling methods to predict an event that has a large amount of uncertainty, and determining how to use streaming PPT sensor data to identify where a face-off is occurring, the players involved, and the probability of each player winning the face-off, all within hundreds of milliseconds.

Players huddling in a shot of a Faceoff during a game

Building an ML model for difficult-to-predict events

Predicting events such as face-off winning probabilities during a live game is a complex task that requires a significant amount of quality historic data and data streaming capabilities. To identify and understand the important signals in such a rich data environment, the development of ML models requires extensive subject matter expertise. The Amazon Machine Learning Solutions Lab partnered with NHL hockey and data experts to work backward from their target goal of enhancing their fan experience. By continuously listening to NHL’s expertise and testing hypotheses, AWS’s scientists engineered over 100 features that correlate to the face-off event. In particular, the team classified this feature set into one of three categories:

  • Historical statistics on player performances such as the number of face-offs a player has taken and won in the last five seasons, the number of face-offs the player has taken and won in previous games, a player’s winning percentages over several time windows, and the head-to-head winning percentage for each player in the face-off
  • Player characteristics such as height, weight, handedness, and years in the league
  • In-game situational data that might affect a player’s performance, such as the score of the game, the elapsed time in the game to that point, where the face-off is located, the strength of each team, and which player has to put their stick down for the face-off first

AWS’s ML scientists considered the problem as a binary classification problem: either the home player wins the face-off or the away player wins the face-off. With data from more than 200,000 historical face-offs, they used a LightGBM model to predict which of the two players involved with a face-off event is likely to win.

Determining if a face-off is about to occur and which players are involved

When a whistle blows and the play is stopped, Face-off Probability begins to make predictions. However, Face-off Probability has to first determine where the face-off is occurring and which player from each team is involved in the face-off. The data stream indicates events as they occur but doesn’t provide information on when an event is likely to occur in the future. As such, the sensor data of the players on the ice is needed to determineif and where a face-off is about to happen.

The PPT system produces real-time locations and velocities for players on the ice at up to 60 events per second. These locations and velocities were used to determine where the face-off is happening on the ice and if it’s likely to happen soon. By knowing how close the players are to known face-off locations and how stationary the players were, Face-off Probability was able to determine that a face-off was likely to occur and the two players that would be involved in the face-off.

Determining the correct cut-off distance for proximity to a face-off location and the corresponding cut-off velocity for stationary players was accomplished using a decision tree model. With PPT data from the 2020-2021 season, we built a model to predict the likelihood that a face-off is occurring at a specified location given the average distance of each team to the location and the velocities of the players. The decision tree provided the cut-offs for each metric, which we included as rules-based logic in the streaming application.

With the correct face-off location determined, the player from each team taking the face-off was calculated by taking the player closest to the known location from each team. This provided the application with the flexibility to identify the correct players while also being able to adjust to a new player having to take a face-off if a current player is waived out due to an infraction. Making and updating the prediction for the correct player was a key focus for the real-time usability of the model in broadcasts, which we describe further in the next section.

Model development and training

To develop the model, we used more than 200,000 historical face-off data points, along with the custom engineered feature set designed by working with the subject matter experts. We looked at features like in-game situations, historical performance of the players taking the face-off, player-specific characteristics, and head-to-head performances of the players taking the face-off, both in the current season and for their careers. Collectively, this resulted in over 100 features created using a combination of available and derived techniques.

To assess different features and how they might influence the model, we conducted extensive feature analysis as part of the exploratory phase. We used a mix of univariate tests and multivariate tests. For multivariate tests, for interpretability, we used decision tree visualization techniques. To assess statistical significance, we used Chi Test and KS tests to test dependence or distribution differences.

A decision tree showing how the model estimates based on the underlying data and features

We explored classification techniques and models with the expectation that the raw probabilities would be treated as the predictions. We explored nearest neighbors, decision trees, neural networks, and also collaborative filtering in terms of algorithms, while trying different sampling strategies (filtering, random, stratified, and time-based sampling) and evaluated performance on Area Under the Curve (AUC) and calibration distribution along with Brier score loss. At the end, we found that the LightGBM model worked best with well-calibrated accuracy metrics.

To evaluate the performance of the models, we used multiple techniques. We used a test set that the trained model was never exposed to. Additionally, the teams conducted extensive manual assessments of the results, looking at edge cases and trying to understand the nuances of how the model looked to determine why a certain player should have won or lost a face-off event.

With information collected from manual reviewers, we would adjust the features when required, or run iterations on the model to see if the performance of the model was as expected.

Deploying Face-off Probability for real-time use during national television broadcasts

One of the goals of the project was not just to predict the winner of the face-off, but to build a foundation for solving a number of similar problems in a real-time and cost-efficient way. That goal helped determine which components to use in the final architecture.

architecture diagram for faceoff application

The first important component is Amazon Kinesis Data Streams, a serverless streaming data service that acts as a decoupler between the specific implementation of the PPT data provider and consuming applications, thereby protecting the latter from the disrupting changes in the former. It has also enhanced the fan-out feature, which provides the ability to connect up to 20 parallel consumers and maintain a low latency of 70 milliseconds and the same throughput of 2MB/s per shard between all of them simultaneously.

PPT events don’t come for all players at once, but arrive discretely for each player as well as other events in the game. Therefore, to implement the upcoming face-off detection algorithm, the application needs to maintain a state.

The second important component of the architecture is Amazon Kinesis Data Analytics for Apache Flink. Apache Flink is a distributed streaming, high-throughput, low-latency data flow engine that provides a convenient and easy way to use the Data Stream API, and it supports stateful processing functions, checkpointing, and parallel processing out of the box. This helps speed up development and provides access to low-level routines and components, which allows for a flexible design and implementation of applications.

Kinesis Data Analytics provides the underlying infrastructure for your Apache Flink applications. It eliminates a need to deploy and configure a Flink cluster on Amazon Elastic Compute Cloud (Amazon EC2) or Kubernetes, which reduces maintenance complexity and costs.

The third crucial component is Amazon SageMaker. Although we used SageMaker to build a model, we also needed to make a decision at the early stages of the project: should scoring be implemented inside the face-off detecting application itself and complicate the implementation, or should the face-off detecting application call SageMaker remotely and sacrifice some latency due to communication over the network? To make an informed decision, we performed a series of benchmarks to verify SageMaker latency and scalability, and validated that average latency was less than 100 milliseconds under the load, which was within our expectations.

With the main parts of high-level architecture decided, we started to work on the internal design of the face-off detecting application. A computation model of the application is depicted in the following diagram.

a diagram representing the flowchart/computation model of the faceoff application

The compute model of the face-off detecting application can be modeled as a simple finite-state machine, where each incoming message transitions the system from one state to another while performing some computation along with that transition. The application maintains several data structures to keep track of the following:

  • Changes in the game state – The current period number, status and value of the game clock, and score
  • Changes in the player’s state – If the player is currently on the ice or on the bench, the current coordinates on the field, and the current velocity
  • Changes in the player’s personal face-off stats – The success rate of one player vs. another, and so on

The algorithm checks each location update event of a player to decide whether a face-off prediction should be made and whether the result should be submitted to broadcasters. Taking into account that each player location is updated roughly every 80 milliseconds and players move much slower during game pauses than during the game, we can conclude that the situation between two updates doesn’t drastically change. If the application called SageMaker for predictions and sent predictions to broadcasters every time a new location update event was received and all conditions are satisfied, SageMaker and the broadcasters would be overwhelmed with a number of duplicate requests.

To avoid all this unnecessary noise, the application keeps track of a combination of parameters for which predictions were already made, along with the result of the prediction, and caches them in memory to avoid expensive duplicate requests to SageMaker. Also, it keeps track of what predictions were already sent to broadcasters and makes sure that only new predictions are sent or the previously sent ones are sent again only if necessary. Testing showed that this approach reduces the amount of outgoing traffic by more than 100 times.

Another optimization technique that we used was grouping requests to SageMaker and performing them asynchronously in parallel. For example, if we have four new combinations of face-off parameters for which we need to get predictions from SageMaker, we know that each request will take less than 100 milliseconds. If we perform each request synchronously one by one, the total response time will be under 400 milliseconds. But if we group all four requests, submit them asynchronously, and wait for the result for the entire group before moving forward, we effectively parallelize requests and the total response time will be under 100 milliseconds, just like for only one request.

Summary

NHL Edge IQ, powered by AWS, is bringing fans closer to the action with advanced analytics and new ML stats. In this post, we showed insights into the building and deployment of the new Face-off Probability model, the first on-air ML statistic for the NHL. Be sure to keep an eye out for the probabilities generated by Face-off Probability in upcoming NHL games.

To find full examples of building custom training jobs for SageMaker, visit Bring your own training-completed model with SageMaker by building a custom container. For examples of using Amazon Kinesis for streaming, refer to Learning Amazon Kinesis Development.

To learn more about the partnership between AWS and the NHL, visit NHL Innovates with AWS Cloud Services. If you’d like to collaborate with experts to bring ML solutions to your organization, contact the Amazon ML Solutions Lab.


About the Authors

Ryan Gillespie is a Sr. Data Scientist with AWS Professional Services. He has a MSc from Northwestern University and a MBA from the University of Toronto. He has previous experience in the retail and mining industries.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Alexander Egorov is a Principal Data Architect, specializing in streaming technologies. He helps organizations to design and build platforms for processing and analyzing streaming data in real time.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Erick Martinez is a Sr. Media Application Architect with 25+ years of experience, with focus on Media and Entertainment. He is experienced in all aspects of systems development life-cycle ranging from discovery, requirements gathering, design, implementation, testing, deployment, and operation.

Read More

Redact sensitive data from streaming data in near-real time using Amazon Comprehend and Amazon Kinesis Data Firehose

Near-real-time delivery of data and insights enable businesses to rapidly respond to their customers’ needs. Real-time data can come from a variety of sources, including social media, IoT devices, infrastructure monitoring, call center monitoring, and more. Due to the breadth and depth of data being ingested from multiple sources, businesses look for solutions to protect their customers’ privacy and keep sensitive data from being accessed from end systems. You previously had to rely on personally identifiable information (PII) rules engines that could flag false positives or miss data, or you had to build and maintain custom machine learning (ML) models to identify PII in your streaming data. You also needed to implement and maintain the infrastructure necessary to support these engines or models.

To help streamline this process and reduce costs, you can use Amazon Comprehend, a natural language processing (NLP) service that uses ML to find insights and relationships like people, places, sentiments, and topics in unstructured text. You can now use Amazon Comprehend ML capabilities to detect and redact PII in customer emails, support tickets, product reviews, social media, and more. No ML experience is required. For example, you can analyze support tickets and knowledge articles to detect PII entities and redact the text before you index the documents. After that, documents are free of PII entities and users can consume the data. Redacting PII entities helps you protect your customer’s privacy and comply with local laws and regulations.

In this post, you learn how to implement Amazon Comprehend into your streaming architectures to redact PII entities in near-real time using Amazon Kinesis Data Firehose with AWS Lambda.

This post is focused on redacting data from select fields that are ingested into a streaming architecture using Kinesis Data Firehose, where you want to create, store, and maintain additional derivative copies of the data for consumption by end-users or downstream applications. If you’re using Amazon Kinesis Data Streams or have additional use cases outside of PII redaction, refer to Translate, redact and analyze streaming data using SQL functions with Amazon Kinesis Data Analytics, Amazon Translate, and Amazon Comprehend, where we show how you can use Amazon Kinesis Data Analytics Studio powered by Apache Zeppelin and Apache Flink to interactively analyze, translate, and redact text fields in streaming data.

Solution overview

The following figure shows an example architecture for performing PII redaction of streaming data in real time, using Amazon Simple Storage Service (Amazon S3), Kinesis Data Firehose data transformation, Amazon Comprehend, and AWS Lambda. Additionally, we use the AWS SDK for Python (Boto3) for the Lambda functions. As indicated in the diagram, the S3 raw bucket contains non-redacted data, and the S3 redacted bucket contains redacted data after using the Amazon Comprehend DetectPiiEntities API within a Lambda function.

Costs involved

In addition to Kinesis Data Firehose, Amazon S3, and Lambda costs, this solution will incur usage costs from Amazon Comprehend. The amount you pay is a factor of the total number of records that contain PII and the characters that are processed by the Lambda function. For more information, refer to Amazon Kinesis Data Firehose pricing, Amazon Comprehend Pricing, and AWS Lambda Pricing.

As an example, let’s assume you have 10,000 logs records, and the key value you want to redact PII from is 500 characters. Out of the 10,000 log records, 50 are identified as containing PII. The cost details are as follows:

Contains PII Cost:

  • Size of each key value = 500 characters (1 unit = 100 characters)
  • Number of units (100 characters) per record (minimum is 3 units) = 5
  • Total units = 10,000 (records) x 5 (units per record) x 1 (Amazon Comprehend requests per record) = 50,000
  • Price per unit = $0.000002
    • Total cost for identifying log records with PII using ContainsPiiEntities API = $0.1 [50,000 units x $0.000002] 

Redact PII Cost:

  • Total units containing PII = 50 (records) x 5 (units per record) x 1 (Amazon Comprehend requests per record) = 250
  • Price per unit = $0.0001
    • Total cost for identifying location of PII using DetectPiiEntities API = [number of units] x [cost per unit] = 250 x $0.0001 = $0.025

Total Cost for identification and redaction:

  • Total cost: $0.1 (validation if field contains PII) + $0.025 (redact fields that contain PII) = $0.125

Deploy the solution with AWS CloudFormation

For this post, we provide an AWS CloudFormation streaming data redaction template, which provides the full details of the implementation to enable repeatable deployments. Upon deployment, this template creates two S3 buckets: one to store the raw sample data ingested from the Amazon Kinesis Data Generator (KDG), and one to store the redacted data. Additionally, it creates a Kinesis Data Firehose delivery stream with DirectPUT as input, and a Lambda function that calls the Amazon Comprehend ContainsPiiEntities and DetectPiiEntities API to identify and redact PII data. The Lambda function relies on user input in the environment variables to determine what key values need to be inspected for PII.

The Lambda function in this solution has limited payload sizes to 100 KB. If a payload is provided where the text is greater than 100 KB, the Lambda function will skip it.

To deploy the solution, complete the following steps:

  1. Launch the CloudFormation stack in US East (N. Virginia) us-east-1:
  2. Enter a stack name, and leave other parameters at their default
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.

Deploy resources manually

If you prefer to build the architecture manually instead of using AWS CloudFormation, complete the steps in this section.

Create the S3 buckets

Create your S3 buckets with the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Create one bucket for your raw data and one for your redacted data.
  4. Note the names of the buckets you just created.

Create the Lambda function

To create and deploy the Lambda function, complete the following steps:

  1. On the Lambda console, choose Create function.
  2. Choose Author from scratch.
  3. For Function Name, enter AmazonComprehendPII-Redact.
  4. For Runtime, choose Python 3.9.
  5. For Architecture, select x86_64.
  6. For Execution role, select Create a new role with Lambda permissions.
  7. After you create the function, enter the following code:
    import json
    import boto3
    import os
    import base64
    import sys
    
    def lambda_handler(event, context):
        
        output = []
        
        for record in event['records']:
            
            # Gathers keys from enviroment variables and makes a list of desired keys to check for PII
            rawkeys = os.environ['keys']
            splitkeys = rawkeys.split(", ")
            print(splitkeys)
            #decode base64
            #Kinesis data is base64 encoded so decode here
            payloadraw=base64.b64decode(record["data"]).decode('utf-8')
            #Loads decoded payload into json
            payloadjsonraw = json.loads(payloadraw)
            
            # Creates Comprehend client
            comprehend_client = boto3.client('comprehend')
            
            
            # This codes handles the logic to check for keys, identify if PII exists, and redact PII if available. 
            for i in payloadjsonraw:
                # checks if the key found in the message matches a redact
                if i in splitkeys:
                    print("Redact key found, checking for PII")
                    payload = str(payloadjsonraw[i])
                    # check if payload size is less than 100KB
                    if sys.getsizeof(payload) < 99999:
                        print('Size is less than 100KB checking if value contains PII')
                        # Runs Comprehend ContainsPiiEntities API call to see if key value contains PII
                        pii_identified = comprehend_client.contains_pii_entities(Text=payload, LanguageCode='en')
                        
                        # If PII is not found, skip over key
                        if (pii_identified['Labels']) == []:
                            print('No PII found')
                        else:
                        # if PII is found, run through redaction logic
                            print('PII found redacting')
                            # Runs Comprehend DetectPiiEntities call to find exact location of PII
                            response = comprehend_client.detect_pii_entities(Text=payload, LanguageCode='en')
                            entities = response['Entities']
                            # creates redacted_payload which will be redacted
                            redacted_payload = payload
                            # runs through a loop that gathers necessary values from Comprehend API response and redacts values
                            for entity in entities:
                                char_offset_begin = entity['BeginOffset']
                                char_offset_end = entity['EndOffset']
                                redacted_payload = redacted_payload[:char_offset_begin] + '*'*(char_offset_end-char_offset_begin) + redacted_payload[char_offset_end:]
                            # replaces original value with redacted value
                            payloadjsonraw[i] = redacted_payload
                            print(str(payloadjsonraw[i]))
                    else:
                        print ('Size is more than 100KB, skipping inspection')
                else:
                    print("Key value not found in redaction list")
            
            redacteddata = json.dumps(payloadjsonraw)
            
            # adds inspected record to record
            output_record = {
                'recordId': record['recordId'],
                'result': 'Ok',
                'data' : base64.b64encode(redacteddata.encode('utf-8'))
            }
            output.append(output_record)
            print(output_record)
            
        print('Successfully processed {} records.'.format(len(event['records'])))
        
        return {'records': output}

  8. Choose Deploy.
  9. In the navigation pane, choose Configuration.
  10. Navigate to Environment variables.
  11. Choose Edit.
  12. For Key, enter keys.
  13. For Value, enter the key values you want to redact PII from, separated by a comma and space. For example, enter Tweet1, Tweet2 if you’re using the sample test data provided in the next section of this post.
  14. Choose Save.
  15. Navigate to General configuration.
  16. Choose Edit.
  17. Change the value of Timeout to 1 minute.
  18. Choose Save.
  19. Navigate to Permissions.
  20. Choose the role name under Execution Role.
    You’re redirected to the AWS Identity and Access Management (IAM) console.
  21. For Add permissions, choose Attach policies.
  22. Enter Comprehend into the search bar and choose the policy ComprehendFullAccess.
  23. Choose Attach policies.

Create the Firehose delivery stream

To create your Firehose delivery stream, complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, select Direct PUT.
  3. For Destination, select Amazon S3.
  4. For Delivery stream name, enter ComprehendRealTimeBlog.
  5. Under Transform source records with AWS Lambda, select Enabled.
  6. For AWS Lambda function, enter the ARN for the function you created, or browse to the function AmazonComprehendPII-Redact.
  7. For Buffer Size, set the value to 1 MB.
  8. For Buffer Interval, leave it as 60 seconds.
  9. Under Destination Settings, select the S3 bucket you created for the redacted data.
  10. Under Backup Settings, select the S3 bucket that you created for the raw records.
  11. Under Permission, either create or update an IAM role, or choose an existing role with the proper permissions.
  12. Choose Create delivery stream.

Deploy the streaming data solution with the Kinesis Data Generator

You can use the Kinesis Data Generator (KDG) to ingest sample data to Kinesis Data Firehose and test the solution. To simplify this process, we provide a Lambda function and CloudFormation template to create an Amazon Cognito user and assign appropriate permissions to use the KDG.

  1. On the Amazon Kinesis Data Generator page, choose Create a Cognito User with CloudFormation.You’re redirected to the AWS CloudFormation console to create your stack.
  2. Provide a user name and password for the user with which you log in to the KDG.
  3. Leave the other settings at their defaults and create your stack.
  4. On the Outputs tab, choose the KDG UI link.
  5. Enter your user name and password to log in.

Send test records and validate redaction in Amazon S3

To test the solution, complete the following steps:

  1. Log in to the KDG URL you created in the previous step.
  2. Choose the Region where the AWS CloudFormation stack was deployed.
  3. For Stream/delivery stream, choose the delivery stream you created (if you used the template, it has the format accountnumber-awscomprehend-blog).
  4. Leave the other settings at their defaults.
  5. For the record template, you can create your own tests, or use the following template.If you’re using the provided sample data below for testing, you should have updated environment variables in the AmazonComprehendPII-Redact Lambda function to Tweet1, Tweet2. If deployed via CloudFormation, update environment variables to Tweet1, Tweet2 within the created Lambda function. The sample test data is below:
    {"User":"12345", "Tweet1":" Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231.", "Tweet2": "My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this streaming record. Let's check"}

  6. Choose Send Data, and allow a few seconds for records to be sent to your stream.
  7. After few seconds, stop the KDG generator and check your S3 buckets for the delivered files.

The following is an example of the raw data in the raw S3 bucket:

{"User":"12345", "Tweet1":" Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231.", "Tweet2": "My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this streaming record. Let's check"}

The following is an example of the redacted data in the redacted S3 bucket:

{"User":"12345", "Tweet1":"Good morning, everybody. My name is *******************, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address ****************************. My address is ********************************** My phone number is ************.", "Tweet"2: "My Social security number is ***********. My Bank account number is ************ and routing number *********. My credit card number is ****************, Expiration Date ********, my C V V code is ***, and my pin ******. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this streaming record. Let's check"}

The sensitive information has been removed from the redacted messages, providing confidence that you can share this data with end systems.

Cleanup

When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. If you followed the manual steps, you will need to manually delete the two buckets, the AmazonComprehendPII-Redact function, the ComprehendRealTimeBlog stream, the log group for the ComprehendRealTimeBlog stream, and any IAM roles that were created.

Conclusion

This post showed you how to integrate PII redaction into your near-real-time streaming architecture and reduce data processing time by performing redaction in flight. In this scenario, you provide the redacted data to your end-users and a data lake administrator secures the raw bucket for later use. You could also build additional processing with Amazon Comprehend to identify tone or sentiment, identify entities within the data, and classify each message.

We provided individual steps for each service as part of this post, and also included a CloudFormation template that allows you to provision the required resources in your account. This template should be used for proof of concept or testing scenarios only. Refer to the developer guides for Amazon Comprehend, Lambda, and Kinesis Data Firehose for any service limits.

To get started with PII identification and redaction, see Personally identifiable information (PII). With the example architecture in this post, you could integrate any of the Amazon Comprehend APIs with near-real-time data using Kinesis Data Firehose data transformation. To learn more about what you can build with your near-real-time data with Kinesis Data Firehose, refer to the Amazon Kinesis Data Firehose Developer Guide. This solution is available in all AWS Regions where Amazon Comprehend and Kinesis Data Firehose are available.


About the authors

Joe Morotti is a Solutions Architect at Amazon Web Services (AWS), helping Enterprise customers across the Midwest US. He has held a wide range of technical roles and enjoy showing customer’s art of the possible. In his free time, he enjoys spending quality time with his family exploring new places and overanalyzing his sports team’s performance

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing Tennis, binge-watching TV shows, and playing Tabla.

Read More

Reduce cost and development time with Amazon SageMaker Pipelines local mode

Creating robust and reusable machine learning (ML) pipelines can be a complex and time-consuming process. Developers usually test their processing and training scripts locally, but the pipelines themselves are typically tested in the cloud. Creating and running a full pipeline during experimentation adds unwanted overhead and cost to the development lifecycle. In this post, we detail how you can use Amazon SageMaker Pipelines local mode to run ML pipelines locally to reduce both pipeline development and run time while reducing cost. After the pipeline has been fully tested locally, you can easily rerun it with Amazon SageMaker managed resources with just a few lines of code changes.

Overview of the ML lifecycle

One of the main drivers for new innovations and applications in ML is the availability and amount of data along with cheaper compute options. In several domains, ML has proven capable of solving problems previously unsolvable with classical big data and analytical techniques, and the demand for data science and ML practitioners is increasing steadily. From a very high level, the ML lifecycle consists of many different parts, but the building of an ML model usually consists of the following general steps:

  1. Data cleansing and preparation (feature engineering)
  2. Model training and tuning
  3. Model evaluation
  4. Model deployment (or batch transform)

In the data preparation step, data is loaded, massaged, and transformed into the type of inputs, or features, the ML model expects. Writing the scripts to transform the data is typically an iterative process, where fast feedback loops are important to speed up development. It’s normally not necessary to use the full dataset when testing feature engineering scripts, which is why you can use the local mode feature of SageMaker Processing. This allows you to run locally and update the code iteratively, using a smaller dataset. When the final code is ready, it’s submitted to the remote processing job, which uses the complete dataset and runs on SageMaker managed instances.

The development process is similar to the data preparation step for both model training and model evaluation steps. Data scientists use the local mode feature of SageMaker Training to iterate quickly with smaller datasets locally, before using all the data in a SageMaker managed cluster of ML-optimized instances. This speeds up the development process and eliminates the cost of running ML instances managed by SageMaker while experimenting.

As an organization’s ML maturity increases, you can use Amazon SageMaker Pipelines to create ML pipelines that stitch together these steps, creating more complex ML workflows that process, train, and evaluate ML models. SageMaker Pipelines is a fully managed service for automating the different steps of the ML workflow, including data loading, data transformation, model training and tuning, and model deployment. Until recently, you could develop and test your scripts locally but had to test your ML pipelines in the cloud. This made iterating on the flow and form of ML pipelines a slow and costly process. Now, with the added local mode feature of SageMaker Pipelines, you can iterate and test your ML pipelines similarly to how you test and iterate on your processing and training scripts. You can run and test your pipelines on your local machine, using a small subset of data to validate the pipeline syntax and functionalities.

SageMaker Pipelines

SageMaker Pipelines provides a fully automated way to run simple or complex ML workflows. With SageMaker Pipelines, you can create ML workflows with an easy-to-use Python SDK, and then visualize and manage your workflow using Amazon SageMaker Studio. Your data science teams can be more efficient and scale faster by storing and reusing the workflow steps you create in SageMaker Pipelines. You can also use pre-built templates that automate the infrastructure and repository creation to build, test, register, and deploy models within your ML environment. These templates are automatically available to your organization, and are provisioned using AWS Service Catalog products.

SageMaker Pipelines brings continuous integration and continuous deployment (CI/CD) practices to ML, such as maintaining parity between development and production environments, version control, on-demand testing, and end-to-end automation, which helps you scale ML throughout your organization. DevOps practitioners know that some of the main benefits of using CI/CD techniques include an increase in productivity via reusable components and an increase in quality through automated testing, which leads to faster ROI for your business objectives. These benefits are now available to MLOps practitioners by using SageMaker Pipelines to automate the training, testing, and deployment of ML models. With local mode, you can now iterate much more quickly while developing scripts for use in a pipeline. Note that local pipeline instances can’t be viewed or run within the Studio IDE; however, additional viewing options for local pipelines will be available soon.

The SageMaker SDK provides a general purpose local mode configuration that allows developers to run and test supported processors and estimators in their local environment. You can use local mode training with multiple AWS-supported framework images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) as well as images you supply yourself.

SageMaker Pipelines, which builds a Directed Acyclic Graph (DAG) of orchestrated workflow steps, supports many activities that are part of the ML lifecycle. In local mode, the following steps are supported:

  • Processing job steps – A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation
  • Training job steps – An iterative process that teaches a model to make predictions by presenting examples from a training dataset
  • Hyperparameter tuning jobs – An automated way to evaluate and select the hyperparameters that produce the most accurate model
  • Conditional run steps – A step that provides a conditional run of branches in a pipeline
  • Model step – Using CreateModel arguments, this step can create a model for use in transform steps or later deployment as an endpoint
  • Transform job steps – A batch transform job that generates predictions from large datasets, and runs inference when a persistent endpoint isn’t needed
  • Fail steps – A step that stops a pipeline run and marks the run as failed

Solution overview

Our solution demonstrates the essential steps to create and run SageMaker Pipelines in local mode, which means using local CPU, RAM, and disk resources to load and run the workflow steps. Your local environment could be running on a laptop, using popular IDEs like VSCode or PyCharm, or it could be hosted by SageMaker using classic notebook instances.

Local mode allows data scientists to stitch together steps, which can include processing, training, and evaluation jobs, and run the entire workflow locally. When you’re done testing locally, you can rerun the pipeline in a SageMaker managed environment by replacing the LocalPipelineSession object with PipelineSession, which brings consistency to the ML lifecycle.

For this notebook sample, we use a standard publicly available dataset, the UCI Machine Learning Abalone Dataset. The goal is to train an ML model to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

All of the code required to run this notebook sample is available on GitHub in the amazon-sagemaker-examples repository. In this notebook sample, each pipeline workflow step is created independently and then wired together to create the pipeline. We create the following steps:

  • Processing step (feature engineering)
  • Training step (model training)
  • Processing step (model evaluation)
  • Condition step (model accuracy)
  • Create model step (model)
  • Transform step (batch transform)
  • Register model step (model package)
  • Fail step (run failed)

The following diagram illustrates our pipeline.

Prerequisites

To follow along in this post, you need the following:

After these prerequisites are in place, you can run the sample notebook as described in the following sections.

Build your pipeline

In this notebook sample, we use SageMaker Script Mode for most of the ML processes, which means that we provide the actual Python code (scripts) to perform the activity and pass a reference to this code. Script Mode provides great flexibility to control the behavior within the SageMaker processing by allowing you to customize your code while still taking advantage of SageMaker pre-built containers like XGBoost or Scikit-Learn. The custom code is written to a Python script file using cells that begin with the magic command %%writefile, like the following:

%%writefile code/evaluation.py

The primary enabler of local mode is the LocalPipelineSession object, which is instantiated from the Python SDK. The following code segments show how to create a SageMaker pipeline in local mode. Although you can configure a local data path for many of the local pipeline steps, Amazon S3 is the default location to store the data output by the transformation. The new LocalPipelineSession object is passed to the Python SDK in many of the SageMaker workflow API calls described in this post. Notice that you can use the local_pipeline_session variable to retrieve references to the S3 default bucket and the current Region name.

from sagemaker.workflow.pipeline_context import LocalPipelineSession

# Create a `LocalPipelineSession` object so that each 
# pipeline step will run locally
# To run this pipeline in the cloud, you must change 
# the `LocalPipelineSession()` to `PipelineSession()`
local_pipeline_session = LocalPipelineSession()
region = local_pipeline_session.boto_region_name

default_bucket = local_pipeline_session.default_bucket()
prefix = "sagemaker-pipelines-local-mode-example"

Before we create the individual pipeline steps, we set some parameters used by the pipeline. Some of these parameters are string literals, whereas others are created as special enumerated types provided by the SDK. The enumerated typing ensures that valid settings are provided to the pipeline, such as this one, which is passed to the ConditionLessThanOrEqualTo step further down:

mse_threshold = ParameterFloat(name="MseThreshold", default_value=7.0)

To create a data processing step, which is used here to perform feature engineering, we use the SKLearnProcessor to load and transform the dataset. We pass the local_pipeline_session variable to the class constructor, which instructs the workflow step to run in local mode:

from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "1.0-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=local_pipeline_session,
)

Next, we create our first actual pipeline step, a ProcessingStep object, as imported from the SageMaker SDK. The processor arguments are returned from a call to the SKLearnProcessor run() method. This workflow step is combined with other steps towards the end of the notebook to indicate the order of operation within the pipeline.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="code/preprocessing.py",
)

step_process = ProcessingStep(name="AbaloneProcess", step_args=processor_args)

Next, we provide code to establish a training step by first instantiating a standard estimator using the SageMaker SDK. We pass the same local_pipeline_session variable to the estimator, named xgb_train, as the sagemaker_session argument. Because we want to train an XGBoost model, we must generate a valid image URI by specifying the following parameters, including the framework and several version parameters:

from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

model_path = f"s3://{default_bucket}/{prefix}/model"
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.5-1",
    py_version="py3",
    instance_type=instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    entry_point="code/abalone.py",
    instance_type=instance_type,
    instance_count=training_instance_count,
    output_path=model_path,
    role=role,
    sagemaker_session=local_pipeline_session,
)

We can optionally call additional estimator methods, for example set_hyperparameters(), to provide hyperparameter settings for the training job. Now that we have an estimator configured, we’re ready to create the actual training step. Once again, we import the TrainingStep class from the SageMaker SDK library:

from sagemaker.workflow.steps import TrainingStep

step_train = TrainingStep(name="AbaloneTrain", step_args=train_args)

Next, we build another processing step to perform model evaluation. This is done by creating a ScriptProcessor instance and passing the local_pipeline_session object as a parameter:

from sagemaker.processing import ScriptProcessor

script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="script-abalone-eval",
    role=role,
    sagemaker_session=local_pipeline_session,
)

To enable deployment of the trained model, either to a SageMaker real-time endpoint or to a batch transform, we need to create a Model object by passing the model artifacts, the proper image URI, and optionally our custom inference code. We then pass this Model object to a ModelStep, which is added to the local pipeline. See the following code:

from sagemaker.model import Model

model = Model(
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    source_dir="code",
    entry_point="inference.py",
    role=role,
    sagemaker_session=local_pipeline_session,
)

from sagemaker.workflow.model_step import ModelStep

step_create_model = ModelStep(name="AbaloneCreateModel", 
    step_args=model.create(instance_type=instance_type)
)

Next, we create a batch transform step where we submit a set of feature vectors and perform inference. We first need to create a Transformer object and pass the local_pipeline_session parameter to it. Then we create a TransformStep, passing the required arguments, and add this to the pipeline definition:

from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_type=instance_type,
    instance_count=transform_instance_count,
    output_path=f"s3://{default_bucket}/{prefix}/transform",
    sagemaker_session=local_pipeline_session,
)

from sagemaker.workflow.steps import TransformStep

transform_args = transformer.transform(transform_data, content_type="text/csv")

step_transform = TransformStep(name="AbaloneTransform", step_args=transform_args)

Finally, we want to add a branch condition to the workflow so that we only run batch transform if the results of model evaluation meet our criteria. We can indicate this conditional by adding a ConditionStep with a particular condition type, like ConditionLessThanOrEqualTo. We then enumerate the steps for the two branches, essentially defining the if/else or true/false branches of the pipeline. The if_steps provided in the ConditionStep (step_create_model, step_transform) are run whenever the condition evaluates to True.

from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step_name=step_eval.name,
        property_file=evaluation_report,
        json_path="regression_metrics.mse.value",),
    right=mse_threshold,
)

step_cond = ConditionStep(
    name="AbaloneMSECond",
    conditions=[cond_lte],
    if_steps=[step_create_model, step_transform],
    else_steps=[step_fail],
)

The following diagram illustrates this conditional branch and the associated if/else steps. Only one branch is run, based on the outcome of the model evaluation step as compared in the condition step.

Now that we have all our steps defined, and the underlying class instances created, we can combine them into a pipeline. We provide some parameters, and crucially define the order of operation by simply listing the steps in the desired order. Note that the TransformStep isn’t shown here because it’s the target of the conditional step, and was provided as step argument to the ConditionalStep earlier.

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"LocalModelPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        mse_threshold,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
    sagemaker_session=local_pipeline_session,
)

To run the pipeline, you must call two methods: pipeline.upsert(), which uploads the pipeline to the underlying service, and pipeline.start(), which starts running the pipeline. You can use various other methods to interrogate the run status, list the pipeline steps, and more. Because we used the local mode pipeline session, these steps are all run locally on your processor. The cell output beneath the start method shows the output from the pipeline:

pipeline.upsert(role_arn=role)
execution = pipeline.start()

You should see a message at the bottom of the cell output similar to the following:

Pipeline execution d8c3e172-089e-4e7a-ad6d-6d76caf987b7 SUCCEEDED

Revert to managed resources

After we’ve confirmed that the pipeline runs without errors and we’re satisfied with the flow and form of the pipeline, we can recreate the pipeline but with SageMaker managed resources and rerun it. The only change required is to use the PipelineSession object instead of LocalPipelineSession:

from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.workflow.pipeline_context import PipelineSession

local_pipeline_session = LocalPipelineSession()
pipeline_session = PipelineSession()

This informs the service to run each step referencing this session object on SageMaker managed resources. Given the small change, we illustrate only the required code changes in the following code cell, but the same change would need to be implemented on each cell using the local_pipeline_session object. The changes are, however, identical across all cells because we’re only substituting the local_pipeline_session object with the pipeline_session object.

from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "1.0-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=pipeline_session,  # non-local session
)

After the local session object has been replaced everywhere, we recreate the pipeline and run it with SageMaker managed resources:

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"LocalModelPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        mse_threshold,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
    sagemaker_session=pipeline_session, # non-local session
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

Clean up

If you want to keep the Studio environment tidy, you can use the following methods to delete the SageMaker pipeline and the model. The full code can be found in the sample notebook.

# delete models 
sm_client = boto3.client("sagemaker")
model_prefix="AbaloneCreateModel"
delete_models(sm_client, model_prefix)

# delete managed pipeline
pipeline_to_delete = 'SM-Managed-Pipeline'
delete_sagemaker_pipeline(sm_client, pipeline_to_delete)

Conclusion

Until recently, you could use the local mode feature of SageMaker Processing and SageMaker Training to iterate on your processing and training scripts locally, before running them on all the data with SageMaker managed resources. With the new local mode feature of SageMaker Pipelines, ML practitioners can now apply the same method when iterating on their ML pipelines, stitching the different ML workflows together. When the pipeline is ready for production, running it with SageMaker managed resources requires just a few lines of code changes. This reduces the pipeline run time during development, leading to more rapid pipeline development with faster development cycles, while reducing the cost of SageMaker managed resources.

To learn more, visit Amazon SageMaker Pipelines or Use SageMaker Pipelines to Run Your Jobs Locally.


About the authors

Paul Hargis has focused his efforts on machine learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and teaching people how to make the most of it. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions, helping amazon.com improve the experience for international shoppers. Paul likes to help customers expand their machine learning initiatives to solve real-world problems.

Niklas Palm is a Solutions Architect at AWS in Stockholm, Sweden, where he helps customers across the Nordics succeed in the cloud. He’s particularly passionate about serverless technologies along with IoT and machine learning. Outside of work, Niklas is an avid cross-country skier and snowboarder as well as a master egg boiler.

Kirit Thadaka is an ML Solutions Architect working in the SageMaker Service SA team. Prior to joining AWS, Kirit worked in early-stage AI startups followed by some time consulting in various roles in AI research, MLOps, and technical leadership.

Read More

Create high-quality data for ML models with Amazon SageMaker Ground Truth

Machine learning (ML) has improved business across industries in recent years—from the recommendation system on your Prime Video account, to document summarization and efficient search with Alexa’s voice assistance. However, the question remains of how to incorporate this technology into your business. Unlike traditional rule-based methods, ML automatically infers patterns from data so as to perform your task of interest. Although this bypasses the need to curate rules for automation, it also means that ML models can only be as good as the data on which they’re trained. However, data creation is often a challenging task. At the Amazon Machine Learning Solutions Lab, we’ve repeatedly encountered this problem and want to ease this journey for our customers. If you want to offload this process, you can use Amazon SageMaker Ground Truth Plus.

By the end of this post, you’ll be able to achieve the following:

  • Understand the business processes involved in setting up a data acquisition pipeline
  • Identify AWS Cloud services for supporting and expediting your data labeling pipeline
  • Run a data acquisition and labeling task for custom use cases
  • Create high-quality data following business and technical best practices

Throughout this post, we focus on the data creation process and rely on AWS services to handle the infrastructure and process components. Namely, we use Amazon SageMaker Ground Truth to handle the labeling infrastructure pipeline and user interface. This service uses a point-and-go approach to collect your data from Amazon Simple Storage Service (Amazon S3) and set up a labeling workflow. For labeling, it provides you with the built-in flexibility to acquire data labels using your private team, an Amazon Mechanical Turk force, or your preferred labeling vendor from AWS Marketplace. Lastly, you can use AWS Lambda and Amazon SageMaker notebooks to process, visualize, or quality control the data—either pre- or post-labeling.

Now that all of the pieces have been laid down, let’s start the process!

The data creation process

Contrary to common intuition, the first step for data creation is not data collection. Working backward from the users to articulate the problem is crucial. For example, what do users care about in the final artifact? Where do experts believe the signals relevant to the use case reside in the data? What information about the use case environment could be provided to model? If you don’t know the answers to those questions, don’t worry. Give yourself some time to talk with users and field experts to understand the nuances. This initial understanding will orient you in the right direction and set you up for success.

For this post, we assume that you have covered this initial process of user requirement specification. The next three sections walk you through the subsequent process of creating quality data: planning, source data creation, and data annotation. Piloting loops at the data creation and annotation steps are vital for ensuring the efficient creation of labeled data. This involves iterating between data creation, annotation, quality assurance, and updating the pipeline as necessary.

The following figure provides an overview of the steps required in a typical data creation pipeline. You can work backward from the use case to identify the data that you need (Requirements Specification), build a process to obtain the data (Planning), implement the actual data acquisition process (Data Collection and Annotation), and assess the results. Pilot runs, highlighted with dashed lines, let you iterate on the process until a high-quality data acquisition pipeline has been developed.

In a typical data creation pipeline, you go through requirements specification for the use case in scope, plan for the data creation process, implement the process for data collection and labeling, and evaluate the results against the original requirements specification. Successive iterations of this workflow enable refinement of the pipeline.

Overview of steps required in a typical data creation pipeline.

Planning

A standard data creation process can be time-consuming and a waste of valuable human resources if conducted inefficiently. Why would it be time-consuming? To answer this question, we must understand the scope of the data creation process. To assist you, we have collected a high-level checklist and description of key components and stakeholders that you must consider. Answering these questions can be difficult at first. Depending on your use case, only some of these may be applicable.

  • Identify the legal point of contact for required approvals – Using data for your application can require license or vendor contract review to ensure compliance with company policies and use cases. It’s important to identify your legal support throughout the data acquisition and annotation steps of the process.
  • Identify the security point of contact for data handling –Leakage of purchased data might result in serious fines and repercussions for your company. It’s important to identify your security support throughout the data acquisition and annotation steps to ensure secure practices.
  • Detail use case requirements and define source data and annotation guidelines – Creating and annotating data is difficult due to the high specificity required. Stakeholders, including data generators and annotators, must be completely aligned to avoid wasting resources. To this end, it’s common practice to use a guidelines document that specifies every aspect of the annotation task: exact instructions, edge cases, an example walkthrough, and so on.
  • Align on expectations for collecting your source data – Consider the following:

    • Conduct research on potential data sources – For example, public datasets, existing datasets from other internal teams, self-collected, or purchased data from vendors.
    • Perform quality assessment – Create an analysis pipeline with relation to the final use case.
  • Align on expectations for creating data annotations – Consider the following:

    • Identify the technical stakeholders – This is usually an individual or team in your company capable of using the technical documentation regarding Ground Truth to implement an annotation pipeline. These stakeholders are also responsible for quality assessment of the annotated data to make sure that it meets the needs of your downstream ML application.
    • Identify the data annotators – These individuals use predetermined instructions to add labels to your source data within Ground Truth. They may need to possess domain knowledge depending on your use case and annotation guidelines. You can use a workforce internal to your company, or pay for a workforce managed by an external vendor.
  • Ensure oversight of the data creation process – As you can see from the preceding points, data creation is a detailed process that involves numerous specialized stakeholders. Therefore, it’s crucial to monitor it end to end toward the desired outcome. Having a dedicated person or team oversee the process can help you ensure a cohesive, efficient data creation process.

Depending on the route that you decide to take, you must also consider the following:

  • Create the source dataset – This refers to instances when existing data isn’t suitable for the task at hand, or legal constraints prevent you from using it. Internal teams or external vendors (next point) must be used. This is often the case for highly specialized domains or areas with low public research. For example, a physician’s common questions, garment lay down, or sports experts. It can be internal or external.
  • Research vendors and conduct an onboarding process – When external vendors are used, a contracting and onboarding process must be set in place between both entities.

In this section, we reviewed the components and stakeholders that we must consider. However, what does the actual process look like? In the following figure, we outline a process workflow for data creation and annotation. The iterative approach uses small batches of data called pilots to decrease turnaround time, detect errors early on, and avoid wasting resources in the creation of low-quality data. We describe these pilot rounds later in this post. We also cover some best practices for data creation, annotation, and quality control.

The following figure illustrates the iterative development of a data creation pipeline. Vertically, we find the data sourcing block (green) and the annotation block (blue). Both blocks have independent pilot rounds (Data creation/Annotation, QAQC, and Update). Increasingly higher sourced data is created and can be used to construct increasingly higher-quality annotations.

During the iterative development of a data creation or annotation pipeline, small batches of data are used for independent pilots. Each pilot round has a data creation or annotation phase, some quality assurance and quality control of the results, and an update step to refine the process. After these processes are finessed through successive pilots, you can proceed to large-scale data creation and annotation.

Overview of iterative development in a data creation pipeline.

Source data creation

The input creation process revolves around staging your items of interest, which depend on your task type. These could be images (newspaper scans), videos (traffic scenes), 3D point clouds (medical scans), or simply text (subtitle tracks, transcriptions). In general, when staging your task-related items, make sure of the following:

  • Reflect the real-world use case for the eventual AI/ML system – The setup for collecting images or videos for your training data should closely match the setup for your input data in the real-world application. This means having consistent placement surfaces, lighting sources, or camera angles.
  • Account for and minimize variability sources – Consider the following:

    • Develop best practices for maintaining data collection standards – Depending on the granularity of your use case, you may need to specify requirements to guarantee consistency among your data points. For example, if you’re collecting image or video data from single camera points, you may need to make sure of the consistent placement of your objects of interest, or require a quality check for the camera before a data capture round. This can avoid issues like camera tilt or blur, and minimize downstream overheads like removing out-of-frame or blurry images, as well as needing to manually center the image frame on your area of interest.
    • Pre-empt test time sources of variability – If you anticipate variability in any of the attributes mentioned so far during test time, make sure that you can capture those variability sources during training data creation. For example, if you expect your ML application to work in multiple different light settings, you should aim to create training images and videos at various light settings. Depending on the use case, variability in camera positioning can also influence the quality of your labels.
  • Incorporate prior domain knowledge when available – Consider the following:

    • Inputs on sources of error – Domain practitioners can provide insights into sources of error based on their years of experience. They can provide feedback on the best practices for the previous two points: What settings reflect the real-world use case best? What are the possible sources of variability during data collection, or at the time of use?
    • Domain-specific data collection best practices – Although your technical stakeholders may already have a good idea of the technical aspects to focus on in the images or videos collected, domain practitioners can provide feedback on how best to stage or collect the data such that these needs are met.

Quality control and quality assurance of the created data

Now that you have set up the data collection pipeline, it might be tempting to go ahead and collect as much data as possible. Wait a minute! We must first check if the data collected through the setup is suitable for your real-word use case. We can use some initial samples and iteratively improve the setup through the insights that we gained from analyzing that sample data. Work closely with your technical, business, and annotation stakeholders during the pilot process. This will make sure that your resultant pipeline is meeting business needs while generating ML-ready labeled data within minimal overheads.

Annotations

The annotation of inputs is where we add the magic touch to our data—the labels! Depending on your task type and data creation process, you may need manual annotators, or you can use off-the-shelf automated methods. The data annotation pipeline itself can be a technically challenging task. Ground Truth eases this journey for your technical stakeholders with its built-in repertoire of labeling workflows for common data sources. With a few additional steps, it also enables you to build custom labeling workflows beyond preconfigured options.

Ask yourself the following questions when developing a suitable annotation workflow:

  • Do I need a manual annotation process for my data? In some cases, automated labeling services may be sufficient for the task at hand. Reviewing the documentation and available tools can help you identify if manual annotation is necessary for your use case (for more information, see What is data labeling?). The data creation process can allow for varying levels of control regarding the granularity of your data annotation. Depending on this process, you can also sometimes bypass the need for manual annotation. For more information, refer to Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model.
  • What forms my ground truth? In most cases, the ground truth will come from your annotation process—that’s the whole point! In others, the user may have access to ground truth labels. This can significantly speed up your quality assurance process, or reduce the overhead required for multiple manual annotations.
  • What is the upper bound for the amount of deviance from my ground truth state? Work with your end-users to understand the typical errors around these labels, the sources of such errors, and the desired reduction in errors. This will help you identify which aspects of the labeling task are most challenging or are likely to have annotation errors.
  • Are there preexisting rules used by the users or field practitioners to label these items? Use and refine these guidelines to build a set of instructions for your manual annotators.

Piloting the input annotation process

When piloting the input annotation process, consider the following:

  • Review the instructions with the annotators and field practitioners – Instructions should be concise and specific. Ask for feedback from your users (Are the instructions accurate? Can we revise any instructions to make sure that they are understandable by non-field practitioners?) and annotators (Is everything understandable? Is the task clear?). If possible, add an example of good and bad labeled data to help your annotators identify what is expected, and what common labeling errors might look like.
  • Collect data for annotations – Review the data with your customer to make sure that it meets the expected standards, and to align on expected outcomes from the manual annotation.
  • Provide examples to your pool of manual annotators as a test run – What is the typical variance among the annotators in this set of examples? Study the variance for each annotation within a given image to identify the consistency trends among annotators. Then compare the variances across the images or video frames to identify which labels are challenging to place.

Quality control of the annotations

Annotation quality control has two main components: assessing consistency between the annotators, and assessing the quality of the annotations themselves.

You can assign multiple annotators to the same task (for example, three annotators label the key points on the same image), and measure the average value alongside the standard deviation of these labels among the annotators. Doing so helps you identify any outlier annotations (incorrect label used, or label far away from the average annotation), which can guide actionable outcomes, such as refining your instructions or providing further training to certain annotators.

Assessing the quality of annotations themselves is tied to annotator variability and (when available) the availability of domain experts or ground truth information. Are there certain labels (across all of your images) where the average variance between annotators is consistently high? Are any labels far off from your expectations of where they should be, or what they should look like?

Based on our experience, a typical quality control loop for data annotation can look like this:

  • Iterate on the instructions or image staging based on results from the test run – Are any objects occluded, or does image staging not match the expectations of annotators or users? Are the instructions misleading, or did you miss any labels or common errors in your exemplar images? Can you refine the instructions for your annotators?
  • If you are satisfied that you have addressed any issues from the test run, do a batch of annotations – For testing the results from the batch, follow the same quality assessment approach of assessing inter-annotator and inter-image label variabilities.

Conclusion

This post serves as a guide for business stakeholders to understand the complexities of data creation for AI/ML applications. The processes described also serve as a guide for technical practitioners to generate quality data while optimizing business constraints such as personnel and costs. If not done well, a data creation and labeling pipeline can take upwards of 4–6 months.

With the guidelines and suggestions outlined in this post, you can preempt roadblocks, reduce time to completion, and minimize the costs in your journey toward creating high-quality data.


About the authors

Jasleen Grewal is an Applied Scientist at Amazon Web Services, where she works with AWS customers to solve real world problems using machine learning, with special focus on precision medicine and genomics. She has a strong background in bioinformatics, oncology, and clinical genomics. She is passionate about using AI/ML and cloud services to improve patient care.

Boris Aronchik is a Manager in the Amazon AI Machine Learning Solutions Lab, where he leads a team of ML scientists and engineers to help AWS customers realize business goals leveraging AI/ML solutions.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Lin Lee Cheong is a Senior Scientist and Manager with the Amazon ML Solutions Lab team at Amazon Web Services. She works with strategic AWS customers to explore and apply artificial intelligence and machine learning to discover new insights and solve complex problems.

Read More

Automate your time series forecasting in Snowflake using Amazon Forecast

This post is a joint collaboration with Andries Engelbrecht and James Sun of Snowflake, Inc.

The cloud computing revolution has enabled businesses to capture and retain corporate and organizational data without capacity planning or data retention constraints. Now, with diverse and vast reserves of longitudinal data, companies are increasingly able to find novel and impactful ways to use their digital assets to make better and informed decisions when making short-term and long-term planning decisions. Time series forecasting is a unique and essential science that allows companies to make surgical planning decisions to help balance customer service levels against often competing goals of optimal profitability.

At AWS, we sometimes work with customers who have selected our technology partner Snowflake to deliver a cloud data platform experience. Having a platform that can recall years and years of historical data is powerful—but how can you use this data to look ahead and use yesterday’s evidence to plan for tomorrow? Imagine not only having what has happened available in Snowflake—your single version of the truth—but also an adjacent set of non-siloed data that offers a probabilistic forecast for days, weeks, or months into the future.

In a collaborative supply chain, sharing information between partners can improve performance, increase competitiveness, and reduce wasted resources. Sharing your future forecasts can be facilitated with Snowflake Data Sharing, which enables you to seamlessly collaborate with your business partners securely and identify business insights. If many partners share their forecasts, it can help control the bullwhip effect in the connected supply chain. You can effectively use Snowflake Marketplace to monetize your predictive analytics from datasets produced in Amazon Forecast.

In this post, we discuss how to implement an automated time series forecasting solution using Snowflake and Forecast.

Essential AWS services that enable this solution

Forecast provides several state-of-the-art time series algorithms and manages the allocation of enough distributed computing capacity to meet the needs of nearly any workload. With Forecast, you don’t get one model; you get the strength of many models that are further optimized into a uniquely weighted model for each time series in the set. In short, the service delivers all the science, data handling, and resource management into a simple API call.

AWS Step Functions provides a process orchestration mechanism that manages the overall workflow. The service encapsulates API calls with Amazon Athena, AWS Lambda, and Forecast to create an automated solution that harvests data from Snowflake, uses Forecast to convert historical data into future predictions, and then creates the data inside Snowflake.

Athena federated queries can connect to several enterprise data sources, including Amazon DynamoDB, Amazon Redshift, Amazon OpenSearch Service, MySQL, PostgreSQL, Redis, and other popular third-party data stores, such as Snowflake. Data connectors run as Lambda functions—you can use this source code to help launch the Amazon Athena Lambda Snowflake Connector and connect with AWS PrivateLink or through a NAT Gateway.

Solution overview

One of the things we often do at AWS is work to help customers realize their goals while also removing the burden of the undifferentiated heavy lifting. With this in mind, we propose the following solution to assist AWS and Snowflake customers perform the following steps:

  1. Export data from Snowflake. You can use flexible metadata to unload the necessary historical data driven by a ready-to-go workflow.
  2. Import data into Forecast. No matter the use case, industry, or scale, importing prepared data inputs is easy and automated.
  3. Train a state-of-the-art time series model. You can automate time series forecasting without managing the underlying data science or hardware provisioning.
  4. Generate inference against the trained model. Forecast-produced outputs are easy to consume for any purpose. They’re available as simple CSV or Parquet files on Amazon Simple Storage Service (Amazon S3).
  5. Use history and future predictions side by side directly in Snowflake.

The following diagram illustrates how to implement an automated workflow that enables Snowflake customers to benefit from highly accurate time series predictions supported by Forecast, an AWS managed service. Transcending use case and industry, the design offered here first extracts historical data from Snowflake. Next, the workflow submits the prepared data for time series computation. Lastly, future period predictions are available natively in Snowflake, creating a seamless user experience for joint AWS and Snowflake customers.

Although this architecture only highlights the key technical details, the solution is simple to put together, sometimes within 1–2 business days. We provide you with working sample code to help remove the undifferentiated heavy lifting of creating the solution alone and without a head start. After you discover how to implement this pattern for one workload, you can repeat the forecasting process for any data held in Snowflake. In the sections that follow, we outline the key steps that enable you to build an automated pipeline.

Extract historical data from Snowflake

In this first step, you use SQL to define what data you want forecasted and let an Athena Federated Query connect to Snowflake, run your customized SQL, and persist the resulting record set on Amazon S3. Forecast requires historical training data to be available on Amazon S3 before ingestion; therefore, Amazon S3 serves as an intermediate storage buffer between Snowflake and Forecast. We feature Athena in this design to enable Snowflake and other heterogeneous data sources. If you prefer, another approach is using the Snowflake COPY command and storage integration to write query results to Amazon S3.

Regardless of the transport mechanism used, we now outline the kind of data Forecast needs and how data is defined, prepared, and extracted. In the section that follows, we describe how to import data into Forecast.

The following screenshot depicts what a set of data might look like in its native Snowflake schema.

Although this screenshot shows how the data looks in its natural state, Forecast requires data to be shaped into three different datasets:

  • Target time series – This is a required dataset containing the target variable and is used to train and predict a future value. Alone, this dataset serves as a univariate time series model.
  • Related time series – This is an optional dataset that contains temporal variables that should have a relationship to the target variable. Examples include variable pricing, promotional efforts, hyperlocal event traffic, economic outlook data—anything you feel might help explain variance in the target time series and produce a better forecast. The related time series dataset turns your univariate model into a multivariate to help improve accuracy.
  • Item metadata – This is an optional dataset containing categorical data about the forecasted item. Item metadata often helps boost performance for newly launched products, which we term a cold start.

With the scope of each of the Forecast datasets defined, you can write queries in Snowflake that source the correct data fields from the necessary source tables with the proper filters to get the desired subset of data. The following are three example SQL queries used to generate each dataset that Forecast needs for a specific food demand planning scenario.

We start with the target time series query:

select LOCATION_ID, ITEM_ID, 
DATE_DEMAND as TIMESTAMP, QTY_DEMAND as TARGET_VALUE 
from DEMO.FOOD_DEMAND

The optional related time series query pulls covariates such as price and promotional:

select LOCATION_ID,ITEM_ID, DATE_DEMAND as TIMESTAMP,
CHECKOUT_PRICE, BASE_PRICE,
EMAILER_FOR_PROMOTION, HOMEPAGE_FEATURED
from DEMO.FOOD_DEMAND

The item metadata query fetches distinct categorical values that help give dimension and further define the forecasted item:

select DISTINCT ITEM_ID, FOOD_CATEGORY, FOOD_CUISINE
from DEMO.FOOD_DEMAND

With the source queries defined, we can connect to Snowflake through an Athena Federated Query to submit the queries and persist the resulting datasets for forecasting use. For more information, refer to Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake.

The Athena Snowflake Connector GitHub repo helps install the Snowflake connector. The Forecast MLOps GitHub repo helps orchestrate all macro steps defined in this post, and makes them repeatable without writing code.

Import data into Forecast

After we complete the previous step, a target time series dataset is in Amazon S3 and ready for import into Forecast. In addition, the optional related time series and item metadata datasets may also be prepared and ready for ingestion. With the provided Forecast MLOps solution, all you have to do here is initiate the Step Functions state machine responsible for importing data—no code is necessary. Forecast launches a cluster for each of the datasets you have provided and makes the data ready for the service to use for ML model building and model inference.

Create a time series ML model with accuracy statistics

After data has been imported, highly accurate time series models are created simply by calling an API. This step is encapsulated inside a Step Functions state machine that initiates the Forecast API to start model training. After the predictor model is trained, the state machine exports the model statistics and predictions during the backtest window to Amazon S3. Backtest exports are queryable by Snowflake as an external stage, as shown in the following screenshot. If you prefer, you can store the data in an internal stage. The point is to use the backtest metrics to evaluate the performance spread of time series in your dataset provided.

Create future predictions

With the model trained from the previous step, a purpose-built Step Functions state machine calls the Forecast API to create future-dated forecasts. Forecast provisions a cluster to perform the inference and pulls the imported target time series, related time series, and item metadata datasets through a named predictor model created in the previous step. After the predictions are generated, the state machine writes them to Amazon S3, where, once again, they can be queried in place as a Snowflake external stage or moved into Snowflake as an internal stage.

Use the future-dated prediction data directly in Snowflake

AWS hasn’t built a fully automated solution for this step; however, with the solution in this post, data was already produced by Forecast in the previous two steps. You may treat the outputs as actionable events or build business intelligence dashboards on the data. You may also use the data to create future manufacturing plans and purchase orders, estimate future revenue, build staffing resource plans, and more. Every use case is different, but the point of this step is to deliver the predictions to the correct consuming systems in your organization or beyond.

The following code snippet shows how to query Amazon S3 data directly from within Snowflake:

CREATE or REPLACE FILE FORMAT mycsvformat
type = 'CSV'
field_delimiter = ','
empty_field_as_null = TRUE
ESCAPE_UNENCLOSED_FIELD = None
skip_header = 1;

CREATE or REPLACE STORAGE INTEGRATION amazon_forecast_integration
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = S3
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::nnnnnnnnnn:role/snowflake-forecast-poc-role'
ENABLED = true
STORAGE_ALLOWED_LOCATIONS = (
's3://bucket/folder/forecast',
's3://bucket/folder/backtest-export/accuracy-metrics-values',
's3://bucket/folder/backtest-export/forecasted-values';

CREATE or REPLACE STAGE backtest_accuracy_metrics
storage_integration = amazon_forecast_integration
url = 's3://bucket/folder/backtest-export/accuracy-metrics-values'
file_format = mycsvformat;

CREATE or REPLACE EXTERNAL TABLE FOOD_DEMAND_BACKTEST_ACCURACY_METRICS (
ITEM_ID varchar AS (value:c1::varchar),
LOCATION_ID varchar AS (value:c2::varchar),
backtest_window varchar AS (value:c3::varchar),
backtestwindow_start_time varchar AS (value:c4::varchar),
backtestwindow_end_time varchar AS (value:c5::varchar),
wQL_10 varchar AS (value:c6::varchar),
wQL_30 varchar AS (value:c7::varchar),
wQL_50 varchar AS (value:c8::varchar),
wQL_70 varchar AS (value:c9::varchar),
wQL_90 varchar AS (value:c10::varchar),
AVG_wQL varchar AS (value:c11::varchar),
RMSE varchar AS (value:c12::varchar),
WAPE varchar AS (value:c13::varchar),
MAPE varchar AS (value:c14::varchar),
MASE varchar AS (value:c15::varchar)
)
with location = @backtest_accuracy_metrics
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);

For more information about setting up permissions, refer to Option 1: Configuring a Snowflake Storage Integration to Access Amazon S3. Additionally, you can use the AWS Service Catalog to configure Amazon S3 storage integration; more information is available on the GitHub repo.

Initiate a schedule-based or event-based workflow

After you install a solution for your specific workload, your final step is to automate the process on a schedule that makes sense for your unique requirement, such as daily or weekly. The main thing is to decide how to start the process. One method is to use Snowflake to invoke the Step Functions state machine and then orchestrate the steps serially. Another approach is to chain state machines together and start the overall run through an Amazon EventBridge rule, which you can configure to run from an event or scheduled task—for example, at 9:00 PM GMT-8 each Sunday night.

Conclusion

With the most experience; the most reliable, scalable, and secure cloud; and the most comprehensive set of services and solutions, AWS is the best place to unlock value from your data and turn it into insight. In this post, we showed you how to create an automated time series forecasting workflow. Better forecasting can lead to higher customer service outcomes, less waste, less idle inventory, and more cash on the balance sheet.

If you’re ready to automate and improve forecasting, we’re here to help support you on your journey. Contact your AWS or Snowflake account team to get started today and ask for a forecasting workshop to see what kind of value you can unlock from your data.


About the Authors

Bosco Albuquerque is a Sr. Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers. He has helped technology companies design and implement data analytics solutions and products.

Frank Dallezotte is a Sr. Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.

Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics.

Charles Laughlin is a Principal AI/ML Specialist Solutions Architect and works on the Time Series ML team at AWS. He helps shape the Amazon Forecast service roadmap and collaborates daily with diverse AWS customers to help transform their businesses using cutting-edge AWS technologies and thought leadership. Charles holds an M.S. in Supply Chain Management and has spent the past decade working in the consumer packaged goods industry.

James Sun is a Senior Partner Solutions Architect at Snowflake. James has over 20 years of experience in storage and data analytics. Prior to Snowflake, he held several senior technical positions at AWS and MapR. James holds a PhD from Stanford University.

Read More

Achieve four times higher ML inference throughput at three times lower cost per inference with Amazon EC2 G5 instances for NLP and CV PyTorch models

Amazon Elastic Compute Cloud (Amazon EC2) G5 instances are the first and only instances in the cloud to feature NVIDIA A10G Tensor Core GPUs, which you can use for a wide range of graphics-intensive and machine learning (ML) use cases. With G5 instances, ML customers get high performance and a cost-efficient infrastructure to train and deploy larger and more sophisticated models for natural language processing (NLP), computer vision (CV), and recommender engine use cases.

The purpose of this post is to showcase the performance benefits of G5 instances for large-scale ML inference workloads. We do this by comparing the price-performance (measured as $ per million inferences) for NLP and CV models with G4dn instances. We start by describing our benchmarking approach and then present throughput vs. latency curves across batch sizes and data type precision. In comparison to G4dn instances, we find that G5 instances deliver consistently lower cost per million inferences for both full precision and mixed precision modes for the NLP and CV models while achieving higher throughput and lower latency.

Benchmarking approach

To develop a price-performance study between G5 and G4dn, we need to measure throughput, latency, and cost per million inferences as a function of batch size. We also study the impact of full precision vs. mixed precision. Both the model graph and inputs are loaded into CUDA prior to inferencing.

As shown in the following architecture diagram, we first create respective base container images with CUDA for the underlying EC2 instance (G4dn, G5). To build the base container images, we start with AWS Deep Learning Containers, which use pre-packaged Docker images to deploy deep learning environments in minutes. The images contain the required deep learning PyTorch libraries and tools. You can add your own libraries and tools on top of these images for a higher degree of control over monitoring, compliance, and data processing.

Then we build a model-specific container image that encapsulates the model configuration, model tracing, and related code to run forward passes. All container images are loaded on into Amazon ECR to allow for horizontal scaling of these models for various model configurations. We use Amazon Simple Storage Service (Amazon S3) as a common data store to download configuration and upload benchmark results for summarization. You can use this architecture to recreate and reproduce the benchmark results and repurpose to benchmark various model types (such as Hugging Face models, PyTorch models, other custom models) across EC2 instance types (CPU, GPU, Inf1).

With this experiment set up, our goal is to study latency as a function of throughput. This curve is important for application design to arrive at a cost-optimal infrastructure for the target application. To achieve this, we simulate different loads by queuing up queries from multiple threads and then measuring the round-trip time for each completed request. Throughput is measured based on the number of completed requests per unit clock time. Furthermore, you can vary the batch sizes and other variables like sequence length and full precision vs. half precision to comprehensively sweep the design space to arrive at indicative performance metrics. In our study, through a parametric sweep of batch size and queries from multi-threaded clients, the throughput vs. latency curve is determined. Every request can be batched to ensure full utilization of the accelerator, especially for small requests that may not fully utilize the compute node. You can also adopt this setup to identify the client-side batch size for optimal performance.

In summary, we can represent this problem mathematically as: (Throughput, Latency) = function of (Batch Size, Number of threads, Precision).

This means, given the exhaustive space, the number of experiments can be large. Fortunately, each experiment can be independently run. We recommend using AWS Batch to perform this horizontally scaled benchmarking in compressed time without an increase in benchmarking cost compared to a linear approach to testing. The code for replicating the results is present in the GitHub repository prepared for AWS Re:Invent 2021. The repository is comprehensive to perform benchmarking on different accelerators. You can refer to the GPU aspect of code to build the container (Dockerfile-gpu) and then refer to the code inside Container-Root for specific examples for BERT and ResNet50.

We used the preceding approach to develop performance studies across two model types: Bert-base-uncased (110 million parameters, NLP) and ResNet50 (25.6 million parameters, CV). The following table summarizes the model details.

Model Type Model Details
NLP twmkn9/bert-base-uncased-squad2 110 million parameters Sequence length = 128
CV ResNet50 25.6 million parameters

Additionally, to benchmark across data types (full, half precision), we use torch.cuda.amp, which provides convenient methods to handle mixed precision where some operations use the torch.float32 (float) data type and other operations use torch.float16 (half). For example, operators like linear layers and convolutions are much faster with float16, whereas others like reductions often require the dynamic range of float32. Automatic mixed precision tries to match each operator to its appropriate data type to optimize the network’s runtime and memory footprint.

Benchmarking results

For a fair comparison, we selected G4dn.4xlarge and G5.4xlarge instances with similar attributes, as listed in the following table.

Instance GPUs GPU Memory (GiB) vCPUs Memory (GiB) Instance Storage (GB) Network Performance (Gbps) EBS Bandwidth (Gbps) Linux On-Demand Pricing (us-east-1)
G5.4xlarge 1 24 16 64 1x 600 NVMe SSD up to 25 8 $1.204/hour
G4dn.4xlarge 1 16 16 64 1x 225 NVMe SSD up to 25 4.75 $1.624/hour

In the following sections, we compare ML inference performance of BERT and RESNET50 models with a grid sweep approach for specific batch sizes (32, 16, 8, 4, 1) and data type precision (full and half precision) to arrive at the throughput vs. latency curve. Additionally, we investigate the effect of throughput vs. batch size for both full and half precision. Lastly, we measure cost per million inferences as a function of batch size. The consolidated results across these experiments are summarized later in this post.

Throughput vs. latency

The following figures compare G4dn and G5 instances for NLP and CV workloads at both full and half precision. In comparison to G4dn instances, the G5 instance delivers a throughput of about five times higher (full precision) and about 2.5 times higher (half precision) for a BERT base model, and about 2–2.5 times higher for a ResNet50 model. Overall, G5 is a preferred choice, with increasing batch sizes for both models for both full and mixed precision from a performance perspective.

The following graphs compare throughput and P95 latency at full and half precision for BERT.

The following graphs compare throughput and P95 latency at full and half precision for ResNet50.

Throughput and latency vs. batch size

The following graphs show the throughput as a function of the batch size. At low batch sizes, the accelerator isn’t functioning to its fullest capacity and as the batch size increases, throughput is increased at the cost of latency. The throughput curve asymptotes to a maximum value that is a function of the accelerator performance. The curve has two distinct features: a rising section and a flat asymptotic section. For a given model, a performant accelerator (G5) is able to stretch the rising section to higher batch sizes than G4dn and asymptote at a higher throughput. Also, there is a linear trade-off between latency and batch size. Therefore, if the application is latency bound, we can use P95 latency vs. batch size to determine the optimum batch size. However, if the objective is to maximize throughput at the lowest latency, it’s better to select the batch size corresponding to the “knee” between the rising and the asymptotic sections, because any further increase in batch size would result in the same throughput at a worse latency. To achieve the best price-performance ratio, targeting higher throughput at lowest latency, you’re better off horizontally scaling this optimum through multiple inference servers rather than just increasing the batch size.

Cost vs. batch size

In this section, we present the comparative results of inference costs ($ per million inferences) versus the batch size. From the following figure, we can clearly observe that the cost (measured as $ per million inferences) is consistently lower with G5 vs. G4dn both (full and half precision).

The following table summarizes throughput, latency, and cost ($ per million inferences) comparisons for BERT and RESNET50 models across both precision modes for specific batch sizes. In spite of a higher cost per instance, G5 consistently outperforms G4dn across all aspects of inference latency, throughput, and cost ($ per million inference), for all batch sizes. Combining the different metrics into a cost ($ per million inferences), BERT model (32 batch size, full precision) with G5 is 3.7 times more favorable than G4dn, and with ResNet50 model (32 batch size, full precision), it is 1.6 times more favorable than G4dn.

Model Batch Size Precision

Throughput

(Batch size X Requests/sec)

Latency (ms)

$/million

Inferences (On-Demand)

Cost Benefit

(G5 over G4dn)

. . . G5 G4dn G5 G4dn G5 G4dn
Bert-base-uncased 32 Full 723 154 44 208 $0.6 $2.2 3.7X
Mixed 870 410 37 79 $0.5 $0.8 1.6X
16 Full 651 158 25 102 $0.7 $2.1 3.0X
Mixed 762 376 21 43 $0.6 $0.9 1.5X
8 Full 642 142 13 57 $0.7 $2.3 3.3X
Mixed 681 350 12 23 $0.7 $1.0 1.4X
. 1 Full 160 116 6 9 $2.8 $2.9 1.0X
Mixed 137 102 7 10 $3.3 $3.3 1.0X
ResNet50 32 Full 941 397 34 82 $0.5 $0.8 1.6X
Mixed 1533 851 21 38 $0.3 $0.4 1.3X
16 Full 888 384 18 42 $0.5 $0.9 1.8X
Mixed 1474 819 11 20 $0.3 $0.4 1.3X
8 Full 805 340 10 24 $0.6 $1.0 1.7X
Mixed 1419 772 6 10 $0.3 $0.4 1.3X
. 1 Full 202 164 5 6 $2.2 $2 0.9X
Mixed 196 180 5 6 $2.3 $1.9 0.8X

Additional inference benchmarks

In addition to the BERT base and ResNet50 results in the prior sections, we present additional benchmarking results for other commonly used large NLP and CV models in PyTorch. The performance benefit of G5 over G4dn has been presented for BERT Large models at various precision, and Yolo-v5 models for various sizes. For the code for replicating the benchmark, refer to NVIDIA Deep Learning Examples for Tensor Cores. These results show the benefit of using G5 over G4dn for a wide range of inference tasks spanning different model types.

Model Precision Batch Size Sequence Length Throughput (sent/sec) Throughput: G4dn Speedup Over G4dn
BERT-large FP16 1 128 93.5 40.31 2.3
BERT-large FP16 4 128 264.2 87.4 3.0
BERT-large FP16 8 128 392.1 107.5 3.6
BERT-large FP32 1 128 68.4 22.67 3.0
BERT-large 4 128 118.5 32.21 3.7
BERT-large 8 128 132.4 34.67 3.8
Model GFLOPS Number of Parameters Preprocessing (ms) Inference (ms) Inference (Non-max-suppression) (NMS/image)
YOLOv5s 16.5 7.2M 0.2 3.6 4.5
YOLOv5m 49.1 21M 0.2 6.5 4.5
YOLOv5l 109.3 46M 0.2 9.1 3.5
YOLOv5x 205.9 86M 0.2 14.4 1.3

Conclusion

In this post, we showed that for inference with large NLP and CV PyTorch models, EC2 G5 instances are a better choice compared to G4dn instances. Although the on-demand hourly cost for G5 instances is higher than G4dn instances, its higher performance can achieve 2–5 times the throughput at any precision for NLP and CV models, which makes the cost per million inferences 1.5–3.5 times more favorable than G4dn instances. Even for latency bound applications, G5 is 2.5–5 times better than G4dn for NLP and CV models.

In summary, AWS G5 instances are an excellent choice for your inference needs from both a performance and cost per inference perspective. The universality of the CUDA framework and the scale and depth of the G5 instance pool on AWS provides you with a unique ability to perform inference at scale.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Sundar Ranganathan is the Head of Business Development, ML Frameworks on the Amazon EC2 team. He focuses on large-scale ML workloads across AWS services like Amazon EKS, Amazon ECS, Elastic Fabric Adapter, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.

Mahadevan Balasubramaniam is a Principal Solutions Architect for Autonomous Computing with nearly 20 years of experience in the area of physics-infused deep learning, building, and deploying digital twins for industrial systems at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Technology and has over 25 patents and publications to his credit.

Amr Ragab is a Principal Solutions Architect for EC2 Accelerated Platforms for AWS, devoted to helping customers run computational workloads at scale. In his spare time he likes traveling and finding new ways to integrate technology into daily life.

Read More