InformedIQ automates verifications for Origence’s auto lending using machine learning

This post was co-written with Robert Berger and Adine Deford from InformedIQ.

InformedIQ is the leader in AI-based software used by the nation’s largest financial institutions to automate loan processing verifications and consumer credit applications in real time per the lenders’ policies. They improve regulatory compliance, reduce cost, and increase accuracy by decreasing human error rates that are caused by the repetitive nature of tasks. Informed partnered with Origence (the nation’s leading lending technology solutions and services provider for 1,130 credit unions serving over 64 million members) to power Origence’s document process automation functionality for indirect lending to automatically identify documents and validate financing policies, creating a better credit union and dealer experience for their network of over 15,000 dealers. To date, $110 billion in auto loans have originated with Informed’s automation, which is 8% of all US auto loans. Six of the top 10 consumer lenders trust Informed’s technology.

In this post, we learn about the challenges faced and how machine learning (ML) solved the problems.

Problem statement

Manual loan verification document processing is time-consuming. The verification includes consumer stipulations like proof of residence, identity, insurance, and income. It can be prone to human error due to the repetitive nature of tasks.

With ML and automation, Informed can provide a software solution that is available 24/7, over holidays and weekends. The solution works accurately without conscious or unconscious bias to calculate and clear stipulations in under 30 seconds, vs. an average of 7 days for loan verifications, with 99% accuracy.

Solution overview

Informed uses a wide range of AWS offerings and capabilities, including Amazon SageMaker and Amazon Textract in their ML stack to power Origence’s document process automation functionality. The solution automatically extracts data and classifies documents (for example, driver’s license, paystub, W2 form, or bank statement), providing the required fields for the consumer verifications used to determine if the lender will grant the loan. Through accurate income calculations and validation of applicant data, loan documents, and documented classification, loans are processed faster and more accurately, with reduced human errors and fraud risk, and added operational efficiency. This helps in creating a better consumer, credit union, and dealer experience.

To classify and extract information needed to validate information in accordance with a set of configurable funding rules, Informed uses a series of proprietary rules and heuristics, text-based neural networks, and image-based deep neural networks, including Amazon Textract OCR via the DetectDocumentText API and other statistical models. The Informed API model can be broken down into five functional steps, as shown in the following diagram: image processing, classification, image feature computations, extractions, and stipulation verification rules, before determining the decision.

Given a sequence of pages for various document types (bank statement, driver’s license, paystub, SSI award letter, and so on), the image processing step performs the necessary image enhancements for each page and invokes multiple APIs, including Amazon Textract OCR for image to text conversion. The rest of the processing steps use the OCR text obtained from image processing and the image for each page.

Main advantages

Informed provides solutions to the auto lending industry that reduce manual processes, support compliance and quality, mitigate risk, and deliver significant cost savings to their customers. Let’s dive into two main advantages of the solution.

Automation at scale with efficiency

The adoption of AWS Cloud technologies and capabilities has helped Informed address a wider range of document types and onboard new partners. Informed has developed integrated, AI/ML-enabled solutions, and continuously strives for innovation to better serve clients.

Almost the entirety of the Informed SaaS service is hosted and enabled by AWS services. Informed is able to offload the undifferentiated heavy lifting for scalable infrastructure and focus on their business objectives. Their architecture includes load balancers, Amazon API Gateway, Amazon Elastic Container Service (Amazon ECS) containers, serverless AWS Lambda, Amazon DynamoDB, and Amazon Relational Database Service (Amazon RDS), in addition to ML technologies like Amazon Textract and SageMaker.

Reducing cost in document extraction

Informed uses new features from Amazon Textract to improve the accuracy of data extraction from documents such as bank statements and paystubs. Amazon Textract is an AI/ML service that automatically extracts text, handwriting, and other forms of metadata from scanned documents, forms, and tables in ways that make further ML processing more efficient and accurate. Informed uses AWS Textract OCR and Analyze Document APIs for both tables and forms as part of the verification process. Informed’s artificial intelligence modeling engine performs complex calculations, ensuring accuracy, identifying omissions, and combating fraud. With AWS, they continue to advance the accuracy and speed of the solution, helping lenders become more efficient by lowering loan processing costs and reducing time to process and fund. With a 99% accuracy rate for field prediction, dealers and credit unions can now focus less on collecting and validating data and more on developing strong customer relationships.

“Partnering with Informed.IQ to integrate their leading AI-based technology allows us to advance our lending systems’ capabilities and performance, further streamlining the overall loan process for our credit unions and their members”

– Brian Hendricks, Chief Product Officer at Origence.

Conclusion

Informed is constantly improving the accuracy, efficiency, and breadth of their automated loan document verifications. This solution can benefit any lending document verification process like personal and student loans, HELOCs, and powersports. The adoption of AWS Cloud technologies and capabilities has helped Informed address the growing complexity of the lending process and improve the dealer and customer experience. With AWS, the company continues to add enhancements that help lenders become more efficient, lower loan processing costs, and provide serverless computing.

Now that you have learned about how ML and automation can solve the loan document verification process, you can get started using Amazon Textract. You can also try out intelligent document processing workshops. Visit Automated data processing from documents to learn more about reference architectures, code samples, industry use cases, blog posts, and more.


About the authors

Robert Berger is the Chief Architect at InformedIQ. He is leading the transformation of the InformedIQ SaaS into a full Serverless Microservice architecture leveraging AWS Cloud, DevOps and Data Oriented Programming. Principal or founder in several other start-ups including InterNex, MetroFi, UltraDevices, Runa, Mist Systems and Omnyway.

Adine Deford is the VP of Marketing at Informed.IQ. She has more than 25 years of technology marketing experience serving industry leaders, world class marketing agencies and technology start-ups.

Jessica Oliveira is an Account Manager at AWS who provides guidance and support to SMB customers in Northern California. She is passionate about building strategic collaborations to help ensure her customer’s success. Outside of work, she enjoys traveling, learning about different languages and cultures, and spending time with her family.

Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies. She brings a breadth of expertise in Data Analytics and Machine Learning. Prior to joining AWS she was architecting data solutions in financial industries. She is very interested in Amazon Future Engineer program enabling middle-school, high-school kids see the art of the possible in STEM. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.

Read More

Prevent account takeover at login with the new Account Takeover Insights model in Amazon Fraud Detector

Digital is the new normal, and there’s no going back. Every year, consumers visit, on average, 191 websites or services requiring a user name and password, and the digital footprint is expected to grow exponentially. So much exposure naturally brings added risks like account takeover (ATO).

Each year, bad actors compromise billions of accounts through stolen credentials, phishing, social engineering, and multiple forms of ATO. To put it into perspective: account takeover fraud increased by 90% to an estimated $11.4 billion in 2021 compared with 2020. Beyond the financial impact, ATOs damage the customer experience, threaten brand loyalty and reputation, and strain fraud teams as they manage chargebacks and customer claims.

Many companies, even those with sophisticated fraud teams, use rules-based solutions to detect compromised accounts because they’re simple to create. To bolster their defenses and reduce friction for legitimate users, businesses are increasingly investing in AI and machine learning (ML) to detect account takeovers.

AWS can help you improve your fraud mitigation with solutions like Amazon Fraud Detector. This fully managed AI service allows you to identify potentially fraudulent online activities by enabling you to train custom ML fraud detection models without ML expertise.

This post discusses how to create a real-time detector endpoint using the new Account Takeover Insights (ATI) model in Amazon Fraud Detector.

Overview of solution

Amazon Fraud Detector relies on specific models with tailored algorithms, enrichments, and feature transformations to detect fraudulent events across multiple use cases. The newly launched ATI model is a low-latency fraud detection ML model designed to detect potentially compromised accounts and ATO fraud. The ATI model detects up to four times more ATO fraud than traditional rules-based account takeover solutions while minimizing the level of friction for legitimate users.

The ATI model is trained using a dataset containing your business’s historical login events. Event labels are optional for model training because the ATI model uses an innovative approach to unsupervised learning. The model differentiates events generated by the actual account owner (legit events) from those generated by bad actors (anomalous events).

Amazon Fraud Detector derives the user’s past behavior by continuously aggregating the data provided. Examples of user behavior include the number of times the user signed in from a specific IP address. With these additional enrichments and aggregates, Amazon Fraud Detector can generate strong model performance from a small set of inputs from your login events.

For a real-time prediction, you call the GetEventPrediction API after a user presents valid login credentials to quantify the risk of ATO. In response, you receive a model score between 0–1000, where 0 shows low fraud risk and 1000 shows high fraud risk, and an outcome based on a set of business rules you define. You can then take the appropriate action on your end: approve the login, deny the login, or challenge the user by enforcing an additional identity verification.

You can also use the ATI model to asynchronously evaluate account logins and take action based on the outcome, such as adding the account to an investigation queue so a human reviewer can determine if further action should be taken.

The following steps outline the process of training an ATI model and publishing a detector endpoint to generate fraud predictions:

  • Prepare and validate the data.
  • Define the entity, event and event variables, and event label (optional).
  • Upload event data.
  • Initiate model training.
  • Evaluate the model.
  • Create a detector endpoint and define business rules.
  • Get real-time predictions.

Prerequisites

Before getting started, complete the following prerequisite steps:

Prepare and validate the data

Amazon Fraud Detector requires that you provide your user account login data in a CSV file encoded in the UTF-8 format. For the ATI, you must provide certain event metadata and event variables in the header line of your CSV file.

The required event metadata is as follows:

  • EVENT_ID – A unique identifier for the login event.
  • ENTITY_TYPE – The entity that performs the login event, such as a merchant or a customer.
  • ENTITY_ID – An identifier for the entity performing the login event.
  • EVENT_TIMESTAMP – The timestamp when the login event occurred. The timestamp format must be in ISO 8601 standard in UTC.
  • EVENT_LABEL (optional) – A label that classifies the event as fraudulent or legitimate. You can use any labels, such as fraud, legit, 1, or 0.

Event metadata must be in uppercase letters. Labels aren’t required for login events. However, we recommend including EVENT_LABEL metadata and providing labels for your login events if available. If you provide labels, Amazon Fraud Detector uses them to automatically calculate an Account Takeover Discovery Rate and display it in the model performance metrics.

The ATI model has both required and optional variables. Event variable names must be in lowercase letters.

The following table summarizes the mandatory variables.

Category Variable type Description
IP address IP_ADDRESS The IP address used in the login event
Browser and device USERAGENT The browser, device, and OS used in the login event
Valid credentials VALIDCRED Indicates if the credentials that were used for login are valid

The following table summarizes the optional variables.

Category Type Description
Browser and device FINGERPRINT The unique identifier for a browser or device fingerprint
Session ID SESSION_ID The identifier for an authentication session
Label EVENT_LABEL A label that classifies the event as fraudulent or legitimate (such as fraud, legit, 1, or 0)
Timestamp LABEL_TIMESTAMP The timestamp when the label was last updated; this is required if EVENT_LABEL is provided

You can provide additional variables. However, Amazon Fraud Detector won’t include these variables for training an ATI model.

Dataset preparation

As you start to prepare your login data, you must meet the following requirements:

  • Provide at least 1,500 entities (individual user accounts), each with at least two associated login events
  • Your dataset must cover at least 30 days of login events

The following configurations are optional:

  • Your dataset can include examples of unsuccessful login events
  • You can optionally label these unsuccessful logins as fraudulent or legitimate
  • You can prepare historical data with login events spanning more than 6 months and include 100,000 entities

We provide a sample dataset for testing purposes that you can use to get started.

Data validation

Before creating your ATI model, Amazon Fraud Detector checks if the metadata and variables you included in your dataset for training the model meet the size and format requirements. For more information, see Dataset validation. If the dataset doesn’t pass validation, a model isn’t created. For details on common dataset errors, see Common event dataset errors.

Define the entity, event type, and event variables

In this section, we walk through the steps to create an entity, event type, and event variables. Optionally, you can also define event labels.

Define the entity

The entity defines who is performing the event. To create an entity, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Entities.
  • Choose Create.
  • Enter an entity name and optional description.
  • Choose Create entity.

Define the event and event variables

An event is a business activity evaluated for fraud risk; this event is performed by the entity we just created. The event type defines the structure for an event sent to Amazon Fraud Detector, including variables of the event, the entity performing the event, and, if available, the labels that classify the event.

To create an event, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Events.
  • Choose Create.
  • For Name, enter a name for your event type.
  • For Entity, choose the entity created in the previous step.

Define the event variables

For event variables, complete the following steps:

  • In the Create IAM role section, enter the specific bucket name where you uploaded your training data.
    The name of the S3 bucket must be the name where you uploaded your dataset. Otherwise, you get an access denied exception error.
  • Choose Create role.

  • For Data location, enter the path to your training data, the path is the S3 URI you copied during the prerequisite steps, and choose Upload.

Amazon Fraud Detector extracts the headers from your training dataset and creates a variable for each header. Make sure to assign the variable to the correct variable type. As part of the model training process, Amazon Fraud Detector uses the variable type associated with the variable to perform variable enrichment and feature engineering. For more details about variable types, see Variable types.

Define event labels (optional)

Labels are used to categorize individual events as either fraud or legitimate. Event labels are optional for model training because the ATI model uses an innovative approach to unsupervised learning. The model differentiates events generated by the actual account owner (legit events) from those generated by abusive actors (anomalous events). We recommend you include EVENT_LABEL metadata and provide labels for your login events if available. If you provide labels, Amazon Fraud Detector uses them to automatically calculate an Account Takeover Discovery Rate and display it in the model performance metrics.

To create an event, complete the following steps:

  • Define two labels (for this post, 1 and 0).
  • Choose Create event type.

Upload event data

In this session, we walk through the steps to upload events data to the service for model training.

ATI models are trained on a dataset stored internally in Amazon Fraud Detector. By storing event data in Amazon Fraud Detector, you can train models that use auto-computed variables to improve performance, simplify model retraining, and update fraud labels to close the machine learning feedback loop. See Stored events for more information on storing your event dataset with Amazon Fraud Detector.

After you define your event, navigate to the Stored events tab. On the Stored events tab, you can see information about your dataset, such as the number of events stored and the total size of the dataset in MB. Because you just created this event type, there are no stored events yet. On this page, you can turn event ingestion on or off. When event ingestion is on, you can upload historical event data to Amazon Fraud Detector and automatically store event data from predictions in real time.

The easiest way to store historical data is by uploading a CSV file and importing the events. Alternatively, you can stream the data into Amazon Fraud Detector using the SendEvent API (see our GitHub repository for sample notebooks). To import the event from a CSV file, complete the following steps:

  • Under Import events data, choose New import.
    You likely need to create a new IAM role. The import events feature requires both read and write access to Amazon S3.

  • Create a new IAM role and provide the S3 buckets for input and output files.
    The IAM role you create grants Amazon Fraud Detector access to these buckets to read input files and store output files. If you don’t plan to store output files in a separate bucket, enter the same bucket name for both.
  • Choose Create role.

  • Enter the location of the CSV file that contains your event data. This should be the S3 URI you copied earlier.
  • Chose Start to start importing the events.

The import time varies based on the number of events you’re importing. For a dataset with 20,000 events, the process takes around 12 minutes, and after you refresh the page, the status changes to Completed. If the status changes to Error, choose the job name to show why the import failed.

Initiate model training

After successfully importing the events, you have all the pieces to initiate model training. To train a model, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Models.
  • Choose Add model and select Create model.
  • For Model name, enter the desired name for your model
  • For Model type, select Takeover Account Insights.
  • For Event type, choose the event type you created earlier.

  • Under Historical event data, you can specify the date range of events to train the model if needed.
  • Choose Next.

  • For this post, you configure training by identifying the variables used as inputs to the model.
  • After evaluating the variables, choose Next.

It’s a best practice to include all the available variables, even if you’re unsure about their value to the model. After the model is trained, Amazon Fraud Detector provides a ranked list of each variable’s impact on the model performance, so you can know whether to include that variable in future model training. If labels are provided, Amazon Fraud Detector uses them to evaluate and display model performance in terms of the model’s discovery rate.

If labels aren’t provided, Amazon Fraud Detector uses negative sampling to provide examples or analogous login attempts that help the model distinguish between legitimate and fraudulent activities. This produces precise risk scores that improve the model’s ability to capture incorrectly flagged legitimate activities.

After reviewing the model configured in the first two steps, choose Create and train the model.

You can see the model in training status in the console page. Creating and training the model takes approximately 45 minutes to complete. When the model has stopped training, you can check model performance by choosing the model version.

Evaluate model performance and deploy the model

In this session, we walk through the steps to review and evaluate the model performance.

Amazon Fraud Detector validates model performance using 15% of your data that wasn’t used to train the model and provides performance metrics. You need to consider these metrics and your business objectives to define a threshold that aligns with your business model. For further details on the metrics and how to determine thresholds, see Model performance metrics.

ATI is an anomaly detection model rather than a classification model; therefore, the evaluation metrics differ from classification models. When your ATI model has finished training, you can see the Anomaly Separation Index (ASI), a holistic measure of the model’s ability to identify high-risk anomalous logins. An ASI of 75% or more is considered good, 90% or more is considered high, and below 75% is considered poor.

To assist in choosing the right balance, Amazon Fraud Detector provides the following metrics to evaluate ATI model performance:

  • Anomaly Separation Index (ASI) – Summarizes the overall ability of the model to separate anomalous activities from the expected behavior of users. A model with no separability power will have the lowest possible ASI score of 0.5. In contrast, the model with a high separability power will have the highest possible ASI score of 1.0.
  • Challenge Rate (CR) – The score threshold indicates the percentage of login events the model would recommend challenging in the form of a one-time password, multi-factor authentication, identify verification, investigation, and so on.
  • Anomaly Discovery Rate (ADR) – Quantifies the percentage of anomalies the model can detect at the selected score threshold. A lower score threshold increases the percentage of anomalies captured by the model. Still, it would also require challenging a more significant percentage of login events, leading to higher customer friction.
  • ATO Discovery Rate (ATODR) – Quantifies the percentage of account compromise events that the model can detect at the selected score threshold. This metric is only available if 50 or more entities with at least one labeled ATO event are present in the ingested dataset.

In the following example, we have an ASI of 0.96 (high), which indicates a high ability to separate anomalous activities from the normal behavior of users. By writing a rule using a model score threshold of 500, you challenge or create friction on 6% of all login activities catching 96% of anomalous activities.

Another important metric is the model variable importance. Variable importance gives you an understanding of how the different variables relate to the model performance. You can have two types of variables: raw and aggregate variables. Raw variables are the ones that were defined based on the dataset, whereas aggregate variables are a combination of multiple variables that are enriched and have an aggregated importance value.

For more information about variable importance, see Model variable importance.

A variable (raw or aggregate) with a much higher number relative to the rest could indicate that the model might be overfitting. In contrast, variables with relatively lowest numbers could just be noise.

After reviewing the model performance and deciding what model score thresholds align with your business model, you can deploy the model version. For that, on the Actions menu, choose Deploy model version. With the model deployed, we create a detector endpoint and perform real-time prediction.

Create a detector endpoint and define business rules

Amazon Fraud Detector uses detector endpoints to generate fraud prediction. A detector contains detection logic, such as trained models and business rules, for a specific event you want to evaluate for fraud. Detection logic uses rules to tell Amazon Fraud Detector how to interpret the data associated with the model.

To create a detector, complete the following steps:

  • On the Amazon Fraud Detector console, in the navigation pane, choose Detectors.
  • Choose Create detector.
  • For Detector name, enter a name.
  • Optionally, describe your detector.
  • For Event type, choose the same event type as the model created earlier.
  • Choose Next.

  • On the Add model (optional) page, choose Add model.

  • To add a model, choose the model you trained and published during the model training steps and choose the active version.
  • Choose Add model.

As part of the next step, you create the business rules that define an outcome. A rule is a condition that tells Amazon Fraud Detector how to interpret variable values during a fraud prediction. A rule consists of one or more variables, a logic expression, and one or more outcomes. An outcome is the result of a fraud prediction and is returned if the rule matches during an evaluation.

  • Define decline_rule as $<your_model_name_insightscore >= 950 with outcome deny_login.
  • Define friction_rule as $ your_model_name _insightscore >= 855 and $ your_model_name_insightscore >= 950 with outcome challenge_login.
  • Define approve_rule as $account_takeover_model_insightscore < 855 with outcome approve_login.

Outcomes are strings returned in the GetEventPrediction API response. You can use outcomes to trigger events by calling applications and downstream systems or to simply identify who is likely to be fraud or legitimate.

  • On the Add Rules page, choose Next after you finish adding all your rules.

  • In the Configure rule execution section, choose the mode for your rules engine.
    The Amazon Fraud Detector rules engine has two modes: first matched or all matched. First matched mode is for sequential rule runs, returning the outcome for the first condition met. The other mode is all matched, which evaluates all rules and returns outcomes from all the matching rules. In this example, we use the first matched mode for our detector.

After this process, you’re ready to create your detector and run some tests.

  • To run a test, go to your newly created detector and choose the detector version you want to use.
  • Provide the variable values as requested and choose Run test.

As a result of the test, you receive the risk score and the outcome based on your business rules.

You can also search past predictions by going to the left panel and choosing Search past predictions. The prediction is based on each variable’s contribution to the overall likelihood of a fraudulent event. The following screenshot is an example of a past prediction showing the input variables and how they influenced the fraud prediction score.

Get real-time predictions

To get real-time predictions and integrate Amazon Fraud Detector into your workflow, we need to publish the detector endpoint. Complete the following steps:

  • Go to the newly created detector and choose the detector version, which will be version 1.
  • On the Actions menu, choose Publish.

You can perform real-time predictions with the published detector by calling the GetEventPrediction API. The following is a sample Python code for calling the GetEventPrediction API:

import boto3
fraudDetector = boto3.client('frauddetector')

fraudDetector.get_event_prediction(
detectorId = 'sample_detector',
eventId = '802454d3-f7d8-482d-97e8-c4b6db9a0428',
eventTypeName = 'sample_transaction',
eventTimestamp = '2021-01-13T23:18:21Z',
entities = [{'entityType':'customer', 'entityId':'12345'}],
eventVariables = {
    'email_address' : 'johndoe@exampledomain.com',
    'ip_address' : '1.2.3.4'
}
)

Conclusion

Amazon Fraud Detector relies on specific models with tailored algorithms, enrichments, and feature transformations to detect fraudulent events across multiple use cases. In this post, you learned how to ingest data, train and deploy a model, write business rules, and publish a detector to generate real-time fraud prediction on potentially compromised accounts.

Visit Amazon Fraud Detector to learn more about Amazon Fraud Detector or our GitHub repo for code samples, notebook, and synthetic datasets.


About the authors

Marcel Pividal is a Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for Fintechs, Payment Providers, Pharma, and government agencies. His current areas of focus are Risk Management, Fraud Prevention, and Identity Verification.

Mike Ames is a data scientist turned identity verification solution specialist, he has extensive experience developing machine learning and AI solutions to protect organizations from fraud, waste and abuse. In his spare time, you can find him hiking, mountain biking or playing freebee with his dog Max.

Read More

Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services

Content moderation is the process of screening and monitoring user-generated content online. To provide a safe environment for both users and brands, platforms must moderate content to ensure that it falls within preestablished guidelines of acceptable behavior that are specific to the platform and its audience.

When a platform moderates content, acceptable user-generated content (UGC) can be created and shared with other users. Inappropriate, toxic, or banned behaviors can be prevented, blocked in real time, or removed after the fact, depending on the content moderation tools and procedures the platform has in place.

You can use Amazon Rekognition Content Moderation to detect content that is inappropriate, unwanted, or offensive, to create a safer user experience, provide brand safety assurances to advertisers, and comply with local and global regulations.

In this post, we discuss the key elements needed to evaluate the performance aspect of a content moderation service in terms of various accuracy metrics, and a provide an example using Amazon Rekognition Content Moderation API’s.

What to evaluate

When evaluating a content moderation service, we recommend the following steps.

Before you can evaluate the performance of the API on your use cases, you need to prepare a representative test dataset. The following are some high-level guidelines:

  • Collection – Take a large enough random sample (images or videos) of the data you eventually want to run through Amazon Rekognition. For example, if you plan to moderate user-uploaded images, you can take a week’s worth of user images for the test. We recommend choosing a set that has enough images without getting too large to process (such as 1,000–10,000 images), although larger sets are better.
  • Definition – Use your application’s content guidelines to decide which types of unsafe content you’re interested in detecting from the Amazon Rekognition moderation concepts taxonomy. For example, you may be interested in detecting all types of explicit nudity and graphic violence or gore.
  • Annotation – Now you need a human-generated ground truth for your test set using the chosen labels, so that you can compare machine predictions against them. This means that each image is annotated for the presence or absence of your chosen concepts. To annotate your image data, you can use Amazon SageMaker Ground Truth (GT)to manage image annotation. You can refer to GT for image labeling, consolidating annotations and processing annotation output.

Get predictions on your test dataset with Amazon Rekognition

Next, you want to get predictions on your test dataset.

The first step is to decide on a minimum confidence score (a threshold value, such as 50%) at which you want to measure results. Our default threshold is set to 50, which offers a good balance between retrieving large amounts of unsafe content without incurring too many false predictions on safe content. However, your platform may have different business needs, so you should customize this confidence threshold as needed. You can use the MinConfidence parameter in your API requests to balance detection of content (recall) vs the accuracy of detection (precision). If you reduce MinConfidence, you are likely to detect most of the inappropriate content, but are also likely to pick up content that is not actually inappropriate. If you increase MinConfidence you are likely to ensure that all your detected content is truly inappropriate but some content may not be tagged. We suggest experimenting with a few MinConfidence values on your dataset and quantitatively select the best value for your data domain.

Next, run each sample (image or video) of your test set through the Amazon Rekognition moderation API (DetectModerationLabels).

Measure model accuracy on images

You can assess the accuracy of a model by comparing human-generated ground truth annotations with the model predictions. You repeat this comparison for every image independently and then aggregate over the whole test set:

  • Per-image results – A model prediction is defined as the pair {label_name, confidence_score} (where the confidence score >= the threshold you selected earlier). For each image, a prediction is considered correct when it matches the ground truth (GT). A prediction is one of the following options:

    • True Positive (TP): both prediction and GT are “unsafe”
    • True Negative (TN): both prediction and GT are “safe”
    • False Positive (FP): the prediction says “unsafe”, but the GT is “safe”
    • False Negative (FN): the prediction is “safe”, but the GT is “unsafe”
  • Aggregated results over all images – Next, you can aggregate these predictions into dataset-level results:

    • False positive rate (FPR) – This is the percentage of images in the test set that are wrongly flagged by the model as containing unsafe content: (FP): FP / (TN+FP).
    • False negative rate (FNR) – This is the percentage of unsafe images in the test set that are missed by the model: (FN): FN / (FN+TP).
    • True positive rate (TPR) – Also called recall, this computes the percentage of unsafe content (ground truth) that is correctly discovered or predicted by the model: TP / (TP + FN) = 1 – FNR.
    • Precision – This computes the percentage of correct predictions (unsafe content) with regards to the total number of predictions made: TP / (TP+FP).

Let’s explore an example. Let’s assume that your test set contains 10,000 images: 9,950 safe and 50 unsafe. The model correctly predicts 9,800 out of 9,950 images as safe and 45 out of 50 as unsafe:

  • TP = 45
  • TN = 9800
  • FP = 9950 – 9800 = 150
  • FN = 50 – 45 = 5
  • FPR = 150 / (9950 + 150) = 0.015 = 1.5%
  • FNR = 5 / (5 + 45) = 0.1 = 10%
  • TPR/Recall = 45 / (45 + 5) = 0.9 = 90%
  • Precision = 45 / (45 + 150) = 0.23 = 23%

Measure model accuracy on videos

If you want to evaluate the performance on videos, a few additional steps are necessary:

  1. Sample a subset of frames from each video. We suggest sampling uniformly with a rate of 0.3–1 frames per second (fps). For example, if a video is encoded at 24 fps and you want to sample one frame every 3 seconds (0.3 fps), you need to select one every 72 frames.
  2. Run these sampled frames through Amazon Rekognition content moderation. You can either use our video API, which already samples frames for you (at a rate of 3 fps), or use the image API, in which case you want to sample more sparsely. We recommend the latter option, given the redundancy of information in videos (consecutive frames are very similar).
  3. Compute the per-frame results as explained in the previous section (per-image results).
  4. Aggregate results over the whole test set. Here you have two options, depending on the type of outcome that matters for your business:
    1. Frame-level results – This considers all the sampled frames as independent images and aggregates the results exactly as explained earlier for images (FPR, FNR, recall, precision). If some videos are considerably longer than others, they will contribute more frames to the total count, making the comparison unbalanced. In that case, we suggest changing the initial sampling strategy to a fixed number of frames per video. For example, you could uniformly sample 50–100 frames per video (assuming videos are at least 2–3 minutes long).
    2. Video-level results – For some use cases, it doesn’t matter whether the model is capable of correctly predicting 50% or 99% of the frames in a video. Even a single wrong unsafe prediction on a single frame could trigger a downstream human evaluation and only videos with 100% correct predictions are truly considered correctly. If this is your use case, we suggest you compute FPR/FNR/TPR over the frames of each video and consider the video as follows:
Video ID Accuracy Per-Video Categorization
Results Aggregated Over All the Frames of Video ID

Total FP = 0

Total FN = 0

Perfect predictions
. Total FP > 0 False Positive (FP)
. Total FN > 0 False Negative (FN)

After you have computed these for each video independently, you can then compute all the metrics we introduced earlier:

  • The percentage of videos that are wrongly flagged (FP) or missed (FN)
  • Precision and recall

Measure performance against goals

Finally, you need to interpret these results in the context of your goals and capabilities.

First, consider your business needs in regards to the following:

  • Data – Learn about your data (daily volume, type of data, and so on) and the distribution of your unsafe vs. safe content. For example, is it balanced (50/50), skewed (10/90) or very skewed (1/99, meaning that only 1% is unsafe)? Understanding such distribution can help you define your actual metric goals. For example, the number of safe content is often an order of magnitude larger than unsafe content (very skewed), making this almost an anomaly detection problem. Within this scenario, the number of false positives may outnumber the number of true positives, and you can use your data information (distribution skewness, volume of data, and so on) to decide the FPR you can work with.
  • Metric goals – What are the most critical aspects of your business? Lowering the FPR often comes at the cost of a higher FNR (and vice versa) and it’s important to find the right balance that works for you. If you can’t miss any unsafe content, you likely want close to 0% FNR (100% recall). However, this will incur the largest number of false positives, and you need to decide the target (maximum) FPR you can work with, based on your post-prediction pipeline. You may want to allow some level of false negatives to be able to find a better balance and lower your FPR: for example, accepting a 5% FNR instead of 0% could reduce the FPR from 2% to 0.5%, considerably reducing the number of flagged contents.

Next, ask yourself what mechanisms you will use to parse the flagged images. Even though the API’s may not provide 0% FPR and FNR, it can still bring huge savings and scale (for example, by only flagging 3% of your images, you have already filtered out 97% of your content). When you pair the API with some downstream mechanisms, like a human workforce that reviews the flagged content, you can easily reach your goals (for example, 0.5% flagged content). Note how this pairing is considerably cheaper than having to do a human review on 100% of your content.

When you have decided on your downstream mechanisms, we suggest you evaluate the throughput that you can support. For example, if you have a workforce that can only verify 2% of your daily content, then your target goal from our content moderation API is a flag rate (FPR+TPR) of 2%.

Finally, if obtaining ground truth annotations is too hard or too expensive (for example, your volume of data is too large), we suggest annotating the small number of images flagged by the API. Although this doesn’t allow for FNR evaluations (because your data doesn’t contain any false negatives), you can still measure TPR and FPR.

In the following section, we provide a solution for image moderation evaluation. You can take a similar approach for video moderation evaluation.

Solution overview

The following diagram illustrates the various AWS services you can use to evaluate the performance of Amazon Rekognition content moderation on your test dataset.

The content moderation evaluation has the following steps:

  1. Upload your evaluation dataset into Amazon Simple Storage Service (Amazon S3).
  2. Use Ground Truth to assign ground truth moderation labels.
  3. Generate the predicted moderation labels using the Amazon Rekognition pre-trained moderation API using a few threshold values. (For example, 70%, 75% and 80%).
  4. Assess the performance for each threshold by computing true positives, true negatives, false positives, and false negatives. Determine the optimum threshold value for your use case.
  5. Optionally, you can tailor the size of the workforce based on true and false positives, and use Amazon Augmented AI (Amazon A2I) to automatically send all flagged content to your designated workforce for a manual review.

The following sections provide the code snippets for steps 1, 2, and 3. For complete end-to-end source code, refer to the provided Jupyter notebook.

Prerequisites

Before you get started, complete the following steps to set up the Jupyter notebook:

  1. Create a notebook instance in Amazon SageMaker.
  2. When the notebook is active, choose Open Jupyter.
  3. On the Jupyter dashboard, choose New, and choose Terminal.
  4. In the terminal, enter the following code:
    cd SageMaker
    git clone https://github.com/aws-samples/amazon-rekognition-code-samples.git

  5. Open the notebook for this post: content-moderation-evaluation/Evaluating-Amazon-Rekognition-Content-Moderation-Service.ipynb.
  6. Upload your evaluation dataset to Amazon Simple Storage Service (Amazon S3).

We will now go through steps 2 through 4 in the Jupyter notebook.

Use Ground Truth to assign moderation labels

To assign labels in Ground Truth, complete the following steps:

  1. Create a manifest input file for your Ground Truth job and upload it to Amazon S3.
  2. Create the labeling configuration, which contains all moderation labels that are needed for the Ground Truth labeling job.To check the limit for the number of label categories you can use, refer to Label Category Quotas. In the following code snippet, we use five labels (refer to the hierarchical taxonomy used in Amazon Rekognition for more details) plus one label (Safe_Content) that marks content as safe:
    # customize CLASS_LIST to include all labels that can be used to classify sameple data, it's up to 10 labels
    # In order to easily match image label with content moderation service supported taxonomy, 
    
    CLASS_LIST = ["<label_1>", "<label_2>", "<label_3>", "<label_4>", "<label_5>", "Safe_Content"]
    print("Label space is {}".format(CLASS_LIST))
    
    json_body = {"labels": [{"label": label} for label in CLASS_LIST]}
    with open("class_labels.json", "w") as f:
        json.dump(json_body, f)
    
    s3.upload_file("class_labels.json", BUCKET, EXP_NAME + "/class_labels.json")

  3. Create a custom worker task template to provide the Ground Truth workforce with labeling instructions and upload it to Amazon S3.
    The Ground Truth label job is defined as an image classification (multi-label) task. Refer to the source code for instructions to customize the instruction template.
  4. Decide which workforce you want to use to complete the Ground Truth job. You have two options (refer to the source code for details):
    1. Use a private workforce in your own organization to label the evaluation dataset.
    2. Use a public workforce to label the evaluation dataset.
  5. Create and submit a Ground Truth labeling job. You can also adjust the following code to configure the labeling job parameters to meet your specific business requirements. Refer to the source code for complete instructions on creating and configuring the Ground Truth job.
    human_task_config = {
        "AnnotationConsolidationConfig": {
            "AnnotationConsolidationLambdaArn": acs_arn,
        },
        "PreHumanTaskLambdaArn": prehuman_arn,
        "MaxConcurrentTaskCount": 200,  # 200 images will be sent at a time to the workteam.
        "NumberOfHumanWorkersPerDataObject": 3,  # 3 separate workers will be required to label each image.
        "TaskAvailabilityLifetimeInSeconds": 21600,  # Your workteam has 6 hours to complete all pending tasks.
        "TaskDescription": task_description,
        "TaskKeywords": task_keywords,
        "TaskTimeLimitInSeconds": 180,  # Each image must be labeled within 3 minutes.
        "TaskTitle": task_title,
        "UiConfig": {
            "UiTemplateS3Uri": "s3://{}/{}/instructions.template".format(BUCKET, EXP_NAME),
        },
    }

After the job is submitted, you should see output similar to the following:

Labeling job name is: ground-truth-cm-1662738403

Wait for labeling job on the evaluation dataset to complete successfully, then continue to the next step.

Use the Amazon Rekognition moderation API to generate predicted moderation labels.

The following code snippet shows how to use the Amazon Rekognition moderation API to generate moderation labels:

client=boto3.client('rekognition')
def moderate_image(photo, bucket):
    response = client.detect_moderation_labels(Image={'S3Object':{'Bucket':bucket,'Name':photo}})
    return len(response['ModerationLabels'])

Assess the performance

You first retrieved ground truth moderation labels from the Ground Truth labeling job results for the evaluation dataset, then you ran the Amazon Rekognition moderation API to get predicted moderation labels for the same dataset. Because this is a binary classification problem (safe vs. unsafe content), we calculate the following metrics (assuming unsafe content is positive):

We also calculate the corresponding evaluation metrics:

The following code snippet shows how to calculate those metrics:

FPR = FP / (FP + TN)
FNR = FN / (FN + TP)
Recall = TP / (TP + FN)
Precision = TP / (TP + FP)

Conclusion

This post discusses the key elements needed to evaluate the performance aspect of your content moderation service in terms of various accuracy metrics. However, accuracy is only one of the many dimensions that you need to evaluate when choosing a particular content moderation service. It’s critical that you include other parameters, such as the service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing. To learn more about content moderation in Amazon Rekognition, visit Amazon Rekognition Content Moderation.


About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Davide Modolo is an Applied Science Manager at AWS AI Labs. He has a PhD in computer vision from the University of Edinburgh (UK) and is passionate about developing new scientific solutions for real-world customer problems. Outside of work, he enjoys traveling and playing any kind of sport, especially soccer.

Jian Wu is a Senior Enterprise Solutions Architect at AWS. He’s been with AWS for 6 years working with customers of all sizes. He is passionate about helping customers to innovate faster via the adoption of the Cloud and AI/ML. Prior to joining AWS, Jian spent 10+ years focusing on software development, system implementation and infrastructure management. Aside from work, he enjoys staying active and spending time with his family.

Read More

Face-off Probability, part of NHL Edge IQ: Predicting face-off winners in real time during televised games

A photo showing a faceoff between two teams during an NHL game. Face-off Probability is the National Hockey League’s (NHL) first advanced statistic using machine learning (ML) and artificial intelligence. It uses real-time Player and Puck Tracking (PPT) data to show viewers which player is likely to win a face-off before the puck is dropped, and provides broadcasters and viewers the opportunity to dive deeper into the importance of face-off matches and the differences in player abilities. Based on 10 years of historical data, hundreds of thousands of face-offs were used to engineer over 70 features fed into the model to provide real-time probabilities. Broadcasters can now discuss how a key face-off win by a player led to a goal or how the chances of winning a face-off decrease as a team’s face-off specialist is waived out of a draw. Fans can see visual, real-time predictions that show them the importance of a key part of the game.

In this post, we focus on how the ML model for Face-off Probability was developed and the services used to put the model into production. We also share the key technical challenges that were solved during construction of the Face-off Probability model.

How it works

Imagine the following scenario: It’s a tie game between two NHL teams that will determine who moves forward. We’re in the third period with 1:22 seconds left to play. Two players from opposite teams line up to take the draw in the closest face-off closer to one of the nets. The linesman notices a defensive player encroaching into the face-off circle and waives their player out of the face-off due to the violation. A less experienced defensive player comes in to take the draw as his replacement. The attacking team wins the face-off, gets possession of the puck, and immediately scores to take the lead. The score holds up for the remaining minute of the game and decides who moves forward. What player was favored to win the face-off before the initial duo was changed? How much did the defensive’s team probability of winning the face-off decrease by the violation that forced a different player to take the draw? Face-off Probability, the newest NHL Edge IQ statistic powered by AWS, can now answer these questions.

When there is a stoppage in play, Face-off Probability generates predictions for who will win the upcoming face-off based on the players on the ice, the location of the face-off, and the current game situation. The predictions are generated throughout the stoppage until the game clock starts running again. Predictions occur at sub-second latency and are triggered any time there is a change in the players involved in the face-off.

An NHL faceoff shot from up top

Overcoming key obstacles for face-off probability

Predicting face-off probability in real-time broadcasts can be broken down into two specific sub-problems:

  • Modeling the face-off event as an ML problem, understanding the requirements and limitations, preparing the data, engineering the data signals, exploring algorithms, and ensuring reliability of results
  • Detecting a face-off event during the game from a stream of PPT events, collecting parameters needed for prediction, calling the model, and submitting results to broadcasters

Predicting the probability of a player winning a face-off in real time on a televised broadcast has several technical challenges that had to be overcome. These included determining the features required and modeling methods to predict an event that has a large amount of uncertainty, and determining how to use streaming PPT sensor data to identify where a face-off is occurring, the players involved, and the probability of each player winning the face-off, all within hundreds of milliseconds.

Players huddling in a shot of a Faceoff during a game

Building an ML model for difficult-to-predict events

Predicting events such as face-off winning probabilities during a live game is a complex task that requires a significant amount of quality historic data and data streaming capabilities. To identify and understand the important signals in such a rich data environment, the development of ML models requires extensive subject matter expertise. The Amazon Machine Learning Solutions Lab partnered with NHL hockey and data experts to work backward from their target goal of enhancing their fan experience. By continuously listening to NHL’s expertise and testing hypotheses, AWS’s scientists engineered over 100 features that correlate to the face-off event. In particular, the team classified this feature set into one of three categories:

  • Historical statistics on player performances such as the number of face-offs a player has taken and won in the last five seasons, the number of face-offs the player has taken and won in previous games, a player’s winning percentages over several time windows, and the head-to-head winning percentage for each player in the face-off
  • Player characteristics such as height, weight, handedness, and years in the league
  • In-game situational data that might affect a player’s performance, such as the score of the game, the elapsed time in the game to that point, where the face-off is located, the strength of each team, and which player has to put their stick down for the face-off first

AWS’s ML scientists considered the problem as a binary classification problem: either the home player wins the face-off or the away player wins the face-off. With data from more than 200,000 historical face-offs, they used a LightGBM model to predict which of the two players involved with a face-off event is likely to win.

Determining if a face-off is about to occur and which players are involved

When a whistle blows and the play is stopped, Face-off Probability begins to make predictions. However, Face-off Probability has to first determine where the face-off is occurring and which player from each team is involved in the face-off. The data stream indicates events as they occur but doesn’t provide information on when an event is likely to occur in the future. As such, the sensor data of the players on the ice is needed to determineif and where a face-off is about to happen.

The PPT system produces real-time locations and velocities for players on the ice at up to 60 events per second. These locations and velocities were used to determine where the face-off is happening on the ice and if it’s likely to happen soon. By knowing how close the players are to known face-off locations and how stationary the players were, Face-off Probability was able to determine that a face-off was likely to occur and the two players that would be involved in the face-off.

Determining the correct cut-off distance for proximity to a face-off location and the corresponding cut-off velocity for stationary players was accomplished using a decision tree model. With PPT data from the 2020-2021 season, we built a model to predict the likelihood that a face-off is occurring at a specified location given the average distance of each team to the location and the velocities of the players. The decision tree provided the cut-offs for each metric, which we included as rules-based logic in the streaming application.

With the correct face-off location determined, the player from each team taking the face-off was calculated by taking the player closest to the known location from each team. This provided the application with the flexibility to identify the correct players while also being able to adjust to a new player having to take a face-off if a current player is waived out due to an infraction. Making and updating the prediction for the correct player was a key focus for the real-time usability of the model in broadcasts, which we describe further in the next section.

Model development and training

To develop the model, we used more than 200,000 historical face-off data points, along with the custom engineered feature set designed by working with the subject matter experts. We looked at features like in-game situations, historical performance of the players taking the face-off, player-specific characteristics, and head-to-head performances of the players taking the face-off, both in the current season and for their careers. Collectively, this resulted in over 100 features created using a combination of available and derived techniques.

To assess different features and how they might influence the model, we conducted extensive feature analysis as part of the exploratory phase. We used a mix of univariate tests and multivariate tests. For multivariate tests, for interpretability, we used decision tree visualization techniques. To assess statistical significance, we used Chi Test and KS tests to test dependence or distribution differences.

A decision tree showing how the model estimates based on the underlying data and features

We explored classification techniques and models with the expectation that the raw probabilities would be treated as the predictions. We explored nearest neighbors, decision trees, neural networks, and also collaborative filtering in terms of algorithms, while trying different sampling strategies (filtering, random, stratified, and time-based sampling) and evaluated performance on Area Under the Curve (AUC) and calibration distribution along with Brier score loss. At the end, we found that the LightGBM model worked best with well-calibrated accuracy metrics.

To evaluate the performance of the models, we used multiple techniques. We used a test set that the trained model was never exposed to. Additionally, the teams conducted extensive manual assessments of the results, looking at edge cases and trying to understand the nuances of how the model looked to determine why a certain player should have won or lost a face-off event.

With information collected from manual reviewers, we would adjust the features when required, or run iterations on the model to see if the performance of the model was as expected.

Deploying Face-off Probability for real-time use during national television broadcasts

One of the goals of the project was not just to predict the winner of the face-off, but to build a foundation for solving a number of similar problems in a real-time and cost-efficient way. That goal helped determine which components to use in the final architecture.

architecture diagram for faceoff application

The first important component is Amazon Kinesis Data Streams, a serverless streaming data service that acts as a decoupler between the specific implementation of the PPT data provider and consuming applications, thereby protecting the latter from the disrupting changes in the former. It has also enhanced the fan-out feature, which provides the ability to connect up to 20 parallel consumers and maintain a low latency of 70 milliseconds and the same throughput of 2MB/s per shard between all of them simultaneously.

PPT events don’t come for all players at once, but arrive discretely for each player as well as other events in the game. Therefore, to implement the upcoming face-off detection algorithm, the application needs to maintain a state.

The second important component of the architecture is Amazon Kinesis Data Analytics for Apache Flink. Apache Flink is a distributed streaming, high-throughput, low-latency data flow engine that provides a convenient and easy way to use the Data Stream API, and it supports stateful processing functions, checkpointing, and parallel processing out of the box. This helps speed up development and provides access to low-level routines and components, which allows for a flexible design and implementation of applications.

Kinesis Data Analytics provides the underlying infrastructure for your Apache Flink applications. It eliminates a need to deploy and configure a Flink cluster on Amazon Elastic Compute Cloud (Amazon EC2) or Kubernetes, which reduces maintenance complexity and costs.

The third crucial component is Amazon SageMaker. Although we used SageMaker to build a model, we also needed to make a decision at the early stages of the project: should scoring be implemented inside the face-off detecting application itself and complicate the implementation, or should the face-off detecting application call SageMaker remotely and sacrifice some latency due to communication over the network? To make an informed decision, we performed a series of benchmarks to verify SageMaker latency and scalability, and validated that average latency was less than 100 milliseconds under the load, which was within our expectations.

With the main parts of high-level architecture decided, we started to work on the internal design of the face-off detecting application. A computation model of the application is depicted in the following diagram.

a diagram representing the flowchart/computation model of the faceoff application

The compute model of the face-off detecting application can be modeled as a simple finite-state machine, where each incoming message transitions the system from one state to another while performing some computation along with that transition. The application maintains several data structures to keep track of the following:

  • Changes in the game state – The current period number, status and value of the game clock, and score
  • Changes in the player’s state – If the player is currently on the ice or on the bench, the current coordinates on the field, and the current velocity
  • Changes in the player’s personal face-off stats – The success rate of one player vs. another, and so on

The algorithm checks each location update event of a player to decide whether a face-off prediction should be made and whether the result should be submitted to broadcasters. Taking into account that each player location is updated roughly every 80 milliseconds and players move much slower during game pauses than during the game, we can conclude that the situation between two updates doesn’t drastically change. If the application called SageMaker for predictions and sent predictions to broadcasters every time a new location update event was received and all conditions are satisfied, SageMaker and the broadcasters would be overwhelmed with a number of duplicate requests.

To avoid all this unnecessary noise, the application keeps track of a combination of parameters for which predictions were already made, along with the result of the prediction, and caches them in memory to avoid expensive duplicate requests to SageMaker. Also, it keeps track of what predictions were already sent to broadcasters and makes sure that only new predictions are sent or the previously sent ones are sent again only if necessary. Testing showed that this approach reduces the amount of outgoing traffic by more than 100 times.

Another optimization technique that we used was grouping requests to SageMaker and performing them asynchronously in parallel. For example, if we have four new combinations of face-off parameters for which we need to get predictions from SageMaker, we know that each request will take less than 100 milliseconds. If we perform each request synchronously one by one, the total response time will be under 400 milliseconds. But if we group all four requests, submit them asynchronously, and wait for the result for the entire group before moving forward, we effectively parallelize requests and the total response time will be under 100 milliseconds, just like for only one request.

Summary

NHL Edge IQ, powered by AWS, is bringing fans closer to the action with advanced analytics and new ML stats. In this post, we showed insights into the building and deployment of the new Face-off Probability model, the first on-air ML statistic for the NHL. Be sure to keep an eye out for the probabilities generated by Face-off Probability in upcoming NHL games.

To find full examples of building custom training jobs for SageMaker, visit Bring your own training-completed model with SageMaker by building a custom container. For examples of using Amazon Kinesis for streaming, refer to Learning Amazon Kinesis Development.

To learn more about the partnership between AWS and the NHL, visit NHL Innovates with AWS Cloud Services. If you’d like to collaborate with experts to bring ML solutions to your organization, contact the Amazon ML Solutions Lab.


About the Authors

Ryan Gillespie is a Sr. Data Scientist with AWS Professional Services. He has a MSc from Northwestern University and a MBA from the University of Toronto. He has previous experience in the retail and mining industries.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Alexander Egorov is a Principal Data Architect, specializing in streaming technologies. He helps organizations to design and build platforms for processing and analyzing streaming data in real time.

Miguel Romero Calvo is an Applied Scientist at the Amazon ML Solutions Lab where he partners with AWS internal teams and strategic customers to accelerate their business through ML and cloud adoption.

Erick Martinez is a Sr. Media Application Architect with 25+ years of experience, with focus on Media and Entertainment. He is experienced in all aspects of systems development life-cycle ranging from discovery, requirements gathering, design, implementation, testing, deployment, and operation.

Read More

Redact sensitive data from streaming data in near-real time using Amazon Comprehend and Amazon Kinesis Data Firehose

Near-real-time delivery of data and insights enable businesses to rapidly respond to their customers’ needs. Real-time data can come from a variety of sources, including social media, IoT devices, infrastructure monitoring, call center monitoring, and more. Due to the breadth and depth of data being ingested from multiple sources, businesses look for solutions to protect their customers’ privacy and keep sensitive data from being accessed from end systems. You previously had to rely on personally identifiable information (PII) rules engines that could flag false positives or miss data, or you had to build and maintain custom machine learning (ML) models to identify PII in your streaming data. You also needed to implement and maintain the infrastructure necessary to support these engines or models.

To help streamline this process and reduce costs, you can use Amazon Comprehend, a natural language processing (NLP) service that uses ML to find insights and relationships like people, places, sentiments, and topics in unstructured text. You can now use Amazon Comprehend ML capabilities to detect and redact PII in customer emails, support tickets, product reviews, social media, and more. No ML experience is required. For example, you can analyze support tickets and knowledge articles to detect PII entities and redact the text before you index the documents. After that, documents are free of PII entities and users can consume the data. Redacting PII entities helps you protect your customer’s privacy and comply with local laws and regulations.

In this post, you learn how to implement Amazon Comprehend into your streaming architectures to redact PII entities in near-real time using Amazon Kinesis Data Firehose with AWS Lambda.

This post is focused on redacting data from select fields that are ingested into a streaming architecture using Kinesis Data Firehose, where you want to create, store, and maintain additional derivative copies of the data for consumption by end-users or downstream applications. If you’re using Amazon Kinesis Data Streams or have additional use cases outside of PII redaction, refer to Translate, redact and analyze streaming data using SQL functions with Amazon Kinesis Data Analytics, Amazon Translate, and Amazon Comprehend, where we show how you can use Amazon Kinesis Data Analytics Studio powered by Apache Zeppelin and Apache Flink to interactively analyze, translate, and redact text fields in streaming data.

Solution overview

The following figure shows an example architecture for performing PII redaction of streaming data in real time, using Amazon Simple Storage Service (Amazon S3), Kinesis Data Firehose data transformation, Amazon Comprehend, and AWS Lambda. Additionally, we use the AWS SDK for Python (Boto3) for the Lambda functions. As indicated in the diagram, the S3 raw bucket contains non-redacted data, and the S3 redacted bucket contains redacted data after using the Amazon Comprehend DetectPiiEntities API within a Lambda function.

Costs involved

In addition to Kinesis Data Firehose, Amazon S3, and Lambda costs, this solution will incur usage costs from Amazon Comprehend. The amount you pay is a factor of the total number of records that contain PII and the characters that are processed by the Lambda function. For more information, refer to Amazon Kinesis Data Firehose pricing, Amazon Comprehend Pricing, and AWS Lambda Pricing.

As an example, let’s assume you have 10,000 logs records, and the key value you want to redact PII from is 500 characters. Out of the 10,000 log records, 50 are identified as containing PII. The cost details are as follows:

Contains PII Cost:

  • Size of each key value = 500 characters (1 unit = 100 characters)
  • Number of units (100 characters) per record (minimum is 3 units) = 5
  • Total units = 10,000 (records) x 5 (units per record) x 1 (Amazon Comprehend requests per record) = 50,000
  • Price per unit = $0.000002
    • Total cost for identifying log records with PII using ContainsPiiEntities API = $0.1 [50,000 units x $0.000002] 

Redact PII Cost:

  • Total units containing PII = 50 (records) x 5 (units per record) x 1 (Amazon Comprehend requests per record) = 250
  • Price per unit = $0.0001
    • Total cost for identifying location of PII using DetectPiiEntities API = [number of units] x [cost per unit] = 250 x $0.0001 = $0.025

Total Cost for identification and redaction:

  • Total cost: $0.1 (validation if field contains PII) + $0.025 (redact fields that contain PII) = $0.125

Deploy the solution with AWS CloudFormation

For this post, we provide an AWS CloudFormation streaming data redaction template, which provides the full details of the implementation to enable repeatable deployments. Upon deployment, this template creates two S3 buckets: one to store the raw sample data ingested from the Amazon Kinesis Data Generator (KDG), and one to store the redacted data. Additionally, it creates a Kinesis Data Firehose delivery stream with DirectPUT as input, and a Lambda function that calls the Amazon Comprehend ContainsPiiEntities and DetectPiiEntities API to identify and redact PII data. The Lambda function relies on user input in the environment variables to determine what key values need to be inspected for PII.

The Lambda function in this solution has limited payload sizes to 100 KB. If a payload is provided where the text is greater than 100 KB, the Lambda function will skip it.

To deploy the solution, complete the following steps:

  1. Launch the CloudFormation stack in US East (N. Virginia) us-east-1:
  2. Enter a stack name, and leave other parameters at their default
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.

Deploy resources manually

If you prefer to build the architecture manually instead of using AWS CloudFormation, complete the steps in this section.

Create the S3 buckets

Create your S3 buckets with the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Create one bucket for your raw data and one for your redacted data.
  4. Note the names of the buckets you just created.

Create the Lambda function

To create and deploy the Lambda function, complete the following steps:

  1. On the Lambda console, choose Create function.
  2. Choose Author from scratch.
  3. For Function Name, enter AmazonComprehendPII-Redact.
  4. For Runtime, choose Python 3.9.
  5. For Architecture, select x86_64.
  6. For Execution role, select Create a new role with Lambda permissions.
  7. After you create the function, enter the following code:
    import json
    import boto3
    import os
    import base64
    import sys
    
    def lambda_handler(event, context):
        
        output = []
        
        for record in event['records']:
            
            # Gathers keys from enviroment variables and makes a list of desired keys to check for PII
            rawkeys = os.environ['keys']
            splitkeys = rawkeys.split(", ")
            print(splitkeys)
            #decode base64
            #Kinesis data is base64 encoded so decode here
            payloadraw=base64.b64decode(record["data"]).decode('utf-8')
            #Loads decoded payload into json
            payloadjsonraw = json.loads(payloadraw)
            
            # Creates Comprehend client
            comprehend_client = boto3.client('comprehend')
            
            
            # This codes handles the logic to check for keys, identify if PII exists, and redact PII if available. 
            for i in payloadjsonraw:
                # checks if the key found in the message matches a redact
                if i in splitkeys:
                    print("Redact key found, checking for PII")
                    payload = str(payloadjsonraw[i])
                    # check if payload size is less than 100KB
                    if sys.getsizeof(payload) < 99999:
                        print('Size is less than 100KB checking if value contains PII')
                        # Runs Comprehend ContainsPiiEntities API call to see if key value contains PII
                        pii_identified = comprehend_client.contains_pii_entities(Text=payload, LanguageCode='en')
                        
                        # If PII is not found, skip over key
                        if (pii_identified['Labels']) == []:
                            print('No PII found')
                        else:
                        # if PII is found, run through redaction logic
                            print('PII found redacting')
                            # Runs Comprehend DetectPiiEntities call to find exact location of PII
                            response = comprehend_client.detect_pii_entities(Text=payload, LanguageCode='en')
                            entities = response['Entities']
                            # creates redacted_payload which will be redacted
                            redacted_payload = payload
                            # runs through a loop that gathers necessary values from Comprehend API response and redacts values
                            for entity in entities:
                                char_offset_begin = entity['BeginOffset']
                                char_offset_end = entity['EndOffset']
                                redacted_payload = redacted_payload[:char_offset_begin] + '*'*(char_offset_end-char_offset_begin) + redacted_payload[char_offset_end:]
                            # replaces original value with redacted value
                            payloadjsonraw[i] = redacted_payload
                            print(str(payloadjsonraw[i]))
                    else:
                        print ('Size is more than 100KB, skipping inspection')
                else:
                    print("Key value not found in redaction list")
            
            redacteddata = json.dumps(payloadjsonraw)
            
            # adds inspected record to record
            output_record = {
                'recordId': record['recordId'],
                'result': 'Ok',
                'data' : base64.b64encode(redacteddata.encode('utf-8'))
            }
            output.append(output_record)
            print(output_record)
            
        print('Successfully processed {} records.'.format(len(event['records'])))
        
        return {'records': output}

  8. Choose Deploy.
  9. In the navigation pane, choose Configuration.
  10. Navigate to Environment variables.
  11. Choose Edit.
  12. For Key, enter keys.
  13. For Value, enter the key values you want to redact PII from, separated by a comma and space. For example, enter Tweet1, Tweet2 if you’re using the sample test data provided in the next section of this post.
  14. Choose Save.
  15. Navigate to General configuration.
  16. Choose Edit.
  17. Change the value of Timeout to 1 minute.
  18. Choose Save.
  19. Navigate to Permissions.
  20. Choose the role name under Execution Role.
    You’re redirected to the AWS Identity and Access Management (IAM) console.
  21. For Add permissions, choose Attach policies.
  22. Enter Comprehend into the search bar and choose the policy ComprehendFullAccess.
  23. Choose Attach policies.

Create the Firehose delivery stream

To create your Firehose delivery stream, complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, select Direct PUT.
  3. For Destination, select Amazon S3.
  4. For Delivery stream name, enter ComprehendRealTimeBlog.
  5. Under Transform source records with AWS Lambda, select Enabled.
  6. For AWS Lambda function, enter the ARN for the function you created, or browse to the function AmazonComprehendPII-Redact.
  7. For Buffer Size, set the value to 1 MB.
  8. For Buffer Interval, leave it as 60 seconds.
  9. Under Destination Settings, select the S3 bucket you created for the redacted data.
  10. Under Backup Settings, select the S3 bucket that you created for the raw records.
  11. Under Permission, either create or update an IAM role, or choose an existing role with the proper permissions.
  12. Choose Create delivery stream.

Deploy the streaming data solution with the Kinesis Data Generator

You can use the Kinesis Data Generator (KDG) to ingest sample data to Kinesis Data Firehose and test the solution. To simplify this process, we provide a Lambda function and CloudFormation template to create an Amazon Cognito user and assign appropriate permissions to use the KDG.

  1. On the Amazon Kinesis Data Generator page, choose Create a Cognito User with CloudFormation.You’re redirected to the AWS CloudFormation console to create your stack.
  2. Provide a user name and password for the user with which you log in to the KDG.
  3. Leave the other settings at their defaults and create your stack.
  4. On the Outputs tab, choose the KDG UI link.
  5. Enter your user name and password to log in.

Send test records and validate redaction in Amazon S3

To test the solution, complete the following steps:

  1. Log in to the KDG URL you created in the previous step.
  2. Choose the Region where the AWS CloudFormation stack was deployed.
  3. For Stream/delivery stream, choose the delivery stream you created (if you used the template, it has the format accountnumber-awscomprehend-blog).
  4. Leave the other settings at their defaults.
  5. For the record template, you can create your own tests, or use the following template.If you’re using the provided sample data below for testing, you should have updated environment variables in the AmazonComprehendPII-Redact Lambda function to Tweet1, Tweet2. If deployed via CloudFormation, update environment variables to Tweet1, Tweet2 within the created Lambda function. The sample test data is below:
    {"User":"12345", "Tweet1":" Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231.", "Tweet2": "My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this streaming record. Let's check"}

  6. Choose Send Data, and allow a few seconds for records to be sent to your stream.
  7. After few seconds, stop the KDG generator and check your S3 buckets for the delivered files.

The following is an example of the raw data in the raw S3 bucket:

{"User":"12345", "Tweet1":" Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231.", "Tweet2": "My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this streaming record. Let's check"}

The following is an example of the redacted data in the redacted S3 bucket:

{"User":"12345", "Tweet1":"Good morning, everybody. My name is *******************, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address ****************************. My address is ********************************** My phone number is ************.", "Tweet"2: "My Social security number is ***********. My Bank account number is ************ and routing number *********. My credit card number is ****************, Expiration Date ********, my C V V code is ***, and my pin ******. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this streaming record. Let's check"}

The sensitive information has been removed from the redacted messages, providing confidence that you can share this data with end systems.

Cleanup

When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. If you followed the manual steps, you will need to manually delete the two buckets, the AmazonComprehendPII-Redact function, the ComprehendRealTimeBlog stream, the log group for the ComprehendRealTimeBlog stream, and any IAM roles that were created.

Conclusion

This post showed you how to integrate PII redaction into your near-real-time streaming architecture and reduce data processing time by performing redaction in flight. In this scenario, you provide the redacted data to your end-users and a data lake administrator secures the raw bucket for later use. You could also build additional processing with Amazon Comprehend to identify tone or sentiment, identify entities within the data, and classify each message.

We provided individual steps for each service as part of this post, and also included a CloudFormation template that allows you to provision the required resources in your account. This template should be used for proof of concept or testing scenarios only. Refer to the developer guides for Amazon Comprehend, Lambda, and Kinesis Data Firehose for any service limits.

To get started with PII identification and redaction, see Personally identifiable information (PII). With the example architecture in this post, you could integrate any of the Amazon Comprehend APIs with near-real-time data using Kinesis Data Firehose data transformation. To learn more about what you can build with your near-real-time data with Kinesis Data Firehose, refer to the Amazon Kinesis Data Firehose Developer Guide. This solution is available in all AWS Regions where Amazon Comprehend and Kinesis Data Firehose are available.


About the authors

Joe Morotti is a Solutions Architect at Amazon Web Services (AWS), helping Enterprise customers across the Midwest US. He has held a wide range of technical roles and enjoy showing customer’s art of the possible. In his free time, he enjoys spending quality time with his family exploring new places and overanalyzing his sports team’s performance

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing Tennis, binge-watching TV shows, and playing Tabla.

Read More

Reduce cost and development time with Amazon SageMaker Pipelines local mode

Creating robust and reusable machine learning (ML) pipelines can be a complex and time-consuming process. Developers usually test their processing and training scripts locally, but the pipelines themselves are typically tested in the cloud. Creating and running a full pipeline during experimentation adds unwanted overhead and cost to the development lifecycle. In this post, we detail how you can use Amazon SageMaker Pipelines local mode to run ML pipelines locally to reduce both pipeline development and run time while reducing cost. After the pipeline has been fully tested locally, you can easily rerun it with Amazon SageMaker managed resources with just a few lines of code changes.

Overview of the ML lifecycle

One of the main drivers for new innovations and applications in ML is the availability and amount of data along with cheaper compute options. In several domains, ML has proven capable of solving problems previously unsolvable with classical big data and analytical techniques, and the demand for data science and ML practitioners is increasing steadily. From a very high level, the ML lifecycle consists of many different parts, but the building of an ML model usually consists of the following general steps:

  1. Data cleansing and preparation (feature engineering)
  2. Model training and tuning
  3. Model evaluation
  4. Model deployment (or batch transform)

In the data preparation step, data is loaded, massaged, and transformed into the type of inputs, or features, the ML model expects. Writing the scripts to transform the data is typically an iterative process, where fast feedback loops are important to speed up development. It’s normally not necessary to use the full dataset when testing feature engineering scripts, which is why you can use the local mode feature of SageMaker Processing. This allows you to run locally and update the code iteratively, using a smaller dataset. When the final code is ready, it’s submitted to the remote processing job, which uses the complete dataset and runs on SageMaker managed instances.

The development process is similar to the data preparation step for both model training and model evaluation steps. Data scientists use the local mode feature of SageMaker Training to iterate quickly with smaller datasets locally, before using all the data in a SageMaker managed cluster of ML-optimized instances. This speeds up the development process and eliminates the cost of running ML instances managed by SageMaker while experimenting.

As an organization’s ML maturity increases, you can use Amazon SageMaker Pipelines to create ML pipelines that stitch together these steps, creating more complex ML workflows that process, train, and evaluate ML models. SageMaker Pipelines is a fully managed service for automating the different steps of the ML workflow, including data loading, data transformation, model training and tuning, and model deployment. Until recently, you could develop and test your scripts locally but had to test your ML pipelines in the cloud. This made iterating on the flow and form of ML pipelines a slow and costly process. Now, with the added local mode feature of SageMaker Pipelines, you can iterate and test your ML pipelines similarly to how you test and iterate on your processing and training scripts. You can run and test your pipelines on your local machine, using a small subset of data to validate the pipeline syntax and functionalities.

SageMaker Pipelines

SageMaker Pipelines provides a fully automated way to run simple or complex ML workflows. With SageMaker Pipelines, you can create ML workflows with an easy-to-use Python SDK, and then visualize and manage your workflow using Amazon SageMaker Studio. Your data science teams can be more efficient and scale faster by storing and reusing the workflow steps you create in SageMaker Pipelines. You can also use pre-built templates that automate the infrastructure and repository creation to build, test, register, and deploy models within your ML environment. These templates are automatically available to your organization, and are provisioned using AWS Service Catalog products.

SageMaker Pipelines brings continuous integration and continuous deployment (CI/CD) practices to ML, such as maintaining parity between development and production environments, version control, on-demand testing, and end-to-end automation, which helps you scale ML throughout your organization. DevOps practitioners know that some of the main benefits of using CI/CD techniques include an increase in productivity via reusable components and an increase in quality through automated testing, which leads to faster ROI for your business objectives. These benefits are now available to MLOps practitioners by using SageMaker Pipelines to automate the training, testing, and deployment of ML models. With local mode, you can now iterate much more quickly while developing scripts for use in a pipeline. Note that local pipeline instances can’t be viewed or run within the Studio IDE; however, additional viewing options for local pipelines will be available soon.

The SageMaker SDK provides a general purpose local mode configuration that allows developers to run and test supported processors and estimators in their local environment. You can use local mode training with multiple AWS-supported framework images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) as well as images you supply yourself.

SageMaker Pipelines, which builds a Directed Acyclic Graph (DAG) of orchestrated workflow steps, supports many activities that are part of the ML lifecycle. In local mode, the following steps are supported:

  • Processing job steps – A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation
  • Training job steps – An iterative process that teaches a model to make predictions by presenting examples from a training dataset
  • Hyperparameter tuning jobs – An automated way to evaluate and select the hyperparameters that produce the most accurate model
  • Conditional run steps – A step that provides a conditional run of branches in a pipeline
  • Model step – Using CreateModel arguments, this step can create a model for use in transform steps or later deployment as an endpoint
  • Transform job steps – A batch transform job that generates predictions from large datasets, and runs inference when a persistent endpoint isn’t needed
  • Fail steps – A step that stops a pipeline run and marks the run as failed

Solution overview

Our solution demonstrates the essential steps to create and run SageMaker Pipelines in local mode, which means using local CPU, RAM, and disk resources to load and run the workflow steps. Your local environment could be running on a laptop, using popular IDEs like VSCode or PyCharm, or it could be hosted by SageMaker using classic notebook instances.

Local mode allows data scientists to stitch together steps, which can include processing, training, and evaluation jobs, and run the entire workflow locally. When you’re done testing locally, you can rerun the pipeline in a SageMaker managed environment by replacing the LocalPipelineSession object with PipelineSession, which brings consistency to the ML lifecycle.

For this notebook sample, we use a standard publicly available dataset, the UCI Machine Learning Abalone Dataset. The goal is to train an ML model to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

All of the code required to run this notebook sample is available on GitHub in the amazon-sagemaker-examples repository. In this notebook sample, each pipeline workflow step is created independently and then wired together to create the pipeline. We create the following steps:

  • Processing step (feature engineering)
  • Training step (model training)
  • Processing step (model evaluation)
  • Condition step (model accuracy)
  • Create model step (model)
  • Transform step (batch transform)
  • Register model step (model package)
  • Fail step (run failed)

The following diagram illustrates our pipeline.

Prerequisites

To follow along in this post, you need the following:

After these prerequisites are in place, you can run the sample notebook as described in the following sections.

Build your pipeline

In this notebook sample, we use SageMaker Script Mode for most of the ML processes, which means that we provide the actual Python code (scripts) to perform the activity and pass a reference to this code. Script Mode provides great flexibility to control the behavior within the SageMaker processing by allowing you to customize your code while still taking advantage of SageMaker pre-built containers like XGBoost or Scikit-Learn. The custom code is written to a Python script file using cells that begin with the magic command %%writefile, like the following:

%%writefile code/evaluation.py

The primary enabler of local mode is the LocalPipelineSession object, which is instantiated from the Python SDK. The following code segments show how to create a SageMaker pipeline in local mode. Although you can configure a local data path for many of the local pipeline steps, Amazon S3 is the default location to store the data output by the transformation. The new LocalPipelineSession object is passed to the Python SDK in many of the SageMaker workflow API calls described in this post. Notice that you can use the local_pipeline_session variable to retrieve references to the S3 default bucket and the current Region name.

from sagemaker.workflow.pipeline_context import LocalPipelineSession

# Create a `LocalPipelineSession` object so that each 
# pipeline step will run locally
# To run this pipeline in the cloud, you must change 
# the `LocalPipelineSession()` to `PipelineSession()`
local_pipeline_session = LocalPipelineSession()
region = local_pipeline_session.boto_region_name

default_bucket = local_pipeline_session.default_bucket()
prefix = "sagemaker-pipelines-local-mode-example"

Before we create the individual pipeline steps, we set some parameters used by the pipeline. Some of these parameters are string literals, whereas others are created as special enumerated types provided by the SDK. The enumerated typing ensures that valid settings are provided to the pipeline, such as this one, which is passed to the ConditionLessThanOrEqualTo step further down:

mse_threshold = ParameterFloat(name="MseThreshold", default_value=7.0)

To create a data processing step, which is used here to perform feature engineering, we use the SKLearnProcessor to load and transform the dataset. We pass the local_pipeline_session variable to the class constructor, which instructs the workflow step to run in local mode:

from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "1.0-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=local_pipeline_session,
)

Next, we create our first actual pipeline step, a ProcessingStep object, as imported from the SageMaker SDK. The processor arguments are returned from a call to the SKLearnProcessor run() method. This workflow step is combined with other steps towards the end of the notebook to indicate the order of operation within the pipeline.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="code/preprocessing.py",
)

step_process = ProcessingStep(name="AbaloneProcess", step_args=processor_args)

Next, we provide code to establish a training step by first instantiating a standard estimator using the SageMaker SDK. We pass the same local_pipeline_session variable to the estimator, named xgb_train, as the sagemaker_session argument. Because we want to train an XGBoost model, we must generate a valid image URI by specifying the following parameters, including the framework and several version parameters:

from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

model_path = f"s3://{default_bucket}/{prefix}/model"
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.5-1",
    py_version="py3",
    instance_type=instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    entry_point="code/abalone.py",
    instance_type=instance_type,
    instance_count=training_instance_count,
    output_path=model_path,
    role=role,
    sagemaker_session=local_pipeline_session,
)

We can optionally call additional estimator methods, for example set_hyperparameters(), to provide hyperparameter settings for the training job. Now that we have an estimator configured, we’re ready to create the actual training step. Once again, we import the TrainingStep class from the SageMaker SDK library:

from sagemaker.workflow.steps import TrainingStep

step_train = TrainingStep(name="AbaloneTrain", step_args=train_args)

Next, we build another processing step to perform model evaluation. This is done by creating a ScriptProcessor instance and passing the local_pipeline_session object as a parameter:

from sagemaker.processing import ScriptProcessor

script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="script-abalone-eval",
    role=role,
    sagemaker_session=local_pipeline_session,
)

To enable deployment of the trained model, either to a SageMaker real-time endpoint or to a batch transform, we need to create a Model object by passing the model artifacts, the proper image URI, and optionally our custom inference code. We then pass this Model object to a ModelStep, which is added to the local pipeline. See the following code:

from sagemaker.model import Model

model = Model(
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    source_dir="code",
    entry_point="inference.py",
    role=role,
    sagemaker_session=local_pipeline_session,
)

from sagemaker.workflow.model_step import ModelStep

step_create_model = ModelStep(name="AbaloneCreateModel", 
    step_args=model.create(instance_type=instance_type)
)

Next, we create a batch transform step where we submit a set of feature vectors and perform inference. We first need to create a Transformer object and pass the local_pipeline_session parameter to it. Then we create a TransformStep, passing the required arguments, and add this to the pipeline definition:

from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name=step_create_model.properties.ModelName,
    instance_type=instance_type,
    instance_count=transform_instance_count,
    output_path=f"s3://{default_bucket}/{prefix}/transform",
    sagemaker_session=local_pipeline_session,
)

from sagemaker.workflow.steps import TransformStep

transform_args = transformer.transform(transform_data, content_type="text/csv")

step_transform = TransformStep(name="AbaloneTransform", step_args=transform_args)

Finally, we want to add a branch condition to the workflow so that we only run batch transform if the results of model evaluation meet our criteria. We can indicate this conditional by adding a ConditionStep with a particular condition type, like ConditionLessThanOrEqualTo. We then enumerate the steps for the two branches, essentially defining the if/else or true/false branches of the pipeline. The if_steps provided in the ConditionStep (step_create_model, step_transform) are run whenever the condition evaluates to True.

from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.functions import JsonGet

cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step_name=step_eval.name,
        property_file=evaluation_report,
        json_path="regression_metrics.mse.value",),
    right=mse_threshold,
)

step_cond = ConditionStep(
    name="AbaloneMSECond",
    conditions=[cond_lte],
    if_steps=[step_create_model, step_transform],
    else_steps=[step_fail],
)

The following diagram illustrates this conditional branch and the associated if/else steps. Only one branch is run, based on the outcome of the model evaluation step as compared in the condition step.

Now that we have all our steps defined, and the underlying class instances created, we can combine them into a pipeline. We provide some parameters, and crucially define the order of operation by simply listing the steps in the desired order. Note that the TransformStep isn’t shown here because it’s the target of the conditional step, and was provided as step argument to the ConditionalStep earlier.

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"LocalModelPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        mse_threshold,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
    sagemaker_session=local_pipeline_session,
)

To run the pipeline, you must call two methods: pipeline.upsert(), which uploads the pipeline to the underlying service, and pipeline.start(), which starts running the pipeline. You can use various other methods to interrogate the run status, list the pipeline steps, and more. Because we used the local mode pipeline session, these steps are all run locally on your processor. The cell output beneath the start method shows the output from the pipeline:

pipeline.upsert(role_arn=role)
execution = pipeline.start()

You should see a message at the bottom of the cell output similar to the following:

Pipeline execution d8c3e172-089e-4e7a-ad6d-6d76caf987b7 SUCCEEDED

Revert to managed resources

After we’ve confirmed that the pipeline runs without errors and we’re satisfied with the flow and form of the pipeline, we can recreate the pipeline but with SageMaker managed resources and rerun it. The only change required is to use the PipelineSession object instead of LocalPipelineSession:

from sagemaker.workflow.pipeline_context import LocalPipelineSession
from sagemaker.workflow.pipeline_context import PipelineSession

local_pipeline_session = LocalPipelineSession()
pipeline_session = PipelineSession()

This informs the service to run each step referencing this session object on SageMaker managed resources. Given the small change, we illustrate only the required code changes in the following code cell, but the same change would need to be implemented on each cell using the local_pipeline_session object. The changes are, however, identical across all cells because we’re only substituting the local_pipeline_session object with the pipeline_session object.

from sagemaker.sklearn.processing import SKLearnProcessor

framework_version = "1.0-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-abalone-process",
    role=role,
    sagemaker_session=pipeline_session,  # non-local session
)

After the local session object has been replaced everywhere, we recreate the pipeline and run it with SageMaker managed resources:

from sagemaker.workflow.pipeline import Pipeline

pipeline_name = f"LocalModelPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        input_data,
        mse_threshold,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
    sagemaker_session=pipeline_session, # non-local session
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

Clean up

If you want to keep the Studio environment tidy, you can use the following methods to delete the SageMaker pipeline and the model. The full code can be found in the sample notebook.

# delete models 
sm_client = boto3.client("sagemaker")
model_prefix="AbaloneCreateModel"
delete_models(sm_client, model_prefix)

# delete managed pipeline
pipeline_to_delete = 'SM-Managed-Pipeline'
delete_sagemaker_pipeline(sm_client, pipeline_to_delete)

Conclusion

Until recently, you could use the local mode feature of SageMaker Processing and SageMaker Training to iterate on your processing and training scripts locally, before running them on all the data with SageMaker managed resources. With the new local mode feature of SageMaker Pipelines, ML practitioners can now apply the same method when iterating on their ML pipelines, stitching the different ML workflows together. When the pipeline is ready for production, running it with SageMaker managed resources requires just a few lines of code changes. This reduces the pipeline run time during development, leading to more rapid pipeline development with faster development cycles, while reducing the cost of SageMaker managed resources.

To learn more, visit Amazon SageMaker Pipelines or Use SageMaker Pipelines to Run Your Jobs Locally.


About the authors

Paul Hargis has focused his efforts on machine learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and teaching people how to make the most of it. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions, helping amazon.com improve the experience for international shoppers. Paul likes to help customers expand their machine learning initiatives to solve real-world problems.

Niklas Palm is a Solutions Architect at AWS in Stockholm, Sweden, where he helps customers across the Nordics succeed in the cloud. He’s particularly passionate about serverless technologies along with IoT and machine learning. Outside of work, Niklas is an avid cross-country skier and snowboarder as well as a master egg boiler.

Kirit Thadaka is an ML Solutions Architect working in the SageMaker Service SA team. Prior to joining AWS, Kirit worked in early-stage AI startups followed by some time consulting in various roles in AI research, MLOps, and technical leadership.

Read More