Amazon SageMaker Automatic Model Tuning now supports three new completion criteria for hyperparameter optimization

Amazon SageMaker Automatic Model Tuning now supports three new completion criteria for hyperparameter optimization

Amazon SageMaker has announced the support of three new completion criteria for Amazon SageMaker automatic model tuning, providing you with an additional set of levers to control the stopping criteria of the tuning job when finding the best hyperparameter configuration for your model.

In this post, we discuss these new completion criteria, when to use them, and some of the benefits they bring.

SageMaker automatic model tuning

Automatic model tuning, also called hyperparameter tuning, finds the best version of a model as measured by the metric we choose. It spins up many training jobs on the dataset provided, using the algorithm chosen and hyperparameters ranges specified. Each training job can be completed early when the objective metric isn’t improving significantly, which is known as early stopping.

Until now, there were limited ways to control the overall tuning job, such as specifying the maximum number of training jobs. However, the selection of this parameter value is heuristic at best. A larger value increases tuning costs, and a smaller value may not yield the best version of the model at all times.

SageMaker automatic model tuning solves these challenges by giving you multiple completion criteria for the tuning job. It’s applied at the tuning level rather than at each individual training job level, which means it operates at a higher abstraction layer.

Benefits of tuning job completion criteria

With better control over when the tuning job will stop, you get the benefit of cost savings by not having the job run for extended periods and being computationally expensive. It also means you can ensure that the job doesn’t stop too early and you get a sufficiently good quality model that meets your objectives. You can choose to stop the tuning job when the models are no longer improving after a set of iterations or when the estimated residual improvement doesn’t justify the compute resources and time.

In addition to the existing maximum number of training job completion criteria MaxNumberOfTrainingJobs, automatic model tuning introduces the option to stop tuning based on a maximum tuning time, Improvement monitoring, and convergence detection.

Let’s explore each of these criteria.

Maximum tuning time

Previously, you had the option to define a maximum number of training jobs as a resource limit setting to control the tuning budget in terms of compute resource. However, this can lead to unnecessary longer or shorter training times than needed or desired.

With the addition of the maximum tuning time criteria, you can now allocate your training budget in terms of amount of time to run the tuning job and automatically terminate the job after a specified amount of time defined in seconds.

"ResourceLimits": {
"MaxParallelTrainingJobs": 10,
"MaxNumberOfTrainingJobs": 100
"MaxRuntimeInSeconds": 3600
}

As seen above, we use the MaxRuntimeInSeconds to define the tuning time in seconds. Setting the tuning time limit helps you limit the duration of the tuning job and also the projected cost of the experiment.

The total cost before any contractual discount can be estimated with the following formula:
EstimatedComputeSeconds= MaxRuntimeInSeconds * MaxParallelTrainingJobs * InstanceCost

The max runtime in seconds could be used to bound cost and runtime. In other words, it’s a budget control completion criteria.

This feature is part of a resource control criteria and doesn’t take into account the convergence of the models. As we see later in this post, this criteria can be used in combination with other stopping criteria to achieve cost control without sacrificing accuracy.

Desired target metric

Another previously introduced criteria is to define the target objective goal upfront. The criteria monitors the performance of the best model based on a specific objective metric and stops tuning when the models reach the defined threshold in relation to a specified objective metric.

With the TargetObjectiveMetricValue criteria, we can instruct SageMaker to stop tuning the model after the objective metric of the best model has reached the specified value:

{
    "TuningJobCompletionCriteria": {
        "TargetObjectiveMetricValue": 0.95
    },
    "HyperParameterTuningJobObjective": {
        "MetricName": "validation:auc", 
         "Type": "Maximize"
        }, 
 }

In this example, we are instructed SageMaker to stop tuning the model when the objective metric of the best model has reached 0.95.

This method is useful when you have a specific target that you want your model to reach, such as a certain level of accuracy, precision, recall, F1-score, AUC, log-loss, and so on.

A typical use case for this criteria would be for a user who is already familiar with the model performance at given thresholds. A user in the exploration phase may first tune the model with a small subset of a larger dataset to identify a satisfactory evaluation metric threshold to target when training with the full dataset.

Improvement monitoring

This criteria monitors the models’ convergence after each iteration and stops the tuning if the models don’t improve after a defined number of training jobs. See the following configuration:

"TuningJobCompletionCriteria": {
    "BestObjectiveNotImproving":{
        "MaxNumberOfTrainingJobsNotImproving":10
        }, 
    }

In this case we set the MaxNumberOfTrainingJobsNotImproving to 10, which means if the objective metric stops improving after 10 training jobs, the tuning will be stopped and the best model and metric reported.

Improvement monitoring should be used to tune a tradeoff between model quality and overall workflow duration in a way that is likely transferable between different optimization problems.

Convergence detection

Convergence detection is a completion criteria that lets automatic model tuning decide when to stop tuning. Generally, automatic model tuning will stop tuning when it estimates that no significant improvement can be achieved. See the following configuration:

"TuningJobCompletionCriteria": {
    "ConvergenceDetected":{
        "CompleteOnConvergence":"Enabled"
    },
}

The criteria is best suited when you initially don’t know what stopping settings to select.

It’s also useful if you don’t know what target objective metric is reasonable for a good prediction given the problem and dataset in hand, and would rather have the tuning job complete when it is no longer improving.

Experiment with a comparison of completion criteria

In this experiment, given a regression task, we run 3 tuning experiments to find the optimal model within a search space of 2 hyperparameters having 200 hyperparameter configurations in total using the direct marketing dataset.

With everything else being equal, the first model was tuned with the BestObjectiveNotImproving completion criteria, the second model was tuned with the CompleteOnConvergence and the third model was tuned with no completion criteria defined.

When describing each job, we can observe that setting the BestObjectiveNotImproving criteria has led to the most optimal resource and time relative to the objective metric with significantly fewer jobs ran.

The CompleteOnConvergence criteria was also able to stop tuning halfway through the experiment resulting in fewer training jobs and shorter training time compared to not setting a criteria.

While not setting a completion criteria resulted in a costly experiment, defining the MaxRuntimeInSeconds as part of the resource limit would be one way of minimizing the cost.

The results above show that when defining a completion criteria, Amazon SageMaker is able to intelligently stop the tuning process when it detects that the model is less likely to improve beyond the current result.

Note that the completion criteria supported in SageMaker automatic model tuning are not mutually exclusive and can be used concurrently when tuning a model.

When more than one completion criteria is defined, the tuning job completes when any of the criteria is met.

For example, a combination of a resource limit criteria like maximum tuning time with a convergence criteria, such as improvement monitoring or convergence detection, may produce an optimal cost control and an optimal objective metrics.

Conclusion

In this post, we discussed how you can now intelligently stop your tuning job by selecting a set of completion criteria newly introduced in SageMaker, such as maximum tuning time, improvement monitoring, or convergence detection.

We demonstrated with an experiment that intelligent stopping based on improvement observation across iteration may lead to a significantly optimized budget and time management compared to not defining a completion criteria.

We also showed that these criteria are not mutually exclusive and can be used concurrently when tuning a model, to take advantage of both, budget control and optimal convergence.

For more details on how to configure and run automatic model tuning, refer to Specify the Hyperparameter Tuning Job Settings.


About the Authors

Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solutions in the cloud.

Chaitra Mathur is a Principal Solutions Architect at AWS. She guides customers and partners in building highly scalable, reliable, secure, and cost-effective solutions on AWS. She is passionate about Machine Learning and helps customers translate their ML needs into solutions using AWS AI/ML services. She holds 5 certifications including the ML Specialty certification. In her spare time, she enjoys reading, yoga, and spending time with her daughters.

iaroslav-imageIaroslav Shcherbatyi is a Machine Learning Engineer at AWS. He works mainly on improvements to the Amazon SageMaker platform and helping customers best use its features. In his spare time, he likes to go to gym, do outdoor sports such as ice skating or hiking, and to catch up on new AI research.

Read More

Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center

Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center

This blog post is co-written with Bruno Mateus, Jonathan Diedrich and Crispim Tribuna at Talkdesk.

Contact centers are using artificial intelligence (AI) and natural language processing (NLP) technologies to build a personalized customer experience and deliver effective self-service support through conversational bots.

This is the first of a two-part series dedicated to the integration of Amazon Lex with the Talkdesk CX Cloud contact center. In this post, we describe a solution architecture that combines the powerful resources of Amazon Lex and Talkdesk CX Cloud for the voice channel. In the second part of this series, we describe how to use the Amazon Lex chatbot UI with Talkdesk CX Cloud to allow customers to transition from a chatbot conversation to a live agent within the same chat window.

The benefits of Amazon Lex and Talkdesk CX Cloud are exemplified by WaFd Bank, a full-service commercial US bank in 200 locations and managing $20 billion in assets. The bank has invested in a digital transformation of its contact center to provide exceptional service to its clients. WaFd has pioneered an omnichannel banking experience that combines the advanced conversational AI capabilities of Amazon Lex voice and chat bots with Talkdesk Financial Services Experience Cloud for Banking.

“We wanted to combine the power of Amazon Lex’s conversational AI capabilities with the Talkdesk modern, unified contact center solution. This gives us the best of both worlds, enabling WaFd to serve its clients in the best way possible.”

-Dustin Hubbard, Chief Technology Officer at WaFd Bank.

To support WaFd’s vision, Talkdesk has extended its self-service virtual agent voice and chat capabilities with an integration with Amazon Lex and Amazon Polly. Additionally, the combination of Talkdesk Identity voice authentication with an Amazon Lex voicebot allows WaFd clients to resolve common banking transactions on their own. Tasks like account balance lookups are completed in seconds, a 90% reduction in time compared to WaFd’s legacy system. The newly designed Amazon Lex website chatbot has led to a substantial decrease in voicemail volume as its chatbot UI seamlessly integrates with Talkdesk systems.

In the following sections, we provide an overview of the components that have this integration possible. We then present the solution architecture, highlight its main components, and describe the customer journey from interacting with Amazon Lex to escalation to an agent. We end by explaining how contact centers can keep AI models up to date using Talkdesk AI Trainer.

Solution overview

The solution consists of the following key components:

  • Amazon Lex – Amazon Lex combines with Amazon Polly to automate customer service interactions by adding conversational AI capabilities to your contact center. Amazon Lex delivers fast responses to customers’ most common questions and seamlessly hands over complex cases to a human agent. Augmenting your contact center operations with Amazon Lex bots provides an enhanced customer experience and helps you build an omnichannel experience, allowing customers to engage across phone lines, websites, and messaging platforms.
  • Talkdesk CX Cloud contact center Talkdesk, Inc. is a global cloud contact center leader for customer-obsessed companies. Talkdesk CX Cloud offers enterprise scale with consumer simplicity to deliver speed, agility, reliability, and security. As an AWS Partner, Talkdesk is using AI capabilities like Amazon Transcribe, a speech-to-text service, with the Talkdesk Agent Assist and Talkdesk Customer Experience Analytics products across a number of languages and accents. Talkdesk has extended its self-service virtual agent voice and chat capabilities with an integration with Amazon Lex and Amazon Polly. These virtual agents can automate routine tasks as well as seamlessly elevate complex interactions to a live agent.
  • Authentication and voice biometrics with Talkdesk Identity – Talkdesk Identity provides fraud protection through self-service authentication using voice biometrics. Voice biometrics solutions provide contact centers with improved levels of security while streamlining the authentication process for the customer. This secure and efficient authentication experience allows contact centers to handle a wide range of self-service functionalities. For example, customers can check their balance, schedule a funds transfer, or activate/deactivate a card using a banking bot.

The following diagram illustrates our solution architecture.

The voice authentication call flow implemented in Talkdesk interacts with Amazon Lex as follows:

  • When a phone call is initiated, a customer lookup is performed using the incoming caller’s phone number. If multiple customers are retrieved, further information, like date of birth, is requested in order to narrow down the list to a unique customer record.
  • If the caller is identified and has previously enrolled in voice biometrics, the caller will be prompted to say their voice pass code. If successful, the caller is offered an authenticated Amazon Lex experience.
  • If a caller is identified and not enrolled in voice biometrics, they can work with an agent to verify their identity and record their voice print as the password. For more information, visit the Talkdesk Voice Biometric documentation.
  • If the caller is not identified or not enrolled in voice biometrics, the caller can interact with Amazon Lex to perform tasks that don’t require authentication, or they can request a transfer to an agent.

How Talkdesk integrates with Amazon Lex

When the call reaches Talkdesk Virtual Agent, Talkdesk uses the continuous streaming capability of the Amazon Lex API to enable conversation with the Amazon Lex bot. Talkdesk Virtual Agent has an Amazon Lex adapter that initiates an HTTP/2 bidirectional event stream through the StartConversation API operation. Talkdesk Virtual Agent and the Amazon Lex bot start exchanging information in real time following the sequence of events for an audio conversation. For more information, refer to Starting a stream to a bot.

All the context data from Talkdesk Studio is sent to Amazon Lex through session attributes established on the initial ConfigurationEvent. The Amazon Lex voicebot has been equipped with a welcome intent, which is invoked by Talkdesk to initiate the conversation and play a welcome message. In Amazon Lex, a session attribute is set to ensure the welcome intent and its message are used only once in any conversation. The greeting message can be customized to include the name of the authenticated caller, if provided from the Talkdesk system in session attributes.

The following diagram shows the basic components and events used to enable communications.

Agent escalation from Amazon Lex

If a customer requests agent assistance, all necessary information to ensure the customer is routed to the correct agent is made available by Amazon Lex to Talkdesk Studio through session attributes.

Examples of session attributes include:

  • A flag to indicate the customer requests agent assistance
  • The reason for the escalation, used by Talkdesk to route the call appropriately
  • Additional data regarding the call to provide the agent with contextual information about the customer and their earlier interaction with the bot
  • The sentiment of the interaction

Training

Talkdesk AI Trainer is a human-in-the-loop tool that is included in the operational flow of Talkdesk CX Cloud. It performs the continuous training and improvement of AI models by real agents without the need for specialized data science teams.

Talkdesk developed a connector that allows AI Trainer to automatically collect intent data from Amazon Lex intent models. Non-technical users can easily fine-tune these models to support Talkdesk AI products such as Talkdesk Virtual Agent. The connector was built by using the Amazon Lex Model Building API with the AWS SDK for Java 2.x.

It is possible to train intent data from Amazon Lex using real-world conversations between customers and (virtual) agents by:

  • Requesting feedback of intent classifications with a low confidence level
  • Adding new training phrases to intents
  • Adding synonyms or regular expressions to slot types

AI Trainer receives data from Amazon Lex, namely intents and slot types. This data is then displayed and managed on Talkdesk AI Trainer, along with all the events that are part of the conversational orchestration taking place in Talkdesk Virtual Agent. Through the AI ​​Trainer quality system or agreement, supervisors or administrators decide which improvements will be introduced in the Amazon Lex model and reflected in Talkdesk Virtual Agent.

Adjustments to production can be easily published on AI Trainer and sent to Amazon Lex. Continuously training AI models ensures that AI products reflect the evolution of the business and the latest needs of customers. This in turn helps increase the automation rate via self-servicing and resolve cases faster, resulting in a higher customer satisfaction.

Conclusion

In this post, we presented how the power of Amazon Lex conversational AI capabilities can be combined with the Talkdesk modern, unified contact center solution through the Amazon Lex API. We explained how Talkdesk voice biometrics offers the caller a self-service authenticated experience and how Amazon Lex provides contextual information to the agent to assist the caller more efficiently.

We are excited about the new possibilities that the integration of Amazon Lex and Talkdesk CX Cloud solutions offers to our clients. We at AWS Professional Services and Talkdesk are available to help you and your team implement your vision of an omnichannel experience.

The next post in this series will provide guidance on how to integrate an Amazon Lex chatbot to Talkdesk Studio, and how to enable customers to interact with a live agent from the chatbot.


About the authors


Grazia Russo Lassner
is a Senior Consultant with the AWS Professional Services Natural Language AI team. She specializes in designing and developing conversational AI solutions using AWS technologies for customers in various industries. Outside of work, she enjoys beach weekends, reading the latest fiction books, and family.


Cecil Patterson
is a Natural Language AI consultant with AWS Professional Services based in North Texas. He has many years of experience working with large enterprises to enable and support global infrastructure solutions. Cecil uses his experience and diverse skill set to build exceptional conversational solutions for customers of all types.


Bruno Mateus
is a Principal Engineer at Talkdesk. With over 20 years of experience in the software industry, he specializes in large-scale distributed systems. When not working, he enjoys spending time outside with his family, trekking, mountain bike riding, and motorcycle riding.


Jonathan Diedrich
is a Principal Solutions Consultant at Talkdesk. He works on enterprise and strategic projects to ensure technical execution and adoption. Outside of work, he enjoys ice hockey and games with his family.


Crispim Tribuna
is a Senior Software Engineer at Talkdesk currently focusing on the AI-based virtual agent project. He has over 17 years of experience in computer science, with a focus on telecommunications, IPTV, and fraud prevention. In his free time, he enjoys spending time with his family, running (he has completed three marathons), and riding motorcycles.

Read More

Image classification model selection using Amazon SageMaker JumpStart

Image classification model selection using Amazon SageMaker JumpStart

Researchers continue to develop new model architectures for common machine learning (ML) tasks. One such task is image classification, where images are accepted as input and the model attempts to classify the image as a whole with object label outputs. With many models available today that perform this image classification task, an ML practitioner may ask questions like: “What model should I fine-tune and then deploy to achieve the best performance on my dataset?” And an ML researcher may ask questions like: “How can I generate my own fair comparison of multiple model architectures against a specified dataset while controlling training hyperparameters and computer specifications, such as GPUs, CPUs, and RAM?” The former question addresses model selection across model architectures, while the latter question concerns benchmarking trained models against a test dataset.

In this post, you will see how the TensorFlow image classification algorithm of Amazon SageMaker JumpStart can simplify the implementations required to address these questions. Together with the implementation details in a corresponding example Jupyter notebook, you will have tools available to perform model selection by exploring pareto frontiers, where improving one performance metric, such as accuracy, is not possible without worsening another metric, such as throughput.

Solution overview

The following figure illustrates the model selection trade-off for a large number of image classification models fine-tuned on the Caltech-256 dataset, which is a challenging set of 30,607 real-world images spanning 256 object categories. Each point represents a single model, point sizes are scaled with respect to the number of parameters comprising the model, and the points are color-coded based on their model architecture. For example, the light green points represent the EfficientNet architecture; each light green point is a different configuration of this architecture with unique fine-tuned model performance measurements. The figure shows the existence of a pareto frontier for model selection, where higher accuracy is exchanged for lower throughput. Ultimately, the selection of a model along the pareto frontier, or the set of pareto efficient solutions, depends on your model deployment performance requirements.

If you observe test accuracy and test throughput frontiers of interest, the set of pareto efficient solutions on the preceding figure are extracted in the following table. Rows are sorted such that test throughput is increasing and test accuracy is decreasing.

Model Name Number of Parameters Test Accuracy Test Top 5 Accuracy Throughput (images/s) Duration per Epoch(s)
swin-large-patch4-window12-384 195.6M 96.4% 99.5% 0.3 2278.6
swin-large-patch4-window7-224 195.4M 96.1% 99.5% 1.1 698.0
efficientnet-v2-imagenet21k-ft1k-l 118.1M 95.1% 99.2% 4.5 1434.7
efficientnet-v2-imagenet21k-ft1k-m 53.5M 94.8% 99.1% 8.0 769.1
efficientnet-v2-imagenet21k-m 53.5M 93.1% 98.5% 8.0 765.1
efficientnet-b5 29.0M 90.8% 98.1% 9.1 668.6
efficientnet-v2-imagenet21k-ft1k-b1 7.3M 89.7% 97.3% 14.6 54.3
efficientnet-v2-imagenet21k-ft1k-b0 6.2M 89.0% 97.0% 20.5 38.3
efficientnet-v2-imagenet21k-b0 6.2M 87.0% 95.6% 21.5 38.2
mobilenet-v3-large-100-224 4.6M 84.9% 95.4% 27.4 28.8
mobilenet-v3-large-075-224 3.1M 83.3% 95.2% 30.3 26.6
mobilenet-v2-100-192 2.6M 80.8% 93.5% 33.5 23.9
mobilenet-v2-100-160 2.6M 80.2% 93.2% 40.0 19.6
mobilenet-v2-075-160 1.7M 78.2% 92.8% 41.8 19.3
mobilenet-v2-075-128 1.7M 76.1% 91.1% 44.3 18.3
mobilenet-v1-075-160 2.0M 75.7% 91.0% 44.5 18.2
mobilenet-v1-100-128 3.5M 75.1% 90.7% 47.4 17.4
mobilenet-v1-075-128 2.0M 73.2% 90.0% 48.9 16.8
mobilenet-v2-075-96 1.7M 71.9% 88.5% 49.4 16.6
mobilenet-v2-035-96 0.7M 63.7% 83.1% 50.4 16.3
mobilenet-v1-025-128 0.3M 59.0% 80.7% 50.8 16.2

This post provides details on how to implement large-scale Amazon SageMaker benchmarking and model selection tasks. First, we introduce JumpStart and the built-in TensorFlow image classification algorithms. We then discuss high-level implementation considerations, such as JumpStart hyperparameter configurations, metric extraction from Amazon CloudWatch Logs, and launching asynchronous hyperparameter tuning jobs. Finally, we cover the implementation environment and parameterization leading to the pareto efficient solutions in the preceding table and figure.

Introduction to JumpStart TensorFlow image classification

JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that solve common business problems. These features remove the heavy lifting from each step of the ML process, making it easier to develop high-quality models and reducing time to deployment. The JumpStart APIs allow you to programmatically deploy and fine-tune a vast selection of pre-trained models on your own datasets.

The JumpStart model hub provides access to a large number of TensorFlow image classification models that enable transfer learning and fine-tuning on custom datasets. As of this writing, the JumpStart model hub contains 135 TensorFlow image classification models across a variety of popular model architectures from TensorFlow Hub, to include residual networks (ResNet), MobileNet, EfficientNet, Inception, Neural Architecture Search Networks (NASNet), Big Transfer (BiT), shifted window (Swin) transformers, Class-Attention in Image Transformers (CaiT), and Data-Efficient Image Transformers (DeiT).

Vastly different internal structures comprise each model architecture. For instance, ResNet models utilize skip connections to allow for substantially deeper networks, whereas transformer-based models use self-attention mechanisms that eliminate the intrinsic locality of convolution operations in favor of more global receptive fields. In addition to the diverse feature sets these different structures provide, each model architecture has several configurations that adjust the model size, shape, and complexity within that architecture. This results in hundreds of unique image classification models available on the JumpStart model hub. Combined with built-in transfer learning and inference scripts that encompass many SageMaker features, the JumpStart API is a great launching point for ML practitioners to get started training and deploying models quickly.

Refer to Transfer learning for TensorFlow image classification models in Amazon SageMaker and the following example notebook to learn about SageMaker TensorFlow image classification in more depth, including how to run inference on a pre-trained model as well as fine-tune the pre-trained model on a custom dataset.

Large-scale model selection considerations

Model selection is the process of selecting the best model from a set of candidate models. This process may be applied across models of the same type with different parameter weights and across models of different types. Examples of model selection across models of the same type include fitting the same model with different hyperparameters (for example, learning rate) and early stopping to prevent the overfitting of model weights to the train dataset. Model selection across models of different types includes selecting the best model architecture (for example, Swin vs. MobileNet) and selecting the best model configurations within a single model architecture (for example, mobilenet-v1-025-128 vs. mobilenet-v3-large-100-224).

The considerations outlined in this section enable all of these model selection processes on a validation dataset.

Select hyperparameter configurations

TensorFlow image classification in JumpStart has a large number of available hyperparameters that can adjust the transfer learning script behaviors uniformly for all model architectures. These hyperparameters relate to data augmentation and preprocessing, optimizer specification, overfitting controls, and trainable layer indicators. You are encouraged to adjust the default values of these hyperparameters as necessary for your application:

model_id: str
model_version: str = "*"

hyperparameters = sagemaker.hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

For this analysis and the associated notebook, all hyperparameters are set to default values except for learning rate, number of epochs, and early stopping specification. Learning rate is adjusted as a categorical parameter by the SageMaker automatic model tuning job. Because each model has unique default hyperparameter values, the discrete list of possible learning rates includes the default learning rate as well as one-fifth the default learning rate. This launches two training jobs for a single hyperparameter tuning job, and the training job with the best reported performance on the validation dataset is selected. Because the number of epochs is set to 10, which is greater than the default hyperparameter setting, the selected best training job doesn’t always correspond to the default learning rate. Finally, an early stopping criterion is utilized with a patience, or the number of epochs to continue training with no improvement, of three epochs.

One default hyperparameter setting of particular importance is train_only_on_top_layer, where, if set to True, the model’s feature extraction layers are not fine-tuned on the provided training dataset. The optimizer will only train parameters in the top fully connected classification layer with output dimensionality equal to the number of class labels in the dataset. By default, this hyperparameter is set to True, which is a setting targeted for transfer learning on small datasets. You may have a custom dataset where the feature extraction from the pre-training on the ImageNet dataset is not sufficient. In these cases, you should set train_only_on_top_layer to False. Although this setting will increase training time, you will extract more meaningful features for your problem of interest, thereby increasing accuracy.

Extract metrics from CloudWatch Logs

The JumpStart TensorFlow image classification algorithm reliably logs a variety of metrics during training that are accessible to SageMaker Estimator and HyperparameterTuner objects. The constructor of a SageMaker Estimator has a metric_definitions keyword argument, which can be used to evaluate the training job by providing a list of dictionaries with two keys: Name for the name of the metric, and Regex for the regular expression used to extract the metric from the logs. The accompanying notebook shows the implementation details. The following table lists the available metrics and associated regular expressions for all JumpStart TensorFlow image classification models.

Metric Name Regular Expression
number of parameters “- Number of parameters: ([0-9\.]+)”
number of trainable parameters “- Number of trainable parameters: ([0-9\.]+)”
number of non-trainable parameters “- Number of non-trainable parameters: ([0-9\.]+)”
train dataset metric f”- {metric}: ([0-9\.]+)”
validation dataset metric f”- val_{metric}: ([0-9\.]+)”
test dataset metric f”- Test {metric}: ([0-9\.]+)”
train duration “- Total training duration: ([0-9\.]+)”
train duration per epoch “- Average training duration per epoch: ([0-9\.]+)”
test evaluation latency “- Test evaluation latency: ([0-9\.]+)”
test latency per sample “- Average test latency per sample: ([0-9\.]+)”
test throughput “- Average test throughput: ([0-9\.]+)”

The built-in transfer learning script provides a variety of train, validation, and test dataset metrics within these definitions, as represented by the f-string replacement values. The exact metrics available vary based on the type of classification being performed. All compiled models have a loss metric, which is represented by a cross-entropy loss for either a binary or categorical classification problem. The former is used when there is one class label; the latter is used if there are two or more class labels. If there is only a single class label, then the following metrics are computed, logged, and extractable via the f-string regular expressions in the preceding table: number of true positives (true_pos), number of false positives (false_pos), number of true negatives (true_neg), number of false negatives (false_neg), precision, recall, area under the receiver operating characteristic (ROC) curve (auc), and area under the precision-recall (PR) curve (prc). Similarly, if there are six or more class labels, a top-5 accuracy metric (top_5_accuracy) is also be computed, logged, and extractable via the preceding regular expressions.

During training, metrics specified to a SageMaker Estimator are emitted to CloudWatch Logs. When the training is complete, you can invoke the SageMaker DescribeTrainingJob API and inspect the FinalMetricDataList key in the JSON response:

tuner: sagemaker.tuner.HyperparameterTuner
session: sagemaker.Session

training_job_name = tuner.best_training_job()
description = session.describe_training_job(training_job_name)
metrics = description["FinalMetricDataList"]

This API requires only the job name to be provided to the query, so, once completed, metrics can be obtained in future analyses so long as the training job name is appropriately logged and recoverable. For this model selection task, hyperparameter tuning job names are stored and subsequent analyses reattach a HyperparameterTuner object given the tuning job name, extract the best training job name from the attached hyperparameter tuner, and then invoke the DescribeTrainingJob API as described earlier to obtain metrics associated with the best training job.

Launch asynchronous hyperparameter tuning jobs

Refer to the corresponding notebook for implementation details on asynchronously launching hyperparameter tuning jobs, which uses the Python standard library’s concurrent futures module, a high-level interface for asynchronously running callables. Several SageMaker-related considerations are implemented in this solution:

  • Each AWS account is affiliated with SageMaker service quotas. You should view your current limits to fully utilize your resources and potentially request resource limit increases as needed.
  • Frequent API calls to create many simultaneous hyperparameter tuning jobs may exceed the Python SDK rate and throw throttling exceptions. A resolution to this is to create a SageMaker Boto3 client with a custom retry configuration.
  • What happens if your script encounters an error or the script is stopped before completion? For such a large model selection or benchmarking study, you can log tuning job names and provide convenience functions to reattach hyperparameter tuning jobs that already exist:
tuning_job_name: str
session: sagemaker.Session

tuner = sagemaker.tuner.HyperparameterTuner.attach(tuning_job_name, session)

Analysis details and discussion

The analysis in this post performs transfer learning for model IDs in the JumpStart TensorFlow image classification algorithm on the Caltech-256 dataset. All training jobs were performed on the SageMaker training instance ml.g4dn.xlarge, which contains a single NVIDIA T4 GPU.

The test dataset is evaluated on the training instance at the end of training. Model selection is performed prior to the test dataset evaluation to set model weights to the epoch with the best validation set performance. Test throughput is not optimized: the dataset batch size is set to the default training hyperparameter batch size, which isn’t adjusted to maximize GPU memory usage; reported test throughput includes data loading time because the dataset isn’t pre-cached; and distributed inference across multiple GPUs isn’t utilized. For these reasons, this throughput is a good relative measurement, but actual throughput would depend heavily on your inference endpoint deployment configurations for the trained model.

Although the JumpStart model hub contains many image classification architecture types, this pareto frontier is dominated by select Swin, EfficientNet, and MobileNet models. Swin models are larger and relatively more accurate, whereas MobileNet models are smaller, relatively less accurate, and suitable for resource constraints of mobile devices. It’s important to note that this frontier is conditioned on a variety of factors, including the exact dataset used and the fine-tuning hyperparameters selected. You may find that your custom dataset produces a different set of pareto efficient solutions, and you may desire longer training times with different hyperparameters, such as more data augmentation or fine-tuning more than just the top classification layer of the model.

Conclusion

In this post, we showed how to run large-scale model selection or benchmarking tasks using the JumpStart model hub. This solution can help you choose the best model for your needs. We encourage you to try out and explore this solution on your own dataset.

References

More information is available at the following resources:


About the authors

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana Champaign. He is an active researcher in machine learning and statistical inference and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS

Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS

Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their knowledge of the games within the game of football. For the 2022 season, the NFL aimed to leverage player-tracking data  and new advanced analytics techniques to better understand special teams.

The goal of the project was to predict how many yards a returner would gain on a punt or kickoff play. One of the challenges when building predictive models for punt and kickoff returns is the availability of very rare events — such as touchdowns — that have significant importance in the dynamics of a game. A data distribution with fat tails is common in real-world applications, where rare events have significant impact on the overall performance of the models. Using a robust method to accurately model distribution over extreme events is crucial for better overall performance.

In this post, we demonstrate how to use Spliced Binned-Pareto distribution implemented in GluonTS to robustly model such fat-tailed distributions.

We first describe the dataset used. Next, we present the data preprocessing and other transformation methods applied to the dataset. We then explain the details of the ML methodology and model training procedures. Finally, we present the model performance results.

Dataset

In this post, we used two datasets to build separate models for punt and kickoff returns. The player tracking data contains the player’s position, direction, acceleration, and more (in x,y coordinates). There are around 3,000 and 4,000 plays from four NFL seasons (2018–2021) for punt and kickoff plays, respectively. In addition, there are very few punt and kickoff-related touchdowns in the datasets—only 0.23% and 0.8%, respectively. The data distribution for punt and kickoff are different. For example, the true yardage distribution for kickoff and punts are similar but shifted, as shown in the following figure.

Punts and kickoff return yards distribution

Data preprocessing and feature engineering

First, the tracking data was filtered for just the data related to punts and kickoff returns. The player data was used to derive features for model development:

  • X – Player position along the long axis of the field
  • Y – Player position along the short axis of the field
  • S – Speed in yards/second; replaced by Dis*10 to make it more accurate (Dis is the distance in the past 0.1 seconds)
  • Dir – Angle of player motion (degrees)

From the preceding data, each play was transformed into 10X11X14 of data with 10 offensive players (excluding the ball carrier), 11 defenders, and 14 derived features:

  • sX – x speed of a player
  • sY – y speed of a player
  • s – Speed of a player
  • aX – x acceleration of a player
  • aY – y acceleration of a player
  • relX – x distance of player relative to ball carrier
  • relY – y distance of player relative to ball carrier
  • relSx – x speed of player relative to ball carrier
  • relSy – y speed of player relative to ball carrier
  • relDist – Euclidean distance of player relative to ball carrier
  • oppX – x distance of offense player relative to defense player
  • oppY – y distance of offense player relative to defense player
  • oppSx –x speed of offense player relative to defense player
  • oppSy – y speed of offense player relative to defense player

To augment the data and account for the right and left positions, the X and Y position values were also mirrored to account for the right and left field positions. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle.

ML methodology and model training

Because we’re interested in all possible outcomes from the play, including the probability of a touchdown, we can’t simply predict the average yards gained as a regression problem. We need to predict the full probability distribution of all possible yard gains, so we framed the problem as a probabilistic prediction.

One way to implement probabilistic predictions is to assign the yards gained to several bins (such as less than 0, from 0–1, from 1–2, …, from 14–15, more than 15) and predict the bin as a classification problem. The downside of this approach is that we want small bins to have a high definition picture of the distribution, but small bins mean fewer data points per bin and our distribution, especially the tails, may be poorly estimated and irregular.

Another way to implement probabilistic predictions is to model the output as a continuous probability distribution with a limited number of parameters (for example, a Gaussian or Gamma distribution) and predict the parameters. This approach gives a very high definition and regular picture of the distribution, but is too rigid to fit the true distribution of yards gained, which is multi-modal and heavy tailed.

To get the best of both methods, we use Spliced Binned-Pareto distribution (SBP), which has bins for the center of the distribution where a lot of data is available, and Generalized Pareto distribution (GPD) at both ends, where rare but important events can happen, like a touchdown. The GPD has two parameters: one for scale and one for tail heaviness, as seen in the following graph (source: Wikipedia).

By splicing the GPD with the binned distribution (see the following left graph) on both sides, we obtain the following SBP on the right. The lower and upper thresholds where splicing is done are hyperparameters.

Binned and SPB distributions

As a baseline, we used the model that won our NFL Big Data Bowl competition on Kaggle. This model uses CNN layers to extract features from the prepared data, and predicts the outcome as a “1 yard per bin” classification problem. For our model, we kept the feature extraction layers from the baseline and only modified the last layer to output SBP parameters instead of probabilities for each bin, as shown in the following figure (image edited from the post 1st place solution The Zoo).

Model Architecture

We used the SBP distribution provided by GluonTS. GluonTS is a Python package for probabilistic time series modeling, but the SBP distribution is not specific to time series, and we were able to repurpose it for regression. For more information on how to use GluonTS SBP, see the following demo notebook.

Models were trained and cross-validated on the 2018, 2019, and 2020 seasons and tested on the 2021 season. To avoid leakage during cross-validation, we grouped all plays from the same game into the same fold.

For evaluation, we kept the metric used in the Kaggle competition, the continuous ranked probability score (CRPS), which can be seen as an alternative to the log-likelihood that is more robust to outliers. We also used the Pearson correlation coefficient and the RMSE as general and interpretable accuracy metrics. Furthermore, we looked at the probability of a touchdown and probability plots to evaluate calibration.

The model was trained on the CRPS loss using Stochastic Weight Averaging and early stopping.

To deal with the irregularity of the binned part of the output distributions, we used two techniques:

  • A smoothness penalty proportional to the squared difference between two consecutive bins
  • Ensembling models trained during cross-validation

Model performance results

For each dataset, we performed a grid search over the following options:

  • Probabilistic models
    • Baseline was one probability per yard
    • SBP was one probability per yard in the center, generalized SBP in the tails
  • Distribution smoothing
    • No smoothing (smoothness penalty = 0)
    • Smoothness penalty = 5
    • Smoothness penalty = 10
  • Training and inference procedure
    • 10 folds cross-validation and ensemble inference (k10)
    • Training on train and validation data for 10 epochs or 20 epochs

Then we looked at the metrics for the top five models sorted by CRPS (lower is better).

For kickoff data, the SBP model slightly over-performs in terms of CRPS but more importantly it estimates the touchdown probability better (true probability is 0.80% in the test set). We see that the best models use 10 folds ensembling (k10) and no smoothness penalty, as shown in the following table.

Training Model Smoothness CRPS RMSE CORR % P(touchdown)%
k10 SBP 0 4.071 9.641 47.15 0.78
k10 Baseline 0 4.074 9.62 47.585 0.306
k10 Baseline 5 4.075 9.626 47.43 0.274
k10 SBP 5 4.079 9.656 46.977 0.682
k10 Baseline 10 4.08 9.621 47.519 0.265

The following plot of the observed frequencies and predicted probabilities indicates a good calibration of our best model, with an RMSE of 0.27 between the two distributions. Note the occurrences of high yardage (for example, 100) that occur in the tail of the true (blue) empirical distribution, whose probabilities are more capturable by the SBP than the baseline method.

Kickoff observed frequencies and predicted probability distribution

For punt data, the baseline outperforms the SBP, perhaps because the tails of extreme yardage have fewer realizations. Therefore, it’s a better trade-off to capture the modality between 0–10 yards peaks; and contrary to kickoff data, the best model uses a smoothness penalty. The following table summarizes our findings.

Training Model Smoothness CRPS RMSE CORR % P(touchdown)%
k10 Baseline 5 3.961 8.313 35.227 0.547
k10 Baseline 0 3.972 8.346 34.227 0.579
k10 Baseline 10 3.978 8.351 34.079 0.555
k10 SBP 5 3.981 8.342 34.971 0.723
k10 SBP 0 3.991 8.378 33.437 0.677

The following plot of observed frequencies (in blue) and predicted probabilities for the two best punt models indicates that the non-smoothed model (in orange) is slightly better calibrated than the smoothed model (in green) and may be a better choice overall.

Punt true and predicted probabilities

Conclusion

In this post, we showed how to build predictive models with fat-tailed data distribution. We used Spliced Binned-Pareto distribution, implemented in GluonTS, which can robustly model such fat-tailed distributions. We used this technique to build models for punt and kickoff returns. We can apply this solution to similar use cases where there are very few events in the data, but those events have significant impact on the overall performance of the models.

If you would like help with accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.


About the Authors

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps AWS customers across various industries such as healthcare and life sciences, manufacturing, automotive, and sports and media, accelerate their use of machine learning and AWS cloud services to solve their business challenges.

Marc van Oudheusden is a Senior Data Scientist with the Amazon ML Solutions Lab team at Amazon Web Services. He works with AWS customers to solve business problems with artificial intelligence and machine learning. Outside of work you may find him at the beach, playing with his children, surfing or kitesurfing.

Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.

Kyeong Hoon (Jonathan) Jung is a senior software engineer at the National Football League. He has been with the Next Gen Stats team for the last seven years helping to build out the platform from streaming the raw data, building out microservices to process the data, to building API’s that exposes the processed data. He has collaborated with the Amazon Machine Learning Solutions Lab in providing clean data for them to work with as well as providing domain knowledge about the data itself. Outside of work, he enjoys cycling in Los Angeles and hiking in the Sierras.

Michael Chi is a Senior Director of Technology overseeing Next Gen Stats and Data Engineering at the National Football League. He has a degree in Mathematics and Computer Science from the University of Illinois at Urbana Champaign. Michael first joined the NFL in 2007 and has primarily focused on technology and platforms for football statistics. In his spare time, he enjoys spending time with his family outdoors.

  Mike Band is a Senior Manager of Research and Analytics for Next Gen Stats at the National Football League. Since joining the team in 2018, he has been responsible for ideation, development, and communication of key stats and insights derived from player-tracking data for fans, NFL broadcast partners, and the 32 clubs alike. Mike brings a wealth of knowledge and experience to the team with a master’s degree in analytics from the University of Chicago, a bachelor’s degree in sport management from the University of Florida, and experience in both the scouting department of the Minnesota Vikings and the recruiting department of Florida Gator Football.

Read More

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More information regarding the NFL Player Health and Safety efforts is available on the NFL website.

The AWS Professional Services team has partnered with the NFL and Biocore to provide machine learning (ML)-based solutions for identifying helmet impacts from game footage using computer vision (CV) techniques. With multiple camera views available from each game, we have developed solutions to identify helmet impacts from each of these views and merge the helmet impact results.

The motivation behind utilizing multiple camera views comes from the limitation of information when the impact events are captured with only one view. With only one perspective, some players might occlude each other or be blocked by other objects on the field. Therefore, adding more perspectives allows our ML system to identify more impacts that aren’t visible in a single view. To showcase the results of our fusion process and how the team uses visualization tools to help evaluate the model performance, we have developed a codebase to visually overlay the multiple view detection results. This process helps identify the actual number of impacts individual players experience by removing duplicate impacts detected in multiple views.

In this post, we use the publicly available dataset from the NFL – Impact Detection Kaggle competition and show results for merging two views. The dataset includes helmet bounding boxes at every frame and impact labels found in each video. In particular, we focus on deduplicating and visualizing videos with the ID 57583_000082 in endzone and sideline views. You can download the endzone and sideline videos, and also the ground truth labels.

Prerequisites

The solution requires the following:

Get started on SageMaker Studio Lab and install the required packages

You can run the notebook from the GitHub repository or from SageMaker Studio Lab. In this post, we run the notebook from a SageMaker Studio Lab environment. We are choosing SageMaker Studio Lab because it is free, provides powerful CPU and GPU user sessions, and 15GB of persistent storage that will automatically save your environment, enabling you to pick up where you left off.  To use SageMaker Studio Lab, request and set up a new account. After the account is approved, complete the following steps:

  1. Visit the aws-samples GitHub repo.
  2. In the README section, choose Open Studio Lab.

sagemaker-studio-button

This redirects you to your SageMaker Studio Lab environment.

  1. Select your CPU compute type, then choose Start Runtime.
  2. After the runtime starts, choose Copy to Project, which opens a new window with the Jupyter Lab environment.

Now you’re ready to use the notebook!

  1. Open fuse_and_visualize_multiview_impacts.ipynb and follow the instructions in the notebook.

The first cell in the notebook installs the necessary Python packages such as pandas and OpenCV:

%pip install pandas
%pip install opencv-contrib-python-headless

Import all the necessary Python packages and set pandas options for better visualization experience:

import os
import cv2
import pandas as pd
import numpy as np
pd.set_option('mode.chained_assignment', None)

We use pandas for ingesting and parsing through the CSV file with the annotated helmet bounding boxes as well as impacts. We use NumPy mainly for manipulating arrays and matrices. We use OpenCV for reading, writing, and manipulating image data in Python.

Prepare the data by fusing results from two views

To fuse the two perspectives together, we use the train_labels.csv from the Kaggle competition as an example because it contains ground truth impacts from both the endzone and sideline. The following function takes the input dataset and outputs a fused dataframe that is deduplicated for all the plays in the input dataset:

def prep_data(df):
    df['game_play'] = df['gameKey'].astype('str') + '_' + df['playID'].astype('str').str.zfill(6)
    return df

def dedup_view(df, windows):
    # define view
    df = df.sort_values(by='frame')
    view_columns = ['frame', 'left', 'width', 'top', 'height', 'video']
    common_columns = ['game_play', 'label', 'view', 'impactType']
    label_cleaned = df[view_columns + common_columns]
    
    # rename columns
    sideline_column_rename = {col: 'Sideline_' + col for col in view_columns}
    endzone_column_rename = {col: 'Endzone_' + col for col in view_columns}
    sideline_columns = list(sideline_column_rename.values())

    # create two dataframes, one for sideline, one for endzone
    label_endzone = label_cleaned.query('view == "Endzone"')
    label_endzone.rename(columns=endzone_column_rename, inplace=True)
    label_sideline = label_cleaned.query('view == "Sideline"')
    label_sideline.rename(columns=sideline_column_rename, inplace=True)

    # prepare sideline labels
    label_sideline['is_dup'] = False
    for columns in sideline_columns:
        label_endzone[columns] = np.nan
    label_endzone['is_dup'] = False

    # iterrate endzone rows to find matches and dedup
    for index, row in label_endzone.iterrows():
        player = row['label']
        frame = row['Endzone_frame']
        impact_type = row['impactType']
        sideline_row = label_sideline[(label_sideline['label'] == player) & 
                                      ((label_sideline['Sideline_frame'] >= frame - windows // 2) &
                                       (label_sideline['Sideline_frame'] <= frame + windows // 2 + 1)) &
                                      (label_sideline['is_dup'] == False) & 
                                      (label_sideline['impactType'] == impact_type)]

        if len(sideline_row) > 0:
            sideline_index = sideline_row.index[0]
            label_sideline['is_dup'].loc[sideline_index] = True

            for col in sideline_columns:
                label_endzone[col].loc[index] = sideline_row.iloc[0][col]
            label_endzone['is_dup'].loc[index] = True

    # calculate overlap perc
    not_dup_sideline = label_sideline[label_sideline['is_dup'] == False]
    final_output = pd.concat([not_dup_sideline, label_endzone])
    return final_output

def fuse_df(raw_df, windows):
    outputs = []
    all_game_play = raw_df['game_play'].unique()
    for game_play in all_game_play:
        df = raw_df.query('game_play ==@game_play')
        output = dedup_view(df, windows)
        outputs.append(output)

    output_df = pd.concat(outputs)
    output_df['gameKey'] = output_df['game_play'].apply(lambda x: x.split('_')[0]).map(int)
    output_df['playID'] = output_df['game_play'].apply(lambda x: x.split('_')[1]).map(int)
    return output_df

To run the function, we run the following code block to provide the location of the train_labels.csv data and then perform data preparation to add an additional column and extract only the impact rows. After running the function, we save the output to a dataframe variable called fused_df.

# read the annotated impact data from train_labels.csv
ground_truth = pd.read_csv('train_labels.csv')

# prepare game_play column using pipe(prep_data) function in pandas then filter the dataframe for just rows with impacts
ground_truth = ground_truth.pipe(prep_data).query('impact == 1')

# loop over all the unique game_plays and deduplicate the impact results from sideline and endzone
fused_df = fuse_df(ground_truth, windows=30)

The following screenshot shows the ground truth.

The following screenshot shows the fused dataframe examples.

Graph and video code

After we fuse the impact results, we use the generated fused_df to overlay the results onto our endzone and sideline videos and merge the two views together. We use the following function for this, and the inputs needed are the paths to the endzone video, sideline video, fused_df dataframe, and the final output path for the newly generated video. The functions used in this section are described in the markdown section of the notebook used in SageMaker Studio Lab.

def get_video_and_metadata(vid_path): 
    vid = cv2.VideoCapture(vid_path)
    total_frame_number = vid.get(cv2.CAP_PROP_FRAME_COUNT)
    width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = vid.get(cv2.CAP_PROP_FPS)
    return vid, total_frame_number, width, height, fps

def overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1):
    # look for duplicates 
    duplicates = fused_df.query(f"gameKey == {int(game_key)} and 
                                  playID == {int(play_id)} and 
                                  is_dup == True and 
                                  Sideline_frame == @frame_cnt") 

    frame_has_impact = False 
    
    if len(duplicates) > 0: 
        for duplicate in duplicates.itertuples(index=False): 
            if frame_cnt == duplicate.Sideline_frame: 
                frame_has_impact = True 

            if frame_has_impact: 
                cv2.rectangle(frame, #frame to be edited 
                              (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of top left corner 
                              (int(duplicate.Sideline_left) + int(duplicate.Sideline_width), int(duplicate.Sideline_top) + int(duplicate.Sideline_height)), #(x,y) of bottom right corner 
                              (0,0,255), #RED boxes
                              thickness=3)

                cv2.rectangle(frame, #frame to be edited
                              (int(duplicate.Endzone_left), int(duplicate.Endzone_top)+ h1), #(x,y) of top left corner
                              (int(duplicate.Endzone_left) + int(duplicate.Endzone_width), int(duplicate.Endzone_top) + int(duplicate.Endzone_height) + h1), #(x,y) of bottom right corner
                              (0,0,255), #RED boxes
                              thickness=3)
 
                cv2.line(frame, #frame to be edited
                         (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of point 1 in a line
                         (int(duplicate.Endzone_left), int(duplicate.Endzone_top) + h1), #(x,y) of point 2 in a line
                         (255, 255, 255), # WHITE lines
                         thickness=4)
 
            else:
                # if no duplicates, look for sideline then endzone and add to the view
                sl_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Sideline' and 
                                              Sideline_frame == @frame_cnt")

                if len(sl_impacts) > 0:
                    for impact in sl_impacts.itertuples(index=False):
                        if frame_cnt == impact.Sideline_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Sideline_left), int(impact.Sideline_top)), #(x,y) of top left corner
                                          (int(impact.Sideline_left) + int(impact.Sideline_width), int(impact.Sideline_top) + int(impact.Sideline_height)), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

                ez_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Endzone' and 
                                              Endzone_frame == @frame_cnt")

                if len(ez_impacts) > 0:
                    for impact in ez_impacts.itertuples(index=False):
                        if frame_cnt == impact.Endzone_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Endzone_left), int(impact.Endzone_top)+ h1), #(x,y) of top left corner
                                          (int(impact.Endzone_left) + int(impact.Endzone_width), int(impact.Endzone_top) + int(impact.Endzone_height) + h1 ), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

    return frame, frame_has_impact

def generate_impact_video(ez_vid_path:str,
                          sl_vid_path:str,
                          fused_df:pd.DataFrame,
                          output_path:str,
                          freeze_impacts=True):
    
    #define video codec to be used for
    VIDEO_CODEC = "MP4V"

    # parse game_key and play_id information from the name of the files
    game_key = os.path.basename(ez_vid_path).split('_')[0] # parse game_key
    play_id = os.path.basename(ez_vid_path).split('_')[1] # parse play_id
 
    # get metadata such as total frame number, width, height and frames per second (FPS) from endzone (ez) and sideline (sl) videos
    ez_vid, ez_total_frame_number, ez_width, ez_height, ez_fps = get_video_and_metadata(ez_vid_path)
    sl_vid, sl_total_frame_number, sl_width, sl_height, sl_fps = get_video_and_metadata(sl_vid_path)

    # define a video writer for the output video
    output_video = cv2.VideoWriter(output_path, #output file name
                                   cv2.VideoWriter_fourcc(*VIDEO_CODEC), #Video codec
                                   ez_fps, #frames per second in the output video
                                  (ez_width, ez_height+sl_height)) # frame size with stacking video vertically

    # find shorter video and use the total frame number from the shorter video for the output video
    total_frame_number = int(min(ez_total_frame_number, sl_total_frame_number))

    # iterate through each frame from endzone and sideline
    for frame_cnt in range(total_frame_number):
        frame_has_impact = False
        frame_near_impact = False

        # reading frames from both endzone and sideline
        ez_ret, ez_frame = ez_vid.read()
        sl_ret, sl_frame = sl_vid.read()

        # creating strings to be added to the output frames
        img_name = f"Game key: {game_key}, Play ID: {play_id}, Frame: {frame_cnt}"
        video_frame = f'{game_key}_{play_id}_{frame_cnt}'

        if ez_ret == True and sl_ret == True:
            h, w, c = ez_frame.shape
            h1,w1,c1 = sl_frame.shape
 
            if h != h1 or w != w1: # resize images if they're different
                ez_frame = cv2.resize(ez_frame,(w1,h1))
 
            frame = np.concatenate((sl_frame, ez_frame), axis=0) # stack the frames vertically

            frame, frame_has_impact = overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1)

            cv2.putText(frame, #image frame to be modified
                        img_name, #string to be inserted
                        (30, 30), #(x,y) location of the string
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), #WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1-20), #(x,y) location of the string in the top view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1+h-20), #(x,y) location of the string in the bottom view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            output_video.write(frame)

            # Freeze for 60 frames on impacts
            if frame_has_impact and freeze_impacts:
                for _ in range(60):
                    output_video.write(frame)
        else:
            break

        frame_cnt += 1

    output_video.release()
    return

To run these functions, we can provide an input as shown in the following code, which generates a video called output.mp4:

generate_impact_video('57583_000082_Endzone.mp4',
                      '57583_000082_Sideline.mp4',
                      fused_df,
                      'output.mp4')

This generates a video as shown in the following example, where the red bounding boxes are impacts found in both endzone and sideline views, and the yellow bounding boxes are impacts that are found in just one view in either the endzone or sideline.

Conclusion

In this post, we demonstrated how the NFL, Biocore, and the AWS ProServe teams are working together to improve impact detection by fusing results from multiple views. This allows the teams to debug and visualize how the model is performing qualitatively. This process can easily be scaled up to three or more views; in our projects, we have utilized up to seven different views. Detecting helmet impacts by watching videos from only one view can be difficult due to view obstruction, but detecting impacts from multiple views and fusing the results allows us to improve our model performance.

To experiment with this solution, visit the aws-samples GitHub repo and refer to the fuse_and_visualize_multiview_impacts.ipynb notebook. Similar techniques can also be applied to other industries such as manufacturing, retail, and security, where having multiple views would benefit the ML system to better identify targets with a more comprehensive view.

For more information regarding NFL Player Health and Safety, visit the NFL website and NFL Explained: Innovation in Player Health & Safety.


About the authors

Chris Boomhower is a Machine Learning Engineer at AWS Professional Services. Chris has over 6 years experience developing supervised and unsupervised Machine Learning solutions across various industries. Today, he spends most his time helping customers in sports, healthcare, and agriculture industries design and build scalable, end-to-end, Machine Learning solutions.

Ben Fenker is a Senior Data Scientist in AWS Professional Services and has helped customers build and deploy ML solutions in industries ranging from sports to healthcare to manufacturing. He has a Ph.D. in physics from Texas A&M University and 6 years of industry experience. Ben enjoys baseball, reading, and raising his kids.

Sam Huddleston is a Principal Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.

Jarvis Lee is a Senior Data Scientist with AWS Professional Services. He has been with AWS for over five years, working with customers on machine learning and computer vision problems. Outside of work, he enjoys riding bicycles.

Tyler Mullenbach is the Global Practice Lead for ML with AWS Professional Services. He is responsible for driving the strategic direction of ML for Professional Services and ensuring that customers realize transformative business achievements through the adoption of ML technologies.

Kevin Song is a Data Scientist at AWS Professional Services. He holds a PhD in Biophysics and has over 5 years of industry experience in building computer vision and machine learning solutions.

Betty Zhang is a data scientist with 10 years of experience in data and technology. Her passion is to build innovative machine learning solutions to drive transformational changes for companies. In her spare time, she enjoys traveling, reading and learning about new technologies.

Read More

How to decide between Amazon Rekognition image and video API for video moderation

How to decide between Amazon Rekognition image and video API for video moderation

Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, and reviews more than ever and 78% of them said that brands are responsible for moderating such content. 40% said that they would disengage with a brand after a single exposure to toxic content.

Amazon Rekognition has two sets of APIs that help you moderate images or videos to keep digital communities safe and engaged.

One approach to moderate videos is to model video data as a sample of image frames and use image content moderation models to process the frames individually. This approach allows the reuse of image-based models. Some customers have asked if they could use this approach to moderate videos by sampling image frames and sending them to the Amazon Rekognition image moderation API. They are curious about how this solution compares with the Amazon Rekognition video moderation API.

We recommend using the Amazon Rekognition video moderation API to moderate video content. It’s designed and optimized for video moderation, offering better performance and lower costs. However, there are specific use cases where the image API solution is optimal.

This post compares the two video moderation solutions in terms of accuracy, cost, performance, and architecture complexity to help you choose the best solution for your use case.

Moderate videos using the video moderation API

The Amazon Rekognition video content moderation API is the standard solution used to detect inappropriate or unwanted content in videos. It performs as an asynchronous operation on video content stored in an Amazon Simple Storage Service (Amazon S3) bucket. The analysis results are returned as an array of moderation labels along with a confidence score and timestamp indicating when the label was detected.

The video content moderation API uses the same machine learning (ML) model for image moderation. The output is filtered for noisy false positive results. The workflow is optimized for latency by parallelizing operations like decode, frame extraction, and inference.

The following diagram shows the logical steps of how to use the Amazon Rekognition video moderation API to moderate videos.

Rekognition Content Moderation Video API diagram

The steps are as follows:

  1. Upload videos to an S3 bucket.
  2. Call the video moderation API in an AWS Lambda function (or customized script on premises) with the video file location as a parameter. The API manages the heavy lifting of video decoding, sampling, and inference. You can either implement a heartbeat logic to check the moderation job status until it completes, or use Amazon Simple Notification Service (Amazon SNS) to implement an event-driven pattern. For details about the video moderation API, refer to the following Jupyter notebook for detailed examples.
  3. Store the moderation result as a file in an S3 bucket or database.

Moderate videos using the image moderation API

Instead of using the video content moderation API, some customers choose to independently sample frames from videos and detect inappropriate content by sending the images to the Amazon Rekognition DetectModerationLabels API. Image results are returned in real time with labels for inappropriate content or offensive content along with a confidence score.

The following diagram shows the logical steps of the image API solution.

Rekognition Content Moderation Video Image Sampling Diagram
The steps are as follows:

1. Use a customized application or script as an orchestrator, from loading the video to the local file system.
2. Decode the video.
3. Sample image frames from the video at a chosen interval, such as two frames per second. Then iterate through all the images to:

3.a. Send each image frame to the image moderation API.
3.b. Store the moderation results in a file or database.

Compare this with the video API solution, which requires a light Lambda function to orchestrate API calls. The image sampling solution is CPU intensive and requires more compute resources. You can host the application using AWS services such as Lambda, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, or Amazon Elastic Compute Cloud (Amazon EC2).

Evaluation dataset

To evaluate both solutions, we use a sample dataset consisting of 200 short-form videos. The videos range from 10 seconds to 45 minutes. 60% of the videos are less than 2 minutes long. This sample dataset is used to test the performance, cost, and accuracy metrics for both solutions. The results compare the Amazon Rekognition image API sampling solution to the video API solution.

To test the image API solution, we use open-source libraries (ffmpeg and OpenCV) to sample images at a rate of two frames per second (one frame every 500 milliseconds). This rate mimics the sampling frequency used by the video content moderation API. Each image is sent to the image content moderation API to generate labels.

To test the video sampling solution, we send the videos directly to the video content moderation API to generate labels.

Results summary

We focus on the following key results:

  • Accuracy – Both solutions offer similar accuracy (false positive and false negative percentages) using the same sampling frequency of two frames per second
  • Cost – The image API sampling solution is more expensive than the video API solution using the same sampling frequency of two frames per second
    • The image API sampling solution cost can be reduced by sampling fewer frames per second
  • Performance – On average, the video API has a 425% faster processing time than the image API solution for the sample dataset
    • The image API solution performs better in situations with a high frame sample interval and on videos less than 90 seconds
  • Architecture complexity – The video API solution has a low architecture complexity, whereas the image API sampling solution has a medium architecture complexity

Accuracy

We tested both solutions using the sample set and the same sampling frequency of two frames per second. The results demonstrated that both solutions provide a similar false positive and true positive ratio. This result is expected because under the hood, Amazon Rekognition uses the same ML model for both the video and image moderation APIs.

To learn more about metrics for evaluating content moderation, refer to Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services.

Cost

The cost analysis demonstrates that the image API solution is more expensive than the video API solution if you use the same sampling frequency of two frames per second. The image API solution can be more cost effective if you reduce the number of frames sampled per second.

The two primary factors that impact the cost of a content moderation solution are the Amazon Rekognition API costs and compute costs. The default pricing for the video content moderation API is $0.10 per minute and $0.001 per image for the image content moderation API. A 60-second video produces 120 frames using a rate of two frames per second. The video API costs $0.10 to moderate a 60-second video, whereas the image API costs $0.120.

The price calculation is based on the official price in Region us-east-1 at the time of writing this post. For more information, refer to Amazon Rekognition pricing.

The cost analysis looks at the total cost to generate content moderation labels for the 200 videos in the sample set. The calculations are based on us-east-1 pricing. If you’re using another Region, modify the parameters with the pricing for that Region. The 200 videos contain 4271.39 minutes of content and generate 512,567 image frames at a sampling rate of two frames per second.

This comparison doesn’t consider other costs, such as Amazon S3 storage. We use Lambda as an example to calculate the AWS compute cost. Compute costs take into account the number of requests to Lambda and AWS Step Functions to run the analysis. The Lambda memory/CPU setting is estimated based on the Amazon EC2 specifications. This cost estimate uses a four GB, 2-second Lambda request per image API call. Lambda functions have a maximum invocation timeout limit of 15 minutes. For longer videos, the user may need to implement iteration logic using Step Functions to reduce the number of frames processed per Lambda call. The actual Lambda settings and cost patterns may differ depending on your requirements. It’s recommended to test the solution end to end for a more accurate cost estimation.

The following table summarizes the costs.

Type Amazon Rekognition Costs Compute Costs Total Cost
Video API Solution $427.14 $0
(Free tier)
$427.14
Image API Solution: Two frames per second $512.57 $164.23 $676.80
Image API Solution: One frame per second $256.28 $82.12 $338.40

Performance

On average, the video API solution has a four times faster processing time than the image API solution. The image API solution performs better in situations with a high frame sample interval and on videos shorter than 90 seconds.

This analysis measures performance as the average processing time in seconds per video. It looks at the total and average time to generate content moderation labels for the 200 videos in the sample set. The processing time is measured from the video upload to the result output and includes each step in the image sampling and video API process.

The video API solution has an average processing time of 35.2 seconds per video for the sample set. This is compared to the image API solution with an average processing time of 156.24 seconds per video for the sample set. On average, the video API performs four times faster than the image API solution. The following table summarizes these findings.

Type Average Processing Time (All Videos) Average Processing Time (Videos Under 1.5 Minutes)
Video API Solution 35.2 seconds 24.05 seconds
Image API Solution: Two frames per second 156.24 seconds 8.45 seconds
Difference 425% -185%

The image API performs better than the video API when the video is shorter than 90 seconds. This is because the video API has a queue managing the tasks that has a lead time. The image API can also perform better if you have a lower sampling frequency. Increasing the frame interval to over 5 seconds can decrease the processing time by 6–10 times. It’s important to note that increasing intervals introduces the risk of missed identification of inappropriate content between frame samples.

Architecture complexity

The video API solution has a low architecture complexity. You can set up a serverless pipeline or run a script to retrieve content moderation results. Amazon Rekognition manages the heavy computing and inference. The application orchestrating the Amazon Rekognition APIs can be hosted on a light machine.

The image API solution has a medium architecture complexity. The application logic has to orchestrate additional steps to store videos on the local drive, run image processing to capture frames, and call the image API. The server hosting the application requires higher computing capacity to support the local image processing. For the evaluation, we launched an EC2 instance with 4 vCPU and 8 G RAM to support two parallel threads. Higher compute requirements may lead to additional operation overhead.

Optimal use cases for the image API solution

The image API solution is ideal for three specific use cases when processing videos.

The first is real-time video streaming. You can capture image frames from a live video stream and send the images to the image moderation API.

The second use case is content moderation with a low frame sampling rate requirement. The image API solution is more cost-effective and performant if you sample frames at a low frequency. It’s important to note that there will be a trade-off between cost and accuracy. Sampling frames at a lower rate may increase the risk of missing frames with inappropriate content.

The third use case is for the early detection of inappropriate content in video. The image API solution is flexible and allows you to stop processing and flag the video early on, saving cost and time.

Conclusion

The video moderation API is ideal for most video moderation use cases. It’s more cost effective and performant than the image API solution when you sample frames at a frequency such as two frames per second. Additionally, it has a low architectural complexity and reduced operational overhead requirements.

The following table summarizes our findings to help you maximize the use of the Amazon Rekognition image and video APIs for your specific video moderation use cases. Although these results are averages achieved during testing and by some of our customers, they should give you ideas to balance the use of each API.

. Video API Solution Image API Solution
Accuracy Same accuracy .
Cost Lower cost using the default image sampling interval Lower cost if you reduce the number of frames sampled per second (sacrifice accuracy)
Performance Faster for videos longer than 90 seconds Faster for videos less than 90 seconds
Architecture Complexity Low complexity Medium complexity

Amazon Rekognition content moderation can not only help your business protect and keep customers safe and engaged, but also contribute to your ongoing efforts to maximize the return on your content moderation investment. Learn more about Content Moderation on AWS and our Content Moderation ML use cases.


About the authors

Author - Lana ZhangLana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team, with expertise in AI and ML for content moderation and computer vision. She is passionate about promoting AWS AI services and helping customers transform their business solutions.

Author - Brigit BrownBrigit Brown is a Solutions Architect at Amazon Web Services. Brigit is passionate about helping customers find innovative solutions to complex business challenges using machine learning and artificial intelligence. Her core areas of depth are natural language processing and content moderation.

Read More