Have you ever thought about how artificial intelligence could be used to detect events during live sports broadcasts? With machine learning (ML) techniques, we introduce a scalable multimodal solution for event detection on sports video data. Recent developments in deep learning show that event detection algorithms are performing well on sports data [1]; however, they’re dependent upon the quality and amount of data used in model development. This post explains a deep learning-based approach developed by the Amazon Machine Learning Solutions Lab for sports event detection using Amazon SageMaker. This approach minimizes the impact of low-quality data in terms of labeling and image quality while improving the performance of event detection. Our solution uses a multimodal architecture utilizing video, static images, audio, and optical flow data to develop and fine-tune a model, followed by boosting and a postprocessing algorithm.
We used sports video data that included static 2D images and frames over time and audio data, which enabled us to train separate models in parallel. The outlined approach also enhances the performance of event detection by consolidating the models’ outcomes into one decision-maker using a boosting technique.
In this post, we first give an overview of the data. We then explain the preprocessing workflow, modeling strategy, postprocessing, and present the results.
In this exploratory research study, we used the Sports-1 Million dataset [2], which includes 400 classes of short video clips of sports. The videos include the audio channel, enabling us to extract audio samples for multimodal model development. Among the sports in the dataset, we selected the most frequently occurring sports based on their number of data samples, resulting in 89 sports.
We then consolidated the sports in similar categories, resulting in 25 overall classes. The final list of selected sports for modeling is:
['americanfootball', 'athletics', 'badminton', 'baseball', 'basketball', 'bowling', 'boxing', 'cricket', 'cycling', 'fieldhockey', 'football', 'formula1', 'golf', 'gymnastics', 'handball', 'icehockey', 'lacrosse', 'rugby', 'skiing', 'soccer', 'swimming', 'tabletennis', 'tennis', 'volleyball', 'wrestling']
The following graph shows the number of video samples per sports category. Each video is cut into 1-second intervals.
Data processing pipeline
The temporal modeling in this solution uses video clips with 1-second-long durations. Therefore, we first extracted 1-second length video clips from each data example. The average length of videos in the dataset is around 20 seconds, resulting in approximately 190,000 1-second video clips. We passed each second-level video clip through a frame extraction pipeline and, depending on the frames per second (fps) of the video clip, extracted the corresponding number of frames, and stored them in an Amazon Simple Storage Service (Amazon S3) bucket. The total number of frames extracted was around 3.8 million. We performed multi-processing on a SageMaker notebook using an Amazon Elastic Compute Cloud (Amazon EC2) ml.c5.large instance with 64 cores to parallelize the I/O heavy clip-extraction process. Parallelization reduced the clip extraction from hours to minutes.
To train the ML algorithms, we split the data using stratified sampling on the original clips, which prevented potential information leakage down the pipeline. In a classification setting, stratifying helps ensure that the training, validation, and test sets have approximately the same percentage of samples of each target class as the complete set. We split the data into 80/10/10 portions for training, validation, and test sets, respectively. We then reflected this splitting pattern on the 1-second video clips level and the corresponding extracted frames level.
Next, we fine-tuned the ResNet50 architecture using the extracted frames. Additionally, we trained a ResNet50 architecture using dense optical flow features extracted from the frames for each 1-second clip. Finally, we extracted audio features from 1-second clips and implemented an audio model. Each approach represented a modality in the final multimodal technique. The following diagram illustrates the architecture of the data processing and pipeline.
The rest of this section details each modality.
Computer vision
We used two separate computer vision-based approaches to fit the data. First, we used the ResNet50 architecture to fine-tune the multi-class classification algorithm using RBG frames. Second, we used the ResNet50 architecture with the same fine-tuning strategy against optical flow frames. ResNet50 is one of the best classifiers for image data and has been remarkably successful in developing business applications.
We used a two-step fine-tuning approach: we first unfroze the last layer, added two flattened layers to the network, and fine-tuned the results for 10 epochs; we then saved the weights of this model, unfroze all the layers, and trained the entire network on the preceding sports data for 30 epochs. We used TensorFlow with Horovod for training on AWS Deep Learning AMI (DLAMI) instances. You can also use SageMaker Pipe mode to set up Horovod.
Horovod, an open-source framework for distributed deep learning, is available for use with most popular deep learning toolkits, like TensorFlow, Keras, PyTorch, and Apache MXNet. It uses the all-reduce algorithm for fast distributed training rather than using a parameter server approach, and it includes multiple optimization methods to make distributed training faster.
Since completing this project, SageMaker has introduced a new data parallelism library optimized for AWS, which allows you to use existing Horovod APIs. For more information, see New – Managed Data Parallelism in Amazon SageMaker Simplifies Training on Large Datasets.
Optical flow
For the second modality, we used an optical flow approach. The implementations of a classifier, such as ResNet50, on image data only addresses relationships of objects within the same frame, disregarding time information. A model trained this way assumes that frames are independent and unrelated.
To capture the relationships between consecutive frames, such as for recognizing human actions, we can use optical flow. Optical flow is the motion of objects between consecutive frames of sequence caused by the relative movement between the object and camera. We performed a dense optical flow algorithm on the images extracted from each 1-second video. We used OpenCV’s Gunner Farnebäck’s algorithm, which is explained in Farnebäck’s 2003 article “Two-Frame Motion Estimation Based on Polynomial Expansion” [3].
Audio event detection
ML-based audio modeling formed the third stream of our multimodal event detection solution, where audio samples were extracted from 1-second videos, resulting in audio segments in M4A format.
To explore the performance of audio models, two types of features broadly used in digital signal processing were extracted from the audio samples: Mel Spectrogram (MelSpec) and Mel-Frequency Cepstrum coefficient (MFCC). A modified version of MobileNet, a state-of-the-art architecture for audio data classification, was employed for the model development [4].
The audio processing pipeline consists of three steps, including MelSpec and MFCC features and MobileNetV2 model development:
- First, MelSpec refers to the fast Fourier transformation of an audio segment known as spectrogram while considering Mel-Scale. Research has shown that human auditory systems are non-linearly distinguishable between certain frequencies so that the Mel-Scale equalizes the distance between frequency bands audible to a human. For our use case, MelSpec features with 128 points were calculated for model development.
- Second, MFCC is a similar feature to MelSpec, where a linear cosine transformation is applied to the MelSpec feature as research has revealed that such a transformation can improve the performance of classification for audible sound. MFCC features with 33 points were extracted from the audio data; however, the performance of a model based on this feature was unable to compete with MelSpec, suggesting that MFCC often performs better with sequence models.
- Finally, the audio model MobileNetV2 was adopted for our data and trained for 100 epochs with preloaded ImageNet weights. MobileNetV2 [5] is a convolutional neural network architecture that seeks to perform well on mobile devices. It’s based on an inverted residual structure, where the residual connections occur between the bottleneck layers.
The objective of the postprocessing step employs a boosting algorithm to do the following:
- Obtain video-level performance from the frame level
- Incorporate three models of output into the decision-making process
- Enhance the model performance in prediction using a defined class-model strategy obtained from validation sets and applied to test sets
First, the postprocessing module generated 1-second-level predicted classes and their probabilities for RGB, optical flow, and audio models. We then used a majority voting algorithm to assign the predicted class at the 1-second-level during inference.
Next, the 1-second-level computer vision and audio labels were converted to video-level performance. The results on the validations sets were compared to create a table of classes based on the model-class performance strategy for multimodal prediction against testing sets.
In the final stage, testing sets were passed through the prediction module, resulting in three labels and probabilities.
In this work, the RGB models resulted in the highest performance for all classes except badminton
, where the audio model gave the best performance. The optical flow models didn’t compete with the other two models, although some research has shown that optical flow-based models could generate better results for certain datasets. The final prediction was performed by incorporating all three labels based on the predefined table to output the most probable classes.
The boosting algorithm of the prediction module is described as follows:
- Split videos into 1-second segments.
- Extract frames and audio signals.
- Prepare RGB frames and MelSpec features.
- Pass RGB frames through the trained ResNet50 by RGB samples and obtain prediction labels per frame.
- Pass MelSpec features through the trained MobileNet by audio samples and obtain prediction labels for each 1-second audio sample.
- Calculate 1-second-level RGB labels and probabilities.
- Use a predefined table (obtained from validation results).
- If the
class is found among two labels associated with a 1-second sample, vote for the audio model (get the label and probability from the audio model). Otherwise, vote for the RBG model (get the label and probability from the RGB model).
The following graph shows the averaged frame-level F1 scores of the three models against two validation datasets; the error bars represent the standard deviations.
Similarly, the following graph compares the F1 scores for three models per class measured for two testing datasets before postprocessing (average and standard deviation as error bars).
After applying the multimodal prediction module to the testing datasets to convert frame-level and 1-second-level predictions, the postprocessed video-level metrics were produced (see the following graph) and showed a significant improvement from the frame-level single modality to the video-level multimodal outputs.
As previously mentioned, the class-model table was prepared using the comparison of three models for validation sets.
The analysis demonstrated that the multimodal approach could improve the performance of multi-class event detection by 5.10%, 55.68%, and 34.2% for single RGB, optical flow, and audio models, respectively. In addition, the confusion matrices for postprocessed testing datasets, shown in the following figures, indicated that the multimodal approach could predict most classes in a challenging 25-class event detection task.
The following figure shows the video-level confusion matrix of the first testing dataset after postprocessing.
The following figure shows the video-level confusion matrix of the second testing dataset after postprocessing.
The modeling workflow explained in this post assumes that the data examples in the dataset are all relevant, are all labeled correctly, and have similar distributions among each class. However, the authors’ manual observation of the data sometimes found substantial differences in video footage from one sample to another in the same class. Therefore, one of the areas of improvement that can have great impact on the performance of the model is to further prune the dataset to only include the relevant training examples and provide better labeling.
We used the multimodal model prediction against the testing dataset to generate the following demo for 25 sports, where the bars demonstrate the probability of each class per second (we called it 1-second-level prediction).
This post outlined a multimodal event detection approach using a combination of RGB, optical flow, and audio models through robust ResNet50 and MobileNet architectures implemented on SageMaker. The results of this study demonstrated that, by using a parallel model development, multimodal event detection improved the performance of a challenging 25-class event detection task in sports.
A dynamic postprocessing module enables you to update predictions after new training to enhance the model’s performance against new data.
About Amazon ML Solutions Lab
The Amazon ML Solutions Lab pairs your team with ML experts to help you identify and implement your organization’s highest value ML opportunities. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.
Editor’s note: The dataset used in this post is for non-commercial demonstration and exploratory research.
[1] Vats, Kanav, Mehrnaz Fani, Pascale Walters, David A. Clausi, and John Zelek. “Event Detection in Coarsely Annotated Sports Videos via Parallel Multi-Receptive Field 1D Convolutions.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 882-883. 2020. [2] Karpathy, Andrej, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. “Large-scale video classification with convolutional neural networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725-1732. 2014. [3] Farnebäck, Gunnar. “Two-frame motion estimation based on polynomial expansion.” In Scandinavian Conference on Image Analysis, pp. 363-370. Springer, Berlin, Heidelberg, 2003. [4] Adapa, Sainath. “Urban sound tagging using convolutional neural networks.” arXiv preprint arXiv:1909.12699 (2019). [5] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. “Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520. 2018.About the Authors
Saman Sarraf is a Data Scientist at the Amazon ML Solutions Lab. His background is in applied machine learning including deep learning, computer vision, and time series data prediction.
Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey and solve their ML problems using state-of-the-art solutions and technologies.
Utilizing XGBoost training reports to improve your models
In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.
This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.
For this example, we use the dataset from the Kaggle ATLAS Higgs Boson Machine Learning Challenge 2014. With this dataset, we train a machine learning (ML) model to automatically classify Higgs Boson events from others (such as background noise) generated from simulated proton-proton collisions in CERN’s Large Hadron Collider. The data can be obtained directly from CERN. Let’s go through the steps of obtaining the data and configuring the training job. You can follow along with a Jupyter notebook.
- We start with the relevant imports:
import requests from io import BytesIO import pandas as pd import boto3 import s3fs from datetime import datetime import time import sagemaker from sagemaker.estimator import Estimator from sagemaker import image_uris from sagemaker.inputs import TrainingInput from sagemaker.debugger import Rule, rule_configs from IPython.display import FileLink, FileLinks
- Then we set up variables that we later need to configure the SageMaker training job:
# setup sagemaker variables role = sagemaker.get_execution_role() sess = sagemaker.session.Session() bucket = sess.default_bucket() key_prefix = "higgs-boson" region = sess._region_name s3 = s3fs.S3FileSystem(anon=False) xgboost_container = image_uris.retrieve("xgboost", region, "1.2-1")
- We obtain data and prepare it for training:
# obtain data from CERN and load it into a DataFrame data_url = "" gz_file = BytesIO(requests.get(data_url).content) gz_file.flush() df = pd.read_csv(gz_file, compression="gzip") # identify feature, label, and unused columns non_feature_cols = ["EventId", "Weight", "KaggleSet", "KaggleWeight", "Label"] feature_cols = [col for col in df.columns if col not in non_feature_cols] label_col = "Label" df["Label"] = df["Label"].apply(lambda x: 1 if x=="s" else 0) # take subsets of data per the original Kaggle competition train_data = df.loc[df["KaggleSet"] == "t", [label_col, *feature_cols]] test_data = df.loc[df["KaggleSet"] == "b", [label_col, *feature_cols]] # upload data to S3 for name, dataset in zip(["train", "test"], [train_data, test_data]): sess.upload_string_as_file_body(body=dataset.to_csv(index=False, header=False), bucket=bucket, key=f"{key_prefix}/input/{name}.csv" ) # configure data inputs for SageMaker training train_input = TrainingInput(f"s3://{bucket}/{key_prefix}/input/train.csv", content_type="text/csv") validation_input = TrainingInput(f"s3://{bucket}/{key_prefix}/input/test.csv", content_type="text/csv")
Setting up a training job with XGBoost training report
We only need to make one code change to the typical process for launching a training job: adding the create_xgboost_report
rule to the Estimator. SageMaker takes care of the rest. A companion SageMaker processing job spins up to analyze the XGBoost model and produce the report. This analysis is done at no additional cost. See the following additional code:
# add a rule to generate the XGBoost Report
"max_depth": "6",
"eta": "0.1",
"objective": "binary:logistic",
"num_round": "100",
training_job_time ={'train': train_input, 'validation': validation_input},
Analyzing models with the XGBoost training report
When the training job is complete, SageMaker automatically starts the processing job to generate the XGBoost report. We write a few lines of code to check the status of the processing job. When it’s complete, we download it to our local drive for further review. The following code downloads the report upon its completion, and provides a hyperlink directly within the notebook for easy viewing:
import os
#get name of profiler report
profiler_report_name = [rule["RuleConfigurationName"]
for rule in estimator.latest_training_job.rule_job_summary()
if "Profiler" in rule["RuleConfigurationName"]][0]
xgb_profile_job_name = [rule["RuleEvaluationJobArn"].split("/")[-1]
for rule in estimator.latest_training_job.rule_job_summary()
if "CreateXgboostReport" in rule["RuleConfigurationName"]][0]
base_output_path = os.path.dirname(estimator.latest_job_debugger_artifacts_path())
rule_output_path = os.path.join(base_output_path, "rule-output/")
xgb_report_path = os.path.join(rule_output_path, "CreateXgboostReport")
profile_report_path = os.path.join(rule_output_path, profiler_report_name)
while True:
xgb_job_info = sess.sagemaker_client.describe_processing_job(ProcessingJobName=xgb_profile_job_name)
if xgb_job_info["ProcessingJobStatus"] == "Completed":
print(f"Job Status: {xgb_job_info['ProcessingJobStatus']}")
time.sleep(30), "reports/xgb/", recursive=True), "reports/profiler/", recursive=True)
display("Click link below to view the profiler report", FileLink("reports/profiler/profiler-output/profiler-report.html"))
display("Click link below to view the XGBoost Training report", FileLink("reports/xgb/xgboost_report.html"))
Before we dive into the training report, let’s take a quick look at the SageMaker Debugger report, which by default is generated after every training job. This report provides key metrics around resource utilization such as network, I/O, and CPU. In the following example, we can see the median CPU utilization was at around 55% while memory utilization was consistently under 5%. This tells us that we can reduce costs by utilizing a smaller training instance.
Now let’s dive into the training report. SageMaker Debugger automatically generates the following key insights on our model:
- Distribution of labels – Detects imbalanced datasets
- Loss graph – Detects over-fitting or over training
- Feature importance metrics – Identifies redundant or uninformative features
- Confusion matrix and evaluation metrics – Evaluates performance at the individual class level and identifies concentrations of errors
- Accuracy rate per iteration – Shows how accuracy improved for each class over each round of boosting
- Receiver operating characteristic curve – Shows how the model performs under different probability thresholds
- Distribution of residuals – Helps determine if residuals are a result of random error or missing information
We pick a few items from the report for demonstration purposes.
Distribution of true labels of the dataset
This visualization shows the distribution of labeled classes (for classification) or values (for regression) in your original dataset. An imbalanced dataset could result in poor predictive performance unless properly handled. In this particular example, there’s a slight imbalance between the negative and positive label.
Loss vs. step graph
This visualization compares the loss from the training dataset against the validation dataset. For this particular model, it looks like this model is over-fitting on the training set because the validation error remains relatively flat after about 30 boosting rounds, even though the error on the training loss continues to improve.
Feature importance
This visualization shows you feature importance by weight, gain, and coverage. Gain, which measures the relative contribution of each feature, is typically the most relevant one for most use cases. For this particular model, we see that a handful of features provide the bulk of the contribution, while a large number contribute little to no gain to the model’s predictive performance. It’s usually a good practice to drop uninformative features from the model because they add noise and may result in over-fitting.
Confusion matrix and ROC curve
There are a number of additional visualizations that show you the common things data scientists often look at, such as the confusion matrix, ROC curve, and F1 score. For more information, see Debugger XGBoost Training Report Walkthrough.
From the following confusion matrix, we can see that the model does a better job at predicting for class 0 than class 1. And this can be explained by the imbalanced label distribution we showed at the beginning (there are more instances for class 0 than class 1). One ramification is making the label distribution more balanced via data resampling techniques.
SageMaker Debugger automatically generates and reports the performance metrics such as F1 score and accuracy. You can also see a classification report, such as the following.
Fine-tuning performance
From the training report’s outputs, we can see several areas where the model can be fine-tuned to improve performance, notably the following:
- The loss vs. step graph indicates that the validation error stopped improving after about 30 rounds, so we can reduce the number of boosting rounds or enable early stopping to mitigate over-training.
- The feature importance graph shows a large number of uninformative features that could potentially be removed to reduce over-fitting and improve predictive performance on unseen datasets.
- Based on the confusion matrix and the classification report, the recall score is somewhat low, meaning we’ve misclassified a large number of signal events. Tuning the
parameter to adjust for the imbalance in the dataset could help improve this.
In this post, we generated an XGBoost training report and profiler report using SageMaker Debugger. With these, we got reports for both the model performance and the resource utilization during training automatically. We then walked through the XGBoost training report and identified a number of issues that we can alleviate with some hyperparameter tuning.
For more about SageMaker Debugger, see SageMaker Debugger XGBoost Training Report and SageMaker Debugger Profiling Report.
About the Authors
Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.
Lu Huang is a Senior Product Manager on the AWS Deep Engine team, managing Sagemaker Debugger.
Satadal Bhattacharjee is Principal Product Manager at AWS AI. He leads the machine learning engine PM team on projects such as SageMaker and optimizes machine learning frameworks such as TensorFlow, PyTorch, and MXNet.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Nihal Harish is an engineer at AWS AI. He loves working at the intersection of distributed systems and machine learning. Outside of work, he enjoys long distance running and playing tennis.
Integrating Amazon Polly with legacy IVR systems by converting output to WAV format
Amazon Web Services (AWS) offers a rich stack of artificial intelligence (AI) and machine learning (ML) services that help automate several components of the customer service industry. Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.
You might face common implementation challenges when updating or modifying legacy interactive voice response (IVR) systems that don’t support file formats such as MP3 and PCM. Amazon Polly, in order to minimize response latency, produces synthesis in real-time and streams the results back to the customer in a streamable format (MP3, Ogg/Vorbis or raw PCM samples) while the request is being processed. WAV audio format is not streamable by definition, but a WAV file can be easily created from a PCM stream generated by Polly at the end of synthesis, when all samples are collected and the length of the result can be calculated. This post shows you how to convert Amazon Polly output to a common audio format like WAV.
Converting Amazon Polly file output to WAV
One of the challenges with legacy systems is that they may not support Amazon Polly file outputs like MP3. The output of the Amazon Polly SynthesizeSpeech
API call doesn’t support WAV, but some legacy IVRs obtain the audio output in WAV file format, which isn’t supported natively in Amazon Polly. Many of these applications are written in Python and Java.
The following sample code which will help in such situations where audio is in WAV file format not supported natively in Amazon Polly. The sample code converts files from PCM to WAV in Python for inputs given in both SSML and text.
#The following sample code snippet converts files from PCM to WAV in Python for both SSML and non SSML text
#Importing libraries
import boto3
import wave
import os
#Initializing variables
CHANNELS = 1 #Polly's output is a mono audio stream
RATE = 16000 #Polly supports 16000Hz and 8000Hz output for PCM format
OUTPUT_FILE_IN_WAVE = "sample_SSML.wav" #WAV format Output file name
WAV_SAMPLE_WIDTH_BYTES = 2 # Polly's output is a stream of 16-bits (2 bytes) samples
#Initializing Polly Client
polly = boto3.client("polly")
#Input text for conversion
INPUT = "<speak>Hi! I'm Matthew. Hope you are doing well. This is a sample PCM to WAV conversion for SSML. I am a Neural voice and have a conversational style. </speak>" # Input in SSML
WORD = "<speak>"
if WORD in INPUT: #Checking for SSML input
#Calling Polly synchronous API with text type as SSML
response = polly.synthesize_speech(Text=INPUT, TextType="ssml", OutputFormat="pcm",VoiceId="Matthew", SampleRate="16000") #the input to sampleRate is a string value.
#Calling Polly synchronous API with text type as plain text
response = polly.synthesize_speech(Text=INPUT, TextType="text", OutputFormat="pcm",VoiceId="Matthew", SampleRate="16000")
except (BotoCoreError, ClientError) as error:
#Processing the response to audio stream
STREAM = response.get("AudioStream")
The following is the sample output from the preceding code:
You can convert Amazon Polly output from PCM to WAV so that you can use Amazon Polly in your legacy IVR, enabling it to support WAV file format output. Try this out for yourself and let us know how it goes in the comments!
You can further refine the converted file using the powerful capabilities available in Amazon Polly like the SynthesizeSpeech request, managing lexicons, reserved characters in SSML, and controlling volume, speaking rate, and pitch.
About the Author
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.
Introducing Amazon SageMaker Reinforcement Learning Components for open-source Kubeflow pipelines
This blog post was co-authored by AWS and Max Kelsen. Max Kelsen is one of Australia’s leading Artificial Intelligence (AI) and Machine Learning (ML) solutions businesses. The company delivers innovation, directly linked to the generation of business value and competitive advantage to customers in Australia and globally, including Fortune 500 companies. Max Kelsen is also dedicated to reinvesting our expertise and profits to solve the challenges of humankind, focusing on Genomics, AI Safety, and Quantum Computing.
Robots require the integration of technologies such as image recognition, sensing, artificial intelligence, machine learning (ML), and reinforcement learning (RL) in ways that are new to the field of robotics. Today, we’re launching Amazon SageMaker Reinforcement Learning Kubeflow Components supporting AWS RoboMaker, a cloud robotics service, for orchestrating robotics ML workflows. Orchestrating robotics operations to train, simulate, and deploy RL applications is difficult and time-consuming. Now, with SageMaker RL components and pipelines, it’s faster to experiment and manage robotics ML workflows from perception to controls and optimization, and create end-to-end solutions without having to rebuild each time.
Robots are being used more widely in society for purposes that are increasing in sophistication, such as complex assembly, picking and packing, last-mile delivery, environmental monitoring, search and rescue, and assisted surgery. Robotics often involves training complex sequences of behaviors. RL is an emerging ML technique that can help develop solutions for exactly these kinds of problems. It learns complex behaviors without requiring any labeled training data, and can make short-term decisions while optimizing for a long-term goal. For example, when a robot interacts with its environment, this mostly takes place in a simulator. The robot receives a positive or negative reward for actions that it takes. Rewards are computed by a user-defined function that outputs a numeric representation of the actions that should be incentivized. The agent tries to maximize positive rewards, and as a result the model learns an optimal strategy for decision-making.
SageMaker and AWS RoboMaker are two different services streamlined to serve two separate personas: data scientists and roboticists, respectively. SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker RL builds on top of SageMaker, adding pre-packaged RL toolkits and making it easy to integrate any simulation environment. AWS RoboMaker is the most complete cloud solution for robotic developers to simulate, test, and securely deploy robotic applications at scale. Its managed Robot Operating System (ROS) ( and Gazebo (, an open-source robot simulation software, stacks free up engineering resources and enable you to start building quickly. The task of stitching together machine learning workflows for robotics using Amazon SageMaker and AWS RoboMaker is non-trivial, consuming valuable time for both data scientists and roboticists.
With Amazon SageMaker RL Components for Kubernetes, you can use SageMaker RL Components in your Kubeflow pipelines to invoke and parallelize SageMaker training jobs and AWS RoboMaker simulation jobs as steps in your RL training workflow, without having to worry about how it runs under the hood. The following diagram illustrates the pipeline workflow for SageMaker RL Components
SageMaker Components in your Kubeflow pipeline simply loads the components and describes your pipeline using the Kubeflow Pipelines SDK. SageMaker RL uses open-source libraries such as Anyscale’s Ray to start training an RL agent by collecting experience from Gazebo (an open-source software to simulate populations of robots in complex indoor and outdoor environments) in AWS RoboMaker using ROS (a set of software libraries and tools that help you build robot applications). When the training is completed, the RL agent model is stored in an Amazon Simple Storage Service (Amazon S3) bucket, and an Amazon SageMaker inference node can be created for deployment in production. You can then download the model to the robot with the same ROS structure from the simulation to perform the required tasks.
Although our solution is implemented in Kubeflow components and pipelines, it’s not specific to Kubeflow and can be generalized to MLOps workflows in Argo and Kubernetes to orchestrate parallel robotics ML jobs.
Use case: Woodside Energy deploys robotics in oil and gas environments
Woodside Energy uses AWS RoboMaker with Amazon SageMaker Kubeflow operators to train, tune, and deploy reinforcement learning agents to their robots to perform manipulation tasks that are repetitive or dangerous. This framework will allow the team to iterate and deploy at scale.
“Our team and our partners wanted to start exploring using machine learning methods for robotics manipulation,” says Kyle Saltmarsh, Robotics Engineer at Woodside Energy. “Before we could do this effectively, we needed a framework that would allow us to train, test, tune, and deploy these models efficiently. Utilizing Kubeflow components and pipelines with SageMaker and RoboMaker provides us with this framework and we are excited to have our roboticists and data scientists focus their efforts and time on algorithms and implementation.”
Woodside and AWS engaged Max Kelsen to assist in the development and contribution of the RoboMaker and RLEstimator components that enable the pipelines described in this project. Max Kelsen leverages open source throughout most of its work, and views participation in these communities as strategically important to delivering the best outcomes for our clients.
In the following image, Ripley, a custom-built robotics platform by Woodside Energy, is getting ready to perform a double block and bleed, a manual pump shutdown procedure that involves turning multiple valves in sequence. Ripley is based on a Clearpath Robotics Husky equipped with two Universal Robotics UR5 arms, Intel RealSense D435 cameras on each wrist, and a Kodak PixPro body camera. The reinforcement learning formulation utilizes the joint states and camera views as inputs to the agent and outputs optimal trajectories for valve manipulation.
Getting started with SageMaker RL components
In a typical Kubeflow pipeline, each component encapsulates your logic in a container image. As a developer or data scientist, you bring in your training, data preprocessing, model serving, or other logic wrapped in a Kubeflow Pipelines ContainerOp
function, which builds your code into a new container. Alternatively, you can put the code into a custom container image and push it to a container registry such as Amazon Elastic Container Registry (Amazon ECR). When the pipeline runs, the component’s container is instantiated on one of the worker nodes on the Kubernetes cluster running Kubeflow, and your logic is implemented. Pipeline components can read outputs from the previous components and create outputs that the next component in the pipeline can consume.
When you use SageMaker Components in your Kubeflow pipeline, rather than encapsulating your logic in a custom container, you simply load the components and describe your pipeline using the Kubeflow Pipelines SDK. When the pipeline runs, your instructions are translated into a SageMaker job or deployment. This workload runs on the fully managed infrastructure of SageMaker. You also get all the benefits of a typical SageMaker capability, including Managed Spot Training, automatic scaling of endpoints, and more.
You have separate VPCs for orchestration and simulation. The reason is that no direct communication is needed between the RLEstimator or AWS RoboMaker jobs and the Kubeflow Pipelines components. The components interact directly with the AWS RoboMaker and SageMaker APIs, but not the jobs themselves. The components poll the APIs for the status of the jobs and any related Amazon CloudWatch Logs, and the responses are reflected back to the Kubeflow Pipelines UI. This offers a single interface for viewing the status of the running jobs.
The orchestration VPC utilizes both public and private subnets and a NAT gateway. The Amazon Elastic Kubernetes Service (Amazon EKS) worker nodes are launched into a private network, and use a route to the NAT gateway in the public subnet to interact with AWS APIs, and also to pull public Docker images to run on the cluster. For this post, we allow public access to the EKS cluster endpoint. This allows you to run kubectl port forwarding from your local machine and by doing so open up a tunnel to access the Kubeflow UI. In a production system, we suggest placing the Kubeflow service behind an Application Load Balancer (ALB) and secure using AWS Identity and Access Management (IAM).
To run the following use case, you need the following:
- Kubernetes cluster – You can use your existing cluster or create a new one. The fastest way to get one up and running is to launch an EKS cluster using eksctl. For instructions, see Getting started with eksctl. Create a simple cluster with two CPU nodes to run this example. We tested this example on a 2 c5.xlarge. You just need enough node resources to run the SageMaker Component containers and Kubeflow. Training and deployments run on the SageMaker and AWS RoboMaker managed infrastructure.
- Kubeflow Pipelines – Install Kubeflow Pipelines on your cluster. For instructions, see Step 1 in Deploying Kubeflow Pipelines. Your Kubeflow Pipelines version must be 0.5.0 or above. Optionally, you can install all of Kubeflow, which includes Kubeflow Pipelines.
- SageMaker and AWS RoboMaker components prerequisites – For instructions on setting up IAM roles and permissions, see Amazon SageMaker Components for Kubeflow Pipelines. You need three IAM roles for the following:
- Kubeflow pipeline pods to access SageMaker and AWS RoboMaker and launch training and simulation jobs.
- Amazon SageMaker execution role to access other AWS resources such as Amazon S3.
- AWS RoboMaker execution role to access other AWS resources such as Amazon S3.
You can launch an EKS cluster from your laptop, desktop, Amazon Elastic Compute Cloud (Amazon EC2) instance, or SageMaker notebook instance. This instance is typically called a gateway instance. Because Amazon EKS offers a fully managed control plane, you only use out-of-the-box the gateway instance to interact with the Kubernetes API and worker nodes. The instance should have a role that allows for interaction with the EKS cluster. The code in the examples here was run from a local device with access to the EKS cluster.
Solution overview
The code, configuration files, and Jupyter notebooks used in this post are available on GitHub. The following walkthrough is provided to explain the key concepts. Rather than copying code from these steps, we recommend running the prepared Jupyter notebook. In this post, we walk through the following high-level steps:
- Configure your dependent resources.
- Clone the example repository and install dependencies.
- Open the example Jupyter notebook.
- Install the Kubeflow Pipelines SDK and load SageMaker pipeline components.
- Prepare your training datasets and upload them to Amazon S3.
- Create your Kubernetes pipeline.
- Compile and run your pipeline.
Configuring your dependent resources
If you’re following the proposed architecture from this post, you run the simulation jobs in a private subnet. To ensure that the running jobs have connectivity to AWS resources, add VPC endpoints for the following services:
- Amazon S3
- CloudWatch
Next, create an S3 bucket to host your simulation job and RLEstimator job source files. The jobs also use this bucket to communicate by writing config files. The bucket should be in the same Region that you’re running the rest of your infrastructure, because VPC endpoints are locked to accessing resources within the same Region.
Finally, you need to configure an IAM role with access to the S3 bucket and AmazonSageMakerFullAccess
and AWSRoboMaker_FullAccess
Cloning the example repository and installing dependencies
Open a terminal and SSH to the EC2 gateway instance that you use to communicate with your EKS cluster. After you log in, clone the example repository to access the example Jupyter notebook. See the following code:
git clone
cd kubeflow-pipelines-robomaker-examples
pip install -r requirements.txt
Opening the example Jupyter notebook
As part of the previous step, you installed Jupyter. To open the Jupyter notebook on your gateway instance, complete the following steps:
- Launch JupyterLab on your gateway instance and access it on your local machine with the following code:
jupyter lab
- If you’re running the JupyterLab server on an EC2 instance, set up a tunnel to the EC2 instance so you can access the JupyterLab client on your local laptop or desktop. (If you’re using Amazon Linux instead of Ubuntu, you have to use
as the username. Update the IP address of the EC2 instance and use the appropriate key pair.) See the following code:ssh -N -L -L -i ~/.ssh/<key_pair>.pem ubuntu@<IP_ADDRESS>
You can now access Jupyter lab at http://localhost:8888 on your local machine.
- Access the Kubeflow dashboard by running the following on your gateway instance:
kubectl port-forward svc/istio-ingressgateway -n istio-system 8081:80
You can now access the Kubeflow dashboard at http://localhost:8081.
Open the example Jupyter notebook (kfp-robomaker-example.ipynb
SageMaker RLEstimator supports two modes for training jobs (the GitHub repo includes one Jupyter notebook for the latter approach):
- Bring your own Docker container image – In this mode, you can provide your own Docker container for training. Build your container with your training scripts and push it to Amazon ECR, which is a container registry. SageMaker pulls your container image, instantiates it, and runs training.
- Bring your own training script (script mode) – In this mode, you don’t have to deal with Docker containers. Simply bring your RLEstimator training scripts in popular frameworks such as TensorFlow, PyTorch, MXNet, and popular RL toolkits such as Coach and Ray, and upload it to Amazon S3. SageMaker automatically pulls the appropriate container, downloads your training scripts, and runs it. This mode is ideal if you don’t want to deal with Docker containers. The kfp-robomaker-example.ipynb Jupyter notebook implements this approach.
The following example takes a closer look at the first approach (bringing your own Docker container image). You walk through all the important steps in the kfp-robomaker-example.ipynb Jupyter notebook. Having it open makes it easy for you to follow along.
The following screenshot shows the kfp-robomaker-example.ipynb notebook.
Installing Kubeflow Pipelines SDK and loading SageMaker pipeline components
To install the SDK and load the pipeline components, complete the following steps:
- Install the Kubeflow Pipelines SDK with the following code:
pip install kfp –upgrade
- Import Kubeflow Pipeline packages in Python with the following code:
import kfp from kfp import components from kfp.components import func_to_container_op from kfp import dsl
- Load SageMaker Components in Python with the following code:
robomaker_create_sim_app_op = components.load_component_from_url(' 4aa11c3c7f6f068fdb135e1af4a0af5bb1d72d17 /components/aws/sagemaker/create_simulation_app/component.yaml') robomaker_sim_job_op = components.load_component_from_url(' 4aa11c3c7f6f068fdb135e1af4a0af5bb1d72d17/components/aws/sagemaker/simulation_job/component.yaml') robomaker_delete_sim_app_op = components.load_component_from_url( ' 4aa11c3c7f6f068fdb135e1af4a0af5bb1d72d17/components/aws/sagemaker/delete_simulation_app/component.yaml') sagemaker_rlestimator_op = components.load_component_from_url(' 4aa11c3c7f6f068fdb135e1af4a0af5bb1d72d17/components/aws/sagemaker/rlestimator/component.yaml')
Preparing training datasets and uploading to Amazon S3
To prepare and upload the source code for SageMaker and AWS RoboMaker, enter the following code:
import boto3
s3 = boto3.resource('s3')
role = "<your_role_name>"
bucket_name = "<your_bucket_name>"
s3.meta.client.upload_file("sourcedir.tar.gz", bucket_name, "sagemaker-sources/sourcedir.tar.gz")
print(f"nUploaded to S3 location: {bucket_name}sagemaker-sources/sourcedir.tar.gz")
s3.meta.client.upload_file("output.tar", bucket_name, "robomaker-sources/output.tar")
print(f"nUploaded to S3 location: {bucket_name}robomaker-sources/output.tar")
Here we upload a sourcedir.tar.gz
that contains some object_tracker
code that the SageMaker RLEstimator training job uses. We also upload an output.tar
file, which contains a colcon bundle that is used to create an AWS RoboMaker simulation application.
Creating a Kubeflow pipeline using AWS RoboMaker and SageMaker Components
You can express a Kubeflow pipeline as a function decorated with @dsl.pipeline
, as shown in the following code and in kfp-robomaker-example.ipynb. For more information, see Overview of Kubeflow Pipelines.
name="SageMaker & RoboMaker pipeline",
description="SageMaker & RoboMaker Reinforcement Learning job where the jobs work together to train an RL model",
def sagemaker_robomaker_rl_job(
+ "".join(random.choice(string.ascii_lowercase) for i in range(10)),
"s3Bucket": bucket_name,
"s3Key": "robomaker-sources/output.tar",
"architecture": "X86_64",
In this code example, you create a new function called sagemaker_robomaker_rl_job()
and define arguments that are common to all the steps in the pipeline. Within the function, you then define your pipeline components:
- Creating the simulation application
- RLEstimator training job
- AWS RoboMaker simulation jobs
- Deleting the simulation application
Component 1: Creating the simulation application
This component describes options for creating an AWS RoboMaker simulation application from a colcon
bundle file. See the following code:
robomaker_create_sim_app = robomaker_create_sim_app_op(
).set_display_name('Create RoboMaker Sim App')
The options include the simulation software name and version, the robot software name and version, and also the sources, which are a link to a colcon
bundle in Amazon S3.
Component 2: RLEstimator training job
This component describes an SageMaker RLEstimator training job. The job receives data from the AWS RoboMaker simulation jobs and uses that data to train a model. The job issues new policy weights to the simulation jobs while training. See the following code:
rlestimator_training_toolkit_ray = sagemaker_rlestimator_op(
"": "0.95",
"robomaker.config.app_arn": robomaker_create_sim_app.outputs["arn"],
"robomaker.config.num_workers": "3",
"robomaker.config.packageName": "object_tracker_simulation",
"robomaker.config.launchFile": "local_client.launch",
"robomaker.config.policyServerPort": "9000",
"robomaker.config.iamRole": assume_role,
"robomaker.config.sagemakerBucket": input_bucket_name,
).set_display_name('Start RLEstimator Training')
The options include a reference to the source directory in Amazon S3 where the source code is stored, hyperparameters that are used to configure the training job, and VPC configuration to define where the training job is run. The RLEstimator job spins up a local Redis server that it uses to issue the new policy weights that are consumed by the simulation jobs.
Components 3,4,5: Multiple AWS RoboMaker simulation jobs
These components describe three AWS RoboMaker simulation jobs that use the simulation application created in the robomaker_create_sim_app
component. The following code shows one of the components:
robomaker_simulation_job_1 = robomaker_sim_job_op(
"packageName": "object_tracker_simulation",
"launchFile": "local_client.launch",
"environmentVariables": {
"RLCAMP_SAGEMAKER_BUCKET": input_bucket_name,
).set_display_name('RoboMaker Simulation 1')
Options include the simulation app launch configuration, which includes environment variables that can configure the S3 bucket used for communication between the RLEstimator
job and these simulation jobs.
Component 6: Deleting the simulation application
This component is used to delete the simulation application that we created. There is a soft limit of 40 simulation applications per AWS account, so it makes sense to clean up automatically as we create new pipeline runs. See the following code:
robomaker_delete_sim_app = robomaker_delete_sim_app_op(
region=region, arn=robomaker_create_sim_app.outputs["arn"],
).set_display_name('Delete RoboMaker Sim App')
The only options are the Region to use to interact with the AWS RoboMaker API and also the ARN of the simulation application to delete. We use the .after()
method to define when this component should run as part of the pipeline.
Compiling and running your pipeline
Using the Kubeflow pipeline compiler, you compile the pipeline, create an experiment, and run the pipeline. See the following code:
client = kfp.Client()
aws_experiment = client.create_experiment(name='rm-kfp-experiment')
exp_name = f'sagemaker_robomaker_rl_job-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
my_run = client.run_pipeline(, exp_name, '')
The following is an annotated screenshot of a Kubeflow pipeline after it finishes running. All the steps are SageMaker and AWS RoboMaker capabilities running as part of a Kubeflow pipeline.
In this post, we discussed using SageMaker RL Components to build open-source Kubeflow pipelines. If you have questions or comments about SageMaker RL Components or AWS RoboMaker components, please leave a comment or create an issue on the Kubeflow Pipelines GitHub repo.
About the Authors
Alex Chung is a Senior Product Manager with AWS in enterprise machine learning systems. His role is to make AWS MLOps products more accessible from Kubernetes and custom environments. He’s passionate about accelerating ML adopted for a large body of users to solve global economic and societal problems. Outside machine learning, he is also a board member at a Silicon Valley nonprofit for donating stock to charity,
Kyle Saltmarsh is a robotics engineer from the Intelligent Assets and Robotics group at Woodside Energy. Kyle enjoys rock climbing, long walks on the beach and robot learning.
Leonard O’Sullivan is a Senior Technical Engineer at Max Kelsen, with numerous AWS certifications including SysOps Admin, Developer and Solution Architect. Leonard has more than 8 years of software development experience, and his passion is creating extensible, maintainable and readable code, with a focus on optimizing workflows and removing bottlenecks. Leonard’s current position centres around the automation and optimization of Machine Learning Operations. In his free time, you can find him playing soccer, eating pizza or trying new activities such as axe throwing and hang gliding.
Nicholas Therkelsen-Terry is CEO and Co-Founder of Max Kelsen, a machine learning and artificial intelligence solutions company. Nick has a broad range of expertise spanning across business, economics, sales, management and law. Nick has a deep theoretical and applied understanding of cutting-edge machine learning techniques and has been widely recognized as an expert and thought-leader in this field. Nick is a founding member and board representative of the Queensland AI Hub, a large investment supporting the development of the AI industry, creating more jobs and providing aspiring AI engineers with a space of their own to contribute to Australia’s innovation growth.
Nicholas Thomson is a Software Development Engineer with AWS Deep Learning. He helps build the open-source deep learning infrastructure projects that power Amazon AI. In his free time, he enjoys playing pool or building proof of concept websites.
Ragha Prasad is a software engineer on the AWS RoboMaker team. Primarily interested in robotics and artificial intelligence. In his spare time, he likes to travel, work on art projects and catch up on documentaries.
Sahika Genc is a senior applied scientist at Amazon artificial intelligence (AI). Her research interests are in smart automation, robotics, predictive control and optimization, and reinforcement learning (RL), and she serves in the industrial committee for the International Federation of Automatic Control. She leads science teams in scalable autonomous driving and automation systems, including consumer products such as AWS DeepRacer and SageMaker RL. Previously, she was a senior research scientist in the Artificial Intelligence and Learning Laboratory at the General Electric (GE) Global Research Center.
Analyzing open-source ML pipeline models in real time using Amazon SageMaker Debugger
Open-source workflow managers are popular because they make it easy to orchestrate machine learning (ML) jobs for productions. Taking models into productions following a GitOps pattern is best managed by a container-friendly workflow manager, also known as MLOps. Kubeflow Pipelines (KFP) is one of the Kubernetes-based workflow managers used today. However, it doesn’t provide all the functionality you need for a best-in-class data science and ML engineer experience. A common issue when developing ML models is having access to the tensor-level metadata of how the job is performing. For extremely large models such as for natural language processing (NLP) and computer vision (CV), this can be critical to avoid wasted GPU resources. However, most training frameworks become a black box after starting to train a model.
Amazon SageMaker is a managed ML platform from AWS to build, train, and deploy ML models at scale. SageMaker Components for Kubeflow Pipelines offer the flexibility to run steps of your KFP workflows on SageMaker instead of on your Kubernetes cluster, which provides the extra capabilities of SageMaker to develop high-quality models. SageMaker Debugger offers the capability to debug ML models during training by identifying and detecting problems with the models in near-real time. This feature can be used when training models within Kubeflow Pipelines through the SageMaker Training component. When combined, you can ensure that if your training jobs aren’t continuously improving with decreasing loss rate, the job ends early, thereby saving both cost and time.
SageMaker Debugger allows you to capture and analyze the state from training with minimal code changes. The state is composed of the following:
- The parameters being learned by the model, such as weights and biases for neural networks
- The changes applied to these parameters by the optimizer, called gradients
- The optimization parameters themselves
- Scalar values, such as accuracies and losses
- The output of each layer
The monitoring of these states is done through rules. SageMaker includes a variety of predefined rules, and you can also make custom rules using Python. For more information, see Amazon SageMaker Debugger – Debug Your Machine Learning Models.
In this post, we go over how to deploy a simple pipeline featuring a training component that has a debugger enabled.
Using SageMaker Debugger for Kubeflow Pipelines with XGBoost
This post demonstrates how adding additional parameters to configure the debugger component can allow us to easily find issues within a model. We train a gradient-boosting model on the Modified National Institute of Standards and Technology (MNIST) dataset using Kubeflow Pipelines. The MNIST dataset contains images of handwritten digits from 0–9 and is a popular ML problem. The MNIST dataset contains 60,000 training images and 10,000 test images.
This post walks through the following steps:
- Generating your data
- Cloning the sample repository
- Creating the training pipeline
- Adding debugger parameters
- Compiling the pipeline
- Deploying the training pipeline through Kubeflow Pipelines
- Reading the debugger output
To run the example in this post, you need the following prerequisites:
- Kubernetes cluster – You can use your existing cluster or create a new one. The fastest way to get one up and running on AWS is to launch an Amazon Elastic Kubernetes Service (Amazon EKS) cluster using eksctl. For instructions, see Getting started with eksctl. Create a small cluster with one node to run this example. We tested this example on an Amazon Elastic Compute Cloud (Amazon EC2) c5.xlarge instance. You just need enough node resources to run the SageMaker Component containers and Kubeflow. Training and deployments run on the SageMaker managed infrastructure.
- Kubeflow Pipelines – Install Kubeflow Pipelines on your cluster. For instructions, see Step 1 in Deploying Kubeflow Pipelines. Your Kubeflow Pipelines version must be 0.5.1 or newer. Optionally, you can install all of Kubeflow, which includes Kubeflow Pipelines.
- SageMaker Components prerequisites – For instructions on setting up AWS Identity and Access Management (IAM) roles and permissions, see SageMaker Components for Kubeflow Pipelines. You need two IAM roles:
- Kubeflow pipeline pods to access SageMaker and launch jobs and deployments.
- SageMaker to access other AWS resources such as Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Container Registry (Amazon ECR).
You can run this example from any instance that has Python installed and access to the Kubernetes cluster where Kubeflow pipelines is installed.
Generating your training data
This post uses a SageMaker prebuilt container to train an XGBoost model on the MNIST dataset. We include a Python file that uploads the MNIST dataset to an S3 bucket in the format that the XGBoost prebuilt container expects.
- Create an S3 bucket. This post uses the
Region. - Create a new file named
with the following code:import pickle, gzip, numpy, urllib.request, json from urllib.parse import urlparse ################################################################### # This is the only thing that you need to change to run this code # Give the name of your S3 bucket bucket = '<bucket-name>' # If you are going to use the default values of the pipeline then # give a bucket name which is in us-east-1 region ################################################################### # Load the dataset urllib.request.urlretrieve("", "mnist.pkl.gz") with'mnist.pkl.gz', 'rb') as f: train_set, valid_set, test_set = pickle.load(f, encoding='latin1') # Upload dataset to S3 from import write_numpy_to_dense_tensor import io import boto3 train_data_key = 'mnist_kmeans_example/train_data' test_data_key = 'mnist_kmeans_example/test_data' train_data_location = 's3://{}/{}'.format(bucket, train_data_key) test_data_location = 's3://{}/{}'.format(bucket, test_data_key) print('training data will be uploaded to: {}'.format(train_data_location)) print('training data will be uploaded to: {}'.format(test_data_location)) # Convert the training data into the format required by # the SageMaker XGBoost algorithm buf = io.BytesIO() write_numpy_to_dense_tensor(buf, train_set[0], train_set[1]) boto3.resource('s3').Bucket(bucket).Object(train_data_key).upload_fileobj(buf) # Convert the test data into the format required by XGBoost algorithm write_numpy_to_dense_tensor(buf, test_set[0], test_set[1]) boto3.resource('s3').Bucket(bucket).Object(test_data_key).upload_fileobj(buf) # Convert the valid data into the format required by XGBoost algorithm numpy.savetxt('valid-data.csv', valid_set[0], delimiter=',', fmt='%g') s3_client = boto3.client('s3') input_key = "{}/valid_data.csv".format("mnist_kmeans_example/input") s3_client.upload_file('valid-data.csv', bucket, input_key)
- Replace <bucket-name> with the name of the bucket you created.
This script requires you to install Python3, boto3, and NumPy.
- Run this script by using python3
. - Verify that the data was successfully uploaded.
In your S3 bucket, you should now see a folder called mnist_kmeans_example
, and under input
, there should be a CSV file named valid-data
Cloning the sample repository
In a terminal window, clone the Kubeflow pipelines repository and navigate to the directory with the sample code:
git clone
cd pipelines/samples/contrib/aws-samples/sagemaker_debugger_demo
We now go over how to create the training pipeline
. This folder contains what the final pipeline should be.
Creating a training pipeline
Create a
Python file as our training pipeline. The pipeline specified has poor hyperparameters and results in a poor model. It doesn’t yet have a debugger configured, but can still be compiled and submitted as a training job, and outputs a model.
See the following code:
#!/usr/bin/env python3
import kfp
import json
import os
import copy
from kfp import components
from kfp import dsl
cur_file_dir = os.path.dirname(__file__)
components_dir = os.path.join(cur_file_dir, '../../../../components/aws/sagemaker/')
sagemaker_train_op = components.load_component_from_file(components_dir + '/train/component.yaml')
def training_input(input_name, s3_uri, content_type):
return {
"ChannelName": input_name,
"DataSource": {"S3DataSource": {"S3Uri": s3_uri, "S3DataType": "S3Prefix"}},
"ContentType": content_type
bad_hyperparameters = {
'max_depth': '5',
'eta': '0',
'gamma': '4',
'min_child_weight': '6',
'silent': '0',
'subsample': '0.7',
'num_round': '50'
name='XGBoost Training Pipeline with bad hyperparameters',
description='SageMaker training job test with debugger'
def training(role_arn="", bucket_name="my-bucket"):
train_channels = [
training_input("train", f"s3://{bucket_name}/mnist_kmeans_example/input/valid_data.csv", 'text/csv')
training = sagemaker_train_op(
# Refer this link for xgboost Registry URLs:
if __name__ == '__main__':
kfp.compiler.Compiler().compile(training, __file__ + '.zip')
Adding debugger parameters
To enable SageMaker Debugger in your training jobs, you need to define the additional parameters to configure the debugger.
First, use debug_hook_config
to select the tensor groups you want to collect for analysis and specify the frequency at which you want to save them. debug_hook_config
takes in two parameters:
- S3OutputPath – Points to the Amazon S3 URI where we intend to store our debugging tensors. SageMaker takes care of uploading these tensors transparently during the run.
- CollectionConfigurations – Enumerates named collections of tensors we want to save. Collections are a convenient way to organize relevant tensors under same umbrella to make it easy to navigate them during analysis. In this particular example, one of the collections we instruct SageMaker Debugger to save is named metrics. We also instruct SageMaker Debugger to save metrics every three iterations.
# Collections of tensors we want to save collections = { 'feature_importance' : { 'save_interval': '5' }, 'losses' : { 'save_interval': '10' }, 'average_shap': { 'save_interval': '5' }, 'metrics': { 'save_interval': '3' } } # Helper method to format CollectionConfigurations def format_collection_config(collection_dict): output = [] for key, val in collection_dict.items(): output.append({'CollectionName': key, 'CollectionParameters': val}) return output # Helper method to format debug_hook_config def training_debug_hook(s3_uri, collection_dict): return { 'S3OutputPath': s3_uri, 'CollectionConfigurations': format_collection_config(collection_dict) } # Provide the debug_hook_config input to the pipeline @dsl.pipeline(...) def training(role_arn="", bucket_name="my-bucket"): ... # debug_hook_config containing S3OutputPath and collections to be saved training = sagemaker_train_op( debug_hook_config=training_debug_hook(f's3://{bucket_name}/mnist_kmeans_example/hook_config', collections),
We also need to specify what rules we want to activate for automatic analysis using debug_rules_config
. In this example, we use two SageMaker built-in rules: OverTraining
and LossNotDecreasing
. As the names suggest, the rules attempt to evaluate if the loss is not decreasing in the tensors captured by the debugging hook during training and also if the model is being over-trained (validation loss should not increase). See the following code:
# Helper method to format debug_rules_config
def training_debug_rules(rule_name, parameters):
return {
'RuleConfigurationName': rule_name,
# Refer this link for Debugger Registry URLs:
'RuleEvaluatorImage': '',
'RuleParameters': parameters
# Provide the debug_rule_config input to the pipeline
def training(role_arn="", bucket_name="my-bucket"):
# Rules and rule parameters
train_debug_rules = [
training_debug_rules("LossNotDecreasing", {"rule_to_invoke": "LossNotDecreasing", "tensor_regex": ".*"}),
training_debug_rules("Overtraining", {'rule_to_invoke': 'Overtraining', 'patience_train': '10', 'patience_validation': '20'}),
training = sagemaker_train_op(
# Provide the debug_rule_config as input to the pipeline
For more information about SageMaker rules and the configurations best suited for using them, see Amazon SageMaker Debugger RulesConfig.
The following code shows what the pipeline looks like after configuring the debug hook and rules:
#!/usr/bin/env python3
import kfp
import json
import os
import copy
from kfp import components
from kfp import dsl
cur_file_dir = os.path.dirname(__file__)
components_dir = os.path.join(cur_file_dir, '../../../../components/aws/sagemaker/')
sagemaker_train_op = components.load_component_from_file(components_dir + '/train/component.yaml')
def training_input(input_name, s3_uri, content_type):
return {
"ChannelName": input_name,
"DataSource": {"S3DataSource": {"S3Uri": s3_uri, "S3DataType": "S3Prefix"}},
"ContentType": content_type
def training_debug_hook(s3_uri, collection_dict):
return {
'S3OutputPath': s3_uri,
'CollectionConfigurations': format_collection_config(collection_dict)
def training_debug_rules(rule_name, parameters):
return {
'RuleConfigurationName': rule_name,
# Refer this link for Debugger Registry URLs:
'RuleEvaluatorImage': '',
'RuleParameters': parameters
def format_collection_config(collection_dict):
output = []
for key, val in collection_dict.items():
output.append({'CollectionName': key, 'CollectionParameters': val})
return output
collections = {
'feature_importance' : {
'save_interval': '5'
'losses' : {
'save_interval': '10'
'average_shap': {
'save_interval': '5'
'metrics': {
'save_interval': '3'
bad_hyperparameters = {
'max_depth': '5',
'eta': '0',
'gamma': '4',
'min_child_weight': '6',
'silent': '0',
'subsample': '0.7',
'num_round': '50'
name='XGBoost Training Pipeline with bad hyperparameters',
description='SageMaker training job test with debugger'
def training(role_arn="", bucket_name="my-bucket"):
train_channels = [
training_input("train", f"s3://{bucket_name}/mnist_kmeans_example/input/valid_data.csv", 'text/csv')
train_debug_rules = [
training_debug_rules("LossNotDecreasing", {"rule_to_invoke": "LossNotDecreasing", "tensor_regex": ".*"}),
training_debug_rules("Overtraining", {'rule_to_invoke': 'Overtraining', 'patience_train': '10', 'patience_validation': '20'}),
training = sagemaker_train_op(
# Refer this link for xgboost Registry URLs:
debug_hook_config=training_debug_hook(f's3://{bucket_name}/mnist_kmeans_example/hook_config', collections),
if __name__ == '__main__':
kfp.compiler.Compiler().compile(training, __file__ + '.zip')
Compiling the pipeline
Our pipeline is now complete and ready to be compiled using the following command:
dsl-compile --py --output debugger-component-demo.tar.gz
This creates debugger-component-demo.tar.gz
in the same folder, and is the file we upload as our training job.
Deploying the pipeline
Now use kubectl
to open up the KFP UI on our browser so we have access to the interface where we can upload the pipeline.
- In a new terminal window, run the following command (it’s possible to create pipelines and submit training jobs from the AWS Command Line Interface (AWS CLI)):
$ kubectl port-forward -n kubeflow service/ml-pipeline-ui 8080:80
- Access the KFP UI by searching http://localhost:8080/ in your browser.
- Create a new pipeline and upload the compiled specification (
file) as a new pipeline template. - Provide the
you created as pipeline inputs.
Reading the debugger output
When the training is complete, the logs display the status of each debugger rule.
The following screenshot shows an example of what the status of each debugger rule should be when the training job is complete.
We see here that our debugger rules haven’t found any issues with the model being overtrained. However, the debug rules indicate that our loss isn’t decreasing over time as it should.
The following screenshot shows the Amazon CloudWatch Logs, also printed on the Logs tab, which indeed show that the train-rmse
is staying steady at 0.5 and isn’t decreasing.
The reason that our loss isn’t decreasing is because our hyperparameters have been initialized suboptimally, specifically eta
, which has been set to a poor value. eta
determines the model’s learning rate and is currently at 0
. This is clearly erroneous because it means that the subsequent steps aren’t progressing from the initial step. To address, this, use a non-zero learning rate, for example, set eta
in hyperparameters to 0.2
. You can see that the LossNotDecreasing
rule is not triggered as train-rmse
keeps decreasing steadily throughout the entire training duration. Rerunning the pipeline with the fix results in a model with no issues found.
Model debugging tools are critical to reduce total time, cost, and resources spent on creating a model. Using SageMaker Debugger in your Kubeflow Pipelines lets you go beyond just looking at scalars like losses and accuracies during training. You can get full visibility into all tensors flowing through the graph during training. Furthermore, it helps you monitor your training in near-real time using rules, and provides alerts if it detects an inconsistency in the training flow, which ultimately reduces costs and improves your company’s effectiveness on ML.
To get started using Kubeflow Pipelines with SageMaker, see the GitHub repo. You can also explore our native integration of SageMaker Operators for Kubernetes for MLOps.
About the Authors
Alex Chung is a Senior Product Manager with AWS in Deep Learning. His role is to make AWS Deep Learning products more accessible and cater to a wider audience. He’s passionate about social impact and technology, getting his regular gym workout, and cooking healthy meals.
Suraj Kota is a Software Engineer specialized in Machine Learning infrastructure. He builds tools to easily get started and scale machine learning workload on AWS. He worked on the Amazon Deep Learning Containers, Deep Learning AMI, SageMaker Operators for Kubernetes, and other open source integrations like Kubeflow.
Dustin Luong is a Software Development Engineering Intern with AWS in Deep Engines. He works on developing SageMaker integrations with open source platforms like Kubernetes and Kubeflow Pipelines. He’s currently a student at UC Berkeley and in his spare time he enjoys playing basketball, hiking, and playing board games.