Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS

Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS

Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their knowledge of the games within the game of football. For the 2022 season, the NFL aimed to leverage player-tracking data  and new advanced analytics techniques to better understand special teams.

The goal of the project was to predict how many yards a returner would gain on a punt or kickoff play. One of the challenges when building predictive models for punt and kickoff returns is the availability of very rare events — such as touchdowns — that have significant importance in the dynamics of a game. A data distribution with fat tails is common in real-world applications, where rare events have significant impact on the overall performance of the models. Using a robust method to accurately model distribution over extreme events is crucial for better overall performance.

In this post, we demonstrate how to use Spliced Binned-Pareto distribution implemented in GluonTS to robustly model such fat-tailed distributions.

We first describe the dataset used. Next, we present the data preprocessing and other transformation methods applied to the dataset. We then explain the details of the ML methodology and model training procedures. Finally, we present the model performance results.

Dataset

In this post, we used two datasets to build separate models for punt and kickoff returns. The player tracking data contains the player’s position, direction, acceleration, and more (in x,y coordinates). There are around 3,000 and 4,000 plays from four NFL seasons (2018–2021) for punt and kickoff plays, respectively. In addition, there are very few punt and kickoff-related touchdowns in the datasets—only 0.23% and 0.8%, respectively. The data distribution for punt and kickoff are different. For example, the true yardage distribution for kickoff and punts are similar but shifted, as shown in the following figure.

Punts and kickoff return yards distribution

Data preprocessing and feature engineering

First, the tracking data was filtered for just the data related to punts and kickoff returns. The player data was used to derive features for model development:

  • X – Player position along the long axis of the field
  • Y – Player position along the short axis of the field
  • S – Speed in yards/second; replaced by Dis*10 to make it more accurate (Dis is the distance in the past 0.1 seconds)
  • Dir – Angle of player motion (degrees)

From the preceding data, each play was transformed into 10X11X14 of data with 10 offensive players (excluding the ball carrier), 11 defenders, and 14 derived features:

  • sX – x speed of a player
  • sY – y speed of a player
  • s – Speed of a player
  • aX – x acceleration of a player
  • aY – y acceleration of a player
  • relX – x distance of player relative to ball carrier
  • relY – y distance of player relative to ball carrier
  • relSx – x speed of player relative to ball carrier
  • relSy – y speed of player relative to ball carrier
  • relDist – Euclidean distance of player relative to ball carrier
  • oppX – x distance of offense player relative to defense player
  • oppY – y distance of offense player relative to defense player
  • oppSx –x speed of offense player relative to defense player
  • oppSy – y speed of offense player relative to defense player

To augment the data and account for the right and left positions, the X and Y position values were also mirrored to account for the right and left field positions. The data preprocessing and feature engineering was adapted from the winner of the NFL Big Data Bowl competition on Kaggle.

ML methodology and model training

Because we’re interested in all possible outcomes from the play, including the probability of a touchdown, we can’t simply predict the average yards gained as a regression problem. We need to predict the full probability distribution of all possible yard gains, so we framed the problem as a probabilistic prediction.

One way to implement probabilistic predictions is to assign the yards gained to several bins (such as less than 0, from 0–1, from 1–2, …, from 14–15, more than 15) and predict the bin as a classification problem. The downside of this approach is that we want small bins to have a high definition picture of the distribution, but small bins mean fewer data points per bin and our distribution, especially the tails, may be poorly estimated and irregular.

Another way to implement probabilistic predictions is to model the output as a continuous probability distribution with a limited number of parameters (for example, a Gaussian or Gamma distribution) and predict the parameters. This approach gives a very high definition and regular picture of the distribution, but is too rigid to fit the true distribution of yards gained, which is multi-modal and heavy tailed.

To get the best of both methods, we use Spliced Binned-Pareto distribution (SBP), which has bins for the center of the distribution where a lot of data is available, and Generalized Pareto distribution (GPD) at both ends, where rare but important events can happen, like a touchdown. The GPD has two parameters: one for scale and one for tail heaviness, as seen in the following graph (source: Wikipedia).

By splicing the GPD with the binned distribution (see the following left graph) on both sides, we obtain the following SBP on the right. The lower and upper thresholds where splicing is done are hyperparameters.

Binned and SPB distributions

As a baseline, we used the model that won our NFL Big Data Bowl competition on Kaggle. This model uses CNN layers to extract features from the prepared data, and predicts the outcome as a “1 yard per bin” classification problem. For our model, we kept the feature extraction layers from the baseline and only modified the last layer to output SBP parameters instead of probabilities for each bin, as shown in the following figure (image edited from the post 1st place solution The Zoo).

Model Architecture

We used the SBP distribution provided by GluonTS. GluonTS is a Python package for probabilistic time series modeling, but the SBP distribution is not specific to time series, and we were able to repurpose it for regression. For more information on how to use GluonTS SBP, see the following demo notebook.

Models were trained and cross-validated on the 2018, 2019, and 2020 seasons and tested on the 2021 season. To avoid leakage during cross-validation, we grouped all plays from the same game into the same fold.

For evaluation, we kept the metric used in the Kaggle competition, the continuous ranked probability score (CRPS), which can be seen as an alternative to the log-likelihood that is more robust to outliers. We also used the Pearson correlation coefficient and the RMSE as general and interpretable accuracy metrics. Furthermore, we looked at the probability of a touchdown and probability plots to evaluate calibration.

The model was trained on the CRPS loss using Stochastic Weight Averaging and early stopping.

To deal with the irregularity of the binned part of the output distributions, we used two techniques:

  • A smoothness penalty proportional to the squared difference between two consecutive bins
  • Ensembling models trained during cross-validation

Model performance results

For each dataset, we performed a grid search over the following options:

  • Probabilistic models
    • Baseline was one probability per yard
    • SBP was one probability per yard in the center, generalized SBP in the tails
  • Distribution smoothing
    • No smoothing (smoothness penalty = 0)
    • Smoothness penalty = 5
    • Smoothness penalty = 10
  • Training and inference procedure
    • 10 folds cross-validation and ensemble inference (k10)
    • Training on train and validation data for 10 epochs or 20 epochs

Then we looked at the metrics for the top five models sorted by CRPS (lower is better).

For kickoff data, the SBP model slightly over-performs in terms of CRPS but more importantly it estimates the touchdown probability better (true probability is 0.80% in the test set). We see that the best models use 10 folds ensembling (k10) and no smoothness penalty, as shown in the following table.

Training Model Smoothness CRPS RMSE CORR % P(touchdown)%
k10 SBP 0 4.071 9.641 47.15 0.78
k10 Baseline 0 4.074 9.62 47.585 0.306
k10 Baseline 5 4.075 9.626 47.43 0.274
k10 SBP 5 4.079 9.656 46.977 0.682
k10 Baseline 10 4.08 9.621 47.519 0.265

The following plot of the observed frequencies and predicted probabilities indicates a good calibration of our best model, with an RMSE of 0.27 between the two distributions. Note the occurrences of high yardage (for example, 100) that occur in the tail of the true (blue) empirical distribution, whose probabilities are more capturable by the SBP than the baseline method.

Kickoff observed frequencies and predicted probability distribution

For punt data, the baseline outperforms the SBP, perhaps because the tails of extreme yardage have fewer realizations. Therefore, it’s a better trade-off to capture the modality between 0–10 yards peaks; and contrary to kickoff data, the best model uses a smoothness penalty. The following table summarizes our findings.

Training Model Smoothness CRPS RMSE CORR % P(touchdown)%
k10 Baseline 5 3.961 8.313 35.227 0.547
k10 Baseline 0 3.972 8.346 34.227 0.579
k10 Baseline 10 3.978 8.351 34.079 0.555
k10 SBP 5 3.981 8.342 34.971 0.723
k10 SBP 0 3.991 8.378 33.437 0.677

The following plot of observed frequencies (in blue) and predicted probabilities for the two best punt models indicates that the non-smoothed model (in orange) is slightly better calibrated than the smoothed model (in green) and may be a better choice overall.

Punt true and predicted probabilities

Conclusion

In this post, we showed how to build predictive models with fat-tailed data distribution. We used Spliced Binned-Pareto distribution, implemented in GluonTS, which can robustly model such fat-tailed distributions. We used this technique to build models for punt and kickoff returns. We can apply this solution to similar use cases where there are very few events in the data, but those events have significant impact on the overall performance of the models.

If you would like help with accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program.


About the Authors

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps AWS customers across various industries such as healthcare and life sciences, manufacturing, automotive, and sports and media, accelerate their use of machine learning and AWS cloud services to solve their business challenges.

Marc van Oudheusden is a Senior Data Scientist with the Amazon ML Solutions Lab team at Amazon Web Services. He works with AWS customers to solve business problems with artificial intelligence and machine learning. Outside of work you may find him at the beach, playing with his children, surfing or kitesurfing.

Panpan Xu is a Senior Applied Scientist and Manager with the Amazon ML Solutions Lab at AWS. She is working on research and development of Machine Learning algorithms for high-impact customer applications in a variety of industrial verticals to accelerate their AI and cloud adoption. Her research interest includes model interpretability, causal analysis, human-in-the-loop AI and interactive data visualization.

Kyeong Hoon (Jonathan) Jung is a senior software engineer at the National Football League. He has been with the Next Gen Stats team for the last seven years helping to build out the platform from streaming the raw data, building out microservices to process the data, to building API’s that exposes the processed data. He has collaborated with the Amazon Machine Learning Solutions Lab in providing clean data for them to work with as well as providing domain knowledge about the data itself. Outside of work, he enjoys cycling in Los Angeles and hiking in the Sierras.

Michael Chi is a Senior Director of Technology overseeing Next Gen Stats and Data Engineering at the National Football League. He has a degree in Mathematics and Computer Science from the University of Illinois at Urbana Champaign. Michael first joined the NFL in 2007 and has primarily focused on technology and platforms for football statistics. In his spare time, he enjoys spending time with his family outdoors.

  Mike Band is a Senior Manager of Research and Analytics for Next Gen Stats at the National Football League. Since joining the team in 2018, he has been responsible for ideation, development, and communication of key stats and insights derived from player-tracking data for fans, NFL broadcast partners, and the 32 clubs alike. Mike brings a wealth of knowledge and experience to the team with a master’s degree in analytics from the University of Chicago, a bachelor’s degree in sport management from the University of Florida, and experience in both the scouting department of the Minnesota Vikings and the recruiting department of Florida Gator Football.

Read More

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More information regarding the NFL Player Health and Safety efforts is available on the NFL website.

The AWS Professional Services team has partnered with the NFL and Biocore to provide machine learning (ML)-based solutions for identifying helmet impacts from game footage using computer vision (CV) techniques. With multiple camera views available from each game, we have developed solutions to identify helmet impacts from each of these views and merge the helmet impact results.

The motivation behind utilizing multiple camera views comes from the limitation of information when the impact events are captured with only one view. With only one perspective, some players might occlude each other or be blocked by other objects on the field. Therefore, adding more perspectives allows our ML system to identify more impacts that aren’t visible in a single view. To showcase the results of our fusion process and how the team uses visualization tools to help evaluate the model performance, we have developed a codebase to visually overlay the multiple view detection results. This process helps identify the actual number of impacts individual players experience by removing duplicate impacts detected in multiple views.

In this post, we use the publicly available dataset from the NFL – Impact Detection Kaggle competition and show results for merging two views. The dataset includes helmet bounding boxes at every frame and impact labels found in each video. In particular, we focus on deduplicating and visualizing videos with the ID 57583_000082 in endzone and sideline views. You can download the endzone and sideline videos, and also the ground truth labels.

Prerequisites

The solution requires the following:

Get started on SageMaker Studio Lab and install the required packages

You can run the notebook from the GitHub repository or from SageMaker Studio Lab. In this post, we run the notebook from a SageMaker Studio Lab environment. We are choosing SageMaker Studio Lab because it is free, provides powerful CPU and GPU user sessions, and 15GB of persistent storage that will automatically save your environment, enabling you to pick up where you left off.  To use SageMaker Studio Lab, request and set up a new account. After the account is approved, complete the following steps:

  1. Visit the aws-samples GitHub repo.
  2. In the README section, choose Open Studio Lab.

sagemaker-studio-button

This redirects you to your SageMaker Studio Lab environment.

  1. Select your CPU compute type, then choose Start Runtime.
  2. After the runtime starts, choose Copy to Project, which opens a new window with the Jupyter Lab environment.

Now you’re ready to use the notebook!

  1. Open fuse_and_visualize_multiview_impacts.ipynb and follow the instructions in the notebook.

The first cell in the notebook installs the necessary Python packages such as pandas and OpenCV:

%pip install pandas
%pip install opencv-contrib-python-headless

Import all the necessary Python packages and set pandas options for better visualization experience:

import os
import cv2
import pandas as pd
import numpy as np
pd.set_option('mode.chained_assignment', None)

We use pandas for ingesting and parsing through the CSV file with the annotated helmet bounding boxes as well as impacts. We use NumPy mainly for manipulating arrays and matrices. We use OpenCV for reading, writing, and manipulating image data in Python.

Prepare the data by fusing results from two views

To fuse the two perspectives together, we use the train_labels.csv from the Kaggle competition as an example because it contains ground truth impacts from both the endzone and sideline. The following function takes the input dataset and outputs a fused dataframe that is deduplicated for all the plays in the input dataset:

def prep_data(df):
    df['game_play'] = df['gameKey'].astype('str') + '_' + df['playID'].astype('str').str.zfill(6)
    return df

def dedup_view(df, windows):
    # define view
    df = df.sort_values(by='frame')
    view_columns = ['frame', 'left', 'width', 'top', 'height', 'video']
    common_columns = ['game_play', 'label', 'view', 'impactType']
    label_cleaned = df[view_columns + common_columns]
    
    # rename columns
    sideline_column_rename = {col: 'Sideline_' + col for col in view_columns}
    endzone_column_rename = {col: 'Endzone_' + col for col in view_columns}
    sideline_columns = list(sideline_column_rename.values())

    # create two dataframes, one for sideline, one for endzone
    label_endzone = label_cleaned.query('view == "Endzone"')
    label_endzone.rename(columns=endzone_column_rename, inplace=True)
    label_sideline = label_cleaned.query('view == "Sideline"')
    label_sideline.rename(columns=sideline_column_rename, inplace=True)

    # prepare sideline labels
    label_sideline['is_dup'] = False
    for columns in sideline_columns:
        label_endzone[columns] = np.nan
    label_endzone['is_dup'] = False

    # iterrate endzone rows to find matches and dedup
    for index, row in label_endzone.iterrows():
        player = row['label']
        frame = row['Endzone_frame']
        impact_type = row['impactType']
        sideline_row = label_sideline[(label_sideline['label'] == player) & 
                                      ((label_sideline['Sideline_frame'] >= frame - windows // 2) &
                                       (label_sideline['Sideline_frame'] <= frame + windows // 2 + 1)) &
                                      (label_sideline['is_dup'] == False) & 
                                      (label_sideline['impactType'] == impact_type)]

        if len(sideline_row) > 0:
            sideline_index = sideline_row.index[0]
            label_sideline['is_dup'].loc[sideline_index] = True

            for col in sideline_columns:
                label_endzone[col].loc[index] = sideline_row.iloc[0][col]
            label_endzone['is_dup'].loc[index] = True

    # calculate overlap perc
    not_dup_sideline = label_sideline[label_sideline['is_dup'] == False]
    final_output = pd.concat([not_dup_sideline, label_endzone])
    return final_output

def fuse_df(raw_df, windows):
    outputs = []
    all_game_play = raw_df['game_play'].unique()
    for game_play in all_game_play:
        df = raw_df.query('game_play ==@game_play')
        output = dedup_view(df, windows)
        outputs.append(output)

    output_df = pd.concat(outputs)
    output_df['gameKey'] = output_df['game_play'].apply(lambda x: x.split('_')[0]).map(int)
    output_df['playID'] = output_df['game_play'].apply(lambda x: x.split('_')[1]).map(int)
    return output_df

To run the function, we run the following code block to provide the location of the train_labels.csv data and then perform data preparation to add an additional column and extract only the impact rows. After running the function, we save the output to a dataframe variable called fused_df.

# read the annotated impact data from train_labels.csv
ground_truth = pd.read_csv('train_labels.csv')

# prepare game_play column using pipe(prep_data) function in pandas then filter the dataframe for just rows with impacts
ground_truth = ground_truth.pipe(prep_data).query('impact == 1')

# loop over all the unique game_plays and deduplicate the impact results from sideline and endzone
fused_df = fuse_df(ground_truth, windows=30)

The following screenshot shows the ground truth.

The following screenshot shows the fused dataframe examples.

Graph and video code

After we fuse the impact results, we use the generated fused_df to overlay the results onto our endzone and sideline videos and merge the two views together. We use the following function for this, and the inputs needed are the paths to the endzone video, sideline video, fused_df dataframe, and the final output path for the newly generated video. The functions used in this section are described in the markdown section of the notebook used in SageMaker Studio Lab.

def get_video_and_metadata(vid_path): 
    vid = cv2.VideoCapture(vid_path)
    total_frame_number = vid.get(cv2.CAP_PROP_FRAME_COUNT)
    width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = vid.get(cv2.CAP_PROP_FPS)
    return vid, total_frame_number, width, height, fps

def overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1):
    # look for duplicates 
    duplicates = fused_df.query(f"gameKey == {int(game_key)} and 
                                  playID == {int(play_id)} and 
                                  is_dup == True and 
                                  Sideline_frame == @frame_cnt") 

    frame_has_impact = False 
    
    if len(duplicates) > 0: 
        for duplicate in duplicates.itertuples(index=False): 
            if frame_cnt == duplicate.Sideline_frame: 
                frame_has_impact = True 

            if frame_has_impact: 
                cv2.rectangle(frame, #frame to be edited 
                              (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of top left corner 
                              (int(duplicate.Sideline_left) + int(duplicate.Sideline_width), int(duplicate.Sideline_top) + int(duplicate.Sideline_height)), #(x,y) of bottom right corner 
                              (0,0,255), #RED boxes
                              thickness=3)

                cv2.rectangle(frame, #frame to be edited
                              (int(duplicate.Endzone_left), int(duplicate.Endzone_top)+ h1), #(x,y) of top left corner
                              (int(duplicate.Endzone_left) + int(duplicate.Endzone_width), int(duplicate.Endzone_top) + int(duplicate.Endzone_height) + h1), #(x,y) of bottom right corner
                              (0,0,255), #RED boxes
                              thickness=3)
 
                cv2.line(frame, #frame to be edited
                         (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of point 1 in a line
                         (int(duplicate.Endzone_left), int(duplicate.Endzone_top) + h1), #(x,y) of point 2 in a line
                         (255, 255, 255), # WHITE lines
                         thickness=4)
 
            else:
                # if no duplicates, look for sideline then endzone and add to the view
                sl_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Sideline' and 
                                              Sideline_frame == @frame_cnt")

                if len(sl_impacts) > 0:
                    for impact in sl_impacts.itertuples(index=False):
                        if frame_cnt == impact.Sideline_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Sideline_left), int(impact.Sideline_top)), #(x,y) of top left corner
                                          (int(impact.Sideline_left) + int(impact.Sideline_width), int(impact.Sideline_top) + int(impact.Sideline_height)), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

                ez_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Endzone' and 
                                              Endzone_frame == @frame_cnt")

                if len(ez_impacts) > 0:
                    for impact in ez_impacts.itertuples(index=False):
                        if frame_cnt == impact.Endzone_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Endzone_left), int(impact.Endzone_top)+ h1), #(x,y) of top left corner
                                          (int(impact.Endzone_left) + int(impact.Endzone_width), int(impact.Endzone_top) + int(impact.Endzone_height) + h1 ), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

    return frame, frame_has_impact

def generate_impact_video(ez_vid_path:str,
                          sl_vid_path:str,
                          fused_df:pd.DataFrame,
                          output_path:str,
                          freeze_impacts=True):
    
    #define video codec to be used for
    VIDEO_CODEC = "MP4V"

    # parse game_key and play_id information from the name of the files
    game_key = os.path.basename(ez_vid_path).split('_')[0] # parse game_key
    play_id = os.path.basename(ez_vid_path).split('_')[1] # parse play_id
 
    # get metadata such as total frame number, width, height and frames per second (FPS) from endzone (ez) and sideline (sl) videos
    ez_vid, ez_total_frame_number, ez_width, ez_height, ez_fps = get_video_and_metadata(ez_vid_path)
    sl_vid, sl_total_frame_number, sl_width, sl_height, sl_fps = get_video_and_metadata(sl_vid_path)

    # define a video writer for the output video
    output_video = cv2.VideoWriter(output_path, #output file name
                                   cv2.VideoWriter_fourcc(*VIDEO_CODEC), #Video codec
                                   ez_fps, #frames per second in the output video
                                  (ez_width, ez_height+sl_height)) # frame size with stacking video vertically

    # find shorter video and use the total frame number from the shorter video for the output video
    total_frame_number = int(min(ez_total_frame_number, sl_total_frame_number))

    # iterate through each frame from endzone and sideline
    for frame_cnt in range(total_frame_number):
        frame_has_impact = False
        frame_near_impact = False

        # reading frames from both endzone and sideline
        ez_ret, ez_frame = ez_vid.read()
        sl_ret, sl_frame = sl_vid.read()

        # creating strings to be added to the output frames
        img_name = f"Game key: {game_key}, Play ID: {play_id}, Frame: {frame_cnt}"
        video_frame = f'{game_key}_{play_id}_{frame_cnt}'

        if ez_ret == True and sl_ret == True:
            h, w, c = ez_frame.shape
            h1,w1,c1 = sl_frame.shape
 
            if h != h1 or w != w1: # resize images if they're different
                ez_frame = cv2.resize(ez_frame,(w1,h1))
 
            frame = np.concatenate((sl_frame, ez_frame), axis=0) # stack the frames vertically

            frame, frame_has_impact = overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1)

            cv2.putText(frame, #image frame to be modified
                        img_name, #string to be inserted
                        (30, 30), #(x,y) location of the string
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), #WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1-20), #(x,y) location of the string in the top view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1+h-20), #(x,y) location of the string in the bottom view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            output_video.write(frame)

            # Freeze for 60 frames on impacts
            if frame_has_impact and freeze_impacts:
                for _ in range(60):
                    output_video.write(frame)
        else:
            break

        frame_cnt += 1

    output_video.release()
    return

To run these functions, we can provide an input as shown in the following code, which generates a video called output.mp4:

generate_impact_video('57583_000082_Endzone.mp4',
                      '57583_000082_Sideline.mp4',
                      fused_df,
                      'output.mp4')

This generates a video as shown in the following example, where the red bounding boxes are impacts found in both endzone and sideline views, and the yellow bounding boxes are impacts that are found in just one view in either the endzone or sideline.

Conclusion

In this post, we demonstrated how the NFL, Biocore, and the AWS ProServe teams are working together to improve impact detection by fusing results from multiple views. This allows the teams to debug and visualize how the model is performing qualitatively. This process can easily be scaled up to three or more views; in our projects, we have utilized up to seven different views. Detecting helmet impacts by watching videos from only one view can be difficult due to view obstruction, but detecting impacts from multiple views and fusing the results allows us to improve our model performance.

To experiment with this solution, visit the aws-samples GitHub repo and refer to the fuse_and_visualize_multiview_impacts.ipynb notebook. Similar techniques can also be applied to other industries such as manufacturing, retail, and security, where having multiple views would benefit the ML system to better identify targets with a more comprehensive view.

For more information regarding NFL Player Health and Safety, visit the NFL website and NFL Explained: Innovation in Player Health & Safety.


About the authors

Chris Boomhower is a Machine Learning Engineer at AWS Professional Services. Chris has over 6 years experience developing supervised and unsupervised Machine Learning solutions across various industries. Today, he spends most his time helping customers in sports, healthcare, and agriculture industries design and build scalable, end-to-end, Machine Learning solutions.

Ben Fenker is a Senior Data Scientist in AWS Professional Services and has helped customers build and deploy ML solutions in industries ranging from sports to healthcare to manufacturing. He has a Ph.D. in physics from Texas A&M University and 6 years of industry experience. Ben enjoys baseball, reading, and raising his kids.

Sam Huddleston is a Principal Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.

Jarvis Lee is a Senior Data Scientist with AWS Professional Services. He has been with AWS for over five years, working with customers on machine learning and computer vision problems. Outside of work, he enjoys riding bicycles.

Tyler Mullenbach is the Global Practice Lead for ML with AWS Professional Services. He is responsible for driving the strategic direction of ML for Professional Services and ensuring that customers realize transformative business achievements through the adoption of ML technologies.

Kevin Song is a Data Scientist at AWS Professional Services. He holds a PhD in Biophysics and has over 5 years of industry experience in building computer vision and machine learning solutions.

Betty Zhang is a data scientist with 10 years of experience in data and technology. Her passion is to build innovative machine learning solutions to drive transformational changes for companies. In her spare time, she enjoys traveling, reading and learning about new technologies.

Read More

How to decide between Amazon Rekognition image and video API for video moderation

How to decide between Amazon Rekognition image and video API for video moderation

Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, and reviews more than ever and 78% of them said that brands are responsible for moderating such content. 40% said that they would disengage with a brand after a single exposure to toxic content.

Amazon Rekognition has two sets of APIs that help you moderate images or videos to keep digital communities safe and engaged.

One approach to moderate videos is to model video data as a sample of image frames and use image content moderation models to process the frames individually. This approach allows the reuse of image-based models. Some customers have asked if they could use this approach to moderate videos by sampling image frames and sending them to the Amazon Rekognition image moderation API. They are curious about how this solution compares with the Amazon Rekognition video moderation API.

We recommend using the Amazon Rekognition video moderation API to moderate video content. It’s designed and optimized for video moderation, offering better performance and lower costs. However, there are specific use cases where the image API solution is optimal.

This post compares the two video moderation solutions in terms of accuracy, cost, performance, and architecture complexity to help you choose the best solution for your use case.

Moderate videos using the video moderation API

The Amazon Rekognition video content moderation API is the standard solution used to detect inappropriate or unwanted content in videos. It performs as an asynchronous operation on video content stored in an Amazon Simple Storage Service (Amazon S3) bucket. The analysis results are returned as an array of moderation labels along with a confidence score and timestamp indicating when the label was detected.

The video content moderation API uses the same machine learning (ML) model for image moderation. The output is filtered for noisy false positive results. The workflow is optimized for latency by parallelizing operations like decode, frame extraction, and inference.

The following diagram shows the logical steps of how to use the Amazon Rekognition video moderation API to moderate videos.

Rekognition Content Moderation Video API diagram

The steps are as follows:

  1. Upload videos to an S3 bucket.
  2. Call the video moderation API in an AWS Lambda function (or customized script on premises) with the video file location as a parameter. The API manages the heavy lifting of video decoding, sampling, and inference. You can either implement a heartbeat logic to check the moderation job status until it completes, or use Amazon Simple Notification Service (Amazon SNS) to implement an event-driven pattern. For details about the video moderation API, refer to the following Jupyter notebook for detailed examples.
  3. Store the moderation result as a file in an S3 bucket or database.

Moderate videos using the image moderation API

Instead of using the video content moderation API, some customers choose to independently sample frames from videos and detect inappropriate content by sending the images to the Amazon Rekognition DetectModerationLabels API. Image results are returned in real time with labels for inappropriate content or offensive content along with a confidence score.

The following diagram shows the logical steps of the image API solution.

Rekognition Content Moderation Video Image Sampling Diagram
The steps are as follows:

1. Use a customized application or script as an orchestrator, from loading the video to the local file system.
2. Decode the video.
3. Sample image frames from the video at a chosen interval, such as two frames per second. Then iterate through all the images to:

3.a. Send each image frame to the image moderation API.
3.b. Store the moderation results in a file or database.

Compare this with the video API solution, which requires a light Lambda function to orchestrate API calls. The image sampling solution is CPU intensive and requires more compute resources. You can host the application using AWS services such as Lambda, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, or Amazon Elastic Compute Cloud (Amazon EC2).

Evaluation dataset

To evaluate both solutions, we use a sample dataset consisting of 200 short-form videos. The videos range from 10 seconds to 45 minutes. 60% of the videos are less than 2 minutes long. This sample dataset is used to test the performance, cost, and accuracy metrics for both solutions. The results compare the Amazon Rekognition image API sampling solution to the video API solution.

To test the image API solution, we use open-source libraries (ffmpeg and OpenCV) to sample images at a rate of two frames per second (one frame every 500 milliseconds). This rate mimics the sampling frequency used by the video content moderation API. Each image is sent to the image content moderation API to generate labels.

To test the video sampling solution, we send the videos directly to the video content moderation API to generate labels.

Results summary

We focus on the following key results:

  • Accuracy – Both solutions offer similar accuracy (false positive and false negative percentages) using the same sampling frequency of two frames per second
  • Cost – The image API sampling solution is more expensive than the video API solution using the same sampling frequency of two frames per second
    • The image API sampling solution cost can be reduced by sampling fewer frames per second
  • Performance – On average, the video API has a 425% faster processing time than the image API solution for the sample dataset
    • The image API solution performs better in situations with a high frame sample interval and on videos less than 90 seconds
  • Architecture complexity – The video API solution has a low architecture complexity, whereas the image API sampling solution has a medium architecture complexity

Accuracy

We tested both solutions using the sample set and the same sampling frequency of two frames per second. The results demonstrated that both solutions provide a similar false positive and true positive ratio. This result is expected because under the hood, Amazon Rekognition uses the same ML model for both the video and image moderation APIs.

To learn more about metrics for evaluating content moderation, refer to Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services.

Cost

The cost analysis demonstrates that the image API solution is more expensive than the video API solution if you use the same sampling frequency of two frames per second. The image API solution can be more cost effective if you reduce the number of frames sampled per second.

The two primary factors that impact the cost of a content moderation solution are the Amazon Rekognition API costs and compute costs. The default pricing for the video content moderation API is $0.10 per minute and $0.001 per image for the image content moderation API. A 60-second video produces 120 frames using a rate of two frames per second. The video API costs $0.10 to moderate a 60-second video, whereas the image API costs $0.120.

The price calculation is based on the official price in Region us-east-1 at the time of writing this post. For more information, refer to Amazon Rekognition pricing.

The cost analysis looks at the total cost to generate content moderation labels for the 200 videos in the sample set. The calculations are based on us-east-1 pricing. If you’re using another Region, modify the parameters with the pricing for that Region. The 200 videos contain 4271.39 minutes of content and generate 512,567 image frames at a sampling rate of two frames per second.

This comparison doesn’t consider other costs, such as Amazon S3 storage. We use Lambda as an example to calculate the AWS compute cost. Compute costs take into account the number of requests to Lambda and AWS Step Functions to run the analysis. The Lambda memory/CPU setting is estimated based on the Amazon EC2 specifications. This cost estimate uses a four GB, 2-second Lambda request per image API call. Lambda functions have a maximum invocation timeout limit of 15 minutes. For longer videos, the user may need to implement iteration logic using Step Functions to reduce the number of frames processed per Lambda call. The actual Lambda settings and cost patterns may differ depending on your requirements. It’s recommended to test the solution end to end for a more accurate cost estimation.

The following table summarizes the costs.

Type Amazon Rekognition Costs Compute Costs Total Cost
Video API Solution $427.14 $0
(Free tier)
$427.14
Image API Solution: Two frames per second $512.57 $164.23 $676.80
Image API Solution: One frame per second $256.28 $82.12 $338.40

Performance

On average, the video API solution has a four times faster processing time than the image API solution. The image API solution performs better in situations with a high frame sample interval and on videos shorter than 90 seconds.

This analysis measures performance as the average processing time in seconds per video. It looks at the total and average time to generate content moderation labels for the 200 videos in the sample set. The processing time is measured from the video upload to the result output and includes each step in the image sampling and video API process.

The video API solution has an average processing time of 35.2 seconds per video for the sample set. This is compared to the image API solution with an average processing time of 156.24 seconds per video for the sample set. On average, the video API performs four times faster than the image API solution. The following table summarizes these findings.

Type Average Processing Time (All Videos) Average Processing Time (Videos Under 1.5 Minutes)
Video API Solution 35.2 seconds 24.05 seconds
Image API Solution: Two frames per second 156.24 seconds 8.45 seconds
Difference 425% -185%

The image API performs better than the video API when the video is shorter than 90 seconds. This is because the video API has a queue managing the tasks that has a lead time. The image API can also perform better if you have a lower sampling frequency. Increasing the frame interval to over 5 seconds can decrease the processing time by 6–10 times. It’s important to note that increasing intervals introduces the risk of missed identification of inappropriate content between frame samples.

Architecture complexity

The video API solution has a low architecture complexity. You can set up a serverless pipeline or run a script to retrieve content moderation results. Amazon Rekognition manages the heavy computing and inference. The application orchestrating the Amazon Rekognition APIs can be hosted on a light machine.

The image API solution has a medium architecture complexity. The application logic has to orchestrate additional steps to store videos on the local drive, run image processing to capture frames, and call the image API. The server hosting the application requires higher computing capacity to support the local image processing. For the evaluation, we launched an EC2 instance with 4 vCPU and 8 G RAM to support two parallel threads. Higher compute requirements may lead to additional operation overhead.

Optimal use cases for the image API solution

The image API solution is ideal for three specific use cases when processing videos.

The first is real-time video streaming. You can capture image frames from a live video stream and send the images to the image moderation API.

The second use case is content moderation with a low frame sampling rate requirement. The image API solution is more cost-effective and performant if you sample frames at a low frequency. It’s important to note that there will be a trade-off between cost and accuracy. Sampling frames at a lower rate may increase the risk of missing frames with inappropriate content.

The third use case is for the early detection of inappropriate content in video. The image API solution is flexible and allows you to stop processing and flag the video early on, saving cost and time.

Conclusion

The video moderation API is ideal for most video moderation use cases. It’s more cost effective and performant than the image API solution when you sample frames at a frequency such as two frames per second. Additionally, it has a low architectural complexity and reduced operational overhead requirements.

The following table summarizes our findings to help you maximize the use of the Amazon Rekognition image and video APIs for your specific video moderation use cases. Although these results are averages achieved during testing and by some of our customers, they should give you ideas to balance the use of each API.

. Video API Solution Image API Solution
Accuracy Same accuracy .
Cost Lower cost using the default image sampling interval Lower cost if you reduce the number of frames sampled per second (sacrifice accuracy)
Performance Faster for videos longer than 90 seconds Faster for videos less than 90 seconds
Architecture Complexity Low complexity Medium complexity

Amazon Rekognition content moderation can not only help your business protect and keep customers safe and engaged, but also contribute to your ongoing efforts to maximize the return on your content moderation investment. Learn more about Content Moderation on AWS and our Content Moderation ML use cases.


About the authors

Author - Lana ZhangLana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team, with expertise in AI and ML for content moderation and computer vision. She is passionate about promoting AWS AI services and helping customers transform their business solutions.

Author - Brigit BrownBrigit Brown is a Solutions Architect at Amazon Web Services. Brigit is passionate about helping customers find innovative solutions to complex business challenges using machine learning and artificial intelligence. Her core areas of depth are natural language processing and content moderation.

Read More

Scaling distributed training with AWS Trainium and Amazon EKS

Scaling distributed training with AWS Trainium and Amazon EKS

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and DeepSpeed, it’s common for training jobs to require hundreds of accelerator devices for several weeks or months at a time.

In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium—a purpose-built machine learning (ML) accelerator optimized to provide a high-performance, cost-effective, and massively scalable platform for training deep learning models in the cloud. Trn1 instances are available in a number of sizes (see the following table), with up to 16 Trainium accelerators per instance.

Instance Size Trainium Accelerators Accelerator Memory (GB) vCPUs Instance Memory (GiB) Network Bandwidth (Gbps)
trn1.2xlarge 1 32 8 32 Up to 12.5
trn1.32xlarge 16 512 128 512 800
trn1n.32xlarge (coming soon) 16 512 128 512 1600

Trn1 instances can either be deployed as standalone instances for smaller training jobs, or in highly scalable ultraclusters that support distributed training across tens of thousands of Trainium accelerators. All Trn1 instances support the standalone configuration, whereas Trn1 ultraclusters require trn1.32xlarge or trn1n.32xlarge instances. In an ultracluster, multiple Trn1 instances are co-located in a given AWS Availability Zone and are connected with high-speed, low-latency, Elastic Fabric Adapter (EFA) networking that provides 800 Gbps of nonblocking network bandwidth per instance for collective compute operations. The trn1n.32xlarge instance type, launching in early 2023, will increase this bandwidth to 1600 Gbps per instance.

Many enterprise customers choose to deploy their deep learning workloads using Kubernetes—the de facto standard for container orchestration in the cloud. AWS customers often deploy these workloads using Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS is a managed Kubernetes service that simplifies the creation, configuration, lifecycle, and monitoring of Kubernetes clusters while still offering the full flexibility of upstream Kubernetes.

Today, we are excited to announce official support for distributed training jobs using Amazon EKS and EC2 Trn1 instances. With this announcement, you can now easily run large-scale containerized training jobs within Amazon EKS while taking full advantage of the price-performance, scalability, and ease of use offered by Trn1 instances.

Along with this announcement, we are also publishing a detailed tutorial that guides you through the steps required to run a multi-instance distributed training job (BERT phase 1 pre-training) using Amazon EKS and Trn1 instances. In this post, you will learn about the solution architecture and review several key steps from the tutorial. Refer to the official tutorial repository for the complete end-to-end workflow.

To follow along, a broad familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EKS is implied, and basic familiarity with deep learning and PyTorch would be helpful.

Solution architecture

The following diagram illustrates the solution architecture.

The solution consists of the following main components:

  • An EKS cluster
  • An EKS node group consisting of trn1.32xlarge instances
  • The AWS Neuron SDK
  • EKS plugins for Neuron and EFA
  • An Amazon Elastic Container Registry (Amazon ECR) Rrepository
  • A training container image
  • An Amazon FSx for Lustre file system
  • A Volcano batch scheduler and etcd server
  • The TorchX universal job launcher
  • The TorchX DDP module for Trainium

At the heart of the solution is an EKS cluster that provides you with core Kubernetes management functionality via an EKS service endpoint. One of the benefits of Amazon EKS is that the service actively monitors and scales the control plane based on load, which ensures high performance for large workloads such as distributed training. Inside the EKS cluster is a node group consisting of two or more trn1.32xlarge Trainium-based instances residing in the same Availability Zone.

The Neuron SDK is the software stack that provides the driver, compiler, runtime, framework integration (for example, PyTorch Neuron), and user tools that allow you to access the benefits of the Trainium accelerators. The Neuron device driver runs directly on the EKS nodes (Trn1 instances) and provides access to the Trainium chips from within the training containers that are launched on the nodes. Neuron and EFA plugins are installed within the EKS cluster to provide access to the Trainium chips and EFA networking devices required for distributed training.

An ECR repository is used to store the training container images. These images contain the Neuron SDK (excluding the Neuron driver, which runs directly on the Trn1 instances), PyTorch training script, and required dependencies. When a training job is launched on the EKS cluster, the container images are first pulled from Amazon ECR onto the EKS nodes, and the PyTorch worker containers are then instantiated from the images.

Shared storage is provided using a high-performance FSx for Lustre file system that exists in the same Availability Zone as the trn1.32xlarge instances. Creation and attachment of the FSx for Lustre file system to the EKS cluster is mediated by the Amazon FSx for Lustre CSI driver. In this solution, the shared storage is used to store the training dataset and any logs or artifacts created during the training process.

The solution uses the TorchX universal job launcher to launch distributed training jobs within Amazon EKS. TorchX has two important dependencies: the Volcano batch scheduler and the etcd server. Volcano handles the scheduling and queuing of training jobs, while the etcd server is a key-value store used by TorchElastic for synchronization and peer discovery during job startup.

When a training job is launched using TorchX, the launch command uses the provided TorchX distributed DDP module for Trainium to configure the overall training job and then run the appropriate torchrun commands on each of the PyTorch worker pods. When a job is running, it can be monitored using standard Kubernetes tools (such as kubectl) or via standard ML toolsets such as TensorBoard.

Solution overview

Let’s look at the important steps of this solution. Throughout this overview, we refer to the Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS tutorial on GitHub.

Create an EKS cluster

To get started with distributed training jobs in Amazon EKS with Trn1 instances, you first create an EKS cluster as outlined in the tutorial on GitHub. Cluster creation can be achieved using standard tools such as eksctl and AWS CloudFormation.

Create an EKS node group

Next, we need to create an EKS node group containing two or more trn1.32xlarge instances in a supported Region. In the tutorial, AWS CloudFormation is used to create a Trainium-specific EC2 launch template, which ensures that the Trn1 instances are launched with an appropriate Amazon Machine Image (AMI) and the correct EFA network configuration needed to support distributed training. The AMI also includes the Neuron device driver that provides support for the Trainium accelerator chips. With the eksctl Amazon EKS management tool, you can easily create a Trainium node group using a basic YAML manifest that references the newly created launch template. For example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-trn1-cluster
  region: us-west-2
  version: "1.23"

iam:
  withOIDC: true

availabilityZones: ["us-west-xx","us-west-yy"]

managedNodeGroups:
  - name: trn1-ng1
    launchTemplate:
      id: TRN1_LAUNCH_TEMPLATE_ID
    minSize: 2
    desiredCapacity: 2
    maxSize: 2
    availabilityZones: ["us-west-xx"]
    privateNetworking: true
    efaEnabled: true

In the preceding manifest, several attributes are configured to allow for the use of Trn1 instances in the EKS cluster. First, metadata.region is set to one of the Regions that supports Trn1 instances (currently us-east-1 and us-west-2). Next, for availabilityZones, Amazon EKS requires that two Availability Zones be specified. One of these Availability Zones must support the use of Trn1 instances, while the other can be chosen at random. The tutorial shows how to determine which Availability Zones will allow for Trn1 instances within your AWS account. The same Trn1-supporting Availability Zone must also be specified using the availabiltyZones attribute associated with the EKS node group. efaEnabled is set to true to configure the nodes with the appropriate EFA network configuration that is required for distributed training. Lastly, the launchTemplate.id attribute associated with the node group points to the EC2 launch template created via AWS CloudFormation in an earlier step.

Assuming that you have already applied the CloudFormation template and installed the eksctl management tool, you can create a Trainium-capable EKS node group by running the following code:

> eksctl create nodegroup -f TEMPLATE.yaml

Install Kubernetes plugins for Trainium and EFA devices

With the node group in place, the next step is to install Kubernetes plugins that provide support for the Trainium accelerators (via the Neuron plugin) and the EFA devices (via the EFA plugin). These plugins can easily be installed on the cluster using the standard kubectl management tool as shown in the tutorial.

To use the TorchX universal PyTorch launcher to launch distributed training jobs, two prerequisites are required: the Volcano batch scheduler, and the etcd server. Much like the Neuron and EFA plugins, we can use the kubectl tool to install Volcano and the etcd server on the EKS cluster.

Attach shared storage to the EKS cluster

In the tutorial, FSx for Lustre is used to provide a high-performance shared file system that can be accessed by the various EKS worker pods. This shared storage is used to host the training dataset, as well as any artifacts and logs creating during the training process. The tutorial describes how to create and attach the shared storage to the cluster using the Amazon FSx for Lustre CSI driver.

Create a training container image

Next, we need to create a training container image that includes the PyTorch training script along with any dependencies. An example Dockerfile is included in the tutorial, which incorporates the BERT pre-training script along with its software dependencies. The Dockerfile is used to build the training container image, and the image is then pushed to an ECR repository from which the PyTorch workers are able to pull the image when a training job is launched on the cluster.

Set up the training data

Before launching a training job, the training data is first copied to the shared storage volume on FSx for Lustre. The tutorial outlines how to create a temporary Kubernetes pod that has access to the shared storage volume, and shows how to log in to the pod in order to download and extract the training dataset using standard Linux shell commands.

With the various infrastructure and software prerequisites in place, we can now focus on the Trainium aspects of the solution.

Precompile your model

The Neuron SDK supports PyTorch through an integration layer called PyTorch Neuron. By default, PyTorch Neuron operates with just-in-time compilation, where the various neural network compute graphs within a training job are compiled as they are encountered during the training process. For larger models, it can be more convenient to use the provided neuron_parallel_compile tool to precompile and cache the various compute graphs in advance so as to avoid graph compilation at training time. Before launching the training job on the EKS cluster, the tutorial shows how to first launch a precompilation job via TorchX using the neuron_parallel_compile tool. Upon completion of the precompilation job, the Neuron compiler will have identified and compiled all of the neural network compute graphs, and cached them to the shared storage volume for later use during the actual BERT pre-training job.

Launch the distributed training job

With precompilation complete, TorchX is then used to launch a 64-worker distributed training job across two trn1.32xlarge instances, with 32 workers per instance. We use 32 workers per instance because each trn1.32xlarge instance contains 16 Trainium accelerators, with each accelerator providing 2 NeuronCores. Each NeuronCore can be accessed as a unique PyTorch XLA device in the training script. An example TorchX launch command from the tutorial looks like the following code:

    torchx run 
    -s kubernetes --workspace="file:///$PWD/docker" 
    -cfg queue=test,image_repo=$ECR_REPO 
    lib/trn1_dist_ddp.py:generateAppDef 
    --name berttrain 
    --script_args "--batch_size 16 --grad_accum_usteps 32 
        --data_dir /data/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128 
        --output_dir /data/output" 
    --nnodes 2 
    --nproc_per_node 32 
    --image $ECR_REPO:bert_pretrain 
    --script dp_bert_large_hf_pretrain_hdf5.py 
    --bf16 True 
    --cacheset bert-large

The various command line arguments in the preceding TorchX command are described in detail in the tutorial. However, the following arguments are most important in configuring the training job:

  • -cfg queue=test – Specifies the Volcano queue to be used for the training job
  • -cfg image_repo – Specifies the ECR repository to be used for the TorchX container images
  • –script_args – Specifies any arguments that should be passed to the PyTorch training script
  • –nnodes and –nproc_per_node – The number of instances and workers per instance to use for the training job
  • –script – The name of the PyTorch training script to launch within the training container
  • –image – The path to the training container image in Amazon ECR
  • –bf16 – Whether or not to enable BF16 data type

Monitor the training job

After the training job has been launched, there are various ways in which the job can be monitored. The tutorial shows how to monitor basic training script metrics on the command line using kubectl, how to visually monitor training script progress in TensorBoard (see the following screenshot), and how to monitor Trainium accelerator utilization using the neuron-top tool from the Neuron SDK.

Clean up or reuse the environment

When the training job is complete, the cluster can then be reused or re-configured for additional training jobs. For example, the EKS node group can quickly be scaled up using the eksctl command in order to support training jobs that require additional Trn1 instances. Similarly, the provided Dockerfile and TorchX launch commands can easily be modified to support additional deep learning models and distributing training topologies.

If the cluster in no longer required, the tutorial also includes all steps required to remove the EKS infrastructure and related resources.

Conclusion

In this post, we explored how Trn1 instances and Amazon EKS provide a managed platform for high-performance, cost-effective, and massively scalable distributed training of deep learning models. We also shared a comprehensive tutorial showing how to run a real-world multi-instance distributed training job in Amazon EKS using Trn1 instances, and highlighted several of the key steps and components in the solution. This tutorial content can easily be adapted for other models and workloads, and provides you with a foundational solution for distributed training of deep learning models in AWS.

To learn more about how to get started with Trainium-powered Trn1 instances, refer to the Neuron documentation.


About the Authors

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.

Read More

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

Starting today, the SageMaker LightGBM algorithm offers distributed training using the Dask framework for both tabular classification and regression tasks. They’re available through the SageMaker Python SDK. The supported data format can be either CSV or Parquet. Extensive benchmarking experiments on four publicly available datasets with various settings are conducted to validate its performance.

Customers are increasingly interested in training models on large datasets with SageMaker LightGBM, which can take a day or even longer. In these cases, you might be able to speed up the process by distributing training over multiple machines or processes in a cluster. This post discusses how SageMaker LightGBM helps you set up and launch distributed training, without the expense and difficulty of directly managing your training clusters.

Problem statement

Machine learning has become an essential tool for extracting insights from large amounts of data. From image and speech recognition to natural language processing and predictive analytics, ML models have been applied to a wide range of problems. As datasets continue to grow in size and complexity, traditional training methods can become increasingly time-consuming and resource-intensive. This is where distributed training comes into play.

Distributed training is a technique that allows for the parallel processing of large amounts of data across multiple machines or devices. By splitting the data and training multiple models in parallel, distributed training can significantly reduce training time and improve the performance of models on big data. In recent years, distributed training has been a popular mechanism in training deep neural networks for use cases such as large language models (LLMs), image generation and classification, and text generation tasks using frameworks like PyTorch, TensorFlow, and MXNet. In this post, we discuss how distributed training can be applied to tabular data (a common type of data found in many industries such as finance, healthcare, and retail) using Dask and the LightGBM algorithm for tasks such as regression and classification.

Dask is an open-source parallel computing library that allows for distributed parallel processing of large datasets in Python. It’s designed to work with the existing Python and data science ecosystem such as NumPy and Pandas. When it comes to distributed training, Dask can be used to parallelize the data loading, preprocessing, and model training tasks, and it integrates well with popular ML algorithms like LightGBM. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, which is designed to be efficient and scalable for training large models on big data. Combining these two powerful libraries, LightGBM v3.2.0 is now integrated with Dask to allow distributed learning across multiple machines to produce a single model.

How distributed training works

Distributed training for tree-based algorithms is a technique that is used when the dataset is too large to be processed on a single instance or when the computational resources of a single instance are not sufficient to train the tree-based model in a reasonable amount of time. It allows a model to be trained across multiple instances or machines, rather than on a single machine. This is done by dividing the dataset into smaller subsets, called chunks, and distributing them among the available instances. Each instance then trains a model on its assigned chunk of data, and the results are later combined using aggregation algorithms to form a single model.

In tree-based models like LightGBM, the main computational cost is in the building of the tree structure. This is typically done by sorting and selecting subsets of the data.

Now, let’s explore how LightGBM does the parallel training. LightGBM can use three types of parallelism:

  • Data parallelism – This is the most basic form of data parallelism. The data is divided horizontally into smaller subsets and distributed among multiple instances. Each instance constructs its local histogram, and all histograms are merged, then a split is performed using a reduce scatter algorithm. A histogram in local instances is constructed by dividing the subset of the local data into discrete bins, and counting the number of data points in each bin. This histogram-based algorithm helps speed up the training and reduces memory usage.
  • Feature parallelism – In feature parallelism, each machine is responsible for training a subset of the features of the model, rather than a subset of the data. This can be useful when working with datasets that have a large number of features, because it allows for more efficient use of resources. It works by finding the best local split point in each instance, then communicates the best split with the other instances. LightGBM implementation maintains all features of the data in every machine to reduce the cost of communicating the best splits.
  • Voting parallelism – In voting parallelism, the data is divided into smaller subsets and distributed among multiple machines. Each machine trains a model on its assigned subset of data, and the results are later combined to form a single, larger model. However, instead of using the gradients from all the machines to update the model parameters, a voting mechanism is used to decide which gradients to use. This can be useful when working with datasets that have a lot of noise or outliers, because it can help reduce the impact of these on the final model. At the time of writing this post, LightGBM integration with Dask only supports data and voting parallelism types.

SageMaker will automatically set up and manage a Dask cluster when using multiple instances with the LightGBM built-in container.

Solution overview

When a training job using LightGBM is started with multiple instances, we first create a Dask cluster. One instance acts as the Dask scheduler, and the remaining instances have Dask workers, where each worker has multiple threads. Each worker in the cluster has part of the data to perform the distributed computations, as illustrated in the following figure.

Enable distributed training

The requirements for the input data are as follows:

  • The supported input data format for training can be either CSV or Parquet. You are allowed to put more than one data file under both train and validation channels. If multiple files are identified, the algorithm will concatenate all of them as the training or validation data. The name of the data file can be any string as long as it ends with .csv or .parquet.
  • For each data file, the algorithm requires that the target variable is in the first column and that it should not have a header record. This follows the convention of the SageMaker XGBoost algorithm.
  • If your predictors include categorical features, you can provide a JSON file named cat_index.json in the same location as your training data. This file should contain a Python dictionary, where the key can be any string and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your data file. The index starts with value 1, because value 0 corresponds to the target variable. The cat_index.json file should be put under the training data directory, as shown in the following example.
  • The instance type supported by distributed training is CPU.

Let’s use data in CSV format as an example. The train and validation data can be structured as follows:

-- training_dataset_s3_path
    -- data_1.csv
    -- data_2.csv
    -- data_3.csv
    -- cat_idx.json
    
-- validation_dataset_s3_path
    -- data_1.csv

You can specify the input type to be either text/csv or application/x-parquet:

from sagemaker.inputs import TrainingInput

content_type = "text/csv" # or "application/x-parquet"

train_input = TrainingInput(
    training_dataset_s3_path, content_type=content_type
)

validation_input = TrainingInput(
    validation_dataset_s3_path, content_type=content_type
)

Before distributed training, you can retrieve the default hyperparameters of LightGBM and override them with custom values:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for LightGBM
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values
hyperparameters[
    "num_boost_round"
] = "500" 

hyperparameters["tree_learner"] = "voting" ### specify either 'data' or 'voting' parallelism for distributed training. Unfortnately, for dask lightgbm, the 'feature' is not supported. See github issue: https://github.com/microsoft/LightGBM/issues/3834

To enable distributed training, you can simply specify the argument instance_count in the class sagemaker.estimator.Estimator to be more than 1. The rest of work is taken care of under the hood. See the following example code:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

training_job_name = name_from_base("sagemaker-built-in-distributed-lgb")

# Create SageMaker Estimator instance
tabular_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=4, ### select the instance count you would like to use for distributed training
    volume_size=30, ### volume_size (int or PipelineVariable): Size in GB of the storage volume to use for storing input and output data during training (default: 30).
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a SageMaker Training job by passing s3 path of the training data
tabular_estimator.fit(
    {
        "train": train_input,
        "validation": validation_input,
    }, logs=True, job_name=training_job_name
)

The following screenshots show a successful training job log from the notebook. The logs from different Amazon Elastic Compute Cloud (Amazon EC2) machines are marked by different colors.

The distributed training is also compatible with SageMaker automatic model tuning. For details, see the example notebook.

Benchmarking

We conducted benchmarking experiments to validate the performance of distributed training in SageMaker LightGBM on four different publicly available datasets for regression, binary, and multi-class classification tasks. The experiment details are as follows:

  • Each dataset is split into training, validation, and test data following the 80/20/10 split rule. For each dataset and instance type and count, we train LightGBM on the training data; record metrics such as billable time (per instance), total runtime, average training loss at the end of the last built tree over all instances, and validation loss at the end of the last built tree; and evaluate its performance on the hold-out test data.
  • For each trial, we use the exact same set of hyperparameter values, with the number of trees being 500 except for the lending dataset. For the lending dataset, we use 100 as the number of trees because it’s sufficient to get optimal results on the hold-out test data.
  • Each number presented in the table is averaged over three trials.
  • Because each model is trained with one fixed set of hyperparameter values, the evaluation metric numbers on the hold-out test data can be further improved with hyperparameter optimization.

Billable time refers to the absolute wall-clock time. The total runtime is the elastic time running the distributed training, which includes the billable time and time to spin up instances and install dependencies. For the validation loss at the end of the last built tree, we didn’t do the average over all the instances as the training loss because all of the validation data is assigned to a single instance and therefore only that instance has the validation loss metric. Out of Memory (OOM) means the dataset hit the out of memory error during training. The loss function and evaluation metrics used are binary and multi-class logloss, L2, accuracy, F1, ROC AUC, F1 macro, F1 micro, R2, MAE, and MSE.

The expectation is that as the instance count increases, the billable time (per instance) and total runtime decreases, while the average training loss and validation loss at the end of the last built tree and evaluation scores on the hold-out test data remain the same.

We conducted three experiments:

  • Benchmark on three publicly available datasets using CSV as the input data format
  • Benchmark on a different dataset using Parquet as the input data format
  • Compare the model performance on different instance types given a certain instance count

The datasets we used are lending club loan data, ad-tracking fraud detection data, code data, and NYC taxi data. The data statistics are presented as follows.

Dataset Size Number of Examples Number of Features Problem Type
lending club loan ~10 G 1, 439, 141 955 Binary classification
ad-tracking fraud detection ~10 G 145, 716, 493 9 Binary classification
code ~10 G 18, 268, 221 9 Multi-class classification (number of classes in target: 10)
NYC taxi ~0.5 G 83, 601, 440 8 Regression

The following table contains the benchmarking results for the first three datasets using CSV as the data input format. For demonstration purposes, we removed the categorical features for the lending club loan data. The data statistics are shown in the table. The experiment results matched our expectations.

Dataset Instance Count (m5.2xlarge) Billable Time per Instance (seconds) Total Runtime (seconds) Average Training Loss over all Instances at the End of the Last Built Tree Validation Loss at the End of the Last Built Tree Evaluation Metrics on Hold-Out Test Data
lending club loan . . . Binary logloss Binary logloss Accuracy (%) F1 (%) ROC AUC (%)
. 1 Out of Memory
. 2 Out of Memory
. 4 461 614 0.034 0.039 98.9 96.6 99.7
. 6 375 561 0.034 0.039 98.9 96.6 99.7
. 8 359 549 0.034 0.039 98.9 96.7 99.7
. 10 338 522 0.036 0.037 98.9 96.6 99.7
.
ad-tracking fraud detection . . . Binary logloss Binary logloss Accuracy (%) F1 (%) ROC AUC (%)
. 1 Out of Memory
. 2 Out of Memory
. 4 2649 2773 0.038 0.039 99.8 43.2 80.4
. 6 1926 2047 0.039 0.04 99.8 44.5 79.7
. 8 1677 1780 0.04 0.04 99.8 45.3 79.2
. 10 1595 1744 0.04 0.041 99.8 43 79.3
.
code . . . Multiclass logloss Multiclass logloss Accuracy (%) F1 Macro (%) F1 Micro (%)
. 1 5329 5414 0.937 0.947 65.6 59.3 65.6
. 2 3175 3294 0.94 0.942 65.5 59 65.5
. 4 2593 2695 0.937 0.942 65.6 59.3 65.6
. 8 2253 2377 0.938 0.943 65.6 59.3 65.6
. 10 2160 2285 0.937 0.942 65.6 59.3 65.6

The following table contains the benchmarking results using NYC taxi data with Parquet as the input data format. For the NYC taxi data, we use the yellow trip taxi records from 2009–2022. We follow the example notebook to conduct feature processing. The processed data takes 8.5 G of disk memory when saved as CSV format, and only 0.55 G when saved as Parquet format.

A similar pattern shown in the preceding table is observed. As the instance count increases, the billable time (per instance) and total runtime decreases, while the average training loss and validation loss at the end of the last built tree and evaluation scores on the hold-out test data remain the same.

Dataset Instance Count (m5.4xlarge) Billable Time per Instance (seconds) Total Runtime (seconds) Average Training Loss over all Instances at the End of the Last Built Tree Validation Loss at the End of the Last Built Tree Evaluation Metrics on Hold-Out Test Data
NYC taxi . . . L2 L2 R2 (%) MSE MAE
. 1 951 1036 6.543 6.543 54.7 42.8 2.7
. 2 635 727 6.545 6.545 54.7 42.8 2.7
. 4 501 628 6.637 6.639 53.4 44.1 2.8
. 6 435 552 6.74 6.74 52 45.4 2.8
. 8 410 510 6.919 6.924 52.3 44.9 2.9

We also conduct benchmarking experiments and compare the performance under different instance types using the code dataset. For a certain instance count, as the instance type becomes larger, the billable time and total runtime decrease.

. ml.m5.2xlarge ml.m5.4xlarge ml.m5.12xlarge
Instance Count Billable Time per Instance (seconds) Total Runtime (seconds) Billable Time per Instance (seconds) Total Runtime (seconds) Billable Time per Instance (seconds) Total Runtime (seconds)
1 5329 5414 2793 2904 1302 1394
2 3175 3294 1911 2000 1006 1098
4 2593 2695 1451 1557 891 973

Conclusion

With the power of Dask’s distributed computing framework and LightGBM’s efficient gradient boosting algorithm, data scientists and developers can train models on large datasets faster and more efficiently than using traditional single-node methods. The SageMaker LightGBM algorithm makes the process of setting up distributed training using the Dask framework for both tabular classification and regression tasks much easier. The algorithm is now available through the SageMaker Python SDK. The supported data format can be either CSV or Parquet. Extensive benchmarking experiments were conducted on four publicly available datasets with various settings to validate its performance.

You can bring your own dataset and try these new algorithms on SageMaker, and check out the example notebook to use the built-in algorithms available on GitHub.


About the authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Will Badr is a Principal AI/ML Specialist SA who works as part of the global Amazon Machine Learning team. Will is passionate about using technology in innovative ways to positively impact the community. In his spare time, he likes to go diving, play soccer and explore the Pacific Islands.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Read More

Build a water consumption forecasting solution for a water utility agency using Amazon Forecast

Build a water consumption forecasting solution for a water utility agency using Amazon Forecast

Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts, without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating supply and demand for inventory management, travel demand forecasting, workforce planning, and computing cloud infrastructure usage.

You can use Forecast to seamlessly conduct what-if analyses up to 80% faster to analyze and quantify the potential impact of business levers on your demand forecasts. A what-if analysis helps you investigate and explain how different scenarios might affect the baseline forecast created by Forecast. With Forecast, there are no servers to provision or ML models to build manually. Additionally, you only pay for what you use, and there is no minimum fee or upfront commitment. To use Forecast, you only need to provide historical data for what you want to forecast, and, optionally, any additional data that you believe may impact your forecasts.

Water utility providers have several forecasting use cases, but primary among them is predicting water consumption in an area or building to meet the demand. Also, it’s important for utility providers to forecast the increased consumption demand because of more apartments added in a building or more houses in the area. Predicting water consumption accurately is critical to avoid any service interruptions to the customer.

This post explores using Forecast to address this use case by using historical time series data.

Solution overview

Water is a natural resource and very critical to industry, agriculture, households, and our lives. Accurate water consumption forecasting is critical to make sure that an agency can run day-to-day operations efficiently. Water consumption forecasting is particularly challenging because demand is dynamic, and seasonal weather changes can have an impact. Predicting water consumption accurately is important so customers don’t face any service interruptions and in order to provide a stable service while maintaining low prices. Improved forecasting enables you to plan ahead to structure more cost-effective future contracts. The following are the two most common use cases:

  • Better demand management – As a utility provider agency, you need to find a balance between water demand and supply. The agency collects information like number of people living in an apartment and number of apartments in a building before providing service. As a utility agency, you must balance aggregate supply and demand. You need to store sufficient water in order to meet the demand. Moreover, demand forecasting has become more challenging for the following reasons:
    • The demand isn’t stable at all times and varies throughout the day. For example, water consumption at midnight is much less compared to in the morning.
    • Weather can also have an impact on the overall consumption. For example, water consumption is higher in the summer than the winter in the northern hemisphere, and the other way around in the southern hemisphere.
    • There is not enough rainfall or water storage mechanisms (lakes, reservoirs), or water filtering is insufficient. During the summer, demand can’t always keep up with supply. The water agencies have to forecast carefully to acquire other sources, which may be more expensive. Therefore, it’s critical for utility agencies to find alternative water sources like harvesting rainwater, capturing condensation from air handling units, or reclaiming wastewater.
  • Conducting a what-if analysis for increased demand – Demand for water is rising due to multiple reasons. This includes a combination of population growth, economic development, and changing consumption patterns. Let’s imagine a scenario where an existing apartment building builds an extension and the number of households and people increase by a certain percentage. Now you need to do an analysis to forecast the supply for increased demand. This also helps you make a cost-effective contract for increased demand.

Forecasting can be challenging because you first need accurate models to forecast demand and then a quick and simple way to reproduce the forecast across a range of scenarios.

This post focuses on a solution to perform water consumption forecasting and a what-if analysis. This post doesn’t consider weather data for model training. However, you can add weather data, given its correlation to water consumption.

Prerequisites

Before getting started, we set up our resources. For this post, we use the us-east-1 Region.

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket for storing the historical time series data. For instructions, refer to Create your first S3 bucket.
  2. Download data files from the GitHub repo and upload to the newly created S3 bucket.
  3. Create a new AWS Identity and Access Management (IAM) role. For instructions, see Set Up Permissions for Amazon Forecast. Be sure to provide the name of your S3 bucket.

Create a dataset group and datasets

This post demonstrates two use cases related to water demand forecast: forecasting the water demand based on past water consumption, and conducting a what-if analysis for increased demand.

Forecast can accept three types of datasets: target time series (TTS), related time series (RTS), and item metadata (IM). Target time series data defines the historical demand for the resources you’re predicting. The target time series dataset is mandatory. A related time series dataset includes time-series data that isn’t included in a target time series dataset and might improve the accuracy of your predictor.

In our example, the target time series dataset contains item_id and timestamp dimensions, and the complementary related time series dataset includes no_of_consumer. An important note with this dataset: the TTS ends on 2023-01-01, and the RTS ends on 2023-01-15. When performing what-if scenarios, it’s important to manipulate RTS variables beyond your known time horizon in TTS.

To conduct a what-if analysis, we need to import two CSV files representing the target time series data and the related time series data. Our example target time series file contains the item_id, timestamp, and demand, and our related time series file contains the product item_id, timestamp, and no_of consumer.

To import your data, complete the following steps:

  1. On the Forecast console, choose View dataset groups.

  2. Choose Create dataset group.

  3. For Dataset group name, enter a name (for this post, water_consumption_datasetgroup).
  4. For Forecasting domain, choose a forecasting domain (for this post, Custom).
  5. Choose Next.
  6. On the Create target time series dataset page, provide the dataset name, frequency of your data, and data schema.
  7. On the Dataset import details page, enter a dataset import name.
  8. For Import file type, select CSV and enter the data location.
  9. Choose the IAM role you created earlier as a prerequisite.
  10. Choose Start.

You’re redirected to the dashboard that you can use to track progress.

  1. To import the related time series file, on the dashboard, choose Import.
  2. On the Create related time series dataset page, provide the dataset name and data schema.
  3. On the Dataset import details page, enter a dataset import name.
  4. For Import file type, select CSV and enter the data location.
  5. Choose the IAM role you created earlier.
  6. Choose Start.

Train a predictor

Next, we train a predictor.

  1. On the dashboard, choose Start under Train a predictor.
  2. On the Train predictor page, enter a name for your predictor.
  3. Specify how long in the future you want to forecast and at what frequency.
  4. Specify the number of quantiles you want to forecast for.

Forecast uses AutoPredictor to create predictors. For more information, refer to Training Predictors.

  1. Choose Create.

Create a forecast

After our predictor is trained (this can take approximately 3.5 hours), we create a forecast. You will know that your predictor is trained when you see the View predictors button on your dashboard.

  1. Choose Start under Generate forecasts on the dashboard.
  2. On the Create a forecast page, enter a forecast name.
  3. For Predictor, choose the predictor that you created.
  4. Optionally, specify the forecast quantiles.
  5. Specify the items to generate a forecast for.
  6. Choose Start.

Query your forecast

You can query a forecast using the Query forecast option. By default, the complete range of the forecast is returned. You can request a specific date range within the complete forecast. When you query a forecast, you must specify filtering criteria. A filter is a key-value pair. The key is one of the schema attribute names (including forecast dimensions) from one of the datasets used to create the forecast. The value is a valid value for the specified key. You can specify multiple key-value pairs. The returned forecast will only contain items that satisfy all the criteria.

  1. Choose Query forecast on the dashboard.
  2. Provide the filter criteria for start date and end date.
  3. Specify your forecast key and value.
  4. Choose Get Forecast.

The following screenshot shows the forecast energy consumption for the same apartment (item ID A_10001) using the forecast model.

Create a what-if analysis

At this point, we have created our baseline forecast can now conduct a what-if analysis. Let’s imagine a scenario where an existing apartment building adds an extension, and the number of households and people increases by 20%. Now you need to do an analysis to forecast increased supply based on increased demand.

There are three stages to conducting a what-if analysis: setting up the analysis, creating the what-if forecast by defining what is changed in the scenario, and comparing the results.

  1. To set up your analysis, choose Explore what-if analysis on the dashboard.
  2. Choose Create.
  3. Enter a unique name and choose the baseline forecast.
  4. Choose the items in your dataset you want to conduct a what-if analysis for. You have two options:
    • Select all items is the default, which we choose in this post.
    • If you want to pick specific items, choose Select items with a file and import a CSV file containing the unique identifier for the corresponding item and any associated dimensions.
  5. Choose Create what-if analysis.

Create a what-if forecast

Next, we create a what-if forecast to define the scenario we want to analyze.

  1. In the What-if forecast section, choose Create.
  2. Enter a name of your scenario.
  3. You can define your scenario through two options:
    • Use transformation functions – Use the transformation builder to transform the related time series data you imported. For this walkthrough, we evaluate how the demand for an item in our dataset changes when the number of consumers increases by 20% when compared to the price in the baseline forecast.
    • Define the what-if forecast with a replacement dataset – Replace the related time series dataset you imported.

For our example, we create a scenario where we increase no_of_consumer by 20% applicable to item ID A_10001, and no_of_consumer is a feature in the dataset. You need this analysis to forecast and meet the water supply for increased demand. This analysis also helps you make a cost-effective contract based on the water demand forecast.

  1. For What-if forecast definition method, select Use transformation functions.
  2. Choose Multiply as our operator, no_of_consumer as our time series, and enter 1.2.
  3. Choose Add condition.
  4. Choose Equals as the operation and enter A_10001 for item_id.
  5. Choose Create.

Compare the forecasts

We can now compare the what-if forecasts for both our scenarios, comparing a 20% increase in consumers with the baseline demand.

  1. On the analysis insights page, navigate to the Compare what-if forecasts section.
  2. For item_id, enter the item to analyze (in our scenario, enter A_10001).
  3. For What-if forecasts, choose water_demand_whatif_analyis.
  4. Choose Compare what-if.
  5. You can choose the baseline forecast for the analysis.

The following graph shows the resulting demand for our scenario. The red line shows the forecast of future water consumption for 20% increased population. The P90 forecast type indicates the true value is expected to be lower than the predicted value 90% of the time. You can use this demand forecast to effectively manage water supply for increased demand and avoid any service interruptions.

Export your data

To export your data to CSV, complete the following steps:

  1. Choose Create export.
  2. Enter a name for your export file (for this post, water_demand_export).
  3. Specify the scenarios to be exported by selecting the scenarios on the What-If Forecast drop-down menu.

You can export multiple scenarios at once in a combined file.

  1. For Export location, specify the Amazon S3 location.
  2. To begin the export, choose Create Export.
  3. To download the export, navigate to S3 file path location on the Amazon S3 console, select the file, and choose Download.

The export file will contain the timestamp, item_id, and forecasts for each quantile for all scenarios selected (including the base scenario).

Clean up the resources

To avoid incurring future charges, remove the resources created by this solution:

  1. Delete the Forecast resources you created.
  2. Delete the S3 bucket.

Conclusion

In this post, we showed you how easy to use how to use Forecast and its underlying system architecture to predict water demand using water consumption data. A what-if scenario analysis is a critical tool to help navigate through the uncertainties of business. It provides foresight and a mechanism to stress-test ideas, leaving businesses more resilient, better prepared, and in control of their future. Other utility providers like electricity or gas providers can use Forecast to build solutions and meet utility demand in a cost-effective way.

The steps in this post demonstrated how to build the solution on the AWS Management Console. To directly use Forecast APIs for building the solution, follow the notebook in our GitHub repo.

We encourage you to learn more by visiting the Amazon Forecast Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Read More