Amazon AWS – Page 172

Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab

February 2, 2023

by Kevin Song Amazon AWS

The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More information regarding the NFL Player Health and Safety efforts is available on the NFL website.

The AWS Professional Services team has partnered with the NFL and Biocore to provide machine learning (ML)-based solutions for identifying helmet impacts from game footage using computer vision (CV) techniques. With multiple camera views available from each game, we have developed solutions to identify helmet impacts from each of these views and merge the helmet impact results.

The motivation behind utilizing multiple camera views comes from the limitation of information when the impact events are captured with only one view. With only one perspective, some players might occlude each other or be blocked by other objects on the field. Therefore, adding more perspectives allows our ML system to identify more impacts that aren’t visible in a single view. To showcase the results of our fusion process and how the team uses visualization tools to help evaluate the model performance, we have developed a codebase to visually overlay the multiple view detection results. This process helps identify the actual number of impacts individual players experience by removing duplicate impacts detected in multiple views.

In this post, we use the publicly available dataset from the NFL – Impact Detection Kaggle competition and show results for merging two views. The dataset includes helmet bounding boxes at every frame and impact labels found in each video. In particular, we focus on deduplicating and visualizing videos with the ID 57583_000082 in endzone and sideline views. You can download the endzone and sideline videos, and also the ground truth labels.

Prerequisites

The solution requires the following:

An Amazon SageMaker Studio Lab account
A Kaggle account for downloading the data

Get started on SageMaker Studio Lab and install the required packages

You can run the notebook from the GitHub repository or from SageMaker Studio Lab. In this post, we run the notebook from a SageMaker Studio Lab environment. We are choosing SageMaker Studio Lab because it is free, provides powerful CPU and GPU user sessions, and 15GB of persistent storage that will automatically save your environment, enabling you to pick up where you left off. To use SageMaker Studio Lab, request and set up a new account. After the account is approved, complete the following steps:

Visit the aws-samples GitHub repo.
In the README section, choose Open Studio Lab.

This redirects you to your SageMaker Studio Lab environment.

Select your CPU compute type, then choose Start Runtime.
After the runtime starts, choose Copy to Project, which opens a new window with the Jupyter Lab environment.

Now you’re ready to use the notebook!

Open fuse_and_visualize_multiview_impacts.ipynb and follow the instructions in the notebook.

The first cell in the notebook installs the necessary Python packages such as pandas and OpenCV:

%pip install pandas
%pip install opencv-contrib-python-headless

Import all the necessary Python packages and set pandas options for better visualization experience:

import os
import cv2
import pandas as pd
import numpy as np
pd.set_option('mode.chained_assignment', None)

We use pandas for ingesting and parsing through the CSV file with the annotated helmet bounding boxes as well as impacts. We use NumPy mainly for manipulating arrays and matrices. We use OpenCV for reading, writing, and manipulating image data in Python.

Prepare the data by fusing results from two views

To fuse the two perspectives together, we use the train_labels.csv from the Kaggle competition as an example because it contains ground truth impacts from both the endzone and sideline. The following function takes the input dataset and outputs a fused dataframe that is deduplicated for all the plays in the input dataset:

def prep_data(df):
    df['game_play'] = df['gameKey'].astype('str') + '_' + df['playID'].astype('str').str.zfill(6)
    return df

def dedup_view(df, windows):
    # define view
    df = df.sort_values(by='frame')
    view_columns = ['frame', 'left', 'width', 'top', 'height', 'video']
    common_columns = ['game_play', 'label', 'view', 'impactType']
    label_cleaned = df[view_columns + common_columns]
    
    # rename columns
    sideline_column_rename = {col: 'Sideline_' + col for col in view_columns}
    endzone_column_rename = {col: 'Endzone_' + col for col in view_columns}
    sideline_columns = list(sideline_column_rename.values())

    # create two dataframes, one for sideline, one for endzone
    label_endzone = label_cleaned.query('view == "Endzone"')
    label_endzone.rename(columns=endzone_column_rename, inplace=True)
    label_sideline = label_cleaned.query('view == "Sideline"')
    label_sideline.rename(columns=sideline_column_rename, inplace=True)

    # prepare sideline labels
    label_sideline['is_dup'] = False
    for columns in sideline_columns:
        label_endzone[columns] = np.nan
    label_endzone['is_dup'] = False

    # iterrate endzone rows to find matches and dedup
    for index, row in label_endzone.iterrows():
        player = row['label']
        frame = row['Endzone_frame']
        impact_type = row['impactType']
        sideline_row = label_sideline[(label_sideline['label'] == player) & 
                                      ((label_sideline['Sideline_frame'] >= frame - windows // 2) &
                                       (label_sideline['Sideline_frame'] <= frame + windows // 2 + 1)) &
                                      (label_sideline['is_dup'] == False) & 
                                      (label_sideline['impactType'] == impact_type)]

        if len(sideline_row) > 0:
            sideline_index = sideline_row.index[0]
            label_sideline['is_dup'].loc[sideline_index] = True

            for col in sideline_columns:
                label_endzone[col].loc[index] = sideline_row.iloc[0][col]
            label_endzone['is_dup'].loc[index] = True

    # calculate overlap perc
    not_dup_sideline = label_sideline[label_sideline['is_dup'] == False]
    final_output = pd.concat([not_dup_sideline, label_endzone])
    return final_output

def fuse_df(raw_df, windows):
    outputs = []
    all_game_play = raw_df['game_play'].unique()
    for game_play in all_game_play:
        df = raw_df.query('game_play ==@game_play')
        output = dedup_view(df, windows)
        outputs.append(output)

    output_df = pd.concat(outputs)
    output_df['gameKey'] = output_df['game_play'].apply(lambda x: x.split('_')[0]).map(int)
    output_df['playID'] = output_df['game_play'].apply(lambda x: x.split('_')[1]).map(int)
    return output_df

To run the function, we run the following code block to provide the location of the train_labels.csv data and then perform data preparation to add an additional column and extract only the impact rows. After running the function, we save the output to a dataframe variable called fused_df.

# read the annotated impact data from train_labels.csv
ground_truth = pd.read_csv('train_labels.csv')

# prepare game_play column using pipe(prep_data) function in pandas then filter the dataframe for just rows with impacts
ground_truth = ground_truth.pipe(prep_data).query('impact == 1')

# loop over all the unique game_plays and deduplicate the impact results from sideline and endzone
fused_df = fuse_df(ground_truth, windows=30)

The following screenshot shows the ground truth.

The following screenshot shows the fused dataframe examples.

Graph and video code

After we fuse the impact results, we use the generated fused_df to overlay the results onto our endzone and sideline videos and merge the two views together. We use the following function for this, and the inputs needed are the paths to the endzone video, sideline video, fused_df dataframe, and the final output path for the newly generated video. The functions used in this section are described in the markdown section of the notebook used in SageMaker Studio Lab.

def get_video_and_metadata(vid_path): 
    vid = cv2.VideoCapture(vid_path)
    total_frame_number = vid.get(cv2.CAP_PROP_FRAME_COUNT)
    width = int(vid.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(vid.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = vid.get(cv2.CAP_PROP_FPS)
    return vid, total_frame_number, width, height, fps

def overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1):
    # look for duplicates 
    duplicates = fused_df.query(f"gameKey == {int(game_key)} and 
                                  playID == {int(play_id)} and 
                                  is_dup == True and 
                                  Sideline_frame == @frame_cnt") 

    frame_has_impact = False 
    
    if len(duplicates) > 0: 
        for duplicate in duplicates.itertuples(index=False): 
            if frame_cnt == duplicate.Sideline_frame: 
                frame_has_impact = True 

            if frame_has_impact: 
                cv2.rectangle(frame, #frame to be edited 
                              (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of top left corner 
                              (int(duplicate.Sideline_left) + int(duplicate.Sideline_width), int(duplicate.Sideline_top) + int(duplicate.Sideline_height)), #(x,y) of bottom right corner 
                              (0,0,255), #RED boxes
                              thickness=3)

                cv2.rectangle(frame, #frame to be edited
                              (int(duplicate.Endzone_left), int(duplicate.Endzone_top)+ h1), #(x,y) of top left corner
                              (int(duplicate.Endzone_left) + int(duplicate.Endzone_width), int(duplicate.Endzone_top) + int(duplicate.Endzone_height) + h1), #(x,y) of bottom right corner
                              (0,0,255), #RED boxes
                              thickness=3)
 
                cv2.line(frame, #frame to be edited
                         (int(duplicate.Sideline_left), int(duplicate.Sideline_top)), #(x,y) of point 1 in a line
                         (int(duplicate.Endzone_left), int(duplicate.Endzone_top) + h1), #(x,y) of point 2 in a line
                         (255, 255, 255), # WHITE lines
                         thickness=4)
 
            else:
                # if no duplicates, look for sideline then endzone and add to the view
                sl_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Sideline' and 
                                              Sideline_frame == @frame_cnt")

                if len(sl_impacts) > 0:
                    for impact in sl_impacts.itertuples(index=False):
                        if frame_cnt == impact.Sideline_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Sideline_left), int(impact.Sideline_top)), #(x,y) of top left corner
                                          (int(impact.Sideline_left) + int(impact.Sideline_width), int(impact.Sideline_top) + int(impact.Sideline_height)), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

                ez_impacts = fused_df.query(f"gameKey == {int(game_key)} and 
                                              playID == {int(play_id)} and 
                                              is_dup == False and 
                                              view == 'Endzone' and 
                                              Endzone_frame == @frame_cnt")

                if len(ez_impacts) > 0:
                    for impact in ez_impacts.itertuples(index=False):
                        if frame_cnt == impact.Endzone_frame:
                            frame_has_impact = True

                        if frame_has_impact:
                            cv2.rectangle(frame, #frame to be edited
                                          (int(impact.Endzone_left), int(impact.Endzone_top)+ h1), #(x,y) of top left corner
                                          (int(impact.Endzone_left) + int(impact.Endzone_width), int(impact.Endzone_top) + int(impact.Endzone_height) + h1 ), #(x,y) of bottom right corner
                                          (0, 255, 255), #YELLOW BOXES
                                          thickness=3)

    return frame, frame_has_impact

def generate_impact_video(ez_vid_path:str,
                          sl_vid_path:str,
                          fused_df:pd.DataFrame,
                          output_path:str,
                          freeze_impacts=True):
    
    #define video codec to be used for
    VIDEO_CODEC = "MP4V"

    # parse game_key and play_id information from the name of the files
    game_key = os.path.basename(ez_vid_path).split('_')[0] # parse game_key
    play_id = os.path.basename(ez_vid_path).split('_')[1] # parse play_id
 
    # get metadata such as total frame number, width, height and frames per second (FPS) from endzone (ez) and sideline (sl) videos
    ez_vid, ez_total_frame_number, ez_width, ez_height, ez_fps = get_video_and_metadata(ez_vid_path)
    sl_vid, sl_total_frame_number, sl_width, sl_height, sl_fps = get_video_and_metadata(sl_vid_path)

    # define a video writer for the output video
    output_video = cv2.VideoWriter(output_path, #output file name
                                   cv2.VideoWriter_fourcc(*VIDEO_CODEC), #Video codec
                                   ez_fps, #frames per second in the output video
                                  (ez_width, ez_height+sl_height)) # frame size with stacking video vertically

    # find shorter video and use the total frame number from the shorter video for the output video
    total_frame_number = int(min(ez_total_frame_number, sl_total_frame_number))

    # iterate through each frame from endzone and sideline
    for frame_cnt in range(total_frame_number):
        frame_has_impact = False
        frame_near_impact = False

        # reading frames from both endzone and sideline
        ez_ret, ez_frame = ez_vid.read()
        sl_ret, sl_frame = sl_vid.read()

        # creating strings to be added to the output frames
        img_name = f"Game key: {game_key}, Play ID: {play_id}, Frame: {frame_cnt}"
        video_frame = f'{game_key}_{play_id}_{frame_cnt}'

        if ez_ret == True and sl_ret == True:
            h, w, c = ez_frame.shape
            h1,w1,c1 = sl_frame.shape
 
            if h != h1 or w != w1: # resize images if they're different
                ez_frame = cv2.resize(ez_frame,(w1,h1))
 
            frame = np.concatenate((sl_frame, ez_frame), axis=0) # stack the frames vertically

            frame, frame_has_impact = overlay_impacts(frame, fused_df, game_key, play_id, frame_cnt, h1)

            cv2.putText(frame, #image frame to be modified
                        img_name, #string to be inserted
                        (30, 30), #(x,y) location of the string
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), #WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1-20), #(x,y) location of the string in the top view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            cv2.putText(frame, #image frame to be modified
                        str(frame_cnt), #frame count string to be inserted
                        (w1-75, h1+h-20), #(x,y) location of the string in the bottom view
                        cv2.FONT_HERSHEY_SIMPLEX, #font
                        1, #scale
                        (255, 255, 255), # WHITE letters
                        thickness=2)

            output_video.write(frame)

            # Freeze for 60 frames on impacts
            if frame_has_impact and freeze_impacts:
                for _ in range(60):
                    output_video.write(frame)
        else:
            break

        frame_cnt += 1

    output_video.release()
    return

To run these functions, we can provide an input as shown in the following code, which generates a video called output.mp4:

generate_impact_video('57583_000082_Endzone.mp4',
                      '57583_000082_Sideline.mp4',
                      fused_df,
                      'output.mp4')

This generates a video as shown in the following example, where the red bounding boxes are impacts found in both endzone and sideline views, and the yellow bounding boxes are impacts that are found in just one view in either the endzone or sideline.

Conclusion

In this post, we demonstrated how the NFL, Biocore, and the AWS ProServe teams are working together to improve impact detection by fusing results from multiple views. This allows the teams to debug and visualize how the model is performing qualitatively. This process can easily be scaled up to three or more views; in our projects, we have utilized up to seven different views. Detecting helmet impacts by watching videos from only one view can be difficult due to view obstruction, but detecting impacts from multiple views and fusing the results allows us to improve our model performance.

To experiment with this solution, visit the aws-samples GitHub repo and refer to the fuse_and_visualize_multiview_impacts.ipynb notebook. Similar techniques can also be applied to other industries such as manufacturing, retail, and security, where having multiple views would benefit the ML system to better identify targets with a more comprehensive view.

For more information regarding NFL Player Health and Safety, visit the NFL website and NFL Explained: Innovation in Player Health & Safety.

About the authors

Chris Boomhower is a Machine Learning Engineer at AWS Professional Services. Chris has over 6 years experience developing supervised and unsupervised Machine Learning solutions across various industries. Today, he spends most his time helping customers in sports, healthcare, and agriculture industries design and build scalable, end-to-end, Machine Learning solutions.

Ben Fenker is a Senior Data Scientist in AWS Professional Services and has helped customers build and deploy ML solutions in industries ranging from sports to healthcare to manufacturing. He has a Ph.D. in physics from Texas A&M University and 6 years of industry experience. Ben enjoys baseball, reading, and raising his kids.

Sam Huddleston is a Principal Data Scientist at Biocore LLC, who serves as the Technology Lead for the NFL’s Digital Athlete program. Biocore is a team of world-class engineers based in Charlottesville, Virginia, that provides research, testing, biomechanics expertise, modeling and other engineering services to clients dedicated to the understanding and reduction of injury.

Jarvis Lee is a Senior Data Scientist with AWS Professional Services. He has been with AWS for over five years, working with customers on machine learning and computer vision problems. Outside of work, he enjoys riding bicycles.

Tyler Mullenbach is the Global Practice Lead for ML with AWS Professional Services. He is responsible for driving the strategic direction of ML for Professional Services and ensuring that customers realize transformative business achievements through the adoption of ML technologies.

Kevin Song is a Data Scientist at AWS Professional Services. He holds a PhD in Biophysics and has over 5 years of industry experience in building computer vision and machine learning solutions.

Betty Zhang is a data scientist with 10 years of experience in data and technology. Her passion is to build innovative machine learning solutions to drive transformational changes for companies. In her spare time, she enjoys traveling, reading and learning about new technologies.

Amazon’s quantum computing papers at QIP 2023

February 2, 2023

by Amazon AWS

Research on “super-Grover” optimization, quantum algorithms for topological data analysis, and simulation of physical systems displays the range of Amazon’s interests in quantum computing.Read More

How to decide between Amazon Rekognition image and video API for video moderation

February 1, 2023

by Lana Zhang Amazon AWS

Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, and reviews more than ever and 78% of them said that brands are responsible for moderating such content. 40% said that they would disengage with a brand after a single exposure to toxic content.

Amazon Rekognition has two sets of APIs that help you moderate images or videos to keep digital communities safe and engaged.

One approach to moderate videos is to model video data as a sample of image frames and use image content moderation models to process the frames individually. This approach allows the reuse of image-based models. Some customers have asked if they could use this approach to moderate videos by sampling image frames and sending them to the Amazon Rekognition image moderation API. They are curious about how this solution compares with the Amazon Rekognition video moderation API.

We recommend using the Amazon Rekognition video moderation API to moderate video content. It’s designed and optimized for video moderation, offering better performance and lower costs. However, there are specific use cases where the image API solution is optimal.

This post compares the two video moderation solutions in terms of accuracy, cost, performance, and architecture complexity to help you choose the best solution for your use case.

Moderate videos using the video moderation API

The Amazon Rekognition video content moderation API is the standard solution used to detect inappropriate or unwanted content in videos. It performs as an asynchronous operation on video content stored in an Amazon Simple Storage Service (Amazon S3) bucket. The analysis results are returned as an array of moderation labels along with a confidence score and timestamp indicating when the label was detected.

The video content moderation API uses the same machine learning (ML) model for image moderation. The output is filtered for noisy false positive results. The workflow is optimized for latency by parallelizing operations like decode, frame extraction, and inference.

The following diagram shows the logical steps of how to use the Amazon Rekognition video moderation API to moderate videos.

The steps are as follows:

Upload videos to an S3 bucket.
Call the video moderation API in an AWS Lambda function (or customized script on premises) with the video file location as a parameter. The API manages the heavy lifting of video decoding, sampling, and inference. You can either implement a heartbeat logic to check the moderation job status until it completes, or use Amazon Simple Notification Service (Amazon SNS) to implement an event-driven pattern. For details about the video moderation API, refer to the following Jupyter notebook for detailed examples.
Store the moderation result as a file in an S3 bucket or database.

Moderate videos using the image moderation API

Instead of using the video content moderation API, some customers choose to independently sample frames from videos and detect inappropriate content by sending the images to the Amazon Rekognition DetectModerationLabels API. Image results are returned in real time with labels for inappropriate content or offensive content along with a confidence score.

The following diagram shows the logical steps of the image API solution.

The steps are as follows:

1. Use a customized application or script as an orchestrator, from loading the video to the local file system.
2. Decode the video.
3. Sample image frames from the video at a chosen interval, such as two frames per second. Then iterate through all the images to:

3.a. Send each image frame to the image moderation API.
3.b. Store the moderation results in a file or database.

Compare this with the video API solution, which requires a light Lambda function to orchestrate API calls. The image sampling solution is CPU intensive and requires more compute resources. You can host the application using AWS services such as Lambda, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, or Amazon Elastic Compute Cloud (Amazon EC2).

Evaluation dataset

To evaluate both solutions, we use a sample dataset consisting of 200 short-form videos. The videos range from 10 seconds to 45 minutes. 60% of the videos are less than 2 minutes long. This sample dataset is used to test the performance, cost, and accuracy metrics for both solutions. The results compare the Amazon Rekognition image API sampling solution to the video API solution.

To test the image API solution, we use open-source libraries (ffmpeg and OpenCV) to sample images at a rate of two frames per second (one frame every 500 milliseconds). This rate mimics the sampling frequency used by the video content moderation API. Each image is sent to the image content moderation API to generate labels.

To test the video sampling solution, we send the videos directly to the video content moderation API to generate labels.

Results summary

We focus on the following key results:

Accuracy – Both solutions offer similar accuracy (false positive and false negative percentages) using the same sampling frequency of two frames per second
Cost – The image API sampling solution is more expensive than the video API solution using the same sampling frequency of two frames per second
- The image API sampling solution cost can be reduced by sampling fewer frames per second
Performance – On average, the video API has a 425% faster processing time than the image API solution for the sample dataset
- The image API solution performs better in situations with a high frame sample interval and on videos less than 90 seconds
Architecture complexity – The video API solution has a low architecture complexity, whereas the image API sampling solution has a medium architecture complexity

Accuracy

We tested both solutions using the sample set and the same sampling frequency of two frames per second. The results demonstrated that both solutions provide a similar false positive and true positive ratio. This result is expected because under the hood, Amazon Rekognition uses the same ML model for both the video and image moderation APIs.

To learn more about metrics for evaluating content moderation, refer to Metrics for evaluating content moderation in Amazon Rekognition and other content moderation services.

Cost

The cost analysis demonstrates that the image API solution is more expensive than the video API solution if you use the same sampling frequency of two frames per second. The image API solution can be more cost effective if you reduce the number of frames sampled per second.

The two primary factors that impact the cost of a content moderation solution are the Amazon Rekognition API costs and compute costs. The default pricing for the video content moderation API is $0.10 per minute and $0.001 per image for the image content moderation API. A 60-second video produces 120 frames using a rate of two frames per second. The video API costs $0.10 to moderate a 60-second video, whereas the image API costs $0.120.

The price calculation is based on the official price in Region us-east-1 at the time of writing this post. For more information, refer to Amazon Rekognition pricing.

The cost analysis looks at the total cost to generate content moderation labels for the 200 videos in the sample set. The calculations are based on us-east-1 pricing. If you’re using another Region, modify the parameters with the pricing for that Region. The 200 videos contain 4271.39 minutes of content and generate 512,567 image frames at a sampling rate of two frames per second.

This comparison doesn’t consider other costs, such as Amazon S3 storage. We use Lambda as an example to calculate the AWS compute cost. Compute costs take into account the number of requests to Lambda and AWS Step Functions to run the analysis. The Lambda memory/CPU setting is estimated based on the Amazon EC2 specifications. This cost estimate uses a four GB, 2-second Lambda request per image API call. Lambda functions have a maximum invocation timeout limit of 15 minutes. For longer videos, the user may need to implement iteration logic using Step Functions to reduce the number of frames processed per Lambda call. The actual Lambda settings and cost patterns may differ depending on your requirements. It’s recommended to test the solution end to end for a more accurate cost estimation.

The following table summarizes the costs.

Type	Amazon Rekognition Costs	Compute Costs	Total Cost
Video API Solution	$427.14	$0 (Free tier)	$427.14
Image API Solution: Two frames per second	$512.57	$164.23	$676.80
Image API Solution: One frame per second	$256.28	$82.12	$338.40

Performance

On average, the video API solution has a four times faster processing time than the image API solution. The image API solution performs better in situations with a high frame sample interval and on videos shorter than 90 seconds.

This analysis measures performance as the average processing time in seconds per video. It looks at the total and average time to generate content moderation labels for the 200 videos in the sample set. The processing time is measured from the video upload to the result output and includes each step in the image sampling and video API process.

The video API solution has an average processing time of 35.2 seconds per video for the sample set. This is compared to the image API solution with an average processing time of 156.24 seconds per video for the sample set. On average, the video API performs four times faster than the image API solution. The following table summarizes these findings.

Type	Average Processing Time (All Videos)	Average Processing Time (Videos Under 1.5 Minutes)
Video API Solution	35.2 seconds	24.05 seconds
Image API Solution: Two frames per second	156.24 seconds	8.45 seconds
Difference	425%	-185%

The image API performs better than the video API when the video is shorter than 90 seconds. This is because the video API has a queue managing the tasks that has a lead time. The image API can also perform better if you have a lower sampling frequency. Increasing the frame interval to over 5 seconds can decrease the processing time by 6–10 times. It’s important to note that increasing intervals introduces the risk of missed identification of inappropriate content between frame samples.

Architecture complexity

The video API solution has a low architecture complexity. You can set up a serverless pipeline or run a script to retrieve content moderation results. Amazon Rekognition manages the heavy computing and inference. The application orchestrating the Amazon Rekognition APIs can be hosted on a light machine.

The image API solution has a medium architecture complexity. The application logic has to orchestrate additional steps to store videos on the local drive, run image processing to capture frames, and call the image API. The server hosting the application requires higher computing capacity to support the local image processing. For the evaluation, we launched an EC2 instance with 4 vCPU and 8 G RAM to support two parallel threads. Higher compute requirements may lead to additional operation overhead.

Optimal use cases for the image API solution

The image API solution is ideal for three specific use cases when processing videos.

The first is real-time video streaming. You can capture image frames from a live video stream and send the images to the image moderation API.

The second use case is content moderation with a low frame sampling rate requirement. The image API solution is more cost-effective and performant if you sample frames at a low frequency. It’s important to note that there will be a trade-off between cost and accuracy. Sampling frames at a lower rate may increase the risk of missing frames with inappropriate content.

The third use case is for the early detection of inappropriate content in video. The image API solution is flexible and allows you to stop processing and flag the video early on, saving cost and time.

Conclusion

The video moderation API is ideal for most video moderation use cases. It’s more cost effective and performant than the image API solution when you sample frames at a frequency such as two frames per second. Additionally, it has a low architectural complexity and reduced operational overhead requirements.

The following table summarizes our findings to help you maximize the use of the Amazon Rekognition image and video APIs for your specific video moderation use cases. Although these results are averages achieved during testing and by some of our customers, they should give you ideas to balance the use of each API.

.	Video API Solution	Image API Solution
Accuracy	Same accuracy	.
Cost	Lower cost using the default image sampling interval	Lower cost if you reduce the number of frames sampled per second (sacrifice accuracy)
Performance	Faster for videos longer than 90 seconds	Faster for videos less than 90 seconds
Architecture Complexity	Low complexity	Medium complexity

Amazon Rekognition content moderation can not only help your business protect and keep customers safe and engaged, but also contribute to your ongoing efforts to maximize the return on your content moderation investment. Learn more about Content Moderation on AWS and our Content Moderation ML use cases.

About the authors

Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team, with expertise in AI and ML for content moderation and computer vision. She is passionate about promoting AWS AI services and helping customers transform their business solutions.

Brigit Brown is a Solutions Architect at Amazon Web Services. Brigit is passionate about helping customers find innovative solutions to complex business challenges using machine learning and artificial intelligence. Her core areas of depth are natural language processing and content moderation.

How Amazon’s AZ3 chip makes neural networks run more efficiently

February 1, 2023

by Amazon AWS

Specialized circuitry for compression and for both inline and on-the-fly decompression minimize data movement.Read More

Scaling distributed training with AWS Trainium and Amazon EKS

February 1, 2023

by Scott Perry Amazon AWS

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and DeepSpeed, it’s common for training jobs to require hundreds of accelerator devices for several weeks or months at a time.

In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium—a purpose-built machine learning (ML) accelerator optimized to provide a high-performance, cost-effective, and massively scalable platform for training deep learning models in the cloud. Trn1 instances are available in a number of sizes (see the following table), with up to 16 Trainium accelerators per instance.

Instance Size	Trainium Accelerators	Accelerator Memory (GB)	vCPUs	Instance Memory (GiB)	Network Bandwidth (Gbps)
trn1.2xlarge	1	32	8	32	Up to 12.5
trn1.32xlarge	16	512	128	512	800
trn1n.32xlarge (coming soon)	16	512	128	512	1600

Trn1 instances can either be deployed as standalone instances for smaller training jobs, or in highly scalable ultraclusters that support distributed training across tens of thousands of Trainium accelerators. All Trn1 instances support the standalone configuration, whereas Trn1 ultraclusters require trn1.32xlarge or trn1n.32xlarge instances. In an ultracluster, multiple Trn1 instances are co-located in a given AWS Availability Zone and are connected with high-speed, low-latency, Elastic Fabric Adapter (EFA) networking that provides 800 Gbps of nonblocking network bandwidth per instance for collective compute operations. The trn1n.32xlarge instance type, launching in early 2023, will increase this bandwidth to 1600 Gbps per instance.

Many enterprise customers choose to deploy their deep learning workloads using Kubernetes—the de facto standard for container orchestration in the cloud. AWS customers often deploy these workloads using Amazon Elastic Kubernetes Service (Amazon EKS). Amazon EKS is a managed Kubernetes service that simplifies the creation, configuration, lifecycle, and monitoring of Kubernetes clusters while still offering the full flexibility of upstream Kubernetes.

Today, we are excited to announce official support for distributed training jobs using Amazon EKS and EC2 Trn1 instances. With this announcement, you can now easily run large-scale containerized training jobs within Amazon EKS while taking full advantage of the price-performance, scalability, and ease of use offered by Trn1 instances.

Along with this announcement, we are also publishing a detailed tutorial that guides you through the steps required to run a multi-instance distributed training job (BERT phase 1 pre-training) using Amazon EKS and Trn1 instances. In this post, you will learn about the solution architecture and review several key steps from the tutorial. Refer to the official tutorial repository for the complete end-to-end workflow.

To follow along, a broad familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EKS is implied, and basic familiarity with deep learning and PyTorch would be helpful.

Solution architecture

The following diagram illustrates the solution architecture.

The solution consists of the following main components:

An EKS cluster
An EKS node group consisting of trn1.32xlarge instances
The AWS Neuron SDK
EKS plugins for Neuron and EFA
An Amazon Elastic Container Registry (Amazon ECR) Rrepository
A training container image
An Amazon FSx for Lustre file system
A Volcano batch scheduler and etcd server
The TorchX universal job launcher
The TorchX DDP module for Trainium

At the heart of the solution is an EKS cluster that provides you with core Kubernetes management functionality via an EKS service endpoint. One of the benefits of Amazon EKS is that the service actively monitors and scales the control plane based on load, which ensures high performance for large workloads such as distributed training. Inside the EKS cluster is a node group consisting of two or more trn1.32xlarge Trainium-based instances residing in the same Availability Zone.

The Neuron SDK is the software stack that provides the driver, compiler, runtime, framework integration (for example, PyTorch Neuron), and user tools that allow you to access the benefits of the Trainium accelerators. The Neuron device driver runs directly on the EKS nodes (Trn1 instances) and provides access to the Trainium chips from within the training containers that are launched on the nodes. Neuron and EFA plugins are installed within the EKS cluster to provide access to the Trainium chips and EFA networking devices required for distributed training.

An ECR repository is used to store the training container images. These images contain the Neuron SDK (excluding the Neuron driver, which runs directly on the Trn1 instances), PyTorch training script, and required dependencies. When a training job is launched on the EKS cluster, the container images are first pulled from Amazon ECR onto the EKS nodes, and the PyTorch worker containers are then instantiated from the images.

Shared storage is provided using a high-performance FSx for Lustre file system that exists in the same Availability Zone as the trn1.32xlarge instances. Creation and attachment of the FSx for Lustre file system to the EKS cluster is mediated by the Amazon FSx for Lustre CSI driver. In this solution, the shared storage is used to store the training dataset and any logs or artifacts created during the training process.

The solution uses the TorchX universal job launcher to launch distributed training jobs within Amazon EKS. TorchX has two important dependencies: the Volcano batch scheduler and the etcd server. Volcano handles the scheduling and queuing of training jobs, while the etcd server is a key-value store used by TorchElastic for synchronization and peer discovery during job startup.

When a training job is launched using TorchX, the launch command uses the provided TorchX distributed DDP module for Trainium to configure the overall training job and then run the appropriate torchrun commands on each of the PyTorch worker pods. When a job is running, it can be monitored using standard Kubernetes tools (such as kubectl) or via standard ML toolsets such as TensorBoard.

Solution overview

Let’s look at the important steps of this solution. Throughout this overview, we refer to the Launch a Multi-Node PyTorch Neuron Training Job on Trainium Using TorchX and EKS tutorial on GitHub.

Create an EKS cluster

To get started with distributed training jobs in Amazon EKS with Trn1 instances, you first create an EKS cluster as outlined in the tutorial on GitHub. Cluster creation can be achieved using standard tools such as eksctl and AWS CloudFormation.

Create an EKS node group

Next, we need to create an EKS node group containing two or more trn1.32xlarge instances in a supported Region. In the tutorial, AWS CloudFormation is used to create a Trainium-specific EC2 launch template, which ensures that the Trn1 instances are launched with an appropriate Amazon Machine Image (AMI) and the correct EFA network configuration needed to support distributed training. The AMI also includes the Neuron device driver that provides support for the Trainium accelerator chips. With the eksctl Amazon EKS management tool, you can easily create a Trainium node group using a basic YAML manifest that references the newly created launch template. For example:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-trn1-cluster
  region: us-west-2
  version: "1.23"

iam:
  withOIDC: true

availabilityZones: ["us-west-xx","us-west-yy"]

managedNodeGroups:
  - name: trn1-ng1
    launchTemplate:
      id: TRN1_LAUNCH_TEMPLATE_ID
    minSize: 2
    desiredCapacity: 2
    maxSize: 2
    availabilityZones: ["us-west-xx"]
    privateNetworking: true
    efaEnabled: true

In the preceding manifest, several attributes are configured to allow for the use of Trn1 instances in the EKS cluster. First, metadata.region is set to one of the Regions that supports Trn1 instances (currently us-east-1 and us-west-2). Next, for availabilityZones, Amazon EKS requires that two Availability Zones be specified. One of these Availability Zones must support the use of Trn1 instances, while the other can be chosen at random. The tutorial shows how to determine which Availability Zones will allow for Trn1 instances within your AWS account. The same Trn1-supporting Availability Zone must also be specified using the availabiltyZones attribute associated with the EKS node group. efaEnabled is set to true to configure the nodes with the appropriate EFA network configuration that is required for distributed training. Lastly, the launchTemplate.id attribute associated with the node group points to the EC2 launch template created via AWS CloudFormation in an earlier step.

Assuming that you have already applied the CloudFormation template and installed the eksctl management tool, you can create a Trainium-capable EKS node group by running the following code:

> eksctl create nodegroup -f TEMPLATE.yaml

Install Kubernetes plugins for Trainium and EFA devices

With the node group in place, the next step is to install Kubernetes plugins that provide support for the Trainium accelerators (via the Neuron plugin) and the EFA devices (via the EFA plugin). These plugins can easily be installed on the cluster using the standard kubectl management tool as shown in the tutorial.

To use the TorchX universal PyTorch launcher to launch distributed training jobs, two prerequisites are required: the Volcano batch scheduler, and the etcd server. Much like the Neuron and EFA plugins, we can use the kubectl tool to install Volcano and the etcd server on the EKS cluster.

Attach shared storage to the EKS cluster

In the tutorial, FSx for Lustre is used to provide a high-performance shared file system that can be accessed by the various EKS worker pods. This shared storage is used to host the training dataset, as well as any artifacts and logs creating during the training process. The tutorial describes how to create and attach the shared storage to the cluster using the Amazon FSx for Lustre CSI driver.

Create a training container image

Next, we need to create a training container image that includes the PyTorch training script along with any dependencies. An example Dockerfile is included in the tutorial, which incorporates the BERT pre-training script along with its software dependencies. The Dockerfile is used to build the training container image, and the image is then pushed to an ECR repository from which the PyTorch workers are able to pull the image when a training job is launched on the cluster.

Set up the training data

Before launching a training job, the training data is first copied to the shared storage volume on FSx for Lustre. The tutorial outlines how to create a temporary Kubernetes pod that has access to the shared storage volume, and shows how to log in to the pod in order to download and extract the training dataset using standard Linux shell commands.

With the various infrastructure and software prerequisites in place, we can now focus on the Trainium aspects of the solution.

Precompile your model

The Neuron SDK supports PyTorch through an integration layer called PyTorch Neuron. By default, PyTorch Neuron operates with just-in-time compilation, where the various neural network compute graphs within a training job are compiled as they are encountered during the training process. For larger models, it can be more convenient to use the provided neuron_parallel_compile tool to precompile and cache the various compute graphs in advance so as to avoid graph compilation at training time. Before launching the training job on the EKS cluster, the tutorial shows how to first launch a precompilation job via TorchX using the neuron_parallel_compile tool. Upon completion of the precompilation job, the Neuron compiler will have identified and compiled all of the neural network compute graphs, and cached them to the shared storage volume for later use during the actual BERT pre-training job.

Launch the distributed training job

With precompilation complete, TorchX is then used to launch a 64-worker distributed training job across two trn1.32xlarge instances, with 32 workers per instance. We use 32 workers per instance because each trn1.32xlarge instance contains 16 Trainium accelerators, with each accelerator providing 2 NeuronCores. Each NeuronCore can be accessed as a unique PyTorch XLA device in the training script. An example TorchX launch command from the tutorial looks like the following code:

    torchx run 
    -s kubernetes --workspace="file:///$PWD/docker" 
    -cfg queue=test,image_repo=$ECR_REPO 
    lib/trn1_dist_ddp.py:generateAppDef 
    --name berttrain 
    --script_args "--batch_size 16 --grad_accum_usteps 32 
        --data_dir /data/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128 
        --output_dir /data/output" 
    --nnodes 2 
    --nproc_per_node 32 
    --image $ECR_REPO:bert_pretrain 
    --script dp_bert_large_hf_pretrain_hdf5.py 
    --bf16 True 
    --cacheset bert-large

The various command line arguments in the preceding TorchX command are described in detail in the tutorial. However, the following arguments are most important in configuring the training job:

-cfg queue=test – Specifies the Volcano queue to be used for the training job
-cfg image_repo – Specifies the ECR repository to be used for the TorchX container images
–script_args – Specifies any arguments that should be passed to the PyTorch training script
–nnodes and –nproc_per_node – The number of instances and workers per instance to use for the training job
–script – The name of the PyTorch training script to launch within the training container
–image – The path to the training container image in Amazon ECR
–bf16 – Whether or not to enable BF16 data type

Monitor the training job

After the training job has been launched, there are various ways in which the job can be monitored. The tutorial shows how to monitor basic training script metrics on the command line using kubectl, how to visually monitor training script progress in TensorBoard (see the following screenshot), and how to monitor Trainium accelerator utilization using the neuron-top tool from the Neuron SDK.

Clean up or reuse the environment

When the training job is complete, the cluster can then be reused or re-configured for additional training jobs. For example, the EKS node group can quickly be scaled up using the eksctl command in order to support training jobs that require additional Trn1 instances. Similarly, the provided Dockerfile and TorchX launch commands can easily be modified to support additional deep learning models and distributing training topologies.

If the cluster in no longer required, the tutorial also includes all steps required to remove the EKS infrastructure and related resources.

Conclusion

In this post, we explored how Trn1 instances and Amazon EKS provide a managed platform for high-performance, cost-effective, and massively scalable distributed training of deep learning models. We also shared a comprehensive tutorial showing how to run a real-world multi-instance distributed training job in Amazon EKS using Trn1 instances, and highlighted several of the key steps and components in the solution. This tutorial content can easily be adapted for other models and workloads, and provides you with a foundational solution for distributed training of deep learning models in AWS.

To learn more about how to get started with Trainium-powered Trn1 instances, refer to the Neuron documentation.

About the Authors

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.

Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.

Where machine learning models meet mobility and human behavior

January 31, 2023

by Amazon AWS

Mahdieh Allahviranloo is using her expertise in transportation engineering to assist last-mile delivery operations at Amazon.Read More

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

January 30, 2023

by Xin Huang Amazon AWS

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

Starting today, the SageMaker LightGBM algorithm offers distributed training using the Dask framework for both tabular classification and regression tasks. They’re available through the SageMaker Python SDK. The supported data format can be either CSV or Parquet. Extensive benchmarking experiments on four publicly available datasets with various settings are conducted to validate its performance.

Customers are increasingly interested in training models on large datasets with SageMaker LightGBM, which can take a day or even longer. In these cases, you might be able to speed up the process by distributing training over multiple machines or processes in a cluster. This post discusses how SageMaker LightGBM helps you set up and launch distributed training, without the expense and difficulty of directly managing your training clusters.

Problem statement

Machine learning has become an essential tool for extracting insights from large amounts of data. From image and speech recognition to natural language processing and predictive analytics, ML models have been applied to a wide range of problems. As datasets continue to grow in size and complexity, traditional training methods can become increasingly time-consuming and resource-intensive. This is where distributed training comes into play.

Distributed training is a technique that allows for the parallel processing of large amounts of data across multiple machines or devices. By splitting the data and training multiple models in parallel, distributed training can significantly reduce training time and improve the performance of models on big data. In recent years, distributed training has been a popular mechanism in training deep neural networks for use cases such as large language models (LLMs), image generation and classification, and text generation tasks using frameworks like PyTorch, TensorFlow, and MXNet. In this post, we discuss how distributed training can be applied to tabular data (a common type of data found in many industries such as finance, healthcare, and retail) using Dask and the LightGBM algorithm for tasks such as regression and classification.

Dask is an open-source parallel computing library that allows for distributed parallel processing of large datasets in Python. It’s designed to work with the existing Python and data science ecosystem such as NumPy and Pandas. When it comes to distributed training, Dask can be used to parallelize the data loading, preprocessing, and model training tasks, and it integrates well with popular ML algorithms like LightGBM. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, which is designed to be efficient and scalable for training large models on big data. Combining these two powerful libraries, LightGBM v3.2.0 is now integrated with Dask to allow distributed learning across multiple machines to produce a single model.

How distributed training works

Distributed training for tree-based algorithms is a technique that is used when the dataset is too large to be processed on a single instance or when the computational resources of a single instance are not sufficient to train the tree-based model in a reasonable amount of time. It allows a model to be trained across multiple instances or machines, rather than on a single machine. This is done by dividing the dataset into smaller subsets, called chunks, and distributing them among the available instances. Each instance then trains a model on its assigned chunk of data, and the results are later combined using aggregation algorithms to form a single model.

In tree-based models like LightGBM, the main computational cost is in the building of the tree structure. This is typically done by sorting and selecting subsets of the data.

Now, let’s explore how LightGBM does the parallel training. LightGBM can use three types of parallelism:

Data parallelism – This is the most basic form of data parallelism. The data is divided horizontally into smaller subsets and distributed among multiple instances. Each instance constructs its local histogram, and all histograms are merged, then a split is performed using a reduce scatter algorithm. A histogram in local instances is constructed by dividing the subset of the local data into discrete bins, and counting the number of data points in each bin. This histogram-based algorithm helps speed up the training and reduces memory usage.
Feature parallelism – In feature parallelism, each machine is responsible for training a subset of the features of the model, rather than a subset of the data. This can be useful when working with datasets that have a large number of features, because it allows for more efficient use of resources. It works by finding the best local split point in each instance, then communicates the best split with the other instances. LightGBM implementation maintains all features of the data in every machine to reduce the cost of communicating the best splits.
Voting parallelism – In voting parallelism, the data is divided into smaller subsets and distributed among multiple machines. Each machine trains a model on its assigned subset of data, and the results are later combined to form a single, larger model. However, instead of using the gradients from all the machines to update the model parameters, a voting mechanism is used to decide which gradients to use. This can be useful when working with datasets that have a lot of noise or outliers, because it can help reduce the impact of these on the final model. At the time of writing this post, LightGBM integration with Dask only supports data and voting parallelism types.

SageMaker will automatically set up and manage a Dask cluster when using multiple instances with the LightGBM built-in container.

Solution overview

When a training job using LightGBM is started with multiple instances, we first create a Dask cluster. One instance acts as the Dask scheduler, and the remaining instances have Dask workers, where each worker has multiple threads. Each worker in the cluster has part of the data to perform the distributed computations, as illustrated in the following figure.

Enable distributed training

The requirements for the input data are as follows:

The supported input data format for training can be either CSV or Parquet. You are allowed to put more than one data file under both train and validation channels. If multiple files are identified, the algorithm will concatenate all of them as the training or validation data. The name of the data file can be any string as long as it ends with .csv or .parquet.
For each data file, the algorithm requires that the target variable is in the first column and that it should not have a header record. This follows the convention of the SageMaker XGBoost algorithm.
If your predictors include categorical features, you can provide a JSON file named cat_index.json in the same location as your training data. This file should contain a Python dictionary, where the key can be any string and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your data file. The index starts with value 1, because value 0 corresponds to the target variable. The cat_index.json file should be put under the training data directory, as shown in the following example.
The instance type supported by distributed training is CPU.

Let’s use data in CSV format as an example. The train and validation data can be structured as follows:

-- training_dataset_s3_path
    -- data_1.csv
    -- data_2.csv
    -- data_3.csv
    -- cat_idx.json
    
-- validation_dataset_s3_path
    -- data_1.csv

You can specify the input type to be either text/csv or application/x-parquet:

from sagemaker.inputs import TrainingInput

content_type = "text/csv" # or "application/x-parquet"

train_input = TrainingInput(
    training_dataset_s3_path, content_type=content_type
)

validation_input = TrainingInput(
    validation_dataset_s3_path, content_type=content_type
)

Before distributed training, you can retrieve the default hyperparameters of LightGBM and override them with custom values:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for LightGBM
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)

# [Optional] Override default hyperparameters with custom values
hyperparameters[
    "num_boost_round"
] = "500" 

hyperparameters["tree_learner"] = "voting" ### specify either 'data' or 'voting' parallelism for distributed training. Unfortnately, for dask lightgbm, the 'feature' is not supported. See github issue: https://github.com/microsoft/LightGBM/issues/3834

To enable distributed training, you can simply specify the argument instance_count in the class sagemaker.estimator.Estimator to be more than 1. The rest of work is taken care of under the hood. See the following example code:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base

training_job_name = name_from_base("sagemaker-built-in-distributed-lgb")

# Create SageMaker Estimator instance
tabular_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=4, ### select the instance count you would like to use for distributed training
    volume_size=30, ### volume_size (int or PipelineVariable): Size in GB of the storage volume to use for storing input and output data during training (default: 30).
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a SageMaker Training job by passing s3 path of the training data
tabular_estimator.fit(
    {
        "train": train_input,
        "validation": validation_input,
    }, logs=True, job_name=training_job_name
)

The following screenshots show a successful training job log from the notebook. The logs from different Amazon Elastic Compute Cloud (Amazon EC2) machines are marked by different colors.

The distributed training is also compatible with SageMaker automatic model tuning. For details, see the example notebook.

Benchmarking

We conducted benchmarking experiments to validate the performance of distributed training in SageMaker LightGBM on four different publicly available datasets for regression, binary, and multi-class classification tasks. The experiment details are as follows:

Each dataset is split into training, validation, and test data following the 80/20/10 split rule. For each dataset and instance type and count, we train LightGBM on the training data; record metrics such as billable time (per instance), total runtime, average training loss at the end of the last built tree over all instances, and validation loss at the end of the last built tree; and evaluate its performance on the hold-out test data.
For each trial, we use the exact same set of hyperparameter values, with the number of trees being 500 except for the lending dataset. For the lending dataset, we use 100 as the number of trees because it’s sufficient to get optimal results on the hold-out test data.
Each number presented in the table is averaged over three trials.
Because each model is trained with one fixed set of hyperparameter values, the evaluation metric numbers on the hold-out test data can be further improved with hyperparameter optimization.

Billable time refers to the absolute wall-clock time. The total runtime is the elastic time running the distributed training, which includes the billable time and time to spin up instances and install dependencies. For the validation loss at the end of the last built tree, we didn’t do the average over all the instances as the training loss because all of the validation data is assigned to a single instance and therefore only that instance has the validation loss metric. Out of Memory (OOM) means the dataset hit the out of memory error during training. The loss function and evaluation metrics used are binary and multi-class logloss, L2, accuracy, F1, ROC AUC, F1 macro, F1 micro, R2, MAE, and MSE.

The expectation is that as the instance count increases, the billable time (per instance) and total runtime decreases, while the average training loss and validation loss at the end of the last built tree and evaluation scores on the hold-out test data remain the same.

We conducted three experiments:

Benchmark on three publicly available datasets using CSV as the input data format
Benchmark on a different dataset using Parquet as the input data format
Compare the model performance on different instance types given a certain instance count

The datasets we used are lending club loan data, ad-tracking fraud detection data, code data, and NYC taxi data. The data statistics are presented as follows.

Dataset	Size	Number of Examples	Number of Features	Problem Type
lending club loan	~10 G	1, 439, 141	955	Binary classification
ad-tracking fraud detection	~10 G	145, 716, 493	9	Binary classification
code	~10 G	18, 268, 221	9	Multi-class classification (number of classes in target: 10)
NYC taxi	~0.5 G	83, 601, 440	8	Regression

The following table contains the benchmarking results for the first three datasets using CSV as the data input format. For demonstration purposes, we removed the categorical features for the lending club loan data. The data statistics are shown in the table. The experiment results matched our expectations.

Dataset	Instance Count (m5.2xlarge)	Billable Time per Instance (seconds)	Total Runtime (seconds)	Average Training Loss over all Instances at the End of the Last Built Tree	Validation Loss at the End of the Last Built Tree	Evaluation Metrics on Hold-Out Test Data
lending club loan	.	.	.	Binary logloss	Binary logloss	Accuracy (%)	F1 (%)	ROC AUC (%)
.	1	Out of Memory
.	2	Out of Memory
.	4	461	614	0.034	0.039	98.9	96.6	99.7
.	6	375	561	0.034	0.039	98.9	96.6	99.7
.	8	359	549	0.034	0.039	98.9	96.7	99.7
.	10	338	522	0.036	0.037	98.9	96.6	99.7
.
ad-tracking fraud detection	.	.	.	Binary logloss	Binary logloss	Accuracy (%)	F1 (%)	ROC AUC (%)
.	1	Out of Memory
.	2	Out of Memory
.	4	2649	2773	0.038	0.039	99.8	43.2	80.4
.	6	1926	2047	0.039	0.04	99.8	44.5	79.7
.	8	1677	1780	0.04	0.04	99.8	45.3	79.2
.	10	1595	1744	0.04	0.041	99.8	43	79.3
.
code	.	.	.	Multiclass logloss	Multiclass logloss	Accuracy (%)	F1 Macro (%)	F1 Micro (%)
.	1	5329	5414	0.937	0.947	65.6	59.3	65.6
.	2	3175	3294	0.94	0.942	65.5	59	65.5
.	4	2593	2695	0.937	0.942	65.6	59.3	65.6
.	8	2253	2377	0.938	0.943	65.6	59.3	65.6
.	10	2160	2285	0.937	0.942	65.6	59.3	65.6

The following table contains the benchmarking results using NYC taxi data with Parquet as the input data format. For the NYC taxi data, we use the yellow trip taxi records from 2009–2022. We follow the example notebook to conduct feature processing. The processed data takes 8.5 G of disk memory when saved as CSV format, and only 0.55 G when saved as Parquet format.

A similar pattern shown in the preceding table is observed. As the instance count increases, the billable time (per instance) and total runtime decreases, while the average training loss and validation loss at the end of the last built tree and evaluation scores on the hold-out test data remain the same.

Dataset	Instance Count (m5.4xlarge)	Billable Time per Instance (seconds)	Total Runtime (seconds)	Average Training Loss over all Instances at the End of the Last Built Tree	Validation Loss at the End of the Last Built Tree	Evaluation Metrics on Hold-Out Test Data
NYC taxi	.	.	.	L2	L2	R2 (%)	MSE	MAE
.	1	951	1036	6.543	6.543	54.7	42.8	2.7
.	2	635	727	6.545	6.545	54.7	42.8	2.7
.	4	501	628	6.637	6.639	53.4	44.1	2.8
.	6	435	552	6.74	6.74	52	45.4	2.8
.	8	410	510	6.919	6.924	52.3	44.9	2.9

We also conduct benchmarking experiments and compare the performance under different instance types using the code dataset. For a certain instance count, as the instance type becomes larger, the billable time and total runtime decrease.

.	ml.m5.2xlarge		ml.m5.4xlarge		ml.m5.12xlarge
Instance Count	Billable Time per Instance (seconds)	Total Runtime (seconds)	Billable Time per Instance (seconds)	Total Runtime (seconds)	Billable Time per Instance (seconds)	Total Runtime (seconds)
1	5329	5414	2793	2904	1302	1394
2	3175	3294	1911	2000	1006	1098
4	2593	2695	1451	1557	891	973

Conclusion

With the power of Dask’s distributed computing framework and LightGBM’s efficient gradient boosting algorithm, data scientists and developers can train models on large datasets faster and more efficiently than using traditional single-node methods. The SageMaker LightGBM algorithm makes the process of setting up distributed training using the Dask framework for both tabular classification and regression tasks much easier. The algorithm is now available through the SageMaker Python SDK. The supported data format can be either CSV or Parquet. Extensive benchmarking experiments were conducted on four publicly available datasets with various settings to validate its performance.

You can bring your own dataset and try these new algorithms on SageMaker, and check out the example notebook to use the built-in algorithms available on GitHub.

About the authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

Will Badr is a Principal AI/ML Specialist SA who works as part of the global Amazon Machine Learning team. Will is passionate about using technology in innovative ways to positively impact the community. In his spare time, he likes to go diving, play soccer and explore the Pacific Islands.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Build a water consumption forecasting solution for a water utility agency using Amazon Forecast

January 30, 2023

by Dhiraj Thakur Amazon AWS

Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts, without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating supply and demand for inventory management, travel demand forecasting, workforce planning, and computing cloud infrastructure usage.

You can use Forecast to seamlessly conduct what-if analyses up to 80% faster to analyze and quantify the potential impact of business levers on your demand forecasts. A what-if analysis helps you investigate and explain how different scenarios might affect the baseline forecast created by Forecast. With Forecast, there are no servers to provision or ML models to build manually. Additionally, you only pay for what you use, and there is no minimum fee or upfront commitment. To use Forecast, you only need to provide historical data for what you want to forecast, and, optionally, any additional data that you believe may impact your forecasts.

Water utility providers have several forecasting use cases, but primary among them is predicting water consumption in an area or building to meet the demand. Also, it’s important for utility providers to forecast the increased consumption demand because of more apartments added in a building or more houses in the area. Predicting water consumption accurately is critical to avoid any service interruptions to the customer.

This post explores using Forecast to address this use case by using historical time series data.

Solution overview

Water is a natural resource and very critical to industry, agriculture, households, and our lives. Accurate water consumption forecasting is critical to make sure that an agency can run day-to-day operations efficiently. Water consumption forecasting is particularly challenging because demand is dynamic, and seasonal weather changes can have an impact. Predicting water consumption accurately is important so customers don’t face any service interruptions and in order to provide a stable service while maintaining low prices. Improved forecasting enables you to plan ahead to structure more cost-effective future contracts. The following are the two most common use cases:

Better demand management – As a utility provider agency, you need to find a balance between water demand and supply. The agency collects information like number of people living in an apartment and number of apartments in a building before providing service. As a utility agency, you must balance aggregate supply and demand. You need to store sufficient water in order to meet the demand. Moreover, demand forecasting has become more challenging for the following reasons:
- The demand isn’t stable at all times and varies throughout the day. For example, water consumption at midnight is much less compared to in the morning.
- Weather can also have an impact on the overall consumption. For example, water consumption is higher in the summer than the winter in the northern hemisphere, and the other way around in the southern hemisphere.
- There is not enough rainfall or water storage mechanisms (lakes, reservoirs), or water filtering is insufficient. During the summer, demand can’t always keep up with supply. The water agencies have to forecast carefully to acquire other sources, which may be more expensive. Therefore, it’s critical for utility agencies to find alternative water sources like harvesting rainwater, capturing condensation from air handling units, or reclaiming wastewater.
Conducting a what-if analysis for increased demand – Demand for water is rising due to multiple reasons. This includes a combination of population growth, economic development, and changing consumption patterns. Let’s imagine a scenario where an existing apartment building builds an extension and the number of households and people increase by a certain percentage. Now you need to do an analysis to forecast the supply for increased demand. This also helps you make a cost-effective contract for increased demand.

Forecasting can be challenging because you first need accurate models to forecast demand and then a quick and simple way to reproduce the forecast across a range of scenarios.

This post focuses on a solution to perform water consumption forecasting and a what-if analysis. This post doesn’t consider weather data for model training. However, you can add weather data, given its correlation to water consumption.

Prerequisites

Before getting started, we set up our resources. For this post, we use the us-east-1 Region.

Create an Amazon Simple Storage Service (Amazon S3) bucket for storing the historical time series data. For instructions, refer to Create your first S3 bucket.
Download data files from the GitHub repo and upload to the newly created S3 bucket.
Create a new AWS Identity and Access Management (IAM) role. For instructions, see Set Up Permissions for Amazon Forecast. Be sure to provide the name of your S3 bucket.

Create a dataset group and datasets

This post demonstrates two use cases related to water demand forecast: forecasting the water demand based on past water consumption, and conducting a what-if analysis for increased demand.

Forecast can accept three types of datasets: target time series (TTS), related time series (RTS), and item metadata (IM). Target time series data defines the historical demand for the resources you’re predicting. The target time series dataset is mandatory. A related time series dataset includes time-series data that isn’t included in a target time series dataset and might improve the accuracy of your predictor.

In our example, the target time series dataset contains item_id and timestamp dimensions, and the complementary related time series dataset includes no_of_consumer. An important note with this dataset: the TTS ends on 2023-01-01, and the RTS ends on 2023-01-15. When performing what-if scenarios, it’s important to manipulate RTS variables beyond your known time horizon in TTS.

To conduct a what-if analysis, we need to import two CSV files representing the target time series data and the related time series data. Our example target time series file contains the item_id, timestamp, and demand, and our related time series file contains the product item_id, timestamp, and no_of consumer.

To import your data, complete the following steps:

On the Forecast console, choose View dataset groups.
Choose Create dataset group.
For Dataset group name, enter a name (for this post, water_consumption_datasetgroup).
For Forecasting domain, choose a forecasting domain (for this post, Custom).
Choose Next.
On the Create target time series dataset page, provide the dataset name, frequency of your data, and data schema.
On the Dataset import details page, enter a dataset import name.
For Import file type, select CSV and enter the data location.
Choose the IAM role you created earlier as a prerequisite.
Choose Start.

You’re redirected to the dashboard that you can use to track progress.

To import the related time series file, on the dashboard, choose Import.
On the Create related time series dataset page, provide the dataset name and data schema.
On the Dataset import details page, enter a dataset import name.
For Import file type, select CSV and enter the data location.
Choose the IAM role you created earlier.
Choose Start.

Train a predictor

Next, we train a predictor.

On the dashboard, choose Start under Train a predictor.
On the Train predictor page, enter a name for your predictor.
Specify how long in the future you want to forecast and at what frequency.
Specify the number of quantiles you want to forecast for.

Forecast uses AutoPredictor to create predictors. For more information, refer to Training Predictors.

Choose Create.

Create a forecast

After our predictor is trained (this can take approximately 3.5 hours), we create a forecast. You will know that your predictor is trained when you see the View predictors button on your dashboard.

Choose Start under Generate forecasts on the dashboard.
On the Create a forecast page, enter a forecast name.
For Predictor, choose the predictor that you created.
Optionally, specify the forecast quantiles.
Specify the items to generate a forecast for.
Choose Start.

Query your forecast

You can query a forecast using the Query forecast option. By default, the complete range of the forecast is returned. You can request a specific date range within the complete forecast. When you query a forecast, you must specify filtering criteria. A filter is a key-value pair. The key is one of the schema attribute names (including forecast dimensions) from one of the datasets used to create the forecast. The value is a valid value for the specified key. You can specify multiple key-value pairs. The returned forecast will only contain items that satisfy all the criteria.

Choose Query forecast on the dashboard.
Provide the filter criteria for start date and end date.
Specify your forecast key and value.
Choose Get Forecast.

The following screenshot shows the forecast energy consumption for the same apartment (item ID A_10001) using the forecast model.

Create a what-if analysis

At this point, we have created our baseline forecast can now conduct a what-if analysis. Let’s imagine a scenario where an existing apartment building adds an extension, and the number of households and people increases by 20%. Now you need to do an analysis to forecast increased supply based on increased demand.

There are three stages to conducting a what-if analysis: setting up the analysis, creating the what-if forecast by defining what is changed in the scenario, and comparing the results.

To set up your analysis, choose Explore what-if analysis on the dashboard.
Choose Create.
Enter a unique name and choose the baseline forecast.
Choose the items in your dataset you want to conduct a what-if analysis for. You have two options:
- Select all items is the default, which we choose in this post.
- If you want to pick specific items, choose Select items with a file and import a CSV file containing the unique identifier for the corresponding item and any associated dimensions.
Choose Create what-if analysis.

Create a what-if forecast

Next, we create a what-if forecast to define the scenario we want to analyze.

In the What-if forecast section, choose Create.
Enter a name of your scenario.
You can define your scenario through two options:
- Use transformation functions – Use the transformation builder to transform the related time series data you imported. For this walkthrough, we evaluate how the demand for an item in our dataset changes when the number of consumers increases by 20% when compared to the price in the baseline forecast.
- Define the what-if forecast with a replacement dataset – Replace the related time series dataset you imported.

For our example, we create a scenario where we increase no_of_consumer by 20% applicable to item ID A_10001, and no_of_consumer is a feature in the dataset. You need this analysis to forecast and meet the water supply for increased demand. This analysis also helps you make a cost-effective contract based on the water demand forecast.

For What-if forecast definition method, select Use transformation functions.
Choose Multiply as our operator, no_of_consumer as our time series, and enter 1.2.
Choose Add condition.
Choose Equals as the operation and enter A_10001 for item_id.
Choose Create.

Compare the forecasts

We can now compare the what-if forecasts for both our scenarios, comparing a 20% increase in consumers with the baseline demand.

On the analysis insights page, navigate to the Compare what-if forecasts section.
For item_id, enter the item to analyze (in our scenario, enter A_10001).
For What-if forecasts, choose water_demand_whatif_analyis.
Choose Compare what-if.
You can choose the baseline forecast for the analysis.

The following graph shows the resulting demand for our scenario. The red line shows the forecast of future water consumption for 20% increased population. The P90 forecast type indicates the true value is expected to be lower than the predicted value 90% of the time. You can use this demand forecast to effectively manage water supply for increased demand and avoid any service interruptions.

Export your data

To export your data to CSV, complete the following steps:

Choose Create export.
Enter a name for your export file (for this post, water_demand_export).
Specify the scenarios to be exported by selecting the scenarios on the What-If Forecast drop-down menu.

You can export multiple scenarios at once in a combined file.

For Export location, specify the Amazon S3 location.
To begin the export, choose Create Export.
To download the export, navigate to S3 file path location on the Amazon S3 console, select the file, and choose Download.

The export file will contain the timestamp, item_id, and forecasts for each quantile for all scenarios selected (including the base scenario).

Clean up the resources

To avoid incurring future charges, remove the resources created by this solution:

Delete the Forecast resources you created.
Delete the S3 bucket.

Conclusion

In this post, we showed you how easy to use how to use Forecast and its underlying system architecture to predict water demand using water consumption data. A what-if scenario analysis is a critical tool to help navigate through the uncertainties of business. It provides foresight and a mechanism to stress-test ideas, leaving businesses more resilient, better prepared, and in control of their future. Other utility providers like electricity or gas providers can use Forecast to build solutions and meet utility demand in a cost-effective way.

The steps in this post demonstrated how to build the solution on the AWS Management Console. To directly use Forecast APIs for building the solution, follow the notebook in our GitHub repo.

We encourage you to learn more by visiting the Amazon Forecast Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.

About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Amazon Research Awards recipients announced

January 30, 2023

by Amazon AWS

Awardees, who represent 24 universities in seven countries, have access to Amazon public datasets, along with AWS AI/ML services and tools.Read More

Best Egg achieved three times faster ML model training with Amazon SageMaker Automatic Model Tuning

January 26, 2023

by Tristan Miller Amazon AWS

This post is co-authored by Tristan Miller from Best Egg.

Best Egg is a leading financial confidence platform that provides lending products and resources focused on helping people feel more confident as they manage their everyday finances. Since March 2014, Best Egg has delivered $22 billion in consumer personal loans with strong credit performance, welcomed almost 637,000 members to the recently launched Best Egg Financial Health platform, and empowered over 180,000 cardmembers who carry the new Best Egg Credit Card in their wallet.

Amazon SageMaker is a fully managed machine learning (ML) service providing various tools to build, train, optimize, and deploy ML models. SageMaker provides automated model tuning, which manages the undifferentiated heavy lifting of provisioning and managing compute infrastructure to run several iterations and select the optimized model candidate from training.

To help you efficiently tune your required hyperparameters and determine the best-performing model, this post will discuss how Best Egg used SageMaker hyperparameter tuning with warm pools and achieved a three-fold improvement in model training time.

Use case overview

Risk credit analysts use credit rating models when lending or offering a credit card to customers by taking a variety of user attributes into account. This statistical model generates a final score, or Good Bad Indicator (GBI), which determines whether to approve or reject a credit application. ML insights facilitate decision-making. To assess the risk of credit applications, ML uses various data sources, thereby predicting the risk that a customer will be delinquent.

The challenge

A significant problem in the financial sector is that there is no universally accepted method or structure for dealing with the overwhelming array of possibilities that must be considered at any one time. It’s difficult to standardize the tools that teams use in order to promote transparency and tracking across the board. The application of ML can help those in the finance industry make better judgments regarding pricing, risk management, and consumer behavior. Data scientists train multiple ML algorithms to examine millions of consumer data records, identify anomalies, and evaluate if a person is eligible for credit.

SageMaker can run automated hyperparameter tuning based on multiple optimization techniques such as grid search, Bayesian, random search, and Hyperband. Automatic model tuning makes it easy to zero in on the optimal model configuration, freeing up time and money for better use elsewhere in the financial sector. As part of hyperparameter tuning, SageMaker runs several iterations of the training code on the training dataset with various hyperparameter combinations. SageMaker then determines the best model candidate with the optimal hyperparameters based on the objective metric configured.

Best Egg was able to automate hyperparameter tuning with the automated hyperparameter optimization (HPO) feature of SageMaker and parallelize it. However, each hyperparameter tuning job could take hours, and selecting the best model candidate took many hyperparameter tuning jobs run over the course of several days. Hyperparameter tuning jobs could be slow due to the nature of the iterative tasks that HPO runs under the hood. Every time a training job is initiated, new resource provisioning occurs, which consumes a significant amount of time before the training actually begins. This is a common problem that data scientists face when training their models. Time efficiency was a major pain point because these long-running training jobs were impeding productivity and data scientists were stuck on these jobs for hours.

Solution overview

The following diagram represents the different components used in this solution.

The Best Egg data science team uses Amazon SageMaker Studio for building and running Jupyter notebooks. SageMaker processing jobs run feature engineering pipelines on the input dataset to generate features. Best Egg trains multiple credit models using classification and regression algorithms. The data science team must sometimes work with limited training data in the order of tens of thousands of records given the nature of their use cases. Best Egg runs SageMaker training jobs with automated hyperparameter tuning powered by Bayesian optimization. To reduce variance, Best Egg uses k-fold cross validation as part of their custom container to evaluate the trained model.

The trained model artifact is registered and versioned in the SageMaker model registry. Inference is run in two ways—real time and batch—based on the user requirements. The trained model artifact is hosted on a SageMaker real-time endpoint using the built-in auto scaling and load balancing features. The model is also scored through batch transform jobs scheduled on a daily basis. The whole pipeline is orchestrated through Amazon SageMaker Pipelines, consisting of a sequence of steps such as a processing step for feature engineering, a tuning step for training and automated model tuning, and a model step for registering the artifact.

With respect to the core problem of long-running hyperparameter tuning jobs, Best Egg explored the recently released warm pools feature managed by SageMaker. SageMaker Managed Warm Pools allows you to retain and reuse provisioned infrastructure after the completion of a training job to reduce latency for repetitive workloads, such as iterative experimentation or consecutively running jobs where specific job configuration parameters like instance type or count match with the previous runs. This allowed Best Egg to reuse the existing infrastructure for their repetitive training jobs without wasting time on infrastructure provisioning.

Deep Dive into Model Tuning and Benefits of Warm Pools

SageMaker Automated Model Tuning leverages Warm Pools by default for any tuning job as of August 2022 (announcement). This makes it straightforward to reap the benefits of Warm Pools as you just need to launch a tuning job and SageMaker Automatic Model Tuning will automatically use Warm Pools between subsequent training jobs launched as part of the tuning. When each training job completes, the provisioned resources are kept alive in a warm pool so that the next training job launched as part of the tuning will start on the same pool with minimal startup overhead.

The below workflow depicts a series of training job runs using warm pool.

After the first training job is complete, the instances used for training are retained in the warm pool cluster.
The next training job triggered will use the instance in the warm pool to run, eliminating the cold start time needed to prepare the instance to start up.
Likewise, if more training jobs come in with instance type, instance count, volume & networking criteria similar to the warm pool cluster resources, then the matched instances will be used for running the jobs.
Once the training job is completed, the instances will be retained in the warm pool waiting for new jobs.
The maximum length of time that a warm pool cluster can continue running consecutive training jobs is 7 days.
- As long as the cluster is healthy and the warm pool is within the specified time duration, the warm pool status is Available.
- The warm pool stays Available until it identifies a matching training job for reuse. If the warm pool status is Terminated, then this is the end of the warm pool lifecycle.

The following diagram illustrates this workflow.

How Best Egg benefitted: Improvements and data points

Best Egg noticed that with warm pools, their training jobs on SageMaker were running faster by a factor of 3. In one credit model project, the best model was selected from eight different HPO jobs, each of which had 40 iterations with five parallel jobs at a time. Each iteration took about 1 minute to compute, whereas without warm pools they typically took 5 minutes each. In total, the process took 2 hours of computation time, with additional input from the data scientist adding up to about half a business day. Without warm pools, we estimate that the computation would have taken 6 hours alone, likely spread out over the course of 2–3 business days.

Summary

In conclusion, this post discussed elements of Best Egg’s business and the company’s ML landscape. We reviewed how Best Egg was able to speed up its model training and tuning by enabling warm pools for their hyperparameter tuning jobs on SageMaker. We also explained how simple it is to implement warm pools for your training jobs with a simple configuration. At AWS, we recommend our readers start exploring warm pools for iterative and repetitive training jobs.

About the Authors

Tristan Miller is a Lead Data Scientist at Best Egg. He builds and deploys ML models to make important underwriting and marketing decisions. He develops bespoke solutions to address specific problems, as well as automation to increase efficiency and scale. He is also a skilled origamist.

Valerio Perrone is an Applied Science Manager at AWS. He leads the science and engineering team owning the service for automatic model tuning across Amazon SageMaker. Valerio’s expertise lies in developing algorithms for large-scale machine learning and statistical models, with a focus on data-driven decision making and the democratization of artificial intelligence

Ganapathi Krishnamoorthi is a Senior ML Solutions Architect at AWS. Ganapathi provides prescriptive guidance to startup and enterprise customers, helping them design and deploy cloud applications at scale. He is specialized in machine learning and is focused on helping customers use AI/ML for their business outcomes. When not at work, he enjoys exploring the outdoors and listening to music

Ajjay Govindaram is a Sr. Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.

Hariharan Suresh is a Senior Solutions Architect at AWS. He is passionate about databases, machine learning, and designing innovative solutions. Prior to joining AWS, Hariharan was a product architect, core banking implementation specialist, and developer, and worked with BFSI organizations for over 11 years. Outside of technology, he enjoys paragliding and cycling.

Prerequisites

Get started on SageMaker Studio Lab and install the required packages

Prepare the data by fusing results from two views

Graph and video code

Conclusion

About the authors

Moderate videos using the video moderation API

Moderate videos using the image moderation API

Evaluation dataset

Results summary

Accuracy

Cost

Performance

Architecture complexity

Optimal use cases for the image API solution

Conclusion

About the authors

Solution architecture

Solution overview

Create an EKS cluster

Create an EKS node group

Install Kubernetes plugins for Trainium and EFA devices

Attach shared storage to the EKS cluster

Create a training container image

Set up the training data

Precompile your model

Launch the distributed training job

Monitor the training job

Clean up or reuse the environment

Conclusion

About the Authors

Problem statement

How distributed training works

Solution overview

Enable distributed training

Benchmarking

Conclusion

About the authors

Solution overview

Prerequisites

Create a dataset group and datasets

Train a predictor

Create a forecast

Query your forecast

Create a what-if analysis

Create a what-if forecast

Compare the forecasts

Export your data

Clean up the resources

Conclusion

About the Author

Use case overview

The challenge

Solution overview

Deep Dive into Model Tuning and Benefits of Warm Pools

How Best Egg benefitted: Improvements and data points

Summary

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.